Unit 5 Text Analytics
这一周主要介绍了要介绍了文本分析。
课程地址:
https://www.edx.org/course/the-analytics-edge
首先读取数据。
setwd("E:\\The Analytics Edge\\Unit 5 Text Analytics")
emails = read.csv("energy_bids.csv", stringsAsFactors=FALSE)
emails$email = iconv(emails$email, "WINDOWS-1252", "UTF-8")
str(emails)
'data.frame': 855 obs. of 2 variables:
$ email : chr "North America's integrated electricity market requires cooperation on environmental policies Commission for Env"| __truncated__ "FYI -----Original Message----- From: \t\"Ginny Feliciano\" <gfeliciano@earthlink.net>@ENRON [mailto:IMCEANOTES-"| __truncated__ "14:13:53 Synchronizing Mailbox 'Kean, Steven J.' 14:13:53 Synchronizing Hierarchy 14:13:53 Synchronizing Favori"| __truncated__ "^ ----- Forwarded by Steven J Kean/NA/Enron on 03/02/2001 12:27 PM ----- Suzanne_Nimocks@mckinsey.com Sent by: "| __truncated__ ...
$ responsive: int 0 1 0 1 0 0 1 0 0 0 ...
注意读取数据的时候一定要添加stringsAsFactors=FALSE
如果后续报错,可以使用如下方法解决:
emails$email = iconv(emails$email,"WINDOWS-1252","UTF-8")
文本分析之前,先要对文本进行预处理,这里使用的方法是Bag of Words,即统计每个单词出现的频率:
注意我们不区分大小写,所以这里将所有单词都转化为小写:
像英语中例如”I”,”We”这样的单词也要去除:
最后一步是将同词根的词视为同一个单词:
下面在R中进行操作,注意要安装tm包。
Load tm package
library(tm)
Create corpus
corpus = VCorpus(VectorSource(emails$email))
Pre-process data
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus = tm_map(corpus, stemDocument)
Create matrix
将数据存在矩阵中。
dtm = DocumentTermMatrix(corpus)
dtm
<<DocumentTermMatrix (documents: 855, terms: 21997)>>
Non-/sparse entries: 102755/18704680
Sparsity : 99%
Maximal term length: 113
Weighting : term frequency (tf)
可以看出单词的数量太多了,我们去除一些不常用的单词。
Remove sparse terms
dtm = removeSparseTerms(dtm, 0.97)
dtm
<<DocumentTermMatrix (documents: 855, terms: 788)>>
Non-/sparse entries: 51645/622095
Sparsity : 92%
Maximal term length: 19
Weighting : term frequency (tf)
Create data frame
构造一个data frame,并增加对应的标签
labeledTerms = as.data.frame(as.matrix(dtm))
labeledTerms$responsive = emails$responsive
数据预处理完毕,后续用之前介绍的分类方法操作即可。
Split the data
library(caTools)
set.seed(144)
spl = sample.split(labeledTerms$responsive, 0.7)
train = subset(labeledTerms, spl == TRUE)
test = subset(labeledTerms, spl == FALSE)
Build a CART model
library(rpart)
library(rpart.plot)
emailCART = rpart(responsive~., data=train, method="class")
prp(emailCART)
Predict
pred = predict(emailCART, newdata=test)
table(test$responsive, pred[, 2] >= 0.5)
FALSE TRUE
0 195 20
1 17 25
本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 Doraemonzzz!
评论
ValineLivere