这一周主要介绍了要介绍了文本分析。

课程地址:

https://www.edx.org/course/the-analytics-edge

首先读取数据。

setwd("E:\\The Analytics Edge\\Unit 5 Text Analytics")
emails = read.csv("energy_bids.csv", stringsAsFactors=FALSE)
emails$email = iconv(emails$email, "WINDOWS-1252", "UTF-8")
str(emails)
'data.frame':    855 obs. of  2 variables:
 $ email     : chr  "North America's integrated electricity market requires cooperation on environmental policies Commission for Env"| __truncated__ "FYI -----Original Message----- From: \t\"Ginny Feliciano\" <gfeliciano@earthlink.net>@ENRON [mailto:IMCEANOTES-"| __truncated__ "14:13:53 Synchronizing Mailbox 'Kean, Steven J.' 14:13:53 Synchronizing Hierarchy 14:13:53 Synchronizing Favori"| __truncated__ "^ ----- Forwarded by Steven J Kean/NA/Enron on 03/02/2001 12:27 PM ----- Suzanne_Nimocks@mckinsey.com Sent by: "| __truncated__ ...
 $ responsive: int  0 1 0 1 0 0 1 0 0 0 ...

注意读取数据的时候一定要添加stringsAsFactors=FALSE

如果后续报错,可以使用如下方法解决:

emails$email = iconv(emails$email,"WINDOWS-1252","UTF-8")

文本分析之前,先要对文本进行预处理,这里使用的方法是Bag of Words,即统计每个单词出现的频率:

注意我们不区分大小写,所以这里将所有单词都转化为小写:

像英语中例如”I”,”We”这样的单词也要去除:

最后一步是将同词根的词视为同一个单词:

下面在R中进行操作,注意要安装tm包。

Load tm package

library(tm)

Create corpus

corpus = VCorpus(VectorSource(emails$email))

Pre-process data

corpus = tm_map(corpus, content_transformer(tolower))

corpus = tm_map(corpus, removePunctuation)

corpus = tm_map(corpus, removeWords, stopwords("english"))

corpus = tm_map(corpus, stemDocument)

Create matrix

将数据存在矩阵中。

dtm = DocumentTermMatrix(corpus)
dtm
<<DocumentTermMatrix (documents: 855, terms: 21997)>>
Non-/sparse entries: 102755/18704680
Sparsity           : 99%
Maximal term length: 113
Weighting          : term frequency (tf)

可以看出单词的数量太多了,我们去除一些不常用的单词。

Remove sparse terms

dtm = removeSparseTerms(dtm, 0.97)
dtm
<<DocumentTermMatrix (documents: 855, terms: 788)>>
Non-/sparse entries: 51645/622095
Sparsity           : 92%
Maximal term length: 19
Weighting          : term frequency (tf)

Create data frame

构造一个data frame,并增加对应的标签

labeledTerms = as.data.frame(as.matrix(dtm))
labeledTerms$responsive = emails$responsive

数据预处理完毕,后续用之前介绍的分类方法操作即可。

Split the data

library(caTools)

set.seed(144)

spl = sample.split(labeledTerms$responsive, 0.7)

train = subset(labeledTerms, spl == TRUE)
test = subset(labeledTerms, spl == FALSE)

Build a CART model

library(rpart)
library(rpart.plot)

emailCART = rpart(responsive~., data=train, method="class")

prp(emailCART)

png

Predict

pred = predict(emailCART, newdata=test)
table(test$responsive, pred[, 2] >= 0.5)
    FALSE TRUE
  0   195   20
  1    17   25