这一周主要介绍聚类,分别介绍了Hierarchical以及K-means的使用。

课程地址:

https://www.edx.org/course/the-analytics-edge

setwd("E:\\The Analytics Edge\\Unit 6 Clustering")

这次的数据不是csv格式,要使用read.table函数来读取。

movies = read.table("movieLens.txt", header=FALSE, sep="|",quote="\"")

str(movies)
'data.frame':    1682 obs. of  24 variables:
 $ V1 : int  1 2 3 4 5 6 7 8 9 10 ...
 $ V2 : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1524 617 554 593 342 1317 1544 110 390 1239 ...
 $ V3 : Factor w/ 241 levels "","01-Aug-1997",..: 71 71 71 71 71 71 71 71 71 182 ...
 $ V4 : logi  NA NA NA NA NA NA ...
 $ V5 : Factor w/ 1661 levels "","http://us.imdb.com/M/title-exact/Independence%20(1997)",..: 1431 565 505 543 310 1661 1453 103 357 1183 ...
 $ V6 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ V7 : int  0 1 0 1 0 0 0 0 0 0 ...
 $ V8 : int  0 1 0 0 0 0 0 0 0 0 ...
 $ V9 : int  1 0 0 0 0 0 0 0 0 0 ...
 $ V10: int  1 0 0 0 0 0 0 1 0 0 ...
 $ V11: int  1 0 0 1 0 0 0 1 0 0 ...
 $ V12: int  0 0 0 0 1 0 0 0 0 0 ...
 $ V13: int  0 0 0 0 0 0 0 0 0 0 ...
 $ V14: int  0 0 0 1 1 1 1 1 1 1 ...
 $ V15: int  0 0 0 0 0 0 0 0 0 0 ...
 $ V16: int  0 0 0 0 0 0 0 0 0 0 ...
 $ V17: int  0 0 0 0 0 0 0 0 0 0 ...
 $ V18: int  0 0 0 0 0 0 0 0 0 0 ...
 $ V19: int  0 0 0 0 0 0 0 0 0 0 ...
 $ V20: int  0 0 0 0 0 0 0 0 0 0 ...
 $ V21: int  0 0 0 0 0 0 1 0 0 0 ...
 $ V22: int  0 1 1 0 1 0 0 0 0 0 ...
 $ V23: int  0 0 0 0 0 0 0 0 0 1 ...
 $ V24: int  0 0 0 0 0 0 0 0 0 0 ...

给数据每列增加名字。

colnames(movies) = c("ID", "Title", "ReleaseDate", "VideoReleaseDate", "IMDB", "Unknown", "Action", "Adventure", "Animation", "Childrens", "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "FilmNoir", "Horror", "Musical", "Mystery", "Romance", "SciFi", "Thriller", "War", "Western")

str(movies)
'data.frame':    1682 obs. of  24 variables:
 $ ID              : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Title           : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1524 617 554 593 342 1317 1544 110 390 1239 ...
 $ ReleaseDate     : Factor w/ 241 levels "","01-Aug-1997",..: 71 71 71 71 71 71 71 71 71 182 ...
 $ VideoReleaseDate: logi  NA NA NA NA NA NA ...
 $ IMDB            : Factor w/ 1661 levels "","http://us.imdb.com/M/title-exact/Independence%20(1997)",..: 1431 565 505 543 310 1661 1453 103 357 1183 ...
 $ Unknown         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Action          : int  0 1 0 1 0 0 0 0 0 0 ...
 $ Adventure       : int  0 1 0 0 0 0 0 0 0 0 ...
 $ Animation       : int  1 0 0 0 0 0 0 0 0 0 ...
 $ Childrens       : int  1 0 0 0 0 0 0 1 0 0 ...
 $ Comedy          : int  1 0 0 1 0 0 0 1 0 0 ...
 $ Crime           : int  0 0 0 0 1 0 0 0 0 0 ...
 $ Documentary     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Drama           : int  0 0 0 1 1 1 1 1 1 1 ...
 $ Fantasy         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ FilmNoir        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Horror          : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Musical         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Mystery         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Romance         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ SciFi           : int  0 0 0 0 0 0 1 0 0 0 ...
 $ Thriller        : int  0 1 1 0 1 0 0 0 0 0 ...
 $ War             : int  0 0 0 0 0 0 0 0 0 1 ...
 $ Western         : int  0 0 0 0 0 0 0 0 0 0 ...

去除不需要的变量

movies$ID = NULL
movies$ReleaseDate = NULL
movies$VideoReleaseDate = NULL
movies$IMDB = NULL

去除重复列。

movies = unique(movies)

Hierarchical

首先介绍Hierarchical聚类方法,Hierarchical聚类方法每次将距离最近的两类合并为一类,直至只有一类为止,类和类之间的距离用中心之间的距离来计算,结果可以用如下树状图表示:

接着人为选择分为几类:

下面在R中操作。

Compute distances

distances = dist(movies[2:20], method = "euclidean")

Hierarchical clustering

clusterMovies = hclust(distances, method = "ward.D")

Plot the dendrogram

plot(clusterMovies)

png

最下面的部分之所以是黑的,是因为起初一个数据归为一类。

Assign points to clusters

接着选择聚类的数量

clusterGroups = cutree(clusterMovies, k = 10)

查看每类的数据

tapply(movies$Action, clusterGroups, mean)
1
0.178451178451178
2
0.78391959798995
3
0.123853211009174
4
0
5
0
6
0.1015625
7
0
8
0
9
0
10
0
tapply(movies$Romance, clusterGroups, mean)
1
0.104377104377104
2
0.0452261306532663
3
0.036697247706422
4
0
5
0
6
1
7
1
8
0
9
0
10
0

Kmeans

接着使用Kmeans

setwd("E:\\The Analytics Edge\\Unit 6 Clustering")

data = read.csv("dailykos.csv")

运行算法

set.seed(1000)
KMC = kmeans(data, centers=7)
str(KMC)
List of 9
 $ cluster     : int [1:3430] 4 4 6 4 1 4 7 4 4 4 ...
 $ centers     : num [1:7, 1:1545] 0.0342 0.0556 0.0253 0.0136 0.0491 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:7] "1" "2" "3" "4" ...
  .. ..$ : chr [1:1545] "abandon" "abc" "ability" "abortion" ...
 $ totss       : num 896461
 $ withinss    : num [1:7] 76583 52693 99504 258927 88632 ...
 $ tot.withinss: num 730632
 $ betweenss   : num 165829
 $ size        : int [1:7] 146 144 277 2063 163 329 308
 $ iter        : int 7
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"

查看各类的数据

table(KMC$cluster)
   1    2    3    4    5    6    7 
 146  144  277 2063  163  329  308 

补充

注意运行聚类算法前一般要把数据正规化,这是为了消除数量级的影响,可以按如下方式操作。

library(caret)
preproc = preProcess(data)
Loading required package: lattice
Loading required package: ggplot2