Clustering and PCA on wines dataset
seetharam annepu
Firstly, lets discuss whats PCA is insimple terms:
PCA is based on a decomposition of the data matrix X into two matrices V and U
X=U*V'
The two matrices V and U are orthogonal. The matrix V is usually called the loadings matrix, and the matrix U is called the scores matrix. The loadings can be understood as the weights for each original variable (data column of original dataset) when calculating the principal component. The matrix U contains the original data in a rotated coordinate system. Furthermore, we can use PCA or SVD to reduce the dimentionality of original data to few important components, which capture highest variances, for faster processing of data in machine learning. you will learn more about this in next post.
ifelse(require(data.table),{library("data.table")},install.packages("data.table"))
## Loading required package: data.table
## [1] "data.table"
ifelse(require(dplyr),{library("dplyr")},install.packages("dplyr"))
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## [1] "dplyr"
ifelse(require(rgl),{library("rgl")},install.packages("rgl"))
## Loading required package: rgl
## Warning: package 'rgl' was built under R version 3.5.3
## [1] "rgl"
ifelse(require(pca3d),{library("pca3d")},install.packages("pca3d"))
## Loading required package: pca3d
## Warning: package 'pca3d' was built under R version 3.5.3
## [1] "pca3d"
ifelse(require(plotly),{library("plotly")},install.packages("plotly"))
## Loading required package: plotly
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## [1] "plotly"
dataWines<-read.delim(file.choose(),header = F, sep=",")
setnames(dataWines, old = c('V1','V2','V3','V4','V5','V6','V7','V8','V9','V10','V11','V12','V13','V14'),
new = c('Type','Alcohol','Malic acid','Ash','Alcalinity of ash','Magnesium','Total phenols','Flavanoids','Nonflavanoid phenols','Proanthocyanins','Color intensity','Hue','OD280/OD315 of diluted wines','Proline'))
apply(dataWines,2, function(x) sum(is.na(x)))
## Type Alcohol
## 0 0
## Malic acid Ash
## 0 0
## Alcalinity of ash Magnesium
## 0 0
## Total phenols Flavanoids
## 0 0
## Nonflavanoid phenols Proanthocyanins
## 0 0
## Color intensity Hue
## 0 0
## OD280/OD315 of diluted wines Proline
## 0 0
pca<-princomp(dataWines[,c(2:ncol(dataWines))], cor = TRUE, scores = TRUE, covmat = NULL)
summary(pca)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 2.1692972 1.5801816 1.2025273 0.9586313 0.92370351
## Proportion of Variance 0.3619885 0.1920749 0.1112363 0.0706903 0.06563294
## Cumulative Proportion 0.3619885 0.5540634 0.6652997 0.7359900 0.80162293
## Comp.6 Comp.7 Comp.8 Comp.9
## Standard deviation 0.80103498 0.74231281 0.59033665 0.53747553
## Proportion of Variance 0.04935823 0.04238679 0.02680749 0.02222153
## Cumulative Proportion 0.85098116 0.89336795 0.92017544 0.94239698
## Comp.10 Comp.11 Comp.12 Comp.13
## Standard deviation 0.50090167 0.47517222 0.41081655 0.321524394
## Proportion of Variance 0.01930019 0.01736836 0.01298233 0.007952149
## Cumulative Proportion 0.96169717 0.97906553 0.99204785 1.000000000
plot(pca)
PCA plot shows that the first three PCA components capture maximum variance between the columns and these three components can be used to reconstruct the data with minimum data loss.
loadings(pca)
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
## Alcohol 0.144 0.484 0.207 0.266 0.214
## Malic acid -0.245 0.225 -0.537 0.537
## Ash 0.316 -0.626 0.214 0.143 0.154
## Alcalinity of ash -0.239 -0.612 -0.101
## Magnesium 0.142 0.300 -0.131 0.352 -0.727
## Total phenols 0.395 -0.146 -0.198 0.149
## Flavanoids 0.423 -0.151 -0.152 0.109
## Nonflavanoid phenols -0.299 -0.170 0.203 0.501 -0.259
## Proanthocyanins 0.313 -0.149 -0.399 -0.137 -0.534
## Color intensity 0.530 0.137 -0.419
## Hue 0.297 -0.279 0.428 0.174 0.106
## OD280/OD315 of diluted wines 0.376 -0.164 -0.166 -0.184 0.101 0.266
## Proline 0.287 0.365 0.127 0.232 0.158 0.120
## Comp.7 Comp.8 Comp.9 Comp.10 Comp.11 Comp.12
## Alcohol 0.396 0.509 0.212 0.226 0.266
## Malic acid -0.421 -0.309 -0.122
## Ash 0.149 -0.170 -0.308 0.499
## Alcalinity of ash 0.287 0.428 0.200 -0.479
## Magnesium -0.323 -0.156 0.271
## Total phenols -0.406 0.286 -0.320 -0.304 0.304
## Flavanoids -0.187 -0.163
## Nonflavanoid phenols -0.595 -0.233 0.196 0.216 -0.117
## Proanthocyanins -0.372 0.368 -0.209 0.134 0.237
## Color intensity 0.228 -0.291 -0.604
## Hue -0.232 0.437 -0.522 -0.259
## OD280/OD315 of diluted wines 0.137 0.524 -0.601
## Proline 0.120 -0.576 0.162 -0.539
## Comp.13
## Alcohol
## Malic acid
## Ash -0.141
## Alcalinity of ash
## Magnesium
## Total phenols -0.464
## Flavanoids 0.832
## Nonflavanoid phenols 0.114
## Proanthocyanins -0.117
## Color intensity
## Hue
## OD280/OD315 of diluted wines -0.157
## Proline
##
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.077 0.077 0.077 0.077 0.077 0.077 0.077 0.077
## Cumulative Var 0.077 0.154 0.231 0.308 0.385 0.462 0.538 0.615
## Comp.9 Comp.10 Comp.11 Comp.12 Comp.13
## SS loadings 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.077 0.077 0.077 0.077 0.077
## Cumulative Var 0.692 0.769 0.846 0.923 1.000
pca$loadings
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
## Alcohol 0.144 0.484 0.207 0.266 0.214
## Malic acid -0.245 0.225 -0.537 0.537
## Ash 0.316 -0.626 0.214 0.143 0.154
## Alcalinity of ash -0.239 -0.612 -0.101
## Magnesium 0.142 0.300 -0.131 0.352 -0.727
## Total phenols 0.395 -0.146 -0.198 0.149
## Flavanoids 0.423 -0.151 -0.152 0.109
## Nonflavanoid phenols -0.299 -0.170 0.203 0.501 -0.259
## Proanthocyanins 0.313 -0.149 -0.399 -0.137 -0.534
## Color intensity 0.530 0.137 -0.419
## Hue 0.297 -0.279 0.428 0.174 0.106
## OD280/OD315 of diluted wines 0.376 -0.164 -0.166 -0.184 0.101 0.266
## Proline 0.287 0.365 0.127 0.232 0.158 0.120
## Comp.7 Comp.8 Comp.9 Comp.10 Comp.11 Comp.12
## Alcohol 0.396 0.509 0.212 0.226 0.266
## Malic acid -0.421 -0.309 -0.122
## Ash 0.149 -0.170 -0.308 0.499
## Alcalinity of ash 0.287 0.428 0.200 -0.479
## Magnesium -0.323 -0.156 0.271
## Total phenols -0.406 0.286 -0.320 -0.304 0.304
## Flavanoids -0.187 -0.163
## Nonflavanoid phenols -0.595 -0.233 0.196 0.216 -0.117
## Proanthocyanins -0.372 0.368 -0.209 0.134 0.237
## Color intensity 0.228 -0.291 -0.604
## Hue -0.232 0.437 -0.522 -0.259
## OD280/OD315 of diluted wines 0.137 0.524 -0.601
## Proline 0.120 -0.576 0.162 -0.539
## Comp.13
## Alcohol
## Malic acid
## Ash -0.141
## Alcalinity of ash
## Magnesium
## Total phenols -0.464
## Flavanoids 0.832
## Nonflavanoid phenols 0.114
## Proanthocyanins -0.117
## Color intensity
## Hue
## OD280/OD315 of diluted wines -0.157
## Proline
##
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.077 0.077 0.077 0.077 0.077 0.077 0.077 0.077
## Cumulative Var 0.077 0.154 0.231 0.308 0.385 0.462 0.538 0.615
## Comp.9 Comp.10 Comp.11 Comp.12 Comp.13
## SS loadings 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.077 0.077 0.077 0.077 0.077
## Cumulative Var 0.692 0.769 0.846 0.923 1.000
biplot(pca$loadings[,1:2],pca$loadings[,1:2]) ## if you didn't want rescaling of the axis
biplot(pca)
pc_scores <- pca$scores
#pc_scores
gr <- factor(dataWines[,1])
pca2d(pca, col=dataWines$Type,group=gr, biplot=TRUE )
#pca3d(pca, col=dataWines$Type,group=gr)
wines <-dataWines[,c(2:ncol(dataWines))]
normalized_data <- scale(wines)
wineCluster_Variability <- matrix(nrow=10, ncol=1)
for (i in 1:10) wineCluster_Variability[i] <- kmeans(normalized_data,centers=i, nstart=10)$tot.withinss
plot(1:10, wineCluster_Variability, type="b", xlab="Number of clusters", ylab="Within groups sum of squares")
winesk <- kmeans(normalized_data, centers=3, iter.max=20, nstart=5)
wine_clust<-cbind(dataWines, winesk$cluster)
#write.csv(wine_clust,file="wine_clusters_k.csv")
#summary(wine_clust[which(wine_clust$`winesk$cluster`==3),])
with(wine_clust, table(`winesk$cluster`, Type))
## Type
## winesk$cluster 1 2 3
## 1 59 3 0
## 2 0 3 48
## 3 0 65 0
wine_clust[["Type"]]=as.factor(wine_clust[["Type"]])
wine_clust %>%
group_by(Type) %>%
summarise_all("mean")->cluster_means
cluster_means
## # A tibble: 3 x 15
## Type Alcohol `Malic acid` Ash `Alcalinity of ~ Magnesium
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 13.7 2.01 2.46 17.0 106.
## 2 2 12.3 1.93 2.24 20.2 94.5
## 3 3 13.2 3.33 2.44 21.4 99.3
## # ... with 9 more variables: `Total phenols` <dbl>, Flavanoids <dbl>,
## # `Nonflavanoid phenols` <dbl>, Proanthocyanins <dbl>, `Color
## # intensity` <dbl>, Hue <dbl>, `OD280/OD315 of diluted wines` <dbl>,
## # Proline <dbl>, `winesk$cluster` <dbl>
cols<-names(cluster_means)
length(cols)
## [1] 15
l <- htmltools::tagList()
for(i in (2:15)){
# gather(key, value, -Cluster_ID)%>%
l[[i]]<-as_widget(plot_ly(cluster_means, x = ~`winesk$cluster`, y = cluster_means[[cols[i]]], type = 'bar', text=n, color = ~`winesk$cluster` )%>%
layout(title = cols[i],
xaxis = list(title = "Cluster ID"),
yaxis = list(title = cols[i])))
}
l
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
With the PCA we can find which all factors of the wines are correlated and how these attributes differentiate each wine type. PCA fundamentally captures the variances between the factors or attributes of the wines, helping wine makers to incorporate these variances in their wines to make different types of wines.
now lets use tableau for more detailed analysis
Tableau analysis
Tableau analysis
We can subset the features by noticing the features which vary widely in composition over each cluster. Wines mostly differ on color intensity, Magnesium, alkalinity of ash, malic acid, proline and total phenols composition. All correlated features of the wines are shown in one color in the below visualization
Tableau analysis