Videos
Topic | Access |
---|---|
Hierarchical Clustering | Stanford University (15 min) | https://www.youtube.com/watch?v=rg2cjfMsCk4 |
Hierarchical Clustering in R (44 min) | https://www.youtube.com/watch?v=9U4h6pZw6f8 |
The k Means Algorithm | Stanford University (13 min) | https://www.youtube.com/watch?v=RD0nNK51Fp8 |
K Means Clustering: Pros and Cons of K Means Clustering ( 24 min) | https://www.youtube.com/watch?v=YIGtalP1mv0 |
Clustering: K-means and Hierarchical (customer segmentation) (17 minutes) optional | https://www.youtube.com/watch?v=QXOkPvFM6NU |
Articles
Topic | Access |
---|---|
Hierarchical Clustering in R | https://www.datacamp.com/community/tutorials/hierarchical-clustering-R |
Hierarchical Cluster Analysis | https://uc-r.github.io/hc_clustering |
K-Means Clustering in R | https://www.datanovia.com/en/lessons/k-means-clustering-in-r-algorith-and-practical-examples/ |
K-Means Clustering in R Tutorial | https://www.datacamp.com/community/tutorials/k-means-clustering-r |
K-means Cluster Analysis | https://uc-r.github.io/kmeans_clustering |
https://search.r-project.org/CRAN/refmans/factoextra/html/fviz_nbclust.html
library(factoextra)
library(cluster)
#Find the optimum Number of clusters
#Partitioning methods, such as k-means clustering require the users to specify the number of clusters to be #generated.
#fviz_nbclust(): Dertemines and visualize the optimal number of clusters using different methods: within cluster #sums of squares, average silhouette and gap statistics.
# one of the following is enough to tell us optimum k
#df is better to be a two column data frame because we can later visualize the clusters in a two dimensional space
factoextra::fviz_nbclust(x=df, FUNcluster = kmeans, method = “silhouette”)
factoextra::fviz_nbclust(x=df, FUNcluster = hcut, method = c(“silhouette”))
factoextra::fviz_nbclust(x=df, FUNcluster = kmeans, method = c(“silhouette”))
factoextra::fviz_nbclust(dx=f, FUNcluster = luster::pam, method = c(“silhouette”))
factoextra::fviz_nbclust(x=df, FUNcluster = cluster::clara, method = c(“silhouette”))
########################################################
#to visualize the clusters, create a distance matrix
DistanceMatrix<-factoextra::get_dist(df, method = “euclidean”)
####################################
#create a Hierarchical cluster object
#method: one of
#”ward.D”, “ward.D2”, “single”, “complete”,
#”average” (= UPGMA), “mcquitty” (= WPGMA), “median” (= WPGMC) or “centroid” (= UPGMC).
#stats::hclust performs a hierarchical cluster analysis using a set of dissimilarities for the n objects being clustered. #1)Initially, each object is assigned to its own cluster and then the algorithm proceeds iteratively,
#2) at each stage joining the two most similar clusters,
# 3)continuing until there is just a single cluster.
Hclustermodel<-stats::hclust(DistanceMatrix, method=”ward.D”)
#cut it to K=2
#stats::cutree Cuts a Tree into Groups of Data is necessary for Hierarchical Cluster Objects
#it is at this step that a cluster is assigned to each point
CutModel<-stats::cutree(Hclustermodel, k=2)
#CutModel will contain the cluster each point belongs to
CutModel
# Visualizing plotting the cut cluster
plot.new()
#factoextra::fviz_cluster Can’t handle an object of class hclust therefore we have to use the list trick
factoextra::fviz_cluster(list(data = df, cluster = CutModel))
#for hierarchical clustering, typically resulting from agnes() or diana()
hc<-cluster::agnes(df, method = “ward”)
cluster::pltree(hc, cex = 0.6, hang = 1, main = “Dendogram Hierachical Cluster made by anges Agglomerative Nesting”)
#####################################
#Kmeans clustering
#Kmeans model model contains the mean of each cluster in a n dimensional space and Each point assigned to a cluster
#Perform k-means clustering on a data matrix.
Kmeans_result<- stats::kmeans(df, centers=3)
#Visualize the clusters
#object = an object of class “partition” created by the functions pam(),
#clara() or fanny() in cluster package;
#”kmeans” [in stats package];
#”dbscan” [in fpc package];
#”Mclust” [in mclust]; “hkmeans”,
#”eclust” [in factoextra].
#Possible value are also any list object with data and cluster components (e.g.: object = list(data = mydata, cluster = myclust)).
factoextra::fviz_cluster(object=Kmeans_result, data = df)
If there are more than two dimensions (variables) in the data frame, fviz_cluster
will perform principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance.
https://uc-r.github.io/hc_clustering
https://uc-r.github.io/kmeans_clustering
https://stackoverflow.com/questions/65844978/caret-classification-thresholds