Machine Learning #4 — Unsupervised Learning, Clustering, K-Means Algorithm, Dimensional Reduction, Principal Components Analysis

Göker Güner
Analytics Vidhya
Published in
6 min readFeb 4, 2022

--

Captured from https://adimadimgurme.com/2020/09/20/sante-wine-more/

This post is the fourth article in my Machine Learning series that I have written to consolidate and share what I have learned. For the third article of the series that published on Analytics Vidhya: link

If you are not interested in the Turkish of this article, you can skip this paragraph.

Bu yazı, öğrendiklerimi pekiştirmek ve paylaşmak için kaleme aldığım Machine Learning yazı dizimin dördüncü yazısıdır. Serinin yayınlanan üçüncü yazısı için: link

Bu yazının Türkçesi için: link

Hello, in the fourth of our Machine Learning series,

  • Examine the concept of unsupervised learning,
  • Will learn the clustering method and the K-Means Algorithm that includes this method,
  • We will touch on the concepts of dimension reduction and Principal Component Analysis (PCA), which are critical issues of the clustering method.

We will use the wine quality dataset in the second article of the series, but this time we will examine the dataset with unsupervised learning techniques without labeling.

In this article, we will talk about a different learning model from the previous articles in the series, unsupervised learning. Unsupervised Learning, unlike supervised learning, is learning the relationships and structures that exist in the data without labeling the data as cause-effect or input-output.

Unlike supervised learning, we do not need a training data. Unsupervised learning algorithm interprets and groups data. Then it guesses which group the new data belongs to. While making this estimation, it tries to find out which of the groups its features are more similar to, but cannot comment on what this sample is because the data is not labeled. Although this interpretation is left to the implementer, it is quite useful because it is easy to visualize.

For the clustering algorithm, it is important that all values are non-null and contain no strings. All our values must be numerical, as clustering will be done according to the distance calculation in an analytic space.

K-Means Clustering Algorithm

The K-means clustering method is to partition a dataset into K clusters given as input parameters. The aim is to ensure that the clusters obtained at the end of the partitioning process have maximum similarities within clusters and minimum similarities between clusters.

According to the working mechanism of the K-means algorithm, K objects are randomly selected to represent the center point or mean of each cluster. The remaining objects are included in the clusters with which they are most similar, taking into account their distance from the mean values of the clusters. Then, by calculating the average value of each cluster, new cluster centers are determined and the distances of the objects to the center are examined again. The algorithm continues to repeat until there is no change.

The algorithm basically consists of 4 stages:

  1. Determination of cluster centers
  2. Clustering the data outside the center according to their distance
  3. Determination of new centers according to the clustering (or shifting of old centers to the new center)
  4. Repeating steps 2 and 3 until stable.

In the previous article where we used this data set, we said that the majority of our dataset’s quality property has values such as 5 and 6 by using the df.describe() method and interpreting the numerical values in the table. To represent this visually, we can make such a visualization with the seaborn library.

We have mentioned the concept of correlation matrix in previous articles of the series. This time, we drew it with a different approach, without the top half.

In order to express the correlation more visually, I chose variables with stronger correlations instead of our target variable “quality”. As we can see from the correlation matrix, the Fixed Acidity feature has a strong positive (0.67) correlation with Density and a strong negative (-0.68) correlation with pH.

Centroid & Inertia Concepts

The purpose of the K-Means clustering algorithm is simply to select the centroids of k clusters as best as possible. These centers of gravity are called centroids. When selected as best as possible, the loss function (ie inertia value) is minimized.

Note: Of course, our ultimate goal is not just to shrink the inertia as much as possible. If we set the number of clusters or centroids equal to the number of elements of the dataset, naturally the inertia will be zero. But as a result, our data is not clustered.

The ideal number of clusters can be found by the elbow method.

We extracted the “quality” property from the properties of the samples in our dataset, then assigned the values to an X variable and scaled it to remove the effect of scalar quantities, thus preparing our input set that our algorithm will use to understand the data.

Note: We did not use the scaled X values when determining the optimum number of clusters because this method is a numerical correction we will make to increase the model performance. But first, we want to examine the data itself, not the tampered version, to know how many clusters we are going to divide this data into.

It seems that we can set our cluster number as 3, because after this number, the rate of decrease of the inertia value decreases.

Our K-Means model will use the first 1000 samples of the data set we have prepared for training, and classify the other 500 samples according to the inferences it makes from the first 1000 samples.

You can see the centroids of the clusters as squares on the graph. While we were able to cluster some sharp samples properly, the colors look a bit mixed. How can we help our algorithm for better performance?

Dimension Reduction & PCA

Dimension reduction finds patterns in data and restates them in a “compressed” format. Thus, it can calculate more effectively in terms of time and size. When working on much larger datasets, it is important not only for accuracy but also for time.

Its most important function is to remove features that contain relatively little information from the dataset. The most basic technique used in the dimension reduction function is PCA (Principal Component Analysis).

The main thing we will try to do is find the number of “key components” that contain important information for the data and reduce the features in the dataset to that number.

It is possible to come across these components as the “inner dimension” in some sources. For this example, we can say that our dataset is actually a multidimensional set, while “intrinsically” it is actually 2-dimensional.

Here, we see that a significant portion of the features do not contain information. If we set our component count to 2, it looks like we can preserve a significant portion of the information that will be useful to us. So, exactly how much of this “information” do we preserve?

When we set our component number to 2, we can preserve about 99.8% of the total variance, that is, almost all of it.

Both the centroids and the clusters definitely look better than they did on our first try.

Hierarchical Clustering

Hierarchical clustering organizes samples in a cluster hierarchy from small to large and is a good way for non-technical people to visualize data. Just like the classification of the living world we see in high school biology, it starts with the smallest (or largest) common features and continues by classification.

In full linkage (referred to as “complete” in the linkage method above), the distance between clusters is the distance between the farthest points of the clusters. This is called the agglomerative approach. At the beginning of this approach, all objects are separate from each other. That is, each of the samples is a set on its own. Subsequently, clusters with similar attributes are brought together to obtain a single cluster. Another name for this method is the Furthest Neighbor Method.

In a single link (referred to as “single” in the linkage method below), the distance between clusters is the distance between the closest points of the clusters. In the divisive approach, a divisive strategy is dominant. In this approach, there is only one cluster at the beginning. At each stage, objects are separated from the main cluster according to the distance/similarity matrix, and different subsets are formed. As a result of the process, each data becomes a set. The other name of this method is the Nearest Neighbor Method.

See you in next articles.

--

--

Göker Güner
Analytics Vidhya

YTU Alumni. ML Ops & Engineer at AlternaCX. Data&AI Enthusiast. MS Student at Bahcesehir University AI Program with Thesis.