There are various types of data mining clustering algorithms but, only few popular algorithms are widely used. Basically, all the clustering algorithms uses the distance measure method, where the data points closer in the data space exhibit more similar characteristics than the points lying further away. Every algorithm follows a different approach to find the ‘similar characteristics’ among the data points.
Read:
- Methods to Measure Data Dispersion
- Mining Frequent itemsets – Apriori Algorithm
- 9 Laws Everyone In The Data Mining Should Use
Let’s look at the different types of Data Mining Clustering Algorithms in detail:
Data Mining Connectivity Models
This model follows 2 approaches.
- In the first approach, they start classifying all the data points into separate clusters, later aggregates the data points as the distance decreases.
- In the second approach, all the data points are aggregated as a single cluster and later partitions the data points as the distance increases.
However, these models are easy to interpret but it is not the best model to handle a big data set. Examples of these models are hierarchical clustering.
Data Mining Hierarchical Clustering Method Steps
Below are the steps to solve the Hierarchical Clustering Method:
Given the set of ‘n’ items to be clustered and an ‘n×n’ distance matrix
Step-1: Assign each item to its own cluster, such that if you have ‘n’ items now you will have ‘n’ clusters, each containing just one item. Let the similarities between the clusters equal the similarities between the items they contain.
Step-2: Find the most similar pair of clusters and merge them to the single cluster.
Step-3: Compute the similarities between the new cluster and old cluster each.
Step-4: Repeat step 2 and step 3 until all items are clustered into the single cluster size ‘n’.
Step-3 can be carried out in a different ways; it can be single-link, complete-link and average-link clustering. In which single link clustering is to find the shortest distance between any data point of one cluster to any data point of the other cluster. In complete-link clustering (called as diameter or maximum method) is to find the longest distance between and data point of one cluster to any data point of the other cluster. In the average-link clustering is to find the average distance between any data point of one cluster to any data member of the other cluster.
Data Mining Centroid Models
Data mining K means algorithm is the best example that falls under this category.
In this model the number of clusters required at the end is known in prior. Therefore, it is important to have knowledge of the data set. These are iterative data mining algorithms in which the data points closer to the centroid in the data space will be aggregated to the single cluster. Number of centroid is always equal to the number of clusters.
Data Mining K-Means Method Steps
Below are the steps for K Means Clustering Method:
-
Place K points into the data space represented by the data objects that are being clustered. These points represent initial group centroids.
-
Assign each data object to the group that has the closest centroid.
-
When all the data objects have been assigned, recalculate the positions of the K centroids.
-
Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the data objects into groups.
Data Mining Distribution Models
These models are based on predicting how probable is that the data points in the cluster belong to the same distribution (Gaussain). Popular example for this model is Expectation- Maximization algorithm.
Data Mining Density Models
These models search for areas of varied density of data points in the data space. It isolates various different density regions and assigns the data points within these regions in the same cluster. Popular examples of density models are DBSCAN and OPTICS.
Data Mining DBSCAN (Density Based Spatial Clustering of Applications with Noise) Method
Below are the steps for DBSCAN Clustering Method:
-
The method requires 2 parameters: epsilon(Eps) and minimum points(MinPts). It starts with a random point that has not yet visited.
-
Finds all the neighbor data points within the distance Eps of the starting point
-
The cluster is formed if the number of neighbors is greater than or equal to MinPts. Starting point is marked as visited
-
If the number of neighbors is less than MinPts, than the data point is marked as noise.
-
The algorithm repeats the process recursively.