# Clustering Algorithms

Clustering is the task of grouping data points into clusters such that points within a cluster are similar and points in different clusters are dissimilar.

## Hierarchical Algorithms

Hierarchical clustering builds hierarchies or trees of clusters. It can be either:

**Agglomerative Clustering (Bottom-Up)**

Each data point starts in its cluster.

Clusters are merged as we move up the hierarchy.

Performs well for large datasets.

Merging is based on the similarity between clusters using:

Single-linkage: Minimum distance between points in two clusters.

Complete-linkage: Maximum distance between points in two clusters.

Average linkage: Average distance between points in two clusters.

Produces a dendrogram showing the cluster hierarchy.

Steps:

Assign each point to its own cluster

Calculate the distance matrix

Merge the closest pair of clusters

Update the distance matrix

Repeat steps 3 and 4 until all points are in one cluster

**Divisive Clustering (Top-Down)**

All points start in one cluster.

Splits clusters as we move down the hierarchy.

Performs poorly for large datasets due to computational complexity.

Splits are based on similarity measures like variance.

Steps:

Start with all points in one cluster

Split the cluster with the largest variance into two sub-clusters

Continue splitting clusters with the largest variance

Stop when each point is in its cluster

## Partitional Algorithms

Partitional clustering algorithms partition the data into a set of disjoint clusters.

**K-Means Clustering**

The most popular partitional clustering algorithm.

Assumes that clusters are spherical in shape with similar spread.

Steps:

Choose the number of clusters K

Randomly initialize K cluster centers

Assign each point to the nearest cluster centre

Recalculate the cluster centres

Repeat steps 3 and 4 until convergence.

Advantages:

Simple and easy to implement

Scales well to large datasets

Disadvantages:

Requires specifying the number of clusters K

Sensitive to initialization

Assumes spherical clusters

**Expectation Maximization (EM)**

Assumes that data points are generated from a mixture of underlying probability distributions (Gaussian, Poisson etc.)

Estimates model parameters using an iterative refinement technique.

Steps:

Initialize model parameters randomly

E-step: Calculate the probability that each point belongs to each cluster

M-step: Re-estimate model parameters to maximize the likelihood

Repeat the E and M steps until convergence.

Advantages:

Can estimate the number of clusters automatically.

Can handle non-spherical clusters.

Disadvantages:

Computationally expensive

Can get stuck in local optima

## Clustering Large Databases

Traditional clustering algorithms do not scale well to large datasets due to their time and memory requirements. Various algorithms have been proposed to cluster large databases:

**BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)**

Agglomerative clustering algorithm.

Builds a CF tree (Clustering Feature tree) to cluster large amounts of numeric data.

The CF-tree summarizes subclusters using clustering features - number of points, linear sum and squared sum.

Performs clustering on a sample of the data and then assigns the remaining points.

Advantages:

Very efficient for large datasets

Can handle data streams

Disadvantage:

- Performance degrades for non-spherical clusters

Diagram and Code: Here

**DBSCAN (Density-Based Spatial Clustering of Applications with Noise)**

Density-based clustering algorithm.

Finds clusters based on areas of high density separated by areas of low density.

Uses two parameters - epsilon (neighbourhood radius) and minPoints (minimum number of points).

Can identify noise points and handle clusters of arbitrary shape.

To scale to large datasets:

Use spatial indexes like R-trees to index the data

Only calculate the density for points in neighbourhoods

Advantage:

- Very efficient for clustering large spatial datasets

**CURE (Clustering Using Representatives)**

Agglomerative clustering algorithm.

Uses a representative set of points to summarize clusters.

The representatives are selected by shrinking each data point towards the centroid of the cluster.

To scale to large datasets:

- Use a small representative set of points instead of all points.

Advantage:

- Can handle clusters of arbitrary shape and density.

## Evaluation of Clusters

Various metrics can be used to evaluate the quality of clusters produced by clustering algorithms:

**Silhouette Coefficient**

Measures how closely points are grouped within a cluster compared to other clusters.

Calculates a silhouette score for each data point which ranges from -1 to 1.

A score near +1 indicates the point is well-matched to its own cluster and poorly matched to neighbouring clusters.

A score near -1 indicates the point is probably assigned to the wrong cluster.

The average of all silhouette scores is the Silhouette Coefficient for the overall clustering.

A high Silhouette Coefficient indicates a good clustering.

**Davies-Bouldin Index**

Measures the ratio of within-cluster distances to between-cluster distances.

Lower values of the index indicate better clustering.

Calculated as the average similarity measure between each cluster and its most similar cluster.

Smaller clusters that are well separated give a lower index.

Formula:

`DB = (1/n) ∑ (i=1 to n) max (j≠i) { Ri + Rj / d(Ci , Cj) }`

Where:

Ri is the within-cluster distance for cluster i

d(Ci, Cj) is the distance between clusters i and j

## Association Rule Mining

Association rule mining aims to find interesting associations or correlation relationships among items in a transaction database.

For example, in a supermarket transaction database:

80% of customers who buy milk also buy bread.

60% of customers who buy nappies also buy beer.

An association rule has two parts:

Antecedent (if)

Consequent (then)

For example:

{milk} → {bread}

{nappies} → {beer}

Two important metrics for association rules:

Support - how often the rule occurs in the dataset. Measures importance.

Confidence - how often items, in consequence, appear given items in the antecedent. Measures reliability.

For example:

Support = 20% (out of all transactions, 20% contain both milk and bread)

Confidence = 80% (out of transactions with milk, 80% also contain bread)

Popular algorithms:

Apriori - generates candidate itemsets and then scans the database to determine frequent itemsets.

FP-Growth - builds a frequent pattern tree and mines the database to find frequent itemsets.

To scale to large datasets:

Use parallel and distributed algorithms that partition the dataset across multiple machines.

Apriori and FP-Growth can both be parallelized.

## Apriori Algorithm

The Apriori algorithm is a popular method for finding frequent item sets and generating association rules.

**How it works**

It makes multiple passes over the transaction database.

In the first pass, it counts item occurrences to find frequent 1-itemsets.

In subsequent passes, it generates candidate k-itemsets from frequent (k-1)--itemsets found in the previous pass.

It then prunes the candidates that have any subset which is not frequent.

Finally, it scans the database to determine frequent candidates among the pruned candidates.

This process repeats until no frequent k-itemsets are found.

**Pseudocode**

Find all frequent 1-itemsets

Join frequent k-itemsets to generate candidates for (k + 1)-itemsets

Prune (k + 1)-itemsets whose k-subsets are infrequent

Scan the database and calculate support for candidates

Return all frequent (k + 1)-itemsets

Ck: Candidate itemset of size k

Lk: frequent itemset of size k

`L1 = {frequent items}; for (k = 1; Lk !=∅; k++) do begin C(k+1) = candidates generated from Lk; for each transaction t in the database do increment the count of all candidates in C(k+1) that are contained in t L(k+1) = candidates in Ck+1 with min_support end return ∪k Lk;`

**Pros and cons**

Pros:

Simple and easy-to-understand algorithm

Guarantees to find all frequent itemsets

Cons:

Multiple passes over the dataset required

Generates a lot of candidate sets, especially for large itemsets

Performance degrades with lower minimum support

## FP-Growth Algorithm

The FP-growth algorithm is an efficient method for mining frequent item sets without candidate generation.

**How it works**

FP-growth works by constructing a frequent pattern tree (FP-tree), which stores transaction data in a compressed form.

The root of the tree contains all frequent items. Each node represents an item and contains information about that item.

The FP-tree is constructed by first scanning the database to find frequent items. Transactions are then inserted into the tree.

Pattern growth is performed by constructing conditional FP-trees for each frequent item. These trees contain transactions that contain that item.

Frequent itemsets are generated by mining each conditional FP-tree. The itemsets found are combined with the conditional item to generate all frequent itemsets.

**Pseudocode**

Scan the database to find frequent items and sort them in descending order by frequency

Construct the FP-tree by inserting transactions

Call FP-growth on the FP-tree to generate frequent itemsets

FP-growth works as follows:

If the FP-tree contains a single path, generate itemsets from the path

Otherwise, recursively generate itemsets from conditional FP-trees

**Pros and cons**

Pros:

Efficient for mining frequent itemsets without candidate generation

Uses less memory than Apriori

Scales well for large datasets

Cons:

Requires sorting of items by frequency before tree construction

Performance degrades with low minimum support

## Parallel and Distributed Association Rule Mining

With the advent of big data, traditional association rule mining algorithms like Apriori and FP-growth face scalability challenges. Parallel and distributed techniques are needed to mine large datasets efficiently.

**MapReduce-based Apriori**

MapReduce can be used to parallelize the Apriori algorithm.

In the map phase, transactions are partitioned among map tasks. Each map task finds local frequent 1-itemsets.

The reduce phase aggregates the local frequent itemsets from all maps to find global frequent 1-itemsets.

This process is repeated to find higher-order frequent itemsets.

The candidate generation and frequency counting steps of Apriori are well suited for the map and reduce functions.

**Distributed FP-Growth**

The FP-tree structure can be partitioned and distributed across nodes.

Each node constructs a local FP tree from its partition of the dataset.

Local frequent itemsets are found by mining each local FP-tree.

The local frequent itemsets are aggregated to find global frequent itemsets.

The conditional FP-trees can also be distributed in the same way.

This approach scales well for large datasets since the FP-tree is divided and mined in parallel.