clustering Module

Clustering algorithms for unsupervised learning tasks.

@author: drusk

class pml.unsupervised.clustering.ClusteredDataSet(dataset, cluster_assignments)[source]

A collection of data which has been analysed by a clustering algorithm. It contains both the original DataSet and the results of the clustering. It provides methods for analysing these clustering results.

__init__(dataset, cluster_assignments)[source]

Creates a new ClusteredDataSet.

Args:
dataset: model.DataSet
A dataset which does not have cluster assignments.
cluster_assignments: pandas.Series
A Series with the cluster assignment for each sample in the dataset.
calculate_purity()[source]

Calculate the purity, a measurement of quality for the clustering results.

Each cluster is assigned to the class which is most frequent in the cluster. Using these classes, the percent accuracy is then calculated.

Returns:
A number between 0 and 1. Poor clusterings have a purity close to 0 while a perfect clustering has a purity of 1.
Raises:
UnlabelledDataSetError if the dataset is not labelled.
calculate_rand_index()[source]

Calculate the Rand index, a measurement of quality for the clustering results. It is essentially the percent accuracy of the clustering.

The clustering is viewed as a series of decisions. There are N*(N-1)/2 pairs of samples in the dataset to be considered. The decision is considered correct if the pairs have the same label and are in the same cluster, or have different labels and are in different clusters. The number of correct decisions divided by the total number of decisions gives the Rand index, or accuracy.

Returns:
The accuracy, a number between 0 and 1. The closer to 1, the better the clustering.
Raises:
UnlabelledDataSetError if the dataset is not labelled.
get_cluster_assignments()[source]

Retrieves the cluster assignments produced for this dataset by a clustering algorithm.

Returns:
A pandas Series. It contains the index of the original dataset with a numerical value representing the cluster it is a part of.
pml.unsupervised.clustering.create_random_centroids(dataset, k)[source]

Initializes centroids at random positions.

The random value chosen for each feature will always be limited to the range of values found in the dataset. For example, if a certain feature has a minimum value of 0 in the dataset, and maximum value of 9, the

Args:
dataset: DataSet
The DataSet to create the random centroids for.
k: int
The number of centroids to create.
Returns:
A list of centroids. Each centroid is a pandas Series with the same labels as the dataset’s headers.
pml.unsupervised.clustering.kmeans(dataset, k=2, create_centroids=<function create_random_centroids at 0x32770c8>)[source]

K-means clustering algorithm.

This algorithm partitions a dataset into k clusters in which each observation (sample) belongs to the cluster with the nearest mean.

Args:
dataset: model.DataSet
The DataSet to perform the clustering on.
k: int
The number of clusters to partition the dataset into.
create_centroids: function
The function specifying how to create the initial centroids for the clusters. Defaults to creating them randomly.
Returns:
A ClusteredDataSet which contains the cluster assignments as well as the original data. In the cluster assignments, each sample index is assigned a numerical value representing the cluster it is part of.

Project Versions

Previous topic

classifiers Module

Next topic

collection_utils Module

This Page