Back to Course

Intermediate Python

0% Complete
0/0 Steps
Lesson 29 of 33
In Progress

Clustering and Classification Algorithms

Clustering and classification are two important tasks in machine learning that allow us to explore and understand our data. Clustering algorithms are used to group data into distinct categories based on their characteristics, while classification algorithms are used to predict the class or category of new data based on labeled examples. In this article, we will explore some popular clustering and classification algorithms implemented in the Scikit-learn library in Python.

K-Means Clustering:

K-Means clustering is a simple and widely used clustering algorithm that divides a dataset into a specified number of clusters. It works by first initializing K centroids, and then assigning each data point to the cluster with the closest centroid. The centroids are then updated to the mean of the points in their respective clusters, and the process is repeated until the centroids stop moving or a maximum number of iterations is reached.

Here’s an example of using K-Means to cluster the iris dataset provided by Scikit-learn:

from sklearn import datasets
from sklearn.cluster import KMeans

# Load the iris dataset
iris = datasets.load_iris()
X = iris.data

# Create an instance of the KMeans class
model = KMeans(n_clusters=3)

# Fit the model to the data
model.fit(X)

# Predict the cluster labels for the data
y_pred = model.predict(X)

print(y_pred)

Hierarchical Clustering:

Hierarchical clustering is a clustering algorithm that produces a hierarchy of clusters. There are two main types of hierarchical clustering: agglomerative and divisive. Agglomerative clustering starts with each data point as a separate cluster, and merges the closest pairs of clusters until all points are in the same cluster. Divisive clustering starts with all points in the same cluster, and splits the clusters until each point is in a separate cluster.

Here’s an example of using agglomerative hierarchical clustering to cluster the iris dataset:

from sklearn import datasets
from sklearn.cluster import AgglomerativeClustering

# Load the iris dataset
iris = datasets.load_iris()
X = iris.data

# Create an instance of the AgglomerativeClustering class
model = AgglomerativeClustering(n_clusters=3)

# Fit the model to the data
model.fit(X)

# Predict the cluster labels for the data
y_pred = model.labels_

print(y_pred)

Logistic Regression:

Logistic regression is a classification algorithm that works by predicting the probability of a binary outcome (e.g. 0 or 1, True or False) based on a set of features. The model is trained by minimizing the loss function, which measures the difference between the predicted probabilities and the true labels.

Here’s an example of using logistic regression to predict whether or not a person will default on a loan based on their credit score and income:

from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Load the credit dataset
credit = datasets.load_credit()
X = credit[['credit_score', 'income']]
y = credit['default']

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# Create an instance of the LogisticRegression class
model = LogisticRegression()

# Fit the model to the training data
model.fit(X_train, y_train)

# Predict the labels of the test data
y_pred = model.predict(X_test)

print(y_pred)

Decision Trees:

Decision trees are a popular classification algorithm that works by creating a tree of decision nodes, where each node represents a decision to be made based on the features of the data. The tree is constructed by choosing the feature that best splits the data into classes, and repeating the process on the split data until the tree is complete.

Here’s an example of using a decision tree to predict the type of iris based on sepal length and width:

from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Load the iris dataset
iris = datasets.load_iris()
X = iris.data[:, :2] # use only sepal length and width
y = iris.target

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# Create an instance of the DecisionTreeClassifier class
model = DecisionTreeClassifier()

# Fit the model to the training data
model.fit(X_train, y_train)

# Predict the labels of the test data
y_pred = model.predict(X_test)

print(y_pred)

Random Forests:

Random forests are an ensemble learning method that combines the predictions of multiple decision trees to improve the accuracy and stability of the model. Each tree in the forest is trained on a random subset of the data, and the final prediction is made by taking the average of the predictions of all the trees.

Here’s an example of using a random forest to predict the type of iris based on all four features:

from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Load the iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# Create an instance of the RandomForestClassifier class
model = RandomForestClassifier(n_estimators=100)

# Fit the model to the training data
model.fit(X_train, y_train)

# Predict the labels of the test data
y_pred = model.predict(X_test)

print(y_pred)

Exercises

To review these concepts, we will go through a series of exercises designed to test your understanding and apply what you have learned.

Use the KMeans clustering algorithm to cluster the iris dataset into 3 clusters.

from sklearn import datasets
from sklearn.cluster import KMeans

# Load the iris dataset
iris = datasets.load_iris()
X = iris.data

# Create an instance of the KMeans class
model = KMeans(n_clusters=3)

# Fit the model to the data
model.fit(X)

# Predict the cluster labels for the data
y_pred = model.predict(X)

print(y_pred)

The cluster labels for each sample in the test set will be printed.

Using the credit dataset from the logistic regression example above, create a K-Means clustering model to group the data into 2 clusters. Print the cluster labels for each sample in the test set.

from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split

# Load the credit dataset
credit = datasets.load_credit()
X = credit[['credit_score', 'income']]
y = credit['default']

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# Create an instance of the KMeans class with 2 clusters
model = KMeans(n_clusters=2)

# Fit the model to the training data
model.fit(X_train)

# Predict the cluster labels of the test data
y_pred = model.predict(X_test)

print(y_pred)

The predicted labels for the test set will be printed.

Using the credit dataset from the logistic regression example above, create a support vector machine classification model to predict whether or not a person will default on a loan based on their credit score and income. Print the predicted labels for the test set.

from sklearn import datasets
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

# Load the credit dataset
credit = datasets.load_credit()
X = credit[['credit_score', 'income']]
y = credit['default']

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# Create an instance of the SVC class
model = SVC()

# Fit the model to the training data
model.fit(X_train, y_train)

# Predict the labels of the test data
y_pred = model.predict(X_test)

print(y_pred)

The predicted labels for the test set will be printed.

Using the credit dataset from the logistic regression example above, create a K-Nearest Neighbors classification model to predict whether or not a person will default on a loan based on their credit score and income. Print the predicted labels for the test set.

from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# Load the credit dataset
credit = datasets.load_credit()
X = credit[['credit_score', 'income']]
y = credit['default']

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# Create an instance of the KNeighborsClassifier class with 3 neighbors
model = KNeighborsClassifier(n_neighbors=3)

# Fit the model to the training data
model.fit(X_train, y_train)

# Predict the labels of the test data
y_pred = model.predict(X_test)

print(y_pred)

The predicted labels for the test set will be printed.

Using the credit dataset from the logistic regression example above, create a Gaussian Naive Bayes classification model to predict whether or not a person will default on a loan based on their credit score and income. Print the predicted labels for the test set.

from sklearn import datasets
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

# Load the credit dataset
credit = datasets.load_credit()
X = credit[['credit_score', 'income']]
y = credit['default']

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# Create an instance of the GaussianNB class
model = GaussianNB()

# Fit the model to the training data
model.fit(X_train, y_train)

# Predict the labels of the test data
y_pred = model.predict(X_test)

print(y_pred)

The predicted labels for the test set will be printed.