In machine learning, it is often the case that we have a large number of features or dimensions in our data. This can lead to the “curse of dimensionality”, where the model becomes increasingly difficult to train as the number of dimensions increases. One way to overcome this issue is by using dimensionality reduction techniques, which aim to reduce the number of dimensions in the data while still retaining as much of the original information as possible.
There are many different dimensionality reduction techniques, including principal component analysis (PCA), independent component analysis (ICA), t-distributed stochastic neighbor embedding (t-SNE), and linear discriminant analysis (LDA). In this article, we will focus on PCA and LDA, as they are commonly used and relatively easy to understand.
Principal Component Analysis (PCA):
PCA is a linear dimensionality reduction technique that projects the data onto a lower-dimensional space by finding the directions of maximum variance in the data. The resulting projection can be used to visualize the data, to reduce the number of dimensions for training a machine learning model, or for data compression.
Here is an example of how to use PCA to reduce the number of dimensions in a dataset using the scikit-learn library in Python:
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
# Load the breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Create an instance of the PCA class
pca = PCA(n_components=2)
# Fit the model to the data and transform the data
X_reduced = pca.fit_transform(X)
print(X_reduced.shape)
This will print the shape of the transformed data, which should be (569, 2) indicating that the number of dimensions has been reduced from 30 to 2.
Linear Discriminant Analysis (LDA):
LDA is a supervised dimensionality reduction technique that aims to maximize the separation between different classes in the data. It is particularly useful for classification tasks, as it can provide a lower-dimensional projection of the data that is more suitable for training a classifier.
Here is an example of how to use LDA to reduce the number of dimensions in a dataset using the scikit-learn library in Python:
from sklearn.datasets import load_breast_cancer
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# Load the breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Create an instance of the LinearDiscriminantAnalysis class
lda = LinearDiscriminantAnalysis(n_components=2)
# Fit the model to the data and transform the data
X_reduced = lda.fit_transform(X, y)
print(X_reduced.shape)
This will print the shape of the transformed data, which should be (569, 2) indicating that the number of dimensions has been reduced from 30 to 2.
Exercises
To review these concepts, we will go through a series of exercises designed to test your understanding and apply what you have learned.
Use PCA to reduce the number of dimensions in the iris dataset to 2 and plot the resulting data using a 2D scatter plot.
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# Load the iris dataset
data = load_iris()
X = data.data
y = data.target
# Create an instance of the PCA class
pca = PCA(n_components=2)
# Fit the model to the data and transform the data
X_reduced = pca.fit_transform(X)
# Plot the data
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y)
plt.show()
Use LDA to reduce the number of dimensions in the iris dataset to 2 and plot the resulting data using a 2D scatter plot.
from sklearn.datasets import load_iris
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
import matplotlib.pyplot as plt
# Load the iris dataset
data = load_iris()
X = data.data
y = data.target
# Create an instance of the LinearDiscriminantAnalysis class
lda = LinearDiscriminantAnalysis(n_components=2)
# Fit the model to the data and transform the data
X_reduced = lda.fit_transform(X, y)
# Plot the data
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y)
plt.show()
Use t-SNE to reduce the number of dimensions in the iris dataset to 2 and plot the resulting data using a 2D scatter plot.
from sklearn.datasets import load_iris
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
# Load the iris dataset
data = load_iris()
X = data.data
y = data.target
# Create an instance of the TSNE class
tsne = TSNE(n_components=2)
# Fit the model to the data and transform the data
X_reduced = tsne.fit_transform(X)
# Plot the data
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y)
plt.show()
Use kernel PCA to reduce the number of dimensions in the iris dataset to 2 and plot the resulting data using a 2D scatter plot.
from sklearn.datasets import load_iris
from sklearn.decomposition import KernelPCA
import matplotlib.pyplot as plt
# Load the iris dataset
data = load_iris()
X = data.data
y = data.target
# Create an instance of the KernelPCA class
kpca = KernelPCA(n_components=2, kernel='rbf')
# Fit the model to the data and transform the data
X_reduced = kpca.fit_transform(X)
# Plot the data
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y)
plt.show()
Use randomized PCA to reduce the number of dimensions in the iris dataset to 2 and plot the resulting data using a 2D scatter plot.
from sklearn.datasets import load_iris
from sklearn.decomposition import SparsePCA
import matplotlib.pyplot as plt
# Load the iris dataset
data = load_iris()
X = data.data
y = data.target
# Create an instance of the SparsePCA class
rpca = SparsePCA(n_components=2, random_state=42)
# Fit the model to the data and transform the data
X_reduced = rpca.fit_transform(X)
# Plot the data
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y)
plt.show()