Principal component analysis
Table of contents
Resources
- Principal component analysis by Hervé Abdi and Lynne J. Williams is excellent at explaining PCA interpretation. It also covers some extensions to PCA.
- A Tutorial on Principal Component Analysis by Jonathon Shlens goes into more detail on the intuition behind PCA, while also discussing its applicability and limits.
- I learnt PCA from these lecture notes from Xavier Gendre. He provides a comprehensive walkthrough using a dataset of Skyrim bows.
Data
PCA assumes you have a dataframe consisting of numerical variables. This includes booleans and integers.
As an example, let’s use a dataset describing the energy mix of each country in the world, in 2019. The dataset is normalized into percentages, which makes countries comparable with one another.
import prince
dataset = prince.datasets.load_energy_mix(year=2019, normalize=True)
dataset.head().style.format('{:.0%}')
coal | oil | gas | nuclear | hydro | wind | solar | other renewables | ||
---|---|---|---|---|---|---|---|---|---|
continent | country | ||||||||
Africa | Algeria | 1% | 35% | 64% | 0% | 0% | 0% | 0% | 0% |
South America | Argentina | 1% | 35% | 50% | 2% | 10% | 1% | 0% | 1% |
Oceania | Australia | 28% | 34% | 30% | 0% | 2% | 3% | 3% | 1% |
Europe | Austria | 9% | 37% | 22% | 0% | 25% | 4% | 1% | 3% |
Asia | Azerbaijan | 0% | 33% | 65% | 0% | 2% | 0% | 0% | 0% |
Fitting
The PCA
estimator implements the fit/transform
API from scikit-learn.
pca = prince.PCA(
n_components=3,
n_iter=3,
rescale_with_mean=True,
rescale_with_std=True,
copy=True,
check_input=True,
engine='sklearn',
random_state=42
)
pca = pca.fit(dataset)
The available parameters are:
n_components
— the number of components that are computed. You only need two if your intention is to visualize the two major components.n_iter
— the number of iterations used for computing the SVD.rescale_with_mean
— whether to substract each column’s meanrescale_with_std
— whether to divide each column by it’s standard deviationcopy
— ifFalse
then the computations will be done inplace which can have possible side-effects on the input dataengine
— what SVD engine to use (should be one of['fbpca', 'sklearn']
)random_state
— controls the randomness of the SVD results.
These methods are common to other Prince estimators.
Eigenvalues
The importance of a principal component is indicated by the proportion of dataset inertia it explains. This is called the explained inertia, and is obtained by dividing the eigenvalues obtained with SVD by the total inertia.
The ideal situation is when a large share of inertia is explained by a low number of principal components.
pca.eigenvalues_summary
eigenvalue | % of variance | % of variance (cumulative) | |
---|---|---|---|
component | |||
0 | 1.963 | 24.54% | 24.54% |
1 | 1.561 | 19.51% | 44.05% |
2 | 1.403 | 17.54% | 61.59% |
In this dataset, the first three components explain ~62% of the dataset inertia. This isn’t great, but isn’t bad either.
The columns in the above table are also available as separate attributes:
pca.eigenvalues_
array([1.96301214, 1.56069986, 1.40327823])
pca.percentage_of_variance_
array([24.5376518 , 19.50874822, 17.54097785])
pca.cumulative_percentage_of_variance_
array([24.5376518 , 44.04640003, 61.58737788])
Eigenvalues can also be visualized with a scree plot.
pca.scree_plot()
Coordinates
The row principal coordinates can be obtained once the PCA
has been fitted to the data.
pca.transform(dataset).head()
component | 0 | 1 | 2 | |
---|---|---|---|---|
continent | country | |||
Africa | Algeria | -2.189068 | 0.380243 | -0.388572 |
South America | Argentina | -1.244981 | 0.801917 | -0.389456 |
Oceania | Australia | 0.203969 | -1.470254 | 0.531819 |
Europe | Austria | 0.847122 | 1.008296 | -0.521998 |
Asia | Azerbaijan | -2.190535 | 0.632250 | -0.365515 |
The transform
method is in fact an alias for row_coordinates
:
pca.transform(dataset).equals(pca.row_coordinates(dataset))
True
This is transforming the original dataset into factor scores.
The column coordinates are obtained during the PCA training process.
pca.column_coordinates_
component | 0 | 1 | 2 |
---|---|---|---|
variable | |||
coal | 0.230205 | -0.395474 | 0.740748 |
oil | 0.176200 | -0.418917 | -0.737855 |
gas | -0.866927 | 0.182990 | -0.096857 |
nuclear | 0.310313 | -0.002598 | 0.400608 |
hydro | 0.440383 | 0.744815 | -0.016375 |
wind | 0.518712 | -0.161507 | -0.364593 |
solar | 0.415677 | -0.589288 | 0.001012 |
other renewables | 0.628750 | 0.516935 | -0.084114 |
Visualization
The row and column coordinates be visualized together with a scatter chart.
pca.plot(
dataset,
x_component=0,
y_component=1,
color_rows_by='continent',
show_row_markers=True,
show_column_markers=True,
show_row_labels=False,
row_labels_column=None, # for DataFrames with a MultiIndex
show_column_labels=False
)
Contributions
pca.row_contributions_.head().style.format('{:.0%}')
component | 0 | 1 | 2 | |
---|---|---|---|---|
continent | country | |||
Africa | Algeria | 3% | 0% | 0% |
South America | Argentina | 1% | 1% | 0% |
Oceania | Australia | 0% | 2% | 0% |
Europe | Austria | 0% | 1% | 0% |
Asia | Azerbaijan | 3% | 0% | 0% |
Observations with high contributions and different signs can be opposed to help interpret the component, because these observations represent the two endpoints of this component.
Column contributions are also available.
pca.column_contributions_.style.format('{:.0%}')
component | 0 | 1 | 2 |
---|---|---|---|
variable | |||
coal | 3% | 10% | 39% |
oil | 2% | 11% | 39% |
gas | 38% | 2% | 1% |
nuclear | 5% | 0% | 11% |
hydro | 10% | 36% | 0% |
wind | 14% | 2% | 9% |
solar | 9% | 22% | 0% |
other renewables | 20% | 17% | 1% |
Cosine similarities
pca.row_cosine_similarities(dataset).head()
component | 0 | 1 | 2 | |
---|---|---|---|---|
continent | country | |||
Africa | Algeria | 0.902764 | 0.027238 | 0.028445 |
South America | Argentina | 0.626517 | 0.259936 | 0.061309 |
Oceania | Australia | 0.008468 | 0.439999 | 0.057570 |
Europe | Austria | 0.236087 | 0.334470 | 0.089643 |
Asia | Azerbaijan | 0.868801 | 0.072377 | 0.024190 |
pca.column_cosine_similarities_
component | 0 | 1 | 2 |
---|---|---|---|
variable | |||
coal | 0.052994 | 0.156400 | 0.548707 |
oil | 0.031046 | 0.175492 | 0.544430 |
gas | 0.751563 | 0.033485 | 0.009381 |
nuclear | 0.096294 | 0.000007 | 0.160487 |
hydro | 0.193937 | 0.554750 | 0.000268 |
wind | 0.269062 | 0.026085 | 0.132928 |
solar | 0.172788 | 0.347260 | 0.000001 |
other renewables | 0.395327 | 0.267222 | 0.007075 |
Column correlations
pca.column_correlations
component | 0 | 1 | 2 |
---|---|---|---|
variable | |||
coal | 0.230205 | -0.395474 | 0.740748 |
oil | 0.176200 | -0.418917 | -0.737855 |
gas | -0.866927 | 0.182990 | -0.096857 |
nuclear | 0.310313 | -0.002598 | 0.400608 |
hydro | 0.440383 | 0.744815 | -0.016375 |
wind | 0.518712 | -0.161507 | -0.364593 |
solar | 0.415677 | -0.589288 | 0.001012 |
other renewables | 0.628750 | 0.516935 | -0.084114 |
(pca.column_correlations ** 2).equals(pca.column_cosine_similarities_)
True
Inverse transformation
You can transform row projections back into their original space by using the inverse_transform
method.
reconstructed = pca.inverse_transform(pca.transform(dataset))
reconstructed.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | ||
---|---|---|---|---|---|---|---|---|---|
continent | country | ||||||||
Africa | Algeria | 0.024204 | 0.369866 | 0.599244 | -0.007082 | 0.024826 | -0.003249 | 0.000650 | -0.008460 |
South America | Argentina | 0.027773 | 0.366342 | 0.487562 | 0.008596 | 0.090082 | 0.006963 | 0.001391 | 0.011292 |
Oceania | Australia | 0.287372 | 0.425348 | 0.208594 | 0.056641 | -0.018484 | 0.025867 | 0.015104 | -0.000441 |
Europe | Austria | 0.060875 | 0.410393 | 0.219553 | 0.040122 | 0.183231 | 0.034237 | 0.006167 | 0.045423 |
Asia | Azerbaijan | 0.013098 | 0.354083 | 0.606928 | -0.006558 | 0.042620 | -0.004640 | -0.000439 | -0.005092 |
This allows measuring the reconstruction error of the PCA. This is usually defined as the $L^2$ norm between the reconstructed dataset and the original dataset.
import numpy as np
np.linalg.norm(reconstructed.values - dataset.values, ord=2)
1.1927159589872411
Supplementary data
Active rows and columns make up the dataset you fit the PCA with. Anything you provide afterwards is considered supplementary data.
For example, we can fit the PCA on all countries that are not part of North America.
active = dataset.query("continent != 'North America'")
pca = prince.PCA().fit(active)
The data for North America can still be projected onto the principal components.
pca.transform(dataset).query("continent == 'North America'")
component | 0 | 1 | |
---|---|---|---|
continent | country | ||
North America | Canada | 0.009864 | 1.332859 |
Mexico | -0.695023 | -0.402230 | |
Trinidad and Tobago | -3.072833 | 1.393067 | |
United States | -0.226122 | -0.433260 |
As for supplementary, they must be provided during the fit
call.
pca = prince.PCA().fit(dataset, supplementary_columns=['hydro', 'wind', 'solar'])
pca.column_correlations
component | 0 | 1 |
---|---|---|
variable | ||
coal | 0.621633 | -0.380008 |
oil | 0.048966 | 0.948650 |
gas | -0.916078 | -0.331546 |
nuclear | 0.384458 | -0.401675 |
other renewables | 0.467116 | 0.086656 |
hydro | 0.246646 | 0.037118 |
wind | 0.195675 | 0.184247 |
solar | 0.251247 | 0.076237 |
There can be supplementary rows and columns at the same time.
pca = prince.PCA().fit(active, supplementary_columns=['hydro', 'wind', 'solar'])
Performance
Under the hood, Prince uses a randomised version of SVD. This is much faster than traditional SVD. However, the results may have a small inherent randomness. This can be controlled with the random_state
parameter. The accurary of the results can be increased by providing a higher n_iter
parameter.
By default prince
uses scikit-learn’s randomized SVD implementation – the one used in TruncatedSVD
.
Prince supports different SVD backends. For the while, the only other supported randomized backend is Facebook’s randomized SVD implementation, called fbpca:
prince.PCA(engine='fbpca')
PCA(engine='fbpca')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(engine='fbpca')
You can also use a non-randomized SVD implementation, using the scipy.linalg.svd
function:
prince.PCA(engine='scipy')
PCA(engine='scipy')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(engine='scipy')
Here is a very crude benchmark that compares each engine on different dataset sizes.
import itertools
import numpy as np
import pandas as pd
N = [100, 10_000, 100_000]
P = [3, 10, 30]
ENGINES = ['sklearn', 'fbpca', 'scipy']
perf = []
for n, p, engine in itertools.product(N, P, ENGINES):
# Too slow
if engine == 'scipy' and n > 10_000:
continue
X = pd.DataFrame(np.random.random(size=(n, p)))
duration = %timeit -q -n 1 -r 3 -o prince.PCA(engine=engine, n_iter=3).fit(X)
perf.append({
'n': n,
'p': p,
'engine': engine,
'seconds': duration.average
})
(
pd.DataFrame(perf)
.set_index(['n', 'p'])
.groupby(['n', 'p'], group_keys=False)
.apply(lambda x: x[['engine', 'seconds']]
.sort_values('seconds'))
)
engine | seconds | ||
---|---|---|---|
n | p | ||
100 | 3 | scipy | 0.002407 |
3 | fbpca | 0.002465 | |
3 | sklearn | 0.002710 | |
10 | scipy | 0.002431 | |
10 | fbpca | 0.002704 | |
10 | sklearn | 0.002723 | |
30 | scipy | 0.002971 | |
30 | sklearn | 0.003015 | |
30 | fbpca | 0.003187 | |
10000 | 3 | fbpca | 0.006824 |
3 | sklearn | 0.012736 | |
3 | scipy | 0.305234 | |
10 | fbpca | 0.009869 | |
10 | sklearn | 0.011739 | |
10 | scipy | 0.733266 | |
30 | fbpca | 0.011290 | |
30 | sklearn | 0.018454 | |
30 | scipy | 1.834851 | |
100000 | 3 | fbpca | 0.024184 |
3 | sklearn | 0.040077 | |
10 | fbpca | 0.061728 | |
10 | sklearn | 0.171649 | |
30 | fbpca | 0.129219 | |
30 | sklearn | 0.218774 |