# Principal component analysis

## Resources

• Principal component analysis by Hervé Abdi and Lynne J. Williams is excellent at explaining PCA interpretation. It also covers some extensions to PCA.
• A Tutorial on Principal Component Analysis by Jonathon Shlens goes into more detail on the intuition behind PCA, while also discussing its applicability and limits.
• I learnt PCA from these lecture notes from Xavier Gendre. He provides a comprehensive walkthrough using a dataset of Skyrim bows.

## Data

PCA assumes you have a dataframe consisting of numerical variables. This includes booleans and integers.

As an example, let’s use a dataset describing the energy mix of each country in the world, in 2019. The dataset is normalized into percentages, which makes countries comparable with one another.

``````import prince

dataset = prince.datasets.load_energy_mix(year=2019, normalize=True)
dataset.head().style.format('{:.0%}')
``````
coaloilgasnuclearhydrowindsolarother renewables
continentcountry
AfricaAlgeria1%35%64%0%0%0%0%0%
South AmericaArgentina1%35%50%2%10%1%0%1%
OceaniaAustralia28%34%30%0%2%3%3%1%
EuropeAustria9%37%22%0%25%4%1%3%
AsiaAzerbaijan0%33%65%0%2%0%0%0%

## Fitting

The `PCA` estimator implements the `fit/transform` API from scikit-learn.

``````pca = prince.PCA(
n_components=3,
n_iter=3,
rescale_with_mean=True,
rescale_with_std=True,
copy=True,
check_input=True,
engine='sklearn',
random_state=42
)
pca = pca.fit(dataset)
``````

The available parameters are:

• `n_components` — the number of components that are computed. You only need two if your intention is to visualize the two major components.
• `n_iter` — the number of iterations used for computing the SVD.
• `rescale_with_mean` — whether to substract each column’s mean
• `rescale_with_std` — whether to divide each column by it’s standard deviation
• `copy` — if `False` then the computations will be done inplace which can have possible side-effects on the input data
• `engine` — what SVD engine to use (should be one of `['fbpca', 'sklearn']`)
• `random_state` — controls the randomness of the SVD results.

These methods are common to other Prince estimators.

## Eigenvalues

The importance of a principal component is indicated by the proportion of dataset inertia it explains. This is called the explained inertia, and is obtained by dividing the eigenvalues obtained with SVD by the total inertia.

The ideal situation is when a large share of inertia is explained by a low number of principal components.

``````pca.eigenvalues_summary
``````

eigenvalue% of variance% of variance (cumulative)
component
01.96324.54%24.54%
11.56119.51%44.05%
21.40317.54%61.59%

In this dataset, the first three components explain ~62% of the dataset inertia. This isn’t great, but isn’t bad either.

The columns in the above table are also available as separate attributes:

``````pca.eigenvalues_
``````
``````array([1.96301214, 1.56069986, 1.40327823])
``````
``````pca.percentage_of_variance_
``````
``````array([24.5376518 , 19.50874822, 17.54097785])
``````
``````pca.cumulative_percentage_of_variance_
``````
``````array([24.5376518 , 44.04640003, 61.58737788])
``````

Eigenvalues can also be visualized with a scree plot.

``````pca.scree_plot()
``````

## Coordinates

The row principal coordinates can be obtained once the `PCA` has been fitted to the data.

``````pca.transform(dataset).head()
``````

component012
continentcountry
AfricaAlgeria-2.1890680.380243-0.388572
South AmericaArgentina-1.2449810.801917-0.389456
OceaniaAustralia0.203969-1.4702540.531819
EuropeAustria0.8471221.008296-0.521998
AsiaAzerbaijan-2.1905350.632250-0.365515

The `transform` method is in fact an alias for `row_coordinates`:

``````pca.transform(dataset).equals(pca.row_coordinates(dataset))
``````
``````True
``````

This is transforming the original dataset into factor scores.

The column coordinates are obtained during the PCA training process.

``````pca.column_coordinates_
``````

component012
variable
coal0.230205-0.3954740.740748
oil0.176200-0.418917-0.737855
gas-0.8669270.182990-0.096857
nuclear0.310313-0.0025980.400608
hydro0.4403830.744815-0.016375
wind0.518712-0.161507-0.364593
solar0.415677-0.5892880.001012
other renewables0.6287500.516935-0.084114

## Visualization

The row and column coordinates be visualized together with a scatter chart.

``````pca.plot(
dataset,
x_component=0,
y_component=1,
color_rows_by='continent',
show_row_markers=True,
show_column_markers=True,
show_row_labels=False,
row_labels_column=None,  # for DataFrames with a MultiIndex
show_column_labels=False
)
``````

## Contributions

``````pca.row_contributions_.head().style.format('{:.0%}')
``````
component012
continentcountry
AfricaAlgeria3%0%0%
South AmericaArgentina1%1%0%
OceaniaAustralia0%2%0%
EuropeAustria0%1%0%
AsiaAzerbaijan3%0%0%

Observations with high contributions and different signs can be opposed to help interpret the component, because these observations represent the two endpoints of this component.

Column contributions are also available.

``````pca.column_contributions_.style.format('{:.0%}')
``````
component012
variable
coal3%10%39%
oil2%11%39%
gas38%2%1%
nuclear5%0%11%
hydro10%36%0%
wind14%2%9%
solar9%22%0%
other renewables20%17%1%

## Cosine similarities

``````pca.row_cosine_similarities(dataset).head()
``````

component012
continentcountry
AfricaAlgeria0.9027640.0272380.028445
South AmericaArgentina0.6265170.2599360.061309
OceaniaAustralia0.0084680.4399990.057570
EuropeAustria0.2360870.3344700.089643
AsiaAzerbaijan0.8688010.0723770.024190
``````pca.column_cosine_similarities_
``````

component012
variable
coal0.0529940.1564000.548707
oil0.0310460.1754920.544430
gas0.7515630.0334850.009381
nuclear0.0962940.0000070.160487
hydro0.1939370.5547500.000268
wind0.2690620.0260850.132928
solar0.1727880.3472600.000001
other renewables0.3953270.2672220.007075

## Column correlations

``````pca.column_correlations
``````

component012
variable
coal0.230205-0.3954740.740748
oil0.176200-0.418917-0.737855
gas-0.8669270.182990-0.096857
nuclear0.310313-0.0025980.400608
hydro0.4403830.744815-0.016375
wind0.518712-0.161507-0.364593
solar0.415677-0.5892880.001012
other renewables0.6287500.516935-0.084114
``````(pca.column_correlations ** 2).equals(pca.column_cosine_similarities_)
``````
``````True
``````

## Inverse transformation

You can transform row projections back into their original space by using the `inverse_transform` method.

``````reconstructed = pca.inverse_transform(pca.transform(dataset))
reconstructed.head()
``````

01234567
continentcountry
AfricaAlgeria0.0242040.3698660.599244-0.0070820.024826-0.0032490.000650-0.008460
South AmericaArgentina0.0277730.3663420.4875620.0085960.0900820.0069630.0013910.011292
OceaniaAustralia0.2873720.4253480.2085940.056641-0.0184840.0258670.015104-0.000441
EuropeAustria0.0608750.4103930.2195530.0401220.1832310.0342370.0061670.045423
AsiaAzerbaijan0.0130980.3540830.606928-0.0065580.042620-0.004640-0.000439-0.005092

This allows measuring the reconstruction error of the PCA. This is usually defined as the \$L^2\$ norm between the reconstructed dataset and the original dataset.

``````import numpy as np

np.linalg.norm(reconstructed.values - dataset.values, ord=2)
``````
``````1.1927159589872411
``````

## Supplementary data

Active rows and columns make up the dataset you fit the PCA with. Anything you provide afterwards is considered supplementary data.

For example, we can fit the PCA on all countries that are not part of North America.

``````active = dataset.query("continent != 'North America'")
pca = prince.PCA().fit(active)
``````

The data for North America can still be projected onto the principal components.

``````pca.transform(dataset).query("continent == 'North America'")
``````

component01
continentcountry
North AmericaCanada0.0098641.332859
Mexico-0.695023-0.402230
Trinidad and Tobago-3.0728331.393067
United States-0.226122-0.433260

As for supplementary, they must be provided during the `fit` call.

``````pca = prince.PCA().fit(dataset, supplementary_columns=['hydro', 'wind', 'solar'])
pca.column_correlations
``````

component01
variable
coal0.621633-0.380008
oil0.0489660.948650
gas-0.916078-0.331546
nuclear0.384458-0.401675
other renewables0.4671160.086656
hydro0.2466460.037118
wind0.1956750.184247
solar0.2512470.076237

There can be supplementary rows and columns at the same time.

``````pca = prince.PCA().fit(active, supplementary_columns=['hydro', 'wind', 'solar'])
``````

## Performance

Under the hood, Prince uses a randomised version of SVD. This is much faster than traditional SVD. However, the results may have a small inherent randomness. This can be controlled with the `random_state` parameter. The accurary of the results can be increased by providing a higher `n_iter` parameter.

By default `prince` uses scikit-learn’s randomized SVD implementation – the one used in `TruncatedSVD`.

Prince supports different SVD backends. For the while, the only other supported randomized backend is Facebook’s randomized SVD implementation, called fbpca:

``````prince.PCA(engine='fbpca')
``````
`PCA(engine='fbpca')`
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

You can also use a non-randomized SVD implementation, using the `scipy.linalg.svd` function:

``````prince.PCA(engine='scipy')
``````
`PCA(engine='scipy')`
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Here is a very crude benchmark that compares each engine on different dataset sizes.

``````import itertools
import numpy as np
import pandas as pd

N = [100, 10_000, 100_000]
P = [3, 10, 30]
ENGINES = ['sklearn', 'fbpca', 'scipy']

perf = []

for n, p, engine in itertools.product(N, P, ENGINES):

# Too slow
if engine == 'scipy' and n > 10_000:
continue

X = pd.DataFrame(np.random.random(size=(n, p)))
duration = %timeit -q -n 1 -r 3 -o prince.PCA(engine=engine, n_iter=3).fit(X)
perf.append({
'n': n,
'p': p,
'engine': engine,
'seconds': duration.average
})
``````
``````(
pd.DataFrame(perf)
.set_index(['n', 'p'])
.groupby(['n', 'p'], group_keys=False)
.apply(lambda x: x[['engine', 'seconds']]
.sort_values('seconds'))
)
``````

engineseconds
np
1003fbpca0.002423
3scipy0.002429
3sklearn0.002854
10scipy0.002442
10fbpca0.002709
10sklearn0.002763
30fbpca0.002880
30scipy0.002975
30sklearn0.003034
100003fbpca0.006477
3sklearn0.013475
3scipy0.316646
10fbpca0.007613
10sklearn0.014766
10scipy0.676616
30fbpca0.020828
30sklearn0.029073
30scipy1.778515
1000003fbpca0.024544
3sklearn0.047487
10fbpca0.068111
10sklearn0.139859
30fbpca0.125326
30sklearn0.221661