Prince foo

Principal component analysis

Table of contents

Resources

Data

PCA assumes you have a dataframe consisting of numerical variables. This includes booleans and integers.

As an example, let’s use a dataset describing the energy mix of each country in the world, in 2019. The dataset is normalized into percentages, which makes countries comparable with one another.

import prince

dataset = prince.datasets.load_energy_mix(year=2019, normalize=True)
dataset.head().style.format('{:.0%}')  
  coaloilgasnuclearhydrowindsolarother renewables
continentcountry        
AfricaAlgeria1%35%64%0%0%0%0%0%
South AmericaArgentina1%35%50%2%10%1%0%1%
OceaniaAustralia28%34%30%0%2%3%3%1%
EuropeAustria9%37%22%0%25%4%1%3%
AsiaAzerbaijan0%33%65%0%2%0%0%0%

Fitting

The PCA estimator implements the fit/transform API from scikit-learn.

pca = prince.PCA(
    n_components=3,
    n_iter=3,
    rescale_with_mean=True,
    rescale_with_std=True,
    copy=True,
    check_input=True,
    engine='sklearn',
    random_state=42
)
pca = pca.fit(dataset)

The available parameters are:

These methods are common to other Prince estimators.

Eigenvalues

The importance of a principal component is indicated by the proportion of dataset inertia it explains. This is called the explained inertia, and is obtained by dividing the eigenvalues obtained with SVD by the total inertia.

The ideal situation is when a large share of inertia is explained by a low number of principal components.

pca.eigenvalues_summary

eigenvalue% of variance% of variance (cumulative)
component
01.96324.54%24.54%
11.56119.51%44.05%
21.40317.54%61.59%

In this dataset, the first three components explain ~62% of the dataset inertia. This isn’t great, but isn’t bad either.

The columns in the above table are also available as separate attributes:

pca.eigenvalues_
array([1.96301214, 1.56069986, 1.40327823])
pca.percentage_of_variance_
array([24.5376518 , 19.50874822, 17.54097785])
pca.cumulative_percentage_of_variance_
array([24.5376518 , 44.04640003, 61.58737788])

Eigenvalues can also be visualized with a scree plot.

pca.scree_plot()

Coordinates

The row principal coordinates can be obtained once the PCA has been fitted to the data.

pca.transform(dataset).head()

component012
continentcountry
AfricaAlgeria-2.1890680.380243-0.388572
South AmericaArgentina-1.2449810.801917-0.389456
OceaniaAustralia0.203969-1.4702540.531819
EuropeAustria0.8471221.008296-0.521998
AsiaAzerbaijan-2.1905350.632250-0.365515

The transform method is in fact an alias for row_coordinates:

pca.transform(dataset).equals(pca.row_coordinates(dataset))
True

This is transforming the original dataset into factor scores.

The column coordinates are obtained during the PCA training process.

pca.column_coordinates_

component012
variable
coal0.230205-0.3954740.740748
oil0.176200-0.418917-0.737855
gas-0.8669270.182990-0.096857
nuclear0.310313-0.0025980.400608
hydro0.4403830.744815-0.016375
wind0.518712-0.161507-0.364593
solar0.415677-0.5892880.001012
other renewables0.6287500.516935-0.084114

Visualization

The row and column coordinates be visualized together with a scatter chart.

pca.plot(
    dataset,
    x_component=0,
    y_component=1,
    color_rows_by='continent',
    show_row_markers=True,
    show_column_markers=True,
    show_row_labels=False,
    row_labels_column=None,  # for DataFrames with a MultiIndex
    show_column_labels=False
)

Contributions

pca.row_contributions_.head().style.format('{:.0%}')  
 component012
continentcountry   
AfricaAlgeria3%0%0%
South AmericaArgentina1%1%0%
OceaniaAustralia0%2%0%
EuropeAustria0%1%0%
AsiaAzerbaijan3%0%0%

Observations with high contributions and different signs can be opposed to help interpret the component, because these observations represent the two endpoints of this component.

Column contributions are also available.

pca.column_contributions_.style.format('{:.0%}')
component012
variable   
coal3%10%39%
oil2%11%39%
gas38%2%1%
nuclear5%0%11%
hydro10%36%0%
wind14%2%9%
solar9%22%0%
other renewables20%17%1%

Cosine similarities

pca.row_cosine_similarities(dataset).head()

component012
continentcountry
AfricaAlgeria0.9027640.0272380.028445
South AmericaArgentina0.6265170.2599360.061309
OceaniaAustralia0.0084680.4399990.057570
EuropeAustria0.2360870.3344700.089643
AsiaAzerbaijan0.8688010.0723770.024190
pca.column_cosine_similarities_

component012
variable
coal0.0529940.1564000.548707
oil0.0310460.1754920.544430
gas0.7515630.0334850.009381
nuclear0.0962940.0000070.160487
hydro0.1939370.5547500.000268
wind0.2690620.0260850.132928
solar0.1727880.3472600.000001
other renewables0.3953270.2672220.007075

Column correlations

pca.column_correlations

component012
variable
coal0.230205-0.3954740.740748
oil0.176200-0.418917-0.737855
gas-0.8669270.182990-0.096857
nuclear0.310313-0.0025980.400608
hydro0.4403830.744815-0.016375
wind0.518712-0.161507-0.364593
solar0.415677-0.5892880.001012
other renewables0.6287500.516935-0.084114
(pca.column_correlations ** 2).equals(pca.column_cosine_similarities_)
True

Inverse transformation

You can transform row projections back into their original space by using the inverse_transform method.

reconstructed = pca.inverse_transform(pca.transform(dataset))
reconstructed.head()

01234567
continentcountry
AfricaAlgeria0.0242040.3698660.599244-0.0070820.024826-0.0032490.000650-0.008460
South AmericaArgentina0.0277730.3663420.4875620.0085960.0900820.0069630.0013910.011292
OceaniaAustralia0.2873720.4253480.2085940.056641-0.0184840.0258670.015104-0.000441
EuropeAustria0.0608750.4103930.2195530.0401220.1832310.0342370.0061670.045423
AsiaAzerbaijan0.0130980.3540830.606928-0.0065580.042620-0.004640-0.000439-0.005092

This allows measuring the reconstruction error of the PCA. This is usually defined as the $L^2$ norm between the reconstructed dataset and the original dataset.

import numpy as np

np.linalg.norm(reconstructed.values - dataset.values, ord=2)
1.1927159589872411

Supplementary data

Active rows and columns make up the dataset you fit the PCA with. Anything you provide afterwards is considered supplementary data.

For example, we can fit the PCA on all countries that are not part of North America.

active = dataset.query("continent != 'North America'")
pca = prince.PCA().fit(active)

The data for North America can still be projected onto the principal components.

pca.transform(dataset).query("continent == 'North America'")

component01
continentcountry
North AmericaCanada0.0098641.332859
Mexico-0.695023-0.402230
Trinidad and Tobago-3.0728331.393067
United States-0.226122-0.433260

As for supplementary, they must be provided during the fit call.

pca = prince.PCA().fit(dataset, supplementary_columns=['hydro', 'wind', 'solar'])
pca.column_correlations

component01
variable
coal0.621633-0.380008
oil0.0489660.948650
gas-0.916078-0.331546
nuclear0.384458-0.401675
other renewables0.4671160.086656
hydro0.2466460.037118
wind0.1956750.184247
solar0.2512470.076237

There can be supplementary rows and columns at the same time.

pca = prince.PCA().fit(active, supplementary_columns=['hydro', 'wind', 'solar'])

Performance

Under the hood, Prince uses a randomised version of SVD. This is much faster than traditional SVD. However, the results may have a small inherent randomness. This can be controlled with the random_state parameter. The accurary of the results can be increased by providing a higher n_iter parameter.

By default prince uses scikit-learn’s randomized SVD implementation – the one used in TruncatedSVD.

Prince supports different SVD backends. For the while, the only other supported randomized backend is Facebook’s randomized SVD implementation, called fbpca:

prince.PCA(engine='fbpca')
PCA(engine='fbpca')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

You can also use a non-randomized SVD implementation, using the scipy.linalg.svd function:

prince.PCA(engine='scipy')
PCA(engine='scipy')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Here is a very crude benchmark that compares each engine on different dataset sizes.

import itertools
import numpy as np
import pandas as pd

N = [100, 10_000, 100_000]
P = [3, 10, 30]
ENGINES = ['sklearn', 'fbpca', 'scipy']

perf = []

for n, p, engine in itertools.product(N, P, ENGINES):

    # Too slow
    if engine == 'scipy' and n > 10_000:
        continue

    X = pd.DataFrame(np.random.random(size=(n, p)))
    duration = %timeit -q -n 1 -r 3 -o prince.PCA(engine=engine, n_iter=3).fit(X)
    perf.append({
        'n': n,
        'p': p,
        'engine': engine,
        'seconds': duration.average
    })
(
    pd.DataFrame(perf)
    .set_index(['n', 'p'])
    .groupby(['n', 'p'], group_keys=False)
    .apply(lambda x: x[['engine', 'seconds']]
    .sort_values('seconds'))
)

engineseconds
np
1003fbpca0.002423
3scipy0.002429
3sklearn0.002854
10scipy0.002442
10fbpca0.002709
10sklearn0.002763
30fbpca0.002880
30scipy0.002975
30sklearn0.003034
100003fbpca0.006477
3sklearn0.013475
3scipy0.316646
10fbpca0.007613
10sklearn0.014766
10scipy0.676616
30fbpca0.020828
30sklearn0.029073
30scipy1.778515
1000003fbpca0.024544
3sklearn0.047487
10fbpca0.068111
10sklearn0.139859
30fbpca0.125326
30sklearn0.221661