Principal component analysis

Resources

Principal component analysis by Hervé Abdi and Lynne J. Williams is excellent at explaining PCA interpretation. It also covers some extensions to PCA.
A Tutorial on Principal Component Analysis by Jonathon Shlens goes into more detail on the intuition behind PCA, while also discussing its applicability and limits.
I learnt PCA from these lecture notes from Xavier Gendre. He provides a comprehensive walkthrough using a dataset of Skyrim bows.

Data

PCA assumes you have a dataframe consisting of numerical variables. This includes booleans and integers.

As an example, let’s use a dataset describing the energy mix of each country in the world, in 2019. The dataset is normalized into percentages, which makes countries comparable with one another.

import prince

dataset = prince.datasets.load_energy_mix(year=2019, normalize=True)
dataset.head().style.format('{:.0%}')

		coal	oil	gas	nuclear	hydro	wind	solar	other renewables
continent	country
Africa	Algeria	1%	35%	64%	0%	0%	0%	0%	0%
South America	Argentina	1%	35%	50%	2%	10%	1%	0%	1%
Oceania	Australia	28%	34%	30%	0%	2%	3%	3%	1%
Europe	Austria	9%	37%	22%	0%	25%	4%	1%	3%
Asia	Azerbaijan	0%	33%	65%	0%	2%	0%	0%	0%

Fitting

The PCA estimator implements the fit/transform API from scikit-learn.

pca = prince.PCA(
    n_components=3,
    n_iter=3,
    rescale_with_mean=True,
    rescale_with_std=True,
    copy=True,
    check_input=True,
    engine='sklearn',
    random_state=42
)
pca = pca.fit(
    dataset,
    sample_weight=None,
    column_weight=None,
    supplementary_columns=None
)

The available parameters are:

n_components — the number of components that are computed. You only need two if your intention is to visualize the two major components.
n_iter — the number of iterations used for computing the SVD.
rescale_with_mean — whether to substract each column’s mean
rescale_with_std — whether to divide each column by it’s standard deviation
copy — if False then the computations will be done inplace which can have possible side-effects on the input data
engine — what SVD engine to use (should be one of ['fbpca', 'sklearn'])
random_state — controls the randomness of the SVD results.

These methods are common to other Prince estimators.

Eigenvalues

The importance of a principal component is indicated by the proportion of dataset inertia it explains. This is called the explained inertia, and is obtained by dividing the eigenvalues obtained with SVD by the total inertia.

The ideal situation is when a large share of inertia is explained by a low number of principal components.

pca.eigenvalues_summary

	eigenvalue	% of variance	% of variance (cumulative)
component
0	1.963	24.54%	24.54%
1	1.561	19.51%	44.05%
2	1.403	17.54%	61.59%

In this dataset, the first three components explain ~62% of the dataset inertia. This isn’t great, but isn’t bad either.

The columns in the above table are also available as separate attributes:

pca.eigenvalues_

array([1.96301214, 1.56069986, 1.40327823])

pca.percentage_of_variance_

array([24.5376518 , 19.50874822, 17.54097785])

pca.cumulative_percentage_of_variance_

array([24.5376518 , 44.04640003, 61.58737788])

Eigenvalues can also be visualized with a scree plot.

pca.scree_plot()

Coordinates

The row principal coordinates can be obtained once the PCA has been fitted to the data.

pca.transform(dataset).head()

	component	0	1	2
continent	country
Africa	Algeria	-2.189068	0.380243	-0.388572
South America	Argentina	-1.244981	0.801917	-0.389456
Oceania	Australia	0.203969	-1.470254	0.531819
Europe	Austria	0.847122	1.008296	-0.521998
Asia	Azerbaijan	-2.190535	0.632250	-0.365515

The transform method is in fact an alias for row_coordinates:

pca.transform(dataset).equals(pca.row_coordinates(dataset))

True

This is transforming the original dataset into factor scores.

The column coordinates are obtained during the PCA training process.

pca.column_coordinates_

component	0	1	2
variable
coal	0.230205	-0.395474	0.740748
oil	0.176200	-0.418917	-0.737855
gas	-0.866927	0.182990	-0.096857
nuclear	0.310313	-0.002598	0.400608
hydro	0.440383	0.744815	-0.016375
wind	0.518712	-0.161507	-0.364593
solar	0.415677	-0.589288	0.001012
other renewables	0.628750	0.516935	-0.084114

Visualization

The row and column coordinates be visualized together with a scatter chart.

pca.plot(
    dataset,
    x_component=0,
    y_component=1,
    color_rows_by='continent',
    show_row_markers=True,
    show_column_markers=True,
    show_row_labels=False,
    row_labels_column=None,  # for DataFrames with a MultiIndex
    show_column_labels=False
)

Contributions

pca.row_contributions_.head().style.format('{:.0%}')

	component	0	1	2
continent	country
Africa	Algeria	3%	0%	0%
South America	Argentina	1%	1%	0%
Oceania	Australia	0%	2%	0%
Europe	Austria	0%	1%	0%
Asia	Azerbaijan	3%	0%	0%

Observations with high contributions and different signs can be opposed to help interpret the component, because these observations represent the two endpoints of this component.

Column contributions are also available.

pca.column_contributions_.style.format('{:.0%}')

component	0	1	2
variable
coal	3%	10%	39%
oil	2%	11%	39%
gas	38%	2%	1%
nuclear	5%	0%	11%
hydro	10%	36%	0%
wind	14%	2%	9%
solar	9%	22%	0%
other renewables	20%	17%	1%

Cosine similarities

pca.row_cosine_similarities(dataset).head()

	component	0	1	2
continent	country
Africa	Algeria	0.902764	0.027238	0.028445
South America	Argentina	0.626517	0.259936	0.061309
Oceania	Australia	0.008468	0.439999	0.057570
Europe	Austria	0.236087	0.334470	0.089643
Asia	Azerbaijan	0.868801	0.072377	0.024190

pca.column_cosine_similarities_

component	0	1	2
variable
coal	0.052994	0.156400	0.548707
oil	0.031046	0.175492	0.544430
gas	0.751563	0.033485	0.009381
nuclear	0.096294	0.000007	0.160487
hydro	0.193937	0.554750	0.000268
wind	0.269062	0.026085	0.132928
solar	0.172788	0.347260	0.000001
other renewables	0.395327	0.267222	0.007075

Column correlations

pca.column_correlations

component	0	1	2
variable
coal	0.230205	-0.395474	0.740748
oil	0.176200	-0.418917	-0.737855
gas	-0.866927	0.182990	-0.096857
nuclear	0.310313	-0.002598	0.400608
hydro	0.440383	0.744815	-0.016375
wind	0.518712	-0.161507	-0.364593
solar	0.415677	-0.589288	0.001012
other renewables	0.628750	0.516935	-0.084114

(pca.column_correlations ** 2).equals(pca.column_cosine_similarities_)

True

Inverse transformation

You can transform row projections back into their original space by using the inverse_transform method.

reconstructed = pca.inverse_transform(pca.transform(dataset))
reconstructed.head()

		0	1	2	3	4	5	6	7
continent	country
Africa	Algeria	0.024204	0.369866	0.599244	-0.007082	0.024826	-0.003249	0.000650	-0.008460
South America	Argentina	0.027773	0.366342	0.487562	0.008596	0.090082	0.006963	0.001391	0.011292
Oceania	Australia	0.287372	0.425348	0.208594	0.056641	-0.018484	0.025867	0.015104	-0.000441
Europe	Austria	0.060875	0.410393	0.219553	0.040122	0.183231	0.034237	0.006167	0.045423
Asia	Azerbaijan	0.013098	0.354083	0.606928	-0.006558	0.042620	-0.004640	-0.000439	-0.005092

This allows measuring the reconstruction error of the PCA. This is usually defined as the $L^2$ norm between the reconstructed dataset and the original dataset.

import numpy as np

np.linalg.norm(reconstructed.values - dataset.values, ord=2)

np.float64(1.1927159589872418)

Supplementary data

Active rows and columns make up the dataset you fit the PCA with. Anything you provide afterwards is considered supplementary data.

For example, we can fit the PCA on all countries that are not part of North America.

active = dataset.query("continent != 'North America'")
pca = prince.PCA().fit(active)

The data for North America can still be projected onto the principal components.

pca.transform(dataset).query("continent == 'North America'")

	component	0	1
continent	country
North America	Canada	0.009864	1.332859
	Mexico	-0.695023	-0.402230
	Trinidad and Tobago	-3.072833	1.393067
	United States	-0.226122	-0.433260

As for supplementary, they must be provided during the fit call.

pca = prince.PCA().fit(dataset, supplementary_columns=['hydro', 'wind', 'solar'])
pca.column_correlations

component	0	1
variable
coal	0.621633	-0.380008
oil	0.048966	0.948650
gas	-0.916078	-0.331546
nuclear	0.384458	-0.401675
other renewables	0.467116	0.086656
hydro	2.192240	0.329911
wind	1.739201	1.637625
solar	2.233130	0.677610

There can be supplementary rows and columns at the same time.

pca = prince.PCA().fit(active, supplementary_columns=['hydro', 'wind', 'solar'])

Weighting

Rows and columns can be weighted differently when fitting the PCA.

row_weights = np.ones(active.shape[0])
row_weights[active.index.get_level_values('continent') == 'Europe'] = 2

column_weights = np.ones(active.shape[1])
column_weights[active.columns.isin(['hydro', 'wind', 'solar', 'other renewables'])] = 2

pca = prince.PCA().fit(
    active,
    sample_weight=row_weights,
    column_weight=column_weights
)
pca.transform(dataset).head()

	component	0	1
continent	country
Africa	Algeria	-1.512601	-1.806884
South America	Argentina	-0.403372	-1.475517
Oceania	Australia	-1.441339	1.394480
Europe	Austria	1.521697	0.040025
Asia	Azerbaijan	-1.272744	-1.980454

Performance

Under the hood, Prince uses a randomised version of SVD. This is much faster than traditional SVD. However, the results may have a small inherent randomness. This can be controlled with the random_state parameter. The accurary of the results can be increased by providing a higher n_iter parameter.

By default prince uses scikit-learn’s randomized SVD implementation – the one used in TruncatedSVD.

Prince supports different SVD backends. For the while, the only other supported randomized backend is Facebook’s randomized SVD implementation, called fbpca:

prince.PCA(engine='fbpca')

PCA(engine='fbpca')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

PCA

iNot fitted

    <div class="estimator-table">
        <details>
            <summary>Parameters</summary>
            <table class="parameters-table">
              <tbody>

    <tr class="default">
        <td><i class="copy-paste-icon"
             onclick="copyToClipboard('rescale_with_mean',
                      this.parentElement.nextElementSibling)"
        ></i></td>
        <td class="param">rescale_with_mean&nbsp;</td>
        <td class="value">True</td>
    </tr>


    <tr class="default">
        <td><i class="copy-paste-icon"
             onclick="copyToClipboard('rescale_with_std',
                      this.parentElement.nextElementSibling)"
        ></i></td>
        <td class="param">rescale_with_std&nbsp;</td>
        <td class="value">True</td>
    </tr>


    <tr class="default">
        <td><i class="copy-paste-icon"
             onclick="copyToClipboard('n_components',
                      this.parentElement.nextElementSibling)"
        ></i></td>
        <td class="param">n_components&nbsp;</td>
        <td class="value">2</td>
    </tr>


    <tr class="default">
        <td><i class="copy-paste-icon"
             onclick="copyToClipboard('n_iter',
                      this.parentElement.nextElementSibling)"
        ></i></td>
        <td class="param">n_iter&nbsp;</td>
        <td class="value">3</td>
    </tr>


    <tr class="default">
        <td><i class="copy-paste-icon"
             onclick="copyToClipboard('copy',
                      this.parentElement.nextElementSibling)"
        ></i></td>
        <td class="param">copy&nbsp;</td>
        <td class="value">True</td>
    </tr>


    <tr class="default">
        <td><i class="copy-paste-icon"
             onclick="copyToClipboard('check_input',
                      this.parentElement.nextElementSibling)"
        ></i></td>
        <td class="param">check_input&nbsp;</td>
        <td class="value">True</td>
    </tr>


    <tr class="default">
        <td><i class="copy-paste-icon"
             onclick="copyToClipboard('random_state',
                      this.parentElement.nextElementSibling)"
        ></i></td>
        <td class="param">random_state&nbsp;</td>
        <td class="value">None</td>
    </tr>


    <tr class="user-set">
        <td><i class="copy-paste-icon"
             onclick="copyToClipboard('engine',
                      this.parentElement.nextElementSibling)"
        ></i></td>
        <td class="param">engine&nbsp;</td>
        <td class="value">&#x27;fbpca&#x27;</td>
    </tr>

              </tbody>
            </table>
        </details>
    </div>
</div></div></div></div></div><script>function copyToClipboard(text, element) {
// Get the parameter prefix from the closest toggleable content
const toggleableContent = element.closest('.sk-toggleable__content');
const paramPrefix = toggleableContent ? toggleableContent.dataset.paramPrefix : '';
const fullParamName = paramPrefix ? `${paramPrefix}${text}` : text;
const originalStyle = element.style;
const computedStyle = window.getComputedStyle(element);
const originalWidth = computedStyle.width;
const originalHTML = element.innerHTML.replace('Copied!', '');
navigator.clipboard.writeText(fullParamName)
    .then(() => {
        element.style.width = originalWidth;
        element.style.color = 'green';
        element.innerHTML = "Copied!";
        setTimeout(() => {
            element.innerHTML = originalHTML;
            element.style = originalStyle;
        }, 2000);
    })
    .catch(err => {
        console.error('Failed to copy:', err);
        element.style.color = 'red';
        element.innerHTML = "Failed!";
        setTimeout(() => {
            element.innerHTML = originalHTML;
            element.style = originalStyle;
        }, 2000);
    });
return false;

} document.querySelectorAll(’.fa-regular.fa-copy’).forEach(function(element) { const toggleableContent = element.closest(’.sk-toggleable__content’); const paramPrefix = toggleableContent ? toggleableContent.dataset.paramPrefix : ‘’; const paramName = element.parentElement.nextElementSibling.textContent.trim(); const fullParamName = paramPrefix ? ${paramPrefix}${paramName} : paramName; element.setAttribute(’title’, fullParamName); });

You can also use a non-randomized SVD implementation, using the scipy.linalg.svd function:

prince.PCA(engine='scipy')

PCA(engine='scipy')

PCA