Prince foo

Multiple correspondence analysis

Resources

Data

Multiple correspondence analysis is an extension of correspondence analysis. It should be used when you have more than two categorical variables. The idea is to one-hot encode a dataset, before applying correspondence analysis to it.

As an example, we’re going to use the balloons dataset taken from the UCI datasets website.

import pandas as pd

dataset = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/balloons/adult+stretch.data')
dataset.columns = ['Color', 'Size', 'Action', 'Age', 'Inflated']
dataset.head()

ColorSizeActionAgeInflated
0YELLOWSMALLSTRETCHADULTT
1YELLOWSMALLSTRETCHCHILDF
2YELLOWSMALLDIPADULTF
3YELLOWSMALLDIPCHILDF
4YELLOWLARGESTRETCHADULTT

Fitting

import prince

mca = prince.MCA(
    n_components=3,
    n_iter=3,
    copy=True,
    check_input=True,
    engine='sklearn',
    random_state=42
)
mca = mca.fit(dataset)

The way MCA works is that it one-hot encodes the dataset, and then fits a correspondence analysis. In case your dataset is already one-hot encoded, you can specify one_hot=False to skip this step.

one_hot = pd.get_dummies(dataset)

mca_no_one_hot = prince.MCA(one_hot=False)
mca_no_one_hot = mca_no_one_hot.fit(one_hot)

Eigenvalues

mca.eigenvalues_summary

eigenvalue% of variance% of variance (cumulative)
component
00.40240.17%40.17%
10.21121.11%61.28%
20.18618.56%79.84%

Coordinates

mca.row_coordinates(dataset).head()

012
00.7053875.369158e-150.758639
1-0.3865865.724889e-150.626063
2-0.3865864.807799e-150.626063
3-0.8520145.108782e-150.562447
40.783539-6.333333e-010.130201
mca.column_coordinates(dataset).head()

012
Color__PURPLE0.1173086.892024e-01-0.641270
Color__YELLOW-0.130342-7.657805e-010.712523
Size__LARGE0.117308-6.892024e-01-0.641270
Size__SMALL-0.1303427.657805e-010.712523
Action__DIP-0.853864-6.367615e-16-0.079340

Visualization

mca.plot(
    dataset,
    x_component=0,
    y_component=1,
    show_column_markers=True,
    show_row_markers=True,
    show_column_labels=False,
    show_row_labels=False
)

Contributions

mca.row_contributions_.head().style.format('{:.0%}')
 012
07%0%16%
12%0%11%
22%0%11%
310%0%9%
48%10%0%
mca.column_contributions_.head().style.format('{:.0%}')
 012
Color__PURPLE0%24%23%
Color__YELLOW0%26%26%
Size__LARGE0%24%23%
Size__SMALL0%26%26%
Action__DIP15%0%0%

Cosine similarities

mca.row_cosine_similarities(dataset).head()

012
00.4614782.673675e-290.533786
10.1522563.338988e-290.399316
20.1522562.354904e-290.399316
30.6533352.348969e-290.284712
40.5926063.871772e-010.016363
mca.column_cosine_similarities(dataset).head()

012
Color__PURPLE0.0152905.277778e-010.456920
Color__YELLOW0.0152905.277778e-010.456920
Size__LARGE0.0152905.277778e-010.456920
Size__SMALL0.0152905.277778e-010.456920
Action__DIP0.5302432.948838e-310.004578