Prince foo

Multiple correspondence analysis

Resources

Data

Multiple correspondence analysis is an extension of correspondence analysis. It should be used when you have more than two categorical variables. The idea is to one-hot encode a dataset, before applying correspondence analysis to it.

As an example, we’re going to use the balloons dataset taken from the UCI datasets website.

import pandas as pd

dataset = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/balloons/adult+stretch.data')
dataset.columns = ['Color', 'Size', 'Action', 'Age', 'Inflated']
dataset.head()

ColorSizeActionAgeInflated
0YELLOWSMALLSTRETCHADULTT
1YELLOWSMALLSTRETCHCHILDF
2YELLOWSMALLDIPADULTF
3YELLOWSMALLDIPCHILDF
4YELLOWLARGESTRETCHADULTT

Fitting

import prince

mca = prince.MCA(
    n_components=3,
    n_iter=3,
    copy=True,
    check_input=True,
    engine='sklearn',
    random_state=42
)
mca = mca.fit(dataset)

The way MCA works is that it one-hot encodes the dataset, and then fits a correspondence analysis. In case your dataset is already one-hot encoded, you can specify one_hot=False to skip this step.

one_hot = pd.get_dummies(dataset)

mca_no_one_hot = prince.MCA(one_hot=False)
mca_no_one_hot = mca_no_one_hot.fit(one_hot)

Eigenvalues

mca.eigenvalues_summary

eigenvalue% of variance% of variance (cumulative)
component
00.40240.17%40.17%
10.21121.11%61.28%
20.18618.56%79.84%

Coordinates

mca.row_coordinates(dataset).head()

012
00.7053875.369158e-150.758639
1-0.3865865.724889e-150.626063
2-0.3865864.807799e-150.626063
3-0.8520145.108782e-150.562447
40.783539-6.333333e-010.130201
mca.column_coordinates(dataset).head()

012
Color_PURPLE0.1173086.892024e-01-0.641270
Color_YELLOW-0.130342-7.657805e-010.712523
Size_LARGE0.117308-6.892024e-01-0.641270
Size_SMALL-0.1303427.657805e-010.712523
Action_DIP-0.853864-6.367615e-16-0.079340

Visualization

mca.plot(
    dataset,
    x_component=0,
    y_component=1,
    show_column_markers=True,
    show_row_markers=True,
    show_column_labels=False,
    show_row_labels=False
)

Contributions

mca.row_contributions_.head().style.format('{:.0%}')
 012
07%0%16%
12%0%11%
22%0%11%
310%0%9%
48%10%0%
mca.column_contributions_.head().style.format('{:.0%}')
 012
Color_PURPLE0%24%23%
Color_YELLOW0%26%26%
Size_LARGE0%24%23%
Size_SMALL0%26%26%
Action_DIP15%0%0%

Cosine similarities

mca.row_cosine_similarities(dataset).head()

012
00.4614782.673675e-290.533786
10.1522563.338988e-290.399316
20.1522562.354904e-290.399316
30.6533352.348969e-290.284712
40.5926063.871772e-010.016363
mca.column_cosine_similarities(dataset).head()

012
Color_PURPLE0.0152905.277778e-010.456920
Color_YELLOW0.0152905.277778e-010.456920
Size_LARGE0.0152905.277778e-010.456920
Size_SMALL0.0152905.277778e-010.456920
Action_DIP0.5302432.948838e-310.004578