Multiple correspondence analysis

Resources

Data

Multiple correspondence analysis is an extension of correspondence analysis. It should be used when you have more than two categorical variables. The idea is to one-hot encode a dataset, before applying correspondence analysis to it.

As an example, we’re going to use the balloons dataset taken from the UCI datasets website.

import pandas as pd

dataset = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/balloons/adult+stretch.data')
dataset.columns = ['Color', 'Size', 'Action', 'Age', 'Inflated']
dataset.head()

	Color	Size	Action	Age	Inflated
0	YELLOW	SMALL	STRETCH	ADULT	T
1	YELLOW	SMALL	STRETCH	CHILD	F
2	YELLOW	SMALL	DIP	ADULT	F
3	YELLOW	SMALL	DIP	CHILD	F
4	YELLOW	LARGE	STRETCH	ADULT	T

Fitting

import prince

mca = prince.MCA(
    n_components=3,
    n_iter=3,
    copy=True,
    check_input=True,
    engine='sklearn',
    random_state=42
)
mca = mca.fit(dataset)

The way MCA works is that it one-hot encodes the dataset, and then fits a correspondence analysis. In case your dataset is already one-hot encoded, you can specify one_hot=False to skip this step.

one_hot = pd.get_dummies(dataset)

mca_no_one_hot = prince.MCA(one_hot=False)
mca_no_one_hot = mca_no_one_hot.fit(one_hot)

Both Benzécri and Greenacre corrections are available. No correction is applied by default.

mca_without_correction = prince.MCA(correction=None)
mca_with_benzecri_correction = prince.MCA(correction='benzecri')
mca_with_greenacre_correction = prince.MCA(correction='greenacre')

Eigenvalues

mca.eigenvalues_summary

	eigenvalue	% of variance	% of variance (cumulative)
component
0	0.402	40.17%	40.17%
1	0.211	21.11%	61.28%
2	0.186	18.56%	79.84%

Coordinates

mca.row_coordinates(dataset).head()

	0	1	2
0	0.705387	5.369158e-15	0.758639
1	-0.386586	5.724889e-15	0.626063
2	-0.386586	4.807799e-15	0.626063
3	-0.852014	5.108782e-15	0.562447
4	0.783539	-6.333333e-01	0.130201

mca.column_coordinates(dataset).head()

	0	1	2
Color__PURPLE	0.117308	6.892024e-01	-0.641270
Color__YELLOW	-0.130342	-7.657805e-01	0.712523
Size__LARGE	0.117308	-6.892024e-01	-0.641270
Size__SMALL	-0.130342	7.657805e-01	0.712523
Action__DIP	-0.853864	-6.367615e-16	-0.079340

Visualization

mca.plot(
    dataset,
    x_component=0,
    y_component=1,
    show_column_markers=True,
    show_row_markers=True,
    show_column_labels=False,
    show_row_labels=False
)

Contributions

mca.row_contributions_.head().style.format('{:.0%}')

	0	1	2
0	7%	0%	16%
1	2%	0%	11%
2	2%	0%	11%
3	10%	0%	9%
4	8%	10%	0%

mca.column_contributions_.head().style.format('{:.0%}')

	0	1	2
Color__PURPLE	0%	24%	23%
Color__YELLOW	0%	26%	26%
Size__LARGE	0%	24%	23%
Size__SMALL	0%	26%	26%
Action__DIP	15%	0%	0%

Cosine similarities

mca.row_cosine_similarities(dataset).head()

	0	1	2
0	0.461478	2.673675e-29	0.533786
1	0.152256	3.338988e-29	0.399316
2	0.152256	2.354904e-29	0.399316
3	0.653335	2.348969e-29	0.284712
4	0.592606	3.871772e-01	0.016363

mca.column_cosine_similarities(dataset).head()

	0	1	2
Color__PURPLE	0.015290	5.277778e-01	0.456920
Color__YELLOW	0.015290	5.277778e-01	0.456920
Size__LARGE	0.015290	5.277778e-01	0.456920
Size__SMALL	0.015290	5.277778e-01	0.456920
Action__DIP	0.530243	2.948838e-31	0.004578

Controlling the one-hot encoding

one_hot_prefix_sep allows you to specify the separator used to prefix the one-hot encoded columns. By default, it is set to __.
one_hot_columns_to_drop allows you to specify which one-hot encoded columns should be dropped before fitting the MCA. This is useful if you want to drop some columns that are not relevant for the analysis, or if you want to avoid multicollinearity issues. It leads to so-called “subset MCA”.

mca = prince.MCA(
    one_hot_prefix_sep="@",
    one_hot_columns_to_drop=['Color@PURPLE', 'Action@STRETCH']
)
mca = mca.fit(dataset)
mca.column_coordinates(dataset)

	0	1
Color@YELLOW	-0.006804	-0.114754
Size@LARGE	0.196862	-0.938999
Size@SMALL	-0.049813	1.080353
Action@DIP	-0.562462	0.004154
Age@ADULT	0.823161	0.082031
Age@CHILD	-0.941808	-0.071144
Inflated@F	-0.681104	-0.024378
Inflated@T	1.384794	0.089388

Prince foo