Multiple correspondence analysis Table of contents Resources Data Multiple correspondence analysis is an extension of correspondence analysis. It should be used when you have more than two categorical variables. The idea is to one-hot encode a dataset, before applying correspondence analysis to it.
As an example, we’re going to use the balloons dataset taken from the UCI datasets website .
import pandas as pd
dataset = pd . read_csv ('https://archive.ics.uci.edu/ml/machine-learning-databases/balloons/adult+stretch.data' )
dataset . columns = ['Color' , 'Size' , 'Action' , 'Age' , 'Inflated' ]
dataset . head ()
Color Size Action Age Inflated 0 YELLOW SMALL STRETCH ADULT T 1 YELLOW SMALL STRETCH CHILD F 2 YELLOW SMALL DIP ADULT F 3 YELLOW SMALL DIP CHILD F 4 YELLOW LARGE STRETCH ADULT T
Fitting import prince
mca = prince . MCA (
n_components = 3 ,
n_iter = 3 ,
copy = True ,
check_input = True ,
engine = 'sklearn' ,
random_state = 42
)
mca = mca . fit (dataset )
The way MCA works is that it one-hot encodes the dataset, and then fits a correspondence analysis. In case your dataset is already one-hot encoded, you can specify one_hot=False
to skip this step.
one_hot = pd . get_dummies (dataset )
mca_no_one_hot = prince . MCA (one_hot = False )
mca_no_one_hot = mca_no_one_hot . fit (one_hot )
Both Benzécri and Greenacre corrections are available. No correction is applied by default.
mca_without_correction = prince . MCA (correction = None )
mca_with_benzecri_correction = prince . MCA (correction = 'benzecri' )
mca_with_greenacre_correction = prince . MCA (correction = 'greenacre' )
Eigenvalues eigenvalue % of variance % of variance (cumulative) component 0 0.402 40.17% 40.17% 1 0.211 21.11% 61.28% 2 0.186 18.56% 79.84%
Coordinates mca . row_coordinates (dataset ). head ()
0 1 2 0 0.705387 5.369158e-15 0.758639 1 -0.386586 5.724889e-15 0.626063 2 -0.386586 4.807799e-15 0.626063 3 -0.852014 5.108782e-15 0.562447 4 0.783539 -6.333333e-01 0.130201
mca . column_coordinates (dataset ). head ()
0 1 2 Color__PURPLE 0.117308 6.892024e-01 -0.641270 Color__YELLOW -0.130342 -7.657805e-01 0.712523 Size__LARGE 0.117308 -6.892024e-01 -0.641270 Size__SMALL -0.130342 7.657805e-01 0.712523 Action__DIP -0.853864 -6.367615e-16 -0.079340
Visualization mca . plot (
dataset ,
x_component = 0 ,
y_component = 1 ,
show_column_markers = True ,
show_row_markers = True ,
show_column_labels = False ,
show_row_labels = False
)
Contributions mca . row_contributions_ . head (). style . format (' {:.0%} ' )
0 1 2 0 7% 0% 16% 1 2% 0% 11% 2 2% 0% 11% 3 10% 0% 9% 4 8% 10% 0%
mca . column_contributions_ . head (). style . format (' {:.0%} ' )
0 1 2 Color__PURPLE 0% 24% 23% Color__YELLOW 0% 26% 26% Size__LARGE 0% 24% 23% Size__SMALL 0% 26% 26% Action__DIP 15% 0% 0%
Cosine similarities mca . row_cosine_similarities (dataset ). head ()
0 1 2 0 0.461478 2.673675e-29 0.533786 1 0.152256 3.338988e-29 0.399316 2 0.152256 2.354904e-29 0.399316 3 0.653335 2.348969e-29 0.284712 4 0.592606 3.871772e-01 0.016363
mca . column_cosine_similarities (dataset ). head ()
0 1 2 Color__PURPLE 0.015290 5.277778e-01 0.456920 Color__YELLOW 0.015290 5.277778e-01 0.456920 Size__LARGE 0.015290 5.277778e-01 0.456920 Size__SMALL 0.015290 5.277778e-01 0.456920 Action__DIP 0.530243 2.948838e-31 0.004578
Controlling the one-hot encoding one_hot_prefix_sep
allows you to specify the separator used to prefix the one-hot encoded columns. By default, it is set to __
.one_hot_columns_to_drop
allows you to specify which one-hot encoded columns should be dropped before fitting the MCA. This is useful if you want to drop some columns that are not relevant for the analysis, or if you want to avoid multicollinearity issues. It leads to so-called “subset MCA”.mca = prince . MCA (
one_hot_prefix_sep = "@" ,
one_hot_columns_to_drop = ['Color@PURPLE' , 'Action@STRETCH' ]
)
mca = mca . fit (dataset )
mca . column_coordinates (dataset )
0 1 Color@YELLOW -0.006804 -0.114754 Size@LARGE 0.196862 -0.938999 Size@SMALL -0.049813 1.080353 Action@DIP -0.562462 0.004154 Age@ADULT 0.823161 0.082031 Age@CHILD -0.941808 -0.071144 Inflated@F -0.681104 -0.024378 Inflated@T 1.384794 0.089388