Multiple correspondence analysis
Resources
Data
Multiple correspondence analysis is an extension of correspondence analysis. It should be used when you have more than two categorical variables. The idea is to one-hot encode a dataset, before applying correspondence analysis to it.
As an example, we’re going to use the balloons dataset taken from the UCI datasets website.
import pandas as pd
dataset = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/balloons/adult+stretch.data')
dataset.columns = ['Color', 'Size', 'Action', 'Age', 'Inflated']
dataset.head()
| Color | Size | Action | Age | Inflated |
---|
0 | YELLOW | SMALL | STRETCH | ADULT | T |
---|
1 | YELLOW | SMALL | STRETCH | CHILD | F |
---|
2 | YELLOW | SMALL | DIP | ADULT | F |
---|
3 | YELLOW | SMALL | DIP | CHILD | F |
---|
4 | YELLOW | LARGE | STRETCH | ADULT | T |
---|
Fitting
import prince
mca = prince.MCA(
n_components=3,
n_iter=3,
copy=True,
check_input=True,
engine='sklearn',
random_state=42
)
mca = mca.fit(dataset)
The way MCA works is that it one-hot encodes the dataset, and then fits a correspondence analysis. In case your dataset is already one-hot encoded, you can specify one_hot=False
to skip this step.
one_hot = pd.get_dummies(dataset)
mca_no_one_hot = prince.MCA(one_hot=False)
mca_no_one_hot = mca_no_one_hot.fit(one_hot)
Eigenvalues
| eigenvalue | % of variance | % of variance (cumulative) |
---|
component | | | |
---|
0 | 0.402 | 40.17% | 40.17% |
---|
1 | 0.211 | 21.11% | 61.28% |
---|
2 | 0.186 | 18.56% | 79.84% |
---|
Coordinates
mca.row_coordinates(dataset).head()
| 0 | 1 | 2 |
---|
0 | 0.705387 | 5.369158e-15 | 0.758639 |
---|
1 | -0.386586 | 5.724889e-15 | 0.626063 |
---|
2 | -0.386586 | 4.807799e-15 | 0.626063 |
---|
3 | -0.852014 | 5.108782e-15 | 0.562447 |
---|
4 | 0.783539 | -6.333333e-01 | 0.130201 |
---|
mca.column_coordinates(dataset).head()
| 0 | 1 | 2 |
---|
Color_PURPLE | 0.117308 | 6.892024e-01 | -0.641270 |
---|
Color_YELLOW | -0.130342 | -7.657805e-01 | 0.712523 |
---|
Size_LARGE | 0.117308 | -6.892024e-01 | -0.641270 |
---|
Size_SMALL | -0.130342 | 7.657805e-01 | 0.712523 |
---|
Action_DIP | -0.853864 | -6.367615e-16 | -0.079340 |
---|
Visualization
mca.plot(
dataset,
x_component=0,
y_component=1,
show_column_markers=True,
show_row_markers=True,
show_column_labels=False,
show_row_labels=False
)
Contributions
mca.row_contributions_.head().style.format('{:.0%}')
| 0 | 1 | 2 |
---|
0 | 7% | 0% | 16% |
---|
1 | 2% | 0% | 11% |
---|
2 | 2% | 0% | 11% |
---|
3 | 10% | 0% | 9% |
---|
4 | 8% | 10% | 0% |
---|
mca.column_contributions_.head().style.format('{:.0%}')
| 0 | 1 | 2 |
---|
Color_PURPLE | 0% | 24% | 23% |
---|
Color_YELLOW | 0% | 26% | 26% |
---|
Size_LARGE | 0% | 24% | 23% |
---|
Size_SMALL | 0% | 26% | 26% |
---|
Action_DIP | 15% | 0% | 0% |
---|
Cosine similarities
mca.row_cosine_similarities(dataset).head()
| 0 | 1 | 2 |
---|
0 | 0.461478 | 2.673675e-29 | 0.533786 |
---|
1 | 0.152256 | 3.338988e-29 | 0.399316 |
---|
2 | 0.152256 | 2.354904e-29 | 0.399316 |
---|
3 | 0.653335 | 2.348969e-29 | 0.284712 |
---|
4 | 0.592606 | 3.871772e-01 | 0.016363 |
---|
mca.column_cosine_similarities(dataset).head()
| 0 | 1 | 2 |
---|
Color_PURPLE | 0.015290 | 5.277778e-01 | 0.456920 |
---|
Color_YELLOW | 0.015290 | 5.277778e-01 | 0.456920 |
---|
Size_LARGE | 0.015290 | 5.277778e-01 | 0.456920 |
---|
Size_SMALL | 0.015290 | 5.277778e-01 | 0.456920 |
---|
Action_DIP | 0.530243 | 2.948838e-31 | 0.004578 |
---|