Correspondence analysis
Resources
Data
You can use correspondence analysis when you have a contingency table. In other words, when you want to analyse the dependency between two categorical variables. For instance, here is a dataset which counts the number of voters per region for each candidate in the 2022 French presidential elections.
import prince
dataset = prince.datasets.load_french_elections()
dataset[['Le Pen', 'Macron', 'Mélenchon', 'Abstention']].head()
candidate | Le Pen | Macron | Mélenchon | Abstention |
---|
region | | | | |
---|
Auvergne-Rhône-Alpes | 943294 | 1175085 | 897434 | 1228490 |
---|
Bourgogne-Franche-Comté | 409639 | 394117 | 277899 | 456682 |
---|
Bretagne | 385393 | 647172 | 407527 | 543425 |
---|
Centre-Val de Loire | 347845 | 383851 | 251259 | 459528 |
---|
Corse | 42283 | 26795 | 19779 | 90636 |
---|
☝️ This dataset is already available as a contingency matrix. It’s more common to have at one’s disposal a flat dataset. If this is the case, a contigency matrix can be obtained using the pivot_table
function in pandas
.
Fitting
ca = prince.CA(
n_components=3,
n_iter=3,
copy=True,
check_input=True,
engine='sklearn',
random_state=42
)
ca = ca.fit(dataset)
Eigenvalues
| eigenvalue | % of variance | % of variance (cumulative) |
---|
component | | | |
---|
0 | 0.021 | 40.82% | 40.82% |
---|
1 | 0.018 | 36.15% | 76.97% |
---|
2 | 0.005 | 10.08% | 87.04% |
---|
Coordinates
ca.row_coordinates(dataset).head()
| 0 | 1 | 2 |
---|
region | | | |
---|
Auvergne-Rhône-Alpes | -0.058638 | 0.038303 | 0.000937 |
---|
Bourgogne-Franche-Comté | -0.070815 | -0.077604 | -0.016357 |
---|
Bretagne | -0.083655 | 0.110491 | -0.058991 |
---|
Centre-Val de Loire | -0.024624 | -0.055799 | -0.046167 |
---|
Corse | 0.127370 | -0.281755 | 0.279328 |
---|
ca.column_coordinates(dataset).head()
| 0 | 1 | 2 |
---|
candidate | | | |
---|
Arthaud | -0.034732 | -0.091291 | -0.122722 |
---|
Dupont-Aignan | -0.094708 | -0.064696 | -0.023546 |
---|
Hidalgo | -0.137897 | 0.052846 | 0.101351 |
---|
Jadot | -0.126228 | 0.188836 | -0.031329 |
---|
Lassalle | -0.271867 | -0.091407 | 0.365112 |
---|
Visualization
ca.plot(
dataset,
x_component=0,
y_component=1,
show_row_markers=True,
show_column_markers=True,
show_row_labels=False,
show_column_labels=False
)
ca.plot(
dataset,
x_component=0,
y_component=1,
show_row_markers=False,
show_column_markers=False,
show_row_labels=False,
show_column_labels=True
)
Contributions
ca.row_contributions_.head().style.format('{:.0%}')
| 0 | 1 | 2 |
---|
Auvergne-Rhône-Alpes | 2% | 1% | 0% |
---|
Bourgogne-Franche-Comté | 1% | 1% | 0% |
---|
Bretagne | 2% | 4% | 4% |
---|
Centre-Val de Loire | 0% | 1% | 2% |
---|
Corse | 0% | 2% | 8% |
---|
ca.column_contributions_.head().style.format('{:.0%}')
| 0 | 1 | 2 |
---|
Arthaud | 0% | 0% | 1% |
---|
Dupont-Aignan | 1% | 0% | 0% |
---|
Hidalgo | 1% | 0% | 3% |
---|
Jadot | 3% | 7% | 1% |
---|
Lassalle | 8% | 1% | 61% |
---|
Cosine similarities
ca.row_cosine_similarities(dataset).head()
| 0 | 1 | 2 |
---|
region | | | |
---|
Auvergne-Rhône-Alpes | 0.568331 | 0.242500 | 0.000145 |
---|
Bourgogne-Franche-Comté | 0.365626 | 0.439086 | 0.019507 |
---|
Bretagne | 0.212706 | 0.371061 | 0.105772 |
---|
Centre-Val de Loire | 0.076356 | 0.392078 | 0.268406 |
---|
Corse | 0.066825 | 0.327001 | 0.321391 |
---|
ca.column_cosine_similarities(dataset).head()
| 0 | 1 | 2 |
---|
candidate | | | |
---|
Arthaud | 0.024619 | 0.170088 | 0.307375 |
---|
Dupont-Aignan | 0.305277 | 0.142452 | 0.018869 |
---|
Hidalgo | 0.292428 | 0.042947 | 0.157968 |
---|
Jadot | 0.265642 | 0.594500 | 0.016364 |
---|
Lassalle | 0.307040 | 0.034709 | 0.553774 |
---|