Correspondence analysis

Resources

Theory of Correspondence Analysis has all the equations.
Correspondence analysis by Hervé Abdi and Michael Béra is great too, although it doesn’t only cover CA.
L’Analyse Factorielle des Correspondences (AFC) by Marie Chavent is short and sweet.

Data

You can use correspondence analysis when you have a contingency table. In other words, when you want to analyse the dependency between two categorical variables. For instance, here is a dataset which counts the number of voters per region for each candidate in the 2022 French presidential elections.

import prince

dataset = prince.datasets.load_french_elections()
dataset[['Le Pen', 'Macron', 'Mélenchon', 'Abstention']].head()

candidate	Le Pen	Macron	Mélenchon	Abstention
region
Auvergne-Rhône-Alpes	943294	1175085	897434	1228490
Bourgogne-Franche-Comté	409639	394117	277899	456682
Bretagne	385393	647172	407527	543425
Centre-Val de Loire	347845	383851	251259	459528
Corse	42283	26795	19779	90636

☝️ This dataset is already available as a contingency matrix. It’s more common to have at one’s disposal a flat dataset. If this is the case, a contigency matrix can be obtained using the pivot_table function in pandas.

Fitting

ca = prince.CA(
    n_components=3,
    n_iter=3,
    copy=True,
    check_input=True,
    engine='sklearn',
    random_state=42
)
ca = ca.fit(dataset)

Eigenvalues

ca.eigenvalues_summary

	eigenvalue	% of variance	% of variance (cumulative)
component
0	0.021	40.82%	40.82%
1	0.018	36.15%	76.97%
2	0.005	10.08%	87.04%

Coordinates

ca.row_coordinates(dataset).head()

	0	1	2
region
Auvergne-Rhône-Alpes	-0.058638	0.038303	0.000937
Bourgogne-Franche-Comté	-0.070815	-0.077604	-0.016357
Bretagne	-0.083655	0.110491	-0.058991
Centre-Val de Loire	-0.024624	-0.055799	-0.046167
Corse	0.127370	-0.281755	0.279328

ca.column_coordinates(dataset).head()

	0	1	2
candidate
Arthaud	-0.034732	-0.091291	-0.122722
Dupont-Aignan	-0.094708	-0.064696	-0.023546
Hidalgo	-0.137897	0.052846	0.101351
Jadot	-0.126228	0.188836	-0.031329
Lassalle	-0.271867	-0.091407	0.365112

Visualization

ca.plot(
    dataset,
    x_component=0,
    y_component=1,
    show_row_markers=True,
    show_column_markers=True,
    show_row_labels=False,
    show_column_labels=False
)

ca.plot(
    dataset,
    x_component=0,
    y_component=1,
    show_row_markers=False,
    show_column_markers=False,
    show_row_labels=False,
    show_column_labels=True
)

Contributions

ca.row_contributions_.head().style.format('{:.0%}')

	0	1	2
Auvergne-Rhône-Alpes	2%	1%	0%
Bourgogne-Franche-Comté	1%	1%	0%
Bretagne	2%	4%	4%
Centre-Val de Loire	0%	1%	2%
Corse	0%	2%	8%

ca.column_contributions_.head().style.format('{:.0%}')

	0	1	2
Arthaud	0%	0%	1%
Dupont-Aignan	1%	0%	0%
Hidalgo	1%	0%	3%
Jadot	3%	7%	1%
Lassalle	8%	1%	61%

Cosine similarities

ca.row_cosine_similarities(dataset).head()

	0	1	2
region
Auvergne-Rhône-Alpes	0.568331	0.242500	0.000145
Bourgogne-Franche-Comté	0.365626	0.439086	0.019507
Bretagne	0.212706	0.371061	0.105772
Centre-Val de Loire	0.076356	0.392078	0.268406
Corse	0.066825	0.327001	0.321391

ca.column_cosine_similarities(dataset).head()

	0	1	2
candidate
Arthaud	0.024619	0.170088	0.307375
Dupont-Aignan	0.305277	0.142452	0.018869
Hidalgo	0.292428	0.042947	0.157968
Jadot	0.265642	0.594500	0.016364
Lassalle	0.307040	0.034709	0.553774