Reducing the memory footprint of a scikit-learn text classifier • Max Halford

Context

This week at Alan I’ve been working on parsing French medical prescriptions. There are three types of prescriptions: lenses, glasses, and pharmaceutical prescriptions. Different information needs to be extracted depending on the prescription type. Therefore, the first step is to classify the prescription. The prescriptions we receive are pictures taken by users with their phone. We run each image through an OCR to obtain a text transcription of the image. We can thus use the text transcription to classify the prescription.

I played around with writing some regex rules and reached a macro F1 score of 95%. Not bad, but not perfect. I wanted to reach 100% because it seemed to me like a simple classification task. That is, when I looked at some sample prescriptions, the type of the prescription was really obvious. I weighed the pros and cons of writing more complex regex patterns versus building a machine learning text classifier. I opted for the latter. I wrote a simple scikit-learn pipeline and tested it as so:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

model = make_pipeline(
    CountVectorizer(),
    Normalizer(),
    LogisticRegression()
)

train = prescriptions[:len(prescriptions) // 2]
test = prescriptions[len(prescriptions) // 2:]

model.fit(train['text'], train['type'])

y_pred = pd.DataFrame(
    model.predict_proba(test['text']),
    columns=model.classes_,
    index=test.index
)
y_pred.head()

prescription_id	contact_lenses	glasses	pharmacy
495866	0.984725	0.0115185	0.00375634
495838	0.041953	0.187964	0.770083
495838	0.041953	0.187964	0.770083
495825	0.0221103	0.966687	0.011203
495812	0.964409	0.0213663	0.0142247

Note: the prescriptions variable is a pandas.DataFrame containing the prescription text transcriptions along with the type (glasses, lenses, or pharmacy).

I decided to trust the classifier if the probability it assigned to a class was above 0.7. If the classifier is unsure, the prescription type can’t be determined, and it will be rejected or sent for manual processing. This rule provided the following classification report:

from sklearn.metrics import classification_report

confident = y_pred.max(axis='columns') > .7
print(classification_report(
    test[confident]['type'],
    y_pred[confident].idxmax(axis='columns'),
    digits=3
))

                precision    recall  f1-score   support

contact_lenses      0.981     0.964     0.972       950
       glasses      0.991     0.986     0.989      2104
      pharmacy      0.987     0.999     0.993      2319

      accuracy                          0.988      5373
     macro avg      0.986     0.983     0.985      5373
  weighted avg      0.988     0.988     0.987      5373

The number of cases where the model is confident enough – the recall in other words – can be computed as so:

print(f'Recall is {confident.sum() / len(confident):.2%}')

Recall is 90.24%

Good enough! I checked some false positive cases and all of them were mislabeled. Indeed, the training data I’m using are manual annotations made by our operators. Once in a while they make a mistake, which is totally acceptable and simply something we have to deal with. However, learning with noisy labels is another story altogether, and I won’t get into it in this blog post. I do however recommend this research article if you’re interested.

I’m confident enough to put this model into production. At Alan we like to keep things simple. Our website is a Flask app running on Heroku. We want to load this scikit-learn model in memory when the app boots up, and simply call it when a prescription has to be classified. If this pattern works for us, we might use it for other document classification tasks.

The model I trained weighs around 2.7 MB. That’s over 8000 times the amount of RAM in the first Apollo shuttle that went to the Moon. Obviously I’m saying this with my tongue in my cheek. And yet, I believe it’s important to be frugal and not waste the memory of our precious Heroku dynos. Especially if I want to convince the engineering team to use more machine learning models.

In a text classifier, the importance of each word follows a long-tail distribution. Most words are not important at all, which is the driving insight for knowledge distillation. As we will see, this is very true for the above model. I decided to push the envelope and lighten the memory footprint of the model, while preserving its accuracy. Anyway, enough with the (long) introduction, here goes.

Measuring the model’s size

From experience, I have a good idea of the model’s memory footprint layout. However, it’s always good to confirm this by measuring the memory footprint of each part of the model. I wrote a little utility to do this for any scikit-learn pipeline:

from humanize import naturalsize
from pickle import dumps

def obj_size(obj):
    return naturalsize(sys.getsizeof(dumps(obj)))

def estimator_size(model):
    return {
        attr: obj_size(getattr(model, attr))
        for attr in model.__dict__ if attr.endswith('_')
    }

for name, step in model.steps:
    print(name)
    for attr, size in estimator_size(step).items():
        print(f'\t{attr.ljust(20)} {size}')
    print()

print('Total'.ljust(20), ' ' * 7, obj_size(model))

countvectorizer
	fixed_vocabulary_    37 Bytes
	stop_words_          38 Bytes
	vocabulary_          980.8 kB

normalizer
	n_features_in_       50 Bytes

logisticregression
	n_features_in_       50 Bytes
	classes_             219 Bytes
	n_iter_              184 Bytes
	coef_                1.8 MB
	intercept_           204 Bytes

Total                        2.7 MB

The above table shows the memory footprint of each attribute for each step of the pipeline. I’m only looking at the attributes whose name ends with a _ because that’s the scikit-learn convention for marking attributes that have been created during the fit call. In other words, I’m ignoring hyperparameters that are provided during initialization, because their memory footprint is insignificant.

As we can see, most of the memory footprint is taken up by the logistic regression coef_ attribute. The latter is a matrix which stores the $n \times k$ weights for each of the $n$ words and the $k$ classes. The second culprit is the count vectorizer’s vocabulary_ attribute, which assigns an index to each word. This allows determining the weight vector of each word because coef_ is integer-indexed instead of being word-indexed. We can now confidently move on and reduce the memory footprint of these two attributes.

Reducing the model’s size

In a logistic regression, the importance of a feature is proportional to the magnitude of its weight vector. Therefore, unimportant features have a weight vector whose magnitude is close to 0. Intuitively, their contribution to the final sum is very small. In our case the features are the words in the text. By determining the unimportant words, we may reduce the model’s memory by limiting the considered vocabulary.

First, let’s measure the importance of each word. We can compute the feature-wise $L^2$ norm to measure the magnitude of each word’s weight vector.

from numpy import np

log_reg = model.steps[2][1]
W = log_reg.coef_
magnitudes = np.linalg.norm(W, ord=2, axis=0)

Now let’s list the most important words by sorting their associated weight vector magnitudes.

vectorizer = model.steps[0][1]
idx_to_word = {
    idx: word
    for word, idx in vectorizer.vocabulary_.items()
}

top = np.argsort(magnitudes)[::-1]

meaningful_vocab = [
    idx_to_word[k]
    for k in top[:100]
]

Here are the 10 most important words:

lentilles
monture
verres
lunettes
ophtalmologiste
50
oeil
ophtalmologie
25
gauche

Looks good to me! I’m not too sure what the 25 and 50 are doing there, but I guess they’re indicative of drug dosage (e.g. “50 mg of paracetamol”), and are thus distinctive indicators of pharmaceutical prescriptions.

We can prune the model now that we know which words matter. I found that the easiest way to proceed is to specify a list of words to the CountVectorizer. Indeed, the latter has a vocabulary parameter which overrides the lists of words that are used in the model. By default, all the words in the training set are used.

pruned_model = make_pipeline(
    CountVectorizer(
        vocabulary=meaningful_vocab
    ),
    Normalizer(),
    LogisticRegression()
)

pruned_model.fit(train['text'], train['text'])

Let’s first check the pruned model’s performance.

y_pred_pruned = pd.DataFrame(
    pruned_model.predict_proba(test['text']),
    columns=pruned_model.classes_,
    index=test.index
)

confident = y_pred_pruned.max(axis='columns') > .7
print(classification_report(
    test[confident]['type'],
    y_pred_pruned[confident].idxmax(axis='columns'),
    digits=3
))

                precision    recall  f1-score   support

contact_lenses      0.976     0.963     0.970      1068
       glasses      0.988     0.985     0.986      2170
      pharmacy      0.991     0.999     0.995      2324

      accuracy                          0.987      5562
     macro avg      0.985     0.982     0.984      5562
  weighted avg      0.987     0.987     0.987      5562

print(f'Recall is {confident.sum() / len(confident):.2%}')

Recall is 93.42%

The performance is (slightly) better! The intention of this pruning process was not to improve the model, but it did. That’s the magic of regularization for you. Indeed, the pruning process we’ve applied boils down to sparsity regularisation.

Let’s now see the memory footprint of the pruned model.

for name, step in model.steps:
    print(name)
    for attr, size in estimator_size(step).items():
        print(f'\t{attr.ljust(20)} {size}')
    print()

print('Total'.ljust(20), ' ' * 7, obj_size(model))

countvectorizer
	fixed_vocabulary_    37 Bytes
	vocabulary_          1.1 kB

normalizer
	n_features_in_       38 Bytes

logisticregression
	n_features_in_       38 Bytes
	classes_             219 Bytes
	n_iter_              184 Bytes
	coef_                2.6 kB
	intercept_           204 Bytes

Total                        5.0 kB

We’ve reduced the model’s memory consumption by a whopping factor of ~545. We’ve done this without impacting the model’s accuracy. It’s a free win! Note that you could push the envelope even further by reducing the precision of the coefficient matrix to 32 bits instead of 64 bits, but the savings would be marginal.

Table of contents

Context

Measuring the model’s size

Reducing the model’s size

Related posts