The Naïve Bayes classifier
The objective of a classifier is to decide to which class (also called label) to assign an observation based on observed data. In supervised learning, this is done by taking into account previous classifications. In other words if we know that certain observations are classified in a certain way, the goal is to determine the class of a new observation. The first group of observations on which the classifier is built is called the training set.
In mathematical terms say is a list of attributes for each one of a group of observations. In the training set, each observation has a label associated to it. The question is which to associate to a new ?
The Naïve Bayes method is a probabilistic approach. It assigns a probability to each label based on the training set. The best guess is the label with the highest probability.
The general idea is to know where is a discrete class/label and is a list of features. For example if the class was an author and the features were words in a text the conditional probability could be read as
“What is the probability that a text is written by author given that words are in the text?”
Good question. This is where the “Bayes” in the title comes from Bayes’ theorem gives
- is called the prior. It is the probability we are looking for before the data is being considered. In philosophy the prior is the state of the knowledge on a subject before an experience is performed.
- is the frequency at which appeared for . This is the observed data in the training set.
- represents the support provides for .
In a sense Bayes’ theorem is a “magic” formula which gives the probability of having A (for example an author) based on the fact that we already B (for example some words). It does because we already know at what frequency B appears when A does. In other words, if the word “Quiditch” appears in a sentence, you have a good chance that the sentence belongs to a Harry Potter book because you already know that it frequently appears in Harry Potter books.
If we calculate for every then we can sort them based on their value and choose one of them. This is called a decision rule. Not surprisingly the most common decision rule is to take the highest probability, this is called maximum a posteriori (MAP).
Now for calculating :
- is simply the frequency of the class.
- doesn’t depend on and so it doesn’t have to be calculated. Indeed is the same for every and won’t affect the order of all the .
- is the only (slighly) tricky probability to compute.
is the probability of having a list of features in a text. For the sake of simplicity we can suppose that the occurrences of features are independent from each other (this is why the classifier is considered “Naïve”). Of course this is false, indeed text often use semantic fields. However it has been shown that the classifier does well even if the assumption seems foolish. Because the events are independent the mathematics become very convenient. Indeed if there are features in a text the probability that a document belongs to a class can now be written down as
It’s important to understand that this isn’t true if the features occurrences are not considered independent. For example, the events “It’s raining” and “It’s cold” are not independent. In a probabilistic sense, the event (“It’s cold and rainy”) is not equal to . This is because most of the time one implies the other. In our case it’s mathematically convenient to consider the independence of events, that’s all.
If you want to run the code that will come up yourself, you can simply copy/paste it all into your text editor, it will work out of the box. However you will have to download the pandas module in order to read a CSV dataframe.
I like using classes, you really can make most of object-oriented programming for creating tidy packages. In this case there are two classes.
Counterclass for parsing the texts and estimating the probabilities.
NaiveBayesclass for predicting the class of new texts which inherits from
The main advantage of having a
Counter class is that it can be reused for different classifiers. In machine learning in general it’s a good idea to split feature extraction and prediction. This is the first thing to code.
Counter class takes 4 arguments. The first is a dataframe which should contain a column for the texts (3rd argument) and a column for the classes (4th argument). The
features argument is a function that extracts the desired features from a text; it can be anything as long as it returns an iterable (list, set, array). For example one could extract stemmed and/or lower cased words. We will get to that later on. In order to compute the probabilities described above we have to count the features and class occurences, the best way to do this is to use dictionaries. The last line parses all the dataframe with the following procedure.
The procedure takes a row as argument and analyses the text and class column. If the class has already appeared then it’s count is incremented, else it begins at 1. The same goes for every feature extracted from the text. The
features dictionary has two levels of depth, one for the feature and one for the class. With this implementation it’s very easy to get the occurences of a feature in a text of a specific category:
Now that the texts have been parsed the features we need a function that returns in order to compute .
A problem that I haven’t mentioned is how to deal with features that haven’t been encountered/classified in previous texts when doing a prediction. Intuitively an unencountered feature should bias our the decision rule towards a specific class. Because is the result of a product, multiplying it by won’t affect it ( is the identity element for the multiplication of real numbers).
The same kind of problem arises as to what probability to assign if a feature has been encountered but not for a certain class. In some sense this should be something low. But it shouldn’t be 0, because then would be equal to 0 and that is too harsh. However it’s also reasonable to return after the
else, which is equivalent to ignoring the feature (but not ignoring the fact that it was never associated with that class!).
Either the feature has already been associated with a class and we can calculate the
count / total ratio, which is the number of times a feature has been associated with a class over the number of occurences of a class. In other words this is the frequency of the association between a feature and a class (. In the other case the ratio is
1 / total, which is very pessimistic.
Now we can code the
NaiveBayes class, the one that will be used for prediction.
The first function classifies a sentence. The outer
for loops over all the possible classes. Lines
10 compute . The inner
for loops iterates over all the features in order to compute . Finally the class that has the highest probability is returned. The second function applies the first function to a dataframe.
The only piece of the puzzle missing is a feature function. Basically, if we have a sentence, we want to be able to transform it into a vector of features (which can be words, prefixes, stems, …) in order to compute individual probabilities for each feature. A very simple approach is to split a sentence on it’s white spaces and to put the words to lowercase.
Choosing a good feature function is crucial for the performance of a classifier, some people argue that it is even more important than the choice of the classifier.
The following piece of code puts all the pieces together.