In this repo I implement from scratch different Machine Learning algorithms commonly used in Natural Language Processing tasks.
Here you can find the different model implementations.
In Linear Regresssion, the output is the weighted sum of inputs.
- Linear Regression Equation:
$h(x_i) = θ^T x_i =\theta_0 + \theta_1 x_{i1} +\theta_2 x_{i2} + ... \theta_m x_{im}$ - Input features vector for a single sequence:
$x_i$ - Weights vector:
$θ$ ($θ_0=1$ for bias)
In Logistic Regression, we pass the Linear Regression output through a sigmoid function, that can map any real value between 0 and 1.
- Sigmoid function:
$σ(x) = \frac{1}{1 + e^{-x}} $ - Logistic Regression Equation:
$h(x_i, \theta) = σ(θ^T x_i)=$ $\geq 0.5 \text{ if } \theta^Tx_i > 0$ $< 0.5 \text{ if } \theta^Tx_i < 0$
import numpy as np
def sigmoid(X, theta):
"""
Sigmoid function for logistic regression
:param X: input features. Numpy array of shape (m, n+1) (m=number of input sequences, n=number of features + 1 for bias)
:param theta: weights vector. Numpy array of shape (n+1, 1)
:return: logistic regression over X. Numpy array of shape (m, 1)
"""
return 1.0 / (1.0 + np.exp(-np.dot(X, theta)))
Log loss for a single training example:
-
Gold label:
$y_i$ -
Predicted probability for seq
$x_i$ (between 0 and 1):$h(x_i, θ)$ => The logs will be negative (we add a small value to avoid$log(0)$ ):$log(0.0001)=-4$ $log(0.5)=-0.3$ $log(1)=0$
-
First term: probability of 1:
${\color{DarkGreen} {p_i = h(x_i, \theta)}}$ -
Second term: probability of 0:
${\color{Red} {1-p_i=1-h(x_i, \theta)}}$
loss | ||||||||
---|---|---|---|---|---|---|---|---|
The cost function used for logistic regression is the average of the log loss across all training examples:
- Number of training examples:
$m$
def cost(Y_prob, Y):
"""
Compute the log loss between predicted labels and gold labels
:param Y_prob: predicted probabilities. Numpy array of shape (m, 1) (m= number of labels)
:param Y: true labels. Numpy array of shape (m, 1)
:return: cost value (float)
"""
return float(-(np.dot(Y.T, np.log(Y_prob)) + np.dot((1-Y).T, np.log(1 - Y_prob)))/Y.shape[0])
We update the weight
-
$i$ is the index across all$m$ training examples. -
$j$ is the index of the weight$\theta_j$ , so$x_j$ is the feature associated with weight$\theta_j$
- weights vector:
$θ$ , with dimensions$(n+1, 1)$ - lerning rate:
$α$ - number of training examples:
$m$ - training examples:
$X$ , with dimensions$(m, n+1)$ - training labels:
$Y$ , with dimensions$(m, 1)$ - predicted labels (probabilities):
$Y_{prob}$
def gradient_descent(X, Y, Y_prob, theta, alpha):
"""
Update the weights vector
:param X: training features. Numpy array of shape (m, n+1) (m=number of training sequences, n=number of features + 1 for bias)
:param Y: training labels. Numpy array of shape (m, 1)
:param Y_prob: predicted probabilities. Numpy array of shape (m, 1)
:param theta: weights vector. Numpy array of shape (n+1, 1)
:param alpha: learning rate. Float
:return: updated weight vector
"""
return theta - alpha/ X.shape[0] * np.dot(X.T, Y_prob-Y)
Bayes' rule allows to compute the conditional probability of a class given some features with the following equation:
The Naïve Bayes algorithm can be used for NLP classification tasks, however, it makes two assumptions that are not always true in language data:
- Conditional Independence: the predictive features (words) are independent
- Bag of Words: the word order is not important
-
$m$ : number of words in the sequence -
$N_{class} $ : frequency of all words in a class -
$V$ : vocabulary size (number of unique words in the vocabulary) - Laplace smoothing: we add 1 to the numerator and V to the denominator to avoid multiplying by zero when we find a word that is not in our training vocabulary
We get the likelihood score for the sequence:
- if score > 1 => class 1 (positive)
- if score < 1 => class 0 (negative)
- if score = 1 => neutral
To avoid numerical flow issues with the likelihood product, we introduce the log:
The first component of the equation is the log prior and represents the classes distribution accross the whole training set, that is, the ratio of positive/negative documents in the training set. For perfectly balanced datasets, this ratio will be 1 so its log will be 0 and we won't add anything to the log likelihood.
We get the log likelihood score for the sequence:
- if score > 0 => class 1 (positive)
- if score < 0 => class 0 (negative)
- if score = 0 => neutral
Reminder: log properties:
$log(xy) = log(x) + log(y) $ $log\frac{x}{y} = log(x) - log(y)$
- Create a frequencies dictionnary with
- Key: (word, class)
- Value: the frequency with which that word is mapped to that class in the training set
- Count the number of positive and negative documents
- Get the vocabulary size
- Calculate the log prior
- Create a dictionary with the log likelihood of each word in the vocabulary
- Predict the class of a new document by adding the log prior and the log likelihood of each word from the document