Machine Learning

A mix of class notes, online resources and hw assignments from the Machine Learning class at Columbia.

Online Resources

Matrix Differentiation

Defining what matrix differentiation is.
- https://betterexplained.com/articles/vector-calculus-understanding-the-gradient/
Refresher on linear algebra, drivatives and gradients.
- https://atmos.washington.edu/~dennis/MatrixCalculus.pdf
Khan tutorials on multivariable culculus. Jump straight to the Jacobian.
- https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives
Step by step solutions of gradients.
- http://www.cs.huji.ac.il/~csip/tirgul3_derivatives.pdf

Determinant

Tells us the size by which the unit vectors will scale or decrease by i.e. be transformed by.
The determinant tells us the factor by which areas are stretch or shrunk by.
- great than one, then the area is growing.
- less than one, then the area is shrinking.

   i j
  [1 0]
  [0 1]

  [3 1]
  [0 2]

^ has a derminant of 6 i.e. (3*2)-(1*0)
That means 

  [1 0]
  [0 1]

will grow 6 times it's size; the area will be 6 times the original size of the unit vectors.

If the determinant is zero, there is no inverse.

Intergral Calculus

Math for Machine Learning
- http://users.umiacs.umd.edu/~hal/courses/2013S_ML/math4ml.pdf
Calc 3 (good course to catch you up to speed with Calculus)
- https://www.udemy.com/calculus-3
- ^ derivatives, integrals, linear algebra (dot and cross products) differentials (Laplace)

Unsupervised Models

Kmeans

K-means is the simplest and most fundamental clustering algorithm.

Input: x₁, . . . , x_n, where x ∈ R^d.

Treats each x as a vector in R^d

Output: Vector c of cluster assignments, and K mean vectors µ.

c = (c₁, . . . , c_n), c_i ∈ {1, . . . , K}
- If c_i = c_j = k, then x_i and x_j are clustered together in cluster k.
- c is an integer that represents each cluster.
µ = (µ₁, . . . , µ_K), µ_k ∈ R^d (same space as xi)
- Each µ_k (called a centroid) is a set of k d-dimensional vectors that's part of some c.
- µ number of vectors for each cluster.
- µ_k defines the cluster for the k's cluster.

The K-means objective function can be written as

_{source: https://www.saedsayad.com/clustering_kmeans.htm}

Algorithm

We have a double sum, a sum over each data point. For every single data point, x_i, there's an additional sum over each cluster µ_k. Each i^th data point will sum over k clusters. We'll determine the euclidean distance of each i^th data point to the center of the cluster c i.e. centroid.

This objective is not convex; can't find the optimal µ and c. There are many theories that say you can. All we could do is derive an algorithm for a local minimum.

We can’t optimize the K-means objective function exactly by taking derivatives and setting to zero, so we use an iterative algorithm.

Can't do everything at once with a single algorithm. The algorithm will require iteration to modify values of whatever that is you're trying to learn.

Coordinate Descent

Split the parameters into two unknown sets µ and c. You could split them anyway you want. We're going to split them into two sets µ and c clusters. We'll then observe that even though we can't find the optimal value µ and c together, if we fix one, we could find the best value for the other one. And if we fixed c, we'll be able to find the best µ conditioned on c.

We split the variables into two unknown sets µ and c. We can’t find their best values at the same time to minimize L. However, we will see that

Fixing µ we can find the best c exactly.
Fixing c we can find the best µ exactly.

This optimization approach is called coordinate descent: Hold one set of parameters fixed, and optimize the other set. Then switch which set is fixed.

Clusters the data into k groups where k is predefined.
Select k points at random as cluster centers.
Assign closest x_i data points to their closest cluster center according to the Euclidean distance function.
- Given µ, find the best value c_i ∈ {1, . . . , K} for (x₁, . . . , x_i).
Calculate the centroid or mean of all objects in each cluster. Out put the updated µ and c.
- Given c, find the best vector µ_k ∈ R^d for k = 1, . . . , K.
Repeat steps 3 and 4 until the same points are assigned to each cluster in consecutive rounds.

There’s a circular way of thinking about why we need to iterate:

3. Given a particular µ, we may be able to find the best c, but once wechange c we can probably find a better µ.
4. Then find the best µ for the new-and-improved c found in #1, but now that we’ve changed µ, there is probably a better c.

We have to iterate because the values of µ and c depend on each other. This happens very frequently in unsupervised models.

This is not a convex problem. Each time a new µ is initialized, we'll get different results.

Because there are only K options for each c_i , there are no derivatives. Simply calculate all the possible values for c_i and pick the best (smallest) one.

The outline of why this converges is straightforward:

Every update to c_i or µ_k decreases L compared to the previous value.
Therefore, L is monotonically decreasing.
L ≥ 0, so Step 3 converges to some point (but probably not to 0).

When c stops changing, the algorithm has converged to a local optimal solution. This is a result of L not being convex.

Non-convexity means that different initializations will give different results:

Often the results will be similar in quality, but no guarantees.
In practice, the algorithm can be run multiple times with different initializations. Then use the result with the lowest L.

Maximum Likelihood using EM Algorithm

Expectation/Maximization algorithm. Closely related to variational inference

probabilistic objective function
discussed for least squares, linear regression (gradient method) and the bayes classifier. That model is nice, because we could find the respective θ_ML analytically by writing an equation and plugging in data to solve.

Markov Chain

Markov chain is a discrete state space with a stochastic process satisfying the Markov property. ^{Alessandro Molina}

Discrete state space

A sequence of random variables X1, X2, ..., Xn.
A Markov chain has either discrete state space (set of possible values of the random variables) or discrete index set (often representing time).
State space is a countable set S of X_i and cab be anything: letters, numbers, basketball scores or weather conditions.
The state space is a matrix that will be an N x N matrix, such that entry (i, j) is the probability of transitioning from state i to state j.

Stochastic process

The transition matrix must be a stochastic matrix, a matrix whose entries in each row must add up to exactly 1.
Each row of the matrix represents its own probability distribution.

Markov property

The probability of moving to the next state depends only on the present state
The distribution on the next position only depends on the current position.
The knowledge of the previous state is all that is necessary to determine the probability distribution of the current state, satisfying the rule of conditional independence (or said other way: you only need to know the current state to determine the next state).

So, the model is characterized by a state space, a transition matrix describing the probabilities of particular transitions, and an initial state across the state space, given in the initial distribution. ^{Sejal Jaiswal}

_{A simple Markov chain on the random variable, representing the random variable Weather = {Sunny, Rainy, Snowy} and showing the probability of the random variable switching to other states in the next time instance}

_{The transition matrix represents the same information as in the dictionary, but in a more compact way. For this reason, the transition matrix is the standard way of representing Markov chains.}

Markov chains are important mathematical tools that effectively aid the simplification of predicting stochastic processes by viewing the future as independent of the past, given the present state of the process.

Nonnegative Matrix Factorization

NMF is a technique that splits your data up into a set of individual signals and weights to apply to those signals to recreate your original data. It works best when your data is truly a sum of individual signals, but it will still work to some degree when the data isn’t quite like that. There are some caveats when it comes to applying NMF; the data needs to be positive, and laid out in an appropriate matrix, but what is the relationship between the number of observations/columns in the data matrix and the number of individual signals you can decompose the data into?. ^{http://www.billconnelly.net/?p=534}

We use notation and think about the problem slightly differently from PMF

Data X has nonnegative entries. None missing, but likely many zeros.
The learned factorization W and H also have nonnegative entries.
The value X_ij ≈ ∑_kW_ikH_kj, but we won’t write this with vector notation

What Problems can NMF solve?

Text data:
- Word term frequencies
- X_{_ij} contains the number of times word i appears in document j.
Image data:
- Face identification data sets.
- Put each vectorized N×M image of a face on a column of X.
Other discrete grouped data:
- Quantize continuous sets of features using K-means.
- X_ij counts how many times group j uses cluster i.
- For example: group = song, features = d×n spectral information matrix.

Squared error objective algorithm

Multiplicative Update for ‖X−WH‖²
Randomly initialize H and W with nonnegative values.
Iterate the following, first for all values in H, then all in W:

Visual of the Multiplicative Update ‖X−WH‖²
- Use element-wise multiplication/division across three columns below.
- Use matrix multiplication within each outlined box.
Squared error objective in python
```
H = (H * np.dot(W.T, X)) / (np.dot(W.T, W).dot(H) + err)
W = (W * np.dot(X, H.T)) / (np.dot(W, H).dot(H.T) + err)
```

Divergence objective algorithm

Multiplicative Update for D(X‖WH):
Randomly initialize H and W with nonnegative values.
Iterate the following, first for all values in H, then all in W:

Visual of the Multiplicative Update D(X‖WH)
- Visualizing the update for the divergence penalty is more complicated.
- Use the color-coded definition below.
- "Purple" is the data matrix “dot-divided” by the approximation of it.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
gmm		gmm
images		images
kmeans		kmeans
knn		knn
linear_regression		linear_regression
logistic_regression		logistic_regression
markov_chain		markov_chain
matrix_factorization		matrix_factorization
naive_bayes		naive_bayes
nonnegative_matrix_factorization		nonnegative_matrix_factorization
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning

Table of Contents

Online Resources

Matrix Differentiation

Determinant

Intergral Calculus

Unsupervised Models

Kmeans

Algorithm

Coordinate Descent

Maximum Likelihood using EM Algorithm

Markov Chain

Discrete state space

Stochastic process

Markov property

Nonnegative Matrix Factorization

What Problems can NMF solve?

Squared error objective algorithm

Visual of the Multiplicative Update ‖X−WH‖²

Squared error objective in python

Divergence objective algorithm

Visual of the Multiplicative Update D(X‖WH)

About

Releases

Packages

Languages

odubno/machine-learning-intro

Folders and files

Latest commit

History

Repository files navigation

Machine Learning

Table of Contents

Online Resources

Matrix Differentiation

Determinant

Intergral Calculus

Unsupervised Models

Kmeans

Algorithm

Coordinate Descent

Maximum Likelihood using EM Algorithm

Markov Chain

Discrete state space

Stochastic process

Markov property

Nonnegative Matrix Factorization

What Problems can NMF solve?

Squared error objective algorithm

Visual of the Multiplicative Update ‖X−WH‖2

Squared error objective in python

Divergence objective algorithm

Visual of the Multiplicative Update D(X‖WH)

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Visual of the Multiplicative Update ‖X−WH‖²

Packages