This page goes through the concepts that will be taught in the Optimization tutorial at the 2017 CBMM Summer School at Woods Hole. It will provide further detail than the in-class tutorial in two ways:

  1. It includes R code that implements the examples discussed in the tutorial – this is hidden by default but can be expanded by clicking on the Code buttons on the right
  2. There are sections that provide detail on optimization that will be skipped in the tutorial
# We need to load this package to make the plots
library(ggplot2)
library(ggrepel)

Introduction

The purpose of this tutorial is to get everyone on the same page with some of the concepts and terminology that underlie optimization. These optimization techniques and concepts are used in many of the tools that will be discussed in this program – including machine learning, neural networks, and probabilistic programming – so it is useful for everyone to understand (1) what optimization is trying to do, (2) how this works at a high level, and (3) the terminology used for optimization (since many of these terms will be thrown around elsewhere in this program). This is not meant to be a tutorial about the mathematics behind optimization, and while there will be pointers to how to implement these techniques in various common programming languages (R, Python, and Matlab), implementation will not be the focus. Instead, it is important to understand the what and why of optimization to understand how some of the techniques discussed later in the course will be used.

The important terms to know include:

  • Likelihood
  • Maximum likelihood estimate
  • Cost function
  • Gradient
  • Gradient descent
  • Global / local minima
  • Convex / non-convex functions
  • Differentiable functions
  • Stochastic gradient descent
  • Regularization
  • Sparse coding
  • Momentum

The main focus of optimization in this tutorial will be parameter fitting: given a known model with an unknown parameterization, how should you assign a parameterization based on data you have observed. There are other types of optimization that use related but different techniques, but these problems will not be discussed in detail so that we can focus on the basic concepts of optimization for parameter fitting.

However, before we can use optimization for parameter fitting, we have to know what we are trying to optimize. For this, we will start with the likelihood function.

Likelihood & cost functions

The definition of likelihood

Likelihood: The probability of observing your data given a particular model

Assuming that you have a set of data \(D={d_1,d_2,...,d_N}\), and a model with parameters \(\theta\), then the likelihood \(L(\theta|D)\) is noted as:

\[L(\theta|D) = P(D|\theta)\]

If your data is independent and identically distributed (I.I.D.), then you can treat this as the multiplication of the probabilities of observing each individual data point:

\[L(\theta|D) = \underset{i...N}{\prod}P(d_i|\theta)\]

This may be very abstract, so let’s try an example to make this more concrete.

An example: balls in urns

Suppose you have an urn that you know contains only black and white balls, but you don’t know anything at all about the proportion of each. You pull a number of balls with replacement and get the sequence BBWBWWBWWWBW.

The first thing we can do is assume that we know the proportion of black balls, and ask ourselves how likely it would have been to draw that sequence if we were right about the proportion. We will call the proportion of black balls \(\theta\) in this case.

Because we know that \(P(B) = \theta\) and \(P(W) = (1-\theta)\) and that each draw is I.I.D., we can calculate the probability of getting that sequence as the product of the probabilities of each of the individual draws:

\[L(\theta|D)=\underset{i...N}{\prod}\begin{cases}\theta & if\ d_i=B \\ (1-\theta) & if\ d_i=W\end{cases}\]

Now we can use that equation to calculate the probability of getting that specific sequence if the balls were equally distributed:

urn.data = c("B", "B", "W", "B", "W", "W", "B", "W", "W", "W", "B", "W")

individual.probabilities = ifelse(urn.data=="B", 0.5, 1-0.5)

total.probability = prod(individual.probabilities)

writeLines(paste('Probability of sequence with equal distribution:',total.probability,sep=' '))
## Probability of sequence with equal distribution: 0.000244140625

And similarly if black balls made up 75% of the urn:

individual.probabilities = ifelse(urn.data=="B", 0.75, 1-0.75)

total.probability = prod(individual.probabilities)

writeLines(paste('Probability of sequence with 75% black ball distribution:',total.probability,sep=' '))
## Probability of sequence with 75% black ball distribution: 1.44839286804199e-05

Of course, we can easily abstract this as a function of and \(\theta\), and graph the likelihood as a function of any value of the proportion of black balls from 0 to 1:

urn.likelihood = function(theta, data = urn.data) {

  individual.probabilities = ifelse(data=="B", theta, 1-theta)
  
  total.probability = prod(individual.probabilities)
  
  return(total.probability)
}

thetas = seq(.01, .99, by=.01)

likelihoods = sapply(thetas, urn.likelihood)

qplot(x = thetas, y = likelihoods, geom = 'line', xlab = expression(theta), ylab = "Likelihood")