Machine Learning

Week 14 Day 2: Intro to Machine Learning

Objectives

Cover basic terminology of Machine Learning
Look at a Logistic Regression example (the Hello World of DL)

import numpy as np

ML terminology

See class notes for diagrams and descriptions.

Training sample: The data you use to train the network.
Validation sample: An independent data sample you can test the network with.
Overtraining: Learning details of the training sample instead of the underlying distribution.
Weights: The parameters you fit to.
Layers: A way to group the distribution definition. Named for the (usually linear) function that produces them:
- Fully Connected linear layer (FC): A simple matrix of weights
- Convolutional layer: A layer that relates nearby data structures
- Recurrent layer: A layer that has some "memory" - often used for variable length data.
Hidden layer: An "inbetween" state not externally visible.
Activation function: A non-linear function that can applies after a layer.
- ReLu: Rectified linear unit function
- Sigmoid: Maps $(-\infty,\infty) \rightarrow (0,1)$
Network: The collection of weights in layers and activation functions. Also called a model.
Loss function: A function that gives you a "score" for how poorly you did.
Cost function: Sum of the loss function over your training sample.
Batch: A smaller division used for evaluating data
Epoch: An iteration over the whole training sample (one "step")
Backpropagation: Taking the derivative of the network
Learning rate: How far to move the weights based on the gradient after each epoch.
Neuron: The combination of a weight and an activation function.

Network terminology

Deep learning: Large neural networks with hidden layers
CNNs: Convolutional Neural Nets: Looks for spacial structures, like in images
RNNs: Recurrent Neural Nets: Have some form of memory (Recursive NN are one form of RNN, among others)
GANs: Generative Adversarial Nets: can run in reverse to generate possible inputs.

Simple network

See the excellent tutorials here: PyTorch with examples for examples using Numpy, Torch, and TensorFlow.

Let's start with a batch of random numbers - this will be our "data". We will do the following:

1,000 x N     1000 x 100          100 x 10     10 x N
 data (x)         ->       ReLu       ->     result (y)
               fully connected linear layers

In words: we have N samples of data with D_in = 1000 values each. We convert that to H = 100 values, use a relu activation function, then convert that to D_out = 10 values and compare with the expected result.

Since this is all random, we can train on small samples because we have so many parameters.

N = 64               # batch size
D_in = 1_000         # input dimension
H = 100              # hidden dimension
D_out = 10           # output dimension

epochs = 500         # How many iterations to run
learning_rate = 1e-6 # How much to move each iteration

Randomly initializing weights is important in some cases, though in this one it is a bit harder to show (IMO).

np.random.seed(42)

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

Now, let's loop and do the calculations.

The derivatives are easy - just look at our definitions:

$$ b = x \cdot w_1 \tag{1} $$$$ h = \mathrm{ReLu}(b) \tag{2} $$$$ \hat{y} = h \cdot w_2 \tag{3} $$$$ L(\hat{y}, y) = \left(\hat{y} - y\right)^2 \tag{4} $$

The derivatives:

$$ \frac{dL}{d\hat{y}} = 2 \left(\hat{y} - y\right) $$$$ \frac{d L}{d w_2} = h^T \cdot \frac{dL}{d\hat{y}} $$$$ \frac{d L}{d h} = \frac{dL}{d\hat{y}} \cdot w_2^T $$$$ \frac{d L}{d b} = \mathrm{ReLu}\left(\frac{d L}{d h}\right) $$$$ \frac{d L}{d w_1} = x^T \cdot \frac{d L}{d b} $$

for t in range(epochs):
    
    # Forward pass: compute predicted y
    h = x @ w1                 # First FC layer          (1)
    h_relu = np.maximum(h, 0)  # Activation function     (2)
    y_pred = h_relu @ w2       # Second FC layer         (3)

    # Compute and print cost
    cost = np.sum((y_pred - y)**2) # Sum of loss         (4)
    print(t, cost)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)     #          back (4)
    grad_w2 = h_relu.T @ grad_y_pred     #          dL/dw2
    grad_h_relu = grad_y_pred @ w2.T     #          back (3)
    grad_h = np.maximum(grad_h_relu, 0)  #          back (2)
    grad_w1 = x.T @ grad_h               #          dL/dw1
    
    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2