import numpy as np
ML terminology
See class notes for diagrams and descriptions.
- Training sample: The data you use to train the network.
- Validation sample: An independent data sample you can test the network with.
- Overtraining: Learning details of the training sample instead of the underlying distribution.
- Weights: The parameters you fit to.
- Layers: A way to group the distribution definition. Named for the (usually linear) function that produces them:
- Fully Connected linear layer (FC): A simple matrix of weights
- Convolutional layer: A layer that relates nearby data structures
- Recurrent layer: A layer that has some "memory" - often used for variable length data.
- Hidden layer: An "inbetween" state not externally visible.
- Activation function: A non-linear function that can applies after a layer.
- ReLu: Rectified linear unit function
- Sigmoid: Maps $(-\infty,\infty) \rightarrow (0,1)$
- Network: The collection of weights in layers and activation functions. Also called a model.
- Loss function: A function that gives you a "score" for how poorly you did.
- Cost function: Sum of the loss function over your training sample.
- Batch: A smaller division used for evaluating data
- Epoch: An iteration over the whole training sample (one "step")
- Backpropagation: Taking the derivative of the network
- Learning rate: How far to move the weights based on the gradient after each epoch.
- Neuron: The combination of a weight and an activation function.
Network terminology
- Deep learning: Large neural networks with hidden layers
- CNNs: Convolutional Neural Nets: Looks for spacial structures, like in images
- RNNs: Recurrent Neural Nets: Have some form of memory (Recursive NN are one form of RNN, among others)
- GANs: Generative Adversarial Nets: can run in reverse to generate possible inputs.
Simple network
See the excellent tutorials here: PyTorch with examples for examples using Numpy, Torch, and TensorFlow.
Let's start with a batch of random numbers - this will be our "data". We will do the following:
1,000 x N 1000 x 100 100 x 10 10 x N
data (x) -> ReLu -> result (y)
fully connected linear layers
In words: we have N samples of data with D_in = 1000
values each. We convert that to H = 100
values, use a relu activation function, then convert that to D_out = 10
values and compare with the expected result.
Since this is all random, we can train on small samples because we have so many parameters.
N = 64 # batch size
D_in = 1_000 # input dimension
H = 100 # hidden dimension
D_out = 10 # output dimension
epochs = 500 # How many iterations to run
learning_rate = 1e-6 # How much to move each iteration
Randomly initializing weights is important in some cases, though in this one it is a bit harder to show (IMO).
np.random.seed(42)
# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)
# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)
Now, let's loop and do the calculations.
The derivatives are easy - just look at our definitions:
$$ b = x \cdot w_1 \tag{1} $$$$ h = \mathrm{ReLu}(b) \tag{2} $$$$ \hat{y} = h \cdot w_2 \tag{3} $$$$ L(\hat{y}, y) = \left(\hat{y} - y\right)^2 \tag{4} $$The derivatives:
$$ \frac{dL}{d\hat{y}} = 2 \left(\hat{y} - y\right) $$$$ \frac{d L}{d w_2} = h^T \cdot \frac{dL}{d\hat{y}} $$$$ \frac{d L}{d h} = \frac{dL}{d\hat{y}} \cdot w_2^T $$$$ \frac{d L}{d b} = \mathrm{ReLu}\left(\frac{d L}{d h}\right) $$$$ \frac{d L}{d w_1} = x^T \cdot \frac{d L}{d b} $$for t in range(epochs):
# Forward pass: compute predicted y
h = x @ w1 # First FC layer (1)
h_relu = np.maximum(h, 0) # Activation function (2)
y_pred = h_relu @ w2 # Second FC layer (3)
# Compute and print cost
cost = np.sum((y_pred - y)**2) # Sum of loss (4)
print(t, cost)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y) # back (4)
grad_w2 = h_relu.T @ grad_y_pred # dL/dw2
grad_h_relu = grad_y_pred @ w2.T # back (3)
grad_h = np.maximum(grad_h_relu, 0) # back (2)
grad_w1 = x.T @ grad_h # dL/dw1
# Update weights
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2