```
import numpy as np
```

## ML terminology

See class notes for diagrams and descriptions.

**Training sample**: The data you use to train the network.**Validation sample**: An independent data sample you can test the network with.**Overtraining**: Learning details of the training sample instead of the underlying distribution.**Weights**: The parameters you fit to.**Layers**: A way to group the distribution definition. Named for the (usually linear) function that produces them:*Fully Connected linear layer (FC)*: A simple matrix of weights*Convolutional layer*: A layer that relates nearby data structures*Recurrent layer*: A layer that has some "memory" - often used for variable length data.

**Hidden layer**: An "inbetween" state not externally visible.**Activation function**: A non-linear function that can applies after a layer.*ReLu*: Rectified linear unit function*Sigmoid*: Maps $(-\infty,\infty) \rightarrow (0,1)$

**Network**: The collection of weights in layers and activation functions. Also called a model.**Loss function**: A function that gives you a "score" for how poorly you did.**Cost function**: Sum of the loss function over your training sample.**Batch**: A smaller division used for evaluating data**Epoch**: An iteration over the whole training sample (one "step")**Backpropagation**: Taking the derivative of the network**Learning rate**: How far to move the weights based on the gradient after each epoch.**Neuron**: The combination of a weight and an activation function.

## Network terminology

**Deep learning**: Large neural networks with hidden layers**CNNs**:*Convolutional Neural*Nets: Looks for spacial structures, like in images**RNNs**:*Recurrent Neural Nets*: Have some form of memory (Recursive NN are one form of RNN, among others)**GANs**:*Generative Adversarial Nets*: can run in reverse to generate possible inputs.

## Simple network

See the excellent tutorials here: PyTorch with examples for examples using Numpy, Torch, and TensorFlow.

Let's start with a batch of random numbers - this will be our "data". We will do the following:

```
1,000 x N 1000 x 100 100 x 10 10 x N
data (x) -> ReLu -> result (y)
fully connected linear layers
```

In words: we have N samples of data with `D_in = 1000`

values each. We convert that to `H = 100`

values, use a relu activation function, then convert that to `D_out = 10`

values and compare with the expected result.

Since this is all random, we can train on small samples because we have so many parameters.

```
N = 64 # batch size
D_in = 1_000 # input dimension
H = 100 # hidden dimension
D_out = 10 # output dimension
epochs = 500 # How many iterations to run
learning_rate = 1e-6 # How much to move each iteration
```

Randomly initializing weights is important in some cases, though in this one it is a bit harder to show (IMO).

```
np.random.seed(42)
# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)
# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)
```

Now, let's loop and do the calculations.

The derivatives are easy - just look at our definitions:

$$ b = x \cdot w_1 \tag{1} $$$$ h = \mathrm{ReLu}(b) \tag{2} $$$$ \hat{y} = h \cdot w_2 \tag{3} $$$$ L(\hat{y}, y) = \left(\hat{y} - y\right)^2 \tag{4} $$The derivatives:

$$ \frac{dL}{d\hat{y}} = 2 \left(\hat{y} - y\right) $$$$ \frac{d L}{d w_2} = h^T \cdot \frac{dL}{d\hat{y}} $$$$ \frac{d L}{d h} = \frac{dL}{d\hat{y}} \cdot w_2^T $$$$ \frac{d L}{d b} = \mathrm{ReLu}\left(\frac{d L}{d h}\right) $$$$ \frac{d L}{d w_1} = x^T \cdot \frac{d L}{d b} $$```
for t in range(epochs):
# Forward pass: compute predicted y
h = x @ w1 # First FC layer (1)
h_relu = np.maximum(h, 0) # Activation function (2)
y_pred = h_relu @ w2 # Second FC layer (3)
# Compute and print cost
cost = np.sum((y_pred - y)**2) # Sum of loss (4)
print(t, cost)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y) # back (4)
grad_w2 = h_relu.T @ grad_y_pred # dL/dw2
grad_h_relu = grad_y_pred @ w2.T # back (3)
grad_h = np.maximum(grad_h_relu, 0) # back (2)
grad_w1 = x.T @ grad_h # dL/dw1
# Update weights
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
```