Building the foundations of Deep Learning from scratch

From matrix multiplication to backpropagation

Software systems for deep learning are built upon several layers of abstraction. At its core, a deep learning system is nothing but optimized matrix multiplications for the forward pass; and reverse mode auto-differentiation for the backward pass. Here we shall rebuild these foundations of deep learning.

A Python Primer

First, let’s arm ourselves with all the Python tools we might need in this endeavour. Follow along by running this notebook on Google Colab.

We start with the Python Data Model. Python uses objects to abstract data and behaviour. Python classes can be extended with special behaviour using the data model methods such as __init__. We can overload operators in Python by using the specific __dunder__ methods, each with a pre-defined name. For example, defining a __add__ method in our class will add two objects of the class whenever a + is encountered.

Such special methods are also invoked by certain special syntax such as subscription (arr[i]) or during certain special context such as initialising a new object (__init__) or calling len on a container object such as a list.

Here I have used the data model methods to design an abstraction for polynomials. The __str__ and __repr__ methods make their representations match how we would actually write and make them. The __len__ method defines the nearest interpretation of length of a polynomial i.e. the degree of the polynomial. The __call__ method lets one use our Polynomial objects like real polynomial functions $f(x)$.

class Polynomial:
    '''An abstraction of polynomials in the python data model'''
            
    def __init__(self, *coefficients):
        self.coeff = [c for c in coefficients]
    
    def __call__(self, x):
        return sum([c*(x**i) for i, c in enumerate(self.coeff)])
    
    def __len__(self):
        '''Degree of a polynomial'''
        return len(self.coeff)
    
    def __getitem__(self, i):
        '''get the coeffecients'''
        return self.coeff[i]
    
    def __str__(self):
        return ' + '.join([f'{c}x^{i}'
                           for i,c in enumerate(self.coeff)])
    
    def __repr__(self):
        return f"Polynomial({self.coeff})"

In the notebook, I also show how to set up the with functionality of contexts in Python. Then we look at decorators that extend the behaviour of an object without modifying its structure.

We also see how to use callbacks in Python. In the process, I demonstrate the usage of lambda functions, inner functions and partial functions. However, these are not necessary for this blog post.

Quick Maths with Matrices!

Now let’s start with the main course of the meal. Matrix operations are at the heart of modern deep learning systems. If machine learning is just Statistics, then deep learning is just matrix multiplications.

You should run this notebook on Google Colab to go through this section. In the notebook, we incrementally optimize matrix multiplications - starting with nested loops in standard Python. Then we use array slicing and broadcasting techniques to push it further.

def matmul(A,B):
    A_rows, A_cols = A.shape
    B_rows, B_cols = B.shape
    assert A_cols==B_rows,\
    f"Inner dimensions must match: {A_cols} not equal to {B_rows}"
    C = torch.zeros([A_rows, B_cols])
    for i in range(A_rows):
        C[i,:] += (A[i,:]*B.t()).sum(1)
    return C

We demonstrate the expressivity and efficiency of the einstein summation notation in PyTorch.

C = torch.einsum("ik,kj->ij", A, B)

Finally, we test all these implementations against the standard PyTorch implementation. On running the notebook, you might be surprised to find that matrix multiplication in PyTorch is faster by three orders of magnitude compared to standard Python!

The Forward Pass

Now, we implement the forward pass through a neural network. Follow along by running this notebook on Google Colab.

There are some inherent problems in training neural networks. They are discussed in the paper Understanding the difficulty of training deep feedforward neural networks by Xavier Glorot and Yoshua Bengio in PMLR 2010. The two problems during the forward pass are:

  1. Exploding Activations
  2. Vanishing Activations

In the notebook, we step through the matrix multiplications of each layer to get the activations. We see for ourselves how the standard deviations and means of activations increase drastically. That is the problem of exploding activations. The solution is using Xavier weight initialization.

Weights at each layer $W_{ij}$ are initialized as: $$W_{ij} \sim U\left[-\frac{1}{\sqrt{n}}, \frac{1}{\sqrt{n}}\right]$$ where $n$ is the size of the previous layer and $U$ is the uniform distribution.

In the notebook, we observe that the means and standard deviations of activations remain roughly close to 0 and 1 respectively when we use Xavier initialization.

The problem of vanishing activations arises when we use the ReLU activation function after each layer. In the notebook, we step through activations of each layer again; but this time we apply ReLU on them. Due to the nature of ReLU, the standard deviations of activations are approximately halved after each layer. If it continues, our activations will soon approach zero and vanish. The solution is to use Kaiming initialization, a slight modification of Xavier initialization.

The weights $w_{l}$ of each layer are initialized from a Gaussian with a standard deviation of $\sqrt{2/{n}_{l}}$ where $n_l$ is the size of the layer. $$w_{l} \sim \mathcal{N}\left(0, 2/n_{l}\right)$$

The notebook demonstrates that the standard deviations behave much better when using Kaiming initialization.

With these common issues out of the way, we start implementing the modules of a neural network. To keep it simple, we build a feedforward network using linear layers, ReLU and some standard loss functions. Following the Python Data Model, we find our API design begins to look a lot like PyTorch.

class Linear:
    '''Affine layer with weight & bias initialized using Kaiming initialization'''

    def __init__(self, in_size, out_size):
        self.weight = torch.randn(in_size, out_size) * math.sqrt(2/in_size)
        self.bias = torch.zeros(out_size)
    
    def __repr__(self):
        return f"(Linear: in={self.weight.shape[0]} out={self.weight.shape[1]})"
        
    def forward(self, x):
        return x@self.weight + self.bias

No wonder PyTorch feels so naturally pythonic!

Propagating Gradients Backwards

After the forward pass, we need to backpropagate the gradients based on the loss. You can follow along by running this notebook on Google Colab.

To mimic an Autograd engine, we derive the equations for gradients and implement them. Every layer (linear, activation and loss) must now implement a backward() method. We design an abstract layer class (just like nn.Module of PyTorch) to enforce these requirements.

class Layer:
    '''Abstract Layer class like PyTorch's nn.Module'''

    def __call__(self, *args):
        self.args = args
        self.y = self.forward(*self.args)
        return self.y

    def __repr__(self):
        raise NotImplementedError("__repr__ method not implemented")

    def forward(self, *args):
        raise NotImplementedError("forward method not implemented")
        
    def backward(self):
        raise NotImplementedError("backward method not implemented")

Just like in PyTorch, every module of our neural network must inherit from the abstract layer class. Unlike PyTorch, every module must implement its own backward computation. We lack the luxury of an Autograd engine. Here’s how a loss layer looks like.

class MSE(Layer):
    '''Mean Squared Error Layer'''
    
    def __repr__(self):
        return "Loss: Mean Squared Error"
    
    def forward(self, y_pred, y_true):
        return (y_pred-y_true).pow(2).mean()
    
    def backward(self):
        n = self.args[0].shape[0]
        self.args[0].g = 2 * (self.args[0] - self.args[1]) / n

Every variable stores its gradient in the g attribute, just like the grad in PyTorch. Here’s how a linear layer would implement its backward method and store the gradients in g.

def backward(self):
    self.weight.g = self.args[0].t() @ self.y.g 
    self.bias.g = self.y.g
    self.args[0].g = self.y.g @ self.weight.t()

We do not maintain a computation graph. Instead, we propagate the gradients by calling backward() recursively in reverse.

def backward(self):
    self.loss.backward()
    for layer in reversed(self.layers):
        layer.backward()

However, even with these limitations, our implementation starts to look a lot like PyTorch from the outside. Here’s how we would write a single training step.

model = FeedForwardNN(x_dim=784, y_dim=1, n_layers=5)
y_pred = model(x_train)
loss = model.loss(y_pred.float(), y_train.unsqueeze(-1).float())
model.backward()

We have now implemented the foundations of deep learning systems - from matrix multiplications to backpropagation.

Masters in Artificial Intelligence

into building things, taking risks and aesthetics