Image for post
Image for post
Image by author

The mathematical trick to speed up transformers

A few weeks ago researchers from Google, the University of Cambridge, DeepMind and the Alan Turing Institute released the paper Rethinking Attention with Performers, which seeks to find a solution to the softmax bottleneck problem in transformers [1]. Their approach exploits a clever mathematical trick, which I will explain in this article.


  • Kernel functions

Topics covered:

  • The problem with transformers
  • Sidestepping the softmax bottleneck

Why transformers?

In essence, the Transformer is a model designed to work efficiently with sequential data, and it is in fact employed heavily in Natural Language Processing (NLP) tasks, which require handling sequences of words/letters. Unlike other sequential models, the transformer exploits attention mechanisms to process sequential data in parallel (ie: not needing to go through one word/input at a time) [2]. …

Why does minimizing the norms induce regularization?

Image for post
Image for post
Image by author

If you’ve taken an introductory Machine Learning class, you’ve certainly come across the issue of overfitting and been introduced to the concept of regularization and norm. I often see this being discussed purely by looking at the formulas, so I figured I’d try to give a better insight into why exactly minimising the norm induces regularization — and how L1 and L2 differ from each other — using some visual examples.

Prerequisite knowledge

  • Linear regression
  • Gradient descent
  • Some understanding of overfitting and regularization

Topics covered

  • Why does minimizing the norm induce regularization?
  • What’s the difference between the L1 norm and the L2 norm?

Recap of regularization

Using the example of linear regression, our loss is given by the Mean Squared Error…

Image for post
Image for post

Evaluating NLP models using the weighted branching factor

In this post I will give a detailed overview of perplexity as it is used in Natural Language Processing (NLP), covering the two ways in which it is normally defined and the intuitions behind them.


  1. A quick recap of language models
  2. Evaluating language models
  3. Perplexity as the normalised inverse probability of the test set
    3.1 Probability of the test set
    3.2 Normalising
    3.3 Bringing it all together
  4. Perplexity as the exponential of the cross-entropy
    4.1 Cross-entropy of a language model
    4.2 Weighted branching factor: rolling a die
    4.3 Weighted branching factor: language models
  5. Summary

1. A quick recap of language models

Skip if not needed

A language model is a statistical model that assigns probabilities to words and sentences. Typically, we might be trying to guess the next word w in a sentence given all previous words, often referred to as the “history”.
For example, given the history “For dinner I’m making __”, what’s the probability that the next word is “cement”? What’s the probability that the next word is “fajitas”?
Hopefully, P(fajitas|For dinner I’m making) > P(cement|For dinner I’m making). …

How to represent and manipulate data in graph form

Image for post
Image for post
Photo by Clint Adair

This is a summary of some of the key points from the paper “A Review of Relational Machine Learning for Knowledge Graphs (28 Sep 2015)” [1], which gives a nice introduction to Knowledge Graphs and some of the methods used to build and expand them.

The key takeaway

Information can be structured in the form of a graph, with nodes representing entities and edges representing relationships between entities. A knowledge graph can be built manually or using automatic information extraction methods on some source text (for example Wikipedia). …


Chiara Campagnola

Machine Learning student at UCL

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store