A few weeks ago researchers from Google, the University of Cambridge, DeepMind and the Alan Turing Institute released the paper Rethinking Attention with Performers, which seeks to find a solution to the softmax bottleneck problem in transformers [1]. Their approach exploits a clever mathematical trick, which I will explain in this article.

**Prerequisites:**

- Some knowledge of transformers
- Kernel functions

**Topics covered:**

- Why transformers?
- The problem with transformers
- Sidestepping the softmax bottleneck

In essence, the Transformer is a model designed to work efficiently with sequential data, and it is in fact employed heavily in Natural Language Processing (NLP) tasks, which require…

If you’ve taken an introductory Machine Learning class, you’ve certainly come across the issue of overfitting and been introduced to the concept of regularization and norm. I often see this being discussed purely by looking at the formulas, so I figured I’d try to give a better insight into why exactly minimising the norm induces regularization — and how L1 and L2 differ from each other — using some visual examples.

- Linear regression
- Gradient descent
- Some understanding of overfitting and regularization

- Why does minimizing the norm induce regularization?
- What’s the difference between the L1 norm and the L2 norm?

Using…

In this post I will give a detailed overview of perplexity as it is used in Natural Language Processing (NLP), covering the two ways in which it is normally defined and the intuitions behind them.

**A quick recap of language models****Evaluating language models****Perplexity as the normalised inverse probability of the test set**3.1 Probability of the test set

3.2 Normalising

3.3 Bringing it all together**Perplexity as the exponential of the cross-entropy**4.1

4.2 Weighted branching factor: rolling a die

4.3 Weighted branching factor: language models**Summary**

*Skip if not needed*

A **language…**

This is a **summary** of some of the key points from the paper *“A Review of Relational Machine Learning for Knowledge Graphs (28 Sep 2015)”*** **[1], which gives a nice introduction to Knowledge Graphs and some of the methods used to build and expand them.

Information can be structured in the form of a graph, with nodes representing entities and edges representing relationships between entities. A knowledge graph can be built manually or using automatic information extraction methods on some source text (for example Wikipedia). …

Machine Learning student at UCL