A few weeks ago researchers from Google, the University of Cambridge, DeepMind and the Alan Turing Institute released the paper Rethinking Attention with Performers, which seeks to find a solution to the softmax bottleneck problem in transformers [1]. Their approach exploits a clever mathematical trick, which I will explain in this article.

**Prerequisites:**

- Some knowledge of transformers
- Kernel functions

**Topics covered:**

- Why transformers?
- The problem with transformers
- Sidestepping the softmax bottleneck

In essence, the Transformer is a model designed to work efficiently with sequential data, and it is in fact employed heavily in Natural Language Processing (NLP) tasks, which require handling sequences of words/letters. Unlike other sequential models, the transformer exploits attention mechanisms to process sequential data in parallel (ie: not needing to go through one word/input at a time) [2]. …

If you’ve taken an introductory Machine Learning class, you’ve certainly come across the issue of overfitting and been introduced to the concept of regularization and norm. I often see this being discussed purely by looking at the formulas, so I figured I’d try to give a better insight into why exactly minimising the norm induces regularization — and how L1 and L2 differ from each other — using some visual examples.

- Linear regression
- Gradient descent
- Some understanding of overfitting and regularization

- Why does minimizing the norm induce regularization?
- What’s the difference between the L1 norm and the L2 norm?

Using the example of linear regression, our loss is given by the Mean Squared Error…

In this post I will give a detailed overview of perplexity as it is used in Natural Language Processing (NLP), covering the two ways in which it is normally defined and the intuitions behind them.

**A quick recap of language models****Evaluating language models****Perplexity as the normalised inverse probability of the test set**3.1 Probability of the test set

3.2 Normalising

3.3 Bringing it all together**Perplexity as the exponential of the cross-entropy**4.1

4.2 Weighted branching factor: rolling a die

4.3 Weighted branching factor: language models**Summary**

*Skip if not needed*

A **language model** is a statistical model that assigns probabilities to words and sentences. Typically, we might be trying to guess the ** next word** w in a sentence given all previous words, often referred to as the

For example, given the history “For dinner I’m making __”, what’s the probability that the next word is “cement”? What’s the probability that the next word is “fajitas”?

Hopefully, P(fajitas|For dinner I’m making) > P(cement|For dinner I’m making). …

This is a **summary** of some of the key points from the paper *“A Review of Relational Machine Learning for Knowledge Graphs (28 Sep 2015)”*** **[1], which gives a nice introduction to Knowledge Graphs and some of the methods used to build and expand them.

Information can be structured in the form of a graph, with nodes representing entities and edges representing relationships between entities. A knowledge graph can be built manually or using automatic information extraction methods on some source text (for example Wikipedia). …

About