A few weeks ago researchers from Google, the University of Cambridge, DeepMind and the Alan Turing Institute released the paper Rethinking Attention with Performers, which seeks to find a solution to the softmax bottleneck problem in transformers . Their approach exploits a clever mathematical trick, which I will explain in this article.
In essence, the Transformer is a model designed to work efficiently with sequential data, and it is in fact employed heavily in Natural Language Processing (NLP) tasks, which require handling sequences of words/letters. Unlike other sequential models, the transformer exploits attention mechanisms to process sequential data in parallel (ie: not needing to go through one word/input at a time) . …
If you’ve taken an introductory Machine Learning class, you’ve certainly come across the issue of overfitting and been introduced to the concept of regularization and norm. I often see this being discussed purely by looking at the formulas, so I figured I’d try to give a better insight into why exactly minimising the norm induces regularization — and how L1 and L2 differ from each other — using some visual examples.
Using the example of linear regression, our loss is given by the Mean Squared Error…
In this post I will give a detailed overview of perplexity as it is used in Natural Language Processing (NLP), covering the two ways in which it is normally defined and the intuitions behind them.
Skip if not needed
A language model is a statistical model that assigns probabilities to words and sentences. Typically, we might be trying to guess the next word w in a sentence given all previous words, often referred to as the “history”.
For example, given the history “For dinner I’m making __”, what’s the probability that the next word is “cement”? What’s the probability that the next word is “fajitas”?
Hopefully, P(fajitas|For dinner I’m making) > P(cement|For dinner I’m making). …
This is a summary of some of the key points from the paper “A Review of Relational Machine Learning for Knowledge Graphs (28 Sep 2015)” , which gives a nice introduction to Knowledge Graphs and some of the methods used to build and expand them.
Information can be structured in the form of a graph, with nodes representing entities and edges representing relationships between entities. A knowledge graph can be built manually or using automatic information extraction methods on some source text (for example Wikipedia). …