A few weeks ago researchers from Google, the University of Cambridge, DeepMind and the Alan Turing Institute released the paper Rethinking Attention with Performers, which seeks to find a solution to the softmax bottleneck problem in transformers . Their approach exploits a clever mathematical trick, which I will explain in this article.
If you’ve taken an introductory Machine Learning class, you’ve certainly come across the issue of overfitting and been introduced to the concept of regularization and norm. I often see this being discussed purely by looking at the formulas, so I figured I’d try to give a better insight into why exactly minimising the norm induces regularization — and how L1 and L2 differ from each other — using some visual examples.
In this post I will give a detailed overview of perplexity as it is used in Natural Language Processing (NLP), covering the two ways in which it is normally defined and the intuitions behind them.
Skip if not needed
This is a summary of some of the key points from the paper “A Review of Relational Machine Learning for Knowledge Graphs (28 Sep 2015)” , which gives a nice introduction to Knowledge Graphs and some of the methods used to build and expand them.
Information can be structured in the form of a graph, with nodes representing entities and edges representing relationships between entities. A knowledge graph can be built manually or using automatic information extraction methods on some source text (for example Wikipedia). …
Machine Learning student at UCL