ELMo is a deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). These word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. They can be easily added to existing models and significantly improve the state of the art across a broad range of challenging NLP problems, including question answering, textual entailment and sentiment analysis.
ELMo representations are:
- Contextual: The representation for each word depends on the entire context in which it is used.
- Deep: The word representations combine all layers of a deep pre-trained neural network.
- Character based: ELMo representations are purely character based, allowing the network to use morphological clues to form robust representations for out-of-vocabulary tokens unseen in training.
Adding ELMo to existing NLP systems significantly improves the state-of-the-art for every considered task. In most cases, they can be simply swapped for pre-trained GloVe or other word vectors.
|Task||Previous SOTA||Our baseline||ELMo + Baseline||Increase (Absolute/Relative)|
|81.1||85.8||4.7 / 24.9%|
Chen et al (2017)88.6
|88.0||88.7 +/- 0.17||0.7 / 5.8%|
He et al (2017)81.7
|81.4||84.6||3.2 / 17.2%|
Lee et al (2017)67.2
|67.2||70.4||3.2 / 9.8%|
Peters et al (2017)91.93 +/- 0.19
|90.15||92.22 +/- 0.10||2.06 / 21%|
McCann et al (2017)53.7
|51.4||54.7 +/- 0.5||3.3 / 6.8%|
Pre-trained ELMo Models
|Model||Link (Weights/Options File)||# Parameters (Millions)||LSTM Hidden Size/Output size||# Highway Layers||SRL F1||Constituency Parsing F1|
The baseline models described are from the original ELMo paper for SRL and from Extending a Parser to Distant Domains Using a Few Dozen Partially Annotated Examples (Joshi et al, 2018) for the Constituency Parser. We do not include GloVe vectors in these models to provide a direct comparison between ELMo representations - in some cases, this results in a small drop in performance (0.5 F1 for the Constituency Parser, > 0.1 for the SRL model).
All models except for the 5.5B model were trained on the 1 Billion Word Benchmark, approximately 800M tokens of news crawl data from WMT 2011. The ELMo 5.5B model was trained on a dataset of 5.5B tokens consisting of Wikipedia (1.9B) and all of the monolingual news crawl data from WMT 2008-2012 (3.6B). In tasks where we have made a direct comparison, the 5.5B model has slightly higher performance then the original ELMo model, so we recommend it as a default model.
Contributed ELMo Models
ELMo models have been trained for other languages and domains. We maintain a list of models here but are unable to respond to quality issues ourselves.
|Model||Link (Weights/Options File)||Contributor/Notes|
|Portuguese (Wikipedia corpus)||Federal University of Goiás (UFG). Pedro Vitor Quinta de Castro, Anderson da Silva Soares, Nádia Félix Felipe da Silva, Rafael Teixeira Sousa, Ayrton Denner da Silva Amaral. Sponsered by Data-H, Aviso Urgente, and Americas Health Labs.|
|Portuguese (brWaC corpus)||Federal University of Goiás (UFG). Pedro Vitor Quinta de Castro, Anderson da Silva Soares, Nádia Félix Felipe da Silva, Rafael Teixeira Sousa, Ayrton Denner da Silva Amaral. Sponsered by Data-H, Aviso Urgente, and Americas Health Labs.|
|Japanese||ExaWizards Inc. Enkhbold Bataa, Joshua Wu. (paper)|
|German||Philip May & T-Systems onsite|
|Transformer ELMo||Joel Grus and Brendan Roof|
Code releases and AllenNLP integration
There are reference implementations of the pre-trained bidirectional language model available in both PyTorch and TensorFlow. The PyTorch verison is fully integrated into AllenNLP. The TensorFlow version is also available in bilm-tf.
You can retrain ELMo models using the tensorflow code in bilm-tf.
See our paper Deep contextualized word representations for more information about the algorithm and a detailed analysis.