Allen Institute for AI

Explicit Semantic Ranking Dataset

Semantic Scholar • 2017
This is the dataset for the paper Explicit Semantic Ranking for Academic Search via Knowledge Graph Embedding. It includes the query log used in the paper, relevance judgements for the queries, ranking lists from Semantic Scholar, candidate documents, entity embeddings trained using the knowledge graph, and baselines, development methods, and alternative methods from the experiments.

This dataset is designed to demonstrate Explicit Semantic Ranking (ESR), a new ranking technique that leverages knowledge graph embedding. Analysis of the query log from our academic search engine,, reveals that a major error source is its inability to understand the meaning of research concepts in queries. To addresses this challenge, ESR represents queries and documents in the entity space and ranks them based on their semantic connections from their knowledge graph embedding. Experiments demonstrate ESR’s ability in improving Semantic Scholar’s online production system, especially on hard queries where word-based ranking fails.


If you find this dataset helpful in your work, please cite:

title={Explicit Semantic Ranking for Academic Search via Knowledge Graph Embedding},
author={Xiong, Chenyan and Power, Russell and Callan, Jamie},
booktitle={Proceedings of the 26th International Conference on World Wide Web (WWW 2017)},
note={To appear},


In the zip file you will find the following files and folders:

s2_query.json contains the queries used in this paper. Each line of it is a json format dictionary, with the following format:

  "qid": "the query id",
  "query": "the query string",
  "ana": {
    the annotated entity id and frequency

The entities are from Freebase. Please refer to the final dump of Freebase to get more information about these entities.

s2.trec is the TREC format ranking files. It contains the ranking lists from’s production search engine (as of 2016 summer).

s2_doc.json contains the candidate documents. Each line of it is a json format dictionary. Its fields include:

docno: the doc id
keyPhrase: the automatically extracted key phrases for this paper.
paperAbstract: paper abstract
numCitedBy: number of citations
numKeyCitations: number of key citations. Key citation means the other paper considers this one as a very important related work. It is from semanticsholar's production system.
Ana: the annotation of each of the title, paperAbstract, and body field.

Due to copyright restrictions, we are not allowed to release the body text. Please check to get the full corpus and more information about each document.

s2.qrel is the relevance judgments for these queries. It was labeled by the first two authors. Judging the relevance of computer science papers is very hard. We have to read many papers’ abstract or even introductions ourselves before making any reasonable judgments. The current size of labels is limited. Keep updated with for future possible benchmark release.

ranking_res folder includes the ranking results of all baselines, develop methods, and alternative methods in the experiments and analysis of this paper. Feel free to conduct future experiments based on them.

knowledge_graph_embedding folder contains the entity embeddings trained using our knowledge graph. It is in Google word2vec format.


queryID  query
-------  ------------------------------------------------------------
1        deep learning
2        artificial intelligence
3        information retrieval
4        machine learning
5        question answering
6        noun phrases
7        penn treebank
8        speech recognition
9        data mining
10       computer vision
11       reinforcement learning
12       natural language
13       autoencoder
14       ontology
15       sentiment analysis
16       sap
17       lstm
18       natural language processing
19       semantic web
20       mooc
21       human computer interaction
22       eye movement clustering
23       semantic relations
24       efficient estimation of word representations in vector space
25       big data
26       audio visual fusion
27       object detection
28       gfdm
29       neural network
30       generalized extreme value
31       information geometry
32       image panorama video
33       data science
34       semantic parsing
35       augmented reality
36       imbalanced data
37       recommender system
38       inverse reinforcement learning mixture
39       transfer learning
40       cnn
41       dynamic programming segmentation
42       natural language interface
43       genetic algorithm
44       prolog
45       contact prediction
46       wifi malware
47       nsdi machine learning
48       forensics and machine learning
49       words to speech
50       information theory
51       morphology morphological
52       category theory
53       graph theory
54       smart thermostat
55       exploit vulnerability
56       reinforcement learning and video game
57       system health management
58       spatial multi agent systems
59       service composition
60       mobile payment
61       3 axis gantry
62       softmax categorization
63       cost aggregation
64       chinese dialect
65       depth camera
66       mobile tcp traffic analysis
67       collective learning
68       robust production planning
69       memory hierarchy
70       hashing
71       comparable corpora
72       knowledge graph
73       social media
74       deep learning surveillance
75       cryptography
76       parametric max flow
77       deep reinforcement learning
78       varying weight grasp
79       dirichlet process
80       word embedding
81       graph drawing
82       robust principal component analysis
83       differential evolution
84       seq2seq
85       document logical structure
86       duality
87       variable neighborhood search
88       urban public transportation systems
89       edx coursera
90       fdir
91       cryptography key management
92       ontology construction
93       go game
94       personality trait
95       sparse learning
96       directed hypergraph
97       inventory management
98       clojure
99       ontology semantic web
100      convolutional neural network time series