Citeomatic is a deep learning model for the citation prediction task. Unlike previous work, Citeomatic is specifically trained to learn a robust model that gives meaningful predictions, even when it’s wrong. Relying only on the title and abstract of a query paper also allows Citeomatic to to be a useful literature review tool at any stage in the writing process.
We tried many variations of models for Citeomatic before finding one that both performed well at distinguishing true citations from false, but also performed well on the more important metric of predicting results that are useful to humans. It turns out that the second part is far more difficult than the first. First, as anyone who has interacted with a human knows, human preferences can be hard to predict. Second, while you would suspect that maximizing performance on the task of predicting citations would provide a model better at giving overall reasonable results, you’d be wrong!
A big part of the problem is that our training set (papers and their citations) does not actually represent what users want. The best results for a user are the set of papers that have not been cited, but perhaps should have been.
It also turns out that models can leverage all sorts of innocuous information in diabolical ways. A model that’s given information about who authored which paper will quickly learn one fact "authors cite themselves" and do a great job at predicting citations. Of course, these are exactly the citations that you would already know about, making the predictions useless.
The same holds for papers with large numbers of incoming citations. Of course a given paper will cite that cornerstone paper for its field, but it’s a pointless prediction to give to a user — they already know about it!
To deal with these issues, the Citeomatic model uses 2 techniques. It explicitly avoids using highly predictive but un-instructive features (viz. the authors on the query paper, citation overlap). We also train the model using a ranking loss function which encourages the model to favor reasonable suggestions over bad ones.
The current Citeomatic model uses a fairly standard siamese network for computing textual similarity, with some additions to reflect the particular use case. A simplified diagram of the model is below:
The model jointly learns embeddings for query and candidate documents, as well as the candidate document authors. We ignore the authors of the source document to avoid overfitting to the obvious "I cite myself" predictions and to encourage a richer prediction set. The word embeddings are combined by summation, which in this case (short titles and abstracts) is efficient and works well. We then use the circular correlation as introduced by Nickel, et al to compute the similarity of the query and candidate authors and text.
These interaction features are combined with additional inputs from the citation graph (incoming citations, influential citations) as well as the original sparse features (term term intersection). Finally we feed the combined representation through 2 fully-connected layers before making a prediction.
In the ideal world, when training a model, we would have access to all of the citations that a paper "should have" or "would have" cited if they had enough space or time. Unfortunately, that type of training data is hard to come by. We do have access to a few million papers and the citations they did make, so how can we heuristically simulate our ideal case?
The approach we took for this initial model is fairly simple: we penalize the model more for making "really bad" decisions than we do for "less bad" decisions. To differentiate between these cases, we train the model using 3 example types:
- Second order citations: if A cites B, and B cites C, but A does not cite C, then we consider C a "hard negative." These are typically papers that are somewhat related to the source; sometimes they may even be valid citations themselves. For example this paper on knowledge base completion has this paper on an energy function for relational data as a hard negative. They’re in the same space, but they’re not quite right.
- Search based negatives: we issue a query to our search cluster and find documents that are considered similar to our query according to the BM25 model with some additional tuning (reasonable number of citations, etc.). These documents should look similar to our query, and in some cases they’re reasonable results.
- Easy negatives: these are randomly sampled from the corpus. In general, they’d be a terrible citation for us to predict.
While obviously this set could be extended (3rd order citations, random walks, etc), we’ve found that this division works well in practice. The goal of the Citeomatic loss function is to distinguish true citations from these 3 types of negatives. To encourage it to prefer reasonable results, we impose a larger margin for easy vs. true negatives when compared to the other negative types. In the presence of extremely noisy labels, this allows the model to err on the side of a "reasonable" prediction.
The hard citations and search based negatives are very lossy — many of them are not very good citation candidates. Nevertheless, we have found that this adaptive margin strategy reliably gives us models that work well for our users, which of course is our end goal.
Trying it out
If you’re interested in trying out the model on your own paper, take a look at the source code.