DIY Information Extraction

Data scientists have a set of tools to work with structured data in tables. But how does one extract meaning from textual data? While NLP provides some solutions, they all require expertise in either machine learning, linguistics, or both. How do we expose advanced AI and text mining capabilities to domain experts who do not know ML or CS?
About DIY Information Extraction
  • Extractive search over CORD-19 with 3 powerful query modes | AI2 Israel, DIY Information Extraction

    SPIKE-CORD is powerful sentence-level, context-aware, and linguistically informed extractive search system for exploring the CORD-19 corpus.

    Try the demo
    SPIKE-CORD Demo Image
  • SPIKE-CORD Demo Image
    Extractive search over CORD-19 with 3 powerful query modes | AI2 Israel, DIY Information Extraction

    SPIKE-CORD is powerful sentence-level, context-aware, and linguistically informed extractive search system for exploring the CORD-19 corpus.

    Try the demo
  • SPIKE demo image
    Powerful extractive search | AI2 Israel, DIY Information Extraction

    SPIKE is a powerful sentence-level, context-aware, and linguistically informed extractive search system. Try SPIKE over one of our provided datasets.

    Try the demo
  • SPIKE demo image
    Powerful extractive search | AI2 Israel, DIY Information Extraction

    SPIKE is a powerful sentence-level, context-aware, and linguistically informed extractive search system. Try SPIKE over one of our provided datasets.

    Try the demo
    • Interactive Extractive Search over Biomedical Corpora

      Hillel Taub-Tabib, Micah Shlain, Shoval Sadde, Dan Lahav, Matan Eyal, Yaara Cohen, Yoav GoldbergACL2020
      We present a system that allows life-science researchers to search a linguistically annotated corpus of scientific texts using patterns over dependency graphs, as well as using patterns over token sequences and a powerful variant of boolean keyword queries. In contrast to previous attempts to dependency-based search, we introduce a light-weight query language that does not require the user to know the details of the underlying linguistic representations, and instead to query the corpus by providing an example sentence coupled with simple markup. Search is performed at an interactive speed due to efficient linguistic graphindexing and retrieval engine. This allows for rapid exploration, development and refinement of user queries. We demonstrate the system using example workflows over two corpora: the PubMed corpus including 14,446,243 PubMed abstracts and the CORD-19 dataset1, a collection of over 45,000 research papers focused on COVID-19 research. The system is publicly available at https://allenai.
    • pyBART: Evidence-based Syntactic Transformations for IE

      Aryeh Tiktinsky, Yoav Goldberg, Reut TsarfatyACL2020
      Syntactic dependencies can be predicted with high accuracy, and are useful for both machine-learned and pattern-based information extraction tasks. However, their utility can be improved. These syntactic dependencies are designed to accurately reflect syntactic relations, and they do not make semantic relations explicit. Therefore, these representations lack many explicit connections between content words, that would be useful for downstream applications. Proposals like English Enhanced UD improve the situation by extending universal dependency trees with additional explicit arcs. However, they are not available to Python users, and are also limited in coverage. We introduce a broad-coverage, data-driven and linguistically sound set of transformations, that makes event-structure and many lexical relations explicit. We present pyBART, an easy-to-use open-source Python library for converting English UD trees either to Enhanced UD graphs or to our representation. The library can work as a standalone package or be integrated within a spaCy NLP pipeline. When evaluated in a pattern-based relation extraction scenario, our representation results in higher extraction scores than Enhanced UD, while requiring fewer patterns.
    • Syntactic Search by Example

      Micah Shlain, Hillel Taub-Tabib, Shoval Sadde, Yoav GoldbergACL2020
      We present a system that allows a user to search a large linguistically annotated corpus using syntactic patterns over dependency graphs. In contrast to previous attempts to this effect, we introduce a light-weight query language that does not require the user to know the details of the underlying syntactic representations, and instead to query the corpus by providing an example sentence coupled with simple markup. Search is performed at an interactive speed due to an efficient linguistic graphindexing and retrieval engine. This allows for rapid exploration, development and refinement of syntax-based queries. We demonstrate the system using queries over two corpora: the English wikipedia, and a collection of English pubmed abstracts. A demo of the wikipedia system is available at: https://allenai. github.io/spike/

    Team

    AI2 Israel Members

    • Yoav Goldberg's Profile PhotoYoav GoldbergResearch Director, AI2 Israel
    • Hillel  Taub-Tabib's Profile PhotoHillel Taub-TabibResearch & Engineering
    • Micah Shlain's Profile PhotoMicah ShlainResearch & Engineering
    • Matan Eyal's Profile PhotoMatan EyalResearch & Engineering
    • Shoval Sadde's Profile PhotoShoval SaddeLinguistics

    Interns

    • Aryeh Tiktinsky's Profile PhotoAryeh TiktinskyIntern
    • Shauli Ravfogel's Profile PhotoShauli RavfogelIntern