InfoMator  - Information Extraction
Who
  • InfoMator was designed and developed by Dean H. Nelson (Salt Lake City, UT) with the testing assistance of NarraText's  other original co-founders, Tina Mercer and Rock Schindler.
What
  • The name, "InfoMator", is a coinage derived from "informate" (to dispense information) and "-ator" (to perform that function) to convey that InfoMator (tm) "dispenses structurized extractions from unstructured text".
  • InfoMator is a workflow-based software prototype, which is used to construct, correct, and parse domain-specific text data sets into data vectors amenable to quantitative and qualitative statistical analysis.
  • InfoMator workflow architecture is based on NarraText's adaptation of the CRISP-DM (CRoss-Industry Standard Process for Data Mining).
  • InfoMator, as a software prototype, emphasizes  two prototyping perspectives:
      1) a fully transparent and fully functional software protype that produces results for business use-cases,
      2) a tested/validated program specification that can be used to engineer functionality in any desired programming language (i.e. Java, Python, Perl, Visual Basic, .Net, etc.) with minimum testing costs.

    Furthermore, InfoMator has an embedded code-generator (aka "Snipper(tm)" that is used to generate
    early-binding, executable scripts and software artifacts that execute to produce the desired results.
Where
  • Where Used: InfoMator has been deployed originally in the Insurance Market to extract unstructured information from claims notes.
  • However, InfoMator's design has focused on rapid development and implementation of new functionalities involving:
  1.   structurization of unstructured data
  2.   extraction of user-defined extraction types
  3.   data sanitization and masking
to accommodate use-case reporting and analysis requirements in any industry or market.
When
  • InfoMator's development begain in the fall of 2010.
  • InfoMator's current version is 3.3.
How 
  • InfoMator is based on a work flow process structure that is supported by a Core Knowledge Base (CKB) consisting of Core LeXicons (CLX) and Core TaXonomies (CTX).
  • Smaller, richer, operational domain-specific knowledge bases are produced by intersecting of domain-specific dataset contents with a domain-specific "Knowledge Sieve", that captures and prioritizes Core Knowledge artifacts via a Dataset LeXicon (DTLX) and Dataset TaXonomy(DTX).
  • All dataset lexical items that are not known in the DLX & DTX are made available for InfoMator's knowledge engineering functions to mark and identify unknown acronyms and abbreviations and to correct spelling errors (e.g. "speling erors", broken tokens (e.g. "br  oken tok  en", and touching tokens (e.g. "touchingtokens").
  • Knowledge engineering results reside in a User LeXicon (ULX) which is dynamically combined with the Dataset LeXicon to complement domain knowledge.
  • Once the target dataset is corrected (and normalized), each sentence (token vector) is assigned a unique and Syntactic-Tag Vector and one or more Semantic-Tag Vectors.
  • Data Vectors consisting of "structured data values", grammatical constituents, discourse constients, user use-case defined extractions are parsed from the combined token-syntax-semantics vectors.
  • The data vector extractions can then be distilled into higher-level extractions (thematic roles, triples, generative grammar constituents, epistemological modality, eventuality (fact vs. nonfact), etc.)
  • The resulting use-case data vectors provide rich and hyperparametric input to perform Statistical Analyses and training sets for Machine Learning methodologies.
Why
  • InfoMator was developed (1) to improve the levels of precision and recall in text extractions,  (2) to provide rapid prototyping development,  and (3) to  extend  functional robustness in our Knowledge Engineering and Text Extraction  efforts that we could not achieve using a commercial Text Analytics software package, Attensity 5, which was in use at NarraText between 2009 and 2009.