Welcome to Voyager Help

Use the Search field below or select a Category from the list at the left

Creating a Custom Extraction using NLP

Natural Language Processing (NLP) is a way of using computer systems to try and interpret text information in the same way that a person would. At its core, NLP tries to understand human language naturally without needing exhaustive sets of processing rules, and can yield much richer results than a simple keyword search.

In a basic implementation, NLP uses content, context and syntax to identify text strings that fall into what are called Default Named Entities. Typically these  may include things like:

  • Geographic Places
  • Administrative Places
  • People
  • Groups
  • Events

Custom Named Entities

Data sets often contain more, and more diverse, Entities than the standard set. These are called Custom Named Entities, and are meaningful for a particular set of data or data types. Before NLP can use them, the Custom Entities must be trained to be extracted during indexing. Before they can be trained, you need to gather the following things:

  1. A list of terms representing valid examples of the entities you want to be able to recognize. The more sample terms you have, the better off you are at being able to train these entities.
  2. Text-based documents containing these terms to use for sentence extraction. PDFs work really well for this. You need a very large sample to properly train the model; a good target is about 10,000 sentences for each entity type. More is better.
  3. A spreadsheet containing the hierarchy definition. 

The Entity Training Process

There are three phases of Entity training:

  • Sentence Tagging
  • Entity Training
  • Creating Hierarchies

Sentence Tagging

  1. Create a new content repository (Voyager location) with the provided sample documents
  2. Apply a sentence-tagging pipeline step that splits the text content of each document into sentences and then looks in those sentences for any of the terms from the provided list
  3. When a term is found, the term is tagged in the sentence - ie: <entity_type>term</entity_type>
  4. That tagged sentence is written to a text file, which will contain all the tagged sentences for that entity type.
  5. There will be a separate text file containing tagged sentences for each entity type. For example:

tagged_sentences_farms.txt

tagged_sentences_fields.txt

tagged_sentences_crops.txt

Once all of the entities have been identified, the next step is Entity Training.

Entity Training

  1. Run a script against the tagged sentence files, which loads an out-of-the-box model (the default spacy english model) and trains that model with the new entities. It reads each sentence, calculates the position of the tagged entity in that sentence, then tells the model what that entity is (a farm, a field, a crop, etc) and where in the sentence the entity is located. The model then learns how each entity type typically appears.
  2. Test the accuracy of the model by asking it to find those entities in various text examples
  3. Save this newly trained model as a pip-installable gzip, and export it to any environment in which you want to run this new NLP model
  4. Load this new model in the NLP Service (or create a second service for NLP which uses the new model). The model can be installed on any machine already using spaCy by typing pip install <model_name.gzip>

Creating Hierarchies

  1. In Voyager, create a content repository with documents on which to run this new NLP. Ideally, these are not the same documents used to extract and tag sentences to create the model.
  2. Add the NLP step, which will pass the text content of each document to the service and retrieve a list of the new entities found in the text.
  3. With the response, search the taxonomy spreadsheet for the term in the appropriate column. ie: NLP returned values for the new FARM entity, find those values in the Farm column in the spreadsheet.
  4. If the term is found, add the parents defined in the taxonomy spreadsheet as additional fields. For example, if we find a crop, we add the corresponding fields field and farm. Also add any other augmenting data in this step - alternate names, geographic locations, etc.
  5. Use these fields created in the NLP step to create a hierarchy definition, which will eventually be created via a user interface but currently is created by hand. This defines the hierarchy of the fields, and each field can have multiple children with their own children, and so forth. This could also be generated from the spreadsheet. 
  6. The hierarchy definition is used to generate a Solr call, building out the json facet according to the hierarchy. This results in a solr document with all the results in a nice hierarchy, and can be represented appropriately in the UI. This allows for a visual representation of the hierarchical solr documents and the ability to click through at any level to see all the results in Navigo. This output is fully customizable.
  7. The Sentence Tagging and Entity Training steps only need to be done when training new entities / creating new models. Once this is created and the model is acceptable, it can be reused for the Hierarchy Creation step.

The entire process may take several days to complete, depending on the size of the document archive.

Web Design and Web Development by Buildable