This tutorial demonstrates how to create a new Pipeline step in HQ using Python. Pipelines provide functions to transform and enhance data as it is indexed. This document will show how you can add a new field to an entry or remove a field from an entry.
Before getting started, it is important to know how to set up and apply a pipeline to a repository in HQ. For more information, see Creating Pipelines.
Follow the steps below to create the Python pipeline.
Make sure the following are installed:
Go to your HQ home directory:
To create the pipeline:
Take some time to examine the Python code and read the documentation strings and comments. An entry which is sent to the run function is a Python dictionary with the required fields. An entry would look like this:
{
'fields': {
'meta_table_name': 'world_countries.csv',
'name': 'Vanuatu',
'repository': 'r16524da57d1',
'format': 'text/csv-record',
'format_category': 'Office',
'fs_SQMI': '3265.07',
'fs_FIPS_CNTRY': 'NH',
'fs_STATUS': 'UNMemberState',
'fs_POP2005': '205754',
'format_type': 'Record'
}
}
The config argument which is sent to the run function is the pipeline configuration, and includes parameters and their values. After creating and saving a pipeline, you can view the configuration by opening the pipelines.json file located in the config directory in your HQ home location.