ETL: Filter Strategies

2024-03-28

Premium Feature — Available with all Premium Editions of LabKey Server. Learn more or contact LabKey.

This topic covers the strategies you can use to define how your ETL determines the rows in the source that should be acted upon, or reflected in the destination query in some manner.

  • <sourceFilters>: An individual transform action may only need to act upon a filtered subset of rows.
  • <incrementalFilter>: The ETL as a whole should only act on the rows that have changed since the last time it was run.
This topic covers both approaches, which may be used separately or together to achieve various goals.

Source Filtering

The most common way to filter or shape data is to write a query in LabKey against the source or create a view on an external server with the source. However if using a remote server where filtering the data is not easily possible, or only wanting to do minimal filtering, the ETL can apply filters on any of the source table's columns to only pull a subset. These <sourceFilters> are in addition to any configured for the incremental filtering described below.

Source filters are applied on a per-transform basis in the ETL's XML definition. As with most other filter definitions in LabKey Server, if multiple filters are defined, they are ANDed together. This example filters the source table's demographics information to only include women who are 20 years or older:

<transform id="step1forEditTest" type="org.labkey.di.pipeline.TransformTask">
<source queryName="demographics" schemaName="etlsource">
<sourceFilters>
<sourceFilter column="sex" operator="eq" value="female" />
<sourceFilter column="age" operator="gte" value="20" />
</sourceFilters>
</source>
<destination queryName="demographics" schemaName="study" targetOption="merge" />
</transform>

Note that these filters are not used for the deleted rows query, if present.

Incremental Filter Strategy Options

ETLs move data from a source to a destination. An incremental filter strategy is used when the ETL should track changes between runs, i.e. when subsequent ETL runs should avoid reprocessing rows from previous runs. There are several incremental filtering strategy options.

A helpful strategy when troubleshooting ETLs with incremental filters is to examine the query trace for the source query. The specific parameters (such as dates used for modified-since comparisons) will be shown in the right hand panel of the source query.

ModifiedSinceFilterStrategy

This strategy is intended for selecting rows that have changed since the last time this ETL was run. You might choose this option for performance reasons, or in order to apply business logic (such as in a trigger) that expects rows to be imported only when changed. Having your ETL act only on 'new' information is the goal of this strategy that uses a DateTime column of your source data.

To accomplish the incremental filtering, you identify the column to use, which must be of type "Date Time". The ETL framework saves the "highest value" from this column each time this ETL runs. The next time the ETL runs, only rows with a value in that column higher than the previous/stored value will be acted upon.

In a case where the ETL source is a LabKey dataset, you could use the built in "Modified" column which is automatically updated every time the row is changed.

...
<incrementalFilter className="ModifiedSinceFilterStrategy" timestampColumnName="Modified">
</incrementalFilter>
...

You can also use a column of another name, as long as it is of type "Date Time". An important distinction is that the behavior of this filter strategy is not inherently tied to the time the ETL is run. The first time you run the ETL, the filter value used is the max value in the specified column, such that all rows will "match". On subsequent runs, the rows with a value *higher* than that first max and lower than the current max will be acted upon. This can lead to unexpected results depending on the column you choose.

For instance, if you were to use a study "StartDate" column for your ModifiedSinceFilterStrategy, and the first batch of participants were added with start dates in 2008, the first ETL run will act on all rows. If you then add a new batch of participants with start dates in 2012, then run the ETL, it will act on that new batch. However, if later you add a set of participants with start dates in 2010, running the ETL will not act on those rows, as the "modified since" date was set at 2012 during the second ETL action. In a scenario like this, you would want to use an actual row creation or modification date and not a user-entered start date.

Note that if your intention is to use an ETL with a ModifiedSinceFilterStrategy to keep two sets of data "in sync" you also need to consider rows deleted in the source query as they will no longer exist to "match" and be acted upon.

RunFilterStrategy

The Run filter strategy relies on a separate run table, specified in the xml. The ETL will operate against rows related to a specified "run", which a further upstream process has used to stage records in the source. You include the column to check, typically an increasing integer column (e.g. Run ID, BatchID, TransferID, etc). The source rows must have a corresponding run (or batchId, etc) column indicating they were staged from that particular run. Any rows with a higher "run" value than when the ETL process was last executed will be acted upon.

Often used for relational data, multi-staged transfer pipelines, or when an earlier upstream process is writing a batch of parent-child data to the source. Useful when child records must be transferred at the same time as the parent records.

SelectAllFilterStrategy

Select all data rows from the source, applying no filter. This is the default when no incremental filter is specified.

Incremental Deletion of Rows

When running an ETL, the source data will be queried in accordance with the ETL filter strategy and the results will be inserted/updated per the destination target option. However, when rows are deleted from the source they will not show up in the results of the ETL query for new/updated rows, as they no longer exist in the source, and thus the deletions will not be propagated to the destination data. There are two options to represent those deletions in the destination table.

For ETLs that write to a file as the destination, incremental deletion is not an option, as the file is not edited upon subsequent runs.

Use Truncate

If you have a small/quick running ETL and use a "truncate" target option, then every time the ETL runs, all rows in the destination will be deleted and replaced with the "remaining" undeleted rows in the source. While this will effectively accomplish incremental deletion, it may not be practical for large queries and complex ETLs.

Use deletedRowsSource

When you include a <deletedRowsSource> element, you can identify another table (or view) on the source side that keeps a record of deleted rows. This table should contain a key value from the source table for each deleted row as well as a timestamp to indicate when it was deleted. After an ETL with an incremental filter imports any updated/created rows, it will next query the deletedRowsSource for any timestamps that are later than the last time the ETL ran and delete any rows from the target table that are returned from this query. It will match these rows by the keys identified in the deletedRowsSource properties deletedSourceKeyColumnName and targetKeyColumnName. Most commonly, audit tables or views created from audit tables are used as deletedRowsSources.

The <deletedRowsSource> element is part of the <incrementalFilter> section. For example, here we have a "ModifiedSince" filter that will also check the query "projects_deletedRowsSource" for rows to delete:

...
<incrementalFilter className="ModifiedSinceFilterStrategy" timestampColumnName="modified">
<deletedRowsSource schemaName="etltest" queryName="projects_deletedRowsSource"
deletedSourceKeyColumnName="project" timestampColumnName="modified" targetKeyColumnName="project"/>
</incrementalFilter>

deletedRowsSource allows you to specify both the deletedSourceKeyColumnName and targetKeyColumnName for use in matching rows for deletion from the target.

Even if there are no new (or recently modified) rows in the source query, any new records in the "deletedRowsSource" will still be tracked and deleted from the target.

Note that using <deletedRowSource> with an <incrementalFilter> strategy does not support a remote connection.

Related Topics