Transform Scripts

2024-03-28

This topic is under construction for the 24.3 (March 2024) release of LabKey Server. For current documentation of this feature, click here.

Transform scripts are attached to assay designs, run before the assay data is imported, and can reshape the data file to match the expected import format. Several scripts can run sequentially to perform different transformations. The extension of the script file identifies the scripting engine that will be used to run the validation script. For example, a script named test.pl will be run with the Perl scripting engine.

Transform scripts (which are always attached to assay designs) are different from trigger scripts, which are attached to other table types (datasets, lists, sample types, etc.).

Scenarios

A wide range of scenarios can be addressed using transform scripts. For example:

  • Input files might not be "parsable" as is. Instrument-generated files often contain header lines before the main data table, denoted by a leading #, !, or other symbol. These lines may contain useful metadata about the protocol, reagents, or samples tested which should either be incorporated into the data import or skipped over to find the main data to import.
  • File or data formats in the file might not be optimized for efficient storage and retrieval. Display formatting, special characters, etc. might be unnecessary for import. Transformation scripts can clean, validate, and reformat imported data.
  • During import, display values from a lookup column may need to be mapped to foreign key values for storage.
  • You may need to fill in additional quality control values with imported assay data, or calculate contents of a new column from columns in the imported data.
  • Inspect and change the data or populate empty columns in the data. Modify run- and batch-level properties. If validation only needs to be done for particular single field values, the simpler mechanism is to use a validator within the field properties for the column.

Scripting Prerequisites

Any scripting language that can be invoked via the command line and has the ability to read/write files is supported for transformation scripts, including:

  • Perl
  • Python
  • R
  • Java
Before you can run scripts, you must configure the necessary scripting engine on your server. If you are missing the necessary engine, or the desired engine does not have the script file extension you are using, you'll get an error message similar to:
A script engine implementation was not found for the specified QC script (my_transformation_script.py). Check configurations in the Admin Console.

Permissions

In order to upload transform scripts and attach them to an assay design, the user must have the Platform Developer or Site Administrator role. Once an authorized user has added a script, it will be run any time data is imported using that design.

Users who can edit assay designs but are not Platform Developers or Site Administrators will be able to edit other aspects of the design, but will not see the transformation script options.

How Transformation Scripts Work

Transformation and validation scripts are invoked in the following Script Execution Sequence:

1. A user imports assay result data and supplies run and batch properties.

2. The server uses that input to create:

  • (1) A "runProperties.tsv" file.
    • This file contains assay-specific properties, run and batch level fields, and generated paths to various resources, all of which can be used by your transformation script.
    • Learn more in the topic: Run Properties Reference.
  • (2) If the first 'non-comment' line of the imported result data contains fields that match the assay design, the server will infer/generate a "runDataFile" which is a TSV based on the result data provided by the user (which may be in another file format like Excel or CSV). This preprocessed data TSV can then be transformed.
    • Note that if the server cannot infer this file, either because the "relevant" part is not at the top, column names are not recognizable to the assay mechanism, or generally isn't tabular/'rectangular', you cannot use the generated "runDataFile", but will instead be able to use the originally uploaded file in its original format (as "runDataUploadedFile").
    • Regardless of whether the server can generate the "runDataFile" itself, the path to the intended location is included in the runProperties.tsv file.
    • A pair of examples of reading in these two types of input file is included in this topic.
3. The server invokes the transform script, which can use the ${runInfo} substitution token to locate the "runProperties.tsv" file created in step 2. The script can then extract information and paths to input and output files from that file.
  • Both the "runDataUploadedFile" (the raw data file you uploaded) and the "runDataFile" (the location for the processed/TSV version of that file that may or may not have been generated in step 2) are included in the "runProperties.tsv" file. Which you use as input depends on your assay type and transformation needs.
  • The "runDataFile" property consists of three columns; the third of which is the path to the output TSV file you will write the data to.
4. After script completion, the server checks whether any errors have been written by the transform script and whether any data has been transformed.

5. If transformed data is available in the specified output location, the server uses it for subsequent steps; otherwise, the original data is used.

6. If multiple transform scripts are specified, the server invokes the other scripts in the order in which they are defined, passing sequentially transformed output as input to the next script.

7. Field-level validator and quality-control checks, including range and regular expression validation that are included in the assay definition are performed on the 'post-transformation' data.

8. If no errors have occurred, the run is loaded into the database.

Use Transformation Scripts

Each assay design can be associated with one or more validation or transform scripts which are run in the order listed in the assay design.

This section describes the process of using a transform script that has already been developed for your assay type. An example workflow for how to create an assay transform script in perl can be found in Example Workflow: Develop a Transformation Script.

Add Script to an Assay Design

To use a transform script in an assay design, edit the design and click Add Script next to the Transform Scripts field. Note that you must have Platform Developer or Site Administrator to see or use this option.

  • Select file or drag and drop to associate your script with this design. Learn more about managing scripts below.
  • You may enter multiple scripts by clicking Add Script again.
  • Confirm that other properties and fields required by your assay are correctly specified.
  • Scroll down and click Save.

When any authorized user imports (or re-imports) run data using this assay design, the script will be executed.

There are two other useful Import Settings presented as checkboxes in the Assay designer.

  • Save Script Data for Debugging tells the framework to not delete the intermediate files such as the runProperties file after a successful run. This option is important during script development. It can be turned off to avoid cluttering the file space under the TransformAndValidationFiles directory that the framework automatically creates under the script file directory.
  • Import In Background tells the framework to create a pipeline job as part of the import process, rather than tying up the browser session. It is useful for importing large data sets.
A few notes on usage:
  • Transform scripts are triggered both when a user imports via the server graphical user interface and when the client API initiates the import (for example via saveBatch).
  • Client API calls are not supported in the body transform scripts, only server-side code is supported.
  • Columns populated by transform scripts must already exist in the assay definition.
  • Executed scripts show up in the experimental graph, providing a record that transformations and/or quality control scripts were run.
  • Transform scripts are run before field-level validators.
  • The script is invoked once per run upload.
  • Multiple scripts are invoked in the order they are listed in the assay design.
  • Note that non-programmatic quality control remains available -- assay designs can be configured to perform basic checks for data types, required values, regular expressions, and ranges. Learn more in these topics: Field Editor and Dataset QC States: Admin Guide.
The general purpose assay tutorial includes another example use of a transform script in Set up a Data Transform Script.

Manage Script Files

When you add a transformation script using the assay designer, the script will be uploaded to a @scripts subdirectory of the file root, parallel to where other @files are stored. This separate location helps protect scripts from being modified or removed by unauthorized users, as only Platform Developers and Site Administrators will be able to access them.

Remove scripts from the design by selecting Remove path from the menu. Note that this does not remove the file itself, just removes the path from the assay design. You can also use Copy path to obtain the path for this script in order to apply it to another assay design.

To manage the actual script files, click Manage Script Files to open the @scripts location.

Here you can select and (Delete) the script files themselves.

Customize a File Browser

You can customize a Files web part to show the @scripts location.

  • Note: If users without access to the script files will be able to reach this folder, you will also want to customize the permissions settings of this web part. This is best accomplished by creating a dummy container and granting only Site Admins and Platform Developers access, then using that container to set the permissions for the Transform Scripts web part.

Provide the Path to the Script File

When you upload a transformation script to the assay designer, it is placed in the @scripts subdirectory of the local file root. The path is determined for you and displayed in the assay designer. This location is only visible to Site Administrators and users with the Platform Developer role, making it a secure place to locate script files.

If for some reason you have scripts located elsewhere on your system, or when you are creating a new design using the same transform script(s), you can specify the absolute path to the script instead of uploading it.

Use > Copy path from an existing assay design's transform script section, or find the absolute path of a script elsewhere in the File Repository.

  • If there is a customized Files web part showing the contents of the @scripts location, as shown here, click the title of the web part to open the file browser.
  • Select the transform script.
  • The absolute path is shown at the bottom of the panel.

In the file path, LabKey Server accepts either backslashes (the default Windows format) or forward slashes.

Example path to script:

/labkey/labkey/files/MyProject/MyAssayFolder/@scripts/MyTransformScript.R

When working on your own developer workstation, you can put the script file wherever you like, but using the assay designer interface to place it in the @scripts location will not only be more secure, but will also make it easier to deploy to a production server. These options also make iterative development against a remote server easier, since you can use a Web-DAV enabled file editor to directly edit the same script file that the server is calling.

Within the script, you can use the built-in substitution token "${srcDirectory}" which is automatically the directory where the script file is located.

Access and Use the Run Properties File

The primary mechanism for communication between the LabKey Assay framework and the Transform script is the Run Properties file. The ${runInfo} substitution token tells the script code where to find this file. The script file should contain a line like

run.props = labkey.transform.readRunPropertiesFile("${runInfo}");

The run properties file contains three categories of properties:

1. Batch and run properties as defined by the user when creating an assay instance. These properties are of the format: <property name> <property value> <java data type>

for example,

gDarkStdDev 1.98223 java.lang.Double

An example Run Properties file to examine: runProperties.tsv

When the transform script is called these properties will contain any values that the user has typed into the "Batch Properties" and "Run Properties" sections of the import form. The transform script can assign or modify these properties based on calculations or by reading them from the raw data file from the instrument. The script must then write the modified properties file to the location specified by the transformedRunPropertiesFile property.

2. Context properties of the assay such as assayName, runComments, and containerPath. These are recorded in the same format as the user-defined batch and run properties, but they cannot be overwritten by the script.

3. Paths to input and output files. These are absolute paths that the script reads from or writes to. They are in a <property name> <property value> format without property types. The paths currently used are:

  • a. runDataUploadedFile: The assay result file selected and imported to the server by the user. This can be an Excel file (XLS, XLSX), a tab-separated text file (TSV), or a comma-separated text file (CSV).
  • b. runDataFile: The file produced after the assay framework converts the user imported file to TSV format. The path will point to a subfolder below the script file directory, with a path value similar to <property value> <java property type>. The AssayId_22\42 part of the directory path serves to separate the temporary files from multiple executions by multiple scripts in the same folder.
C:\labkey\files\transforms\@files\scripts\TransformAndValidationFiles\AssayId_22\42\runDataFile.tsv
  • c. AssayRunTSVData: This file path is where the result of the transform script will be written. It will point to a unique file name in an "assaydata" directory that the framework creates at the root of the files tree. NOTE: this property is written on the same line as the runDataFile property.
  • d. errorsFile: This path is where a transform or validation script can write out error messages for use in troubleshooting. Not normally needed by an R script because the script usually writes errors to stdout, which are written by the framework to a file named "<scriptname>.Rout".
  • e. transformedRunPropertiesFile: This path is where the script writes out the updated values of batch- and run-level properties that are listed in the runProperties file.

Choose the Input File for Transform Script Processing

From the runProperties.tsv, the transform script developer has two choices of the file to use as input to transform:

  • runDataUploadedFile: The "raw" data file as uploaded. This is stored as "runDataUploadedFile" and can be Excel, TSV, or CSV format.
  • runDataFile The "preprocessed" TSV file that the system will attempt to infer by scanning that raw uploaded file. Generally, this file can successfully be inferred from any format when the column names in the first row match the expectations of the assay design.
As an example, the "runDataFile" would be the right choice for importing Excel data to a standard and straightforward assay design. The successful inferral of the Excel format into TSV means the script does not need to parse the Excel format, and the data is already in the "final" expected TSV format.

However, even when the "runDataFile" is successfully parsed, the script could still choose to read from and act upon the raw "runDataUploadedFile" if desired for any reason. For instance, if the original file is already in TSV format, the script could use either version.

If the data file cannot be preprocessed into a TSV, then the script developer must work with the originally uploaded "runDataUploadedFile" and provide the parsing and preprocessing into a TSV format. For instance, if the data includes a header "above" the actual data table, the script would need to skip that header and read the data into a TSV.

A Python example that loads the original imported "raw" results file...

fileRunProperties = open(filePathRunProperties, "r")
for l in fileRunProperties:
row = l.split()
if row[0] == "runDataUploadedFile":
filePathIn = row[1]
if row[0] == "runDataFile":
filePathOut = row[3]

… and one that loads the inferred TSV file:

fileRunProperties = open(filePathRunProperties, "r")
for l in fileRunProperties:
row = l.split()
if row[0] == "runDataFile":
filePathIn = row[1]
if row[0] == "runDataFile":
filePathOut = row[3]

Note that regardless of whether the preprocessing is successful, the path of the "runDataFile" .tsv will be included in the runProperties.tsv file, it will just be missing. You can catch this scenario by saving script data for debugging. The "runDataFile" property also has two more columns, the third being the full path to the "output" tsv file to use.

Pass Run Properties to Transform Scripts

Information on run properties can be passed to a transform script in two ways. You can put a substitution token into your script to identify the run properties file, or you can configure your scripting engine to pass the file path as a command line argument. See Transformation Script Substitution Syntax for a list of available substitution tokens.

For example, using perl:

Option #1: Put a substitution token (${runInfo}) into your script and the server will replace it with the path to the run properties file. Here's a snippet of a perl script that uses this method:

# Open the run properties file. Run or upload set properties are not used by
# this script. We are only interested in the file paths for the run data and
# the error file.

open my $reportProps, '${runInfo}';

Option #2: Configure your scripting engine definition so that the file path is passed as a command line argument:

  • Go to (Admin) > Site > Admin Console.
  • Under Configuration, click Views and Scripting.
  • Select and edit the perl engine.
  • Add ${runInfo} to the Program Command field.

Related Topics


Premium Resources Available

Subscribers to premium editions of LabKey Server can learn more with the example code in these topics:


Learn more about premium editions