Script Pipeline: Running R and Other Scripts in Sequence

_Documentation
[ Video Overview: File-based R Pipeline Scripts ] [ Tutorial Demo: RNASeq matrix processing ]

The "R pipeline" lets you run scripts in a managed environment, so you can run scripts and commands in a sequence -- essentially creating an assembly line of scripts, where the output of one script becomes the input for the next in the series. The pipeline supports R scripts, as well as any of the languages that can be configured for the server, including JavaScript, Perl, Python, SAS and others. Automating data processing using the pipeline lets you:

  • Simplify procedures and reduce errors
  • Standardize and reproduce analyses
  • Track inputs, script versions, and outputs
Pipeline jobs are defined as a sequence of "tasks", run in a specified order. For example a job might include three tasks : (1) pass raw data file to an R script for initial processing, (2) process the results with Perl, and (3) insert into an assay database.

Set Up

Before you use the script pipeline, confirm that your target script engine is enlisted with LabKey Server. For example, if you intend to use an R script, enlist the R engine as described in the topic Configure Scripting Engines.

Tasks

Tasks are defined in a LabKey Server module. They are file-based, so they can be created from scratch, cloned, exported, imported, renamed, and deleted. Tasks declare parameters, inputs, and outputs. Inputs may be files, parameters entered by users or by the API, a query, or a user selected set of rows from a query. Outputs may be files, values, or rows inserted into a table. Also, tasks may call other tasks.

Module File Layout

The module directory layout for sequence configuration files (.pipeline.xml), task configuration files (.task.xml), and script files (.r, .pl, etc.) has the following shape. (Note: the layout below follows the pattern for modules as checked into LabKey Server source control. Modules not checked into source control have a somewhat different directoy pattern. For details see Map of Module Files.)

<module>
resources
pipeline
pipelines
job1.pipeline.xml
job2.pipeline.xml
job3.pipeline.xml
...
tasks
RScript.task.xml
RScript.r
PerlScript.task.xml
PerlScript.pl
...

File Operation Tasks

Exec Task

An example command line .task.xml file that takes .hk files as input and writes .cms2 files:

<task xmlns="http://labkey.org/pipeline/xml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="ExecTaskType" name="mytask" version="1.0">
<exec>
bullseye -s my.spectra -q ${q}
-o ${output.cms2} ${input.hk}
</exec>
</task>

Script Task

An example task configuration file that calls an R script:

<task xmlns="http://labkey.org/pipeline/xml"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:type="ScriptTaskType"
name="generateMatrix" version="0.0">
<description>Generate an expression matrix file (TSV format).</description>
<script file="RScript.r"/>
</task>

Parameters

Parameters, inputs, and outputs can be explicitly declared in the .task.xml file (or in the .pipeline.xml, if it includes an inline task).

<task xmlns="http://labkey.org/pipeline/xml"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:type="ScriptTaskType"
name="someTask" version="0.0">

<inputs>
<file name="input.txt" required="true"/>
<text name="param1" required="true"/>
</inputs>

If you do not provide explicit configurations, parameters are inferred from any dollar sign/curly braces tokens in your script. For example, see below: ${q}, ${output.cms2}, and ${input.hk}.

<task xmlns="http://labkey.org/pipeline/xml" name="mytask" version="1.0">
<exec>
bullseye -s my.spectra -q ${q}
-o ${output.cms2} ${input.hk}
</exec>
</task>

Inputs and Outputs

File inputs are identified by file extension. For example, the following configures the task to accept .txt files:

<inputs>
<file name="input.txt"/>
</inputs>

File outputs are automatically named using the formula: input file name + file extension set at <outputs><file name="output.tsv">. For example, If the input file is "myData1.txt", the output file will be named "myData1.tsv".

  • The task name must be unique (no other task with the same name). For example: <task xmlns="http://labkey.org/pipeline/xml" name="myUniqueTaskName">
  • An input must be declared, either implicitly or explicitly with XML configuration elements.
  • Input and output files must not have the same file extensions. For example, the following is not allowed, because .tsv is declared for both input and output:
<inputs>
<file name="input.tsv"/>
</inputs>
<outputs>
<file name="output.tsv"/> <!-- WRONG - input and output cannot share the same file extension. -->
</outputs>

Configure required parameters with the attribute 'required', for example:

<inputs>
<file name="input.tsv"/>
<text name="param1" required="true"/>
</inputs>

Control the output location (where files are written) using the attributes outpuDir or outputLocation.

Implicitly Declared Parameters, Inputs, and Outputs

Implicitly declared parameters, inputs, and outputs are allowed and identified by the dollar sign/curly braces syntax, for example, ${param1}.

  • Inputs are identified by the pattern: ${input.XXX} where XXX is the desired file extension.
  • Outputs are identified by the pattern: ${output.XXX} where XXX is the desired file extension.
  • All others patterns are considered parameters: ${fooParam}, ${barParam}
For example, the following R script contains these implicit parameters:
  • ${input.txt} - Input files have 'txt' extension.
  • ${output.tsv} - Output files have 'tsv' extension.
  • ${skip-lines} - An integer indicating how many initial lines to skip.
# reads the input file and prints the contents to stdout
lines = readLines(con="${input.txt}")

# skip-lines parameter. convert to integer if possible
skipLines = as.integer("${skip-lines}")
if (is.na(skipLines)) {
skipLines = 0
}

# lines in the file
lineCount = NROW(lines)

if (skipLines > lineCount) {
cat("start index larger than number of lines")
} else {
# start index
start = skipLines + 1

# print to stdout
cat("(stdout) contents of file: ${input.txt}n")
for (i in start:lineCount) {
cat(sep="", lines[i], "n")
}

# print to ${output.tsv}
f = file(description="${output.tsv}", open="w")
cat(file=f, "# (output) contents of file: ${input.txt}n")
for (i in start:lineCount) {
cat(file=f, sep="", lines[i], "n")
}
flush(con=f)
close(con=f)
}

Assay Database Import Tasks

The built-in task type AssayImportRunTaskType looks for TSV and XSL files that were output by the previous task. If it finds output files, it uses that data to update the database, importing into whatever assay runs tables you configure.

An example task sequence file with two tasks: (1) generate a TSV file, (2) import that file to the database: scriptset1-assayimport.pipeline.xml.

<pipeline xmlns="http://labkey.org/pipeline/xml"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
name="scriptset1-assayimport" version="0.0">
<!-- The description text is shown in the Import Data selection menu. -->
<description>Sequence: Call generateMatrix.r to generate a tsv file,
import this tsv file into the database. </description>
<tasks>
<!-- Task #1: Call the task generateMatrix (= the script generateMatrix.r) in myModule -->
<taskref ref="myModule:task:generateMatrix"/>
<!-- Task #2: Import the output/results of the script into the database -->
<task xsi:type="AssayImportRunTaskType" >
<!-- Target an assay by provider and protocol, -->
<!-- where providerName is the assay type -->
<!-- and protocolName is the assay design -->
<!-- <providerName>General</providerName> -->
<!-- <protocolName>MyAssayDesign</protocolName> -->
</task>
</tasks>
</pipeline>

The name attribute of the <pipeline> element, this must match the file name (minus the file extension). In this case: 'scriptset1-assayimport'.

The elements providerName and protocolName determine which runs table is targeted.

Pipeline Task Sequences

Pipelines consist of a configured sequence of tasks. A "job" is a pipeline instance with specific input and outputs files and parameters. Task sequences are defined in files with the extension ".pipeline.xml".

Note the task references, for example "myModule:task:generateMatrix". This is of the form <ModuleName>:task:<TaskName>, where <TaskName> refers to a task config file at /pipeline/tasks/<TaskName>.task.xml

An example pipeline file: job1.pipeline.xml, which runs two tasks:

<pipeline xmlns="http://labkey.org/pipeline/xml"
name="job1" version="0.0">
<description> (1) Normalize and (2) generate an expression matrix file.</description>
<tasks>
<taskref ref="myModule:task:normalize"/>
<taskref ref="myModule:task:generateMatrix"/>
</tasks>
</pipeline>

Invoking Pipeline Sequences from the File Browser

Configured pipeline jobs/sequences can be invoked from the Pipeline File browser by selecting an input file(s) and clicking Import Data. The list of available pipeline jobs is populated by the .pipeline.xml files.

Overriding Parameters

The default UI provides a panel for overriding default parameters for the job.

<?xml version="1.0" encoding="UTF-8"?>
<bioml>
<!-- Override default parameters here. -->
<note type="input" label="pipeline, protocol name">geneExpression1</note>
<note type="input" label="pipeline, email address">steveh@labkey.com</note>
</bioml>

Providing User Interface

You can override the default user interface by setting <analyzeURL> in the .pipeline.xml file.

<pipeline xmlns="http://labkey.org/pipeline/xml"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
name="geneExpMatrix-assayimport" version="0.0">
<description>Expression Matrix: Process with R, Import Results</description>
<!-- Overrides the default UI, user will see myPage.view instead. -->
<analyzeURL>/pipelineSample/myPage.view</analyzeURL>
<tasks>
...
</tasks>
</pipeline>

Invoking from JavaScript

Example JavaScript that invokes a pipeline job through LABKEY.Pipeline.startAnalysis().

Note the value of taskId: 'myModule:pipeline:generateMatrix'. This is of the form <ModuleName>:pipeline:<TaskName>, referencing a file at /pipeline/pipelines/<PipelineName>.pipeline.xml

function startAnalysis()
{
var protocolName = document.getElementById("protocolNameInput").value;
if (!protocolName) {
alert("Protocol name is required");
return;
}

var skipLines = document.getElementById("skipLinesInput").value;
if (skipLines < 0) {
alert("Skip lines >= 0 required");
return;
}

LABKEY.Pipeline.startAnalysis({
taskId: "myModule:pipeline:generateMatrix",
path: path,
files: files,
protocolName: protocolName,
protocolDescription: "",
jsonParameters: {
'skip-lines': skipLines
},
saveProtocol: false,
success: function() {
window.location = LABKEY.ActionURL.buildURL("pipeline-status", "showList.view")
}
});
}

Execution Environment

When a pipeline job is run, a job directory is created named after the job type and a another child directory is created inside named after the protocol, for example, "create-matrix-job/protocol2". Log and output files are written to this child directory.

Also, while a job is running, a work directory is created, for example, "run1.work". This includes:

  • Parameter replaced script.
  • Context 'task info' file with server URL, list of input files, etc.
If job is successfully completed, the work directory is cleaned up - any generated files are moved to their permanent locations, and the work directory is deleted.

Other Resources


previousnext
 
expand allcollapse all