Analysis Pipelines
Execute CWL workflows with full provenance tracking and automated results management.
Overview
Libre Biotech integrates a pipeline execution system that runs Common Workflow Language (CWL) workflows on dedicated compute infrastructure. Pipelines are linked to the ISA framework — you submit an analysis run against a study, process, or project, and results are tracked with full provenance.
Available pipelines
The platform includes pre-configured pipelines for common genomics workflows:
| Pipeline | Description | Inputs |
|---|---|---|
| ONT Transcriptomics | Oxford Nanopore long-read RNA-Seq: IsoQuant → gffread → TransDecoder | FASTQ, reference genome |
| Mouse RNA-Seq | ONT PCR-cDNA long-read RNA-Seq: minimap2 → bambu → DESeq2 | Sequencing process (NanoporeQC run) |
| IsoSeq Annotation | PacBio IsoSeq post-clustering: pbmm2 → isoseq collapse → SQANTI3 | Clustered reads, reference genome |
| Functional Annotation | TransDecoder + Pfam (hmmscan) + SwissProt (DIAMOND) | Upstream transcriptomics or annotation run |
| JBrowse Tracks | Generate genome browser tracks from analysis outputs | Upstream annotation or transcriptomics run |
| PacBio CCS | Subreads to HiFi consensus reads | SubreadSet |
Datasets
Before running a pipeline, your input data must be registered as a dataset. Datasets are files on the compute server that have been validated, checksummed, and linked to your ISA workflow. Navigate to Compute → Datasets to manage them.
Getting data onto the compute server
| Method | Best for | How |
|---|---|---|
| SFTP Upload | Your own sequencing data, local files | Upload via SFTP to the compute server, then browse and register from the Browse Uploads tab |
| SRA/ENA Fetch | Public data from NCBI or EBI | Enter an accession (e.g. SRR12345678) on the Fetch from SRA/ENA tab — the system downloads it in the background |
| Manual Path | Data already on the compute server | Enter the absolute file path on the Manual Path tab |
Dataset provenance
Each dataset records:
- Source type — Where the data came from (sequencing run, SRA, upload, etc.)
- Source reference — Accession number, run folder name, or URL
- SHA-256 checksum — Computed at registration, verifiable at any time via the Revalidate button
- ISA links — Optional links to the sample and sequencing process that produced the data
Submitting an analysis run
There are two ways to start a pipeline run:
- From an ISA page: Click the Run Analysis button on any Study or Investigation page. Your study is pre-selected in the submission form.
- From the Analyses page: Navigate to Compute → Analyses, select a pipeline, and click Start Run.
- Select the pipeline type
- Choose FASTQ sources — pick from the Registered Datasets tab (recommended) or the NanoporeQC Runs tab
- Select a reference genome (required for CWL workflows)
- Set the study and project for ISA context
- Click Submit Run
The run enters the queue and is picked up by the pipeline worker (runs every 2 minutes).
Automatic ISA integration
When a pipeline run completes (or fails), the system automatically creates a data transformation process in the ISA framework. This process:
- Links to the analysis run with full pipeline details
- Records input files, output files, and source samples
- Shows live status with auto-refresh polling while the run is active
- Becomes part of your investigation's provenance chain
Access control
Pipeline runs are scoped to your groups:
- You can only select studies, processes, projects, datasets, and upstream runs that belong to groups you are a member of
- You can only view runs belonging to your groups
- The owner group must be a group you belong to
- The run is recorded under your user account
Platform administrators can see and submit runs against any group's data.
Upstream dependencies
Some pipelines depend on the output of a previous run (e.g. Functional Annotation requires a completed Transcriptomics or Annotation run). When you select an upstream run:
- ISA context (study, investigation, process, project, group) is inherited automatically
- The worker checks the upstream run's status before starting your run
- If the upstream run is still
queued,running, orfetching, your run waits - If the upstream run
failedor wascanceled, your run is automatically marked as failed - Independent runs (no upstream dependency) are always processed first
Run lifecycle
| Status | Meaning |
|---|---|
queued | Waiting to be picked up by the pipeline worker |
running | Workflow is executing on the compute server |
fetching | Results are being collected from the compute server |
succeeded | Workflow completed successfully, results available |
failed | Workflow encountered an error — check logs |
Viewing results
On the analysis run page, you can:
- View output files — Browse and download all files produced by the workflow
- Read logs — View stdout/stderr from the workflow execution
- View RO-Crate — Inspect the Research Object Crate metadata (provenance package)
- Download all — Download the entire run output as a ZIP archive
- View in JBrowse — For applicable outputs, open directly in the genome browser
RO-Crate provenance
Each completed analysis run is packaged as an RO-Crate — a standardised research object that bundles:
- The CWL workflow definition
- Input parameters
- Output files
- Execution metadata (timing, software versions, container images)
- Links to ISA entities (study, process, samples)
RO-Crates are portable and can be inspected by any RO-Crate-compatible tool.
CWL workflows
All pipelines use the Common Workflow Language (CWL) v1.2. CWL is an open standard supported by multiple execution engines (cwltool, Toil, Arvados, CWL-Airflow). This means:
- Workflows are portable — they can run on any CWL-compatible platform
- Workflow definitions are human-readable YAML/JSON
- Tools are containerised (Apptainer/Singularity) for reproducibility
Audit trail
Every pipeline state transition is logged automatically. Administrators can view the full history of a run — including who triggered it, when each state change occurred, and any error messages — in the Admin Dashboard → Activity tab.
Data integrity
When results are fetched from the compute server, each file is verified against the SHA-256 checksums recorded in the file manifest. Any mismatches are logged as warnings. The RO-Crate metadata also records checksums for all output files.
Storage management and archiving
Compute server storage is finite. When you no longer need a dataset's raw files for active analysis, you can archive it:
- Go to the dataset's page and click Archive
- Delete the file from the compute server (via SFTP or SSH) to free disk space
- The dataset record is preserved with its full provenance: checksum, source reference, ISA links, and any pipeline runs that used it
Archived datasets are excluded from the pipeline submission form but remain visible in the dataset list and in the provenance chain of any analysis that used them.
What does FAIR require?
FAIR does not require storing every file forever — it requires that data can be found and re-obtained. When you archive a dataset:
- Public data (SRA/ENA) — the accession is the provenance. Anyone can re-download it. The checksum lets you verify you got the same file.
- Your own data — keep a copy on your local NAS or institutional storage. The dataset record's checksum lets you verify integrity if you re-upload later.
- Collaborator data — ensure the source is documented in the source reference field before archiving.
To restore an archived dataset, re-upload the file to the same path and click Restore. The system will re-verify the checksum to ensure data integrity.
SFTP access
To upload data files for pipeline runs:
- Connect via SFTP to the compute server (host and path shown on the Datasets registration page)
- Upload your files to the
uploads/directory - Go to Compute → Datasets → Register Dataset
- Use the Browse Uploads tab to find and select your file
- The system validates the file exists, computes a SHA-256 checksum, and registers it