Analysis Pipelines

Execute CWL workflows with full provenance tracking and automated results management.

Overview

Libre Biotech integrates a pipeline execution system that runs Common Workflow Language (CWL) workflows on dedicated compute infrastructure. Pipelines are linked to the ISA framework — you submit an analysis run against a study, process, or project, and results are tracked with full provenance.

Available pipelines

The platform includes pre-configured pipelines for common genomics workflows:

Pipeline	Description	Inputs
ONT Transcriptomics	Oxford Nanopore long-read RNA-Seq: IsoQuant → gffread → TransDecoder	FASTQ, reference genome
Mouse RNA-Seq	ONT PCR-cDNA long-read RNA-Seq: minimap2 → bambu → DESeq2	Sequencing process (NanoporeQC run)
IsoSeq Annotation	PacBio IsoSeq post-clustering: pbmm2 → isoseq collapse → SQANTI3	Clustered reads, reference genome
Functional Annotation	TransDecoder + Pfam (hmmscan) + SwissProt (DIAMOND)	Upstream transcriptomics or annotation run
JBrowse Tracks	Generate genome browser tracks from analysis outputs	Upstream annotation or transcriptomics run
PacBio CCS	Subreads to HiFi consensus reads	SubreadSet

Datasets

Before running a pipeline, your input data must be registered as a dataset. Datasets are files on the compute server that have been validated, checksummed, and linked to your ISA workflow. Navigate to Compute → Datasets to manage them.

Getting data onto the compute server

Method	Best for	How
SFTP Upload	Your own sequencing data, local files	Upload via SFTP to the compute server, then browse and register from the Browse Uploads tab
SRA/ENA Fetch	Public data from NCBI or EBI	Enter an accession (e.g. SRR12345678) on the Fetch from SRA/ENA tab — the system downloads it in the background
Manual Path	Data already on the compute server	Enter the absolute file path on the Manual Path tab

Dataset provenance

Each dataset records:

Source type — Where the data came from (sequencing run, SRA, upload, etc.)
Source reference — Accession number, run folder name, or URL
SHA-256 checksum — Computed at registration, verifiable at any time via the Revalidate button
ISA links — Optional links to the sample and sequencing process that produced the data

Submitting an analysis run

There are two ways to start a pipeline run:

From an ISA page: Click the Run Analysis button on any Study or Investigation page. Your study is pre-selected in the submission form.
From the Analyses page: Navigate to Compute → Analyses, select a pipeline, and click Start Run.

Select the pipeline type
Choose FASTQ sources — pick from the Registered Datasets tab (recommended) or the NanoporeQC Runs tab
Select a reference genome (required for CWL workflows)
Set the study and project for ISA context
Click Submit Run

The run enters the queue and is picked up by the pipeline worker (runs every 2 minutes).

Automatic ISA integration

When a pipeline run completes (or fails), the system automatically creates a data transformation process in the ISA framework. This process:

Links to the analysis run with full pipeline details
Records input files, output files, and source samples
Shows live status with auto-refresh polling while the run is active
Becomes part of your investigation's provenance chain

Access control

Pipeline runs are scoped to your groups:

You can only select studies, processes, projects, datasets, and upstream runs that belong to groups you are a member of
You can only view runs belonging to your groups
The owner group must be a group you belong to
The run is recorded under your user account

Platform administrators can see and submit runs against any group's data.

Upstream dependencies

Some pipelines depend on the output of a previous run (e.g. Functional Annotation requires a completed Transcriptomics or Annotation run). When you select an upstream run:

ISA context (study, investigation, process, project, group) is inherited automatically
The worker checks the upstream run's status before starting your run
If the upstream run is still queued, running, or fetching, your run waits
If the upstream run failed or was canceled, your run is automatically marked as failed
Independent runs (no upstream dependency) are always processed first

Run lifecycle

Status	Meaning
`queued`	Waiting to be picked up by the pipeline worker
`running`	Workflow is executing on the compute server
`fetching`	Results are being collected from the compute server
`succeeded`	Workflow completed successfully, results available
`failed`	Workflow encountered an error — check logs

Viewing results

On the analysis run page, you can:

View output files — Browse and download all files produced by the workflow
Read logs — View stdout/stderr from the workflow execution
View RO-Crate — Inspect the Research Object Crate metadata (provenance package)
Download all — Download the entire run output as a ZIP archive
View in JBrowse — For applicable outputs, open directly in the genome browser

RO-Crate provenance

Each completed analysis run is packaged as an RO-Crate — a standardised research object that bundles:

The CWL workflow definition
Input parameters
Output files
Execution metadata (timing, software versions, container images)
Links to ISA entities (study, process, samples)

RO-Crates are portable and can be inspected by any RO-Crate-compatible tool.

CWL workflows

All pipelines use the Common Workflow Language (CWL) v1.2. CWL is an open standard supported by multiple execution engines (cwltool, Toil, Arvados, CWL-Airflow). This means:

Workflows are portable — they can run on any CWL-compatible platform
Workflow definitions are human-readable YAML/JSON
Tools are containerised (Apptainer/Singularity) for reproducibility

Audit trail

Every pipeline state transition is logged automatically. Administrators can view the full history of a run — including who triggered it, when each state change occurred, and any error messages — in the Admin Dashboard → Activity tab.

Data integrity

When results are fetched from the compute server, each file is verified against the SHA-256 checksums recorded in the file manifest. Any mismatches are logged as warnings. The RO-Crate metadata also records checksums for all output files.

Storage management and archiving

Compute server storage is finite. When you no longer need a dataset's raw files for active analysis, you can archive it:

Go to the dataset's page and click Archive
Delete the file from the compute server (via SFTP or SSH) to free disk space
The dataset record is preserved with its full provenance: checksum, source reference, ISA links, and any pipeline runs that used it

Archived datasets are excluded from the pipeline submission form but remain visible in the dataset list and in the provenance chain of any analysis that used them.

What does FAIR require?

FAIR does not require storing every file forever — it requires that data can be found and re-obtained. When you archive a dataset:

Public data (SRA/ENA) — the accession is the provenance. Anyone can re-download it. The checksum lets you verify you got the same file.
Your own data — keep a copy on your local NAS or institutional storage. The dataset record's checksum lets you verify integrity if you re-upload later.
Collaborator data — ensure the source is documented in the source reference field before archiving.

To restore an archived dataset, re-upload the file to the same path and click Restore. The system will re-verify the checksum to ensure data integrity.

SFTP access

To upload data files for pipeline runs:

Connect via SFTP to the compute server (host and path shown on the Datasets registration page)
Upload your files to the uploads/ directory
Go to Compute → Datasets → Register Dataset
Use the Browse Uploads tab to find and select your file
The system validates the file exists, computes a SHA-256 checksum, and registers it

Custom workflows: If you need a workflow that isn't available as a pre-configured pipeline, contact the platform administrator. CWL workflows can be added to the system and made available for all users.