Libre Biotech

Analysis Pipelines

Execute CWL workflows with full provenance tracking and automated results management.

Overview

Libre Biotech integrates a pipeline execution system that runs Common Workflow Language (CWL) workflows on dedicated compute infrastructure. Pipelines are linked to the ISA framework — you submit an analysis run against a study, process, or project, and results are tracked with full provenance.

Available pipelines

The platform includes pre-configured pipelines for common genomics workflows:

PipelineDescriptionInputs
ONT TranscriptomicsOxford Nanopore long-read RNA-Seq: IsoQuant → gffread → TransDecoderFASTQ, reference genome
Mouse RNA-SeqONT PCR-cDNA long-read RNA-Seq: minimap2 → bambu → DESeq2Sequencing process (NanoporeQC run)
IsoSeq AnnotationPacBio IsoSeq post-clustering: pbmm2 → isoseq collapse → SQANTI3Clustered reads, reference genome
Functional AnnotationTransDecoder + Pfam (hmmscan) + SwissProt (DIAMOND)Upstream transcriptomics or annotation run
JBrowse TracksGenerate genome browser tracks from analysis outputsUpstream annotation or transcriptomics run
PacBio CCSSubreads to HiFi consensus readsSubreadSet

Datasets

Before running a pipeline, your input data must be registered as a dataset. Datasets are files on the compute server that have been validated, checksummed, and linked to your ISA workflow. Navigate to Compute → Datasets to manage them.

Getting data onto the compute server

MethodBest forHow
SFTP Upload Your own sequencing data, local files Upload via SFTP to the compute server, then browse and register from the Browse Uploads tab
SRA/ENA Fetch Public data from NCBI or EBI Enter an accession (e.g. SRR12345678) on the Fetch from SRA/ENA tab — the system downloads it in the background
Manual Path Data already on the compute server Enter the absolute file path on the Manual Path tab

Dataset provenance

Each dataset records:

  • Source type — Where the data came from (sequencing run, SRA, upload, etc.)
  • Source reference — Accession number, run folder name, or URL
  • SHA-256 checksum — Computed at registration, verifiable at any time via the Revalidate button
  • ISA links — Optional links to the sample and sequencing process that produced the data

Submitting an analysis run

There are two ways to start a pipeline run:

  • From an ISA page: Click the Run Analysis button on any Study or Investigation page. Your study is pre-selected in the submission form.
  • From the Analyses page: Navigate to Compute → Analyses, select a pipeline, and click Start Run.
  1. Select the pipeline type
  2. Choose FASTQ sources — pick from the Registered Datasets tab (recommended) or the NanoporeQC Runs tab
  3. Select a reference genome (required for CWL workflows)
  4. Set the study and project for ISA context
  5. Click Submit Run

The run enters the queue and is picked up by the pipeline worker (runs every 2 minutes).

Automatic ISA integration

When a pipeline run completes (or fails), the system automatically creates a data transformation process in the ISA framework. This process:

  • Links to the analysis run with full pipeline details
  • Records input files, output files, and source samples
  • Shows live status with auto-refresh polling while the run is active
  • Becomes part of your investigation's provenance chain

Access control

Pipeline runs are scoped to your groups:

  • You can only select studies, processes, projects, datasets, and upstream runs that belong to groups you are a member of
  • You can only view runs belonging to your groups
  • The owner group must be a group you belong to
  • The run is recorded under your user account

Platform administrators can see and submit runs against any group's data.

Upstream dependencies

Some pipelines depend on the output of a previous run (e.g. Functional Annotation requires a completed Transcriptomics or Annotation run). When you select an upstream run:

  • ISA context (study, investigation, process, project, group) is inherited automatically
  • The worker checks the upstream run's status before starting your run
  • If the upstream run is still queued, running, or fetching, your run waits
  • If the upstream run failed or was canceled, your run is automatically marked as failed
  • Independent runs (no upstream dependency) are always processed first

Run lifecycle

StatusMeaning
queuedWaiting to be picked up by the pipeline worker
runningWorkflow is executing on the compute server
fetchingResults are being collected from the compute server
succeededWorkflow completed successfully, results available
failedWorkflow encountered an error — check logs

Viewing results

On the analysis run page, you can:

  • View output files — Browse and download all files produced by the workflow
  • Read logs — View stdout/stderr from the workflow execution
  • View RO-Crate — Inspect the Research Object Crate metadata (provenance package)
  • Download all — Download the entire run output as a ZIP archive
  • View in JBrowse — For applicable outputs, open directly in the genome browser

RO-Crate provenance

Each completed analysis run is packaged as an RO-Crate — a standardised research object that bundles:

  • The CWL workflow definition
  • Input parameters
  • Output files
  • Execution metadata (timing, software versions, container images)
  • Links to ISA entities (study, process, samples)

RO-Crates are portable and can be inspected by any RO-Crate-compatible tool.

CWL workflows

All pipelines use the Common Workflow Language (CWL) v1.2. CWL is an open standard supported by multiple execution engines (cwltool, Toil, Arvados, CWL-Airflow). This means:

  • Workflows are portable — they can run on any CWL-compatible platform
  • Workflow definitions are human-readable YAML/JSON
  • Tools are containerised (Apptainer/Singularity) for reproducibility

Audit trail

Every pipeline state transition is logged automatically. Administrators can view the full history of a run — including who triggered it, when each state change occurred, and any error messages — in the Admin Dashboard → Activity tab.

Data integrity

When results are fetched from the compute server, each file is verified against the SHA-256 checksums recorded in the file manifest. Any mismatches are logged as warnings. The RO-Crate metadata also records checksums for all output files.

Storage management and archiving

Compute server storage is finite. When you no longer need a dataset's raw files for active analysis, you can archive it:

  1. Go to the dataset's page and click Archive
  2. Delete the file from the compute server (via SFTP or SSH) to free disk space
  3. The dataset record is preserved with its full provenance: checksum, source reference, ISA links, and any pipeline runs that used it

Archived datasets are excluded from the pipeline submission form but remain visible in the dataset list and in the provenance chain of any analysis that used them.

What does FAIR require?

FAIR does not require storing every file forever — it requires that data can be found and re-obtained. When you archive a dataset:

  • Public data (SRA/ENA) — the accession is the provenance. Anyone can re-download it. The checksum lets you verify you got the same file.
  • Your own data — keep a copy on your local NAS or institutional storage. The dataset record's checksum lets you verify integrity if you re-upload later.
  • Collaborator data — ensure the source is documented in the source reference field before archiving.

To restore an archived dataset, re-upload the file to the same path and click Restore. The system will re-verify the checksum to ensure data integrity.

SFTP access

To upload data files for pipeline runs:

  1. Connect via SFTP to the compute server (host and path shown on the Datasets registration page)
  2. Upload your files to the uploads/ directory
  3. Go to Compute → Datasets → Register Dataset
  4. Use the Browse Uploads tab to find and select your file
  5. The system validates the file exists, computes a SHA-256 checksum, and registers it
Custom workflows: If you need a workflow that isn't available as a pre-configured pipeline, contact the platform administrator. CWL workflows can be added to the system and made available for all users.