Libre Biotech

AI Readiness

How the platform makes research data consumable by AI/ML pipelines, with live adoption statistics and scoring details.

Libre Biotech is designed so your research data is structured for machine learning from day one. This page documents the AI-readiness features, how they are scored, and how each platform component contributes to both FAIR compliance and ML consumability.

Component overview

Each platform feature serves both FAIR data principles and AI/ML readiness. The cards below show live adoption statistics from the platform.

ISA Framework
I R

Structured investigation-study-assay hierarchy that ML parsers can traverse programmatically. Every investigation contains studies, which contain processes, which produce samples with annotations.

Details

The ISA (Investigation-Study-Assay) framework provides a standardised hierarchy for organising research metadata. Investigations group related studies, studies link to processes (lab work and analyses), and processes produce samples with ontology annotations. This structure means any ML pipeline can navigate from high-level project metadata down to individual sample features without custom parsing.

4 Investigations
Ontology Annotations
F I

Machine-readable vocabulary from 13 ontologies ensures consistent feature naming across datasets.

Details

Samples and other entities can be annotated with terms from 13 indexed ontologies (OBI, EFO, UBERON, NCBITaxon, UO, CLO, CHEBI, GO, SO, PATO, HP, ENVO, BAO) containing ~2.9M terms. Each annotation stores the ontology term ID (CURIE), label, and source, enabling ML pipelines to use standardised feature names rather than free-text labels. Annotations are organised by "slot" (e.g. organism, tissue, disease) and support both text values and numeric values with optional unit terms.

147 Annotated entities
CWL Pipelines
R

Containerised workflows with full parameter tracking provide computational reproducibility for any analysis.

Details

All analysis pipelines use Common Workflow Language (CWL) definitions executed via cwltool with Apptainer containers. Each analysis run records the workflow file, container image, input parameters (as JSON), and all output files. This means any result can be reproduced exactly — the same inputs, same container, same parameters will produce identical outputs. The platform tracks run status (queued, running, succeeded, failed) and links outputs back to the input samples.

74 Succeeded runs
RO-Crate Export
F A I

Self-describing research object packages that include workflow definitions, inputs, outputs, and provenance metadata.

Details

Each completed analysis run can be exported as an RO-Crate (Research Object Crate) — a self-describing data package following the RO-Crate 1.1 specification. The package includes the CWL workflow definition, all input/output files, parameter JSON, container image reference, and W3C PROV-O provenance metadata in a single ZIP archive. This means an ML researcher can download one package and have everything needed to understand and reproduce the analysis.

107 RO-Crate archives
ISA-JSON/Tab Export
I A

Standard exchange formats (ISA-JSON, ISA-Tab) that feed directly into ML data loading pipelines.

Details

Investigations can be exported as ISA-JSON (following the ISA-JSON 1.0 specification) or ISA-Tab (tab-separated files in a ZIP). ISA-JSON includes full study metadata, protocols, materials (sources, samples, extracts), factor values, characteristic categories, assay data, and ontology source references. These formats are widely supported by bioinformatics tools and repositories (ENA, ArrayExpress, MetaboLights) and can be parsed by any JSON/TSV reader.

Always available
Sample Lineage
R

Provenance chains link every derived sample back to its source, providing full context for training data curation.

Details

The process_input_samples table tracks parent-child relationships between samples across processes via the output_sample_id column. A tissue sample becomes an RNA extract, which becomes a library, which goes through sequencing — each step is a link in the chain. The API provides lineage queries (ancestors, descendants, full graph) via recursive CTEs capped at depth 10. This lets ML pipelines trace any derived feature back to the original biological material, filter by process category, or validate that samples share a common origin.

404 Lineage links
Quantitative Measurements
I

Measurements annotated with Unit Ontology (UO) terms provide typed numeric features ready for ML feature vectors.

Details

Measurements are linked to samples via assays and store a measurement type (e.g. RQN, concentration, fragment_length), a numeric value, and an optional unit from the Unit Ontology (UO). This means ML pipelines get typed numeric columns with standardised units — no parsing "8.7 ng/uL" from free text. The ML-Ready export automatically discovers all measurement types across an investigation and creates one numeric column per type.

341 Measurements with units
REST API
A

Programmatic access to all metadata, samples, annotations, and analysis outputs via authenticated endpoints.

Details

The REST API provides full CRUD access to investigations, processes, samples, and files. Additional endpoints support ISA-JSON export, ISA validation, provenance graphs, PROV-O export, sample lineage queries, and ML-Ready data export. Authentication uses API keys via the X-API-Key header. Rate limits are 300 requests/minute for authenticated users. See the API Reference for the complete endpoint list.

0 API keys issued
Sequencing QC
R

Quality-controlled inputs across Illumina, Nanopore, and PacBio platforms ensure clean data enters ML pipelines.

Details

Three integrated QC dashboards (IlluminaQC, NanoporeQC, PacBioQC) track run quality metrics including yield, Q-scores, pass/fail rates, and instrument trends. QC runs can be linked to ISA processes, so ML pipelines can filter samples by sequencing quality before including them in training datasets. This prevents low-quality data from corrupting model training.

568 QC runs tracked
PROV-O Provenance
F R

W3C-standard provenance graphs (JSON-LD) that document the full history of every data transformation.

Details

The platform exports W3C PROV-O provenance as JSON-LD for both investigations and individual samples. The export includes prov:Activity nodes (processes and analysis runs), prov:Entity nodes (samples and files), and prov:Agent nodes (people), connected by prov:wasGeneratedBy, prov:used, prov:wasDerivedFrom, and prov:wasAssociatedWith relationships. This standard format can be loaded into any RDF store or knowledge graph.

Always available
Data Cards & Skill Files
F A R

Auto-generated dataset cards and platform skill files that bridge human-readable and machine-readable metadata.

Details

Every investigation auto-generates a data card (YAML frontmatter + Markdown body) aggregating metadata, FAIR/AI-Ready scores, ontology stats, provenance depth, and API links — inspired by Hugging Face dataset cards. Workflow cards document CWL pipelines with execution stats. A platform-level skill file at /CLAUDE.md describes the entire platform for AI coding assistants. Data cards are bundled into ISA-Tab ZIP and ML-Ready ZIP exports, and ML-Ready ZIP includes a README describing columns, types, and loading code. Available via API at /api.php/v1/investigations/{id}/card and /api.php/v1/platform-card.

Always available

AI-Ready Score

Each investigation displays an AI-Ready Score card (0-100) alongside the FAIR Score in the investigation sidebar. The score measures how consumable the data is for ML pipelines across 8 independently scored dimensions.

Scoring dimensions

#DimensionMaxWhat it checksSource tables
1 Machine-Readable Formats 12 Has studies (6 pts) and processes (6 pts) — the minimum data needed for ISA-JSON or ML-Ready export studies, process_studies
2 Structured Metadata 12 Has ontology annotations on samples (6 pts) and study factors defined (6 pts) annotations, study_factors
3 Data Provenance 12 Sample lineage depth (4-8 pts: depth 1-2 = 4 pts, depth 3+ = 8 pts) plus process count (2-4 pts) process_input_samples (recursive CTE on output_sample_id, depth cap 10), process_studies
4 Standard File Formats 12 Files with standard extensions: FASTQ, BAM, SAM, CRAM, CSV, TSV, GFF3, GFF, GTF, VCF, BED, BigWig, FASTA, JSON, XML (1-4 files = 8 pts, 5+ = 12 pts) process_files, files
5 Computational Reproducibility 13 Succeeded CWL analysis runs (1-2 runs = 8 pts, 3+ = 13 pts) analysis_runs WHERE status='succeeded'
6 Quantitative Measurements 13 Measurements with Unit Ontology terms (1-4 = 8 pts, 5+ = 13 pts) measurements JOIN assays WHERE unit_id IS NOT NULL
7 Consistent Sample Labeling 12 >80% labels match a common prefix pattern (3-6 pts) and >80% have descriptions (3-6 pts) samples — regex heuristic on labels
8 API Accessibility 13 Investigation visibility: public = 13 pts, group = 8 pts, private = 3 pts investigations.visibility

Total maximum: 99 points (normalised to 0-100 scale).

Badge thresholds

ScoreBadgeColour
90-100ExcellentGreen
75-89GoodBlue
50-74FairYellow
0-49Needs WorkRed

Suggestions

Each missing criterion generates an actionable suggestion displayed beneath the score card. For example:

  • Metadata Add ontology annotations to samples for structured metadata
  • Provenance Add sample lineage (process_input_samples links) for provenance tracking
  • Reproducibility Run CWL analysis pipelines for computational reproducibility
  • Accessibility Set visibility to "public" for unrestricted programmatic access

How to improve your score

To improveAction
Machine-Readable FormatsCreate at least one study and link processes to it
Structured MetadataUse the annotation panel on sample pages to add ontology terms (organism, tissue, disease). Define study factors (genotype, treatment) on the study page
Data ProvenanceLink samples as source materials using the "Link Sources" button on process pages. Build chains: tissue → extract → library → sequencing
Standard File FormatsAttach FASTQ, BAM, CSV, GFF3, or VCF files to processes via the file upload panel
Computational ReproducibilitySubmit CWL analysis runs from the Compute → Analyses page and wait for them to succeed
Quantitative MeasurementsRecord measurements with assays, selecting unit terms from the Unit Ontology (UO) — e.g. nanogram, microliter, RQN
Consistent Sample LabelingUse a consistent naming pattern (e.g. MOUSE-BRAIN-001, MOUSE-BRAIN-002) and add descriptions to all samples
API AccessibilitySet the investigation visibility to "public" on the investigation edit page

ML-Ready Data Export

The ML-Ready export flattens all investigation data into a single samples × features matrix. This is the primary format for loading platform data into ML pipelines.

How it works

  1. Collects all samples across the investigation (via studies → processes → samples)
  2. Gathers ontology annotations for each sample, grouped by slot (organism, tissue, etc.)
  3. Gathers study factor values for each sample (genotype, treatment, etc.)
  4. Gathers quantitative measurements for each sample (RQN, concentration, etc.)
  5. Builds a provenance summary per sample using a recursive CTE on process_input_samples (depth capped at 10)
  6. Discovers all unique column names from the union of annotation slots, factor names, and measurement types
  7. Builds a flat rows matrix where each row is a sample and each column is a feature (nulls for missing values)

Access methods

MethodHow
Web UIClick the ML-Ready button on any investigation page. Use the dropdown to choose CSV, JSON, or ZIP format
REST API (JSON)GET /api.php/v1/investigations/{id}/ml-export
REST API (CSV)GET /api.php/v1/investigations/{id}/ml-export?format=csv
REST API (ZIP)GET /api.php/v1/investigations/{id}/ml-export?format=zip — CSV data + README in a self-documenting ZIP bundle

JSON schema

{
  "metadata": {
    "investigation_id": 3,
    "investigation_title": "Mouse Transcriptomics",
    "export_date": "2026-03-14",
    "sample_count": 42,
    "feature_count": 8,
    "ontology_prefix_map": {
      "NCBITaxon": "http://purl.obolibrary.org/obo/NCBITaxon_",
      "UBERON": "http://purl.obolibrary.org/obo/UBERON_"
    }
  },
  "columns": [
    {"name": "sample_id", "type": "identifier"},
    {"name": "sample_label", "type": "string"},
    {"name": "annotation:organism", "type": "ontology", "ontology_curie": "NCBITaxon:10090"},
    {"name": "annotation:anatomy", "type": "ontology", "ontology_curie": "UBERON:0000955"},
    {"name": "factor:genotype", "type": "categorical"},
    {"name": "measurement:RQN", "type": "numeric", "unit": "UO:0000186"}
  ],
  "rows": [
    [1, "SAMPLE-001", "Mus musculus", "brain", "wild-type", 8.7],
    [2, "SAMPLE-002", "Mus musculus", "liver", "knockout", 7.2]
  ],
  "provenance_summary": {
    "1": {"depth": 3, "process_chain": ["extraction", "sample_prep", "sequencing"]},
    "2": {"depth": 2, "process_chain": ["extraction", "sequencing"]}
  }
}

Column types

TypeDescriptionPrefixExample
identifierUnique integer IDsample_id
stringFree textsample_label
ontologyOntology term value with CURIE referenceannotation:annotation:organism
categoricalDiscrete category from study factorsfactor:factor:genotype
numericNumeric measurement with optional unit CURIEmeasurement:measurement:RQN

CSV format

The CSV export uses the column names from the JSON columns array as the header row. Null values are empty strings. Standard RFC 4180 CSV encoding.

sample_id,sample_label,annotation:organism,annotation:anatomy,factor:genotype,measurement:RQN
1,SAMPLE-001,Mus musculus,brain,wild-type,8.7
2,SAMPLE-002,Mus musculus,liver,knockout,7.2

Code examples

Python: load into pandas

import requests
import pandas as pd

API = "https://librebiotech.org/api.php/v1"
headers = {"X-API-Key": "YOUR_KEY"}

# Option 1: JSON
resp = requests.get(f"{API}/investigations/3/ml-export", headers=headers)
data = resp.json()
df = pd.DataFrame(data["rows"], columns=[c["name"] for c in data["columns"]])

# Option 2: CSV (simpler)
from io import StringIO
resp = requests.get(f"{API}/investigations/3/ml-export?format=csv", headers=headers)
df = pd.read_csv(StringIO(resp.text))

# Access provenance
provenance = data["provenance_summary"]
print(f"Sample 1 lineage depth: {provenance['1']['depth']}")
print(f"Process chain: {provenance['1']['process_chain']}")

R: load into tibble

library(httr)
library(jsonlite)
library(tibble)

api <- "https://librebiotech.org/api.php/v1"
key <- "YOUR_KEY"

resp <- GET(paste0(api, "/investigations/3/ml-export"),
            add_headers("X-API-Key" = key))
data <- fromJSON(content(resp, "text", encoding = "UTF-8"))

df <- as_tibble(data$rows)
colnames(df) <- sapply(data$columns, function(c) c$name)
print(df)

Python: scikit-learn pipeline

import requests
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler

API = "https://librebiotech.org/api.php/v1"
headers = {"X-API-Key": "YOUR_KEY"}

resp = requests.get(f"{API}/investigations/3/ml-export", headers=headers)
data = resp.json()
df = pd.DataFrame(data["rows"], columns=[c["name"] for c in data["columns"]])

# Separate features by type
numeric_cols = [c["name"] for c in data["columns"] if c["type"] == "numeric"]
categorical_cols = [c["name"] for c in data["columns"] if c["type"] == "categorical"]
ontology_cols = [c["name"] for c in data["columns"] if c["type"] == "ontology"]

# Encode categoricals
for col in categorical_cols + ontology_cols:
    if col in df.columns:
        df[col] = LabelEncoder().fit_transform(df[col].fillna("unknown"))

# Scale numerics
if numeric_cols:
    df[numeric_cols] = StandardScaler().fit_transform(df[numeric_cols].fillna(0))

FAIR Score comparison

The AI-Ready Score complements the existing FAIR Score. Both are shown in the investigation sidebar:

AspectFAIR ScoreAI-Ready Score
FocusData findability, accessibility, interoperability, reusabilityML/AI consumability of the data
Dimensions4 (Findable, Accessible, Interoperable, Reusable)8 (formats, metadata, provenance, files, reproducibility, measurements, labeling, accessibility)
Scale0-100 (average of 4 sub-scores)0-100 (sum of 8 dimension scores)
ChecksMetadata completeness, DOI, contacts, licenses, protocolsOntology annotations, lineage depth, file formats, CWL runs, unit measurements, label consistency
AudienceData managers, repository curators, compliance officersML engineers, AI researchers, computational biologists

For AI/ML Researchers

If you are an AI/ML researcher looking to consume data from this platform, visit the dedicated For AI/ML Researchers page for API quickstart guides, endpoint documentation, code examples in Python/R/curl, and details on data formats and ontology coverage.