AI Readiness
How the platform makes research data consumable by AI/ML pipelines, with live adoption statistics and scoring details.
Libre Biotech is designed so your research data is structured for machine learning from day one. This page documents the AI-readiness features, how they are scored, and how each platform component contributes to both FAIR compliance and ML consumability.
Component overview
Each platform feature serves both FAIR data principles and AI/ML readiness. The cards below show live adoption statistics from the platform.
ISA Framework
Structured investigation-study-assay hierarchy that ML parsers can traverse programmatically. Every investigation contains studies, which contain processes, which produce samples with annotations.
Details
The ISA (Investigation-Study-Assay) framework provides a standardised hierarchy for organising research metadata. Investigations group related studies, studies link to processes (lab work and analyses), and processes produce samples with ontology annotations. This structure means any ML pipeline can navigate from high-level project metadata down to individual sample features without custom parsing.
Ontology Annotations
Machine-readable vocabulary from 13 ontologies ensures consistent feature naming across datasets.
Details
Samples and other entities can be annotated with terms from 13 indexed ontologies (OBI, EFO, UBERON, NCBITaxon, UO, CLO, CHEBI, GO, SO, PATO, HP, ENVO, BAO) containing ~2.9M terms. Each annotation stores the ontology term ID (CURIE), label, and source, enabling ML pipelines to use standardised feature names rather than free-text labels. Annotations are organised by "slot" (e.g. organism, tissue, disease) and support both text values and numeric values with optional unit terms.
CWL Pipelines
Containerised workflows with full parameter tracking provide computational reproducibility for any analysis.
Details
All analysis pipelines use Common Workflow Language (CWL) definitions executed via cwltool with Apptainer containers. Each analysis run records the workflow file, container image, input parameters (as JSON), and all output files. This means any result can be reproduced exactly — the same inputs, same container, same parameters will produce identical outputs. The platform tracks run status (queued, running, succeeded, failed) and links outputs back to the input samples.
RO-Crate Export
Self-describing research object packages that include workflow definitions, inputs, outputs, and provenance metadata.
Details
Each completed analysis run can be exported as an RO-Crate (Research Object Crate) — a self-describing data package following the RO-Crate 1.1 specification. The package includes the CWL workflow definition, all input/output files, parameter JSON, container image reference, and W3C PROV-O provenance metadata in a single ZIP archive. This means an ML researcher can download one package and have everything needed to understand and reproduce the analysis.
ISA-JSON/Tab Export
Standard exchange formats (ISA-JSON, ISA-Tab) that feed directly into ML data loading pipelines.
Details
Investigations can be exported as ISA-JSON (following the ISA-JSON 1.0 specification) or ISA-Tab (tab-separated files in a ZIP). ISA-JSON includes full study metadata, protocols, materials (sources, samples, extracts), factor values, characteristic categories, assay data, and ontology source references. These formats are widely supported by bioinformatics tools and repositories (ENA, ArrayExpress, MetaboLights) and can be parsed by any JSON/TSV reader.
Sample Lineage
Provenance chains link every derived sample back to its source, providing full context for training data curation.
Details
The process_input_samples table tracks parent-child relationships between samples across processes via the output_sample_id column. A tissue sample becomes an RNA extract, which becomes a library, which goes through sequencing — each step is a link in the chain. The API provides lineage queries (ancestors, descendants, full graph) via recursive CTEs capped at depth 10. This lets ML pipelines trace any derived feature back to the original biological material, filter by process category, or validate that samples share a common origin.
Quantitative Measurements
Measurements annotated with Unit Ontology (UO) terms provide typed numeric features ready for ML feature vectors.
Details
Measurements are linked to samples via assays and store a measurement type (e.g. RQN, concentration, fragment_length), a numeric value, and an optional unit from the Unit Ontology (UO). This means ML pipelines get typed numeric columns with standardised units — no parsing "8.7 ng/uL" from free text. The ML-Ready export automatically discovers all measurement types across an investigation and creates one numeric column per type.
REST API
Programmatic access to all metadata, samples, annotations, and analysis outputs via authenticated endpoints.
Details
The REST API provides full CRUD access to investigations, processes, samples, and files. Additional endpoints support ISA-JSON export, ISA validation, provenance graphs, PROV-O export, sample lineage queries, and ML-Ready data export. Authentication uses API keys via the X-API-Key header. Rate limits are 300 requests/minute for authenticated users. See the API Reference for the complete endpoint list.
Sequencing QC
Quality-controlled inputs across Illumina, Nanopore, and PacBio platforms ensure clean data enters ML pipelines.
Details
Three integrated QC dashboards (IlluminaQC, NanoporeQC, PacBioQC) track run quality metrics including yield, Q-scores, pass/fail rates, and instrument trends. QC runs can be linked to ISA processes, so ML pipelines can filter samples by sequencing quality before including them in training datasets. This prevents low-quality data from corrupting model training.
PROV-O Provenance
W3C-standard provenance graphs (JSON-LD) that document the full history of every data transformation.
Details
The platform exports W3C PROV-O provenance as JSON-LD for both investigations and individual samples. The export includes prov:Activity nodes (processes and analysis runs), prov:Entity nodes (samples and files), and prov:Agent nodes (people), connected by prov:wasGeneratedBy, prov:used, prov:wasDerivedFrom, and prov:wasAssociatedWith relationships. This standard format can be loaded into any RDF store or knowledge graph.
Data Cards & Skill Files
Auto-generated dataset cards and platform skill files that bridge human-readable and machine-readable metadata.
Details
Every investigation auto-generates a data card (YAML frontmatter + Markdown body) aggregating metadata, FAIR/AI-Ready scores, ontology stats, provenance depth, and API links — inspired by Hugging Face dataset cards. Workflow cards document CWL pipelines with execution stats. A platform-level skill file at /CLAUDE.md describes the entire platform for AI coding assistants. Data cards are bundled into ISA-Tab ZIP and ML-Ready ZIP exports, and ML-Ready ZIP includes a README describing columns, types, and loading code. Available via API at /api.php/v1/investigations/{id}/card and /api.php/v1/platform-card.
AI-Ready Score
Each investigation displays an AI-Ready Score card (0-100) alongside the FAIR Score in the investigation sidebar. The score measures how consumable the data is for ML pipelines across 8 independently scored dimensions.
Scoring dimensions
| # | Dimension | Max | What it checks | Source tables |
|---|---|---|---|---|
| 1 | Machine-Readable Formats | 12 | Has studies (6 pts) and processes (6 pts) — the minimum data needed for ISA-JSON or ML-Ready export | studies, process_studies |
| 2 | Structured Metadata | 12 | Has ontology annotations on samples (6 pts) and study factors defined (6 pts) | annotations, study_factors |
| 3 | Data Provenance | 12 | Sample lineage depth (4-8 pts: depth 1-2 = 4 pts, depth 3+ = 8 pts) plus process count (2-4 pts) | process_input_samples (recursive CTE on output_sample_id, depth cap 10), process_studies |
| 4 | Standard File Formats | 12 | Files with standard extensions: FASTQ, BAM, SAM, CRAM, CSV, TSV, GFF3, GFF, GTF, VCF, BED, BigWig, FASTA, JSON, XML (1-4 files = 8 pts, 5+ = 12 pts) | process_files, files |
| 5 | Computational Reproducibility | 13 | Succeeded CWL analysis runs (1-2 runs = 8 pts, 3+ = 13 pts) | analysis_runs WHERE status='succeeded' |
| 6 | Quantitative Measurements | 13 | Measurements with Unit Ontology terms (1-4 = 8 pts, 5+ = 13 pts) | measurements JOIN assays WHERE unit_id IS NOT NULL |
| 7 | Consistent Sample Labeling | 12 | >80% labels match a common prefix pattern (3-6 pts) and >80% have descriptions (3-6 pts) | samples — regex heuristic on labels |
| 8 | API Accessibility | 13 | Investigation visibility: public = 13 pts, group = 8 pts, private = 3 pts | investigations.visibility |
Total maximum: 99 points (normalised to 0-100 scale).
Badge thresholds
| Score | Badge | Colour |
|---|---|---|
| 90-100 | Excellent | Green |
| 75-89 | Good | Blue |
| 50-74 | Fair | Yellow |
| 0-49 | Needs Work | Red |
Suggestions
Each missing criterion generates an actionable suggestion displayed beneath the score card. For example:
- Metadata Add ontology annotations to samples for structured metadata
- Provenance Add sample lineage (process_input_samples links) for provenance tracking
- Reproducibility Run CWL analysis pipelines for computational reproducibility
- Accessibility Set visibility to "public" for unrestricted programmatic access
How to improve your score
| To improve | Action |
|---|---|
| Machine-Readable Formats | Create at least one study and link processes to it |
| Structured Metadata | Use the annotation panel on sample pages to add ontology terms (organism, tissue, disease). Define study factors (genotype, treatment) on the study page |
| Data Provenance | Link samples as source materials using the "Link Sources" button on process pages. Build chains: tissue → extract → library → sequencing |
| Standard File Formats | Attach FASTQ, BAM, CSV, GFF3, or VCF files to processes via the file upload panel |
| Computational Reproducibility | Submit CWL analysis runs from the Compute → Analyses page and wait for them to succeed |
| Quantitative Measurements | Record measurements with assays, selecting unit terms from the Unit Ontology (UO) — e.g. nanogram, microliter, RQN |
| Consistent Sample Labeling | Use a consistent naming pattern (e.g. MOUSE-BRAIN-001, MOUSE-BRAIN-002) and add descriptions to all samples |
| API Accessibility | Set the investigation visibility to "public" on the investigation edit page |
ML-Ready Data Export
The ML-Ready export flattens all investigation data into a single samples × features matrix. This is the primary format for loading platform data into ML pipelines.
How it works
- Collects all samples across the investigation (via studies → processes → samples)
- Gathers ontology annotations for each sample, grouped by slot (organism, tissue, etc.)
- Gathers study factor values for each sample (genotype, treatment, etc.)
- Gathers quantitative measurements for each sample (RQN, concentration, etc.)
- Builds a provenance summary per sample using a recursive CTE on
process_input_samples(depth capped at 10) - Discovers all unique column names from the union of annotation slots, factor names, and measurement types
- Builds a flat rows matrix where each row is a sample and each column is a feature (nulls for missing values)
Access methods
| Method | How |
|---|---|
| Web UI | Click the ML-Ready button on any investigation page. Use the dropdown to choose CSV, JSON, or ZIP format |
| REST API (JSON) | GET /api.php/v1/investigations/{id}/ml-export |
| REST API (CSV) | GET /api.php/v1/investigations/{id}/ml-export?format=csv |
| REST API (ZIP) | GET /api.php/v1/investigations/{id}/ml-export?format=zip — CSV data + README in a self-documenting ZIP bundle |
JSON schema
{
"metadata": {
"investigation_id": 3,
"investigation_title": "Mouse Transcriptomics",
"export_date": "2026-03-14",
"sample_count": 42,
"feature_count": 8,
"ontology_prefix_map": {
"NCBITaxon": "http://purl.obolibrary.org/obo/NCBITaxon_",
"UBERON": "http://purl.obolibrary.org/obo/UBERON_"
}
},
"columns": [
{"name": "sample_id", "type": "identifier"},
{"name": "sample_label", "type": "string"},
{"name": "annotation:organism", "type": "ontology", "ontology_curie": "NCBITaxon:10090"},
{"name": "annotation:anatomy", "type": "ontology", "ontology_curie": "UBERON:0000955"},
{"name": "factor:genotype", "type": "categorical"},
{"name": "measurement:RQN", "type": "numeric", "unit": "UO:0000186"}
],
"rows": [
[1, "SAMPLE-001", "Mus musculus", "brain", "wild-type", 8.7],
[2, "SAMPLE-002", "Mus musculus", "liver", "knockout", 7.2]
],
"provenance_summary": {
"1": {"depth": 3, "process_chain": ["extraction", "sample_prep", "sequencing"]},
"2": {"depth": 2, "process_chain": ["extraction", "sequencing"]}
}
}
Column types
| Type | Description | Prefix | Example |
|---|---|---|---|
identifier | Unique integer ID | — | sample_id |
string | Free text | — | sample_label |
ontology | Ontology term value with CURIE reference | annotation: | annotation:organism |
categorical | Discrete category from study factors | factor: | factor:genotype |
numeric | Numeric measurement with optional unit CURIE | measurement: | measurement:RQN |
CSV format
The CSV export uses the column names from the JSON columns array as the header row. Null values are empty strings. Standard RFC 4180 CSV encoding.
sample_id,sample_label,annotation:organism,annotation:anatomy,factor:genotype,measurement:RQN
1,SAMPLE-001,Mus musculus,brain,wild-type,8.7
2,SAMPLE-002,Mus musculus,liver,knockout,7.2
Code examples
Python: load into pandas
import requests
import pandas as pd
API = "https://librebiotech.org/api.php/v1"
headers = {"X-API-Key": "YOUR_KEY"}
# Option 1: JSON
resp = requests.get(f"{API}/investigations/3/ml-export", headers=headers)
data = resp.json()
df = pd.DataFrame(data["rows"], columns=[c["name"] for c in data["columns"]])
# Option 2: CSV (simpler)
from io import StringIO
resp = requests.get(f"{API}/investigations/3/ml-export?format=csv", headers=headers)
df = pd.read_csv(StringIO(resp.text))
# Access provenance
provenance = data["provenance_summary"]
print(f"Sample 1 lineage depth: {provenance['1']['depth']}")
print(f"Process chain: {provenance['1']['process_chain']}")
R: load into tibble
library(httr)
library(jsonlite)
library(tibble)
api <- "https://librebiotech.org/api.php/v1"
key <- "YOUR_KEY"
resp <- GET(paste0(api, "/investigations/3/ml-export"),
add_headers("X-API-Key" = key))
data <- fromJSON(content(resp, "text", encoding = "UTF-8"))
df <- as_tibble(data$rows)
colnames(df) <- sapply(data$columns, function(c) c$name)
print(df)
Python: scikit-learn pipeline
import requests
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
API = "https://librebiotech.org/api.php/v1"
headers = {"X-API-Key": "YOUR_KEY"}
resp = requests.get(f"{API}/investigations/3/ml-export", headers=headers)
data = resp.json()
df = pd.DataFrame(data["rows"], columns=[c["name"] for c in data["columns"]])
# Separate features by type
numeric_cols = [c["name"] for c in data["columns"] if c["type"] == "numeric"]
categorical_cols = [c["name"] for c in data["columns"] if c["type"] == "categorical"]
ontology_cols = [c["name"] for c in data["columns"] if c["type"] == "ontology"]
# Encode categoricals
for col in categorical_cols + ontology_cols:
if col in df.columns:
df[col] = LabelEncoder().fit_transform(df[col].fillna("unknown"))
# Scale numerics
if numeric_cols:
df[numeric_cols] = StandardScaler().fit_transform(df[numeric_cols].fillna(0))
FAIR Score comparison
The AI-Ready Score complements the existing FAIR Score. Both are shown in the investigation sidebar:
| Aspect | FAIR Score | AI-Ready Score |
|---|---|---|
| Focus | Data findability, accessibility, interoperability, reusability | ML/AI consumability of the data |
| Dimensions | 4 (Findable, Accessible, Interoperable, Reusable) | 8 (formats, metadata, provenance, files, reproducibility, measurements, labeling, accessibility) |
| Scale | 0-100 (average of 4 sub-scores) | 0-100 (sum of 8 dimension scores) |
| Checks | Metadata completeness, DOI, contacts, licenses, protocols | Ontology annotations, lineage depth, file formats, CWL runs, unit measurements, label consistency |
| Audience | Data managers, repository curators, compliance officers | ML engineers, AI researchers, computational biologists |
For AI/ML Researchers
If you are an AI/ML researcher looking to consume data from this platform, visit the dedicated For AI/ML Researchers page for API quickstart guides, endpoint documentation, code examples in Python/R/curl, and details on data formats and ontology coverage.