AI Readiness

How the platform makes research data consumable by AI/ML pipelines, with live adoption statistics and scoring details.

Libre Biotech is designed so your research data is structured for machine learning from day one. This page documents the AI-readiness features, how they are scored, and how each platform component contributes to both FAIR compliance and ML consumability.

Component overview

Each platform feature serves both FAIR data principles and AI/ML readiness. The cards below show live adoption statistics from the platform.

ISA Framework

I R

Structured investigation-study-assay hierarchy that ML parsers can traverse programmatically. Every investigation contains studies, which contain processes, which produce samples with annotations.

Details

The ISA (Investigation-Study-Assay) framework provides a standardised hierarchy for organising research metadata. Investigations group related studies, studies link to processes (lab work and analyses), and processes produce samples with ontology annotations. This structure means any ML pipeline can navigate from high-level project metadata down to individual sample features without custom parsing.

6 Investigations

Ontology Annotations

F I

Machine-readable vocabulary from 13 ontologies ensures consistent feature naming across datasets.

Details

Samples and other entities can be annotated with terms from 13 indexed ontologies (OBI, EFO, UBERON, NCBITaxon, UO, CLO, CHEBI, GO, SO, PATO, HP, ENVO, BAO) containing ~2.9M terms. Each annotation stores the ontology term ID (CURIE), label, and source, enabling ML pipelines to use standardised feature names rather than free-text labels. Annotations are organised by "slot" (e.g. organism, tissue, disease) and support both text values and numeric values with optional unit terms.

223 Annotated entities

CWL Pipelines

Containerised workflows with full parameter tracking provide computational reproducibility for any analysis.

Details

All analysis pipelines use Common Workflow Language (CWL) definitions executed via cwltool with Apptainer containers. Each analysis run records the workflow file, container image, input parameters (as JSON), and all output files. This means any result can be reproduced exactly — the same inputs, same container, same parameters will produce identical outputs. The platform tracks run status (queued, running, succeeded, failed) and links outputs back to the input samples.

74 Succeeded runs

RO-Crate Export

F A I

Self-describing research object packages that include workflow definitions, inputs, outputs, and provenance metadata.

Details

Each completed analysis run can be exported as an RO-Crate (Research Object Crate) — a self-describing data package following the RO-Crate 1.1 specification. The package includes the CWL workflow definition, all input/output files, parameter JSON, container image reference, and W3C PROV-O provenance metadata in a single ZIP archive. This means an ML researcher can download one package and have everything needed to understand and reproduce the analysis.

107 RO-Crate archives

ISA-JSON/Tab Export

I A

Standard exchange formats (ISA-JSON, ISA-Tab) that feed directly into ML data loading pipelines.

Details

Investigations can be exported as ISA-JSON (following the ISA-JSON 1.0 specification) or ISA-Tab (tab-separated files in a ZIP). ISA-JSON includes full study metadata, protocols, materials (sources, samples, extracts), factor values, characteristic categories, assay data, and ontology source references. These formats are widely supported by bioinformatics tools and repositories (ENA, ArrayExpress, MetaboLights) and can be parsed by any JSON/TSV reader.

Always available

Sample Lineage

Provenance chains link every derived sample back to its source, providing full context for training data curation.

Details

The process_input_samples table tracks parent-child relationships between samples across processes via the output_sample_id column. A tissue sample becomes an RNA extract, which becomes a library, which goes through sequencing — each step is a link in the chain. The API provides lineage queries (ancestors, descendants, full graph) via recursive CTEs capped at depth 10. This lets ML pipelines trace any derived feature back to the original biological material, filter by process category, or validate that samples share a common origin.

410 Lineage links

Quantitative Measurements

Measurements annotated with Unit Ontology (UO) terms provide typed numeric features ready for ML feature vectors.

Details

Measurements are linked to samples via assays and store a measurement type (e.g. RQN, concentration, fragment_length), a numeric value, and an optional unit from the Unit Ontology (UO). This means ML pipelines get typed numeric columns with standardised units — no parsing "8.7 ng/uL" from free text. The ML-Ready export automatically discovers all measurement types across an investigation and creates one numeric column per type.

524 Measurements with units

REST API

Programmatic access to all metadata, samples, annotations, and analysis outputs via authenticated endpoints.

Details

The REST API provides full CRUD access to investigations, processes, samples, and files. Additional endpoints support ISA-JSON export, ISA validation, provenance graphs, PROV-O export, sample lineage queries, and ML-Ready data export. Authentication uses API keys via the X-API-Key header. Rate limit: 1000 requests per hour per API key. See the API Reference for the complete endpoint list.

7 API keys issued

Sequencing QC

Quality-controlled inputs across Illumina, Nanopore, and PacBio platforms ensure clean data enters ML pipelines.

Details

Three integrated QC dashboards (IlluminaQC, NanoporeQC, PacBioQC) track run quality metrics including yield, Q-scores, pass/fail rates, and instrument trends. QC runs can be linked to ISA processes, so ML pipelines can filter samples by sequencing quality before including them in training datasets. This prevents low-quality data from corrupting model training.

568 QC runs tracked

PROV-O Provenance

F R

W3C-standard provenance graphs (JSON-LD) that document the full history of every data transformation.

Details

The platform exports W3C PROV-O provenance as JSON-LD for both investigations and individual samples. The export includes prov:Activity nodes (processes and analysis runs), prov:Entity nodes (samples and files), and prov:Agent nodes (people), connected by prov:wasGeneratedBy, prov:used, prov:wasDerivedFrom, and prov:wasAssociatedWith relationships. This standard format can be loaded into any RDF store or knowledge graph.

Always available

Data Cards & Skill Files

F A R

Auto-generated dataset cards and platform skill files that bridge human-readable and machine-readable metadata.

Details

Every investigation auto-generates a data card (YAML frontmatter + Markdown body) aggregating metadata, FAIR/AI-Ready scores, ontology stats, provenance depth, and API links — inspired by Hugging Face dataset cards. Workflow cards document CWL pipelines with execution stats. A platform-level skill file at /CLAUDE.md describes the entire platform for AI coding assistants. Data cards are bundled into ISA-Tab ZIP and ML-Ready ZIP exports, and ML-Ready ZIP includes a README describing columns, types, and loading code. Available via API at /api.php/v1/investigations/{id}/card and /api.php/v1/platform-card.

Always available

AI-Ready Score

Each investigation displays an AI-Ready Score card (0-100) alongside the FAIR Score in the investigation sidebar. The score measures how consumable the data is for ML pipelines across 8 independently scored dimensions.

Scoring dimensions

#	Dimension	Max	What it checks	Source tables
1	Machine-Readable Formats	12	Has studies (6 pts) and processes (6 pts) — the minimum data needed for ISA-JSON or ML-Ready export	`studies`, `process_studies`
2	Structured Metadata	12	Has ontology annotations on samples (6 pts) and study factors defined (6 pts)	`annotations`, `study_factors`
3	Data Provenance	12	Sample lineage depth (4-8 pts: depth 1-2 = 4 pts, depth 3+ = 8 pts) plus process count (2-4 pts)	`process_input_samples` (recursive CTE on `output_sample_id`, depth cap 10), `process_studies`
4	Standard File Formats	12	Files with standard extensions: FASTQ, BAM, SAM, CRAM, CSV, TSV, GFF3, GFF, GTF, VCF, BED, BigWig, FASTA, JSON, XML (1-4 files = 8 pts, 5+ = 12 pts)	`process_files`, `files`
5	Computational Reproducibility	13	Succeeded CWL analysis runs (1-2 runs = 8 pts, 3+ = 13 pts)	`analysis_runs` WHERE `status='succeeded'`
6	Quantitative Measurements	13	Measurements with Unit Ontology terms (1-4 = 8 pts, 5+ = 13 pts)	`measurements` JOIN `assays` WHERE `unit_id IS NOT NULL`
7	Consistent Sample Labeling	12	>80% labels match a common prefix pattern (3-6 pts) and >80% have descriptions (3-6 pts)	`samples` — regex heuristic on labels
8	API Accessibility	13	Investigation visibility: public = 13 pts, group = 8 pts, private = 3 pts	`investigations.visibility`

Total maximum: 99 points (normalised to 0-100 scale).

Badge thresholds

Score	Badge	Colour
90-100	Excellent	Green
75-89	Good	Blue
50-74	Fair	Yellow
0-49	Needs Work	Red

Suggestions

Each missing criterion generates an actionable suggestion displayed beneath the score card. For example:

Metadata Add ontology annotations to samples for structured metadata
Provenance Add sample lineage (process_input_samples links) for provenance tracking
Reproducibility Run CWL analysis pipelines for computational reproducibility
Accessibility Set visibility to "public" for unrestricted programmatic access

How to improve your score

To improve	Action
Machine-Readable Formats	Create at least one study and link processes to it
Structured Metadata	Use the annotation panel on sample pages to add ontology terms (organism, tissue, disease). Define study factors (genotype, treatment) on the study page
Data Provenance	Link samples as source materials using the "Link Sources" button on process pages. Build chains: tissue → extract → library → sequencing
Standard File Formats	Attach FASTQ, BAM, CSV, GFF3, or VCF files to processes via the file upload panel
Computational Reproducibility	Submit CWL analysis runs from the Compute → Analyses page and wait for them to succeed
Quantitative Measurements	Record measurements with assays, selecting unit terms from the Unit Ontology (UO) — e.g. nanogram, microliter, RQN
Consistent Sample Labeling	Use a consistent naming pattern (e.g. `MOUSE-BRAIN-001`, `MOUSE-BRAIN-002`) and add descriptions to all samples
API Accessibility	Set the investigation visibility to "public" on the investigation edit page

ML-Ready Data Export

The ML-Ready export flattens all investigation data into a single samples × features matrix. This is the primary format for loading platform data into ML pipelines.

How it works

Collects all samples across the investigation (via studies → processes → samples)
Gathers ontology annotations for each sample, grouped by slot (organism, tissue, etc.)
Gathers study factor values for each sample (genotype, treatment, etc.)
Gathers quantitative measurements for each sample (RQN, concentration, etc.)
Builds a provenance summary per sample using a recursive CTE on process_input_samples (depth capped at 10)
Discovers all unique column names from the union of annotation slots, factor names, and measurement types
Builds a flat rows matrix where each row is a sample and each column is a feature (nulls for missing values)

Access methods

Method	How
Web UI	Click the ML-Ready button on any investigation page. Use the dropdown to choose CSV, JSON, or ZIP format
REST API (JSON)	`GET /api.php/v1/investigations/{id}/ml-export`
REST API (CSV)	`GET /api.php/v1/investigations/{id}/ml-export?format=csv`
REST API (ZIP)	`GET /api.php/v1/investigations/{id}/ml-export?format=zip` — CSV data + README in a self-documenting ZIP bundle

JSON schema

{
  "metadata": {
    "investigation_id": 3,
    "investigation_title": "Mouse Transcriptomics",
    "export_date": "2026-03-14",
    "sample_count": 42,
    "feature_count": 8,
    "ontology_prefix_map": {
      "NCBITaxon": "http://purl.obolibrary.org/obo/NCBITaxon_",
      "UBERON": "http://purl.obolibrary.org/obo/UBERON_"
    }
  },
  "columns": [
    {"name": "sample_id", "type": "identifier"},
    {"name": "sample_label", "type": "string"},
    {"name": "annotation:organism", "type": "ontology", "ontology_curie": "NCBITaxon:10090"},
    {"name": "annotation:anatomy", "type": "ontology", "ontology_curie": "UBERON:0000955"},
    {"name": "factor:genotype", "type": "categorical"},
    {"name": "measurement:RQN", "type": "numeric", "unit": "UO:0000186"}
  ],
  "rows": [
    [1, "SAMPLE-001", "Mus musculus", "brain", "wild-type", 8.7],
    [2, "SAMPLE-002", "Mus musculus", "liver", "knockout", 7.2]
  ],
  "provenance_summary": {
    "1": {"depth": 3, "process_chain": ["extraction", "sample_prep", "sequencing"]},
    "2": {"depth": 2, "process_chain": ["extraction", "sequencing"]}
  }
}

Column types

Type	Description	Prefix	Example
`identifier`	Unique integer ID	—	`sample_id`
`string`	Free text	—	`sample_label`
`ontology`	Ontology term value with CURIE reference	`annotation:`	`annotation:organism`
`categorical`	Discrete category from study factors	`factor:`	`factor:genotype`
`numeric`	Numeric measurement with optional unit CURIE	`measurement:`	`measurement:RQN`

CSV format

The CSV export uses the column names from the JSON columns array as the header row. Null values are empty strings. Standard RFC 4180 CSV encoding.

sample_id,sample_label,annotation:organism,annotation:anatomy,factor:genotype,measurement:RQN
1,SAMPLE-001,Mus musculus,brain,wild-type,8.7
2,SAMPLE-002,Mus musculus,liver,knockout,7.2

Code examples

Python: load into pandas

import requests
import pandas as pd

API = "https://librebiotech.org/api.php/v1"
headers = {"X-API-Key": "YOUR_KEY"}

# Option 1: JSON
resp = requests.get(f"{API}/investigations/3/ml-export", headers=headers)
data = resp.json()
df = pd.DataFrame(data["rows"], columns=[c["name"] for c in data["columns"]])

# Option 2: CSV (simpler)
from io import StringIO
resp = requests.get(f"{API}/investigations/3/ml-export?format=csv", headers=headers)
df = pd.read_csv(StringIO(resp.text))

# Access provenance
provenance = data["provenance_summary"]
print(f"Sample 1 lineage depth: {provenance['1']['depth']}")
print(f"Process chain: {provenance['1']['process_chain']}")

R: load into tibble

library(httr)
library(jsonlite)
library(tibble)

api <- "https://librebiotech.org/api.php/v1"
key <- "YOUR_KEY"

resp <- GET(paste0(api, "/investigations/3/ml-export"),
            add_headers("X-API-Key" = key))
data <- fromJSON(content(resp, "text", encoding = "UTF-8"))

df <- as_tibble(data$rows)
colnames(df) <- sapply(data$columns, function(c) c$name)
print(df)

Python: scikit-learn pipeline

import requests
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler

API = "https://librebiotech.org/api.php/v1"
headers = {"X-API-Key": "YOUR_KEY"}

resp = requests.get(f"{API}/investigations/3/ml-export", headers=headers)
data = resp.json()
df = pd.DataFrame(data["rows"], columns=[c["name"] for c in data["columns"]])

# Separate features by type
numeric_cols = [c["name"] for c in data["columns"] if c["type"] == "numeric"]
categorical_cols = [c["name"] for c in data["columns"] if c["type"] == "categorical"]
ontology_cols = [c["name"] for c in data["columns"] if c["type"] == "ontology"]

# Encode categoricals
for col in categorical_cols + ontology_cols:
    if col in df.columns:
        df[col] = LabelEncoder().fit_transform(df[col].fillna("unknown"))

# Scale numerics
if numeric_cols:
    df[numeric_cols] = StandardScaler().fit_transform(df[numeric_cols].fillna(0))

FAIR Score comparison

The AI-Ready Score complements the existing FAIR Score. Both are shown in the investigation sidebar:

Aspect	FAIR Score	AI-Ready Score
Focus	Data findability, accessibility, interoperability, reusability	ML/AI consumability of the data
Dimensions	4 (Findable, Accessible, Interoperable, Reusable)	8 (formats, metadata, provenance, files, reproducibility, measurements, labeling, accessibility)
Scale	0-100 (average of 4 sub-scores)	0-100 (sum of 8 dimension scores)
Checks	Metadata completeness, DOI, contacts, licenses, protocols	Ontology annotations, lineage depth, file formats, CWL runs, unit measurements, label consistency
Audience	Data managers, repository curators, compliance officers	ML engineers, AI researchers, computational biologists

For AI/ML Researchers

If you are an AI/ML researcher looking to consume data from this platform, visit the dedicated For AI/ML Researchers page for API quickstart guides, endpoint documentation, code examples in Python/R/curl, and details on data formats and ontology coverage.