Libre Biotech
Viewing v0.1.0 — not the latest version. The content below (steps, materials, parameters, etc.) is what this procedure said at v0.1.0; newer versions may differ.
View latest (v0.1.1)

BLAST Species Call and Best-Hit Analysis

Computational procedure for calling species from a Sanger-sequenced barcode read using NCBI BLAST + BOLD Systems. Covers sequence trimming, database choice, interpretation of hit tables, and decision rules for species-level vs. genus-level calls. Closes the Sushi Truth pilot analytic loop: Sanger read → BLAST → species call → mislabeling determination. No wet-lab equipment or materials; a computer with internet access is sufficient. Written for a novice audience.

data_transformation
Procedure Details
Safety & Hazards
  • No physical hazards — computational procedure.
  • Data integrity. Record every BLAST query / result pair; species calls form the basis of downstream public data claims. Any error in this step propagates into the final mislabeling dataset.
  • Privacy. BLAST queries are submitted to NCBI servers. Do not submit privacy-sensitive sequences (e.g. unpublished human research sequences).
Preparation Notes

Inputs:

  • Sanger read in FASTA format (from procedure 71 Sanger Submission).
  • Chromatogram (AB1 file or image) for read-quality inspection.
  • Internet access for NCBI BLAST (blast.ncbi.nlm.nih.gov) and BOLD (boldsystems.org).

Database choice:

Database When to use Strengths Weaknesses
BOLD Systems Species ID from curated barcode genes (COI, ITS, rbcL) Curated taxonomic data; high-confidence species assignments; includes cryptic/recent species Smaller than NCBI; fewer non-barcode sequences
NCBI GenBank (nt) Any sequence; cross-checking BOLD Comprehensive; includes genomic sequences Less curated; taxonomic labels sometimes wrong
NCBI RefSeq When you need only the reference sequences Curated subset Smaller than nt

For COI Fish Barcoding (procedure 56): BOLD as primary source; NCBI nt as secondary confirmation.

Tools for read quality inspection:

  • FinchTV / 4Peaks (free, macOS/Windows/Linux) — view and trim AB1 chromatograms.
  • Biopython (Python library) — scriptable trimming if you prefer code.

Mental model: BLAST finds the best-matching sequence in a reference database by aligning your query against every entry and scoring similarity. High-quality matches give: (1) high % identity over (2) high query coverage with (3) large gap to the next-best hit. All three must hold for a confident species call.

Timing
  • Per sample (~10 min): view chromatogram → trim read → submit to BLAST → wait for results → interpret → record call.
  • Batch (~30 min for 10 samples): submit batch queries in parallel.
Protocol Parameters Captured per-assay on each run; exported as ISA-Tab Parameter Value columns
Name Type Required Default Unit Description
identity_threshold_pct number 98 Minimum % identity of top hit for species-level call. 98% standard for COI fish; adjust per taxon (95% for high-divergence groups, 99% for strict assignments).
coverage_threshold_pct number 90 Minimum query coverage for species-level call. 90% standard; ensures the read matches most of the database entry, not just a short region.
gap_to_next_min_pct number 1 Minimum gap in % identity between top hit and next-best-species hit. 1% standard; separates a confident single-species call from an ambiguous one.
database_primary text NCBI_nt Primary BLAST database. 'NCBI_nt' or 'BOLD'. For COI barcoding BOLD is preferred due to better curation; NCBI nt is comprehensive backup.
quality_trim_threshold number 20 Minimum Phred quality score for read-trimming. Q20 (99% accuracy) standard; Q30 stricter; don't go below Q15.
Procedure Steps (Version 0.1.0)

Open the chromatogram (AB1 file) in FinchTV, 4Peaks, or similar viewer. Inspect the read quality.

Identify the high-quality region: continuous sharp peaks of single colour per position, with low baseline noise. Typical high-quality range: bases ~20 to ~800 (read quality degrades at both ends).

Trim the read to the high-quality region. In FinchTV: use the 'trim' tool. In Biopython: use Bio.SeqIO.convert or SeqRecord slicing. Target: remove the primer sequence at the start (first ~20 bp) and any Q<20 bases at the 3' end.

Export the trimmed sequence as FASTA. Format example: >SampleID_Date\nATGCATGC...

Open NCBI BLAST: navigate to https://blast.ncbi.nlm.nih.gov/Blast.cgi and select 'Nucleotide BLAST'.

Paste the FASTA sequence into the query box. Optionally include the >SampleID header line.

For COI barcoding: set 'Choose Search Set' to 'nucleotide collection (nr/nt)'. Optionally restrict by organism (e.g. 'bony fishes (taxid:7898)' for teleosts) to reduce off-target hits.

Click 'BLAST'. Results return in 30 seconds to 2 minutes depending on server load.

Examine the hit table: top hits sorted by E-value (descending). Key columns: % Identity, Query Coverage, E-value, Scientific Name.

Assess hit quality against decision rules: (1) top hit ≥98% identity; (2) query coverage ≥90%; (3) gap to next-best species ≥1% identity.

If all three criteria pass: record as species-level call. The organism in row 1 is your call.

If partial criteria only: downgrade to genus-level call (if top hit's genus matches the next several hits) or ambiguous call.

Optionally, cross-check on BOLD: navigate to https://boldsystems.org → 'Identification' → paste FASTA → 'Species Level Barcode Records'. BOLD is more curated than GenBank for COI fish barcodes.

Record the call in LibreBiotech as a Sample annotation: top hit species, accession number, % identity, query coverage, gap to next species, database(s) consulted, your decision. For Sushi Truth pilot: also set mislabeled=true if the call differs from the claimed species on the Sample record.

For ambiguous or unexpected calls: flag for manual review and annotate the Sample with rationale. For pilot results, document the decision explicitly — every mislabeling claim must be traceable to a specific read and BLAST result.

Completion Notes

Expected outcome. A species-level, genus-level, or ambiguous call per sample, recorded in LibreBiotech as a Sample annotation with full justification (top hit, % identity, query coverage, gap to next hit).

Decision rules for species-level call:

  • Top hit ≥98% identity
  • Query coverage ≥90%
  • Gap to next-best species ≥1% identity
  • All three criteria must hold

Fall-backs:

  • If top hit is 95–98% identity or <90% coverage: genus-level call only (sample identified to genus, not species).
  • If multiple species tied within 1%: report as ambiguous; call to the lowest common taxonomic rank.
  • If top hit <95% identity: likely novel or mis-sequenced; flag for re-submission or re-sequencing.

Record format in LibreBiotech:

  • Top-hit species + accession number.
  • % identity, query coverage, gap to next.
  • Database source (BOLD or NCBI).
  • Decision: species-level / genus-level / ambiguous.
  • For Sushi Truth pilot: mislabeled=true if the call differs from the claimed species.

Troubleshooting.

Symptom Likely cause Fix
No hits at all Read contains only primer or has quality issues Re-inspect chromatogram; re-trim more aggressively; verify species is in BOLD/NCBI
All hits are same species at 100% Sample is well-characterised Confident species call
Top two hits tied within 1% Cryptic species or very recent divergence Report to genus; note in annotation
Top hit matches organism you didn't expect Mislabeling (for pilot) OR contamination Verify sample chain of custody; check NTC amplicon on gel
High % identity but short coverage Primer-dimer carryover in read Re-trim read more aggressively; re-submit to Sanger if needed
Many species within 1% identity COI not diagnostic for your taxon Consider multi-locus barcoding (ITS + rbcL for plants; multi-COI-region for fish)
References
  1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990). Basic local alignment search tool. J Mol Biol 215(3):403–10. (Original BLAST paper). DOI paper
  2. Ratnasingham S, Hebert PDN (2007). BOLD: The Barcode of Life Data System. Mol Ecol Notes 7(3):355–64. DOI paper
  3. NCBI BLAST — https://blast.ncbi.nlm.nih.gov Link paper
  4. BOLD Systems — https://boldsystems.org Link paper
  5. LibreBiotech procedure 71 — Sanger Sequencing Submission (upstream). Link protocol