BLAST Species Call and Best-Hit Analysis
Computational procedure for calling species from a Sanger-sequenced barcode read using NCBI BLAST + BOLD Systems. Covers sequence trimming, database choice, interpretation of hit tables, and decision rules for species-level vs. genus-level calls. Closes the Sushi Truth pilot analytic loop: Sanger read → BLAST → species call → mislabeling determination. No wet-lab equipment or materials; a computer with internet access is sufficient. Written for a novice audience.
Version History
Version 0.1.1 Viewing Latest
Effective: 2026-04-20Cross-reference fix: Sanger ref 71→75. All `procedure N` references converted to Markdown hyperlinks pointing at https://librebiotech.org/?action=show&id=N — enables in-app click-through to referenced sibling procedures. Text content otherwise preserved.
Version 0.1.0
Effective: 2026-04-20Initial release. Pure-computational atomic technique; no lab setup. BLAST algorithm (Altschul et al. 1990) and BOLD (Ratnasingham & Hebert 2007) are well-established field tools. Fresh original prose with practical decision rules for Sushi Truth pilot context.
Procedure Details
- No physical hazards — computational procedure.
- Data integrity. Record every BLAST query / result pair; species calls form the basis of downstream public data claims. Any error in this step propagates into the final mislabeling dataset.
- Privacy. BLAST queries are submitted to NCBI servers. Do not submit privacy-sensitive sequences (e.g. unpublished human research sequences).
Inputs:
- Sanger read in FASTA format (from procedure 75 Sanger Submission).
- Chromatogram (AB1 file or image) for read-quality inspection.
- Internet access for NCBI BLAST (
blast.ncbi.nlm.nih.gov) and BOLD (boldsystems.org).
Database choice:
| Database | When to use | Strengths | Weaknesses |
|---|---|---|---|
| BOLD Systems | Species ID from curated barcode genes (COI, ITS, rbcL) | Curated taxonomic data; high-confidence species assignments; includes cryptic/recent species | Smaller than NCBI; fewer non-barcode sequences |
| NCBI GenBank (nt) | Any sequence; cross-checking BOLD | Comprehensive; includes genomic sequences | Less curated; taxonomic labels sometimes wrong |
| NCBI RefSeq | When you need only the reference sequences | Curated subset | Smaller than nt |
For COI Fish Barcoding (procedure 56): BOLD as primary source; NCBI nt as secondary confirmation.
Tools for read quality inspection:
- FinchTV / 4Peaks (free, macOS/Windows/Linux) — view and trim AB1 chromatograms.
- Biopython (Python library) — scriptable trimming if you prefer code.
Mental model: BLAST finds the best-matching sequence in a reference database by aligning your query against every entry and scoring similarity. High-quality matches give: (1) high % identity over (2) high query coverage with (3) large gap to the next-best hit. All three must hold for a confident species call.
- Per sample (~10 min): view chromatogram → trim read → submit to BLAST → wait for results → interpret → record call.
- Batch (~30 min for 10 samples): submit batch queries in parallel.
Protocol Parameters Captured per-assay on each run; exported as ISA-Tab Parameter Value columns
| Name | Type | Required | Default | Unit | Description |
|---|---|---|---|---|---|
identity_threshold_pct |
number | — |
98
|
— | Minimum % identity of top hit for species-level call. 98% standard for COI fish; adjust per taxon (95% for high-divergence groups, 99% for strict assignments). |
coverage_threshold_pct |
number | — |
90
|
— | Minimum query coverage for species-level call. 90% standard; ensures the read matches most of the database entry, not just a short region. |
gap_to_next_min_pct |
number | — |
1
|
— | Minimum gap in % identity between top hit and next-best-species hit. 1% standard; separates a confident single-species call from an ambiguous one. |
database_primary |
text | — |
NCBI_nt
|
— | Primary BLAST database. 'NCBI_nt' or 'BOLD'. For COI barcoding BOLD is preferred due to better curation; NCBI nt is comprehensive backup. |
quality_trim_threshold |
number | — |
20
|
— | Minimum Phred quality score for read-trimming. Q20 (99% accuracy) standard; Q30 stricter; don't go below Q15. |
Procedure Steps (Version 0.1.1)
Open the chromatogram (AB1 file) in FinchTV, 4Peaks, or similar viewer. Inspect the read quality.
Identify the high-quality region: continuous sharp peaks of single colour per position, with low baseline noise. Typical high-quality range: bases ~20 to ~800 (read quality degrades at both ends).
Trim the read to the high-quality region. In FinchTV: use the 'trim' tool. In Biopython: use Bio.SeqIO.convert or SeqRecord slicing. Target: remove the primer sequence at the start (first ~20 bp) and any Q<20 bases at the 3' end.
Export the trimmed sequence as FASTA. Format example: >SampleID_Date\nATGCATGC...
Open NCBI BLAST: navigate to https://blast.ncbi.nlm.nih.gov/Blast.cgi and select 'Nucleotide BLAST'.
Paste the FASTA sequence into the query box. Optionally include the >SampleID header line.
For COI barcoding: set 'Choose Search Set' to 'nucleotide collection (nr/nt)'. Optionally restrict by organism (e.g. 'bony fishes (taxid:7898)' for teleosts) to reduce off-target hits.
Click 'BLAST'. Results return in 30 seconds to 2 minutes depending on server load.
Examine the hit table: top hits sorted by E-value (descending). Key columns: % Identity, Query Coverage, E-value, Scientific Name.
Assess hit quality against decision rules: (1) top hit ≥98% identity; (2) query coverage ≥90%; (3) gap to next-best species ≥1% identity.
If all three criteria pass: record as species-level call. The organism in row 1 is your call.
If partial criteria only: downgrade to genus-level call (if top hit's genus matches the next several hits) or ambiguous call.
Optionally, cross-check on BOLD: navigate to https://boldsystems.org → 'Identification' → paste FASTA → 'Species Level Barcode Records'. BOLD is more curated than GenBank for COI fish barcodes.
Record the call in LibreBiotech as a Sample annotation: top hit species, accession number, % identity, query coverage, gap to next species, database(s) consulted, your decision. For Sushi Truth pilot: also set mislabeled=true if the call differs from the claimed species on the Sample record.
For ambiguous or unexpected calls: flag for manual review and annotate the Sample with rationale. For pilot results, document the decision explicitly — every mislabeling claim must be traceable to a specific read and BLAST result.
Expected outcome. A species-level, genus-level, or ambiguous call per sample, recorded in LibreBiotech as a Sample annotation with full justification (top hit, % identity, query coverage, gap to next hit).
Decision rules for species-level call:
- Top hit ≥98% identity
- Query coverage ≥90%
- Gap to next-best species ≥1% identity
- All three criteria must hold
Fall-backs:
- If top hit is 95–98% identity or <90% coverage: genus-level call only (sample identified to genus, not species).
- If multiple species tied within 1%: report as ambiguous; call to the lowest common taxonomic rank.
- If top hit <95% identity: likely novel or mis-sequenced; flag for re-submission or re-sequencing.
Record format in LibreBiotech:
- Top-hit species + accession number.
- % identity, query coverage, gap to next.
- Database source (BOLD or NCBI).
- Decision: species-level / genus-level / ambiguous.
- For Sushi Truth pilot:
mislabeled=trueif the call differs from the claimed species.
Troubleshooting.
| Symptom | Likely cause | Fix |
|---|---|---|
| No hits at all | Read contains only primer or has quality issues | Re-inspect chromatogram; re-trim more aggressively; verify species is in BOLD/NCBI |
| All hits are same species at 100% | Sample is well-characterised | Confident species call |
| Top two hits tied within 1% | Cryptic species or very recent divergence | Report to genus; note in annotation |
| Top hit matches organism you didn't expect | Mislabeling (for pilot) OR contamination | Verify sample chain of custody; check NTC amplicon on gel |
| High % identity but short coverage | Primer-dimer carryover in read | Re-trim read more aggressively; re-submit to Sanger if needed |
| Many species within 1% identity | COI not diagnostic for your taxon | Consider multi-locus barcoding (ITS + rbcL for plants; multi-COI-region for fish) |
References
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990). Basic local alignment search tool. J Mol Biol 215(3):403–10. (Original BLAST paper). DOI paper
- Ratnasingham S, Hebert PDN (2007). BOLD: The Barcode of Life Data System. Mol Ecol Notes 7(3):355–64. DOI paper
- NCBI BLAST — https://blast.ncbi.nlm.nih.gov Link paper
- BOLD Systems — https://boldsystems.org Link paper
- LibreBiotech procedure 75 — Sanger Sequencing Submission (upstream). Link protocol