bakta
Tags: bacteria annotation genome functional-annotation taxonomy sample-scope
Rapid bacterial genome annotation.
This subworkflow uses Bakta to provide rapid, comprehensive annotation of bacterial genomes. It can download and prepare the Bakta database on-demand or use a pre-existing database. The workflow processes each sample individually, producing multiple output formats including GFF3, GenBank, protein sequences, nucleotide sequences, and a BLAST database.
Take
assembly: Channel<Record>
| Field | Description |
|---|---|
meta | Groovy Record containing sample information |
assembly | Assembled contigs in FASTA format |
database: Path?
download_bakta: Boolean
save_as_tarball: Boolean
proteins: Path?
prodigal_tf: Path?
replicons: Path?
| Name | Type | Description |
|---|---|---|
database | Path? | Optional pre-existing Bakta database path |
download_bakta | Boolean | Boolean flag to trigger automatic database download |
save_as_tarball | Boolean | Boolean flag to save downloaded database as tarball |
proteins | Path? | Optional trusted protein sequences for homology search |
prodigal_tf | Path? | Optional Prodigal training file for improved gene prediction |
replicons | Path? | Optional replicon sequences for plasmid identification |
Emit
Published
The sample_outputs and run_outputs emissions are aggregates of output files that will be published in the entry workflow.
sample_outputs
| Output | Description |
|---|---|
embl | Annotations and sequences in EMBL format |
faa | CDS/sORF amino acid sequences as FASTA |
ffn | Feature nucleotide sequences as FASTA |
fna | Replicon/contig DNA sequences as FASTA |
gbff | Annotations and sequences in GenBank format |
gff | Annotations and sequences in GFF3 format |
hypotheticals_tsv | Further information on hypothetical protein CDS as tab-separated values |
hypotheticals_faa | Hypothetical protein CDS amino acid sequences as FASTA |
tsv | Annotations as simple human readable tab-separated values |
txt | Broad summary of Bakta annotations |
blastdb | A compressed tar.gz archive of BLAST+ databases of the contigs, genes, and proteins |
run_outputs
No run-scope outputs.
Downstream Inputs
The following emissions are meant to be used as inputs to downstream subworkflows.
annotations
| Output | Description |
|---|---|
fna | Annotated nucleotide sequences in FASTA format |
faa | Protein sequences in FASTA format |
gff | Annotations in GFF3 format |
Module Composition
This subworkflow calls the following modules:
- bakta_download - Download the Bakta annotation database.
- bakta_run - Rapid and standardized annotation of bacterial genomes and plasmids.
Used By
This subworkflow is used by the following workflows:
- bactopia - Comprehensive bacterial analysis pipeline for complete genomic characterization.
- bakta - Rapid annotation of bacterial genomes and plasmids.
- staphopia - Comprehensive analysis pipeline for Staphylococcus aureus isolates.
Citations
If you use this in your analysis, please cite the following.
-
Bactopia
Petit III RA, Read TD Bactopia - a flexible pipeline for complete analysis of bacterial genomes. mSystems 5 (2020) -
Bakta
Schwengers O, Jelonek L, Dieckmann MA, Beyvers S, Blom J, Goesmann A Bakta - rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microbial Genomics 7(11) (2021)