Skip to main content

bactopia_gather

Tags: validation download merging simulation metadata fastq sra ena art sample-scope

Search, validate, gather, and standardize input samples.

This subworkflow processes raw input samples through validation, standardization, and metadata collection. It handles various input types including local FASTQ files, SRA/ENA accessions, NCBI assembly accessions, and assemblies. The workflow can merge multiple sequencing runs, download remote data, and simulate reads from assemblies using ART.

Uses explicit positional record fields for reads:

  • Input: record(meta, r1_files, r2_files, se_files, lr_files) with Set<Path> slots (pre-merge)
  • Output: record(meta, r1, r2, se, lr) with Path? slots (post-merge, consolidated)

Take

samples: Channel<Record>
FieldDescription
metaGroovy Record containing sample information
r1_filesIllumina R1 read files (Set, elements may be null)
r2_filesIllumina R2 read files (Set, elements may be null)
se_filesSingle-end read files (Set, elements may be null)
lr_filesLong read files (ONT/PacBio) or assembly for simulation (Set, elements may be null)

Emit

Published

The sample_outputs and run_outputs emissions are aggregates of output files that will be published in the entry workflow.

sample_outputs

OutputDescription
tsvA tab-delimited metadata file describing the valid samples

run_outputs

OutputDescription
csvAggregated metadata from all samples

Downstream Inputs

The following emissions are meant to be used as inputs to downstream subworkflows.

reads

OutputDescription
r1Illumina R1 reads (paired-end forward)
r2Illumina R2 reads (paired-end reverse)
seSingle-end Illumina reads
lrLong reads (ONT/PacBio)
fnaAssembly file for assembly-based samples

Module Composition

This subworkflow calls the following modules:

  • bactopia_gather - Search, validate, gather, or simulate input samples.
  • csvtk_concat - Concatenate multiple CSV or TSV files into a single table.

Used By

This subworkflow is used by the following workflows:

  • bactopia - Comprehensive bacterial analysis pipeline for complete genomic characterization.
  • cleanyerreads - Quality control and optional host read removal from raw sequencing reads.
  • staphopia - Comprehensive analysis pipeline for Staphylococcus aureus isolates.
  • teton - Taxonomic classification and abundance profiling of metagenomic reads.

Citations

If you use this in your analysis, please cite the following.

Source

View source on GitHub