MAGMA
Here we briefly introduce the main outputs from MAGMA pipeline execution, please note that some outputs are optional and depends mainly on specific parameters to be generated.
Tutorials and Presentations
Tim Huepink and Lennert Verboven created an in-depth tutorial of the features of the variant calling in MAGMA:
We have also included a presentation (in PDF format) of the logic and workflow of the MAGMA pipeline as well as posters that have been presented at conferences. Please refer the docs folder.
Interpretation
The results directory produced by MAGMA is as follows:
/path/to/results_dir/
.
├── QC_statistics
├── analyses
└── vcf_files
QC Statistics Directory
In this directory you will find files related to the quality control carried out by the MAGMA pipeline. The structure is as follows:
/path/to/results_dir/QC_statistics
├── cohort
| └── fastq_validation
│ └── multiqc
│ └── multiqc_data
└── per_sample
├── coverage
├── fastqc
└── mapping
cohort
Here you will find the joint.merged_cohort_stats.tsv
which contains the QC statistics for all samples in the samplesheet and allows users to determine why certain samples failed to be incorporated in the cohort analysis steps
In addition, you’ll find the cohort-level MultiQC report generated by per_sample/fastqc
analysis and the fastq validation report in json
format.
Understanding the joint.merged_cohort_stats.tsv file
To accurately interpret the joint.merged_cohort_stats.tsv
file and assess the quality of each sample analysis, we recommend that users consult the following table, which summarizes all relevant parameters.
Parameter | Meaning |
---|---|
SAMPLE | Identifier of the sample being analyzed. |
AVG_INSERT_SIZE | Average size (in base pairs) of the DNA fragment between paired-end reads. |
MAPPED_PERCENTAGE | Percentage of reads successfully aligned to the reference genome. |
RAW_TOTAL_SEQS | Total number of raw reads sequenced, before any filtering or trimming. |
AVERAGE_BASE_QUALITY | Average Phred quality score of bases across all reads. |
MEAN_COVERAGE | Average number of times each base of the reference is covered by aligned reads. |
SD_COVERAGE | Standard deviation of the per-base coverage; indicates variation in coverage. |
MEDIAN_COVERAGE | Median coverage per base across the reference; less sensitive to outliers than the mean. |
MAD_COVERAGE | Median Absolute Deviation of coverage; measures the variability around the median. |
PCT_EXC_ADAPTER | Percentage of bases or reads excluded due to presence of adapter sequences. |
PCT_EXC_MAPQ | Percentage of bases or reads excluded due to low mapping quality (MAPQ). |
PCT_EXC_DUPE | Percentage of bases or reads marked as duplicates. |
PCT_EXC_UNPAIRED | Percentage of reads excluded because they are unpaired in paired-end data. |
PCT_EXC_BASEQ | Percentage of bases excluded due to low base quality. |
PCT_EXC_OVERLAP | Percentage of bases excluded due to overlapping paired-end reads. |
PCT_EXC_CAPPED | Percentage of bases excluded because they exceed the target regions. |
PCT_EXC_TOTAL | Total percentage of excluded bases across all categories. |
PCT_1X | Percentage of target bases covered by at least 1 read. |
PCT_5X | Percentage of target bases covered by at least 5 reads. |
PCT_10X | Percentage of target bases covered by at least 10 reads. |
PCT_30X | Percentage of target bases covered by at least 30 reads. |
PCT_50X | Percentage of target bases covered by at least 50 reads. |
PCT_100X | Percentage of target bases covered by at least 100 reads. |
LINEAGES FREQUENCIES | Relative abundance of different detected lineages. |
MAPPED_NTM_FRACTION_16S | Fraction of 16S reads aligned to non-tuberculous mycobacteria (NTM). |
MAPPED_NTM_FRACTION_16S_THRESHOLD_MET | Whether the mapped NTM fraction in 16S exceeded the defined threshold (Boolean). |
COVERAGE_THRESHOLD_MET | Whether the sample met the minimum average coverage requirement (Boolean). |
BREADTH_OF_COVERAGE_THRESHOLD_MET | Whether the required genome breadth was achieved (Boolean). |
RELABUNDANCE_THRESHOLD_MET | Whether the relative abundance of target organism exceeded the required threshold (Boolean). |
ALL_THRESHOLDS_MET | Boolean indicating whether all quality and detection thresholds were met. |
The interpretation of median (middle value) or mean (average) coverage depth should be guided by the following categorization
Coverage Depth | Interpretation |
---|---|
≥ 50X | A genome is analysable; nothing should be missed |
≥ 20X to < 50X | A genome is analysable with nearly all minor variants detected |
≥ 10X to < 20X | A genome is analysable, but some minor variants may not be detected |
≥ 5X to < 10X | A genome is analysable, but some variants may not be detected |
< 5X | A sample is not analysable |
per_sample/coverage
Contains the GATK WGSMetrics outputs for each of the samples in the samplesheet
per_sample/mapping
Contains the FlagStat and samtools stats for each of the samples in the samplesheet
Analysis Directory
/path/to/results_dir/analysis
├── cluster_analysis
├── drug_resistance
├── non-tuberculous_mycobacteria
├── phylogeny
├── spotyping
└── snp_distances
- Cluster Analysis
Contains files related to clustering based on 5SNP and 12SNP cutoffs and inclunding and excluding complex regions .figtree files: These can be imported directly into Figtree for visualisation
- Drug Resistance
Organised based on the different types of variants as well as combined results:
/path/to/results_dir/analysis/drug_resistance
├── combined_resistance_summaries
├── combined_resistance_summaries_mixed_infection_samples
├── major_variants_xbs
├── minor_variants_lofreq
├── structural_variants_delly
└── tbprofiler_fastq
Each of the directories containing results related to the different variants (major | minor | structural) have text files that can be used to annotate the .treefiles produced by MAGMA in iToL (https://itol.embl.de)
The combined resistance results file contains a per-sample drug resistance summary based on the WHO Catalogue of Mtb mutations (https://www.who.int/publications/i/item/9789240082410)
MAGMA also notes the presence of all variants in in tier 1 and tier 2 drug resistance genes.
MAGMA will generated mixed infection reports and also optionally run tbprofiler from the fastq files for comparison purposes.
- Non-Tuberculous Mycobacteria (NTM)
Contains a brief report of NTM presence on the submitted samples, in cohort and per_sample structure.
- Phylogeny
Contains the outputs of the IQTree phylogenetic tree construction.
:memo: By default we recommend that you use the ExDRIncComplex files as MAGMA was optimized to be able to accurately call positions on the edges of complex regions in the Mtb genome
- SNP distances
Contains the SNP distance tables in tsv format.
:memo: By default we recommend that you use the ExDRIncComplex files as MAGMA was optimized to be able to accurately call positions on the edges of complex regions in the Mtb genome
- Spotyping
Contains a spoligotyping pattern prediction using SpoTyping.
vcf_files
Directory
/path/to/results_dir/vcf_files
├── cohort
│ ├── combined_variant_files
│ ├── minor_variants
│ ├── multiple_alignment_files
│ ├── raw_variant_files
│ ├── snp_variant_files
│ └── structural_variants
└── per_sample
├── minor_variants
├── raw_variant_files
└── structural_variants
- Combined variant files
Contains the cohort gvcfs based on major variants detected by the MAGMA pipeline
- Minor variants
Merged vcfs of all samples, generated by LoFreq
- Multiple alignment files
FASTA files for the generation of phylogenetic trees by IQTree
- Raw variant files
Unfiltered indel and SNPs detected by the MAGMA pipeline
- SNP variant files
Filtered SNPs detected by the MAGMA pipeline
- Structural variant files
Unfiltered structural variants detected by the MAGMA pipeline
Libraries Directory
Contains files related to FASTQ validation and FASTQC analysis
Samples Directory
Contains vcf files for major|minor|structural variants for each individual samples