MAGMA

Here we briefly introduce the main outputs from MAGMA pipeline execution, please note that some outputs are optional and depends mainly on specific parameters to be generated.

Interpretation

Tutorials and Presentations

Tim Huepink and Lennert Verboven created an in-depth tutorial of the features of the variant calling in MAGMA:

We have also included a presentation (in PDF format) of the logic and workflow of the MAGMA pipeline as well as posters that have been presented at conferences. Please refer the docs folder.

Interpretation

The results directory produced by MAGMA is as follows:

/path/to/results_dir/
.
├── QC_statistics
├── analyses
└── vcf_files

QC Statistics Directory

In this directory you will find files related to the quality control carried out by the MAGMA pipeline. The structure is as follows:

/path/to/results_dir/QC_statistics
├── cohort
|   └── fastq_validation
│   └── multiqc
│       └── multiqc_data
└── per_sample
    ├── coverage
    ├── fastqc
    └── mapping

cohort

Here you will find the joint.merged_cohort_stats.tsv which contains the QC statistics for all samples in the samplesheet and allows users to determine why certain samples failed to be incorporated in the cohort analysis steps

In addition, you’ll find the cohort-level MultiQC report generated by per_sample/fastqc analysis and the fastq validation report in json format.

Understanding the joint.merged_cohort_stats.tsv file

To accurately interpret the joint.merged_cohort_stats.tsv file and assess the quality of each sample analysis, we recommend that users consult the following table, which summarizes all relevant parameters.

Parameter	Meaning
SAMPLE	Identifier of the sample being analyzed.
AVG_INSERT_SIZE	Average size (in base pairs) of the DNA fragment between paired-end reads.
MAPPED_PERCENTAGE	Percentage of reads successfully aligned to the reference genome.
RAW_TOTAL_SEQS	Total number of raw reads sequenced, before any filtering or trimming.
AVERAGE_BASE_QUALITY	Average Phred quality score of bases across all reads.
MEAN_COVERAGE	Average number of times each base of the reference is covered by aligned reads.
SD_COVERAGE	Standard deviation of the per-base coverage; indicates variation in coverage.
MEDIAN_COVERAGE	Median coverage per base across the reference; less sensitive to outliers than the mean.
MAD_COVERAGE	Median Absolute Deviation of coverage; measures the variability around the median.
PCT_EXC_ADAPTER	Percentage of bases or reads excluded due to presence of adapter sequences.
PCT_EXC_MAPQ	Percentage of bases or reads excluded due to low mapping quality (MAPQ).
PCT_EXC_DUPE	Percentage of bases or reads marked as duplicates.
PCT_EXC_UNPAIRED	Percentage of reads excluded because they are unpaired in paired-end data.
PCT_EXC_BASEQ	Percentage of bases excluded due to low base quality.
PCT_EXC_OVERLAP	Percentage of bases excluded due to overlapping paired-end reads.
PCT_EXC_CAPPED	Percentage of bases excluded because they exceed the target regions.
PCT_EXC_TOTAL	Total percentage of excluded bases across all categories.
PCT_1X	Percentage of target bases covered by at least 1 read.
PCT_5X	Percentage of target bases covered by at least 5 reads.
PCT_10X	Percentage of target bases covered by at least 10 reads.
PCT_30X	Percentage of target bases covered by at least 30 reads.
PCT_50X	Percentage of target bases covered by at least 50 reads.
PCT_100X	Percentage of target bases covered by at least 100 reads.
LINEAGES FREQUENCIES	Relative abundance of different detected lineages.
MAPPED_NTM_FRACTION_16S	Fraction of 16S reads aligned to non-tuberculous mycobacteria (NTM).
MAPPED_NTM_FRACTION_16S_THRESHOLD_MET	Whether the mapped NTM fraction in 16S exceeded the defined threshold (Boolean).
COVERAGE_THRESHOLD_MET	Whether the sample met the minimum average coverage requirement (Boolean).
BREADTH_OF_COVERAGE_THRESHOLD_MET	Whether the required genome breadth was achieved (Boolean).
RELABUNDANCE_THRESHOLD_MET	Whether the relative abundance of target organism exceeded the required threshold (Boolean).
ALL_THRESHOLDS_MET	Boolean indicating whether all quality and detection thresholds were met.

The interpretation of median (middle value) or mean (average) coverage depth should be guided by the following categorization

Coverage Depth	Interpretation
≥ 50X	A genome is analysable; nothing should be missed
≥ 20X to < 50X	A genome is analysable with nearly all minor variants detected
≥ 10X to < 20X	A genome is analysable, but some minor variants may not be detected
≥ 5X to < 10X	A genome is analysable, but some variants may not be detected
< 5X	A sample is not analysable

per_sample/coverage

Contains the GATK WGSMetrics outputs for each of the samples in the samplesheet

per_sample/mapping

Contains the FlagStat and samtools stats for each of the samples in the samplesheet

Analysis Directory

/path/to/results_dir/analysis
├── cluster_analysis
├── drug_resistance
├── non-tuberculous_mycobacteria
├── phylogeny
├── spotyping
└── snp_distances

Cluster Analysis

Contains files related to clustering based on 5SNP and 12SNP cutoffs and inclunding and excluding complex regions .figtree files: These can be imported directly into Figtree for visualisation

Drug Resistance

Organised based on the different types of variants as well as combined results:

/path/to/results_dir/analysis/drug_resistance
├── combined_resistance_summaries
├── combined_resistance_summaries_mixed_infection_samples
├── major_variants_xbs
├── minor_variants_lofreq
├── structural_variants_delly
└── tbprofiler_fastq

Each of the directories containing results related to the different variants (major | minor | structural) have text files that can be used to annotate the .treefiles produced by MAGMA in iToL (https://itol.embl.de)

The combined resistance results file contains a per-sample drug resistance summary based on the WHO Catalogue of Mtb mutations (https://www.who.int/publications/i/item/9789240082410)

MAGMA also notes the presence of all variants in in tier 1 and tier 2 drug resistance genes.

MAGMA will generated mixed infection reports and also optionally run tbprofiler from the fastq files for comparison purposes.

Non-Tuberculous Mycobacteria (NTM)

Contains a brief report of NTM presence on the submitted samples, in cohort and per_sample structure.

Phylogeny

Contains the outputs of the IQTree phylogenetic tree construction.

:memo: By default we recommend that you use the ExDRIncComplex files as MAGMA was optimized to be able to accurately call positions on the edges of complex regions in the Mtb genome

SNP distances

Contains the SNP distance tables in tsv format.

:memo: By default we recommend that you use the ExDRIncComplex files as MAGMA was optimized to be able to accurately call positions on the edges of complex regions in the Mtb genome

Spotyping

Contains a spoligotyping pattern prediction using SpoTyping.

`vcf_files` Directory

/path/to/results_dir/vcf_files
├── cohort
│   ├── combined_variant_files
│   ├── minor_variants
│   ├── multiple_alignment_files
│   ├── raw_variant_files
│   ├── snp_variant_files
│   └── structural_variants
└── per_sample
    ├── minor_variants
    ├── raw_variant_files
    └── structural_variants

Combined variant files

Contains the cohort gvcfs based on major variants detected by the MAGMA pipeline

Minor variants

Merged vcfs of all samples, generated by LoFreq

Multiple alignment files

FASTA files for the generation of phylogenetic trees by IQTree

Raw variant files

Unfiltered indel and SNPs detected by the MAGMA pipeline

SNP variant files

Filtered SNPs detected by the MAGMA pipeline

Structural variant files

Unfiltered structural variants detected by the MAGMA pipeline

Libraries Directory

Contains files related to FASTQ validation and FASTQC analysis

Samples Directory

Contains vcf files for major|minor|structural variants for each individual samples