Random forest classifiers trained on simulated data enable accurate short read-based genotyping of structural variants in the alpha globin region at Chr16p13.3

In regions where reads don’t align well to a reference, it is generally difficult to characterize structural variation using short read sequencing. Here, we utilize machine learning classifiers and short sequence reads to genotype structural variants in the alpha globin locus on chromosome 16, a medically-relevant region that is challenging to genotype in individuals. Using models trained only with simulated data, we accurately genotype two hard-to-distinguish deletions in two separate human cohorts. Furthermore, population allele frequencies produced by our methods across a wide set of ancestries agree more closely with previously-determined frequencies than those obtained using currently available genotyping software.

The download may take a while, but once it's finished, you should have 14 gzipped fasta files in the haplotype_assemblies directory which you can then convert to bgzipped files and index with samtools: >for fasta in `ls HG*fa.gz`;do \ gunzip -c $fasta | bgzip -c > $fasta.bgzip;\ mv $fasta.bgzip$fasta; \ samtools faidx $fasta; \ done Once the HPGP genome fastas are downloaded into the appropriate directory and indexed, you can configure the Snakemake pipeline by editing the "config.yaml"file in the config directory, and create files of training features using the "sh.make_sim" script.

About the config file for running mlgenofeatures
Several useful parameters for simulating the six -thalassemia genotypes are in the config file located at config/config.yaml in the mlgenofeatures repository.These parameters can be tuned to the particular srWGS dataset you are analyzing, so the model will be trained with simulated data that looks like your real data.For example, the parameter "readlength" can be set to the the length of the reads, and "insertlength" can be set to the mean total length of your library inserts (from the start of the first read to the start of the second read), while "insertstddev" sets the standard deviation of insert lengths in the simulated dataset.
By default, mlgenofeatures will generate k-mer features from simulated read datasets with a variety of depths of coverage (ranging from 10x to 60x), but if you know specifically what depth of coverage your whole genome read sets have, you can tailor the simulated read coverage to be closer to your dataset's by editing the list following the line "coveragevals:" in the config file.The rest of the parameters in the config file shouldn't be altered unless you are creating a simulated dataset for a different part of the genome or with different background genome fastas.

Generating feature files to use in training
With the Snakefile located in the "workflow" subdirectory of the repository, the snakemake command can be used to create feature files for each of the six -thalassemia genotypes: >cd ../.. >for geno in (WTYP_WTYP, WTYP_AL37, WTYP_AL42, AL37_AL37, \ AL37_AL42, AL42_AL42); do snakemake -cores=16 features/combined_feature_file.$geno.txtdone These "combined feature" files can be combined into a single file for use in training classifiers by extracting the header from one of them and combining it with headerless versions of each of the six files:

Training random forest models
To use the features you generated in the previous step to train random forest genotyping models, install the mlgenotype python library, either from PyPi or from bioconda using anaconda.
To install mlgenotype with Python's pip installer, first create a virtual environment.Then use pip install to install the latest version of mlgenotype:
The "output/predictions" directory contains predicted genotypes for each model on the held out "test" samples whose features were passed to the script with the -test option.These output files were used to calculate the data plotted in Figure 1A.