A simplified workflow for the analysis of whole-genome sequencing data from Pristionchus pacificus mutant lines

Nematodes are attractive model systems to understand the genetic basis of various biological processes ranging from development to complex behaviors. In particular, mutagenesis experiments combined with whole-genome sequencing has been proven as one of the most effective methods to identify core players of multiple biological pathways. To enable experimentalists to apply such integrative genetic and bioinformatic analysis in the case of the satellite model organism Pristionchus pacificus, I present a simplified workflow for the analysis of whole-genome data from mutant lines and corresponding mapping panels. Individual components are based on well-maintained and widely used software packages and are extended by 50 lines of code for the analysis and visualization of allele frequencies. The effectiveness of this workflow is demonstrated by an application to recently generated data of a P. pacificus mutant line, where it reduced the number of candidate mutations from an initial set of 3,500 single nucleotide variants to ten.


Introduction
Free-living nematodes such as Caenorhabditis elegans and Pristionchus pacificus are powerful model systems to dissect the genetic architecture of various traits and to perform comparative studies across the whole biological spectrum ranging from development to neurobiology and behavior (Sommer 2009;Sternberg 2015;Moreno et al. 2017;Hong et al. 2019) . The potential of nematode systems for comparative biology was even further enhanced with the advent of high-throughput sequencing and genome editing (Lo et al. 2013;Nakayama et al. 2020) .
However, while reverse genetic approaches based on candidate genes are promising tools to investigate the evolution of gene function Sieriebriennikov et al. 2017;Okumura et al. 2017) , forward genetic studies in non-classical model organisms such as P. pacificus , are complicated by the fact that most research groups are usually small and have limited access to bioinformatic support to guide the analysis of whole-genome sequencing (WGS) data. Although relatively user-friendly analysis pipelines for mutant WGS data exist (Minevich et al. 2012;Etherington et al. 2014;Sun and Schneeberger 2015;Wachsman et al. 2017) , they have been developed in species-specific manners and the integration of new organisms is often as complex as setting up a new pipeline. Thus, the research community still needs simple and well documented workflows that are composed of software packages that can easily be installed so that biologists with minimal computational skills can carry out their own analysis of WGS data. Genome sequencing of C. elegans and P. pacificus mutants often revealed in the order of thousands of candidate mutations even after successive rounds backcrossing (Rae et al. 2012;Iatsenko et al. 2013) . Subsequent filtering for mutations with effects on protein sequences and additional mapping data can reduce the number of candidates by a factor of hundred, which is often sufficient for further manual inspection and experimental validation (Witte et al. 2015;Sieriebriennikov et al. 2020) . Mapping data can be generated from recombinant inbred lines (Witte et al. 2015;Kieninger et al. 2016) or pooled sequencing of progeny with a mutant phenotype (Schneeberger et al. 2009;Doitsidou et al. 2010) . In a previous study, we crossed a mutant line with a wildtype P. pacificus strain PS1843 (Washington) and sequenced a pool of progeny showing the mutant phenotype ( Figure 1). The candidate region showed a pronounced bias to the reference allele. In combination with WGS of the pure mutant line, this resulted in the identification of the P. pacificus ortholog of C. elegans nhr-1 as a regulator of mouthform plasticity (Sieriebriennikov et al. 2020) .
Here, I present a step-by-step interactive workflow for the analysis of WGS data from P. pacificus mutant lines with simultaneous mapping data. The effectiveness of this workflow is demonstrated by the reanalysis of the WGS data of the P. pacificus nhr-1 mutant (Sieriebriennikov et al. 2020) . This computational pipeline is intended as a basic workflow to generate first results, but individual components can be added or be replaced to optimize and tailor the workflow according to specific needs.

Figure 1 Generation of a mapping panel for pooled sequencing.
A recessive mutant allele can be mapped by crossing a mutant line to a wildtype isolate and screening the F2 progeny for individuals showing the mutant phenotype. Selecting these individuals for pooled sequencing establishes a mapping panel that shows an allelic bias towards the reference alleles in the candidate region, which is highlighted by the gray dotted lines.

Results and Discussion
To allow small research groups without extensive bioinformatic expertise to analyze their own genome data, I present a simplified workflow for the analysis of WGS data from P. pacificus mutants. The bioinformatic workflow ( Figure 2) is mostly composed of widely used and well-maintained software packages such as BWA and samtools and is extended by two custom source codes for estimating the allele frequencies in pooled sequencing data (Supplemental File: Source code 1) and visualization (Supplemental File: Source code 2). The first step of the analysis consists in the alignment of the raw sequencing data to the P. pacificus reference assembly. This step can be executed in parallel for the WGS data from pure mutant lines, the mapping panel, and the wildtype strain PS1843 (Washington) for which WGS data was generated previously as part of population genomic studies of P. pacificus (Rödelsperger et al. 2014;McGaughran et al. 2016) . In the second step, the alignments of the wildtype P. pacificus strain PS1843 serve for the identification of roughly 1,300,000 marker positions for which allele frequencies can be Figure 2 Computational workflow for mapping the candidate region. WGS data of more than 300 natural isolates of P. pacificus (Rödelsperger et al. 2014;McGaughran et al. 2016) can be obtained from the NCBI Sequence Read and European Nucleotide Archive. These data can be reprocessed to establish informative marker positions for the sequencing data of the pooled mapping panel. Raw read alignment is carried out by BWA and variant analysis is performed by bcftools and samtools, which are widely used and easy to install software packages for the analysis of next generation sequencing data Li and Durbin 2010;Li 2011) .
computed from the data of the mapping panel ( Figure 2). To this end, the samtools mpileup command generates a text file which contains aligned reference and non-reference nucleotides for any given position, which is subsequently parsed by custom perl script (Supplemental File: Source code 1). This yields a data table that can be plotted with the help of R (Supplemental File: Source code 2). In the case of the mapping data of the nhr-1 mutant line, this revealed a high frequency of reference alleles on large parts of P. pacificus chromosome I (5-16Mb). The causative gene, P. pacificus nhr-1 is located at position 5.7Mb on chromosome X (Sieriebriennikov et al. 2020) . With this candidate interval, variants of the pure mutant line can be filtered with the help of the bcftools programs. More precisely, in a first step, variants in the candidate interval are selected, subsequently background variants that are common between the nhr-1 mutant line and its genetic background are removed (Kieninger et al. 2016) , and the remaining variants are screened for missense mutations in protein-coding exons. In the case of the P. pacificus nhr-1 mutant data this reduced the number of candidate mutations from roughly 3,500 single nucleotide variants to candidate mutations (Table 2). This demonstrates the effectiveness of this workflow to identify reasonably small numbers of candidate mutations after analysis of mapping data and WGS of a mutant line.
The presented workflow represents a baseline approach to analyze WGS data from  (Schneeberger et al. 2009;Doitsidou et al. 2010) . However, I would still recommend to sequence the pure mutant line, because certain structural variations in the wildtype strain could inflate the number of variants in the candidate region. In addition, the sequencing data of the pure mutant line will be helpful to filter out candidates in a follow up suppressor screen. Further, if the marker density is high, restriction site associated DNA (RAD) sequencing can be applied to reduce the cost of sequencing and protocols for WGS of individual worms are available, which could provide a much better definition of the boundaries of the candidate interval. We have previously successfully applied such methods for mapping associations between genotypes and phenotypes and population genomic analysis (Witte et al. 2015;Falcke et al. 2018;Zhou et al. 2019) . Thus, there is considerable flexibility in the design of the underlying experiments which will eventually require the computational workflow to be readjusted. Nevertheless, the presented workflow should be suitable as a baseline approach for analyzing WGS data from mutant lines.

Sequence retrieval, indexing of the reference genome, and alignment
The P. pacificus reference assembly (version El Paco  ) was downloaded from http://pristionchus.org/download/ . In the first step, the fasta file was decompressed and an index for efficient searching was built using the BWA program (version 0.7.17-r1188) (Li and Durbin 2009 The option -o specifies the name of the output file and the option -t indicates the numbers of processor cores that should be used for parallel computing. Subsequently the alignments were converted from text format (.sam) into binary format (.bam), sorted, and indexed. Finally, temporary alignment files were removed to save disk space. >rm alignment_unsorted.bam alignment.sam

Identification of the candidate region
To identify marker positions, initial variant calls for the P. pacificus strains PS1843 (Washington) were generated with the program bcftools (version: 1.7) (Li 2011) .  >samtools mpileup -f genome.fa -l positions.txt mapping.bam > pileup_data.txt Source code 1 is then copied into a text file (count_allele_frequencies.pl) and is used to count the allele frequencies for each position in the pileup data. >perl count_allele_frequencies.pl pileup_data.txt > mapping_data.txt Within the same directory, source code 2 has to be executed within R to generate the plot of allele frequency data.

Identification of candidate mutations
Initial variant calling for the alignments of the nhr-1 mutant line and other samples are generated by bcftools. Here, the additional call of the bcftools view command filters for variants with a quality score > 20. The bcftools index command is needed for the following filtering procedures. Next, variants falling into the candidate region are extracted and the resulting compressed .vcf file is indexed to allow further filtering. The following steps extracts all variant positions that are shared with the genetic background ( nhr-40 mutant line (Kieninger et al. 2016) ) and the corresponding variants are extracted. Filtering the annotated variant file for missense mutations identifies ten candidate mutations (Table 2) including the non-synonymous mutation in the P. pacificus nhr-1 gene (Sieriebriennikov et al. 2020