RT Journal Article SR Electronic T1 Easy and Accurate Reconstruction of Whole HIV Genomes from Short-Read Sequence Data JF bioRxiv FD Cold Spring Harbor Laboratory SP 092916 DO 10.1101/092916 A1 Chris Wymant A1 François Blanquart A1 Astrid Gall A1 Margreet Bakker A1 Daniela Bezemer A1 Nicholas J. Croucher A1 Tanya Golubchik A1 Matthew Hall A1 Mariska Hillebregt A1 Swee Hoe Ong A1 Jan Albert A1 Norbert Bannert A1 Jacques Fellay A1 Katrien Fransen A1 Annabelle Gourlay A1 M. Kate Grabowski A1 Barbara Gunsenheimer-Bartmeyer A1 Huldrych F. Günthard A1 Pia Kivelä A1 Roger Kouyos A1 Oliver Laeyendecker A1 Kirsi Liitsola A1 Laurence Meyer A1 Kholoud Porter A1 Matti Ristola A1 Ard van Sighem A1 Guido Vanham A1 Ben Berkhout A1 Marion Cornelissen A1 Paul Kellam A1 Peter Reiss A1 Christophe Fraser A1 The BEEHIVE Collaboration YR 2016 UL http://biorxiv.org/content/early/2016/12/13/092916.abstract AB Next-generation sequencing has yet to be widely adopted for HIV. The difficulty of accurately reconstructing the consensus sequence of a quasispecies from reads (short fragments of DNA) in the presence of rapid between- and within-host evolution may have presented a barrier. In particular, mapping (aligning) reads to a reference sequence leads to biased loss of information; this bias can distort epidemiological and evolutionary conclusions. De novo assembly avoids this bias by effectively aligning the reads to themselves, producing a set of sequences called contigs. However contigs provide only a partial summary of the reads, misassembly may result in their having an incorrect structure, and no information is available at parts of the genome where contigs could not be assembled. To address these problems we developed the tool shiver to preprocess reads for quality and contamination, then map them to a reference tailored to the sample using corrected contigs supplemented with existing reference sequences. Run with two commands per sample, it can easily be used for large heterogeneous data sets. We use shiver to reconstruct the consensus sequence and minority variant information from paired-end short-read data produced with the Illumina platform, for 65 existing publicly available samples and 50 new samples. We show the systematic superiority of mapping to shiver’s constructed reference over mapping the same reads to the standard reference HXB2: an average of 29 bases per sample are called differently, of which 98.5% are supported by higher coverage. We also provide a practical guide to working with imperfect contigs.