Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

A family-based phasing algorithm for sequence data

Mara Battagin, Serap Gonen, Roger Ros-Freixedes, Andrew Whalen, View ORCID ProfileGregor Gorjanc, View ORCID ProfileJohn M Hickey
doi: https://doi.org/10.1101/504480
Mara Battagin
1The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Serap Gonen
1The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Roger Ros-Freixedes
1The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Andrew Whalen
1The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Gregor Gorjanc
1The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Gregor Gorjanc
John M Hickey
1The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for John M Hickey
  • For correspondence: mara.battagin@roslin.ed.ac.uk serap.gonen@roslin.ed.ac.uk roger.ros@roslin.ed.ac.uk awhalen@roslin.ed.ac.uk gregor.gorjanc@roslin.ed.ac.uk john.hickey@roslin.ed.ac.uk
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

This paper describes a family-based phasing algorithm, for variable-coverage sequence data, that first minimises phasing errors and then maximises the proportion of alleles phased. This algorithm is one of the essential tools that underpin an overall strategy for generating highly accurate sequence data on whole populations at low cost.

The algorithm is called AlphaFamSeq. It uses sequence data on the focal individual and at least two generations of ancestors to phase alleles. In the first step, AlphaFamSeq calculates allele probabilities using iterative peeling. In subsequent steps, the alleles are phased using heuristics deriving information from the sequence data of parents, grandparents and progenies and, if available, from other families in the pedigree. AlphaFamSeq was tested on a range of simulated data sets.

AlphaFamSeq gives low phasing error rates and, if there is sufficient sequence information and haplotype sharing amongst individuals, it can give a high yield of correctly phased alleles.

The allele threshold had a large effect and window size had a small effect on performance. When all individuals in a single family were sequenced at different coverages the highest correctly phased alleles reached 90% of the possible maximum (98.9%) at ~1/6 of the maximum aggregate coverage. Adding sequence information from other related individuals increased the percentage of correctly phased alleles. Imputation performance was high across all allele frequencies (average correlation by marker of 0.94), except for a slight decrease at very low frequencies (≤0.01 MAF).

Within an overall strategy for generating highly accurate sequence data on whole populations at low cost the role of AlphaFamSeq is to provide very accurately phased haplotypes on focal individuals, who are individuals whose haplotypes are very common in the population.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
Posted December 21, 2018.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
A family-based phasing algorithm for sequence data
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
A family-based phasing algorithm for sequence data
Mara Battagin, Serap Gonen, Roger Ros-Freixedes, Andrew Whalen, Gregor Gorjanc, John M Hickey
bioRxiv 504480; doi: https://doi.org/10.1101/504480
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
A family-based phasing algorithm for sequence data
Mara Battagin, Serap Gonen, Roger Ros-Freixedes, Andrew Whalen, Gregor Gorjanc, John M Hickey
bioRxiv 504480; doi: https://doi.org/10.1101/504480

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Genomics
Subject Areas
All Articles
  • Animal Behavior and Cognition (3479)
  • Biochemistry (7318)
  • Bioengineering (5296)
  • Bioinformatics (20197)
  • Biophysics (9976)
  • Cancer Biology (7703)
  • Cell Biology (11250)
  • Clinical Trials (138)
  • Developmental Biology (6418)
  • Ecology (9916)
  • Epidemiology (2065)
  • Evolutionary Biology (13280)
  • Genetics (9352)
  • Genomics (12554)
  • Immunology (7674)
  • Microbiology (18939)
  • Molecular Biology (7417)
  • Neuroscience (40893)
  • Paleontology (298)
  • Pathology (1226)
  • Pharmacology and Toxicology (2126)
  • Physiology (3140)
  • Plant Biology (6838)
  • Scientific Communication and Education (1270)
  • Synthetic Biology (1891)
  • Systems Biology (5296)
  • Zoology (1085)