Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Improved genome inference in the MHC using a population reference graph

Alexander Dilthey, Charles Cox, Zamin Iqbal, Matthew R. Nelson, Gil McVean
doi: https://doi.org/10.1101/006973
Alexander Dilthey
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Charles Cox
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Zamin Iqbal
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Matthew R. Nelson
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Gil McVean
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

In humans and many other species, while much is known about the extent and structure of genetic variation, such information is typically not used in assembling novel genomes. Rather, a single reference is used against which to map reads, which can lead to poor characterisation of regions of high sequence or structural diversity. Here, we introduce a population reference graph, which combines multiple reference sequences as well as catalogues of SNPs and short indels. The genomes of novel samples are reconstructed as paths through the graph using an efficient hidden Markov Model, allowing for recombination between different haplotypes and variants. By applying the method to the 4.5Mb extended MHC region on chromosome 6, combining eight assembled haplotypes, sequences of known classical HLA alleles and 87,640 SNP variants from the 1000 Genomes Project, we demonstrate, using simulations, SNP genotyping, short-read and longread data, how the method improves the accuracy of genome inference. Moreover, the analysis reveals regions where the current set of reference sequences is substantially incomplete, particularly within the Class II region, indicating the need for continued development of reference-quality genome sequences.

  • Abbreviations:
    A
    Alignment A
    Q=(q1,…,qNQ)
    Query sequence of NQ characters
    Aall(Q, G)
    Set of alignments between Q and G.
    PRG
    Population Reference Graph
    COV
    Catalogue of Variation
    G
    The specific PRG
    V
    Set of vertices
    E
    Set of edges
    Pn(e)
    Edge probability distribution at node n
    Psub
    Set of all subpaths
    Ptraversal
    Set of all subpaths, constrained to complete traversals
    Vm
    Two vertices
    vn
    e
    One edge
    l(v)
    The level of vertex v
    L
    Scaffold haplotype MSA length; last level of haplotype graph
    H(v)
    The set of scaffold haplotypes attached to v
    K(v)
    The set of kMer-edges attached to v
    cv
    Current vertex
    r
    “Recombination” parameter
    SN
    Number of scaffold haplotypes
    Sn,i
    i-th position (MSA) of haplotype n
    Oi
    Set of kMers output from level i
    o(kMer)
    Sample count of kMer kMer
    x
    Generic variable
    X
    Additional variant specifiers.
    suffix(v, r)
    Suffix function for vertex v of length r
    Q
    Alignment query sequence
    NQ
    Length of Q
    qi
    Index for Q
    Q’
    Aligned query sequence
    E’
    Aligned edge sequence
    AL
    Alignment length
    M
    Alignment scoring matrix
    Zl
    Number of nodes at level
    node(l, z)
    Retrieve node z at level l.
  • Copyright 
    The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
    Back to top
    PreviousNext
    Posted July 08, 2014.
    Download PDF
    Email

    Thank you for your interest in spreading the word about bioRxiv.

    NOTE: Your email address is requested solely to identify you as the sender of this article.

    Enter multiple addresses on separate lines or separate them with commas.
    Improved genome inference in the MHC using a population reference graph
    (Your Name) has forwarded a page to you from bioRxiv
    (Your Name) thought you would like to see this page from the bioRxiv website.
    CAPTCHA
    This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
    Share
    Improved genome inference in the MHC using a population reference graph
    Alexander Dilthey, Charles Cox, Zamin Iqbal, Matthew R. Nelson, Gil McVean
    bioRxiv 006973; doi: https://doi.org/10.1101/006973
    Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
    Citation Tools
    Improved genome inference in the MHC using a population reference graph
    Alexander Dilthey, Charles Cox, Zamin Iqbal, Matthew R. Nelson, Gil McVean
    bioRxiv 006973; doi: https://doi.org/10.1101/006973

    Citation Manager Formats

    • BibTeX
    • Bookends
    • EasyBib
    • EndNote (tagged)
    • EndNote 8 (xml)
    • Medlars
    • Mendeley
    • Papers
    • RefWorks Tagged
    • Ref Manager
    • RIS
    • Zotero
    • Tweet Widget
    • Facebook Like
    • Google Plus One

    Subject Area

    • Genomics
    Subject Areas
    All Articles
    • Animal Behavior and Cognition (2543)
    • Biochemistry (4992)
    • Bioengineering (3495)
    • Bioinformatics (15277)
    • Biophysics (6923)
    • Cancer Biology (5420)
    • Cell Biology (7766)
    • Clinical Trials (138)
    • Developmental Biology (4553)
    • Ecology (7179)
    • Epidemiology (2059)
    • Evolutionary Biology (10257)
    • Genetics (7528)
    • Genomics (9823)
    • Immunology (4894)
    • Microbiology (13290)
    • Molecular Biology (5163)
    • Neuroscience (29562)
    • Paleontology (203)
    • Pathology (842)
    • Pharmacology and Toxicology (1470)
    • Physiology (2151)
    • Plant Biology (4776)
    • Scientific Communication and Education (1015)
    • Synthetic Biology (1341)
    • Systems Biology (4021)
    • Zoology (770)