The Vertebrate Codex Gene Breaking Protein Trap Library For Genomic Discovery and Disease Modeling Applications

The zebrafish is a powerful model to explore the molecular genetics and expression of the vertebrate genome. The gene break transposon (GBT) is a unique insertional mutagen that reports the expression of the tagged member of the proteome while generating Cre-revertible genetic alleles. This 1000+ locus collection represents novel codex expression data from the illuminated mRFP protein trap, with 36% and 87% of the cloned lines showcasing to our knowledge the first described expression of these genes at day 2 and day 4 of development, respectively. Analyses of 183 molecularly characterized loci indicate a rich mix of genes involved in diverse cellular processes from cell signaling to DNA repair. The mutagenicity of the GBT cassette is very high as assessed using both forward and reverse genetic approaches. Sampling over 150 lines for visible phenotypes after 5dpf shows a similar rate of discovery of embryonic phenotypes as ENU and retroviral mutagenesis. Furthermore, five cloned insertions were in loci with previously described phenotypes; embryos homozygous for each of the corresponding GBT alleles displayed strong loss of function phenotypes comparable to published mutants using other mutagenesis strategies (ryr1b, fras1, tnnt2a, edar and hmcn1). Using molecular assessment after positional cloning, to date nearly all alleles cause at least a 99+% knockdown of the tagged gene. Interestingly, over 35% of the cloned loci represent 68 mutants in zebrafish orthologs of human disease loci, including nervous, cardiovascular, endocrine, digestive, musculoskeletal, immune and integument systems. The GBT protein trapping system enabled the construction of a comprehensive protein codex including novel expression annotation, identifying new functional roles of the vertebrate genome and generating a diverse collection of potential models of human disease.


Introduction
With the generation of more than 100 sequenced vertebrate genomes (Meadows & Lindblad-Toh, 2017), 2 the current key question is how to determine the role(s) of uncharacterized gene products in specific 3 biological and pathological processes. For example, genes associated with human disease are being 4 discovered at a rapid rate. However, the biological functions underlying this linkage is often unclear 5 (Kettleborough et al., 2013). Model system science using loss of function approaches has been essential 6 to the annotation of the genome to date including the discovery of novel processes and the biological 7 mechanisms underlying disease (Stoeger, Gerlach, Morimoto, & Nunes Amaral, 2018). 8 Among vertebrates, Danio rerio (zebrafish) has emerged as an outstanding model organism 9 amenable to both forward and reverse genetic approaches. In addition, the natural transparency of the 10 zebrafish embryo and larvae enables the unprecedented ability to non-invasively collect a rich set of 11 expression data for the proteome and in the context of an entire living vertebrate. We describe here a 12 1000+ collection of zebrafish lines made using the Protein Trap Gene-Breaking Transposon 13 (GBT; (Clark, Balciunas, et al., 2011)to develop such a codex for the comparative vertebrate genomics 14 field (Meadows & Lindblad-Toh, 2017), (Clark, Balciunas, et al., 2011). 15 The initial pGBT-RP 2.1 (RP2.1) vector has several features that efficiency cooperate to report 16 gene sequence, expression and function (Clark, Balciunas, et al., 2011). Two main reporter components 17 include a 5' protein trap and a 3' exon trap, with the entire cassette flanked by inverted terminal repeats 18 (ITR) of the miniTol2 transposon to effectively deliver the transgene as single copy integrations into the 19 zebrafish genome. In cases where RP2 integrates in the sense orientation of a transcription unit, the 20 protein trap's splice acceptor overrides normal splicing of the transcription unit, creating a fusion 21 between endogenous upstream exons and the monomeric RFP (mRFP) reporter sequences. The protein-22 trap domain in RP2.1 generates the expression profile, including subsequent protein localization and 23 6 accumulation when a functional in-frame fusion between the start codon-deficient mRFP reporter and 1 the tagged protein. Mutagenesis is accomplished by the strong internal polyadenylation and putative 2 border element, effectively truncating the endogenously tagged protein. The GBT mutagenesis system 3 represented the first step toward a 'codex' of protein expression and functional annotation of the 4 vertebrate genome (Clark, Balciunas, et al., 2011). 5 We report here the development of a series of GBT protein trap vectors including versions to trap 6 expression in each of the three potential reading frames. In addition, we modified the 3' exon trap to use 7 a localized BFP rather than the more commonly used GFP to more effectively use these lines in 8 conjunction with other transgenic fish. We deployed these vectors at scale, generating over 1000 protein 9 trap lines with visible mRFP expression at either 2dpf (end of embryogenesis) or 4dpf (larval stage), 10 with 36% and 87% of the cloned lines showcasing to our knowledge the first described expression of 11 these genes at these stages, respectively. We used forward and reverse genetic tests to assess the 12 mutagenicity of these vectors, noting similar rates of visible phenotypes at 5dpf as ENU and retroviral 13 screening tools. We re-isolated five previously described loci, and embryos homozygous for each of the 14 corresponding GBT alleles displayed strong loss of function phenotypes comparable to these previously 15 published mutants generated using other mutagenesis strategies (ryr1b, fras1, tnnt2a, edar and hmcn1). 16 Molecular assessment after positional cloning shows that nearly all alleles cause at least a 99+% 17 knockdown of the tagged gene. Interestingly, over 35% of the cloned loci represent 68 mutants in 18 zebrafish orthologs of human disease loci, including nervous, cardiovascular, endocrine, digestive, 19 musculoskeletal, immune and integument systems. The GBT protein trapping system enabled the  Mayo IACUC approved all protocols involving live vertebrate animals (A23107, A21710 and A34513).

2
The 234bp SalI to XhoI mini-intron fragment was isolated from pCR4-bactmIntron following digestion. 3 The pGBT-RP7.1 plasmid was digested with XhoI so that the SalI to XhoI fragment was cloned between 4 the gamma-crystallin promoter and nls tagBFP.  pGBT-RP7.1 was made by replacing a 501bp PstI to PstI fragment of pGBT-RP6.1 with a 480bp PstI to 16 PstI fragment of pRP2.1. This changed the nucleotide sequence between the carp beta-actin splice 17 acceptor to replicate the sequences in pGBT-RP2.1. pGBT-RP7.1 was never directly tested in zebrafish. pGBT-RP5.1 was made by cloning a PCR product with the AUG-less mRFP into pre(-1)GBT-RP5.1.

4
The 698bp mRFP* PCR product was obtained by amplification of pGBT-R15 (Clark, Balciunas, et al., . Prior to cloning the PCR 7 mRFP* product was digested with EcoRI and SpeI to prepare the ends for subcloning into pre(-1)GBT-8 RP5.1 that was opened between the carp beta actin splice acceptor and the ocean pout terminator. 9 10 pre(-1)GBT-RP5.1 was made by cloning 1.2kb SpeI to AvrII fragment from pGBT-PX (Sivasubbu et al., 11 2006) that contained the ocean pout terminator into the SpeI site of pre(-2)GBT-RP5.1. The resulting 12 products were screened for the proper orientation of the ocean pout terminator relative to the carp beta 13 actin splice acceptor.
14 15 pre(-2)GBT-RP5.1 was made by inserting an expression cassette to make a 3' poly(A) trap that makes 16 blue lenses. A 1.15kb SpeI to BglII fragment from pKTol2gC-nlsTagBFP was cloned into pre(-3)GBT-17 RP5.1 that had been cut with AvrII and BglII. This moved the Xenopus gamma crystallin promoter   pUC57-I-SceI_LoxP_Splice contains a synthetic sequence (see below) cloned into pUC57 (Genscript). 3 The scaffold contains an I-SceI site; loxP site; carp beta actin splice acceptor; cloning sites for mRFP, 4 ocean pout terminator, and BFP lens cassettes; carp beta actin splice donor; loxP site; and an I-SceI site.  Fluorescent microscopy of mRFP reporter protein expression 16 Larvae were treated with 0.2 mM phenylthiocarbamide at 1 dpf to inhibit pigment formation. The  zebrafish, LP 560 nm filter as excitation and LP 585nm as emission was used for Lightsheet microscopy. 23 The sagittal-, dorsal-, and ventral-oriented z-stacks of the mRFP expression were captured at either 50x 1 magnification using an ApoTome microscope (Zeiss) with a 5x/0.25 NA dry objective (Zeiss) or 50x 2 magnification using a Lightsheet Z.1 microscope (Zeiss) 5x/0.16 NA dry objective. Each set of images 3 were obtained from the same larva and the images shown are composites of the maximum image 4 projections of the z-stacks obtained from each direction.  Genomic DNA isolation 11 Genomic DNA was isolated from F1 fish tail biopsies to conduct next generation sequencing and from 12 both WT and heterozygous larva to manually perform the PCR-based mRFP linkage analysis. Zebrafish    Forward genetic screening with next-generation sequencing 1 Isolated genomic DNA (300-500ng) was digested with MseI, and BfaI in parallel for 3h at 37°C and 2 heat inactivated for 10 min at 80°C. The digested samples from each enzyme were pooled with 3 prealiquoted barcoded linker in individual wells. The T4 DNA ligase (New England Biolabs, Inc.) was 4 added, and the reaction mix was incubated for 2 h at 16°C. The linker-mediated PCR was performed in 5 two steps. In the first step, PCR was done with one primer specific to the 3'-ITR (5'-   Annotating human orthologues of GBT-tagged genes and disease-causing genes 21 The human orthologues of 192 cloned zebrafish genes were mainly collected by using a data mining tool, (https://genomevolution.org/CoGe/SynFind.pl). If the candidate multiply hit in those manual 5 assessments, it was annotated as a human orthologue. The human phenotype data caused by mutations 6 of 68 human orthologues were collected by using another data mining tool, BioMart 7 (http://useast.ensembl.org/biomart/martview/cfe15ead83199a0b7c7997f5a4ce9e6b) supported by 8 Ensembl database.  The cloned genes with unpublished expression data were isolated by using "Gene Expression" tool of

1
The features of GBT constructs RP2 and RP8 -capturing all three proteomic reading frames 2 In our previous study, we reported the intronic-based gene-breaking transposons (GBTs) as effective and  RP2.1 was designed to use one main reading frame, and some lines with expression were noted to 23 include the use of a secondary splice acceptor (data not shown). To maximize genome coverage of this 1 insertional mutagen, we created all three reading frames of the RP2 and RP8 vector series (Fig. 1). 4) 3' 2 exon trap. These vectors also encodes a 3' exon trap with preferential expression following intragenic  The function of this cassette complements the obligate, in-frame protein trapping effect and is used for 5 both quality control during mutagenesis and for genotyping of more weakly expressing protein trap 6 alleles (Fig.1A). In the RP2 vector series, the nearly ubiquitous b-actin promoter drives expression of 7 GFP. Expression of integrated GFP becomes detectable between early developmental stages such as RP8 vector series includes all reading frames for the AUG-free mRFP reporter and a new 3' exon trap 12 cassette with expression of tagBFP driven by the lens-specific gamma-crystalline promoter (Fig. 1B). 13 Using the tissue-specific reporter system with BFP is helpful to easily detect F1 founder with weak  18 We generated more than eleven hundred independent lines by using all six constructs of the GBT system 19 (Supplemental Table 1). We conducted an initial screening expression of the mRFP fusion protein and 20 showed that RP2 and RP8 vector series with all reading frames of mRFP reporter protein readily detects 21 the distribution of the fusion proteins expressed from their own promoter in zebrafish (Supplemental 22 Figure 1). 23 1 Annotation of protein localization and trafficking of the GBT strain collection 2 The ability to non-invasively obtain temporal and spatial expression pattern information is a key feature 3 of these protein trap strains. Pilot data from our first RFP lines rapidly demonstrated this step was going 4 to be a major bottleneck for our pipeline if we used standard documentation methods. Consequently, we  Table 1), and updated results are posted at zfishbook.

22
Our throughput for cloning the GBT lines using traditional molecular methods was clearly an initial 1 bottleneck. To help address this, we deployed a rapid cloning process based on methods used to isolate   Table 1. Imaging the localization of transcripts 23 and proteins at 4dpf is more difficult than those at 2 dpf, because accessibility of antisense RNA probes 1 and antibodies into the larva's body is technically limited for in the methods of both in situ hybridization 2 and immunohistochemistry. Compared with the published data of gene expression in ZFIN, zfishbook 3 currently provides almost the double number of genes with expression data at 2 dpf and 14 times the 4 number of genes with expression data at 4 dpf ( Table 1). In addition, zfishbook also provides novel 5 expression data for 61 genes at any developmental stage (Fig.2). High knockdown efficiency of endogenous transcripts induced by RP2 8 We directly compared published transposon insertional mutant vector systems (Fig. 3). The range and The pFT1 appears to be an improvement over these systems, in which the overall range and average 2016) and this manuscript) maintains a strong knockdown (1% or less read-through) in 26 lines tested. 16 Though deployed here using a nearly random, transposon-based delivery platform, the GBT vector 17 system is an effective insertional mutagen suitable for an array of otherincluding targeted integration -

21
Screening through 5 dpf 22 We conducted an initial forward genetic screen on embryos and early larvae of 179 RFP-positive GBT 1 lines, identifying 12 recessive phenotypes, such as ryr1b, fras1, tnnt2a, edar and hmcn1, (  GBT alleles phenocopy known embryonic mutations 11 We tested the first five GBT lines in genes with known loss of function mutant phenotypes (ryr1b;  Gene ontology analysis of GBT-tagged loci 18 To assess the diversity of GBT loci molecularly characterized to date, we utilized the PANTHER were tagged in the PANTHER Protein Class ontology. 168 of our cloned GBT alleles mapped in the 6 PANTHER system with 21 types of Protein Classes ( Table 2). 18% and 16 % of the mapped GBT 7 alleles are classified to nucleic acid binding (PC00171) and transcription factor (PC00218), respectively 8 (Fig. 4). This result reveals that a quarter of the mapped genes possibly play a role in regulatory 9 processes. Overall, however, the rich diversity of protein classes observed in our cloned traps suggests a 10 large diversity will be represented by the overall collection and consistent with the random nature of   (Table 3). Several human orthologues were 16 provisionally annotated using BLASTP and a synteny analysis tool, SynFind. In a previous study 17 comparing the list of human genes possessing at least one zebrafish orthologue with the 3,176 genes 18 bearing morbidity descriptions that are listed in the OMIM database, 82 % morbid genes 2,601 genes) 19 can be related to at least one zebrafish orthologue (Howe et al., 2013). Surprisingly, 67 genes (about 20 37%) of 183 annotated human orthologues are associated with human disease involved in multi-organ 21 system including nervous, circulatory, endocrine, metabolic, digestive, musculoskeletal, immune, and 22 integument systems (Fig. 5 and Table 3) and many are not established in rodents and zebrafish ( Table   23 24 5). The GBT protein-trap system provides a variety of potential human disease models which have a 1 revertible allele that can interchange between disease and healthy cellular, organ and physiological states.  However, in each case we were able to confirm transcription at the locus in wild-type animals, yielding 8 new annotation for these loci in the zebrafish genome.    We know two major potential biases that may yield non-random trapping coverage of the genome. First, 5 the RP2.1 protein trap was initially designed around a single reading frame. Upon molecular analysis of 6 our first lines, however, we discovered that RP2.1 encodes a second, alternative splice acceptor yielding 7 protein trap expression from a second reading frame due to this alternative splicing event in a significant  The GBT system described here is a two-component, molecularly regulatable mutagenesis approach that 2 offers the ability to test for the sufficiency of protein-encoding loci in regulated, tissue-and cell-specific 3 applications.

5
Functional Diversity of the Trapped Proteins by the GBT system 6 To analyze distribution of protein functions of the trapped genes by GBT system, we performed GO 7 analysis using the PANTHER protein classification. Although the protein functions related in 8 transcriptional regulatory process, such as nucleic acid binding and transcription factors represented one 9 relatively common class of isolated genes, the protein GO analysis indicated that the GBT protein trap 10 was a useful tool for capturing a wide range of protein functions in addition to cell fate regulators and 11 related nuclear genes.  Table 2). 11 Since the completion of the zebrafish reference genome sequencing, it has enabled many new 12 discoveries to be made, in particular the positional cloning of hundreds genes from mutation affecting 13 embryogenesis behavior, physiology, and health and disease. However, a few poorly assembled regions 14 remain (Howe et al., 2013). In molecular cloning of GBT lines generated, we found that a surprising 15 proportion of the sequenced insertions does not correspond to any predicted genes. Although we have 16 not formally excluded that mRFP expression might, in some case, be an artifact, the data of gene 17 prediction provided in genome databases reveals some prediction errors. These results suggest that the 18 algorithms used to predict genes from genome databases have missed a significant number of genes. The 19 protein trapping by using GBT system may useful in identifying unsuspected novel genes, expressions 20 and functions in vivo in real time.       Table 2.