Building Genomic Analysis Pipelines in a Hackathon Setting with Bioinformatician Teams: DNA-seq, Epigenomics, Metagenomics and RNA-seq

We assembled teams of genomics professionals to assess whether we could rapidly develop pipelines to answer biological questions commonly asked by biologists and others new to bioinformatics by facilitating analysis of high-throughput sequencing data. In January 2015, teams were assembled on the National Institutes of Health (NIH) campus to address questions in the DNA-seq, epigenomics, metagenomics and RNA-seq subfields of genomics. The only two rules for this hackathon were that either the data used were housed at the National Center for Biotechnology Information (NCBI) or would be submitted there by a participant in the next six months, and that all software going into the pipeline was open-source or open-use. Questions proposed by organizers, as well as suggested tools and approaches, were distributed to participants a few days before the event and were refined during the event. Pipelines were published on GitHub, a web service providing publicly available, free-usage tiers for collaborative software development (https://github.com/features/). The code was published at https://github.com/DCGenomics/ with separate repositories for each team, starting with hackathon_v001.


87
Genomic analysis leverages large datasets generated by sequencing technologies in order 88 to gain better understanding of the genomes of humans and other species. Given its reliance on 89 large datasets with complex interactions and its fairly regularized metadata, genomic analysis is 90 an exemplar of "big data" science (1). Genomic analysis has shown great promise in finding 91 actionable variants for rare diseases (2), as well as directing more specific clinical action for 92 common diseases (3,4). 93 Due to its potential for significant clinical and basic science discoveries, genomics has 94 drawn many newcomers from the biological and computational sciences, as well as investigators 95 from new graduate programs in bioinformatics. While many of these investigators can run 96 established pipelines on local, public, or combined datasets, most do not have the expertise or 97 resources to establish and validate novel pipelines. Additionally, highly experienced genomic 98 investigators often lack the resources necessary to generate and distribute pipelines with broader 99 applicability outside their specific area of research. In this study, we aimed to assess whether we 100 could close this gap by bringing genomics experts from around the world together to establish 101 public pipelines that can be both used by newcomers to genomics and refined by other seasoned 102 professionals.
(12). To further increase our confidence in the called variants, we added the MuTect algorithm 287 (13) to those used by Cake. Xu et al have compared somatic variation calling algorithms (14). 288 A first level of filtering is provided in the pipeline for the resulted VCF files containing 289 called somatic variants. This was accomplished using the Cake filtering module. We kept most 290 of the default parameters, and those that were changed are explained in Table 1. Once the VCF 291 files were filtered, they would be annotated using the ANNOVAR software, downloaded on 292 January 5th, 2015 (32), with five different databases (RefGene,KnownGene,ClinVar,Cosmid 293 and Cosmic), and the top 1% most deleterious CADD scores. 294 As the coverage of these samples was too low, we set the threshold to a less stringent value.

NORMAL_MIN_DEPTH
As the coverage of these samples was too low, we set the threshold to a less stringent value.

5
EMPTY_CONSEQ_FILTER VEP annotations were not used FALSE

296
To compare the predicted somatic variations between two matched tumor-control 297 samples, we created a module in our pipeline that used VCFtools to find shared and unique 298 variations (33). Lastly, we combined these modules into a single pipeline using Bpipe (34), as 299 well as a Unix Bash script. The full list of software used by the DNA-seq Team is included in 300 Table 2.

Epigenomics Team 304
We gathered data with the intention of modeling transcription (mRNA-seq) based on 305 DNA methylation (RRBS or Bisulfite-seq) and histone states (ChIP-seq). To simplify analysis, 306 we focused on marks associated with enhancers and their regulatory status: H3K27ac, 307 H3K4me1, and H3K4me3. Ultimately, we required that included tissues have matching 308 H3K27ac, RNA-seq and DNA methylation data for preliminary modeling. Data files were drawn 309 from human cell lines and tissues in the NIH Epigenomic Roadmap that fit our criteria (outlined 310 in Table 3), from the site's FTP mirror (36) using the rsync command with the -av option. All 311 files were already aligned, and were required to have been generated using the hg19 reference genome. Several files were identified that appeared to be re-aligned or uncorrected versions of 313 other downloaded files, and were removed from the analysis. Tissues acceptable for analysis 314 were identified by using the data matrix view on the Roadmap site, as well as searching for non-315 partial datasets with the Data Grid view of the International Human Epigenome Consortium 316 (IHEC) Data Portal (37). Samples were initially downloaded from ENCODE: Encyclopedia of 317 DNA Elements (38), but were not included in initial analysis for reasons of uniformity and time. 318 Identify and quantify HERV sequences in assembled reads using blastn, 324 2. Identify and quantify HERV sequences in non-assembled reads from a human genome 325 using search tools from NCBI's SRA Toolkit, 326 3. Identify and quantify HERV sequences in non-assembled reads from a human genome 327 using a standard blastn search of reads in FASTA format. 328 Pipelines 4-6 repeat pipelines 1-3 to identify all viral sequences within a sample from a human 329 bacterial microbiome. 330 For pipeline 1, whole genome sequence raw reads of human CEU NA12878 (39) were 331 obtained from NCBI's SRA database and converted from SRA to FASTQ file format using the 332 fastq-dump command provided by the SRA Toolkit with default filter settings. The resulting 333 FASTQ file was moved to an Amazon Elastic Compute Cloud (EC2) node and assembled into 334 contigs with the ABySS assembler (40, 41). Contigs were used as queries for blastn against a 335 database consisting of all Retroviridae RefSeq genomes that was constructed using the 336 makeblastdb command. For pipeline 2, the SRA files containing raw reads for NA12878 were 337 used directly as a database for a SRA blast (blastn_vdb) using the Retroviridae RefSeq genomes 338 as query. For pipeline 3, the FASTA file of non-assembled reads from pipeline 1 was used as a 339 query for a blastn search against the database of Retroviridae RefSeq genomes from pipeline 1. 340 For pipelines 4 -6, we used samples from the Human Microbiome Project (HMP) that 341 had passed preliminary quality checks (42). The plan for pipeline 4 was the same as for pipeline 342 1, except for using SOAPdenovo2 for the assembly of the microbiome FASTQ reads. For 343 pipeline 5, rather than start with SRA files, raw Illumina WGS reads in FASTQ format first were 344 converted to SRA format using the FASTQ loader tool latf-load within the SRA Toolkit. 345 Pipeline 6 follows pipeline 3, but uses the microbiome FASTA as its query and the total viral 346 RefSeq genome as its database. Full details about the data and tools we used are located in  normal samples GSM1228202-GSM1228219) were also determined to suitable. Additional tools 363 used by the RNA-seq Team are listed in Table 6. 364 However, one of the main hurdles we encountered was finding an appropriate dataset that could 372 be used to test and help create our pipeline. Initially, we found a neuroblastoma dataset 373 submitted to SRA that included both the exome and RNA-seq data for matched samples. 374 However, the dataset was unusable because of corrupted files. The majority of our time at the 375 hackathon was spent on finding an appropriate dataset and debugging issues with the publicly 376 available files. This hindered our efforts to create a fully functional pipeline to achieve our goals. 377 Due to a lack of a working dataset of both DNA-seq and RNA-seq from the same individuals, we 378 were unable to write a module within our pipeline to find eQTLs. However, we were able to 379 design a pipeline that would find somatic mutations using five calling algorithms for given 380 matched samples using five different algorithms, filter and annotate mutations, and find shared 381 and unique mutations between two matched sample pairs (see Methods for more details). 382

Epigenomics Team 383
We initially considered different scenarios in which a lab might utilize our proposed 384 pipeline based upon their available datasets and how investigators might want to model their 385 datasets. For example, investigators might want to find a relationship between DNA methylation 386 levels and histone enrichment. Given the variation in epigenetic data available as part of publicly 387 available datasets, we recognized the need for flexibility about which data components would be 388 required to model different epigenetic relationships. Time limitations prevented us from 389 generating a workflow for every potential scenario (for example, RNA-seq and ChIP-seq, or 390 ChIP-seq and DNA methylation but no RNA-seq). Instead, we considered common questions 391 that might interest a general epigenetics laboratory. Investigators with epigenetic data often want 392 to understand how this data correlates to gene expression. Thus, we decided to focus our model on elucidating relationships between a given variable collection of epigenetic data and gene 394 expression. A lab can use publicly available datasets to create a model with which to test their 395 own epigenetic data. 396

Metagenomics Team 397
We considered several different options for metagenomics searches before the final six 398 pipelines were settled. We spent significant time discussing the requirements for filtering or 399 other QC steps on raw reads prior to running assembly (pipelines 1 and 4) or SRA blast steps. 400 We also debated the relative merits of assembly for each task, eventually deciding to compare 401 searches with and without assembly in different pipelines, which significantly added to our 402 workload and may have contributed to the fact that only two pipelines were completed during the 403 hackathon. With regard to assembly, our group considered using the established MG-RAST 404 pipeline for a "brute force" blastn into raw reads (47), as well as using metAMOS (48), a 405 comprehensive pipeline for metagenomics analysis for comparison. Eventually we decided to 406 forego established pipelines, due to computational requirements within the timeframe of the 407 hackathon and the desire to focus on developing novel workflows of our own. 408 For pipelines 1-3 we initially planned to use human genomic information from 1000 409 Genomes (24) but found the format (base genomic sequence with list of variants) more difficult 410 to handle for our purposes than the human CEU NA12878 data that we eventually used. 411 Likewise, for the microbiome task, we initially planned to apply our pipelines to several 412 microbiome sample types, but eventually decided to focus on a skin microbiome, since skin 413 samples tend to contain abundant viruses and multiple datasets may be available due to ease of 414 sampling.
Our original goals were very ambitious, especially the task of determining RNA editing 417 without DNA controls. We were interested in looking at the variants in paired cancer samples, 418 and spent a fair amount of time to find an appropriate dataset. We decided to align the samples 419 using HISAT and determine the types and counts of variants (particularly A-I transitions) that 420 may suggest RNA editing. We also wanted to determine variants in genes and then possibly 421 correlate the genes to Gene Ontology. 422 We discussed a variety of aligners, such as STAR (49) and HISAT (50), which are very 423 fast. Information about the recently released HISAT program was shared via the Google Group 424 prior to the start of the hackathon so everyone had a chance to review this program. We selected 425 HISAT because of its speed, a decision partly driven by the time constraints of the hackathon. 426 We encountered some technical difficulties processing these in HISAT, so we ran the dataset 427 with the 3 pairs of tumor/normal using tophat and bowtie so that we had some results for 428 downstream processing, while another team member continued to develop the HISAT portion of 429 the pipeline. 430 We selected bambino as our variant caller based on team members' past successes with 431 using this tool. A collection of Python and Perl scripts was written to filter out unmapped and 432 low-quality reads and, more importantly, multi-mapped reads that did not map to a unique locus 433 within the genome, since bambino would call each of these multiple alignments separately. The 434 BAMs also had to be sorted by chromosome to prepare input for bambino. Bambino was then 435 used to generate a variant call table and a Perl script was created to filter for coverage on both 436 strands and give a sparser table for downstream analysis. Another program counted the variant 437 info. We then applied the Fisher's exact test to the data using R.
We discussed creating a command line to get gene ontologies of the set of genes, but 439 were concerned about users keeping up-to-date versions of the GO database. Gene ontology 440 could be determined by using web sites such as PANTHER (51) or DAVID (52, 53). 441

DNA-seq Team 443
We found the SRA website challenging to use for locating data, and the quality of 444 available datasets was inconsistent. Although many data sources validate user-submitted files, a 445 number of files that had been improperly validated and thus were not usable. Some files were 446 corrupted and could not be used, such as those in BioProject PRJNA76777 (54). Our pipeline 447 needed datasets that had a paired tumor-normal sample from the same patient. For some datasets, 448 paired samples were not available, and other datasets were marked as being paired, when in fact 449 they were not, such as BioProject PRJNA217947 (55). In addition, we observed that multiple 450 datasets in the SRA database were missing the header information required to create BAM files 451 used by downstream analysis tools, such as BioProject PRJDB1903 (56). In other cases, SRA 452 data were found to be malformed, and caused certain tools to crash. Specifically, files from 453 BioProject PRJNA268172 (27) contained reads with differing length sequence and quality scores 454 (e.g. 34 bases of sequences, 70 bases of quality information). Files with such mismatches cannot 455 be used in SAMtools to convert to BAM files, as a difference in these field lengths is 456 inconsistent with the SAM format specification (57). 457 We also encountered problems with upstream bioinformatics code quality, such as poor 458 or incorrect documentation. The tools we employed had a variety of installation methods, and 459 few were available for easy installation through a package manager. For example, core software, 460 such as R version 3, was not available as a package from the operating system vendor. Installing 461 from a third-party repository is not complex, but may be daunting to someone inexperienced in 462 systems administration. 463

Epigenomics Team 464
When searching for epigenetic datasets that belong to a given cell type, we found that in 465 many cases all of the necessary data were not available in one centralized location. Thus, we had 466 to search through multiple websites and databases to find enough epigenetic data for a given cell 467 type we wanted to model. In some cases, the metadata for a given file was either corrupt or 468 unavailable. In other cases, the assembly used to align reads for a given set of files was not 469 clearly indicated, so these files were discarded. When dealing with wiggle (wig) and bigWig 470 files, sometimes the format of the file was inconsistent and needed to be edited on the fly. 471

Metagenomics Team 472
Technical difficulties generally were resolved expediently, but still hindered timely 473 analysis within the hackathon context. For example, some Amazon EC2 nodes would suddenly 474 become completely unresponsive for unexplained reasons, requiring that we shut down and re-475 initiate the nodes. By the end of the hackathon, results were only available from the pipelines 476 that used the SRA BLAST, in part because the SRA BLAST took about an order of magnitude 477 less computing time than the standard blast program. In both cases, many Amazon compute 478 nodes were available, but only the SRA BLAST was able to handle the large volume of human 479 genome and human microbiome read data efficiently. In contrast, a huge amount of the 480 processing power available to the standard blast program (several tens of nodes) was simply 481 wasted while the program waited for data.
It is important to recognize the difficulties of variant calling, especially with RNA-seq 484 data. First, bias impacts genes expressed at lower levels. As gene expression itself varies from 485 sample to sample, depth of coverage for any particular variant may differ. For instance, a variant 486 in a sample with high gene expression would be called without difficulties, but may not be called 487 in a sample that also carries the variant but whose expression is too low to call with confidence. 488 Another source of variance lies within the heterogeneity of the tissue sample. Most tissue 489 samples harbor multiple cell types, and not all of these cells will carry a somatic mutation. This 490 problem is encountered in both DNA-seq and RNA-seq data, but results can be difficult to 491 interpret on a per-variant basis when the fluctuation in overall coverage in gene expression is 492 also considered. Thus, we decided to deal with overall global effects rather than selecting 493 particular singular changes. 494

DNA-seq Team 496
The test dataset was downloaded from SRA website. The SRA Toolkit utility called 497 prefetch allows the user to download SRA data files, but we found it initially troublesome to use 498 due to configuration and storage issues; by default, prefetch stores all files in user home 499 directories, which are often limited in storage capacity. We therefore wrote a faster web-scraper 500 script to download the files from the SRA website. Given our time limitations, we had to rely on 501 the user-submitted aligned and trimmed files, but we recommend that files submitted to the SRA 502 should be validated prior to upload. 503 Our pipeline was designed to find somatic mutations using five different algorithms, filter 504 and annotate the mutations, and compare the predicted mutations between matched tumor-505 normal samples. However, due to time constraints and initial difficulties with finding an 506 appropriate data set and software installation, we were unable to complete our analysis. A 507 diagram of the final DNA-seq Team pipeline design is presented in Fig 3. 508

Epigenomics Team 510
We sought to rectify previously described inconsistencies in analyses by developing a 511 more efficient, novel pipeline. Our pipeline uses RNA-seq counts, ChIP-seq peaks, and DNA 512 methylation data in order to generate a model to predict relationships between gene expression 513 and epigenetic data. These models can then be used to predict changes in gene expression with 514 respect to changes in these epigenetic signals. Publicly available datasets can be utilized to 515 generate a model, which investigators can then use to predict the state of the chromatin based on 516 their own epigenetic data. The pipeline uses a combination of Python, R, and command line-517 based tools. 518 For each gene in a given cell type, epigenetic marks positioned locally to the gene are 519 considered, as are distal enhancer elements that may also play in a role in that gene's expression. 520 To calculate the local epigenetic effects on transcription, an arbitrary distance on the 5' and 3' 521 ends of a gene is binned into regions and the scores of epigenetic marks that reside in each of 522 these bins are collected. The distal effect of transcription on a given gene is given by peak scores 523 of enhancer elements that are at most one megabase (Mb) upstream or downstream of the gene. 524 The scores for each epigenetic mark and enhancer for a given gene are standardized and 525 stored in a data matrix, where each row corresponds to a given gene for a given sample condition 526 or cell type. Transcript gene counts generated from RNA-seq data are also stored. This pipeline 527 generates a unique model for each gene in a given cell type by considering the gene count values 528 as Y-values and each of the epigenetic scores as X-values. Corresponding coefficients are 529 calculated for each X-value. Investigators can use these coefficients to input a new set of 530 epigenetic data and receive a testable hypothesis of predicted levels of expression for each gene 531 based on the new epigenetic data. Over time, different datasets can be used to train a given 532 model to make it more reliable. A diagram of the final DNA-seq Team pipeline is presented in 533

Metagenomics Team 536
Although the goals were similar across all six of our pipelines, differences in file formats 537 and analysis approaches between the pipelines required the team to split their efforts rather than 538 work together on a single pipeline. One result of this fragmentation was some lack of consistency 539 in analytical methods (for example, choice of query versus database) between the pipelines. 540 Moreover, due to time limitations of the hackathon only one assembly was completed: the 541 ABySS assembly of the NA12878 human genome. Likewise, while we completed a versatile 542 script for conversion of FASTQ files to SRA format with the latf-load tool, time allowed only for 543 its demonstration on a single human microbiome sample. 544 Initially our plan included comparison of ERV sequence abundances between NA12878 545 genomes sequenced by several different sequencing technologies. Likewise, we initially planned 546 to compare viral sequence abundances between several different microbiome sample types. Due 547 to the complexity of these tasks, we decided to demonstrate our pipelines with a single sample 548 type for each application: Illumina HiSeq 2000 reads from NA12878 and a single sample from 549 the right retroauricular crease for the HMP application. Of the six pipelines that were planned 550 and designed, we built four (pipelines 1-3 and 5). 551 We found that the most successful approach for searching a human genome for endogenous retroviruses was to use reads converted to SRA format (pipeline 2) via latf-load. The 553 blastn in pipeline 2 was completed in 50-60 minutes. For pipeline 1, while an assembly of 554 NA12878 was completed using ABySS within the time constraints of the hackathon, the blastn 555 search using the assembled contigs to query the ERV database required excessive computational 556 time; after more than 4 hours using 30 cores, the search still had not finished. In contrast, the 557 blastn for pipeline 3 finished in 5 hours. Part of the increased time for the blastn search in 558 pipeline 3 may have been due to alteration of the FASTA database by merging of forward and 559 reverse paired-ends. 560 Pipeline 5 includes a set of scripts that we developed to create a versatile pipeline for 561 searching a human microbiome sample for all viruses. These scripts may be adjusted to conduct 562 BLAST searches using other types of SRA files. A shell script downloads the relevant datasets 563 for the assembled and non-assembled sequences from HMP as well as for total viral sequences 564 from RefSeq. Scripts and a wrapper, written in R, were developed to convert FASTQ data to 565 SRA format with the latf-loader tool, convert the loaded data to .kar format, run a BLAST search 566 with the blast_vdb command, and parse the data into a viral-by-sample count matrix. The 567 resulting sparse matrix may be normalized and handled in a way similar to previously published 568 methods for sparse matrices of high-throughput 16S survey data (21)

RNA-seq Team 573
We developed and ran a Python script that reads a user-defined manifest file to extract 574 the read sequence information from the SRA files, stores the data in FASTQ format, and 575 launches the jobs to align the sequences using HISAT. Due to technical difficulties and time 576 constraints, we decided to manually download and process a smaller set of 3 pairs of 577 tumor/normal samples, as opposed to the set of 18 pairs we had initially considered. We aligned 578 the sequences using HISAT to prepare the data for use in subsequent parts of the pipeline. The 579 aligned SAM files were filtered to remove the unmapped, low-quality or ambiguous reads, such 580 as reads that map at multiple different locations. 581 The filtered data were run through bambino to create a variant call table in which each 582 line contains a call variant at a particular location within the genome and the reference base at 583 the same location. We counted nucleotide change variants in the tumor and normal samples and 584 ran a Fisher's exact statistical test using R to identify potential RNA editing. We found no 585 significant global changes of overrepresentation, but it is important to recognize the limitations 586 of our small sample size and our focus on specific changes. RNA editing most likely only 587 comprises a small number of A to G variants, and we would not be able to identify these changes 588 by considering global total numbers as opposed to looking at each site's overall counts 589 individually. This limitation does not affect overrepresentation in a global manner, but a small 590 set of specific local changes might not be identified with this study design. 591 Before the end of the hackathon, we were able to use the initial Python script to 592 download all 36 samples and launch the alignment tool jobs, but were able to complete fewer 593 than 10 samples given the amount of time required to finish debugging. However, when 594 completed, this automation script will greatly simplify the process of accessing and launching of 595 alignment jobs for RNA-seq datasets. A diagram of the final RNA-seq Team pipeline is 596

599
Feedback from hackathon participants was generally positive, and the enthusiasm that 600 participants felt was evident during the event. Participants voluntarily stayed past the planned 601 ending time each night, and many participants did not even want to take a break when lunch 602 arrived. Even more than a week after the hacakathon had ended, teams continued to 603 communicate about and work on the problems, as well as this paper. 604 Every participant in the hackathon contributed not only to the research but also to 605 drafting the paper. Each group appointed a lead writer, who worked closely with the librarian 606 editor and coordinated with the other members of their team. Because each of the team members 607 worked on different parts of the project, every individual wrote at least a portion of the sections 608 of the paper covering their work. The use of Google Docs allowed multiple authors to work on 609 the paper simultaneously and all changes to be reflected in real time. Google Docs' comment 610 functionality also facilitated communication among authors. Once the writing was considered 611 complete, the librarian editor organized and edited the draft in order to create a coherent and 612 consistent paper, then returned this final draft to all authors for their approval. Though 613 coordinating with so many authors is challenging, here we demonstrated that it is possible for a 614 large group of individuals to contribute substantively to an article. 615 Participants reported that they appreciated having structured roles within the teams. Team 616 leads were also important for the success of the team, though their presence was not necessary 617 for the entire hackathon. For example, inclement weather on the second day prevented one of the 618 team leads from attending, but the team still made progress on pipeline production. Given that 619 members of each team came from diverse backgrounds with experience working with a 620 multitude of different data types and resources, the hackathon promoted innovation through team science and consensus-building. For example, it was essential that each pipeline utilize an 622 appropriate test dataset, but many teams had difficulty with data that were located across 623 multiple repositories or could not be used due to errors in metadata or formatting. Thus, teams 624 had to brainstorm other datasets to use or create new ways to process the data. Because each 625 problem encompassed technical challenges inherent in many biological fields, teams needed to 626 consolidate ideas from each member. This allowed teams to not only transcend the difficult data 627 landscape, but fostered a strong learning environment. 628 Although the ultimate goal of the hackathon was to solve biological problems, 629 participants emphasized that they appreciated this unique opportunity for career development 630 and networking. Participants with strong backgrounds in computer science effectively mentored 631 those who were less computationally savvy, and those with strong biology backgrounds were 632 able to share insights with those who lacked this expertise. Additionally, the hackathon brought 633 together individuals from different research communities who otherwise may have never met and 634 created the potential for establishing new collaborations. In particular, participants early in their 635 careers were able to meet prominent researchers in various fields and receive helpful training 636 advice from the more senior participants. We anticipate that the participants will share their 637 experiences upon returning to their respective institutes. 638 The organizers learned some valuable lessons from this event. Surprisingly, although the 639 organizers had kept the goals somewhat loosely structured, participants generally asked for more 640 structure, particularly concerning datasets. In the future, the organizers intend to prepare videos 641 for team members concerning the scientific directions of the projects prior to the event. Other 642 informational materials distributed in advance of the event could help participants learn how to 643 complete tasks that took time away from pipeline development, such as how to locate and 644 download datasets. Specific attention will be paid to using the SRA SDK to process small parts 645 of many genomes simultaneously. One team was unable to complete their pipeline, and other 646 teams were affected by time constraints, so moving some of the preparatory work of locating and 647 downloading datasets would help ensure that the teams had adequate time for more substantive 648 work on the pipelines. 649 From an institutional perspective, the hackathon was also helpful as a means to test NCBI 650 public data repositories. Over the course of this hackathon, several technical issues with respect 651 to data storage, metadata and corruptions were illuminated. These issues as well as constructive 652 feedback about how NCBI should host data were discussed directly with NCBI Director David 653 Lipman. 654 Finally, we hope that this hackathon will help to stimulate the community to continue to 655 improve these pipelines. We chose these topics and questions because they are of interest to 656 many biologists and introductory bioinformaticians. Because the data is publicly available, 657 investigators should be able to access the datasets from NCBI in order to replicate the work done 658 in creating these pipelines. We encourage members of the community to extend, expand and alter 659 these pipelines, which are licensed under a Creative Commons Attribution License (CC-BY). We 660 hope that the community will continue working with these pipelines to suit their needs and repost 661  a  t  i  o  n  a  l  C  e  n  t  e  r  f  o  r  B  i  o  t  e  c  h  n  o  l  o  g  y  I  n  f  o  r  m  a  t  i  o  n  .  W  h  o  l  e  e  x  o  m  e  s  e  q  u  e  n  c  i  n  g  i  n  a  c  a  s  e  o  f  s  p  o  r  a  d  i  c   758   m  u  l  t  i  p  l  e  m  e  n  i  n  g  i  o  m  a  s  2  0  1  4  [  c  i  t  e  d  2  0  1  5  J  a  n  u  a  r  y  1  3  ] .
A   t  i  o  n  a  l  C  e  n  t  e  r  f  o  r  B  i  o  t  e  c  h  n  o  l  o  g  y  I  n  f  o  r  m  a  t  i  o  n  .  T  r  a  n  s  c  r  i  p  t  o  m  e  s  e  q  u  e  n  c  i  n  g  o  f  h  u  m  a  n   806   h  e  p  a  t  o  c  e  l  l  u  l  a  r  c  a  r  c  i  n  o  m  a  (  h  u  m  a  n  )  2  0  1  1  [  c  i  t  e  d  2  0  1  5  J  a  n  u  a  r  y  1  3  ] . A