The great genome sequencing rush
Data is the new gold and, with the cost of sequencing steadily dropping, scientists are digging the genome goldmine with relentless enthusiasm. Automatization of the process workflow from DNA purification to data analysis and generation has introduced routine high-throughput processing, unleashing possibilities that go far beyond what anyone could expect just a few years ago. As a result, the field of genomics has expanded exponentially, affecting virtually every aspect of the biosciences (1).
One, a hundred, a thousand, a hundred thousand genomes
It would be fair to say that the applied genomics field began in the year 1990, with the official start of the Human Genome Project (HGP)(2). The effort lasted thirteen years and spent over $500 million, involving over twenty research centers and universities in the United States, the United Kingdom, Japan, France, Germany, Spain, and China. Due to its cost, the project attracted some criticisms from influential scientists (3,4). One of the major issues raised at the time was that, from a practical point of view, a reference genomic sequence is nothing more than an abstract concept: each individual is carrying around a unique genomic setup consisting of millions of particular combinations of variants ranging from Single Nucleotide Polymorphisms (SNPs) to large chromosomal rearrangements. Therefore, it makes no sense to talk about the sequence of the human genome. Although this criticism had a strong scientific rationale, it did not (and, at the time, could not) take into consideration that, driven by the HGP, the sequencing technology would evolve so fast that the inconceivable would become routine, and today anyone can have their own genome sequenced and obtain a personalized genomic analysis at a reasonable price. Thanks to many technological advancements, this issue has been addressed by a very effective brute force approach. In 2007 a consortium was established to undertake the 1000 Genomes Project, a broad international collaboration aimed at sequencing the full genome of 2504 individuals from 26 different human populations (5). The project aimed at providing a representation of human genetic complexity, offering geneticists a tool to understand the functional meaning of specific sequence variants. At its completion, it led to the mapping of tens of millions SNPs, millions of short sequence insertions and deletions, and tens of thousands of large rearrangements commonly present in the human population. Using these data, researchers obtained a much clearer vision of the human genomic setup and used it as a platform for projects directly aimed at investigating the genetic causes of specific diseases. The past few years have seen the birth of many sequencing consortia focused on understanding the genetic causes of diseases as diverse as cancer and psychiatric conditions. With sequencing prices falling, consortia ceased to be the only players in the genomics field, and more actors joined the sequencing rush (e.g., direct-to-consumer genetic services), contributing to increasing the knowledge about the human genome and introducing the technology to a variety of clinical applications (6). The promise by a few companies of having in the next few years a technology able to sequence a human genome for less than $100 has many implications, especially in the medical field. It is now very likely that whole genome sequencing will become shortly a routine test for many patients.
The first non-human mammal to be sequenced was the mouse, due to its value as an experimental model for human disease (7). The mouse genome project started almost at the same time as the HGP and was completed a year earlier (although the official article describing the effort was published a year later). By the time of its completion, the technology had progressed to a point where the cost of sequencing entire genomes had fallen by three to four orders of magnitude, and other species had been added to the list of sequenced organisms. Additionally to classical model organisms, such as zebrafish, Xenopus, and rat, other economically relevant species had their genome sequenced in the following years, including the sheep, cow, and horse. In the same years, plant biologists did not stay idle: after the genome of Arabidopsis, other plants such as grape, apple, and rice followed suit. Mirroring the 1000 Genome Project, a large consortium of research centers launched in 2008 the 1000 Plant Genome Project. Nothing, however, came even close to the new ambitious proposal recently unveiled by the Earth BioGenome Project (EBP). The EBP is planning to sequence all of the 1.5 million known eukaryotic species to establish the foundation for the young science of phylogenomics (8). When completed, the EBP project will provide a framework for studying virtually any aspects of eukaryotic biology. Strikingly, the estimated cost of this effort is in the same order of magnitude as the budget initially devoted to the first sequencing of the human genome.
Despite the extraordinary progress seen in the field of genome sequencing in the last twenty years, there are still issues that need to be solved before massive sequencing becomes a technology as widespread as, say, PCR.
High-throughput DNA extraction. A challenge facing genomic scientists is related to the ability to obtain DNA of reliable quality from high numbers of samples. Sequencing is just the last step of a complex workflow that starts with the dissolution and lysis of the primary sample material (e.g., a tumor biopsy or a few milliliters of blood) to allow the isolation of genomic DNA. DNA extraction is performed routinely in the vast majority of biomedical laboratories all around the world, and it is safe to say that it’s a rather standard procedure (9). The issue, however, arises when the number of samples to be processed is very high. Manual handling of large numbers of specimens usually means that the first ones to enter the extraction pipeline are processed differently than the last ones, leading to intrinsic variability of the DNA quality between samples. This heterogeneity will, in turn, be the cause of the so-called batch effects, namely variations between the results that are not caused by biological differences between the primary samples. Examples of batch effects are numerous, and it has been shown that experimental conclusions of large genomics projects can be severely affected by these technical artifacts (10). Automating the DNA extraction is one of the most effective approaches to reduce batch effects since it ensures reliable and uniform processing of specimens throughout the entire workflow. However, even automated DNA extraction systems might suffer from batch effects, especially if the extraction is performed by the serial handling of small batches of samples rather than parallel processing of all specimens. A careful choice of the most appropriate automation system for high-throughput DNA isolation is therefore essential to ensure consistency in large-scale genomics projects.
Library preparation. After the DNA has been extracted from the primary samples, the sequencing library has to be prepared. This step entails a fragmentation of the DNA followed by ligation of the adapters and amplification. Technical artifacts can be introduced at each one of these steps: sub-optimal fragmentation might lead to molecules that are either too short or too long for the selected sequencing approach; inefficient ligation might result in the overbearing presence of adapter dimers which in turn will interfere with sequencing, and unbalanced amplification will cause reduced library complexity and loss of information (11). All of these artifacts will eventually result in a skewed representation of the sample DNA sequence and to a loss of potentially critical information. A potential issue that affects high-throughput sequencing workflows is represented again by batch effects (10). Automated workflows from primary sample to DNA fragmentation and amplification setup are the preferred solutions to address this issue since they reduce hands-on time and manipulate large numbers of samples in a very consistent fashion.
The most discussed limitation of the technology lies in handling the amount of data generated by the sequencing machines. It has been calculated that sequencing of a human genome creates about 150 gigabytes (one gigabyte being one billion bytes, GB) of raw data, with more specialized applications (e.g., tumor sequencing) generating about two terabytes (one terabyte corresponding to a thousand gigabytes) per sample. A project like the EBP is going to produce about 1 Exabyte, corresponding to 1 billion GBs. If this number seems daunting, consider that when sequencing becomes a routine clinical procedure, the amount of patient-related data that will be generated in the next ten years will be even more extensive than the EBP database. All of these data need to be stored and shared for analysis, posing a formidable challenge. To address the issue, public consortia such as the Sequence Read Archive hosted by the US National Center for Biotechnology Information (NCBI), and the European Nucleotide Archive (ENA), which is hosted by the European Molecular Biology Laboratory at the European Bioinformatics Institute (EMBL–EBI) are available for data storage (12), relieving single institutions from the necessity of building massive hardware infrastructures.
The sheer analysis of this amount of data is going to be a significant challenge. Fortunately, another rapidly developing field is machine learning (also referred to as artificial intelligence). Fueled by private and public funding, machine learning is now being actively applied to pharmacogenomics and genetic screenings, and it will soon become a crucial tool for making sense of the rising ocean of genomic data(13). Cloud-based services such as the ones hosted by Illumina or Google are already routinely used in many laboratories to analyze complex genomic data sets. The enormous amount of information publicly available makes the perfect setup for refining machine learning algorithms and use them for applications as diverse as drug repositioning and patient categorization(14). A big limitation in the field is the lack of large collections of data manually annotated by medical experts that could be used to train the algorithms. Different approaches are being implemented to tackle this issue, and although not yet refined, these approaches are being developed very aggressively. Shortly, machine learning algorithms are expected to become the primary tool for virtually all applied genomics analyses and to provide support to physician and geneticists in taking critical decisions.
Sequencing: what’s next?
Genomics is undergoing a revolution that has few parallels in the history of science. The possibility to access and process whole genome data with low effort and cost is opening almost infinite opportunities in basic and applied life science, paving the way for new applications (metagenomics, to name one, would not be possible without the NGS technology). Thirty years ago, no one would have guessed how PCR would change the face of modern science. It is not a stretch of the imagination to suggest that the combination of high-throughput DNA isolation and sequencing, and machine learning will have an even more profound effect on biology and medicine in the years to come.
- Park ST, Kim J (2016) Trends in Next-Generation Sequencing and a New Era for Whole Genome Sequencing. International Neurourology Journal 20(S2), pages S76-83.
- International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature 409, pages 860–921
- Letter from Christian B. Anfinsen https://profiles.nlm.nih.gov/ps/retrieve/ResourceMetadata/KKBBDT
- Weis JH (1990) Usefulness of the Human Genome Project. Science (correspondence), 29 Jun
- The 1000 Genomes Project – Nature collection https://www.nature.com/collections/dcfqmlgsrw
- Lander, ES (2011) Initial impact of the sequencing of the human genome. Nature 470, pages 187–197
- Mouse Genome Sequencing Consortium (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420, pages 520–562
- Lewin HA, et al. (2018) Earth BioGenome Project: Sequencing life for the future of life. PNAS 115 (17), pages 4325-4333
- Green MR, Sambrook J (2012) Molecular Cloning, a Laboratory Manual 4th Cold Spring Harbor Laboratory Press
- PerkinElmer review on batch effects
- Head SR, et al. (2014) Library construction for next-generation sequencing: Overviews and challenges. Biotechniques 56(2), page 61
- Langmead B, and Nellore A (2018) Cloud computing for genomic data analysis and collaboration. Nature Reviews Genetics 19, pages 208–219
- Miller DD, and Brown EW (2018) Artificial Intelligence in Medical Practice: The Question to the Answer? The American Journal of Medicine 131 (2), pages 129–133
- Ching T, et al. (2018) Opportunities and obstacles for deep learning in biology and medicine. Journal of the Royal Society Interface 15(141)