Petascale Computing and Genomics
By Hsien-Hsien Lei, PhD, HUGO Matters Editor
Last week, I mentioned the use of petascale supercomputers to manage and analyze the overwhelming amount of genomic data being generated currently and into the foreseeable future. Last week was also the first time I’d ever heard the terms “petascale” and “petaflop.” I assume that I’m not the only one who hasn’t given much thought to the specifics of supercomputing so I’m sharing here what I’ve learned so far.
First, a couple of definitions:
- “peta” is one quadrillion (1015)
- FLOPS stands for FLoating point OPeration which is a measure of a computer’s performance
- 1 petaflop is equal to 1000 teraflops or 1 quadrillion floating point operations per second
Trivia:
- According to Wikipedia, a simple calculator functions at about 10 FLOPS.
- Most personal computers process a few hundred thousand calculations per second.
- One petabyte of data is equivalent to six billion digital photos. (Blue Waters)
- Google processes 20 petabytes of data per day (GenomeWeb)
- 1 petabyte = 1,024 terabytes
Image: I See Your Petaflop and Raise You 19 More, Wired Science, February 2, 2009. Sequoia is the supercomputer planned by the Department of Energy and IBM that will be able to perform at the 20 petaflop level.
David A. Bader, author of Petascale Computing: Algorithms and Applications explained in an interview with iTnews.com.au,
Computational science enables us to investigate phenomena where economics or constraints preclude experimentation, evaluate complex models and manage massive data volumes, model processes across interdisciplinary boundaries, and transform business and engineering practices.
Petascale computing is run off clusters of computers. An article in Cloud Computing Journal explains why:
The main benefits of clusters are affordability, flexibility, availability, high-performance and scalability. A cluster uses the aggregated power of compute server nodes to form a high-performance solution for parallel applications. When more compute power is needed, it can be simply achieved by adding more server nodes to the cluster.
In November 2009, it was announced that a four-year $1 million project, supported by the National Science Foundation’s PetaApps program, was awarded to study genomic evolution using petascale computers. Researchers will first use GRAPPA, an open-source algorithm, to study genome rearrangements in Drosophila. From this analysis, new algorithms will be developed which have the potential to make sense of genome rearrangements leading to better identification of microorganisms, the development of new vaccines, and a greater understanding of how microbial communities evolve along with biochemical pathways.
In 2011, the world’s most powerful supercomputer, Blue Waters, will come online. According to GenomeWeb, Blue Waters will contain more than 200,000 processing cores and can perform at multi-petaflop levels. A partnership between University of Illinois at Urbana-Champaign, its National Center for Supercomputing Applications, IBM, and the Great Lakes Consortium for Petascale Computation, Blue Waters is supported by the National Science Foundation and the University of Illinois. Researchers can apply for time on Blue Waters from the National Science Foundation.
"I think petascale computing comes at a very good time for biology, especially genomics, which has to deal with … increasingly large data sets trying to do a lot of correlation between the data that’s held in several massive datasets," says Thomas Dunning, director of the NCSA at University of Illinois, Urbana-Champaign. "This is the time that biology is now going to need this kind of computing capability — and the good thing is that it’s going to be here."
~Petascale Coming Down the Pike, GenomeWeb, Jun 2009
Here’s a video of Saurabh Sinha, a University of Illinois assistant professor of computer science, talking about his research using NCSA’s supercomputers.
Genome-wide search for regulatory sequences in a newly sequenced genome: comparative genomics in the large divergence regime
Next topic for thought: cloud computing. More to come.
NB: HUGO President Prof. Edison T. Liu is currently attending the Bioinformatics of Genome Validation and Supercomputer Applications workshop at NCSA in Urbana, Illinois. I’m looking forward to hearing more about their discussions!
Do you have any knowledge to share with regards to petascale computing and genomics?
Movie – Naturally Obsessed, the making of a scientist
Naturally Obsessed: the making of a scientist is a documentary by Richard Rifkind and Carole Rifkind
Mixing humor with heartbreak, the film tells a profoundly real yet intensely dramatic story about life in a molecular biology lab. “I want the viewer to stand in the shoes of a scientist at work in a lab, glimpse the world of research as it really is, and understand what it takes to fill an ample pipeline of future scientists,” says scientist turned filmmaker, Sloan-Kettering Institute Chairman Emeritus, Richard Rifkind.
For another behind-the-scenes look at the high pressure environment of a life sciences lab, I recommend Intuition by Allegra Goodman. (New York Times review)
(via Misha Angrist)
Scientific consortium maps the range of genetic diversity in Asia, and traces the genetic origins of Asian populations
by Dr. Vikrant Kumar, Genome Institute of Singapore
As an anthropologist, I always wanted to know if Asians, known for their extensive linguistic and ethnic diversity also have a substantial level of genetic variation. In other words, do they have a common or multiple origins? Or whether the ancestors of Negritos from Philippines, Malaysia and Indonesia differ from those of their neighboring Asians? Or what binds us more: language or geography? The recent paper published in Science by the HUGO Pan-Asian SNP Consortium – Mapping Human Genetic Diversity in Asia quintessentially answers these fundamental questions which have been floating around for years.
To the best of my understanding, so far, this is the only paper where 73 populations scattered across 10 Asian countries are studied together through a massive collaborative effort of scientists from 40 institutes mostly from Asia (~2000 samples covering almost entire spectrum of linguistic and ethnic diversity were genotyped for ~50000 single nucleotide polymorphic markers). Some of the key findings of this paper are:
· East and Southeast Asians share a common origin.
· East Asians have mainly originated from South East Asian populations with minor contributions from Central-South Asian groups.
· A common ancestor of the Negrito and non-Negrito populations of Asia entered into the continent. This supports the hypothesis of one wave of migration into Asia as opposed to two waves of migrations from Africa.
· The Taiwan aborigines are derived from Austronesian populations. This stands in contrast to the suggestion that this island served as the ancestral “homeland” for Austronesian speaking populations throughout the Indo-Pacific.
· Genetic ancestry is highly correlated with linguistic affiliations as well as geography.
The paper stands out in its attempt to understand the peopling of Asia and their genetic relationships and in the process it not only presents a fantastic genotype database but also provides vital clues to scientists of diverse fields –from linguistics to archeology to human genetics. For example, it may be an interesting proposition for a human geneticist to examine if East and Southeast Asians share, more than expected, risk alleles associated with diseases. Likewise, it may be time for the linguists to re-look at the “birthplace” of the Austronesian linguistic family. I hope the consortium continues with their amazing endeavor and include a lot more number of important and isolated populations from whole of Asia and move beyond the analysis of Single Nucleotide Polymorphism to other kinds such as structural variations.
Please see below the fold for the official press release.
How Genome-Wide Variation within the Han Chinese Population Affects Study Design
By Brian Z. Ring, PhD, Director of Technology of YiGene Inc., and Principal Investigator of Applied Genomics
A scan of a rack of magazines these days will likely find at least one article discussing the emergence of China onto the world stage, its headline usually following the standard formula of “The (Sleeping/Red) (Dragon/Giant) (Awakes/Rises)”. Yet despite China’s growing importance to the world news, the genetics of the Han people, the dominant ethnic group in China and the largest ethnic group in the world, has been relatively poorly studied. The HAPMAP project, which characterized 45 Han individuals, not surprisingly revealed that this group has distinct variation from other ethnic groups, yet unveiled little about variation within the Han population. Other studies have been of a small scale or followed only maternal and paternal lineages through Y chromosome and mitochondrial studies. These have suggested an interesting variation within the Han population, primarily on a north-south axis, but not much more.
Fortunately two studies recently published in the American Journal of Human Genetics are shedding more light on the genome wide variation within the Han population. The studies, one led by Jin Li of Fudan University, the second by Jianjun Liu of the Genome Institute of Singapore, each independently studied thousands of autosomal snps in samples collected from several regions in China (the studies utilized 160K snps, 1700 individuals, and 350K snps, 6000 individuals, respectively). The studies by and large reveal a similar story: both show that while the Han Chinese population is comparatively uniform, significant variation exists, and the variation largely seems to correspond to the known north-south settlement of China by this ethnic group. This pattern, as measured by the Genome Institute of Singapore’s study, accounts for roughly 0.4% of the genetic variance within this population. Though small, both studies confirm that this variation could affect genome wide association studies if a geographically diverse population is used without proper stratification. This information will help better guide these surveys and help avoid false positive candidates.
Another important result from these studies is the flip side of the observed genetic variance; though of statistical significance, it is nonetheless relatively small. This is encouraging to a variety of efforts underway in China which presume a relatively flat genetic landscape. Drug development has been carried out, for the most part, in Europe and the United States, and thus the clinical populations have been comprised largely of those of European heritage. This leads to concern on the applicability of the direct translation of these results to other populations. To address this bias, as well as to entail lower costs in the increasingly cost conscious pharmaceutical market, an increasing number of clinical trials are being carried out within China. If there existed strong regional differences in the population’s genetic makeup, these studies would have to be carefully constructed to either properly stratify their cohorts or limit them to a genetically uniform region, especially in trials where there is a potentially significant genetic component to the response to the candidate compound.
While these two studies do not reveal that the intra-ethnic variation within the Han population can be ignored in such studies, it appears it’s likely of a small enough scale to not adversely affect current clinical trial strategies. Similarly, efforts to employ known associations between drug response and genetic variation in crafting public health care services are not disrupted by this new information. For example, cytochrome P450 2D6 is responsible for the metabolism of a wide variety of commonly prescribed drugs. However, mutant forms of this enzyme, though rare in the West, have been revealed by the HAPMAP project to affect roughly a third of Han Chinese. This is encouraging efforts to determine if genotyping prior to prescribing affected drugs could lead to improved health care delivery and lower costs. If there was strong variation within the Han population then strategies created with the HAPMAP data as its basis would not be applicable. Instead, these recent results will serve to improve ongoing studies and ensure meaningful results. Additional studies of genome wide variation within the Han population, encompassing more regions of China, will further fill in the picture and allow an even more refined approach to future translational studies.
Dr. Brian Ring received his PhD in Molecular Biology from Cornell University and currently lives and works in Beijing, China.
Yigene is a Beijing based personal genomics service company. Yigene is working with the Chinese CDC and other public institutions to determine the best methods and practices of translating genetic discoveries to the Chinese public health market.
Image: Wellcome Trust, L0004700, “The doctor is feeling the pulse of a woman patient. Her wrist is supported on a small red bolster. The doctor touches the pulse only with his finger-tips, without looking at the woman,” Watercolour By: Zhao Pei Qun
Genetic Diversity in Han Chinese
by HUGO Matters Editor Hsien-Hsien Lei, PhD
When I was in high school, a classmate of mine commented on the San Francisco Miss Chinatown competition.
“How can they tell the difference between all the girls?”
Perhaps I should send her a copy of two recent papers published in the American Journal of Human Genetics (AJHG) that found significant genetic and genomic diversity within the Han Chinese population.
The first paper by Jieming Chen et al. from the Genome Institute of Singapore sampled Chinese from ten provinces in addition to Beijing, Shanghai, and Singapore. They examined over 350,000 genome-wide autosomal SNPs and developed a genetic map of the Han Chinese. They found that:
- Within Guangdong province, genetic differentiation correlated with language.
- Genetic patterns correlated with north to south geographic orientations but not east to west. This finding is consistent with historical migration patterns.
- Metropolitan cities in China have experienced strong modern migration and are thus more difficult to tease apart genetically.
- Han Chinese individuals in Singapore are closest genetically with individuals from southern China.
- Spurious associations in genome-wide association studies (GWAS) can occur if population stratification in Chinese populations is not addressed. Geographic matching, however, can serve as a proxy for genetic matching.
"By investigating the genome-wide DNA variation, we can determine whether an anonymous person is a Chinese, what the ancestral origin of this person in China may be, and sometimes which dialect group of the Han Chinese this person may belong to," senior author Liu Jianjun, leader of the GIS Human Genetics Group, said in a statement. "More importantly, our study provides information for a better design of genetic studies in the search for genes that confer susceptibility to various diseases."
A second smaller study also published in in AJHG by Xu Shuhua et al., confirmed the observations of the paper above. The researchers concluded that genetic differentiation between northern, central, and southern Han can lead to false-positive results in association studies.
NB: On a related note, these maps of China are a riot. They were created by people around China and depict their views of the various provinces.



