Petascale Computing and Genomics

March 1, 2010 · Posted in Research, Tools of the Genome Trade 

By Hsien-Hsien Lei, PhD, HUGO Matters Editor

Last week, I mentioned the use of petascale supercomputers to manage and analyze the overwhelming amount of genomic data being generated currently and into the foreseeable future. Last week was also the first time I’d ever heard the terms “petascale” and “petaflop.” I assume that I’m not the only one who hasn’t given much thought to the specifics of supercomputing so I’m sharing here what I’ve learned so far.

First, a couple of definitions:

  • “peta”  is one quadrillion (1015)
  • FLOPS stands for FLoating point OPeration which is a measure of a computer’s performance
  • 1 petaflop is equal to 1000 teraflops or 1 quadrillion floating point operations per second

Trivia:

  • According to Wikipedia, a simple calculator functions at about 10 FLOPS.
  • Most personal computers process a few hundred thousand calculations per second.
  • One petabyte of data is equivalent to six billion digital photos. (Blue Waters)
  • Google processes 20 petabytes of data per day (GenomeWeb)
  • 1 petabyte = 1,024 terabytes

image Image: I See Your Petaflop and Raise You 19 More, Wired Science, February 2, 2009. Sequoia is the supercomputer planned by the Department of Energy and IBM that will be able to perform at the 20 petaflop level.

David A. Bader, author of Petascale Computing: Algorithms and Applications explained in an interview with iTnews.com.au,

Computational science enables us to investigate phenomena where economics or constraints preclude experimentation, evaluate complex models and manage massive data volumes, model processes across interdisciplinary boundaries, and transform business and engineering practices.

Petascale computing is run off clusters of computers. An article in Cloud Computing Journal explains why:

The main benefits of clusters are affordability, flexibility, availability, high-performance and scalability. A cluster uses the aggregated power of compute server nodes to form a high-performance solution for parallel applications. When more compute power is needed, it can be simply achieved by adding more server nodes to the cluster.

In November 2009, it was announced that a four-year $1 million project, supported by the National Science Foundation’s PetaApps program, was awarded to study genomic evolution using petascale computers. Researchers will first use GRAPPA, an open-source algorithm, to study genome rearrangements in Drosophila. From this analysis, new algorithms will be developed which have the potential to make sense of genome rearrangements leading to better identification of microorganisms, the development of new vaccines, and a greater understanding of how microbial communities evolve along with biochemical pathways.

In 2011, the world’s most powerful supercomputer, Blue Waters, will come online. According to GenomeWeb, Blue Waters will contain more than 200,000 processing cores and can perform at multi-petaflop levels. A partnership between University of Illinois at Urbana-Champaign, its National Center for Supercomputing Applications, IBM, and the Great Lakes Consortium for Petascale Computation, Blue Waters is supported by the National Science Foundation and the University of Illinois. Researchers can apply for time on Blue Waters from the National Science Foundation.

"I think petascale computing comes at a very good time for biology, especially genomics, which has to deal with … increasingly large data sets trying to do a lot of correlation between the data that’s held in several massive datasets," says Thomas Dunning, director of the NCSA at University of Illinois, Urbana-Champaign. "This is the time that biology is now going to need this kind of computing capability — and the good thing is that it’s going to be here."

~Petascale Coming Down the Pike, GenomeWeb, Jun 2009

Here’s a video of Saurabh Sinha, a University of Illinois assistant professor of computer science, talking about his research using NCSA’s supercomputers.

Genome-wide search for regulatory sequences in a newly sequenced genome: comparative genomics in the large divergence regime

Next topic for thought: cloud computing. More to come.

NB: HUGO President Prof. Edison T. Liu is currently attending the Bioinformatics of Genome Validation and Supercomputer Applications workshop at NCSA in Urbana, Illinois. I’m looking forward to hearing more about their discussions!

Do you have any knowledge to share with regards to petascale computing and genomics?

Comments

Leave a Reply