Protein Folding

Citizen science: Not just for birders any more. (DrKjaergaard/Wikimedia.org)

by Ursula Goodenough

The idea of citizen science goes back to 1900 with the Audubon Society's Christmas Bird Count. Other such projects have followed, like the current Community Collaborative Rain, Hail & Snow Network (CoCoRaHS), where data collected by volunteers is used for weather forecasting and climate studies.

But nowadays, there's another way to participate in science projects where yes, you miss out on the fun of cataloguing birds and collecting rainfall, but you put to advantage all that processing capacity in your laptop that you aren't using. Basically, when you aren't running something else -- like when you're asleep -- you switch your computer over to a system that performs calculations on a specific task. Thousands of other computers are on the same project, allowing calculations to be made very rapidly; hence enormous amounts of information can be analyzed over reasonable periods of time, particularly if your computer carries widgets like nVidia graphics cards.

Adam flagged us on a new platform for the seti@home project that's been operating since 1999. In this case, radio telescope data is downloaded and processed in thousands of personal computers, looking for unexplained regularities that might derive from extraterrestrials sending radio signals. None have yet been detected, but then, the project is young.

If extra-terrestrial intelligence isn't your thing, it turns out that there are many other such projects on offer, some launched and others under development. The list is fascinating: one finds Malaria Control, Quake-Catcher, Milkyway@Home, FightAIDS@home, Mindmodeling@Home, and mystifying ones relating to mathematics, like Goldbach's Conjecture (tests Goldbach's weak conjecture) and Ramsey@Home (searches for new lower bounds of Ramsey numbers).

Most of these use the BOINC middleware system, originally developed at UC Berkeley for the SETI project and available to anyone for free. As of last week, BOINC reports 4,739,000 active computers worldwide processing an average of 4 petaFLOPs (peta=10^15), faster than Cray's top-of-the-line supercomputer. The whole thing just blows my mind.

The most ambitious of these distributed computing clusters, masterminded from Stanford University, is called Folding@home, where its goal is to predict the shapes of proteins. It was launched in 2000, and now harnesses >5 petaflops from 400,000 machines using a non-BOINC-based platform.

So why is such massive computation required to understand protein folding, and why do we care how a protein folds?

Available technologies now allow us to figure out the amino-acid sequence of a protein with great ease and accuracy. This is called its primary structure. Algorithms have also been developed that predict, with good accuracy, which of these amino acids adopt such local secondary structures as alpha-helices and beta-strands.

All of this is essential to know. But what's really important to know is the overall shape of a protein, designated its tertiary structure or its fold, a shape adopted spontaneously via interactions between amino acids that are often far away from one another in the primary structure. This shape has everything to do with function, allowing an enzyme to bind to its substrate and a receptor to bind to its hormone and a membrane channel to open and close and a muscle protein to participate in contraction. The shape can be resolved if the protein can be crystallized and analyzed by X-ray diffraction, but this is a tedious process and many proteins refuse to form good crystals. Moreover, it's often of particular interest to learn the shape of a mutant protein that is implicated in disease. So, if a researcher could take a primary sequence, normal or mutant, feed it into a computer algorithm, and quickly learn its predicted 3-D configuration, this would be a stupendous breakthrough.

The reason that it's so hard to come up with such an algorithm is that an average protein has some 300 amino acids, each potentially forming various kinds of bonds with any other -- hydrophobic interactions, hydrogen bonds, electrostatic interactions, and so on -- where the final set of bonds determines the final fold. Moreover, once a first set of bonds is in place, this influences the probability of the next set of possible bonds, which in turn influences the next set. Worst of all, this all happens on a millisecond timescale, meaning that simulations of these events need to be run at this rate. So, for each primary sequence, a vast number of simulations must be rapidly deployed to figure out which interactions are thermodynamically the most probable and hence the most likely to occur spontaneously. By dividing the work between multiple processors, such calculations have become possible to make, yielding models that can be compared with real shapes, as ascertained by X-ray crystallography, to check how accurate the predictions are.

Progress is definitely being made, and each time a model "works," information is obtained that informs the next round of analysis. So the stupendous breakthrough seems attainable, and the implications, albeit perhaps not as momentous as an extra-terrestrial radio signal, are nonetheless huge.

I find it thrilling to think of all those hundreds of thousands of laptops out there, churning out data that may someday have everything to do with understanding diseases like Alzheimers or ALS, diseases where incorrect protein folding has everything to do with the tragic ensuing pathology. Citizen science is on a roll.

3:21 - February 15, 2010