Riccardo Sabatini: Can New Technology Decode The Biggest Data Set Of All? Scientist Riccardo Sabatini says we have the technology to read the human genome and predict things like height, eye color, age — all from a vial of blood.

Can New Technology Decode The Biggest Data Set Of All?

  • Download
  • <iframe src="https://www.npr.org/player/embed/492298128/493113233" width="100%" height="290" frameborder="0" scrolling="no" title="NPR embedded audio player">
  • Transcript

GUY RAZ, HOST:

It's the TED Radio Hour from NPR. I'm Guy Raz. And on the show today, ideas about how big data is helping us understand our world and ourselves.

Do you think it's fair to say that human beings are basically, like, the equivalent of ones and zeros?

RICCARDO SABATINI: Well, we are - I guess many people will say that we are a little more. But in a way, we are an expression of how nature is playing his own game. We are, in a way, a biological representation of an underlying set of rules that works with numbers. And for me, it's one of the most beautiful complexities that you can ever think of. It's mesmerizing.

RAZ: This is Riccardo Sabatini. And by training, he is a physicist.

SABATINI: Yep. I'm a theoretical physicist by training, indeed.

RAZ: Riccardo works at a company called Human Longevity. And there, he applies physics to process, manage and understand one of the most complicated data sets out there, the human genome.

SABATINI: That's the most amazing thing that happened at the beginning of this millennia. We started to have access to the digital representation of our genome. We started to digitalize matter. We have digital representation of atoms and proteins. We are starting to digitalize life.

RAZ: And digitizing life, understanding the data that makes up our genome, means that we can study how to make our lives better and healthier. And it's something Riccardo has been working on for years now, as he explained on the TED stage.

SABATINI: So for me, everything started many, many years ago when I met the first 3-D printer. The concept was fascinating. A 3-D printer need three elements - a bit of information, some raw material, some energy - and it can produce any object that was not there before. Then I realized that I actually always knew a 3-D printer - and everyone does - it was my mom.

SABATINI: So my mom takes three elements - a bit of information is between my father and my mom, in this case; raw elements and energy in the same media - that is food - and after several months, produces me. And I was not existing before. Well, what amount of information takes to build and assemble a human? Is it much? Is it a little? How many thumb drives you can fill? OK. Now, you can run some numbers, and that happens to be quite an astonishing number.

So the number of atoms - the file that I will save in my thumb drive to assemble a little baby, it will actually feel an entire Titanic of thumb drives multiplied 2,000 times. This is the miracle of life. Every time you see, from now on, a pregnant lady, she's assembling the biggest amount of information that you will ever encounter. Forget big data. Forget anything you heard off. This is the biggest amount of information that exists.

RAZ: This is unbelievable. I mean, the amount of data it takes to create each human being - each and every single one of us will be the equivalent to 2,000 Titanics filled with thumb drives. Exactly. So could you even begin to compare the amount of data that we generate versus the amount of data that a computer generates?

SABATINI: Oh, no, I think. I mean, the order of magnitude of how complicated we are is something that will eat every single database that we know.

RAZ: Wow.

SABATINI: In year 2020, we believe that we will have sequenced several hundred million genomes. And at that point, YouTube will look like a small hard drive of a kid. I mean, the amount of data that we will require to map the human diversity will overrun every single database that we ever encountered before. It's the biggest data that you can ever think of.

RAZ: Yeah. So if each human is 2,000 Titanic ships filled with hard drives, how do you process that?

SABATINI: Yeah, the nice thing is that nature is much smarter than than a theoretical physicist, so he found a language to embed and compress this complexity in a much compact form. And that's what is the DNA, the very fundamental part that give the instructions to make the 2,000 Titanic work.

RAZ: I got you. So the DNA is like a compressed version of those 2,000 Titanics?

SABATINI: So the DNA is much shorter, much more complex and much more compact. And in February, we decided to print it to actually show, in books, how large is an instruction manual, and it's about 3 billion letters.

RAZ: Wow.

SABATINI: And if you print them at character 6, happens to be 262,640 pages, the precise instruction manual to rebuild Craig Venter.

RAZ: Craig Venter, the famous geneticist, who was, I guess, one of the first to map the human genome.

SABATINI: Yes.

RAZ: So you actually printed out the 3 billion letters of Craig Venter's DNA?

SABATINI: Exactly.

RAZ: And, like, how many volumes did it take?

SABATINI: (Laughter) A hundred and seventy-five volumes of a 1,600 pages.

RAZ: Like encyclopedia-sized.

SABATINI: So welcome on stage Dr. Craig Venter.

RAZ: And you wheeled these out, Craig Venter's genome...

SABATINI: Yeah, yeah.

RAZ: ...Onto the TED stage.

SABATINI: Exactly.

SABATINI: Not the man in his flesh. But for the first time in history, this is the genome of a specific human printed page-by-page, letter-by-letter, 262,000 pages of information, 450 kilogram. And now for the first time I can do something funny. I can actually poke inside it and read. So let me take some interesting book, like this one. Chromosome 14, book 132.

SABATINI: A-T-T-C-T-T-G-A-T-T. This human is lucky because if you will miss just two letters in this position, two letters over 3 billion, he will condemned to a terrible disease - cystic fibrosis. We have no cure for it. We don't know how to solve it, and it's just two letter of difference for what we are.

So now that I have your attention, the next question is how do I read it? How do I make sense out of it? Well, for how good you can be assembling Swedish furnitures, this instruction manual is nothing you can crack in your life. And so...

SABATINI: ...We're going to use a technology called machine learning, OK? We build a machine and we train a machine - well, not exactly one machine, many, many machines - to try to understand what are the letters and what do they do. So we asked, can we read the books and predict your height?

Well, we actually can with 5 centimeters of precision. Can we predict the eye color? Yeah, we can, 80 percent accuracy.

RAZ: OK, just to break in for a sec - by using big data, Riccardo and his team can take a random sample of DNA, pick through billions of letters of genetic code and then predict the height, the eye color, all kinds of physical traits of the person that that DNA came from. They can even assemble those traits into a biologically accurate human face.

SABATINI: It's a little complicated because a human face is scattered around million of these letters. We had to learn and teach a machine what is a face and embed and compress it. So we take the real face of a subject and we run it in our algorithm, OK? The results that I show you right now, this is the prediction we have.

RAZ: So, Riccardo, you are showing this face on the TED stage. I was there. This computer-generated face of a woman, and it's side-by-side with her real face. And it was amazingly accurate.

SABATINI: Yeah. That's what we are up to, to trace the information from the books to the body. And this is the biggest challenge of the millennium.

SABATINI: So why do we do this? We do it because the same technology and the same approach, the machine learning off this code is helping us to understand how we work, how your body work, how your body ages, how disease generate in your body, how your cancer grow and develop, how drugs work and if they work on your body. It's called personalized medicine. It is a particularly complicated challenge.

The more we will learn, every time we will be confronted with decisions that we never had to face before about life, about death, about parenting. This must be a global conversation. We must start to think the future we're building as a humanity without fear but with the understanding that the decisions that we will take in the next year will change the course of history forever.

RAZ: In the future, Riccardo says, the same technology that can predict a face from a DNA sample today could be used to predict and treat disease. In fact, some of that future is already here. You might have even heard about technology that allows anyone to sequence their genome and identify potential health problems. But as with any massive data set, interpreting that information is pretty tricky.

So you've had your genome sequenced, right? You've done this.

SABATINI: I did. I did. And it was an interesting time.

RAZ: What happened?

SABATINI: So, I mean, I had a couple of interesting results, some cardiovascular complications that are, let's say, present in my family but we've never been able to explain. And so there is this information, even for a professional. It's not something you should read by yourself because...

RAZ: Yeah.

SABATINI: ...At the beginning, I was, oh, my God, I'm going to die tomorrow.

RAZ: You freaked out.

SABATINI: (Laughter) Yeah.

RAZ: Yeah, it's understandable, right? I mean, there's all this information staring you right in the face.

SABATINI: But in reality, when you chat with the doctor and you are in a clinic, you discover that it's something that medicine developed lots of therapeutics around it. And I would rather prefer to know it than not because not means not taking actions and letting the roll of the dice know when or how your complications are growing. Now I have control of it.

RAZ: Yeah. So when you look down the road, like, five or 10 or 20 years from now - right? - how will the world be different? Because humans will finally have unlocked the ability to read and analyze and maybe even change the biggest source of data in the world, which is our genome.

SABATINI: So we survive and we have medicine that is amazing today and what we believe is amazing. But in 20 years, personalized medicine will really take the lead. It means every doctor and every pill that we will ever take, we'll know exactly if it will work or not for our genome, for our body.

And it will be so embedded in the mind of our future generation, that our kids will laugh at us on how we survive on a medicine that is not based on these assumptions. And it will be a cultural moment where precision medicine will be the verb and the past will look like we're in the caves, trying to understand how to switch on the fire.

RAZ: Data scientist Riccardo Sabatini. You can see his entire talk and how his team can predict faces from DNA at TED.com.

(SOUNDBITE OF TED TALK)

(LAUGHTER)

(APPLAUSE)

(SOUNDBITE OF TED TALK)

(APPLAUSE)

(SOUNDBITE OF TED TALK)

(LAUGHTER)

(LAUGHTER, APPLAUSE)

(SOUNDBITE OF TED TALK)

(SOUNDBITE OF TED TALK)

Copyright © 2016 NPR. All rights reserved. Visit our website terms of use and permissions pages at www.npr.org for further information.

NPR transcripts are created on a rush deadline by Verb8tm, Inc., an NPR contractor, and produced using a proprietary transcription process developed with NPR. This text may not be in its final form and may be updated or revised in the future. Accuracy and availability may vary. The authoritative record of NPR’s programming is the audio record.