How the same technology driving facial and voice recognition is helping us rapidly identify novel therapies. The first in our Q&A series.
Abraham Heifets was working at IBM Research on high performance data processing when his career interest veered toward drug discovery. A big data guy, he found himself studying organic chemistry—specifically chemical synthesis planning—and eventually enrolling at the University of Toronto where he earned a PhD in computer science. At Toronto, Heifets met Izhar Wallach, who was writing computational biology and structural algorithms for a small pharmaceutical company.
Down the hall from their computational biology group was a machine learning group run by Geoffrey Hinton, a pioneering scientist known primarily for his work on artificial neural networks. Dr. Hinton’s group was busy inventing all of the core techniques underlying the revolution in machine learning and artificial intelligence.
Dr. Wallach and Dr. Heifets saw the promise of these machine learning techniques. At the same time datasets that support these machine learning algorithms were starting to take off. So Dr. Heifets and Dr. Wallach co-founded Atomwise, and in 2015 moved their company to the San Francisco Bay area. (They are currently partnering with Charles River on drug discovery research.)
Eureka spoke with Dr. Heifets recently as part of a multi-part series on the influence of AI in Drug Discovery. Here are his edited responses.
Eureka: Can you explain in layman’s terms how Atomwise’s software works?
AH: Atomwise’s software actually uses the same technology that underlie all of the current work in image recognition and speech recognition. These are the same kinds of approaches, the same neural networks that are the best and most accurate technology that we have for things like facial recognition. It’s why you can unlock your phone by looking at it. It underlies the vision system in self-driving cars. It underlies the speech recognition technology in things like Siri and Alexa. These AI techniques are all based on something called convolutional neural networks, and people are probably using them every day.
The critical piece is that the insight for how to apply these techniques from these very different fields. Speech recognition and image recognition are very different fields from molecular recognition, but it turns out that there is a way of mapping the problem and finding a core underlying theme. An image is a two-dimensional grid of pixels and every pixel has red and green and blue color channels. Proteins are 3D, and so instead of setting up a 2D grid, we set up a 3D grid, and every grid point, instead of having red and green and blue, has carbon, oxygen, sulfur, nitrogen, etc., color channels. As soon as you do that encoding, then the techniques which work in 2D for image recognition, can be translated into 3D for molecular recognition.
Eureka: It sounds like a pretty simple deduction—go from 2D to 3D. Was this solution something that you figured out right away—that is go from facial recognition to protein recognition?
AH: There are a lot of ideas that sound simple after the fact, but to recognize it for the first time is not necessarily obvious. To get it to work correctly, there’s a lot of hidden complexity and subtlety in how you set up these problems. For example, in the last paper that we published we actually did a deep dive into how you even tell whether these techniques are actually being predictive or not. It turns out there’s incredible depth and subtlety in how you set up the tests. I think that we showed was that the benchmarks that exist are likely to reward teaching to the test rather than learning something deep.
Eureka: When you say it took a lot to get it right, what kinds of tools helped you to structure these experiments?
AH: There are a number of underlying technologies which had to come together to enable the kind of techniques that we’re talking about here. One is fundamental algorithmic breakthroughs. Convolutional neural nets in a practical sense, even for image recognition, didn’t exist that long ago. The big breakthroughs and adoption for image recognition was only in 2012. So this is really new technology. But to support the subtle statistical approaches, you need massive data sets. The last time I checked the NIH’s PubChem database, I think there were 240 million chemical compounds in there. But that didn’t exist 20-30 years ago. If you have small data sets and powerful statistics, you are going to memorize the data set and you’re not actually going to generalize, so you have to match the power of the algorithm to the size and complexity of the data. Of course, not all data is good data. Ninety-eight percent of those data sets don’t pass our quality control filters. The old tenet of garbage in, garbage out holds—if you don’t carefully go through your data, you’re going to learn artifacts, you’re going to fool yourself. That’s part of the reason why we have a team of medicinal chemists and structural biologists, as well as machine learning and computer scientists.
Eureka: Can you give an example of noise vs. good data?
AH: Sure. Alexander Tropsha at UNC-Chapel Hill has written excellent papers on this. There is an example he gives where you’ll see two entries in a database. You’ll have a particular protein and particular molecule with a particular binding affinity that if it had been measured, it might be 8.241 nanomolar binding. Then you’ll see the same protein and the same molecule and the same 8.241 [measure] except it is millimolar. These are two measurements which are very precise, look like the same measurement, except they are off by a factor of a million. If you start digging into it, what you will find is that someone made a typo and hit the wrong key. Unless you are careful about controlling for that kind of thing, you get an answer which is half a million off from whatever the right answer was.
Eureka: Your technology has been described as a digital matchmaker that uses machine learning to help companies screen potential compounds much faster. What types of targets are you applying this to and are there particular disease areas that the technology is being used more extensively than others?
AH: We have projects running in every major therapeutic area. We’ve recently announced success in ischemic stroke and in Chagas disease. These are very different areas of application; one is neuroscience and on is an infectious disease. About 35% of our projects are in cancer, 22% are in infectious disease, 8% are in neurology, and then on down the list. I think what’s cool about these approaches is that we can work on previously undruggable targets. When you have new technologies, new things are possible. What used to be undruggable now falls under druggable.
We use a structure-based technique, so we need a structure of the protein. Historically, one of the challenges to structure-based techniques was the availability of X-ray crystal structures, because previous approaches using homology models sometimes suffered in the accuracy of their outputs. What we’ve been showing is that actually these statistical approaches are robust to homology models, and we’re getting really good results, even in the cases where we don’t have experimental X-ray crystal structures.
Eureka: What about repurposing drugs that are already being prescribed for certain conditions?
AH: The last time I looked at DrugBank, I think there were around 9,000 compounds that had been through Phase II or later, and so, had some human safety data around them. There’s billions of molecules that you can buy today. That means that the set of drugs or drug candidates that we have for repurposing is a really small fraction of the molecules that we have access to. So, I think if you ignore the vast majority of space that we can order and get compounds for in just a few weeks, you may be missing really useful drugs.
Eureka: One of the claimed advantages of using this kind of AI technology is that it is not just fast but highly accurate, too. How common a tool is it?
AH: It’s becoming—if it is not already—the standard for drug discovery. We’re currently working on over 250 projects with collaborators in over 20 countries all over the world.
Eureka: What do you think are the biggest challenges in using AI to help discover drugs?
AH: I think the burden of proof is always on the developers of the technology. The key to convert non-believers is by demonstrating repeated success, demonstrating that it’s a reliable technology and that it provides value. It’s insufficient to have benchmarks. It’s insufficient to predict yesterday’s weather. You have to predict tomorrow’s weather, and get it right over and over and over again. I think in this field, whether it is AI or anything else, you have to be able to show a couple dozen successes when nobody knew what the answer was. I think that’s the biggest challenge in driving the adoption of AI for discovering drugs.
Eureka: What do you think will be the “Next Big Thing” in the application of AI in drug discovery?
AH: I am really excited about the work that is occurring in computational biology. Biology is a really tough problem, a really hard domain. There are a lot of smart people working very hard on a number of different techniques, from literature analysis to various kinds of multi-omics approaches, aiming to discover new therapeutic targets. I think finding that protein is the critical bottleneck and will be the “Next Big Thing” in creating new medicines.
Eureka: Lastly, robots are ubiquitous in entertainment. Who is your favorite robot?
AH: Robotic soccer players. Before I got involved in drug discovery I worked on the artificial intelligence and strategy system for Cornell’s RoboCup, an autonomous team of robots that were world champions in 2000 and 2001. If you think about entertainment, there are a lot of people around the world who get enjoyment from the World Cup. RoboCup is trying to field a team of humanoid robots that will play FIFA-rules soccer by 2050, and beat the human champion team.
Eureka: Do you play soccer?
AH: No, I’m a nerd. I couldn’t play soccer to save my life. But I do enjoy watching robots play soccer and have great appreciation for the work being done by RoboCup.
Thanks for tuning in. Our next Q&A in this series on AI in Drug Discovery will be with Dr. Ola Engkvist, PhD, a computational chemist and section head of the hit discovery department at Astra Zeneca’s Discovery Sciences department. You can follow our series here.