Biobanks offer a rich source of data for biomed startups, but the route to deriving maximum benefit from them is not always straightforward. We talk to Matt Brauer, Vice President of Data Science at Maze Therapeutics, about his experience with the UK Biobank, Finngen and other consortia, and the role that making data-management processes open-source has had in delivering business success.
Hi Matt, great to speak to you. Can we start by talking a bit about the science behind what you do at Maze Therapeutics?
Sure! The premise of Maze is that we use genetics data throughout the entire drug discovery process – we’re making a concerted effort to not only find and validate drug targets, but also using the genetic variation data to build what is known as an allelic series.
This means that we’re looking for a range of genetic variants that change the risk of the disease. In a way, this is a bit like a dose–response curve, but what it means is that you’ve got a really strong signal that your target is somehow causative or central to the disease.
And presumably you can then carry that forward into the drug discovery side?
That’s right – creating this allelic series gives us hypotheses about how the protein target operates, and how we might be able to modulate the course of the disease. Looking at variation in the protein structure also gives us clues into where we can best target it, and ultimately design drug molecules.
Moving into the clinic, if we can identify variation that allows us to recruit patients that are more likely to have rapid progression disease, for example, that gives us a real advantage in doing our clinical trials. Variation really is the raw material that we work with.
I guess that means you need datasets that exhibit wide variation as well?
That’s a good point. In fact, it ties into one of the major challenges in drug discovery, that of variation within biobank data.
For many years we’ve been involved in a number of large-scale biobank datasets and consortia including the UK Biobank and FinnGen. But one problem with European biobanks is that from a global perspective, the European population is relatively small, it’s not especially diverse, and those recruited have been very specifically selected. Sometimes this can be useful, as in Finland for example, which has a ‘bottleneck’ population that has exposed some deleterious variations, but the number of variants present is very limited. To fulfill the promise of biobanks for drug discovery, we really need to move beyond the European genome.
How have you been trying to do that?
Recently we’ve joined Genes & Health, which contains data from British residents who are almost entirely of Bangladeshi and Pakistani descent. This will be important for us in resolving the difficulties caused by our over-reliance on European genomes.
We’re also involved in some disease-specific consortia, including one with the New York Genome Center focused on amyotrophic lateral sclerosis, and we’re supporting research at the University of Toronto on polycystic kidney disease and getting access to some of that data too.
That sounds interesting, but are there any other challenges you find when dealing with biobank data?
Yes, things aren’t perfect by any means. One problem is that the data is rather siloed and fragmented – there’s not enough longitudinal data, acquired over long periods of time, that’s correlated to genetics data. We’d love to be able to dive in deep, but that’s currently very difficult.
Another big challenge is the privacy aspect of biobank data – it’s not clear to me that this has really been ironed out from a legal standpoint, and that’s an unresolved issue when thinking about using data in the long-term, for sure.
Thinking about fragmentation, do you find difficulties with cross-correlating biobank-derived datasets?
Yes, particularly when aligning phenotypes across different sources because there are different methods of measuring the same things – there are various ontologies and different vocabularies. For example, FinnGen is an aggregation of multiple biobanks, and each of them collects data in a slightly different way. So, a term used in one biobank doesn’t necessarily mean the same as a term from another, which makes the statistics very difficult.
In a similar way, different biobanks work in different computing environments, which can complicate matters. We’re actually pretty familiar with this, because we’ll make a discovery say in the UK Biobank, and we want to replicate it in Finngen. We often have to work with summary statistics as opposed to the raw data.
A final point is instrumental modalities. Let’s say we’re looking at cardiac MRI images – it’s very difficult to standardize images when they’re acquired on instruments from different manufacturers. The Allotrope Data Format is a good example of an attempt to fix this issue, but standards are always going to lag behind innovation, and I think we have to learn to live with that.
Despite this fragmentation of data remaining an issue, is it fair to say that collaboration is resolving the issue of fragmentation within the research effort?
I’d agree with you, and I think that regarding the use of biobank data, the consortium approach to research is now dominating. In the past, I’d often be approached by people wanting to sell biobank data to us and expecting to make a quick buck from it. Even at the time, that displayed a rather naive understanding of the role that genetics data plays in drug discovery.
But thanks to the UK Biobank, Finngen, and others, I don’t see much of that anymore – they showed how the collaborative approach can work, and now there’s a belief that if everybody works together, we can get some funding, and everybody can benefit.
It seems that you’ve been at the confluence of computational biology and data management for much of your career?
Yes, you’re right. After my Ph.D. in 2000 I started a postdoc with David Botstein at Stanford, and then later at Princeton, working on a so-called continuous culture of yeast, which was set up to be in permanent exponential growth. It was a beautiful experimental system, and it showed me for the first time how to manage and think about genome-scale data, and how to derive meaningful conclusions from that data.
That got me thinking about statistical rigor, because after all, pretty much everything in biology comes back to statistics. Tied into that is the notion of reproducible pipelines for data processing and analysis: if you can’t start with the same data and get precisely the same answer every time you run your pipeline, then there’s something wrong.
It’s interesting that you say that data-handling processes need to be locked-down, because don’t biotech startups need to be agile?
That’s an interesting point. I agree that for early-stage startups, the premium is on agility and moving quickly. For example, at Maze, we’re in phase one clinical trials now, which after just three years is pretty fast! What’s interesting is that at this point we’re transitioning to a new phase of our business, and we’re having to deal much more with regulation-driven processes.
That requires much more emphasis on solid processes and rigorous record-keeping, which for some people can be a challenge – scientists are often in hurry to do their experiments and get their results, and move on to the next thing. But when you’re working with high-dimensional data in the clinical phase, you must make sure that all the important stuff is captured – you’ve got no choice. You’ve got to know what you’ve done at every step.
So how do you tackle that at Maze?
Maze has taken an approach that many biomed startups are now following – we get things open-source as quickly as we can. This means that the process development becomes shared, so we’re routinely challenged on the utility of what’s been created. The result is that we get better solutions than we would have done if it had all been kept in-house.
There’s a trap you can get in, especially in big pharma, where you think you can do it all yourself. Maybe you can, but what working at Genentech proved to me is that there’s a risk that you converge on solutions that serve your needs, that don’t become the state of the art further down the line. It doesn’t matter how smart your people are, if you’re only interacting with people from within your company, you’re missing out on a lot of other insights. But at Maze, by using a collaborative approach, we’re steering the state of the art, but not exclusively creating it.
That covers the process development part of the solution, but what about the data itself?
I’m glad you asked that, because in my experience most scientists have some desire to get their fingers in the data. Thinking back to my postdoc days, when you run a big experiment, there’s a period of time where you’re spending 24 hours a day trying things out and getting your head around the data. But if you hand the analysis over to someone else, you miss out on that process, and end up not really understanding the data.
As such, one of our main motivations at Maze is to build the tooling that allows the scientists that have done the actual work to easily visualize their data, to try out different dimensionality reduction methods for example, and to think about how it might be analyzed in a more rigorous fashion. That playing with the data is a key part of the discovery process. And thinking through to my current work, that’s what’s made our collaboration with Paradigm4 so enjoyable, because the REVEAL: Biobank module is pretty fast, it integrates with tools that the geneticists are really happy to use, and it means that the computational biology group is not a bottleneck for their analysis.
So, the bioinformatics team isn’t just about number crunching, but about helping identify the best processes too?
Part of the role of bioinformatics, as I see it, is to help build the right culture within a startup as it matures. One way of doing this is to facilitate record-keeping in as automated a fashion as possible. We’re not trying to get in the way of the science, but we need to make sure that it’s reproducible, and later in the pipeline, that it can be easily accessed.
Thinking about my experience at Maze, everyone now understands why we need to document certain things. They also understand how we can use that data later, and then maybe reuse it, while not losing track of the reasoning behind those initial insights. It’s not very useful if you just keep track of your current state of belief: you need to know how you got there, and how different data would give you a different belief. It’s taken a while, but I think that we’ve done a good job of putting this mindset in place – and the benefits will flow through to the company’s success, and ultimately to patients.
To finish off our discussion, if you had to pick three grand challenges in biotechnology, what would they be?
I’ve already covered the first one – getting round our reliance on the European genome. As for the other two, I think a big one is tackling neurodegenerative disease. As we’ve found from our work with amyotrophic lateral sclerosis, it’s very difficult to make headway, and we have huge amounts of risk from different sources.
The final challenge, I think, is getting medicines out to populations that have been underserved globally. Currently the cost of developing drugs is very high – and that will be especially true for precision medicine. If we can find a way to dramatically reduce that cost and get the medicines out to people who historically have not been able to get them, that would be a fantastic achievement.
It’s been great to have a chat Matt – thanks for your insights into using biobanks and handling data-management within a biotech startup.
Our conversation with Matt certainly highlighted for us the challenge faced by many scientists in engaging with biobank data and using it effectively. At Paradigm4, this is something we’ve thought deeply about, with the result being our REVEAL™: Biobank app. With the flexibility to adapt to your requirements out of the box, and the scalability to respond to whatever dataset you’re looking at, we think it’s worth considering for any project where you need to make the most of biobank data.
For more information about our biobank tools, or how other products in our portfolio could help transform your data analysis, contact us at firstname.lastname@example.org.
Dr Matt Brauer is currently Vice President of Data Science at Maze Therapeutics in San Francisco, having previously had roles at biotech research company Genentech, as well as research positions in genetics at Stanford University School of Medicine and Princeton University. His current focus is on using statistical and bioinformatics tools to extract insights from human genetics and functional genomics data, to advance understanding of how to more effectively treat patients with severe diseases.