Blog: The Cutting Edge

How to grow a biomed startup: The role of biobanks and open-source process development

Biobanks offer a rich source of data for biomed startups, but the route to deriving maximum benefit from them is not always straightforward. We talk to Matt Brauer, Vice President of Data Science at Maze Therapeutics, about his experience with the UK Biobank, Finngen and other consortia, and the role that making data-management processes open-source has had in delivering business success.

Hi Matt, great to speak to you. Can we start by talking a bit about the science behind what you do at Maze Therapeutics?

Sure! The premise of Maze is that we use genetics data throughout the entire drug discovery process – we’re making a concerted effort to not only find and validate drug targets, but also using the genetic variation data to build what is known as an allelic series.

This means that we’re looking for a range of genetic variants that change the risk of the disease. In a way, this is a bit like a dose–response curve, but what it means is that you’ve got a really strong signal that your target is somehow causative or central to the disease.

And presumably you can then carry that forward into the drug discovery side?

That’s right – creating this allelic series gives us hypotheses about how the protein target operates, and how we might be able to modulate the course of the disease. Looking at variation in the protein structure also gives us clues into where we can best target it, and ultimately design drug molecules.

Moving into the clinic, if we can identify variation that allows us to recruit patients that are more likely to have rapid progression disease, for example, that gives us a real advantage in doing our clinical trials. Variation really is the raw material that we work with.

I guess that means you need datasets that exhibit wide variation as well?

That’s a good point. In fact, it ties into one of the major challenges in drug discovery, that of variation within biobank data.

For many years we’ve been involved in a number of large-scale biobank datasets and consortia including the UK Biobank and FinnGen. But one problem with European biobanks is that from a global perspective, the European population is relatively small, it’s not especially diverse, and those recruited have been very specifically selected. Sometimes this can be useful, as in Finland for example, which has a ‘bottleneck’ population that has exposed some deleterious variations, but the number of variants present is very limited. To fulfill the promise of biobanks for drug discovery, we really need to move beyond the European genome.

How have you been trying to do that?

Recently we’ve joined Genes & Health, which contains data from British residents who are almost entirely of Bangladeshi and Pakistani descent. This will be important for us in resolving the difficulties caused by our over-reliance on European genomes.

We’re also involved in some disease-specific consortia, including one with the New York Genome Center focused on amyotrophic lateral sclerosis, and we’re supporting research at the University of Toronto on polycystic kidney disease and getting access to some of that data too.

That sounds interesting, but are there any other challenges you find when dealing with biobank data?

Yes, things aren’t perfect by any means. One problem is that the data is rather siloed and fragmented – there’s not enough longitudinal data, acquired over long periods of time, that’s correlated to genetics data. We’d love to be able to dive in deep, but that’s currently very difficult.

Another big challenge is the privacy aspect of biobank data – it’s not clear to me that this has really been ironed out from a legal standpoint, and that’s an unresolved issue when thinking about using data in the long-term, for sure.

Thinking about fragmentation, do you find difficulties with cross-correlating biobank-derived datasets?

Yes, particularly when aligning phenotypes across different sources because there are different methods of measuring the same things – there are various ontologies and different vocabularies. For example, FinnGen is an aggregation of multiple biobanks, and each of them collects data in a slightly different way. So, a term used in one biobank doesn’t necessarily mean the same as a term from another, which makes the statistics very difficult.

In a similar way, different biobanks work in different computing environments, which can complicate matters. We’re actually pretty familiar with this, because we’ll make a discovery say in the UK Biobank, and we want to replicate it in Finngen. We often have to work with summary statistics as opposed to the raw data.

A final point is instrumental modalities. Let’s say we’re looking at cardiac MRI images – it’s very difficult to standardize images when they’re acquired on instruments from different manufacturers. The Allotrope Data Format is a good example of an attempt to fix this issue, but standards are always going to lag behind innovation, and I think we have to learn to live with that.

Despite this fragmentation of data remaining an issue, is it fair to say that collaboration is resolving the issue of fragmentation within the research effort?

I’d agree with you, and I think that regarding the use of biobank data, the consortium approach to research is now dominating. In the past, I’d often be approached by people wanting to sell biobank data to us and expecting to make a quick buck from it. Even at the time, that displayed a rather naive understanding of the role that genetics data plays in drug discovery.

But thanks to the UK Biobank, Finngen, and others, I don’t see much of that anymore – they showed how the collaborative approach can work, and now there’s a belief that if everybody works together, we can get some funding, and everybody can benefit.

It seems that you’ve been at the confluence of computational biology and data management for much of your career?

Yes, you’re right. After my Ph.D. in 2000 I started a postdoc with David Botstein at Stanford, and then later at Princeton, working on a so-called continuous culture of yeast, which was set up to be in permanent exponential growth. It was a beautiful experimental system, and it showed me for the first time how to manage and think about genome-scale data, and how to derive meaningful conclusions from that data.

That got me thinking about statistical rigor, because after all, pretty much everything in biology comes back to statistics. Tied into that is the notion of reproducible pipelines for data processing and analysis: if you can’t start with the same data and get precisely the same answer every time you run your pipeline, then there’s something wrong.

It’s interesting that you say that data-handling processes need to be locked-down, because don’t biotech startups need to be agile?

That’s an interesting point. I agree that for early-stage startups, the premium is on agility and moving quickly. For example, at Maze, we’re in phase one clinical trials now, which after just three years is pretty fast! What’s interesting is that at this point we’re transitioning to a new phase of our business, and we’re having to deal much more with regulation-driven processes.

That requires much more emphasis on solid processes and rigorous record-keeping, which for some people can be a challenge – scientists are often in hurry to do their experiments and get their results, and move on to the next thing. But when you’re working with high-dimensional data in the clinical phase, you must make sure that all the important stuff is captured – you’ve got no choice. You’ve got to know what you’ve done at every step.

So how do you tackle that at Maze?

Maze has taken an approach that many biomed startups are now following – we get things open-source as quickly as we can. This means that the process development becomes shared, so we’re routinely challenged on the utility of what’s been created. The result is that we get better solutions than we would have done if it had all been kept in-house.

There’s a trap you can get in, especially in big pharma, where you think you can do it all yourself. Maybe you can, but what working at Genentech proved to me is that there’s a risk that you converge on solutions that serve your needs, that don’t become the state of the art further down the line. It doesn’t matter how smart your people are, if you’re only interacting with people from within your company, you’re missing out on a lot of other insights. But at Maze, by using a collaborative approach, we’re steering the state of the art, but not exclusively creating it.

That covers the process development part of the solution, but what about the data itself?

I’m glad you asked that, because in my experience most scientists have some desire to get their fingers in the data. Thinking back to my postdoc days, when you run a big experiment, there’s a period of time where you’re spending 24 hours a day trying things out and getting your head around the data. But if you hand the analysis over to someone else, you miss out on that process, and end up not really understanding the data.

As such, one of our main motivations at Maze is to build the tooling that allows the scientists that have done the actual work to easily visualize their data, to try out different dimensionality reduction methods for example, and to think about how it might be analyzed in a more rigorous fashion. That playing with the data is a key part of the discovery process. And thinking through to my current work, that’s what’s made our collaboration with Paradigm4 so enjoyable, because the REVEAL: Biobank module is pretty fast, it integrates with tools that the geneticists are really happy to use, and it means that the computational biology group is not a bottleneck for their analysis.

So, the bioinformatics team isn’t just about number crunching, but about helping identify the best processes too?

Part of the role of bioinformatics, as I see it, is to help build the right culture within a startup as it matures. One way of doing this is to facilitate record-keeping in as automated a fashion as possible. We’re not trying to get in the way of the science, but we need to make sure that it’s reproducible, and later in the pipeline, that it can be easily accessed.

Thinking about my experience at Maze, everyone now understands why we need to document certain things. They also understand how we can use that data later, and then maybe reuse it, while not losing track of the reasoning behind those initial insights. It’s not very useful if you just keep track of your current state of belief: you need to know how you got there, and how different data would give you a different belief. It’s taken a while, but I think that we’ve done a good job of putting this mindset in place – and the benefits will flow through to the company’s success, and ultimately to patients.

To finish off our discussion, if you had to pick three grand challenges in biotechnology, what would they be?

I’ve already covered the first one – getting round our reliance on the European genome. As for the other two, I think a big one is tackling neurodegenerative disease. As we’ve found from our work with amyotrophic lateral sclerosis, it’s very difficult to make headway, and we have huge amounts of risk from different sources.

The final challenge, I think, is getting medicines out to populations that have been underserved globally. Currently the cost of developing drugs is very high – and that will be especially true for precision medicine. If we can find a way to dramatically reduce that cost and get the medicines out to people who historically have not been able to get them, that would be a fantastic achievement.

It’s been great to have a chat Matt – thanks for your insights into using biobanks and handling data-management within a biotech startup.

Our conversation with Matt certainly highlighted for us the challenge faced by many scientists in engaging with biobank data and using it effectively. At Paradigm4, this is something we’ve thought deeply about, with the result being our REVEAL™: Biobank app. With the flexibility to adapt to your requirements out of the box, and the scalability to respond to whatever dataset you’re looking at, we think it’s worth considering for any project where you need to make the most of biobank data.

For more information about our biobank tools, or how other products in our portfolio could help transform your data analysis, contact us at lifesciences@paradigm4.com.

 

Biography

Dr Matt Brauer is currently Vice President of Data Science at Maze Therapeutics in San Francisco, having previously had roles at biotech research company Genentech, as well as research positions in genetics at Stanford University School of Medicine and Princeton University. His current focus is on using statistical and bioinformatics tools to extract insights from human genetics and functional genomics data, to advance understanding of how to more effectively treat patients with severe diseases.

Thirty years in genetics – A perspective from Brazil

The course of academic research rarely runs smoothly, and through her 30-year career in genetics, Professor Lygia V. Pereira has certainly seen plenty of challenges – but also lots of successes. We talk to her about the importance of genetic diversity to better serve the health needs of countries like Brazil but also to improve our understanding of disease and health across all populations, and why we need to make bioinformatics tools truly user-friendly.

 

Can we start by talking about your career path – how did you get into genomics?

My career path started off in a slightly unusual way, as I originally did a Bachelor’s degree in Physics at the Pontifícia Universidade Católica in Rio de Janeiro – computer engineering was the latest thing, and I was keen to follow that route. But I had also really enjoyed genetics in high school, so while I was doing calculus and quantum mechanics, I kept reading about genetic engineering and molecular biology.

The more I read, the more interested I became, so towards the end of my physics course I asked if I could switch my course credits, and I started doing an internship in a molecular biology lab, studying the genetics of Drosophila. I loved that – the lab life, extracting DNA – and I realized that these tools could be used to study any living thing. That was when I saw that human genetics was the career for me.

And where did that internship lead to next?

Around that time, which was the late 1980s, the then Director of the Human Genetics Department at Mount Sinai medical Center in New York came to Brazil to give a talk about gene therapy. There was a lot of talk about how gene therapy was going to cure everything, and I was fascinated by that. I applied to Mount Sinai Graduate School, which at the time was affiliated to the City University of New York, and that meant that from 1989 to 1994 I was based in New York.

For my first rotation I worked on Niemann–Pick Disease, which involves a defect in lysosomal storage. Then in the second rotation I went to Francisco Ramirez’s lab, and at that time they had just found a little piece of the gene involved in Marfan syndrome, which affects the body’s connective tissues. I thought that was really interesting, so I stayed with that for my Ph.D. which involved cloning the full length of cDNA, determining the genomic structure of the gene, and ultimately developing a knockout mouse model for Mafan syndrome.

And then you returned to Brazil?

Yes, and despite my original intentions, I’ve never left! I was originally going to stay in Brazil for a couple of years and then go elsewhere for a postdoc, but in 1996 I ended up getting a four-year grant to work at the University of São Paulo, on establishing in the country the methodology to create knockout mice models, again for Marfan syndrome. The models I used with Ramirez’s group were great, but they had some limitations, so I worked on some different mutations. That work was eventually rewarded in 2001 when our group generated the first knockout mice here in Brazil. That was a proud moment – nationally it was a big contribution.

One of your passions has been improving the diversity of population diversity present in cell lines and biobanks. How did that come about?

That all arose when we developed the lines of human embryonic stem cells from blastocyst here in Brazil. We were thinking about compatibility, and asked ourselves, are these cell lines a better match to the Brazilian population? We collected HLA (Human Leukocyte Antigen) profiles for different lines from elsewhere in the world, and we crossed those with our bone marrow registry, for which there was HLA typing on hundreds of thousands of Brazilians. And we found out that no, our line was not a better match!

To understand that we looked at the genomic ancestry of our lines, and we found that they had over 90% of European ancestry, not representative of the admixture of Indigenous, African and European genomic ancestries found in our general population. Our hypothesis was that the embryos were all donated in private clinics, and that the socioeconomic class that has access to those clinics isn’t representative of the whole population.

As a result, we knew that if we wanted to capture the true genetic diversity of the Brazilian population, we needed to get out into the country and establish lines of induced pluripotent stem cells (iPS) instead. So that was what we did – we developed a collaboration with a longitudinal study of 15,000 Brazilians, who have been followed clinically for more than 10 years. And as expected, the genomic ancestry of these iPS cells were much more representative of the population – which of course should prove very useful when cell-based assays become robust enough to detect differential drug response between individuals, for example.

Did that lead on to your work on genetic diversity?

That’s right. In about 2016–7, the lack of diversity in human genomic data started to gain a lot more attention, with an article that showed that about 80% of the genomic data that’s out there related to populations of European descent. And we figured, if Brazil has anything to contribute to improving that figure, it’s with our population diversity. In fact, since 2014 there had been much discussion about doing a Brazilian genome project, but this global discussion seemed to me to provide the push that was needed.

I joined forces with two colleagues, Alexandre Pereira and Tábita Hünemeier, who were also at the University of São Paulo, and we sought funding to put this project together. Our plan was to take the 15,000-person longitudinal study I mentioned earlier, and basically add genomic data to that, to kind of jumpstart the program. And it took quite a while, but eventually we got the Ministry of Health on our side, and in the end, they were willing to fund sequencing of the whole cohort. We launched the project in December 2019 – it was my happiest Christmas ever!

And then Covid came along…

Indeed, it did, and priorities shifted completely. To cut a long story short, as I’m talking to you now in July 2022, we’ve sequenced 4000 genomes, and a contract should be on the way to do 6000 more. Our hope is that the program will now start to reaccelerate – we will get there!

What do you see as the ultimate benefits of this regional genomic database?

First and foremost, it should benefit the Brazilian population, because you should be able to develop polygenic risk scores that better reflect an admixed population like ours, rather than a European one. And at the same time, we may find novel variants that tell us things about human biology that can eventually be translated into new medicines or therapies. And this isn’t just the case for Brazil of course – there’s such a lot to be gained from looking at different genetic lineages across the globe.

Our proposal is that for the first two years the principal investigators from the cohort studies would have exclusive access to the data. After that, we’d like it to be accessible to Brazilian-led groups for two years, and only then would we open it to the rest of the world. That decision is of course in the hands of the government, but what I’d like to avoid is a situation where we generate the data and then the discoveries are made mainly by groups outside Brazil, solely because we lack the pace of research that comes when you have a critical mass of researchers and suppliers.

You raise an interesting point there, about who gets to make the discoveries based on genomic data. Why do you think that’s so important?

Partly it’s about fostering skills and collaboration in the country where the data was generated – our thinking in Brazil is that if groups here have exclusive access for a few years, that will kickstart capabilities to use that data, and hopefully promote the growth of a well-developed genomics and bioinformatics ‘ecosystem’ in this country.

But for me, it’s in part a more fundamental principle too. When I was at school, I was taught that as an ‘underdeveloped’ country, we export primary goods, and then we import the manufactured goods. This may a feature of the Brazilian economy in many areas, but I’m determined not to see it replicated for human genome work too. I see this issue in some of the African initiatives too – why should they go to the effort of producing the primary data, only to watch the developed world make the discoveries and bring the innovations that stem from it? We need to strike a balance that benefits all parties equally.

And speaking about processing the data, what are your thoughts on the tools that are now available?

Well, as I’ve mentioned I’m not really a bioinformatician, so I’ll give you an end-user’s perspective.

I think there are several challenges – having a platform that allows access globally while remaining secure and affordable is high on my wish-list! But also, I would like to see bioinformatics tools made more user-friendly than they are currently. I like to make an analogy with smartphone apps – they do some amazingly complex processes and offer a wealth of capabilities, but on the whole, they are a lot easier to use than many of the bioinformatics software platforms that are out there. Why should that be?

I think bioinformatics software companies need to think about the end-users a little bit more. They need to make sure their tools are not needlessly complex, and that they’re structured to allow the most popular tasks to be done easily and quickly. That way, I think we could get more scientists – and especially those without a bioinformatics background – using the almost infinite amount of data that’s out there. If they can do that, then I see potential to accelerate genomic discoveries dramatically.

Thank you for your time, Lygia – it’s been very interesting to talk to you and hear your inspiring story and unique perspective.

It’s clear to us that what we perceive as the scientific challenges in nations where genomics is well-funded are not necessarily the same faced by researchers in other countries. It’s renewed our focus on making sure that our tools truly help genomic scientists with their everyday tasks, and that we strive to ensure that they’re easy to use by those who aren’t experts in bioinformatics or coding.

For more information about our suite of tools for genomics research, contact us at lifesciences@paradigm4.com.

Author Bio

Professor Lygia V. Pereira is Director of the National Laboratory of Embryonic Stem Cells (LaNCE) in the Department of Genetics and Evolutionary Biology, at the University of São Paulo, Brazil. Her previous training was in Physics (Bachelor’s degree, 1989) and Molecular Biology (M.Sc., 1990), followed by a Ph.D. in human molecular genetics from the City University of New York (1994). Professor Pereira’s current research interests include studying animal models of Marfan Syndrome, X-chromosome inactivation, establishing new lines of human embryonic stem cells, and identifying human knockouts within the Brazilian Genome project.

References

  1. Moreira de Mello, J.C., Fernandes, G.R., Vibranovski, M.D. et al.Early X chromosome inactivation during human preimplantation development revealed by single-cell RNA-sequencing. Sci Rep 7, 10794 (2017). https://doi.org/10.1038/s41598-017-11044-z

Network science: big data, strong connections, powerful possibilities

As the volume and complexity of data available to researchers increases, the potential to extract valuable insights becomes greater. However, how datasets are structured, organized and accessible are critical considerations, before true value can be realized.

Few people are more enthusiastic about the power of network sciences to solve intractable data analysis problems than Dr. Ahmed Abdeen Hamed, Assistant Professor of Data Science and Artificial Intelligence at Norwich University, Northfield, Vermont, USA. We’re talking to Ahmed to get his thoughts on why this approach is proving so valuable in the medical sciences (and elsewhere), and to hear his opinions on the evolution of artificial intelligence (AI).

Can we start by defining what we mean by ‘network science’?

It’s essentially a way of analyzing data relating to complex systems, by thinking of them as a set of distinct players/elements (known as nodes) laid out as a map, with the connections between these elements being the links.

Doing this makes it possible to analyze such systems mathematically and graphically – which in turn enables us to spot patterns in the network and derive useful conclusions.

And that helps you to solve problems, right?

Absolutely, and not just any old problems! I’m particularly interested in intractable real-world problems – some are known to be Millennium Prize Problems where a prize of 1 million dollars is pledged for the correct solution of any of the problems; the sort that companies and governments spend millions of dollars trying to solve (e.g., drug discovery and cybersecurity). 

These problems are often challenging because the answer is hidden amongst vast amounts of multimodal data – like records of social media interactions, libraries of medical records from drug trials, or databases of chemical properties, or combinations of these. It is not possible for any existing algorithm that runs on today’s computers to terminate if the problem is intractable. A big issue has been those conventional methods of analyzing such complex datasets (using the computers we possess today), because the number of potential interactions scales exponentially with the size of the network.

Where I think Turing reduction can change this is by simplifying the problem so that it becomes solvable by known and novel computational algorithms in non-exponential time. This is exciting because our world is essentially large systems of complex networks, so there are simply masses of applications out there ready for investigation by network analysis.

Like your work on drug repurposing, for example?

Exactly, and this is a great example of network analysis in action, because it shows how it can speed up the process of finding a drug to treat a given condition. Designing new drugs based on understanding the biology works well, but it’s a long-haul from discovery, through clinical trials, to market approval, whereas finding existing, FDA-approved drugs that can be repurposed for the disease in question has many advantages.

How do you go about creating a network based on drug data?

Well, I thought what if we could search through the literature for every known drug molecule, and rank them based on their potential to treat a particular disease or condition? So, I teamed up with colleagues from pharma and academia (the authors of the paper: TargetAnalytica: A Text Analytics Framework for Ranking Therapeutic Molecules in the Bibliome), and for over two years, we worked on refining a molecule-ranking algorithm. We realized that the more specific a molecule is for targeting a tissue of a certain organ, cell type, cell line or gene, the greater its potential as a possible treatment.

We hit upon drug specificity as being the best metric to use and applied it to a dataset comprising already published biomedical literature. The nodes in our network were the drug molecules – or more strictly ‘chemical entities’ as the main player of the network. Then we linked them back to other entities such as a gene, disease, protein, cell type, etc. The connection was essentially the distance of two players being mentioned in the abstract section of the same publication. 

And more recently you’ve moved on to apply the principle to COVID-19, too?

That’s right – with the arrival of the pandemic, suddenly we had a need to find a treatment for this relatively unknown serious illness. I realized this was an ideal application for another network medicine tool, because of the sheer amount of information that has been and will continue to be generated on COVID-19.

Specifically, in the two years since the pandemic hit, there have been well over 150,000 publications related to COVID-19, and these publications contain everything you need to feed into a drug-repurposing algorithm. This biomedical literature is a goldmine for information! The more publications you analyze, the more elements are going to find themselves being connected, whether from the same publication or from another publication. The complexity of the network evolves and refines dramatically as you increase the size of the data input.

Can you tell us about your research into COVID-19 so far?

Well, we’ve developed a drug-repurposing algorithm called CovidX, which we introduced in August 2020. At the time, we used it to identify and rank 30 possible drug candidates for repurposing, and also validated the ranking outcomes against evidence from clinical trials. This work was published in the Journal of Medical Internet Research.

Post-CovidX, we’ve been working on even bigger datasets, and have recently had an article published with MDPI Pharmaceutics about the interplay of some drugs that may present themselves as a COVID-19 treatment. This study was in a much bigger scale (used a set of 114,000 publications) than the one of CovidX. It certainly provides a way to stitch together all the evidence that’s been accumulated about treating the virus over the last two and a half years! It is important that I also mention the role of clinical trials records mirroring the map from the biomedical literature and validating its findings.

This all sounds very interesting, but on a practical level, how do you extract useful insights from these massively complex networks?

That’s a great question, because it strikes at the heart of what makes network science so powerful. I’ll use a simple analogy. 

Imagine you’re in a hospital, and you’re looking at the job roles of all the staff. Some members of staff interact with lots of people, like the receptionist or the ward manager, but others are more often involved as a close-knit group. For example, an anesthetist, surgeon and nurse would all have close involvement in performing in the operating theater, pretty much every time. We call this strongly connected community a ‘clique’, and it tells us that this combination of roles does something important. Now, if we were to look at thousands of hospitals, we might find similar ‘cliques’ in each one, and overlaying them all enables us to pull out those roles that rank most highly in their association with a successful outcome for the patient. It’s basically about removing all the noise.  

If we go back to our network on drug literature, our first task is to find all these ‘cliques’ amongst our drugs. We can then work out where these ‘cliques’ overlap, and this enables us to identify the drugs that are most important for a successful outcome. And that’s where you focus your effort.

And you suggested earlier that network analysis has lots of other applications too?

Yes indeed! As long as you have a sufficiently large dataset and you’re able to identify the entities that you’re going to use to build your network map, then you can turn it to a myriad of problems. I love it, not just because the mathematics is beautiful, but because it allows us to reduce highly complex, intractable problems down to simpler problems that can be addressed without the exponential time difference.

Applications are really broad – from genetics to ecology, and from sociology to economics. But to come back to pharma, what makes it such a fertile area for exploiting the power of network science is that everything is documented – whether that’s in published reports, internal documents, or electronic lab notebooks, the data is all there. And not only that, but the data is high-quality, too. So, in a way, the quality of the data has inspired the quality of the science, and obviously in the medical arena, this can help to save lives.

Coming back to your passion for problem-solving, what’s your opinion about how well the AI part of computer science is doing?

Well, this might be a bit controversial, but to be honest I’m a bit disappointed with the progress that’s been made since the benchmark set by Deep Blue in 1997. If we could teach a computer to play chess 25 years ago, then why – with the vastly greater computing power we’ve now got at our disposal – can we not develop a supercomputer that could solve some genuine problems? If we created an AI that specialized in, let’s say, cancer or Alzheimer’s to the same level as Deep Blue specialized in chess, could we find an effective treatment? Why is that proving so hard?

But do you think there’s light at the end of this tunnel?

Yes, I do, and it’s all to do with the possibilities opening up in quantum computing. Now I’m not a specialist in this area, but it’s clear to me that because it works in a very different way, it should allow us to tackle previously intractable problems – what’s known as ‘quantum supremacy’.

And what’s exciting is that this computing power should be accessible to programmers everywhere, because it’s being interfaced with regular programming languages like Microsoft’s Q# language. I envision that within the next five to ten years, we’ll see this computing revolution become a reality. From my perspective as a network analyst, I’m really hoping it does, because it should allow us to remove the current limit on the number of elements we can have in a network. I would certainly like to get my hands on a quantum computer once the technology has matured!

Ahmed, it’s been a pleasure to talk to you, and thank you for sharing with us the power and potential of network analysis.

It’s clear that there are challenges for pharmaceutical researchers, whether they are using network analysis or machine learning approaches. To reap the benefits of either approach, the fundamentals of data storage, flexible access, and computational/analytical tools must be considered. At Paradigm4, we are tackling the data storage issues through our novel elastic cloud file system, flexFS, for more resource-efficient data storage and sustainable computing, to help accelerate drug discovery efforts. For more information on how our technology can help to transform your data analysis, contact lifesciences@paradigm4.com. 

For more about the work of Dr. Hamed at Norwich University, read this article on the university’s website. Disclaimer: These views are Ahmed Abdeen Hamed’s own views and not those of Norwich University.

Bio

Dr. Ahmed Abdeen Hamed is Assistant Professor of Data Science and Artificial Intelligence at Norwich University, Northfield, VT, where since 2019 he has led the university’s data analytics courses and carried out research using algorithms and computational methods to address a wide range of global issues. Prior to this, he worked in the private sector, where he designed network-based drug-ranking algorithms, now patented under his name as the first inventor.

Tackling complexity: making sense of genomic data

We’re talking to Professor John Quackenbush from the Harvard T.H. Chan School of Public Health, who throughout his 30-year career has addressed big questions in human health by combining his twin passions of genomics and data analysis.

Can we start by discussing how you got into genomics?

Well, I’ve had an interesting career journey. I got my Ph.D. in theoretical physics in 1990, and I thought that’s what I was going to be doing for the rest of my life. But shortly after I had my degree, funding for physics research dried up because of the break-up of the Soviet Union and the end of the cold war.

At the time, a friend who was finishing her Ph.D. in molecular biology was using polymerase chain reaction (PCR) to study hormonal regulation of gene expression in cockroaches. I started to help her analyze data, and I began to see that there was this whole universe of biology that had been invented since I was at high school – with data that people like me could analyze. And of course, this was when the human genome project was getting off the ground, so I applied for a five-year fellowship, and it all followed from there!

It’s been an interesting personal journey then?

I’ve been lucky to have been to all these truly outstanding places doing work at the forefront of biology and genomics. I used to joke about being a Harvard professor one day, and now I am chair of Harvard’s department of biostatistics – despite not being a trained statistician. 

But I think my journey indicates a point of fundamental importance in genomics and biomedical research in general – that there’s a need for a broad range of people and techniques so we can take massive quantities of data and turn that into knowledge, and then convert that knowledge into understanding. We definitely need people who can combine computational, statistical, and biological perspectives – and I’ve always been very keen to foster that ‘community’ of researchers myself.

Let’s talk about the science. How have you seen genomic science evolve since the 1990s?

Well, it’s been incredibly exciting to have been working in a field that has been transformed so fundamentally – and which is still undergoing massive change. For example, technology is advancing at such a pace that today we can assemble datasets that give us a foothold in addressing questions that were unanswerable even two or three years ago.

But what’s fascinating is that the field hasn’t evolved in the way that some originally thought it might. When the first human genome was sequenced 20 years ago, people were saying things like “now we’ve identified all the genes, we’ll be able to find the root cause of all human diseases”. But it wasn’t quite so simple. Biological systems are massively more complex than we imagined and making sense of these requires enormous datasets. Even now, we’re only just starting to grapple with the vast amount of genetic variation in the human population. 

There’s a study I always point to that illustrates this nicely, by a consortium called GIANT, that looked at the height of just over 250,000 people. They found that 697 genetic variants could explain 20% of human height, but to get to 21% they needed 2,000, to get to 24% they needed 3,700, and to get to 29% they needed almost 10,000! Height is something we know has a genetic component, but it is controlled by many variants, most of which have very, very small effects.

So, what can you do with your computational methods that tackles that complexity?

We recognized early on that what distinguishes one cell type from another isn’t just an individual gene turning on or off, but the activation of coordinated programs of genes. Early studies of ‘gene expression’ looked for genes that were correlated in their expression levels in ways that differed between disease or physical states.

Unfortunately, this doesn’t explain why the genes are differentially expressed, and we’ve wrestled with approaches to answer that question over the years. A major turning point in my thinking came a few years ago, when I became aware of a 1997 paper by two workers at IBM, who showed that introducing domain-specific knowledge could dramatically improve performance of complex optimization algorithms. Their paper is relevant to genomics because for the last 30 years, people have been trying to apply generic ‘black box’ algorithms to their data, and yes, this might work for a particular situation. But when you try to generalize the methods, they often fail, and this paper told me why. The paper told me that if we wanted to understand how genes were regulated, we should instead aim to understand what drives the process and use that as a starting point. Ultimately, this idea led us to develop computational ‘network methods’ that model how genes are controlled, and this has dramatically expanded our understanding of how and why diseases develop, progress, and respond to treatment.

And what can these network approaches tell you about gene regulation?

Well, it’s been an incredibly fruitful area of study. We start by guessing a network using what we know about the human genome, where the genes are, and where regulatory proteins called transcription factors can bind to the DNA. We then take other sources of data about protein interactions and correlated gene expression in different biological states and use advanced computational methods to optimize the structure of the network until it’s consistent with all the data. We can then compare networks to look for differences between different states, such as health and disease, that tell us what functions in the cell are activated in one or the other. 

As an example of what such models can reveal, we built network models for 38 tissue types, and we found a three-tiered structure. First, there was a core set of regulatory processes which were essentially the same in every tissue. Then there was a periphery of processes that were often unique to every tissue. But there was a middle layer in which regulatory paths changed to activate the processes that made one tissue different from another tissue. And such an activating layer is where you want to look when you think about interventions to address the causes of disease.

To take that a step further, we’ve now created and catalogued nearly 200,000 gene regulatory networks, including many from drug response studies. Using that resource, we can ask whether there’s a particular drug that alters the regulation in a way which might make a disease cell more like a healthy cell. That could be a powerful shortcut to matching diseases to drugs.

It sounds like practical applications may not be far off. What prospects are there of using your approaches for personalized medicine?

I’d say it’s early days yet – there’s a big gap between using these tools in the research lab and ensuring they are reliable enough to be used in the clinic. But we’re committed to making the methods accessible, and so all our methods are open-source, and we try to make them available in multiple programming languages like R, Python, MATLAB, and C. So, any researcher can take our network methods and use them on their own applications! Our goal is to help move the field forward.

In terms of areas we’ve studied ourselves, my colleagues and I have looked at cancer, Alzheimer’s disease, chronic obstructive pulmonary disease, interstitial lung disease and asthma, amongst others. In addition, one of the underlying themes for us is the differences that exist between males and females in gene regulatory processes that can help us understand disease evolution and inform treatment.

Surely gender differences have been well-studied?

Well kind of – we know there are differences, but we don’t know why. In fact, sex differences in disease are among the most understudied problems in biology. For example, in colon cancer we know that there are differences in disease risk, development, and response to therapy between the sexes. However, when we examine gene expression in tumors, there’s very little difference between male and female samples and none that explain the clinical differences. 

But, using our network inference and modeling approach, we do find differences in the regulatory networks – for example, those influencing drug metabolism. This of course allows us to identify possible ways in which we can treat disease in males and females differently to increase drug efficacy. This is something we’d never have been able to do before taking this integrated approach.

So how quickly can you run a network-building algorithm these days?

Pretty fast, and it’s getting faster. For example, in a study of 29 different tissues, we had to create 19,000 networks. When we ran a method that we call PANDA and LIONESS to generate the initial networks for publication, it took six months. Then the reviewers asked us to make some changes, by which time we’d made some software improvements and had access to new hardware, so on the second run it took six weeks. Now we can do it in six days!

That’s an amazing change in turnaround time. But why do you conduct analyses on such a large scale?

Most experiments in science are based on hypothesis testing, meaning we have some evidence and analysis that suggests a particular factor is associated with some measurable outcome, so we design an experiment to see if the factor and the outcome are related. But in associating gene regulation across the genome with particular outcomes, it is hard to know where to look or what to test. In studying the inference and analysis of gene regulatory networks, there are so many ‘unknown unknowns’ that there’s value in simply taking your tools and seeing what you can discover. If you have a large sample size, you can have greater confidence in the things that you eventually find, and it makes it easier to tease out small but meaningful signals. 

An analogy that I like is that what we are doing is similar to what Galileo did when he turned his telescope on Jupiter. He took new technology and made observations that led him to conclude that the planet had moons and indirectly helped to confirm the heliocentric model of the solar system. Our new technologies are looking at big data sets and the advanced computational tools that we developed. Using these, we hope to identify genomic relationships that wouldn’t be found if we’d constrained ourselves to a handful of genes or simply looked at which ones change their expression. The deeper regulatory connections we find in networks across all 25,000 genes provide new insights that lead us to develop more targeted questions and testable hypotheses.

With your perspective of 30 years in the field, what excites you most about the future for analysis of genomic data?

Two things stand out for me. The first is multi-modal data, which is one thing I know Paradigm4 is tackling. It’s combining genomic data, imaging data, and patient metadata with computational methods to create multi-tiered models that encompass genetic variants, transcription, gene expression, methylation, microRNAs, and more. Bringing all these tools together gives us a better chance of extracting insights from the fantastically complex public datasets that are now available.

The second is single-cell sequencing, which I find very interesting. If you take a chunk of tissue and genetically profile it, you’re only ever going to uncover what happens on average. But if you take single-cell data like Paradigm4 has been doing, then you can probe what happens during the transition from a healthy state to a diseased state. That offers fantastic prospects for uncovering the mechanism of disease.

John, thanks very much for your time, and for giving us some fascinating perspectives on genomics and computational biology. 

Single-cell sequencing and integrative analytics are both areas of great importance and opportunity as we look towards the future of genomic data and precision medicine. At Paradigm4, we’re utilizing these methods to support scientists to extract insights from datasets, through our purpose-built platform and suite of REVEAL apps which helps users to build a multimodal understanding of disease biology. The technology allows users to ask more questions quickly, through a scalable, cost-effective solution. For more information on how we can help to transform your data analysis, contact lifesciences@paradigm4.com. 

Bio

John Quackenbush is Professor of Computational Biology & Bioinformatics, and Chair of the Department of Biostatistics at the Harvard T.H. Chan School of Public Health in Boston, Massachusetts. He also holds professorships at Brigham & Women’s Hospital and the Dana-Farber Cancer Institute. 

Prof. Quackenbush’s Ph.D. was in Theoretical Physics, but in 1992 he received a fellowship to work on the Human Genome Project. This led him through the Salk Institute, Stanford University, The Institute for Genomic Research (TIGR), and ultimately to Harvard in 2005. Prof. Quackenbush has more than 300 scientific papers with over 84,000 citations; he was recognized in 2013 as a White House Open Science Champion of Change.

References

1 Wood, A., Esko, T., Yang, J. et al. Defining the role of common variation in the genomic and biological architecture of adult human height. Nat Genet 46, 1173–1186 (2014). https://doi.org/10.1038/ng.3097

2 D. H. Wolpert and W. G. Macready, “No free lunch theorems for optimization,” in IEEE Transactions on Evolutionary Computation, vol. 1, no. 1, pp. 67-82, April 1997, doi: 10.1109/4235.585893.

The role of computational biology in drug discovery: an inside perspective

We talk to Dr. David De Graaf, CEO of biotech company, Abcuro, and for over 20 years a leading light in the field of computational/systems biology.

Let’s start by talking about what you’re doing at Abcuro?

We’re focused on developing two clinical leads – one targeting an autoimmune muscle-wasting disease known as inclusion body myositis, and the other targeting cancer cells. In both cases, the targets are cytotoxic T cells that express a receptor known as KLRG1. For inclusion body myositis, we’re developing an anti-KLRG1 antibody that can selectively deplete cytotoxic T cells present in muscle tissue, effectively removing the source of immune attack. Whereas for cancer we want to turn on these cytotoxic T cells and direct them towards the tumor – which we’re doing using a different antibody. It’s challenging to extract value from cancer programs because many times, until you’re in phase 2, there is no clear proof of efficacy.

And how did you identify that KLRG1 target?

That’s a great question because it relates right back to bioinformatics. The founder of Abcuro is Dr. Steven Greenberg, who is a neurologist. He was being referred patients with inclusion body myositis, but he was becoming very frustrated, because there was absolutely nothing that could be done for them. So, he decided to change that. After taking time out to study bioinformatics at MIT, he came back and set up a small lab. He took muscle biopsies from these patients, cut out the invading T cells that were eating away at the muscle fiber, and used bioinformatics to find a marker in those cells that was selective for those T cells and spared the rest of the immune system. The target that he found was KLRG1, which is the one we’re using today to deplete these cytotoxic T cells.

Are you still doing computational biology today?

No, but we’d like to! The thing is, for a small biotech company like ours, drug discovery is not a big driver, because that’s not where the business value is. The reality is that drug pipelines don’t pay the bills, successful drugs do. And so, companies like ours must be selective, we have to focus on some very specific questions, and it doesn’t matter quite so much how long it takes to get answers. The important thing is that you work out exactly what’s going on. There’s no grand hypothesis in our discovery efforts which would be amenable to large-scale computational studies.

What do you think about companies offering specialist computational biology tools? Do they have a viable business model?

It’s very difficult. The problem with selling computational tools is that your business often becomes about the tool, but that’s not really what the customer cares about, is it? Instead, it should be about what you do with it and how you make a difference.

When I was at Selventa, I remember a trip that we made to a big pharma company, and they looked at the computational tool we’d created, and they said: “Hmm, it’s a great tool, it’s a very efficient and interesting way of analyzing data. But, to validate it, we want you to go through the data set that we’ve already analyzed”. 

Now this sort of situation nearly always ends in failure, because there are essentially two outcomes. The first one is you find exactly what they already found, and then they say: “that’s very impressive, you did it in three days rather than three months but, in reality, that difference in timescale doesn’t matter much to us – so no thanks”. Or even worse is if you find something new, and then they go on the defensive and say: “we’ve got great people here, are you telling us they’ve been doing it all wrong?”. So, either way, it becomes really hard to show the added value of what you’re providing.

In your opinion, what are the benefits of commercializing computational biology tools?

There are plenty. Top of the list is that thanks to computational biology, we’ve got a lot better at defining diseases in terms of what’s going on at a molecular level. And that’s important because patients don’t come in the clinic and complain that their PI3 kinase hurts. We need to continue to do that, and make connections through the whole value chain, from identifying symptoms, to understanding the molecular basis of disease, through to drug discovery. Now, that’s a grand aim, so you’d need to carve out something narrower there, but there are plenty of opportunities. For example, we don’t know why patients don’t always benefit as expected from gene therapy. And we don’t really know why people get autoimmune diseases, with a case-in-point being long Covid. There’s a lot to be gained if you’re prepared to focus on such questions. 

Are there any other ways you see computational tools making a big difference?

When I was in big pharma, we often did the exact same assay 15 or 20 times, because it was always much easier to regenerate the data than it was to find the old data and assess the conditions used. That remains a huge inefficiency. I think that computational tools that annotate relevant data, and allow you to search across it, could really pay off.

Another opportunity is being able to take externally curated content and understand it in the context of your own experiments. Then there’s the concept of making a link between patients with a similar molecular phenotype, rather than talking about them purely as a sort of observational phenotype.

And finally, being able to extract value from studies that haven’t worked has massive potential. People don’t think about the fact that 99% of the money in the pharma industry is spent on things that end up going wrong. But I don’t believe it’s right to simply forget about that work – we should bring it together and get some insights from it. I think all these things have the potential to be huge drivers in accelerating drug discovery and development.

You implied an issue with data flow in your previous role, can you explain more about that and what it means for those wanting to commercialize computational biology tools?

Data flow is absolutely an issue, and I’ll give you another example from my time at Selventa. We’d worked with a big pharma company to analyze gene expression data from a couple of their clinical trials to help them decide whether to move particular candidates forward. At the time, they had a poorly defined mechanism of action, and didn’t understand why certain patients responded and others didn’t. Anyway, we crunched the numbers, we figured it out, it was a really nice, productive collaboration, and the CEO invited us over for a meeting, and we thought: “great, we’re in there, these guys want to do a deal with us”.

But it didn’t turn out like that. They took us out the night before, then the next day we talked business, and it all fell to pieces. We had this excruciating 45-minute conversation with the CEO, who spent the entire time apologizing, explaining that because we’d analyzed all their genetic and genomic data, they were going to be busy for the next two or three years dealing with it all. Although they wanted to retain a relationship, they weren’t going to have any new data for us. 

My conclusion is that it’s very difficult to sell a specialist computational tool to a pharma company unless they’ve got a continuous flow of data to feed into it – and not many do. To make it worth their while to construct that information infrastructure, they need to be able to make full use of it in the long-term.

It’s interesting you say that, especially as Paradigm4 has focused on translational medicine, where the data flow is huge and continuous. Proteomics is also an emerging gold mine of data! How would you say those challenges you mention can be tackled?

One approach is to refine your core pitch and make it something that pharma companies can’t do. The problem is that often, the one thing that they can’t do is to use these specialized tools. You end up building a service organization on top of your software offering, and those don’t scale particularly well. It’s very, very hard to extract value from them.

Another way is to provide your own content – for example, to curate and annotate all the data that’s in the public domain and sell it. Now, there may be a continuous need to have that data but, to be frank, the value of that is very small, because it’s not unique to a particular company.

In summary, I think it’s very hard to develop a good business model for these companies. In the research setting, customers will say: “this tool is fantastic, but we haven’t got the data flow to give us good return on investment”, and in the drug development setting, they say: “this tool is still fantastic, but we’re only going to use a little bit of it, so it’s not going to give us return on investment”.

So, it seems you’re saying that these software companies need to have a different business model to be viable in the long term?

Yes, and you can state it in three words. Analyze clinical data. That’s where the money is, and that’s where important decisions are made. It’s true that supporting research using your software platform can be a great way to build confidence. But you’re not going to make money until you’re in the clinic.

Fair enough, but what about the regulatory side?

That’s a good point, and it comes down to the scope of your analysis. Again, thinking about things from the pharma perspective, one of their worries is that if you look at more and more parameters, you’re going to find something that looks wrong. And that has huge impacts for your path through clinical regulation. For example, in a preclinical toxicology model I worked on, we saw induction of a single cytokine out of about 30. That cytokine then needed to be tracked through phase one, phase two, phase three – even though it was completely irrelevant. And it even ended up on the label, even though there’s absolutely zero evidence that there was any issue with it. Now, the way I’d approach that is not by saying we shouldn’t generate the data, but by saying upfront: “here are some very specific things we’ll be looking for”. That way, you eliminate the risk of automatically flagging-up differences that ultimately don’t matter.

David, thank you for your time and for giving us some fascinating perspectives on computational biology. 

An interesting take on many aspects of the industry and one important point we’ve taken away is that many scientists struggle to use computational tools in their current form, often needing support from software developers. At Paradigm4, we work closely with our customers so that scientists can use the computational tools to their advantage, asking complex questions and assessing key biological hypotheses more efficiently and independently. Our platform is what we like to call ‘science-ready.’ We want to help scientists make a difference with their data. For more information on how our technology can help to transform your single-cell data analysis, contact lifesciences@paradigm4.com.

Bio

Dr. David De Graaf is CEO of Abcuro, a clinical-stage biotechnology company that he joined in late 2020, after having had a string of positions in the biotech sector, including CEO of Comet Therapeutic and Syntimmune, and leadership roles in systems and computational biology at Pfizer, AstraZeneca, Boehringer-Ingelheim, Selventa and Apple Tree Partners. He holds a Ph.D. in genetics from the University of Illinois at Chicago.

Abcuro is developing antibody treatments for autoimmune diseases and cancers modulated by cytotoxic T and NK cells, including ABC008 for treatment of the degenerative muscle condition known as inclusion body myositis, and ABC015 for reactivating bodily defenses against tumors. 

The evolution of computational genomics: solving bioinformatics challenges

In this, the first of a series of opinion leader interviews, we talk with Dr. Martin Hemberg, Ph.D., Assistant Professor of Neurology, Brigham & Women’s Hospital and Member of the Faculty, Harvard Medical School, located at The Evergrande Center for Immunologic Diseases (a joint program of Brigham & Women’s Hospital and Harvard Medical School).

Dr. Hemberg’s field of research is computational genomics. He develops methods and models to help understand gene regulation. He also has a long history of collaborating with experimental groups, mainly in neurobiology, assisting them to analyze and interpret their data. 

Dr. Hemberg started out studying maths and physics, but towards the end of his undergraduate studies decided that biology was going to be his route into research. As a result, he carried out his post-graduate and PhD work in theoretical systems biology at Imperial College London under the supervision of Professor Mauricio Barahona. He was a post-doc in the Kreiman lab at Boston Children’s Hospital, working on analyzing ChIP-seq and RNA-seq data. In 2014, Dr. Hemberg moved to Cambridge, UK, to set up a research group at the Wellcome Sanger Institute with the goal of developing methods for analyzing single-cell RNA-seq data.

Following relocation to the USA in 2021 to establish a new research group in Boston, Dr. Hemberg is now embarking on the next stage of his career.


How did you first get involved in analyzing sequencing data?

With my background in theoretical systems biology, it was interesting to come across a team of ‘experimentalists’ in the lab next door to mine at Boston Children’s Hospital. It was just around the time when second generation sequencing technology was becoming available and my neighbours were optimizing their ChIP-seq and RNA-seq protocols but had no idea how they were going to analyze the data. So, I got involved, and worked on this for the duration of my post-doc contract.

Then, thinking back to my experience with single-cell analysis at Imperial – part mathematical, not genome wide, but just looking at individual cells, it seemed clear to me that these two fields were going to converge. For my first Principal Investigator post (at the Wellcome Sanger Institute) I pitched ideas around systems biology, mathematical biology, and biophysical approaches using single-cell RNA-seq data. However, as we started to do this, it quickly became clear that we did not really understand the data, so we needed to switch around to doing method development to enable us to process the data and separate the signal from the noise. This became the focus of the lab: developing novel methods to analyze single-cell data sets. 


How did this develop, and what is driving your work now?

It has been an interesting journey. We have two types of projects. Firstly, what I would call ‘in house’ projects, where we are working on methods and analyzing public datasets independently to improve analysis or come up with novel ways of approaching the data. The second group of projects are ‘applied’. We collaborate with external groups where we help them to analyze their data. There is a really nice synergy here. Often, I will be talking with someone else about their problems and realize that we have a new method that can offer a solution. It works the other way around, too, when we realise that a specific issue faced by one group is actually a generic problem, and we can then work to find a solution for everyone.

This work has now transferred to Boston, where I am setting up a new group to work using the same template – the same approach. I moved during 2021, and I can definitely advise colleagues that it is not a good idea to move continents with your family, change jobs and try to set up a new team during a pandemic! 

What we will eventually be doing is defining some of the key problems – from an informatics point of view, from a technology point of view and, just as importantly, from a biological point of view. Looking at where there are no good solutions currently, and working to develop methods and tools to solve these challenges and deal with the ever-increasing volume of data that is being produced. It is a very exciting prospect. Clearly, as data set size increases, you need to rely on more sophisticated software and data processing tools that can cope with the scale.


One of your most cited papers is guidelines for computational analysis of single-cell data. Why has this been so important?

Initially, we were thinking only about ourselves and our own research but, as we were working out how best to approach our data, I thought we should write it down and put it into a form where it might help others. In addition to the publication, we also formatted the information as an online course. The feedback has been great. Every time I am at a conference, I have had people mentioning how they learned how to handle their data using our principles. 


What plans do you have to update this work?

It’s definitely a work in progress. We just hit 10,000 visitors/month on the course portal, so I’m absolutely committed to keeping it as current as possible. In the past, we updated only occasionally, as time allowed, but now we have a dedicated person who will be looking at it as part of their day-to-day responsibilities.  


What is your view regarding the best ways to organise the data?

I always tell my students, many of whom have more of a biology background, rather than having experience in mathematics or informatics, that there is usually a module that is taught in computing courses called ‘algorithms and data structures’. And there is a very good reason you put these things together. If you have the right data structure, the algorithm becomes trivial, whereas if you have the wrong data structure, the algorithm becomes a ton of work!

In addition, a more practical consideration is data set size. If you have 10MB, you can store it on a hard drive and do a linear search to get what you want, but if you have 10GB, you need to have much more efficient ways of accessing and querying your data. There is absolutely no doubt that is one of the keys as you scale up to very large data sets – which is something that is really driving the field, so it’s critical.

What has been the impact of biobanks – and the expanded nature of data types being collected, computed and queried?

I see biobanks as simply increasing the number of opportunities available to us – on one hand, first papers coming out in a new field, or after a major development, most often leave potential for subsequent re-analysis and re-working of the data sets to cover angles that were not looked at by the original authors.

On the other hand, and perhaps more exciting still, is the opportunity to combine data sets and data types and see more than you could see with just one modality. That is really important.

Moreover, you can use these expanded data sets to validate any number of new hypotheses by going back and analyzing the biobank data without having to go out and do the experiments yourself. 

Can you summarize your view of the ‘grand challenges’ in bioinformatics today?

That’s a tough one. I’ll answer in two ways – at a high level, genomics is about understanding ‘genotype to phenotype’ mapping. How is information encoded, how is it read out, and how is it then stored in the genome? That will take a while to sort out. 

In terms of less ambitious goals – I’m very interested in the regulatory code, and I think we can make significant progress on this. It’s a case of understanding information concerning adjusting the expression levels of different genes and how that is encoded within the genome through promoters and enhancers. 

Specifically, in terms of single-cell analysis, new technology development is really driving the field. I have to keep my ear to the ground to understand these new technologies and how they can help us solve the day-to-day problems we have. For example, I recall that when the costs of cell isolation and sequencing RNA started to fall, researchers took advantage, new protocols emerged where pooled samples were sequenced ‘in bulk’, and the results were deconvoluted to identify individuals. The amount and complexity of the data increased dramatically. In 2016, the problem did not exist, then around 2018 the first publications came out relating to these new experiments, and we started looking at enhancing analysis methodology. What’s become obvious over the years is that a difficult computational challenge can be totally solved by a better assay – or, conversely, a new assay can throw up a new, interesting and challenging bioinformatics and computational problem. The moral is clear: you need to be nimble.

Martin, thank you for your time, and for giving us a most interesting perspective on single-cell data analysis.

For more information on how our technology can help to transform your data analysis, as well as information on our REVEAL: SingleCell app, contact lifesciences@paradigm4.com.

Author bio

Martin Hemberg, Ph.D., Assistant Professor of Neurology, Brigham & Women’s Hospital and Member of the Faculty, Harvard Medical School, located at The Evergrande Center for Immunologic Diseases (a joint program of Brigham and Women’s Hospital and Harvard Medical School). Prior to this role, Martin Hemberg was a CDF Group Leader at the Sanger Institute.