Blog: The Cutting Edge

Network science: big data, strong connections, powerful possibilities

As the volume and complexity of data available to researchers increases, the potential to extract valuable insights becomes greater. However, how datasets are structured, organized and accessible are critical considerations, before true value can be realized.

Few people are more enthusiastic about the power of network sciences to solve intractable data analysis problems than Dr. Ahmed Abdeen Hamed, Assistant Professor of Data Science and Artificial Intelligence at Norwich University, Northfield, Vermont, USA. We’re talking to Ahmed to get his thoughts on why this approach is proving so valuable in the medical sciences (and elsewhere), and to hear his opinions on the evolution of artificial intelligence (AI).

Can we start by defining what we mean by ‘network science’?

It’s essentially a way of analyzing data relating to complex systems, by thinking of them as a set of distinct players/elements (known as nodes) laid out as a map, with the connections between these elements being the links.

Doing this makes it possible to analyze such systems mathematically and graphically – which in turn enables us to spot patterns in the network and derive useful conclusions.

And that helps you to solve problems, right?

Absolutely, and not just any old problems! I’m particularly interested in intractable real-world problems – some are known to be Millennium Prize Problems where a prize of 1 million dollars is pledged for the correct solution of any of the problems; the sort that companies and governments spend millions of dollars trying to solve (e.g., drug discovery and cybersecurity). 

These problems are often challenging because the answer is hidden amongst vast amounts of multimodal data – like records of social media interactions, libraries of medical records from drug trials, or databases of chemical properties, or combinations of these. It is not possible for any existing algorithm that runs on today’s computers to terminate if the problem is intractable. A big issue has been those conventional methods of analyzing such complex datasets (using the computers we possess today), because the number of potential interactions scales exponentially with the size of the network.

Where I think Turing reduction can change this is by simplifying the problem so that it becomes solvable by known and novel computational algorithms in non-exponential time. This is exciting because our world is essentially large systems of complex networks, so there are simply masses of applications out there ready for investigation by network analysis.

Like your work on drug repurposing, for example?

Exactly, and this is a great example of network analysis in action, because it shows how it can speed up the process of finding a drug to treat a given condition. Designing new drugs based on understanding the biology works well, but it’s a long-haul from discovery, through clinical trials, to market approval, whereas finding existing, FDA-approved drugs that can be repurposed for the disease in question has many advantages.

How do you go about creating a network based on drug data?

Well, I thought what if we could search through the literature for every known drug molecule, and rank them based on their potential to treat a particular disease or condition? So, I teamed up with colleagues from pharma and academia (the authors of the paper: TargetAnalytica: A Text Analytics Framework for Ranking Therapeutic Molecules in the Bibliome), and for over two years, we worked on refining a molecule-ranking algorithm. We realized that the more specific a molecule is for targeting a tissue of a certain organ, cell type, cell line or gene, the greater its potential as a possible treatment.

We hit upon drug specificity as being the best metric to use and applied it to a dataset comprising already published biomedical literature. The nodes in our network were the drug molecules – or more strictly ‘chemical entities’ as the main player of the network. Then we linked them back to other entities such as a gene, disease, protein, cell type, etc. The connection was essentially the distance of two players being mentioned in the abstract section of the same publication. 

And more recently you’ve moved on to apply the principle to COVID-19, too?

That’s right – with the arrival of the pandemic, suddenly we had a need to find a treatment for this relatively unknown serious illness. I realized this was an ideal application for another network medicine tool, because of the sheer amount of information that has been and will continue to be generated on COVID-19.

Specifically, in the two years since the pandemic hit, there have been well over 150,000 publications related to COVID-19, and these publications contain everything you need to feed into a drug-repurposing algorithm. This biomedical literature is a goldmine for information! The more publications you analyze, the more elements are going to find themselves being connected, whether from the same publication or from another publication. The complexity of the network evolves and refines dramatically as you increase the size of the data input.

Can you tell us about your research into COVID-19 so far?

Well, we’ve developed a drug-repurposing algorithm called CovidX, which we introduced in August 2020. At the time, we used it to identify and rank 30 possible drug candidates for repurposing, and also validated the ranking outcomes against evidence from clinical trials. This work was published in the Journal of Medical Internet Research.

Post-CovidX, we’ve been working on even bigger datasets, and have recently had an article published with MDPI Pharmaceutics about the interplay of some drugs that may present themselves as a COVID-19 treatment. This study was in a much bigger scale (used a set of 114,000 publications) than the one of CovidX. It certainly provides a way to stitch together all the evidence that’s been accumulated about treating the virus over the last two and a half years! It is important that I also mention the role of clinical trials records mirroring the map from the biomedical literature and validating its findings.

This all sounds very interesting, but on a practical level, how do you extract useful insights from these massively complex networks?

That’s a great question, because it strikes at the heart of what makes network science so powerful. I’ll use a simple analogy. 

Imagine you’re in a hospital, and you’re looking at the job roles of all the staff. Some members of staff interact with lots of people, like the receptionist or the ward manager, but others are more often involved as a close-knit group. For example, an anesthetist, surgeon and nurse would all have close involvement in performing in the operating theater, pretty much every time. We call this strongly connected community a ‘clique’, and it tells us that this combination of roles does something important. Now, if we were to look at thousands of hospitals, we might find similar ‘cliques’ in each one, and overlaying them all enables us to pull out those roles that rank most highly in their association with a successful outcome for the patient. It’s basically about removing all the noise.  

If we go back to our network on drug literature, our first task is to find all these ‘cliques’ amongst our drugs. We can then work out where these ‘cliques’ overlap, and this enables us to identify the drugs that are most important for a successful outcome. And that’s where you focus your effort.

And you suggested earlier that network analysis has lots of other applications too?

Yes indeed! As long as you have a sufficiently large dataset and you’re able to identify the entities that you’re going to use to build your network map, then you can turn it to a myriad of problems. I love it, not just because the mathematics is beautiful, but because it allows us to reduce highly complex, intractable problems down to simpler problems that can be addressed without the exponential time difference.

Applications are really broad – from genetics to ecology, and from sociology to economics. But to come back to pharma, what makes it such a fertile area for exploiting the power of network science is that everything is documented – whether that’s in published reports, internal documents, or electronic lab notebooks, the data is all there. And not only that, but the data is high-quality, too. So, in a way, the quality of the data has inspired the quality of the science, and obviously in the medical arena, this can help to save lives.

Coming back to your passion for problem-solving, what’s your opinion about how well the AI part of computer science is doing?

Well, this might be a bit controversial, but to be honest I’m a bit disappointed with the progress that’s been made since the benchmark set by Deep Blue in 1997. If we could teach a computer to play chess 25 years ago, then why – with the vastly greater computing power we’ve now got at our disposal – can we not develop a supercomputer that could solve some genuine problems? If we created an AI that specialized in, let’s say, cancer or Alzheimer’s to the same level as Deep Blue specialized in chess, could we find an effective treatment? Why is that proving so hard?

But do you think there’s light at the end of this tunnel?

Yes, I do, and it’s all to do with the possibilities opening up in quantum computing. Now I’m not a specialist in this area, but it’s clear to me that because it works in a very different way, it should allow us to tackle previously intractable problems – what’s known as ‘quantum supremacy’.

And what’s exciting is that this computing power should be accessible to programmers everywhere, because it’s being interfaced with regular programming languages like Microsoft’s Q# language. I envision that within the next five to ten years, we’ll see this computing revolution become a reality. From my perspective as a network analyst, I’m really hoping it does, because it should allow us to remove the current limit on the number of elements we can have in a network. I would certainly like to get my hands on a quantum computer once the technology has matured!

Ahmed, it’s been a pleasure to talk to you, and thank you for sharing with us the power and potential of network analysis.

It’s clear that there are challenges for pharmaceutical researchers, whether they are using network analysis or machine learning approaches. To reap the benefits of either approach, the fundamentals of data storage, flexible access, and computational/analytical tools must be considered. At Paradigm4, we are tackling the data storage issues through our novel elastic cloud file system, flexFS, for more resource-efficient data storage and sustainable computing, to help accelerate drug discovery efforts. For more information on how our technology can help to transform your data analysis, contact lifesciences@paradigm4.com. 

For more about the work of Dr. Hamed at Norwich University, read this article on the university’s website. Disclaimer: These views are Ahmed Abdeen Hamed’s own views and not those of Norwich University.

Bio

Dr. Ahmed Abdeen Hamed is Assistant Professor of Data Science and Artificial Intelligence at Norwich University, Northfield, VT, where since 2019 he has led the university’s data analytics courses and carried out research using algorithms and computational methods to address a wide range of global issues. Prior to this, he worked in the private sector, where he designed network-based drug-ranking algorithms, now patented under his name as the first inventor.

Tackling complexity: making sense of genomic data

We’re talking to Professor John Quackenbush from the Harvard T.H. Chan School of Public Health, who throughout his 30-year career has addressed big questions in human health by combining his twin passions of genomics and data analysis.

Can we start by discussing how you got into genomics?

Well, I’ve had an interesting career journey. I got my Ph.D. in theoretical physics in 1990, and I thought that’s what I was going to be doing for the rest of my life. But shortly after I had my degree, funding for physics research dried up because of the break-up of the Soviet Union and the end of the cold war.

At the time, a friend who was finishing her Ph.D. in molecular biology was using polymerase chain reaction (PCR) to study hormonal regulation of gene expression in cockroaches. I started to help her analyze data, and I began to see that there was this whole universe of biology that had been invented since I was at high school – with data that people like me could analyze. And of course, this was when the human genome project was getting off the ground, so I applied for a five-year fellowship, and it all followed from there!

It’s been an interesting personal journey then?

I’ve been lucky to have been to all these truly outstanding places doing work at the forefront of biology and genomics. I used to joke about being a Harvard professor one day, and now I am chair of Harvard’s department of biostatistics – despite not being a trained statistician. 

But I think my journey indicates a point of fundamental importance in genomics and biomedical research in general – that there’s a need for a broad range of people and techniques so we can take massive quantities of data and turn that into knowledge, and then convert that knowledge into understanding. We definitely need people who can combine computational, statistical, and biological perspectives – and I’ve always been very keen to foster that ‘community’ of researchers myself.

Let’s talk about the science. How have you seen genomic science evolve since the 1990s?

Well, it’s been incredibly exciting to have been working in a field that has been transformed so fundamentally – and which is still undergoing massive change. For example, technology is advancing at such a pace that today we can assemble datasets that give us a foothold in addressing questions that were unanswerable even two or three years ago.

But what’s fascinating is that the field hasn’t evolved in the way that some originally thought it might. When the first human genome was sequenced 20 years ago, people were saying things like “now we’ve identified all the genes, we’ll be able to find the root cause of all human diseases”. But it wasn’t quite so simple. Biological systems are massively more complex than we imagined and making sense of these requires enormous datasets. Even now, we’re only just starting to grapple with the vast amount of genetic variation in the human population. 

There’s a study I always point to that illustrates this nicely, by a consortium called GIANT, that looked at the height of just over 250,000 people. They found that 697 genetic variants could explain 20% of human height, but to get to 21% they needed 2,000, to get to 24% they needed 3,700, and to get to 29% they needed almost 10,000! Height is something we know has a genetic component, but it is controlled by many variants, most of which have very, very small effects.

So, what can you do with your computational methods that tackles that complexity?

We recognized early on that what distinguishes one cell type from another isn’t just an individual gene turning on or off, but the activation of coordinated programs of genes. Early studies of ‘gene expression’ looked for genes that were correlated in their expression levels in ways that differed between disease or physical states.

Unfortunately, this doesn’t explain why the genes are differentially expressed, and we’ve wrestled with approaches to answer that question over the years. A major turning point in my thinking came a few years ago, when I became aware of a 1997 paper by two workers at IBM, who showed that introducing domain-specific knowledge could dramatically improve performance of complex optimization algorithms. Their paper is relevant to genomics because for the last 30 years, people have been trying to apply generic ‘black box’ algorithms to their data, and yes, this might work for a particular situation. But when you try to generalize the methods, they often fail, and this paper told me why. The paper told me that if we wanted to understand how genes were regulated, we should instead aim to understand what drives the process and use that as a starting point. Ultimately, this idea led us to develop computational ‘network methods’ that model how genes are controlled, and this has dramatically expanded our understanding of how and why diseases develop, progress, and respond to treatment.

And what can these network approaches tell you about gene regulation?

Well, it’s been an incredibly fruitful area of study. We start by guessing a network using what we know about the human genome, where the genes are, and where regulatory proteins called transcription factors can bind to the DNA. We then take other sources of data about protein interactions and correlated gene expression in different biological states and use advanced computational methods to optimize the structure of the network until it’s consistent with all the data. We can then compare networks to look for differences between different states, such as health and disease, that tell us what functions in the cell are activated in one or the other. 

As an example of what such models can reveal, we built network models for 38 tissue types, and we found a three-tiered structure. First, there was a core set of regulatory processes which were essentially the same in every tissue. Then there was a periphery of processes that were often unique to every tissue. But there was a middle layer in which regulatory paths changed to activate the processes that made one tissue different from another tissue. And such an activating layer is where you want to look when you think about interventions to address the causes of disease.

To take that a step further, we’ve now created and catalogued nearly 200,000 gene regulatory networks, including many from drug response studies. Using that resource, we can ask whether there’s a particular drug that alters the regulation in a way which might make a disease cell more like a healthy cell. That could be a powerful shortcut to matching diseases to drugs.

It sounds like practical applications may not be far off. What prospects are there of using your approaches for personalized medicine?

I’d say it’s early days yet – there’s a big gap between using these tools in the research lab and ensuring they are reliable enough to be used in the clinic. But we’re committed to making the methods accessible, and so all our methods are open-source, and we try to make them available in multiple programming languages like R, Python, MATLAB, and C. So, any researcher can take our network methods and use them on their own applications! Our goal is to help move the field forward.

In terms of areas we’ve studied ourselves, my colleagues and I have looked at cancer, Alzheimer’s disease, chronic obstructive pulmonary disease, interstitial lung disease and asthma, amongst others. In addition, one of the underlying themes for us is the differences that exist between males and females in gene regulatory processes that can help us understand disease evolution and inform treatment.

Surely gender differences have been well-studied?

Well kind of – we know there are differences, but we don’t know why. In fact, sex differences in disease are among the most understudied problems in biology. For example, in colon cancer we know that there are differences in disease risk, development, and response to therapy between the sexes. However, when we examine gene expression in tumors, there’s very little difference between male and female samples and none that explain the clinical differences. 

But, using our network inference and modeling approach, we do find differences in the regulatory networks – for example, those influencing drug metabolism. This of course allows us to identify possible ways in which we can treat disease in males and females differently to increase drug efficacy. This is something we’d never have been able to do before taking this integrated approach.

So how quickly can you run a network-building algorithm these days?

Pretty fast, and it’s getting faster. For example, in a study of 29 different tissues, we had to create 19,000 networks. When we ran a method that we call PANDA and LIONESS to generate the initial networks for publication, it took six months. Then the reviewers asked us to make some changes, by which time we’d made some software improvements and had access to new hardware, so on the second run it took six weeks. Now we can do it in six days!

That’s an amazing change in turnaround time. But why do you conduct analyses on such a large scale?

Most experiments in science are based on hypothesis testing, meaning we have some evidence and analysis that suggests a particular factor is associated with some measurable outcome, so we design an experiment to see if the factor and the outcome are related. But in associating gene regulation across the genome with particular outcomes, it is hard to know where to look or what to test. In studying the inference and analysis of gene regulatory networks, there are so many ‘unknown unknowns’ that there’s value in simply taking your tools and seeing what you can discover. If you have a large sample size, you can have greater confidence in the things that you eventually find, and it makes it easier to tease out small but meaningful signals. 

An analogy that I like is that what we are doing is similar to what Galileo did when he turned his telescope on Jupiter. He took new technology and made observations that led him to conclude that the planet had moons and indirectly helped to confirm the heliocentric model of the solar system. Our new technologies are looking at big data sets and the advanced computational tools that we developed. Using these, we hope to identify genomic relationships that wouldn’t be found if we’d constrained ourselves to a handful of genes or simply looked at which ones change their expression. The deeper regulatory connections we find in networks across all 25,000 genes provide new insights that lead us to develop more targeted questions and testable hypotheses.

With your perspective of 30 years in the field, what excites you most about the future for analysis of genomic data?

Two things stand out for me. The first is multi-modal data, which is one thing I know Paradigm4 is tackling. It’s combining genomic data, imaging data, and patient metadata with computational methods to create multi-tiered models that encompass genetic variants, transcription, gene expression, methylation, microRNAs, and more. Bringing all these tools together gives us a better chance of extracting insights from the fantastically complex public datasets that are now available.

The second is single-cell sequencing, which I find very interesting. If you take a chunk of tissue and genetically profile it, you’re only ever going to uncover what happens on average. But if you take single-cell data like Paradigm4 has been doing, then you can probe what happens during the transition from a healthy state to a diseased state. That offers fantastic prospects for uncovering the mechanism of disease.

John, thanks very much for your time, and for giving us some fascinating perspectives on genomics and computational biology. 

Single-cell sequencing and integrative analytics are both areas of great importance and opportunity as we look towards the future of genomic data and precision medicine. At Paradigm4, we’re utilizing these methods to support scientists to extract insights from datasets, through our purpose-built platform and suite of REVEAL apps which helps users to build a multimodal understanding of disease biology. The technology allows users to ask more questions quickly, through a scalable, cost-effective solution. For more information on how we can help to transform your data analysis, contact lifesciences@paradigm4.com. 

Bio

John Quackenbush is Professor of Computational Biology & Bioinformatics, and Chair of the Department of Biostatistics at the Harvard T.H. Chan School of Public Health in Boston, Massachusetts. He also holds professorships at Brigham & Women’s Hospital and the Dana-Farber Cancer Institute. 

Prof. Quackenbush’s Ph.D. was in Theoretical Physics, but in 1992 he received a fellowship to work on the Human Genome Project. This led him through the Salk Institute, Stanford University, The Institute for Genomic Research (TIGR), and ultimately to Harvard in 2005. Prof. Quackenbush has more than 300 scientific papers with over 84,000 citations; he was recognized in 2013 as a White House Open Science Champion of Change.

References

1 Wood, A., Esko, T., Yang, J. et al. Defining the role of common variation in the genomic and biological architecture of adult human height. Nat Genet 46, 1173–1186 (2014). https://doi.org/10.1038/ng.3097

2 D. H. Wolpert and W. G. Macready, “No free lunch theorems for optimization,” in IEEE Transactions on Evolutionary Computation, vol. 1, no. 1, pp. 67-82, April 1997, doi: 10.1109/4235.585893.

The role of computational biology in drug discovery: an inside perspective

We talk to Dr. David De Graaf, CEO of biotech company, Abcuro, and for over 20 years a leading light in the field of computational/systems biology.

Let’s start by talking about what you’re doing at Abcuro?

We’re focused on developing two clinical leads – one targeting an autoimmune muscle-wasting disease known as inclusion body myositis, and the other targeting cancer cells. In both cases, the targets are cytotoxic T cells that express a receptor known as KLRG1. For inclusion body myositis, we’re developing an anti-KLRG1 antibody that can selectively deplete cytotoxic T cells present in muscle tissue, effectively removing the source of immune attack. Whereas for cancer we want to turn on these cytotoxic T cells and direct them towards the tumor – which we’re doing using a different antibody. It’s challenging to extract value from cancer programs because many times, until you’re in phase 2, there is no clear proof of efficacy.

And how did you identify that KLRG1 target?

That’s a great question because it relates right back to bioinformatics. The founder of Abcuro is Dr. Steven Greenberg, who is a neurologist. He was being referred patients with inclusion body myositis, but he was becoming very frustrated, because there was absolutely nothing that could be done for them. So, he decided to change that. After taking time out to study bioinformatics at MIT, he came back and set up a small lab. He took muscle biopsies from these patients, cut out the invading T cells that were eating away at the muscle fiber, and used bioinformatics to find a marker in those cells that was selective for those T cells and spared the rest of the immune system. The target that he found was KLRG1, which is the one we’re using today to deplete these cytotoxic T cells.

Are you still doing computational biology today?

No, but we’d like to! The thing is, for a small biotech company like ours, drug discovery is not a big driver, because that’s not where the business value is. The reality is that drug pipelines don’t pay the bills, successful drugs do. And so, companies like ours must be selective, we have to focus on some very specific questions, and it doesn’t matter quite so much how long it takes to get answers. The important thing is that you work out exactly what’s going on. There’s no grand hypothesis in our discovery efforts which would be amenable to large-scale computational studies.

What do you think about companies offering specialist computational biology tools? Do they have a viable business model?

It’s very difficult. The problem with selling computational tools is that your business often becomes about the tool, but that’s not really what the customer cares about, is it? Instead, it should be about what you do with it and how you make a difference.

When I was at Selventa, I remember a trip that we made to a big pharma company, and they looked at the computational tool we’d created, and they said: “Hmm, it’s a great tool, it’s a very efficient and interesting way of analyzing data. But, to validate it, we want you to go through the data set that we’ve already analyzed”. 

Now this sort of situation nearly always ends in failure, because there are essentially two outcomes. The first one is you find exactly what they already found, and then they say: “that’s very impressive, you did it in three days rather than three months but, in reality, that difference in timescale doesn’t matter much to us – so no thanks”. Or even worse is if you find something new, and then they go on the defensive and say: “we’ve got great people here, are you telling us they’ve been doing it all wrong?”. So, either way, it becomes really hard to show the added value of what you’re providing.

In your opinion, what are the benefits of commercializing computational biology tools?

There are plenty. Top of the list is that thanks to computational biology, we’ve got a lot better at defining diseases in terms of what’s going on at a molecular level. And that’s important because patients don’t come in the clinic and complain that their PI3 kinase hurts. We need to continue to do that, and make connections through the whole value chain, from identifying symptoms, to understanding the molecular basis of disease, through to drug discovery. Now, that’s a grand aim, so you’d need to carve out something narrower there, but there are plenty of opportunities. For example, we don’t know why patients don’t always benefit as expected from gene therapy. And we don’t really know why people get autoimmune diseases, with a case-in-point being long Covid. There’s a lot to be gained if you’re prepared to focus on such questions. 

Are there any other ways you see computational tools making a big difference?

When I was in big pharma, we often did the exact same assay 15 or 20 times, because it was always much easier to regenerate the data than it was to find the old data and assess the conditions used. That remains a huge inefficiency. I think that computational tools that annotate relevant data, and allow you to search across it, could really pay off.

Another opportunity is being able to take externally curated content and understand it in the context of your own experiments. Then there’s the concept of making a link between patients with a similar molecular phenotype, rather than talking about them purely as a sort of observational phenotype.

And finally, being able to extract value from studies that haven’t worked has massive potential. People don’t think about the fact that 99% of the money in the pharma industry is spent on things that end up going wrong. But I don’t believe it’s right to simply forget about that work – we should bring it together and get some insights from it. I think all these things have the potential to be huge drivers in accelerating drug discovery and development.

You implied an issue with data flow in your previous role, can you explain more about that and what it means for those wanting to commercialize computational biology tools?

Data flow is absolutely an issue, and I’ll give you another example from my time at Selventa. We’d worked with a big pharma company to analyze gene expression data from a couple of their clinical trials to help them decide whether to move particular candidates forward. At the time, they had a poorly defined mechanism of action, and didn’t understand why certain patients responded and others didn’t. Anyway, we crunched the numbers, we figured it out, it was a really nice, productive collaboration, and the CEO invited us over for a meeting, and we thought: “great, we’re in there, these guys want to do a deal with us”.

But it didn’t turn out like that. They took us out the night before, then the next day we talked business, and it all fell to pieces. We had this excruciating 45-minute conversation with the CEO, who spent the entire time apologizing, explaining that because we’d analyzed all their genetic and genomic data, they were going to be busy for the next two or three years dealing with it all. Although they wanted to retain a relationship, they weren’t going to have any new data for us. 

My conclusion is that it’s very difficult to sell a specialist computational tool to a pharma company unless they’ve got a continuous flow of data to feed into it – and not many do. To make it worth their while to construct that information infrastructure, they need to be able to make full use of it in the long-term.

It’s interesting you say that, especially as Paradigm4 has focused on translational medicine, where the data flow is huge and continuous. Proteomics is also an emerging gold mine of data! How would you say those challenges you mention can be tackled?

One approach is to refine your core pitch and make it something that pharma companies can’t do. The problem is that often, the one thing that they can’t do is to use these specialized tools. You end up building a service organization on top of your software offering, and those don’t scale particularly well. It’s very, very hard to extract value from them.

Another way is to provide your own content – for example, to curate and annotate all the data that’s in the public domain and sell it. Now, there may be a continuous need to have that data but, to be frank, the value of that is very small, because it’s not unique to a particular company.

In summary, I think it’s very hard to develop a good business model for these companies. In the research setting, customers will say: “this tool is fantastic, but we haven’t got the data flow to give us good return on investment”, and in the drug development setting, they say: “this tool is still fantastic, but we’re only going to use a little bit of it, so it’s not going to give us return on investment”.

So, it seems you’re saying that these software companies need to have a different business model to be viable in the long term?

Yes, and you can state it in three words. Analyze clinical data. That’s where the money is, and that’s where important decisions are made. It’s true that supporting research using your software platform can be a great way to build confidence. But you’re not going to make money until you’re in the clinic.

Fair enough, but what about the regulatory side?

That’s a good point, and it comes down to the scope of your analysis. Again, thinking about things from the pharma perspective, one of their worries is that if you look at more and more parameters, you’re going to find something that looks wrong. And that has huge impacts for your path through clinical regulation. For example, in a preclinical toxicology model I worked on, we saw induction of a single cytokine out of about 30. That cytokine then needed to be tracked through phase one, phase two, phase three – even though it was completely irrelevant. And it even ended up on the label, even though there’s absolutely zero evidence that there was any issue with it. Now, the way I’d approach that is not by saying we shouldn’t generate the data, but by saying upfront: “here are some very specific things we’ll be looking for”. That way, you eliminate the risk of automatically flagging-up differences that ultimately don’t matter.

David, thank you for your time and for giving us some fascinating perspectives on computational biology. 

An interesting take on many aspects of the industry and one important point we’ve taken away is that many scientists struggle to use computational tools in their current form, often needing support from software developers. At Paradigm4, we work closely with our customers so that scientists can use the computational tools to their advantage, asking complex questions and assessing key biological hypotheses more efficiently and independently. Our platform is what we like to call ‘science-ready.’ We want to help scientists make a difference with their data. For more information on how our technology can help to transform your single-cell data analysis, contact lifesciences@paradigm4.com.

Bio

Dr. David De Graaf is CEO of Abcuro, a clinical-stage biotechnology company that he joined in late 2020, after having had a string of positions in the biotech sector, including CEO of Comet Therapeutic and Syntimmune, and leadership roles in systems and computational biology at Pfizer, AstraZeneca, Boehringer-Ingelheim, Selventa and Apple Tree Partners. He holds a Ph.D. in genetics from the University of Illinois at Chicago.

Abcuro is developing antibody treatments for autoimmune diseases and cancers modulated by cytotoxic T and NK cells, including ABC008 for treatment of the degenerative muscle condition known as inclusion body myositis, and ABC015 for reactivating bodily defenses against tumors. 

The evolution of computational genomics: solving bioinformatics challenges

In this, the first of a series of opinion leader interviews, we talk with Dr. Martin Hemberg, Ph.D., Assistant Professor of Neurology, Brigham & Women’s Hospital and Member of the Faculty, Harvard Medical School, located at The Evergrande Center for Immunologic Diseases (a joint program of Brigham & Women’s Hospital and Harvard Medical School).

Dr. Hemberg’s field of research is computational genomics. He develops methods and models to help understand gene regulation. He also has a long history of collaborating with experimental groups, mainly in neurobiology, assisting them to analyze and interpret their data. 

Dr. Hemberg started out studying maths and physics, but towards the end of his undergraduate studies decided that biology was going to be his route into research. As a result, he carried out his post-graduate and PhD work in theoretical systems biology at Imperial College London under the supervision of Professor Mauricio Barahona. He was a post-doc in the Kreiman lab at Boston Children’s Hospital, working on analyzing ChIP-seq and RNA-seq data. In 2014, Dr. Hemberg moved to Cambridge, UK, to set up a research group at the Wellcome Sanger Institute with the goal of developing methods for analyzing single-cell RNA-seq data.

Following relocation to the USA in 2021 to establish a new research group in Boston, Dr. Hemberg is now embarking on the next stage of his career.


How did you first get involved in analyzing sequencing data?

With my background in theoretical systems biology, it was interesting to come across a team of ‘experimentalists’ in the lab next door to mine at Boston Children’s Hospital. It was just around the time when second generation sequencing technology was becoming available and my neighbours were optimizing their ChIP-seq and RNA-seq protocols but had no idea how they were going to analyze the data. So, I got involved, and worked on this for the duration of my post-doc contract.

Then, thinking back to my experience with single-cell analysis at Imperial – part mathematical, not genome wide, but just looking at individual cells, it seemed clear to me that these two fields were going to converge. For my first Principal Investigator post (at the Wellcome Sanger Institute) I pitched ideas around systems biology, mathematical biology, and biophysical approaches using single-cell RNA-seq data. However, as we started to do this, it quickly became clear that we did not really understand the data, so we needed to switch around to doing method development to enable us to process the data and separate the signal from the noise. This became the focus of the lab: developing novel methods to analyze single-cell data sets. 


How did this develop, and what is driving your work now?

It has been an interesting journey. We have two types of projects. Firstly, what I would call ‘in house’ projects, where we are working on methods and analyzing public datasets independently to improve analysis or come up with novel ways of approaching the data. The second group of projects are ‘applied’. We collaborate with external groups where we help them to analyze their data. There is a really nice synergy here. Often, I will be talking with someone else about their problems and realize that we have a new method that can offer a solution. It works the other way around, too, when we realise that a specific issue faced by one group is actually a generic problem, and we can then work to find a solution for everyone.

This work has now transferred to Boston, where I am setting up a new group to work using the same template – the same approach. I moved during 2021, and I can definitely advise colleagues that it is not a good idea to move continents with your family, change jobs and try to set up a new team during a pandemic! 

What we will eventually be doing is defining some of the key problems – from an informatics point of view, from a technology point of view and, just as importantly, from a biological point of view. Looking at where there are no good solutions currently, and working to develop methods and tools to solve these challenges and deal with the ever-increasing volume of data that is being produced. It is a very exciting prospect. Clearly, as data set size increases, you need to rely on more sophisticated software and data processing tools that can cope with the scale.


One of your most cited papers is guidelines for computational analysis of single-cell data. Why has this been so important?

Initially, we were thinking only about ourselves and our own research but, as we were working out how best to approach our data, I thought we should write it down and put it into a form where it might help others. In addition to the publication, we also formatted the information as an online course. The feedback has been great. Every time I am at a conference, I have had people mentioning how they learned how to handle their data using our principles. 


What plans do you have to update this work?

It’s definitely a work in progress. We just hit 10,000 visitors/month on the course portal, so I’m absolutely committed to keeping it as current as possible. In the past, we updated only occasionally, as time allowed, but now we have a dedicated person who will be looking at it as part of their day-to-day responsibilities.  


What is your view regarding the best ways to organise the data?

I always tell my students, many of whom have more of a biology background, rather than having experience in mathematics or informatics, that there is usually a module that is taught in computing courses called ‘algorithms and data structures’. And there is a very good reason you put these things together. If you have the right data structure, the algorithm becomes trivial, whereas if you have the wrong data structure, the algorithm becomes a ton of work!

In addition, a more practical consideration is data set size. If you have 10MB, you can store it on a hard drive and do a linear search to get what you want, but if you have 10GB, you need to have much more efficient ways of accessing and querying your data. There is absolutely no doubt that is one of the keys as you scale up to very large data sets – which is something that is really driving the field, so it’s critical.

What has been the impact of biobanks – and the expanded nature of data types being collected, computed and queried?

I see biobanks as simply increasing the number of opportunities available to us – on one hand, first papers coming out in a new field, or after a major development, most often leave potential for subsequent re-analysis and re-working of the data sets to cover angles that were not looked at by the original authors.

On the other hand, and perhaps more exciting still, is the opportunity to combine data sets and data types and see more than you could see with just one modality. That is really important.

Moreover, you can use these expanded data sets to validate any number of new hypotheses by going back and analyzing the biobank data without having to go out and do the experiments yourself. 

Can you summarize your view of the ‘grand challenges’ in bioinformatics today?

That’s a tough one. I’ll answer in two ways – at a high level, genomics is about understanding ‘genotype to phenotype’ mapping. How is information encoded, how is it read out, and how is it then stored in the genome? That will take a while to sort out. 

In terms of less ambitious goals – I’m very interested in the regulatory code, and I think we can make significant progress on this. It’s a case of understanding information concerning adjusting the expression levels of different genes and how that is encoded within the genome through promoters and enhancers. 

Specifically, in terms of single-cell analysis, new technology development is really driving the field. I have to keep my ear to the ground to understand these new technologies and how they can help us solve the day-to-day problems we have. For example, I recall that when the costs of cell isolation and sequencing RNA started to fall, researchers took advantage, new protocols emerged where pooled samples were sequenced ‘in bulk’, and the results were deconvoluted to identify individuals. The amount and complexity of the data increased dramatically. In 2016, the problem did not exist, then around 2018 the first publications came out relating to these new experiments, and we started looking at enhancing analysis methodology. What’s become obvious over the years is that a difficult computational challenge can be totally solved by a better assay – or, conversely, a new assay can throw up a new, interesting and challenging bioinformatics and computational problem. The moral is clear: you need to be nimble.

Martin, thank you for your time, and for giving us a most interesting perspective on single-cell data analysis.

For more information on how our technology can help to transform your data analysis, as well as information on our REVEAL: SingleCell app, contact lifesciences@paradigm4.com.

Author bio

Martin Hemberg, Ph.D., Assistant Professor of Neurology, Brigham & Women’s Hospital and Member of the Faculty, Harvard Medical School, located at The Evergrande Center for Immunologic Diseases (a joint program of Brigham and Women’s Hospital and Harvard Medical School). Prior to this role, Martin Hemberg was a CDF Group Leader at the Sanger Institute.