Blog: The Cutting Edge
We’re talking to Professor John Quackenbush from the Harvard T.H. Chan School of Public Health, who throughout his 30-year career has addressed big questions in human health by combining his twin passions of genomics and data analysis.
Can we start by discussing how you got into genomics?
Well, I’ve had an interesting career journey. I got my Ph.D. in theoretical physics in 1990, and I thought that’s what I was going to be doing for the rest of my life. But shortly after I had my degree, funding for physics research dried up because of the break-up of the Soviet Union and the end of the cold war.
At the time, a friend who was finishing her Ph.D. in molecular biology was using polymerase chain reaction (PCR) to study hormonal regulation of gene expression in cockroaches. I started to help her analyze data, and I began to see that there was this whole universe of biology that had been invented since I was at high school – with data that people like me could analyze. And of course, this was when the human genome project was getting off the ground, so I applied for a five-year fellowship, and it all followed from there!
It’s been an interesting personal journey then?
I’ve been lucky to have been to all these truly outstanding places doing work at the forefront of biology and genomics. I used to joke about being a Harvard professor one day, and now I am chair of Harvard’s department of biostatistics – despite not being a trained statistician.
But I think my journey indicates a point of fundamental importance in genomics and biomedical research in general – that there’s a need for a broad range of people and techniques so we can take massive quantities of data and turn that into knowledge, and then convert that knowledge into understanding. We definitely need people who can combine computational, statistical, and biological perspectives – and I’ve always been very keen to foster that ‘community’ of researchers myself.
Let’s talk about the science. How have you seen genomic science evolve since the 1990s?
Well, it’s been incredibly exciting to have been working in a field that has been transformed so fundamentally – and which is still undergoing massive change. For example, technology is advancing at such a pace that today we can assemble datasets that give us a foothold in addressing questions that were unanswerable even two or three years ago.
But what’s fascinating is that the field hasn’t evolved in the way that some originally thought it might. When the first human genome was sequenced 20 years ago, people were saying things like “now we’ve identified all the genes, we’ll be able to find the root cause of all human diseases”. But it wasn’t quite so simple. Biological systems are massively more complex than we imagined and making sense of these requires enormous datasets. Even now, we’re only just starting to grapple with the vast amount of genetic variation in the human population.
There’s a study I always point to that illustrates this nicely, by a consortium called GIANT, that looked at the height of just over 250,000 people. They found that 697 genetic variants could explain 20% of human height, but to get to 21% they needed 2,000, to get to 24% they needed 3,700, and to get to 29% they needed almost 10,000! Height is something we know has a genetic component, but it is controlled by many variants, most of which have very, very small effects.
So, what can you do with your computational methods that tackles that complexity?
We recognized early on that what distinguishes one cell type from another isn’t just an individual gene turning on or off, but the activation of coordinated programs of genes. Early studies of ‘gene expression’ looked for genes that were correlated in their expression levels in ways that differed between disease or physical states.
Unfortunately, this doesn’t explain why the genes are differentially expressed, and we’ve wrestled with approaches to answer that question over the years. A major turning point in my thinking came a few years ago, when I became aware of a 1997 paper by two workers at IBM, who showed that introducing domain-specific knowledge could dramatically improve performance of complex optimization algorithms. Their paper is relevant to genomics because for the last 30 years, people have been trying to apply generic ‘black box’ algorithms to their data, and yes, this might work for a particular situation. But when you try to generalize the methods, they often fail, and this paper told me why. The paper told me that if we wanted to understand how genes were regulated, we should instead aim to understand what drives the process and use that as a starting point. Ultimately, this idea led us to develop computational ‘network methods’ that model how genes are controlled, and this has dramatically expanded our understanding of how and why diseases develop, progress, and respond to treatment.
And what can these network approaches tell you about gene regulation?
Well, it’s been an incredibly fruitful area of study. We start by guessing a network using what we know about the human genome, where the genes are, and where regulatory proteins called transcription factors can bind to the DNA. We then take other sources of data about protein interactions and correlated gene expression in different biological states and use advanced computational methods to optimize the structure of the network until it’s consistent with all the data. We can then compare networks to look for differences between different states, such as health and disease, that tell us what functions in the cell are activated in one or the other.
As an example of what such models can reveal, we built network models for 38 tissue types, and we found a three-tiered structure. First, there was a core set of regulatory processes which were essentially the same in every tissue. Then there was a periphery of processes that were often unique to every tissue. But there was a middle layer in which regulatory paths changed to activate the processes that made one tissue different from another tissue. And such an activating layer is where you want to look when you think about interventions to address the causes of disease.
To take that a step further, we’ve now created and catalogued nearly 200,000 gene regulatory networks, including many from drug response studies. Using that resource, we can ask whether there’s a particular drug that alters the regulation in a way which might make a disease cell more like a healthy cell. That could be a powerful shortcut to matching diseases to drugs.
It sounds like practical applications may not be far off. What prospects are there of using your approaches for personalized medicine?
I’d say it’s early days yet – there’s a big gap between using these tools in the research lab and ensuring they are reliable enough to be used in the clinic. But we’re committed to making the methods accessible, and so all our methods are open-source, and we try to make them available in multiple programming languages like R, Python, MATLAB, and C. So, any researcher can take our network methods and use them on their own applications! Our goal is to help move the field forward.
In terms of areas we’ve studied ourselves, my colleagues and I have looked at cancer, Alzheimer’s disease, chronic obstructive pulmonary disease, interstitial lung disease and asthma, amongst others. In addition, one of the underlying themes for us is the differences that exist between males and females in gene regulatory processes that can help us understand disease evolution and inform treatment.
Surely gender differences have been well-studied?
Well kind of – we know there are differences, but we don’t know why. In fact, sex differences in disease are among the most understudied problems in biology. For example, in colon cancer we know that there are differences in disease risk, development, and response to therapy between the sexes. However, when we examine gene expression in tumors, there’s very little difference between male and female samples and none that explain the clinical differences.
But, using our network inference and modeling approach, we do find differences in the regulatory networks – for example, those influencing drug metabolism. This of course allows us to identify possible ways in which we can treat disease in males and females differently to increase drug efficacy. This is something we’d never have been able to do before taking this integrated approach.
So how quickly can you run a network-building algorithm these days?
Pretty fast, and it’s getting faster. For example, in a study of 29 different tissues, we had to create 19,000 networks. When we ran a method that we call PANDA and LIONESS to generate the initial networks for publication, it took six months. Then the reviewers asked us to make some changes, by which time we’d made some software improvements and had access to new hardware, so on the second run it took six weeks. Now we can do it in six days!
That’s an amazing change in turnaround time. But why do you conduct analyses on such a large scale?
Most experiments in science are based on hypothesis testing, meaning we have some evidence and analysis that suggests a particular factor is associated with some measurable outcome, so we design an experiment to see if the factor and the outcome are related. But in associating gene regulation across the genome with particular outcomes, it is hard to know where to look or what to test. In studying the inference and analysis of gene regulatory networks, there are so many ‘unknown unknowns’ that there’s value in simply taking your tools and seeing what you can discover. If you have a large sample size, you can have greater confidence in the things that you eventually find, and it makes it easier to tease out small but meaningful signals.
An analogy that I like is that what we are doing is similar to what Galileo did when he turned his telescope on Jupiter. He took new technology and made observations that led him to conclude that the planet had moons and indirectly helped to confirm the heliocentric model of the solar system. Our new technologies are looking at big data sets and the advanced computational tools that we developed. Using these, we hope to identify genomic relationships that wouldn’t be found if we’d constrained ourselves to a handful of genes or simply looked at which ones change their expression. The deeper regulatory connections we find in networks across all 25,000 genes provide new insights that lead us to develop more targeted questions and testable hypotheses.
With your perspective of 30 years in the field, what excites you most about the future for analysis of genomic data?
Two things stand out for me. The first is multi-modal data, which is one thing I know Paradigm4 is tackling. It’s combining genomic data, imaging data, and patient metadata with computational methods to create multi-tiered models that encompass genetic variants, transcription, gene expression, methylation, microRNAs, and more. Bringing all these tools together gives us a better chance of extracting insights from the fantastically complex public datasets that are now available.
The second is single-cell sequencing, which I find very interesting. If you take a chunk of tissue and genetically profile it, you’re only ever going to uncover what happens on average. But if you take single-cell data like Paradigm4 has been doing, then you can probe what happens during the transition from a healthy state to a diseased state. That offers fantastic prospects for uncovering the mechanism of disease.
John, thanks very much for your time, and for giving us some fascinating perspectives on genomics and computational biology.
Single-cell sequencing and integrative analytics are both areas of great importance and opportunity as we look towards the future of genomic data and precision medicine. At Paradigm4, we’re utilizing these methods to support scientists to extract insights from datasets, through our purpose-built platform and suite of REVEAL apps which helps users to build a multimodal understanding of disease biology. The technology allows users to ask more questions quickly, through a scalable, cost-effective solution. For more information on how we can help to transform your data analysis, contact email@example.com.
John Quackenbush is Professor of Computational Biology & Bioinformatics, and Chair of the Department of Biostatistics at the Harvard T.H. Chan School of Public Health in Boston, Massachusetts. He also holds professorships at Brigham & Women’s Hospital and the Dana-Farber Cancer Institute.
Prof. Quackenbush’s Ph.D. was in Theoretical Physics, but in 1992 he received a fellowship to work on the Human Genome Project. This led him through the Salk Institute, Stanford University, The Institute for Genomic Research (TIGR), and ultimately to Harvard in 2005. Prof. Quackenbush has more than 300 scientific papers with over 84,000 citations; he was recognized in 2013 as a White House Open Science Champion of Change.
1 Wood, A., Esko, T., Yang, J. et al. Defining the role of common variation in the genomic and biological architecture of adult human height. Nat Genet 46, 1173–1186 (2014). https://doi.org/10.1038/ng.3097
2 D. H. Wolpert and W. G. Macready, "No free lunch theorems for optimization," in IEEE Transactions on Evolutionary Computation, vol. 1, no. 1, pp. 67-82, April 1997, doi: 10.1109/4235.585893.