Databases Unlock the Power of Medical Big Data

Heard recently at a SciDB tech-talk:

“Data in databases are used 100x more than data in files.” Ok. Not too controversial. “If we take ‘use’ (the number of papers published) to be a proxy for value derived…” Hmm, could that mean that databases increase the value of data by 100x? Maybe.

So what’s the solution to that? “Replace all file systems with /dev/null” (translation “delete the data”) might be one approach. It was somewhat tongue-in-cheek, but it certainly was cause for thought. This study by Vijay Gadepally (member of technical staff at MIT Lincoln Lab and SciDB user) elucidates the options. We think SciDB has an important role here. Check out the green dot below. Admittedly, we added that dot here at P4, but that is one goal of the Intel Science & Technology Center for Big Data program.

~ P4 Team


Vijay Gadepally, MIT Lincoln Laboratory,



As a part of the Intel Science and Technology Center on Big Data, we at MIT are very interested in new technologies that can be used to help alleviate the many challenges faced by the upcoming Internet of Things revolution. One key feature of the Internet of Things will be the machine speed at which information is collected. Adding to today’s big data, which is largely human, generated at human rates (social media posts can reach ~1Hz) will be vast quantities of information coming in at machine speed (data coming in at 500-1000Hz). The Internet of Things promises to accelerate the rate at which we collect information and will undoubtedly require a new generation of technologies that are capable of indexing, storing, correlating and retrieving information in a timely manner. I recently calculated that if we were to add sensors on board all vehicles in the United States, this might generate nearly 200 Exabytes of data per year! The current solution for such data volume is typically to put it in files and tuck it away on a large shared file system.

Another area where we are sure to see an explosion in the quantity (and quality) of data is in the medical arena. From our perspective, this area serves as a surrogate for the larger Internet of Things movement. Digitization of healthcare records, collections of data from sensors, and widespread aggregation and sharing of research presents great opportunities as well as great challenges. Just as the larger Internet of Things movement, data is collected through a variety of modalities that require different solutions. There is a human-generated, structured component (such as demographic information) and unstructured components (such as notes and reports). In addition to this information, there is a vast quantity of information that is being collected directly from sensors such as ECG, EKG, pulse oximeters, etc. With this in mind, we make use of the Multi-parameter Intelligent Monitoring in Intensive Care (MIMIC) dataset, which is a large open source dataset, collected from patients admitted to an Intensive Care Unit between 2001 and 2008. This rich dataset has been the subject of multitudes of research.

As we explored this dataset, we noticed an interesting trend. The clinical data (which sits in MySQL/PostGRES) has been the subject of 100s of academic research papers. The waveform data (which sits in a filesystem) on the other hand, which makes up nearly 90% of the volume of the MIMIC dataset, has been the subject of ~1 research article. What does this mean? Data in databases are used; data in files are not. In essence, for most applications, you need a database to extract the true value of your dataset. As a colleague at MIT pointed out, this leaves you with two options: 1) Replace your filesystem with /dev/null, or 2) get your data into databases!

As you may have guessed, we chose option 2. We have taken the vast quantity of time series waveform data and put it into SciDB. Using SciDB, we have been able to get some simple medical analytics running and have had great success with accessing and manipulating the data. In fact, we are in the process of recruiting a group of talented students at MIT to come up with the next best thing. We hope that these students, given the valuable tools such as SciDB, can define the systems, develop the tools and demonstrate the value of medical data to usher in a new era of evidence based medicine.