This post is part one of a two-part series on why you might care about an array database. Part two is a sample use case demonstrating how an array database really helps perform quantitative analytics on big data.
So, let’s start with the database part. After all, you could just store data in files. If your data needs to be shared among people or applications while preserving data integrity over time, if it needs to be rapidly searchable and selectable, if metadata is important, or if new sources of data get added into the mix, then a database is the way to go. Databases also simplify application development; eliminating the need for users or applications to have to know about how data is stored. The world voted on this long ago. Ok, so let’s say you need a database, why does big data need an array database?
The bottom line is that if you want to use big data for driving discovery and new product creation, an array database is a critical addition to a company’s analytical arsenal because:
• Array databases support complex analytics without compromises
• Array databases support analytics-in-context
In a moment we will look at those points in more detail. But first, let’s define what we mean by an array. An array representation of data consists of dimensions and attributes. An n-dimensional SciDB array has dimensions (d1, d2…dn). Each combination of dimension values identifies a cell or element of the array, which can hold multiple data values called attributes (a1, a2…am). That’s the formal definition.
A three dimensional example makes it easier to understand. Say you wanted to represent instrument measurements along a pipeline. You might have the dimensions of the array be time, position, and measurement ID; where time is the date/time stamp of the measurement (measured in seconds from some starting point), position is the linear placement of the instrument along the pipeline (measured in feet from the beginning of the pipeline) and measurement ID is the type of measurement (temperature, pressure, flow rate, etc.). In each cell of the array, you would place attributes. The obvious attribute here would be the instrument measurement corresponding to the unique combination of time, position and measurement ID; but you could have other attributes: serial number of the instrument, ambient temperature, etc.
Now let’s look at what makes an array model work so well for big data.
1. Big data is about new types of data and complex analytics Unlike traditional business data, big data is machine-generated (by things like GPS devices, industrial sensors, DNA sequencers, imaging equipment or algorithmic trading) or created by a machine logging of human actions (web-browsing, tweeting or cell phone text messaging). This new data is sparse and more of it is better than less. For a great discussion about these points, see Christopher Mim’s article, Why the only thing better than big data is bigger data. And that brings the need to accommodate new data types, massive scale, complex analytics and context. While these new data types can be stored in traditional row-store, column-store or in files, they are intrinsically represented by multi-dimensional arrays and their two-dimensional subsets—matrices. So an array storage model is an easier way to store and retrieve this new data.
But these new data types keep flowing in, and that’s where scale comes in to play. Things get challenging when the data is too big to store on one server, or the analysis required exceeds one server’s memory or the time allocated for the effort. If it stopped at just big scale then you would be all set, as there are plenty of database solutions that scale across multiple servers. But the combination of scale, the high value of complex analytics (the math that underpins things like predictive models, recommendation engines, personalized medicine, geo-targeting or industrial analytics) and the importance of context in developing those analytics reach beyond traditional database approaches.
2. Complex analytics move insights far beyond traditional BI or SQL analytics Let’s be clear with what we mean. We define business intelligence as data aggregations such as sum, count, average, min/max and variance. BI is incredibly valuable and, while scaling these reports to big data is not trivial, there are plenty of solutions available. Increasingly though, BI reports are table-stakes. Competitive advantages lie in complex analytics—predictive models, clustering, principal component analysis and survival analysis on big data. Businesses have long recognized this and used complex analytics. Marketing response models for example have been in use for the past 50 years. What’s changed? The opportunity to exploit this at massive scale.
3. Challenges in doing complex analytics at scale And that’s where the compromises begin. Conventional complex analytics solutions are limited to running on one server’s memory. So if your data is bigger than this (and you’ll recall, that’s what we meant by ‘big data’) you have a problem. Running the analysis on a single server often takes longer than the allotted time. You might sample data to get around this problem. This means selecting a subset of the big data so it will fit within your memory constraints, extracting that data from your database (or files if you don’t have a database), transforming it into a math package’s data structures and loading it on your analytics server (ETL). If all goes well, your compromises will be (i) lower accuracy from working with a small part of your data (Why were you storing all this data?), (ii) increasing the time to perform the analysis (because you had to do all that ETL) and (iii) wasted productivity of your analytics team while they wait for the ETL to finish.
Then again, maybe your analysis started out fitting in memory but the subsequent steps exceeded the computational resources available. Or if you’re using parallel processing on distributed data, you are limited to problems that lend themselves to “divide-and-conquer” (aka, Map-Reduce) where you can work on subsets of the data independently, in parallel. Map-Reduce makes some problems easy, but increasingly, the important questions don’t lend themselves to divide and conquer.
Today’s reality is that most have settled for the constraints associated with traditional approaches: less accurate results, missed signals, expensive proprietary hardware, lots of programming, lost productivity or longer turn-time. Big data discovery requires complex analytics that scale beyond one server’s resources—on commodity hardware. And that’s where an array database with built-in complex math comes in. But before we get to that, there is one other important factor in play.
4. Data’s real value will come from context Context brings together different sources of data to provide a richer and more holistic view. Context means caring about the temporal and spatial order of things—like what happened before and after a particular event or which locations are nearby. Context encompasses metadata about how, when and where data are acquired or derived.
So why is context so critical for big data? Unlike business transactions where we primarily store data to retrieve a transaction or build summary reports about what happened, we store big data for its predictive abilities. Why did it happen and what could happen?
Complex analytics is about looking at changes in a system over time, location, price, position, or any other ordered dimension. To do this, we need to find the next and previous or a neighborhood of values along whatever dimensions we are analyzing. Complex analytics is also about looking at those changes relative to the changes in other people, devices and information.
Now certainly a traditional database can store time stamps and find previous or subsequent records in time. The issue is cost. How easy is it to program, what does it cost to maintain that program, and how compute intensive is that query as the stored data grows or if the data is sharded across multiple servers? What if we need to select records within a range or neighborhood? (A so-called “window select” is a very common requirement for complex analyses.) And how would performance change if we need to look at a different window size or select windows across multiple dimensions at once? This is where an array database could be 50 to 100x faster.
In summary, if you want to store your machine-generated data because multiple applications or data scientists will need to rapidly explore and mine it over time, then you want a database. If you have big data then you care about context and an array database makes context-based queries 50 to 100x faster. If you need to run complex analytics on big data, then compared to a traditional database, an array database with native math is faster to program, saves time by avoiding ETL and executes complex math fast on commodity hardware. The array database wins.
Want to learn more? Read our sample use case that highlights the points above and demonstrates how an array database really helps perform quantitative analytics on big data.