This post is part two of a two-part series on why you might care about an array database. Part one, “Why an Array Database?” is recommended reading.
Now, a word for you experts out there. We realize that quantitative analysis is complex; and we’ve simplified this use case for the sake of articulating an example to a larger audience. You’ll have to cut us some slack. Ok, now back to the use case.
A hedge fund wants to develop an algorithmic trading model that finds and exploits short-term market inefficiencies. Its intent is to identify stocks that track each other, find the first mover, and execute trades before the market fully equalizes on the other stocks. The data is stock trade tick information which looks like this: time, symbol, price and volume. And as you can imagine, there’s a lot of it. Trade ticks are all ordered in time although they can be at irregular intervals. Symbols can be ordered as well (say alphabetically) but the order does not mean much but making symbol an array coordinate makes accessing it faster.
Let’s set up a two-dimensional array; let one dimension be time and another be symbol. For each time-symbol pair, the array would store the attributes of price and volume. There are alternatives schemas of course. That’s up to the user who can easily change things later. For now let’s just go with the two-dimensional time-symbol array.
Did you see what happened when we set up the array? In other databases, time and symbol are fields of a record. Rather than burying the inherent order of time and symbol in rows or columns in a table, an array storage model promotes what was buried order (or context) into primary coordinates of the system. Why does that matter? Well with a traditional database, one common way to accelerate data access is to maintain two copies of your data—one organized by time then symbol, another organized by symbol then time—at the cost of doubling your storage. (Yes, an alternative is to use two indices but that only reduces access time for a single symbol and time. Analytics will focus on ranges, not discrete values.)
With an array database, time and symbol are coordinate dimensions of the array; so it is very easy to select data along a single coordinate dimension or multiple dimensions at once. More important, because the inherent order is preserved in the way the data is stored in an array database, it also means that accessing the information will be much faster than alternative approaches. Admittedly, if our hedge fund knows—in advance—how they want to access their data, they would bear the cost and tune the database. But for ad hoc analysis, they won’t know what kinds of questions will be asked. So the data access ought to be fast without advanced tuning. Array storage models give them that flexibility.
But our hedge fund wants to do something that will really burden traditional databases. It wants to know what other stocks are highly correlated with IBM? This involves looking at every tick of every stock symbol, comparing it to the previous ticks for that same symbol over some time interval (that’s a window of values) in order to infer a trend in the stock price and then comparing that trend to the same measurements for IBM. That’s a lot of multidimensional selects. A speed difference will be significant, but there is another advantage of the array database.
What our hedge fund really wants to do is to compare the movement of every stock to every other stock. Yikes! This is getting computationally intense. And this is where the array database really shines. If we take the two-dimensional array of time and symbol, select only the price attribute; we have a matrix of stock prices with symbols on one dimension and time on the other. A multiplication of this matrix by its transpose (with two other less computationally challenging steps) produces the correlation matrix showing precisely what the hedge fund wants—an understanding of which stocks behave similarly.
This is just one example of many valuable analyses based on linear algebra. Now that matrix multiply can be quite compute intensive depending upon how granular your data is and how long a period you want to analyze. With SciDB, matrix multiplication executes fast and in parallel and performance can be improved by just adding more commodity hardware. And the really cool thing is that the program for multiplying these matrices—which are distributed over several servers—is just one line of code in SciDB. No one had to worry about where the data sits and coordinate the analyses across multiple servers—SciDB took care of that behind the scenes.
There is one other way that the array storage model wins here. Because the data is stored in exactly the format you want to perform the matrix math, there is no manipulation necessary to get it into the format needed by your analytics engine. And if your database happens to also offer in-database math like SciDB, then there is no need to move the data. Just do the math, don’t worry about moving the data or how much memory is on your server.