Windowing operations—aggregating functions over a rolling subset of data—are useful in many applications. For example, rolling average calculations can help smooth over short-term fluctuations, thereby revealing long-term trends. Sensors used for testing global warming track average temperature changes over time; the scientists studying this data are concerned with year-over-year changes, but not with daily or even monthly fluctuations.
SciDB’s native array data model intuitively and efficiently stores time series data, enabling fast access and in-place complex analytics. Time can be represented at any resolution (days, hours, microseconds, whatever). And since SciDB provides true, multidimensional storage of data, in addition to time, you can store other relevant contextual data as additional dimensions in the same array. Typical array dimensions might include location data, experimental parameters, stock symbols, phenotypic data.
How Do Windowing Operations Work in SciDB?
Windowing operations cut out a contiguous section of your data, and perform some aggregation, such as a median or sum, over that section. The windowing operators then slide that window across your data along a numeric dimension such as the time dimension. Paradigm4 provides a variety of aggregate functions—including median, average, MAD and others—that work over data windows.
Keep in mind that windowing aggregations operate over a bunch of data that is logically close together (e.g. trades that occur milliseconds apart, or climate measurements that are in the same geographic area). Ideally, you want this logically related data to be close together on the database’s physical storage—logical neighbors become physical neighbors. For relational databases, this can be accomplished by sorting on a column. But Relational databases can establish such locality only for the first variable in the ORDER BY clause. Traditional Relational Databases cannot simultaneously preserve adjacency for more than one variable.
SciDB by contrast, stores data into multidimensional arrays. Any variable can be made into a dimension, which ensures that data values that are logically adjacent for that variable will be stored physically close to each other. In short, SciDB preserves the locality of your data along multiple dimensions.
Another consideration for making windowing operations efficient on a distributed database system is to be able to keep one window’s worth of data together on a single SciDB instance. SciDB provides a feature called chunking, which allows you to do this. For each array, the chunking feature lets you declare parameters that yield contiguous, rectilinear “chunks” of data. The idea is to choose parameters so that each individual chunk fits in the memory of an individual SciDB instance.
Additionally, there is a feature called overlap. The overlap allows the windowing operation to continue smoothly, even when the window spans a chunk boundary. Data near chunk edges is duplicated in the adjacent chunks.
In SciDB, by defining the correct set of dimensions, and by judicious use of the chunking and overlap features, you guarantee that a single set of data for a single window aggregation will reside on a single instance of your distributed SciDB installation. This makes these kinds of computations fast.
And SciDB can compute these rolling aggregations over 2 or more dimensions. Let’s look at a multidimensional example.
Consider a SciDB array where the dimensions are elevation and time, and the attribute is PPB (parts per billion of some airborne toxin).
There are fluctuations in the data from moment to moment and from place to place. You could smooth over these fluctuations in three ways:
- A rolling average over the time dimension. This will smooth over moment-to-moment fluctuations. For each (time=t, elevation=a ) pair, the result shows the average PPB for that exact elevation a at various moments near t.
- A rolling average over the elevation dimension. This will smooth over point-to-point fluctuations. For each (time=t, elevation=a) pair, the result shows the average PPB for that exact time t at various elevations near a.
- A rolling average over both time and elevation. This will smooth over moment-to-moment and point-to-point fluctuations. For each (time=t, elevation=a ) pair, the result shows the average PPB for all the measurements that were close (in time) to t and close in elevation to a How close? That depends on the size of the window in each dimension.
For more detail: To see how you might represent this data in SciDB and query against it, see the next post, Windowing Operations in Paradigm4, How-To.