Windowing Operations over Timeseries Data, Part 2

This post is a continuation of the previous post, Windowing Operations over Timeseries Data in Paradigm4.

Repeated from part 1 of the post:The SciDB array, showing elevation and time as dimensions
2-dimensional SciDB Array

Now, let’s look at how this array could be represented in SciDB, as well as some actual windowing queries in AFL (Array Functional Language).

How Can You Represent Data in a SciDB Array?

After you loaded in the data, the schema of your array might look like this:

window_example <toxicConcentration: int64>
[elevation=0:2499, 2500, 0, time=1325376000:*, 1000000, 100]

The attribute is “toxicConcentration,” representing parts per billion of the toxin. The dimensions are “elevation” and “time.” Elevation ranges from 0 feet to 2500 feet above sea level. Time is actually represented as an offset in seconds from the epoch starting on January 1, 1970. We are assuming that the data set contains times during the calendar year 2012, so the starting value for the time dimension is set to the offset for January 1, 2012.

For each dimension, the following parameters are specified: lower value, upper value, chunk size, and overlap size. For this example, we are using a single chunk for the elevation dimension, and no overlap. For time, the chunk size is 1 million, and we are using an overlap value of 100. That overlap will suffice because the window size along the time dimension will be smaller than 100. The optimum setting for all of these values is quite dependent on the actual data set; its density, sparseness, and skew. A skewed data set has areas where the number of data values could be very high, or dense, and other areas where the data is very sparse. The optimum settings will also depend on the size and shape of the windows you plan to use in your windowing operations.

Let’s say that you want to find the rolling average over the elevation dimension (between 150 and 200 feet above sea level), for the exact elevation, where the time is within a minute. You would run the following query:

SELECT avg(toxicConcentration) into Results
FROM window_example
(PARTITION BY time 30 preceding and 30 following,
elevation 0 preceding and 0 following);

Let’s examine the query:

  • The results are stored to a new array—created on-the-fly—called Results.
  • For the time dimension, we have a window of 60 seconds—30 before
    and 30 after—around each cell.
  • SciDB requires that we include all the dimensions when specifying the window. Thus, we set the window in the elevation dimension to 0, so that we are only windowing in the time dimension.

Next, let’s compute the rolling average over both time and elevation. The 2-dimensional query is very similar to the 1-d query, except that we specify non-zero values for the elevation dimension. The following example keeps the 1-minute time window, and adds a 20-foot elevation window:

SELECT avg(toxicConcentration) into Results_2D
FROM window_example
(PARTITION BY time 30 preceding and 30 following,
elevation 10 preceding and 10 following);

If this was useful, Please share it!
Tweet about this on TwitterShare on LinkedInShare on Google+Share on FacebookEmail this to someonePrint this page

Subscribe for Newsletter