Aggregation splits data into subsets, computes summary statistics on each subset, and reports the results in a conveniently summarized form.
The aggregate function is one of the most capable functions in the scidb package. The package overloads R’s standard aggregate function for SciDB arrays, using reasonably standard R syntax to cover most SciDB aggregation operators including aggregate, window, and variable_window. (The regrid and cumulate functions separately implement additional SciDB aggregation operators.)
The aggregate function extends the default capabilities of many SciDB aggregation operators to allow grouping by SciDB array dimensions, aggregates, other SciDB arrays, and combinations of all three.
The SciDB aggregate operator computes summary statistics for one or more attributes of a SciDB array, grouped by zero or more dimensions of the array. SciDB computes aggregates grouped along array dimensions very quickly.
The aggregate function in the scidb package extends the basic SciDB aggregation capability to implement aggregation grouped by SciDB array dimensions, array attributes, auxiliary SciDB arrays, or combinations of all three.
The aggregate function presents its output summarized in a data frame-like unpacked SciDB 1-d array with the unpack=TRUE option set. This format is the most convenient format for display and manipulation by R. Grouped aggregates may optionally be returned in multi-dimensional SciDB arrays by setting the unpack=FALSE function argument (the default). This form is useful if the aggregate result is to be subsequently joined with other SciDB arrays. See the examples below.
The scidb package tries to use standard R syntax as much as possible. The package interprets the FUN function argument for several standard R summary statistic functions like mean, sd, min, max, var, length, prod, as SciDB aggregate expressions for convenience. The aggregate function also accepts any valid SciDB aggregate expression as a character string. See below for examples.
Dimensionality of Aggregation Results
The SciDB aggregate function can return results of variable array dimensionality depending on the number of grouping level combinations involved. The precise dimensions of the output may not be known until run time. For this reason, SciDB may place aggregation results into a huge sparse array, avoiding the need to compute precise coordinate upper bounds. Consequentially, the results of aggregations are huge sparse arrays whose data reside near the coordinate system origin at zero.
When users materialize results to R, only the portion of the sparse output array containing data is returned. The examples show a simple way to explicitly bound the coordinate system to the output size of the aggregation if you need to do that.
Many examples below use the iris data set. The following code listing uploads iris to SciDB, assigning the result to the R variable x, a data frame-like scidbdf object. The underlying SciDB array has a single coordinate axis named “row” with 150 rows, and five SciDB array attributes corresponding to the variables in the iris data. Note that the variable names change to conform to SciDB array attribute naming convention.
Example 1: “Grand” aggregates
Omit the by function argument to compute a grand aggregate that averages all values, grouped across array attributes. Note that we use the standard R mean function instead of supplying an explicit SciDB aggregation expression. The package interprets this for us as the SciDB avg aggregate function. Also note that we limit the aggregation to the first four attributes with x[,1:4], because the fifth attribute (“Species”) is a string value, and the average of strings is not defined. The result of this example is returned as a huge sparse 1-d SciDB data frame-like array (see the discussion above). We explicitly materialize the result to R with [ ].
We can equivalently use an explicit SciDB aggregation expression instead of mean:
Note that aggregation expressions can rename their results:
Example 2: Grouped by dimension coordinates
We use a 3-d array for this example. Define the 5 x 4 x 3 array A with a single double-precision-valued array attribute val and dimension names i, j, and k as follows:
Compute sums grouped by dimensions j and k, returned as a data frame:
Alternatively, we can unpack the 2-d output into a data.frame with unpack=TRUE:
Example 3: Group by array attributes
The aggregate function for SciDB arrays supports groups defined by distinct array attribute values, similarly to the standard R aggregate function syntax. Aggregation by array attributes proceeds by converting the specified array attributes to categorical variables that enumerate distinct attribute values, and then grouping along the resulting categories. Consider the iris data used in Example 1. Compute the average Petal_Length, grouped by the distinct values of the “Species” attribute:
The returned result includes the new “Species_index” array attribute that enumerates the distinct values of “Species.”
Example 4: Group by an auxiliary SciDB array
This example most closely matches standard R aggregation behavior. Summary statistics are computed across groups defined by an auxiliary, usually 1-d, SciDB array. The following example creates an auxiliary grouping array that divides flowers into two groups of different petal widths and then computes the average petal length for each group. Note that supplying the by argument as a list is optional in the SciDB case.
This is identical to the standard R syntax (!):
Moving window aggregation
The aggregate function supports moving-window aggregates along coordinate systems or successive (sparse) data values.
Example 5: Moving window aggregation along the coordinate system
Consider the first six rows and only the numeric attributes of the iris data. The following example computes the rolling sum of pairs of rows across the data:
Note that SciDB’s moving window aggregates assume a zero boundary condition for (non-existing) values past the end of the array when applying sums. (See discussion below.)
Compare with an alternate convolution-based approach on the original R data using R’s filter function. The output of this approach differs in a few ways. The filter function returns a time series object, and the boundary condition is different than SciDB–assuming NA values outside the bounds of the array, resulting in a different last row.
Window boundary conditions
Example 6: Windows along successive data values
Use ‘variable_window’ to perform moving window aggregates over data values in a single dimension specified by the ‘by’ argument. Moving window aggregates along data values are restricted to a single array dimension. The following example considers a sparse version of the iris data, filtered to include numeric values greater than 2.3 across all attributes.