SciDB accelerates linear algebra operations—the basis for statistical and machine learning computing such as correlation, covariance, and singular value decompositions (SVD)—because SciDB has been built from the ground up to store and process arrays efficiently. Consequently, data analysis with SciDB scales seamlessly to trillions of data points without the need to re-write your analytic code, or manually distribute your data. This blog post explains one benchmark that shows how SciDB’s is pushing the frontier for in-database scalable linear algebra.
One important metric for assessing a system’s scalability is called “speed-up” scalability. The idea is that, ultimately, there is a limit to how big a single computer can be; so if your data set is large enough, it can take seemingly forever to calculate your result, even on a high-end server stuffed with the maximum amount of memory. Given the upper bounds on single-box compute power, we’ve always relied on clusters of computers to take on the biggest tasks. But there is also an economic argument. High-end computers are very expensive. But with the right software you can build a system with the computational power of a very big box by combining the power of many small ones.
So how does SciDB scale? To measure speed-up scalability, we start with a large data volume, and measure how long it takes to complete a fixed workload or set of queries using a single computer. For example, you might start with 50 Gigabytes of array data, and measure how long it takes to multiply the matrix by itself on one computer. Then we repeatedly double the amount of compute power available—first step is to add a second computer to the cluster, second step is to add two more—and at each scale factor measure how long it takes to compute the same multiply result.
In this test, the input is a dense, 2D array of 50,000 x 5,000 double precision values. The linear algebra operation we benchmarked was a simple matrix multiply, because multiply is a basic building block for many statistical analyses. This means that we produced a result matrix that was 50,000 x 50,000 in size. The input size was small, only about 2G, with the output size being 20G. SciDB also supports operations on sparse arrays, as well as many of the analytic OLAP style operations common to SQL.
Note that the results are shown on a log/log scale. This is because, although we are holding the data size and hence the total computation fixed, each step along the X axis (compute power) represents a doubling of the computational power, which should result in a halving of the Y axis value (total time).
What’s going on under the hood? Well, the challenge of scaling linear algebra has been the subject of intense research within computer science for decades. Over time, this community—exemplified by the ScaLAPACK team at Oak Ridge National Laboratories—have settled on a collection of algorithmic best practices; block-partitioned algorithms, mesh topologies, peer-to-peer message passing, and extremely efficient per-node execution libraries. Frameworks like Hadoop’s Map/Reduce or SQL’s set-theoretic algebras are nowhere to be found; rejected by this community as being too slow or too unwieldy. But because SciDB began with a clean sheet, and uses block-partitioned arrays as our building blocks, we are able to adopt this community’s best-practices as our own. Although these routines require a complicated data distribution, SciDB takes care of all of this automatically so that no user knowledge or interaction is needed to get the performance gains associated with ScaLAPACK.
While we could also use ScaLAPACK on sparse matrices, we can achieve faster performance with implementations that take advantage of the sparsity. So we have written our own routines for linear algebra operations, such as spgemm (for sparse matrix multiplication) and tsvd (for truncated SVD).
The bottom line
With SciDB, you can add nodes incrementally, and cost-effectively, to achieve faster response times.
This means that complex analytics—even for matrices too large to fit in-memory on a single computer—can run fast enough to support interactive, exploratory analytics in-database.