MAC™—the key to fast range selects and joins

When it comes to getting value out of big data, investigations need to focus on ranges—not single points—of data. Unlike other DBMSs, SciDB is aware the natural order of data and exploits that order to accelerate range selects and joins—the fundamental building blocks of data-driven discovery. This is why SciDB excels at getting value out of data in applications such as earth sciences, quantitative finance, bioinformatics, consumer marketing, e-commerce, and sensor analytics. All this is driven by what we call MAC technology.

Multidimensional Array Clustering makes sure that

  • data that are close to each other in the user-defined coordinate system are stored in the same chunk in the storage media
  • data are stored in the storage media in the same order as in the coordinate system.

In contrast, many systems (HDFS for example) distribute data randomly—and that makes all the difference.

So what’s the big deal with MAC? Why should you care?

Keep in mind that we are talking about big data, so the data is distributed over many “chunks” on a cluster of computers. Figure 1 shows an example of how this works. Let’s see how MAC speeds up the core building blocks of analytics.

Analytics on big data is about looking at what happened over a range of ordered values. In bioinformatics, this could mean selecting genetic variant data over a range of genome probe locations and patient groups. In earth sciences, this could mean selecting a cube of volume over some period of time. In finance this could mean selecting and aggregating market data over a range of time and exchanges.

By clustering adjacent data next to each other in the storage media, MAC reduces the number of reads required to select a range of data along a coordinate axis. Reads always have overhead and inevitably access more data than actually requested. So minimizing reads means reducing wasted time and wasted data movement. That’s important because range selections are the building blocks of data exploration.

So now imagine you want to join two arrays along the same coordinate axis. Since the data are stored in rectilinear chunks, it is easier for the system to locate pairs of chunks that need to be joined; and there are fewer comparisons that need to be performed when joining these two chunks. Again; MAC means better performance.

Matrix multiplication, for example, can be thought of as a join of two arrays on one axis followed by an aggregation. MAC allows SciDB to minimize data movement among nodes and efficiently multiply pairs of matching chunks. Thus MAC facilitates fast matrix multiplications, SVDs, thresholded correlations, and other linear algebra operations on big data.

What’s this thing overlap thing about?

SciDB also employs optional user-settable chunk overlaps to speed up windowed queries. Figure 2 shows an example of how this works. The key point is that by specifying an overlap section of each chunk, window queries that straddle what would otherwise be a chunk boundary, still only need to access a single chunk—provided that the window is smaller than or equal to the specified overlap.

Moreover, SciDB transparently manages the overlap behind the scenes. In this way, SciDB turns window queries into embarrassingly parallel operations that require no special programming on the part of the developer.

Like with the basic multidimensional array clustering, overlap can produce substantial speed improvements. While the overlap makes no difference for window queries that are within a chunk, it can make a huge difference for queries that straddle chunk boundaries. This is a common situation when a sliding window is used to aggregate data. We’ve seen situations where other systems gave nearly comparable performance to SciDB on queries that fall within a chunk; but when the window straddles two chunks, these other systems just never get there, while SciDB returns results in constant time—irrespective of whether the window crosses a chunk boundary.

How is this different?

While a few other systems offer clustering along two or three dimensions, these products fall short of the full MAC that SciDB provides. Also many of these other systems don’t offer linear algebra. As far as we know, no other database management systems employ overlap. Bottom line is that multidimensional array clustering is the way to roll if you need to do analytics on big data. Your results may vary but it’s not uncommon to see 50-100x improvements over traditional approaches.

MAC figure 1 Bioinformatics

Figure 1. Multidimensional Array Clustering reduces reads, speeds queries.  SciDB provides superior performance by collocating data in the storage media that are adjacent to each other in the coordinates of an array. This example data is from a single two-dimensional array with coordinates SAMPLE and GENE, and two attributes RNASEQ and METHY in each cell.

 

  • Different attributes are stored separately so a query that uses RNASEQ does not need to scan the data for METHY.
  • Data are split into rectilinear chunks (the dotted lines in the figure). Chunks are assigned to different SciDB instances based on their location in the array, using a simple hash function. The hash function allows SciDB to quickly find the right chunk for any piece of data based on the coordinates.
  • Data in each chunk are stored in a contiguous region, minimizing the number of reads needed to access a chunk.
  • Data are compressed using run-length encoding (RLE), which shrinks the size and allows for faster performance.
  • The locations of empty cells are encoded using a special bitmask (EBM) that is also RLE compressed.
  • Coordinate values themselves are not stored, but are recomputed when needed from the EBM.
MAC figure 2 Quant Finance

Figure 2. Optional overlap speeds sliding window queries. SciDB speeds window calculations through an optional chunk overlap feature. This example is from a single two-dimensional array with coordinates STOCK and TIME, and two attributes PRICE and VOLUME in each cell.

 

  • Users can specify an optional overlap of chunks, so that data in the overlap regions are replicated in the logically adjacent chunks.
  • Overlap is maintained automatically by the database.
  • This makes windowing queries embarrassingly parallel—even when the window straddles chunks.
  • The overlap uses slightly more storage space but gives faster performance—a tradeoff at the user’s discretion.