Parallel Array Computations With SciDB and R

Bryan Lewis is Paradigm4′s Chief Data Scientist and an active R evangelist

R ProjectData science is a rapidly evolving discipline. New algorithms, new and larger-scale problems, and new ways to manage and work with data are introduced almost daily. One tool that remains indispensable to data scientists is the R language.

R is a very popular and capable open source environment for data analysis and computation. R’s popularity stems in part from its expressive programming language, superb facilities for publication-quality graphics, and thousands of high-quality community-contributed analysis packages for applications including genomics, econometrics, geospatial analaysis and many others.

SciDB is an open source, scalable database that stores data in distributed, multidimensional arrays–an approach that fits naturally with analytics environments like R that largely compute with data in array form. While R excels at in-memory analytics, SciDB is designed for parallel processing of terabytes of data distributed across computational clusters.

The SciDB-R package lets R work with SciDB in several ways:

  1. By using a familiar data-frame interface similar to standard table/cursor database interfaces.
  2. Through the SciDB array class for R, that defines n-dimensional dense or sparse arrays in R that are backed by SciDB data. The arrays behave in most ways like standard R arrays, but computation is performed in parallel on distribued array data by the SciDB engine.

The second approach is really unique to SciDB. It is inspired by, and similar to, the “bigmemory” package for R. But, unlike bigmemory, array computations are performed in parallel on distributed data. And the SciDB array class supports n-dimensional sparse and dense arrays.

The SciDB array class for R presently exposes a relatively limited, but very useful and fast-growing, functionality including basic linear algebra operations and some matrix decompositions. The SciDB array class for R can, in some cases, let existing R programs operate in parallel on huge distributed arrays without modification. It can be an easy way for a data scientist using R to extend existing algorithms to very large scale data without recoding.

Of course, SciDB provides many other benefits one expects from working with a database including multiuser access to data with transactional ACID integrity, sophisticated data management features like no-overwrite data versioning and fault-tolerance and data replication. Data stored in SciDB arrays benefit from these features automatically. 

Paradigm4 is dedicated to making large-scale analytics as easy as possible for R users. We think the SciDB-R package provides a very natural way to work with SciDB from R. Contact us to try it out and let us know what you think and what else you’d like to see in the package. 

     

Schedule a Demo