SciDB—the array database management system
Supports parallel processing without parallel programming, frees data scientists to analyze more and program less.
- SciDB is a full ACID database management system that stores data in multidimensional arrays with strongly typed attributes (aka fields) within each cell.
- Arrays are the natural way to organize, store, and retrieve ordered or multifaceted data.
- SciDB’s unique Multidimensional Array Clustering gives it an unfair speed advantage for multidimensional selections and joins. Learn more about MAC™.
- Select any two dimensions of your array and you have a matrix—represented in exactly the format you need to run complex analytics that drive predictive models.
- A distributed massively parallel processing architecture lets you store and access as much data as you need by scaling out on commodity hardware.
- In-database linear algebra means you spend more time analyzing and less time moving moving data to a math software package. Since the math runs on distributed data, you’ll never have to sample or select a subset of data to fit in available memory on a single computer.
MAC (Multidimensional Array Clustering) is the key to super-fast range selections, aggregates, and joins. MAC uses two principles to accomplish these goals: (1) data that are close to each other in the user-defined coordinate system are stored in the same chunk, and (2) data are arranged in the storage media in the same order as in the coordinate system. MAC is so important we gave it a page unto itself. Learn more. Go ahead and make our day. Read it!
Array Data Model
Geo-spatial data, scientific data, financial feeds, sensor data, sequencing data, time-series data, and other highly faceted data do not fit neatly or efficiently into tables, the data model used in relational databases. SciDB’s native multi-dimensional array data model is designed from the ground up for ordered, highly dimensional, multifaceted data. And data is never overwritten, allowing you to record and access data corrections and updates over time. Dramatic Storage and Operational Benefits stem from the array data model. SciDB is designed to efficiently handle both dense and sparse arrays providing dramatic storage efficiencies as the number of dimensions and attributes grows. Math operations run directly on the native data format. Partitioning data in each coordinate of an array facilitates fast joins and access along any dimension, thereby speeding up clustering, array operations and population selection.
Need to support concurrent users, reads, and writes? That’s what database management systems were built to do. Try that with files and you are looking for trouble—forked files, corrupted data, inconsistent results. Databases solve these problems with ACID technology—so you can curate once and analyze many. ACID guarantees that transactions are all or nothing (Atomicity), all users see the same valid data (Consistency), transactions don’t interfere with each other (Isolation), and data will never be lost (Durability). ACID guarantees make for repeatable results across multiple users and facilitates collaboration on shared data. SciDB combines full ACID guarantees with versioned, no overwrite array storage. When using versioned arrays, write transactions in SciDB create new versions of the array rather than modifying pre-existing versions.
Distributed MPP Architecture
Get cost-effective scaling of data management and analytics with SciDB’s shared-nothing, massively parallel processing (MPP) architecture. Scale out on 10s to 1000s of commodity-hardware nodes in a cloud or on-premise. No need for big and expensive high-performance computers or costly database appliances. Hit the memory limit on a scale up architecture and you’ll need a new system. With SciDB, just add more nodes.
SciDB moves analytics to the data, eliminating time-intensive ETL processes. Arrays are the natural way to store data for linear algebra operations (like SVD and covariance) which saves time moving, organizing, and preparing data for matrix math. SciDB then performs massively parallel linear algebra—without parallel programming—on commodity hardware clusters. The result: analytical workflows scale to 100s of billions of data elements without turning the analysis into a programming science project. SciDB eliminates the tedium of manual data distribution and lets you leverage R and Python coding skills. Program faster, and shorten your “ask-to-answer” loops with SciDB’s scalable, cost-effective in-database analytics.
Open source software reduces your costs. SciDB Community Edition is provided under an open-source license. Developers can implement custom operators, aggregates and other extensions to the SciDB codebase. SciDB runs on existing commodity hardware or in the cloud, delivering cost-effective analytics without the need for expensive appliances or high performance computers.
Scale out by adding commodity hardware nodes to your cluster. Commodity hardware means affordability and flexibility. Proprietary hardware locks you in to a vendor and comes at high cost of ownership. SciDB lets you leverage the industry’s best price/performance available.
Programmable from R & Python
Leverage existing workforce skill sets rather than learning new development tools. Analysts and developers can harness the power of SciDB arrays and distributed processing from familiar R and Python interfaces.