SciDB helps data scientists, bioinformatics and clinical informatics researchers, quants and analysts in a variety of industries.
Life Sciences & Healthcare Informatics
It is hard to overstate the impact that massive amounts of digital data will have on life sciences, medicine, and healthcare, where meaningful discoveries and breakthrough treatments depend on the ability to rapidly and reproducibly convert measured data into knowledge and better outcomes. The ability to effectively analyze data is particularly significant in translational medicine, leading to unprecedented opportunities for pharmaceuticals and biotechnology. The massive quantities of data produced by experimental biology, high-throughput screening platforms, genomics, clinical trials, adverse effects and response data can lead to substantial improvements in our capacity to develop the next generation of drugs or to properly guide their applicability with personalized medicine. We see data providers, researchers, clinicians, and instrument and sensor manufacturers struggling with both data management and analytics. SciDB has been designed to support the special needs of data-intensive scientific computing applications.
A scientific computational database makes data readily accessible and supports data sharing from a curated master while preserving data integrity.
Data is too often kept in thousands of files making data inventories, data retrieval, integration, sharing, and version control difficult. At one institution five researchers needed three meetings just to find the data for a new analytical project: files containing gene expression data had been archived; metadata and clinical data were each stored elsewhere. In many organizations, multiple versions of datasets proliferate, leading to tracking problems, errors, and inability to reproduce results when an inappropriate or outdated version is unintentionally used. Metadata like experimental variables and data provenance are often encoded in filenames, making the metadata inaccessible for searching and selection. SciDB’s scientific database provides data management and data access throughout the data lifecycle, supporting the ability to reanalyze data with new analytical techniques as well as the need for reproducibility and compliance.
A scientific computational database supports growing experimental complexity and data volumes.
Experiments, assays, high throughput screening, GWAS, metagenomic studies, drug retargeting studies, cohort and effective outcome studies are all getting ever larger and more complex. SciDB helps researchers manage this complexity. Experimental variables, samples, subjects, and diverse data sources and types can be modeled quite naturally as dimensions with SciDB’s multi-dimensional array data model. Multi-dimensional data and cohort selection become intuitive and fast with SciDB’s multidimensional storage—independent of the data size. Public data sets like TCGA, ICGC, CGHub are both proliferating and growing in size. Many organizations need to integrate public data with their proprietary data to augment their data and to validate findings. With SciDB, researchers can load and readily join public and proprietary data for analysis.
A scientific computational database supports interactive data exploration and faster ask-to-answer iterations. Typical research workflows involve finding and extracting data from files or a data warehouse; then transforming and loading it for use with a separate mathematical or statistical computing software package. With SciDB, complex analytics are performed in the database. Moreover, SciDB’s math-like correlation, principal component analysis, clustering, and network analysis—scales transparently so researchers do not have to worry about computations failing or taking hours to complete as data volumes grow. Informatics researchers can explore data and iterate through hypotheses faster with streamlined research workflows and faster computations. With more facile data management and faster and bigger math, SciDB enables researchers to work with the full spectrum of data to accelerate discovery.
SciDB in quant finance: 100:1 speed improvement. What could you do with a 100x speed up? How about 1000x by just adding more commodity hardware?
- Hedge funds improve returns by backtesting trade models against real market order activity.
- Market makers increase margins by building and backtesting trading algorithms that don’t move the market against the trade.
Faster performance allows users to validate more models sooner and exploit market changes. The common thread in both these applications is the so-called “as-of join”—find the previous value in a sparse time series and carry that value forward to the current time. This database operation is also sometimes referred to as the “inexact-temporal join” or “NA.LOCF” (for data not available, last observation carry forward). This simple concept is the building block for (among others) constructing and consolidating order books; but it is computationally intensive when applied to big data. SciDB excels in this query because of its spatial clustering of data and massively parallel architecture. Spatial clustering ensures that finding the previous or next data in a time series is done with minimal reads—usually just one. Massively parallel architecture allows SciDB users to speed up execution by simply adding more hardware. Order book construction and aggregation typically runs in hours or overnight on another industry-leading solution but run in seconds to minutes with SciDB on inexpensive commercial off-the-shelf hardware. Read our blog on order book building and consolidation to learn more. Arrays are a natural fit for multidimensional financial trade data. Timestamp, symbol, volume, price, and exchange make natural dimensions for SciDB’s array storage. SciDB’s multidimensional storage turns sliding range selections and aggregates into constant time operations—independent of the database size. In contrast relational databases struggle, requiring multiple passes on data or storing multiple copies of data to perform these basic operations. Fast sliding range selections and aggregates are the building blocks required to:
- Analyze tick data at different time resolutions (read our Technical Brief)
- Build the National Best Bid and Offer (NBBO) book
- Aggregate order books across multiple exchanges (read our blog post and recreate our example)
These are the inputs to analytic models, which can be developed directly in SciDB without having to extract, transform, and load the data to a separate analytical package. No ETL means fast ask-to-answer loops, better models, and more profitable trades. Finally, when it comes to backtesting these models against historic data, SciDB’s sliding range and window aggregates can be used to validate and assess the strength of these models rapidly and over time. Fast back-testing means you will quickly spot market changes that turn previously profitable models into underperformers; and that turn previous underperforming models into winners.
Improve products, services, and reliability Manufacturers, miners, utilities and fleet operators collect and analyze data across a spectrum of sources, sampling frequencies, volumes and relevance. Sensor, laboratory, maintenance and utilization data forms a rich ecosystem of orthogonal and mineable analytical opportunities, with the potential to drive improvements in product and service quality, on-time delivery, asset maintenance and utilization and employee safety. Industrial data is ordered and highly dimensional – far more complex than the flat, fact-oriented data that conventional databases were designed for. SciDB’s massively scalable windowing functions and complex analytics exploit this inherent structure. Uniquely, SciDB provides both ACID guarantees and full DBMS functionality with an in-database analytics environment that eliminates time consuming data extraction overhead for engineers and analysts. SciDB’s shared-nothing, massively parallel processing (MPP) distributed database runs on 10s to 1000s of commodity-hardware nodes in a cloud or on-premise. Robust array archiving ensures that laboratory and maintenance records that need to be updated with corrections or additions are never overwritten — critical functionality for pharmaceutical and other highly regulated manufacturers and service providers.
Innovative products faster to market
Property and casualty insurers are developing innovative personalized products and value-added services. These products and services leverage new data sources – from telematics devices in cars that measure driver behavior and engine conditions to GPS devices that collect location data. Trip-based insurance pricing will integrate current road, traffic and weather conditions, drivers’ trip profiles, fuel pricing and road accident history, enabling insurers to offer consumers flexible and attractively priced products.
SciDB’s native array model is designed to deal with the massive volumes of time, location, and sensor data collected at high frequency sampling rates, generated by millions of sensors and mobile devices. Scalable complex analytics lets analysts and data scientists build pricing and risk models entirely in-database. SciDB enables true ad hoc data exploration, helping insurers surface new market opportunities and pricing inefficiencies.
Find. Keep. Grow.
As businesses compete to find, keep, and grow customer relationships, the race is on to identify behaviors and traits that predispose customers to accept an offer or click on an ad. E-commerce businesses depend on micro-personalization to make the right offer, at the right time, to the right customer; and on fraud detection to prevent unauthorized transactions.
The underlying mathematical algorithms for these use cases are based on linear algebra techniques (e.g. principal component analysis, singular value decomposition, clustering, general linear models) applied to big data. SciDB has an unfair advantage for these applications because it can run these math functions directly in the database on massive datasets spread across a cluster—without moving the data. SciDB’s array data model is the natural way to store sparse multi-dimensional data for these underlying analytic techniques because it makes for fast ad hoc data exploration and stores the data in the format needed for these complex mathematical techniques.
Faster data exploration, and advanced complex math on big data means better targeting, more appropriate offers, and more revenue.