We’re delighted with the response to our webinar, Analyze More, Program Less: Using SciDB for Computational Finance. We didn’t have time to answer all the questions submitted through the chat window. We answer the overflow now.
- Can you show some of the C++ examples?
We will post some examples soon.
- Are min/max values stored for every array chunk so that when processing range queries, chunks can be skipped over?
Yes, the SciDB engine can skip over chunks when performing range queries.
- Are Vertica-like compressed array columns supported? Or something analogous for arrays?
Yes, SciDB uses vertical partitioning and run-length encoding.
- How can we run SciDB on Hadoop?
You can load data stored in HDFS into SciDB. Instructions are here.
If are asking about running SciDB in the MapReduce framework, the answer is no. SciDB already offers massive parallelism and distributed storage. MapReduce is a framework for implementing roll-your-own solutions to embarrassingly parallel problems.
By contrast, SciDB is a scalable, parallel, distributed DBMS with built-in capabilities for complex analytics, both embarrassingly parallel and not embarrassingly parallel. Imposing Hadoop on SciDB would subtract much from SciDB’s value, including ACID database semantics, complex analytics, and automatic data distribution and parallelism that Hadoop makes you do by hand — mapping this, reducing that, etc. One of SciDB’s great conveniences is the support of complex, exploratory (aka ad-hoc) analytics at scale. Adding MapReduce to the picture would disrupt the workflow of ad-hoc analytics.
- Does the regrid() operator build additional indexes, or does it actually store the processed arrays?
Like all SciDB operators, regrid() produces a result array. Using regrid() on an existing array does not alter that array in any way. And remember, SciDB does not store indexes; indexes are implicit in the way array cells are organized in memory or on disk.
- How fast is SciDB in exporting (as opposed to importing) data?
As mentioned during the webinar, SciDB importing is fast because it can occur in parallel on every SciDB instance in the cluster. SciDB exporting can also occur in parallel across the entire cluster, so it is also fast.
- How is the computational performance of SciDB when handling terabytes of data? How is the memory management?We have users with SciDB databases containing petabytes of data. And because SciDB is horizontally scalable, computation on larger data volumes can be accommodated with larger clusters — easily and economically created because SciDB runs on commodity hardware.
- Is it possible to append data to an array while querying that array?
You can submit two such operations simultaneously, but the SciDB engine will complete one operation before beginning the other.
- What is the maximum suggested number of dimensions for arrays? 10? 100?
For dense data, there really is no limit. But highly dimensional arrays tend to be very sparse. In very sparse arrays, memory consumption will be dominated by chunk headers rather than array cells, and performance will be affected. We have customers that have worked successfully with 15-dimensional sparse arrays.
That’s all for now. Stay tuned for more webinar follow-up, including code examples. And if you missed the webinar, watch this space. When the video recording of the webinar is available, we’ll put it on our website and announce it with a blog post.