Product Tip: Loading Data Into SciDB from HDFS

Some customers have inquired about loading files from HDFS. Current versions of SciDB support this.

Remember, HDFS is not a specific file format, it is a file system that uses distributed storage. HDFS can store CSV files. Although HDFS files are not directly visible to Linux systems, you can circumvent this restriction by using STDIN—that is, by piping the contents of the CSV file directly into SciDB’s script for parallel loading of CSV files, loadcsv.py.

Here’s an example:

$ hadoop dfs -cat /user/hduser/test.csv | loadcsv.py
-a “test”
-s “<attr1:string,attr2:int64,attr3:double>[i=0:*,100000,0]“

 

Let’s look at the parts of the example in detail.

  • hadoop dfs –cat
    • The Hadoop command to list the file to STDOUT.
  • /user/hduser/test.csv
    • The CSV file in the HDFS file system.
  • |loadcsv.py
    • Pipes the contents of the file test.csv into loadcsv.py
  • -a “test”
    • –a is the command switch that precedes the name of the SciDB array that is the target of the load operation.
  • -s “<attr1:string,attr2:int64,attr3:double>[i=0:*,100000,0]”
    • –s is the command switch that precedes the schema for the array that is the target of the load operation.

The script loadcsv.py capitalizes on SciDB’s parallelism in the following ways:

  • Automatically determines the number of SciDB instances n in the cluster.
  • Partitions the to-be-loaded csv file into n nonoverlapping named pipes.
  • In parallel, converts each of these n csv files into a dense-load-format (dlf) named pipes.
  • In parallel, loads each of the n dlf file objects to yield a single SciDB array, distributed across the SciDB cluster.

Note that all of this partitioning and distribution happens automatically when you issue the command that pipes the text.csv file from HDFS into loadcsv.py. You need not contemplate the details. If you run the script in verbose mode (with the –v switch of loadcsv.py), you will see evidence of all this automatic parallelism. For example, the following script shows the results of loading a csv file from HDFS into an array on a four-instance SciDB cluster:

$ hadoop dfs -cat /user/hduser/test.csv | loadcsv.py -a “test” -s “<attr1:string,attr2:int64,attr3:double>[i=0:*,100000,0]” -x -v
Parsing chunk size from provided load schema.
Using chunk size of 100000 for load array based on load array schema definition.
Getting SciDB configuration information.
“/opt/scidb/13.2/bin/iquery” -c localhost -p 1239 -o csv -aq “list(‘instances’)”
This SciDB installation has 4 instance(s).
Creating CSV fragment FIFOs.
“/tmp/stdin.csv_0000″ created.
“/tmp/stdin.csv_0001″ created.
“/tmp/stdin.csv_0002″ created.
“/tmp/stdin.csv_0003″ created.
Creating DLF fragment FIFOs.
mkfifo “/data/scidb/poc/000/0/stdin.csv.dlf”
mkfifo “/data/scidb/poc/000/1/stdin.csv.dlf”
mkfifo “/data/scidb/poc/000/2/stdin.csv.dlf”
mkfifo “/data/scidb/poc/000/3/stdin.csv.dlf”
Starting CSV splitting process.
“/opt/scidb/13.2/bin/splitcsv” -n 4 -c 100000 -s 0 -o “/tmp/stdin.csv”
Starting CSV distribution and conversion processes.
cat “/tmp/stdin.csv_0000″ | “/opt/scidb/13.2/bin/csv2scidb” -c 100000 -f 0 -n 4 -d “,” -o “/data/scidb/poc/000/0/stdin.csv.dlf”
cat “/tmp/stdin.csv_0001″ | “/opt/scidb/13.2/bin/csv2scidb” -c 100000 -f 100000 -n 4 -d “,” -o “/data/scidb/poc/000/1/stdin.csv.dlf”
cat “/tmp/stdin.csv_0002″ | “/opt/scidb/13.2/bin/csv2scidb” -c 100000 -f 200000 -n 4 -d “,” -o “/data/scidb/poc/000/2/stdin.csv.dlf”
cat “/tmp/stdin.csv_0003″ | “/opt/scidb/13.2/bin/csv2scidb” -c 100000 -f 300000 -n 4 -d “,” -o “/data/scidb/poc/000/3/stdin.csv.dlf”
Removing “test” array.
“/opt/scidb/13.2/bin/iquery” -c localhost -p 1239 -anq “remove(test)”
Creating “test” array.
“/opt/scidb/13.2/bin/iquery” -c localhost -p 1239 -nq “CREATE ARRAY test <attr1:string,attr2:int64,attr3:double>[i=0:*,100000,0]”
Loading data into “test” array (may take a while for large input files). 1-D load only since no target array name was provided.
“/opt/scidb/13.2/bin/iquery” -c localhost -p 1239 -anq “load(test, ‘stdin.csv.dlf’, -1, ‘text’, 0)”
Performing cleanup tasks.
Removing CSV fragmemt FIFOs.
“/tmp/stdin.csv_0000″ removed.
“/tmp/stdin.csv_0001″ removed.
“/tmp/stdin.csv_0002″ removed.
“/tmp/stdin.csv_0003″ removed.
Removing DLF fragment FIFOs.
rm -f “/data/scidb/poc/000/0/stdin.csv.dlf”
rm -f “/data/scidb/poc/000/1/stdin.csv.dlf”
rm -f “/data/scidb/poc/000/2/stdin.csv.dlf”
rm -f “/data/scidb/poc/000/3/stdin.csv.dlf”Total Elapsed Time: 1.704 seconds.
Success: Data Loaded.
</br>

If this was useful, Please share it!
Tweet about this on TwitterShare on LinkedInShare on Google+Share on FacebookEmail this to someonePrint this page

Subscribe for Newsletter