A few weeks back at talpasolutions I needed - only for performance testing purposes - to import a .tsv
file (round about 600,000 lines with time-stamped sensor data in 15 columns) to see how redis
handles it (especially memory wise). As it turned out the import is feasible with a single, one-(long)-line cli
command.
Using the pipe mode redis-cli --pipe
this is what I came up with (full command and corresponding .tsv
see below ... also for other Redis data types than redis streams
):
awk -F '\t' 'FNR > 1 {print "XADD some_stream "($1=="" ? "NA" : $1)" column_1 "($2=="" ? "NA" : $2)" column_2 "($3=="" ? "NA" : $3)" ... \n"}' /path/to/data.tsv | redis-cli --pipe
Import Choices
Field Delimiter
If your field delimiter is a tab than use:
awk -F '\t' ...
Header Row
If you have one header row to ignore on import add this to your awk
statement:
... FNR > 1 ...
String Qualifier and White Spaces in Cell Values
This was one of the trickiest as in the beginning I often got the error message ERR Protocol error: unbalanced quotes in request
.
If your .csv
cell value does not already contain ""
you can use both and "try around" a little:
... column_1 "$2" ...
... column_1 \""$2"\" ...
If the value already has quotes around the string you must use the first option - otherwise you will also bounce into the above error message.
Empty Cells
To avoid misinterpretation by redis-cli
due to empty cells and thus missing parameters use a "short hand" if
(ternary operator) to check for and fill empty cells with e.g. "NA":
column_2 "($3=="" ? "NA" : $3)"
(Valid) JSON strings
If your .csv
happens to have JSON
-like strings in some cells ... it depends ;-> on:
- if these are already valid, especially have double quotes around the keys
- the string qualifier in the
.csv
- if you want to store them (immediately) as valid JSON string (and thus directly parsable in your code which fetches them from redis)
- or if you can afford to store them as "relaxed"/JSON5 style and deal with it later with it in your code
In the original file I had there were no quotes around the JSON keys but for the worst case that your string qualifier is double quotes and you need to store them in a valid way then - in your .csv
- you will need to escape the quotes around the JSON keys:
... "{\"key_1\": [5.1234567,15.1234567], \"key_2\": 5.000000, \"key_3\": 1, \"key_4\":-1}" ...
At least I did not come up with something else in the cli
command per se (thus circumventing the need to clean up your raw data ... but may be this thread or this may enlighten you more than me).
Other Gotchas
New Line character
Don't for forget to print a \n
at the end of the awk
command:
... \n"}' /path/to/data.tsv | redis-cli --pipe
Extra (only for Redis Stream): CSV Timestamp as Redis Stream ID
If your timestamp from the .csv
column is not unique but you want to set the .csv
timestamp as the redis stream ID
you will need to come up with some kind of index variable or something similar for the import (the timestamp in the .tsv
I had was unique and sorted ascending by timestamp).
Actual Import
Sample Data
Here is a simplified version:
ts column_1 column_2 column_3 column_4 column_5
1549878790 "{\"key_1\": [0.1234567,10.1234567], \"key_2\": 0.000000, \"key_3\": 1, \"key_4\":-1}" true 0 -5 "Some Other Long String 1"
1549878791 "{\"key_1\": [1.1234567,11.1234567], \"key_2\": 1.000000, \"key_3\": 1, \"key_4\":-1}" false 0 "Some Other Long String 2"
1549878792 "{\"key_1\": [2.1234567,12.1234567], \"key_2\": 2.000000, \"key_3\": 1, \"key_4\":-1}" false 0 -44 "Some Other Long String 3"
1549878793 "{\"key_1\": [3.1234567,13.1234567], \"key_2\": 3.000000, \"key_3\": 1, \"key_4\":-1}" false 0 -1234 "Some Other Long String 4"
1549878794 "{\"key_1\": [4.1234567,14.1234567], \"key_2\": 4.000000, \"key_3\": 1, \"key_4\":-1}" false 0 0 "Some Other Long String 5"
1549878795 "{\"key_1\": [5.1234567,15.1234567], \"key_2\": 5.000000, \"key_3\": 1, \"key_4\":-1}" 123 -166 "Some Other Long String 6"
1549878796 "{\"key_1\": [6.1234567,16.1234567], \"key_2\": 6.000000, \"key_3\": 1, \"key_4\":-1}" true 125 -125 "Some Other Long String 7"
1549878797 "{\"key_1\": [7.1234567,17.1234567], \"key_2\": 7.000000, \"key_3\": 1, \"key_4\":-1}" false 0 571
1549878798 "{\"key_1\": [8.1234567,18.1234567], \"key_2\": 8.000000, \"key_3\": 1}" false 0 43375 "Some Other Long String 9"
1549878799 "{\"key_1\": [9.1234567,19.1234567], \"key_2\": 9.000000, \"key_3\": 1}" false 1 10125 "Some Other Long String 10"
Sample Import
If you (finally) have sorted out everything from above the actual import (here with XADD
) is straight forward and - as always with Redis - very fast:
docker volume create redis_csv_import
docker run -v redis_csv_import:/data --name some-redis -d redis:5.0.3 redis-server --appendonly yes
docker cp redis_import_fake.tsv some-redis:/tmp
docker exec -it some-redis /bin/bash
# All commands now within the container
# to see if file is in the container
# ls -al /tmp/
# to check if awk recognizes your columns correctly
# awk -F'\t' '{print NF; exit}' /tmp/redis_import_fake.tsv
# for troubleshooting the format of your raw data
# cat -A /tmp/redis_import_fake.tsv
# do import
awk -F '\t' 'FNR > 1 {print "XADD some_stream \""($1=="" ? "NA" : $1)"\" column_1 "($2=="" ? "NA" : $2)" column_2 \""($3=="" ? "NA" : $3)"\" column_3 \""($4=="" ? "NA" : $4)"\" column_4 \""($5=="" ? "NA" : $5)"\" column_5 "($6=="" ? "NA" : $6)" \n"}' /tmp/redis_import_fake.tsv | redis-cli --pipe
# verfiy import
redis-cli XRANGE some_stream - +
1) 1) "1549878790-0"
2) 1) "column_1"
2) "{\"key_1\": [0.1234567,10.1234567], \"key_2\": 0.000000, \"key_3\": 1, \"key_4\":-1}"
3) "column_2"
4) "true"
5) "column_3"
6) "0"
7) "column_4"
8) "-5"
9) "column_5"
10) "Some Other Long String 1"
2) 1) "1549878791-0"
2) 1) "column_1"
2) "{\"key_1\": [1.1234567,11.1234567], \"key_2\": 1.000000, \"key_3\": 1, \"key_4\":-1}"
3) "column_2"
4) "false"
5) "column_3"
6) "0"
7) "column_4"
8) "NA"
9) "column_5"
10) "Some Other Long String 2"
...
The cli
statement also works for other Redis data types as e.g. HMSET
:
# do import
awk -F '\t' 'FNR > 1 {print " HMSET some_hash:ts:"($1=="" ? "NA" : $1)" column_1 "($2=="" ? "NA" : $2)" column_2 "($3=="" ? "NA" : $3)" column_3 "($4=="" ? "NA" : $4)" column_4 "($5=="" ? "NA" : $5)" column_5 "($6=="" ? "NA" : $6)" \n"}' /tmp/redis_import_fake.tsv | redis-cli --pipe
# verfiy import
redis-cli keys "*"
1) "some_hash:ts:1549878791"
2) "some_hash:ts:1549878797"
3) "some_hash:ts:1549878798"
4) "some_hash:ts:1549878790"
5) "some_hash:ts:1549878792"
6) "some_hash:ts:1549878796"
7) "some_hash:ts:1549878799"
8) "some_hash:ts:1549878793"
9) "some_hash:ts:1549878794"
10) "some_hash:ts:1549878795"
Further Information
Redis Docs:
https://redis.io/topics/mass-insert
https://redis.io/topics/streams-intro
https://redis.io/commands/xadd
https://redis.io/commands/hmset
Gist which inspired my import approach:
https://gist.github.com/arsperger/6f246f21279edf3cd03ba2bee19daaef
JSON strings:
https://stackoverflow.com/questions/21065225/redis-command-line-set-value-containing-double-quotes
https://stackoverflow.com/questions/9637517/parsing-relaxed-json-without-eval