Data Exploration In BASH
Sometimes it is necessary to do some initial exploration in huge data files like a .csv
without a GUI: Sometimes you might only have a terminal window but no GUI application like Excel or an Editor available or sometimes the files are just so big that it takes ages for them to load and/or work with them in a GUI application.
This is why here are some basic to medium tricks on how to explore and navigate through a huge tab-separated .tsv
file with bash
only.
Be aware that some commands may vary depending on your operating system and your bash
version. The commands here have been tested on Ubuntu
and Mac OS
.
The Sample Data
Use cat
or less
(paged view not filling up your terminal) to see the whole file.
cat ~/Documents/sample_data.csv
ID TIME NAME TYPE VALUE
0 1556668800 A_SUB1 bool TRUE
1 1556668801 B int 10
2 1556668802 C int 200
3 1556668803 D int 3000
4 1556668804 E string
5 1556668805 F string a string
6 1556668806 G float 0.1
7 1556668807 H float 0.2
8 1556668808 I unknown
9 1556668809 A_SUB1_SUB2 bool TRUE
10 1556668810 B int 20
11 1556668811 C int 300
12 1556668812 D int 4000
13 1556668813 E string another string
14 1556668814 F string
15 1556668815 G float 1.1
16 1556668816 H float 1.2
17 1556668817 I
18 1556668818 A_SUB1 bool FALSE
You can also use cat -A
to show all (non-printing) characters. This file here is tab-separated.
Get First n
Lines
head -n 2 ~/Documents/sample_data.csv
ID TIME NAME TYPE VALUE
0 1556668800 A_SUB1 bool TRUE
Get Last n
Lines
tail -n 2 ~/Documents/sample_data.csv
17 1556668817 I
18 1556668818 A_SUB1 bool FALSE
Get Exact Line n
head -n 5 ~/Documents/sample_data.csv | tail -n 1
3 1556668803 D int 3000
Find First m
Substring Matches
grep -m 2 155666881 ~/Documents/sample_data.csv
10 1556668810 B int 20
11 1556668811 C int 300
Find Exact Match
grep -w A_SUB1 ~/Documents/sample_data.csv
0 1556668800 A_SUB1 bool TRUE
18 1556668818 A_SUB1 bool FALSE
# not exact and also finding 'A_SUB1_SUB2'
grep A_SUB1 ~/Documents/sample_data.csv
0 1556668800 A_SUB1 bool TRUE
9 1556668809 A_SUB1_SUB2 bool TRUE
18 1556668818 A_SUB1 bool FALSE
Find All Lines With Certain Values
grep -P "bool\tFALSE" ~/Documents/sample_data.csv
18 1556668818 A_SUB1 bool FALSE
grep -P "bool\tTRUE" ~/Documents/sample_data.csv
0 1556668800 A_SUB1 bool TRUE
9 1556668809 A_SUB1_SUB2 bool TRUE
In Mac OS you need to use grep -e
instead.
Find All Lines With Empty Value Column
grep -P "\t__aSyNcId_<_FJIlzVJn__quot; ~/Documents/sample_data.csv
4 1556668804 E string
14 1556668814 F string
17 1556668817 I
In Mac OS again you need to use grep -e
instead.
Count All Lines
wc -l ~/Documents/sample_data.csv
20 ~/Documents/sample_data.csv
Count Specific Lines
grep A_SUB1 ~/Documents/sample_data.csv | wc -l
3
Print Certain Columns Only
awk -F '\t' '{print $1}' ~/Documents/sample_data.csv
ID
0
1
...
awk -F '\t' 'FNR > 1 {print $1}' ~/Documents/sample_data.csv
0
1
2
...
Use F '\t'
for tab-separated files and FNR > 1
to ignore the header row.
Sort By Certain Row
# 3rd row with duplicates
sort -k 3,3 ~/Documents/sample_data.csv
0 1556668800 A_SUB1 bool TRUE
18 1556668818 A_SUB1 bool FALSE
9 1556668809 A_SUB1_SUB2 bool TRUE
10 1556668810 B int 20
1 1556668801 B int 10
11 1556668811 C int 300
2 1556668802 C int 200
...
# 3rd row without duplicates
sort -u -k 3,3 ~/Documents/sample_data.csv
0 1556668800 A_SUB1 bool TRUE
9 1556668809 A_SUB1_SUB2 bool TRUE
1 1556668801 B int 10
2 1556668802 C int 200
...
Count Unique Cell Values
awk -F '\t' 'FNR > 1 {print $3}' ~/Documents/sample_data.csv | sort | uniq -c | sort -n
1 A_SUB1_SUB2
2 A_SUB1
2 B
2 C
...
# for occurences in descending order
awk -F '\t' 'FNR > 1 {print $3}' ~/Documents/sample_data.csv | sort | uniq -c | sort -nr
Use F '\t'
for tab-separated files and FNR > 1
to ignore the header row.
Summary
Of course there are even more tricks you can do in order to "get to know" your data and certainly from some point on you won't be using bash
for working on it (especially when it comes to maths and calculations which is not very pleasant/possible in a shell) ... but for some initial exploration without long loading times bash
can help you getting started.
Bonus
For not only quickly navigating through your data but also through your terminal these shortcuts are also (very) useful and great time savers:
ctrl
+a
: Jump to beginning of line.ctrl
+e
: Jump to end of line.ctrl
+w
: Delete one word to the left.ctrl
+u
: Delete complete line to the left.