Data Exploration In BASH

Sometimes it is necessary to do some initial exploration in huge data files like a .csv without a GUI: Sometimes you might only have a terminal window but no GUI application like Excel or an Editor available or sometimes the files are just so big that it takes ages for them to load and/or work with them in a GUI application.

This is why here are some basic to medium tricks on how to explore and navigate through a huge tab-separated .tsv file with bash only.

Be aware that some commands may vary depending on your operating system and your bash version. The commands here have been tested on Ubuntu and Mac OS.

The Sample Data

Use cat or less (paged view not filling up your terminal) to see the whole file.

cat ~/Documents/sample_data.csv

ID	TIME	NAME	TYPE	VALUE
0	1556668800	A_SUB1	bool	TRUE
1	1556668801	B	int	10
2	1556668802	C	int	200
3	1556668803	D	int	3000
4	1556668804	E	string	
5	1556668805	F	string	a string
6	1556668806	G	float	0.1
7	1556668807	H	float	0.2
8	1556668808	I		unknown
9	1556668809	A_SUB1_SUB2	bool	TRUE
10	1556668810	B	int	20
11	1556668811	C	int	300
12	1556668812	D	int	4000
13	1556668813	E	string	another string
14	1556668814	F	string	
15	1556668815	G	float	1.1
16	1556668816	H	float	1.2
17	1556668817	I		
18	1556668818	A_SUB1	bool	FALSE

You can also use cat -A to show all (non-printing) characters. This file here is tab-separated.

Get First n Lines

head -n 2 ~/Documents/sample_data.csv

ID	TIME	NAME	TYPE	VALUE
0	1556668800	A_SUB1	bool	TRUE

Get Last n Lines

tail -n 2 ~/Documents/sample_data.csv

17	1556668817	I		
18	1556668818	A_SUB1	bool	FALSE

Get Exact Line n

head -n 5 ~/Documents/sample_data.csv | tail -n 1

3	1556668803	D	int	3000

Find First m Substring Matches

grep -m 2 155666881 ~/Documents/sample_data.csv

10	1556668810	B	int	20
11	1556668811	C	int	300

Find Exact Match

grep -w A_SUB1 ~/Documents/sample_data.csv

0	1556668800	A_SUB1	bool	TRUE
18	1556668818	A_SUB1	bool	FALSE

# not exact and also finding 'A_SUB1_SUB2'
grep A_SUB1 ~/Documents/sample_data.csv

0	1556668800	A_SUB1	bool	TRUE
9	1556668809	A_SUB1_SUB2	bool	TRUE
18	1556668818	A_SUB1	bool	FALSE

Find All Lines With Certain Values

grep -P "bool\tFALSE" ~/Documents/sample_data.csv

18	1556668818	A_SUB1	bool	FALSE

grep -P "bool\tTRUE" ~/Documents/sample_data.csv

0	1556668800	A_SUB1	bool	TRUE
9	1556668809	A_SUB1_SUB2	bool	TRUE

In Mac OS you need to use grep -e instead.

Find All Lines With Empty Value Column

grep -P "\t__aSyNcId_<_exCeXyz___quot; ~/Documents/sample_data.csv

4	1556668804	E	string	
14	1556668814	F	string	
17	1556668817	I

In Mac OS again you need to use grep -e instead.

Count All Lines

wc -l ~/Documents/sample_data.csv

20 ~/Documents/sample_data.csv

Count Specific Lines

grep A_SUB1 ~/Documents/sample_data.csv | wc -l

3

Print Certain Columns Only

awk -F '\t' '{print $1}' ~/Documents/sample_data.csv

ID
0
1
...

awk -F '\t' 'FNR > 1 {print $1}' ~/Documents/sample_data.csv

0
1
2
...

Use F '\t' for tab-separated files and FNR > 1 to ignore the header row.

Sort By Certain Row

# 3rd row with duplicates
sort -k 3,3 ~/Documents/sample_data.csv

0	1556668800	A_SUB1	bool	TRUE
18	1556668818	A_SUB1	bool	FALSE
9	1556668809	A_SUB1_SUB2	bool	TRUE
10	1556668810	B	int	20
1	1556668801	B	int	10
11	1556668811	C	int	300
2	1556668802	C	int	200
...

# 3rd row without duplicates
sort -u -k 3,3 ~/Documents/sample_data.csv

0	1556668800	A_SUB1	bool	TRUE
9	1556668809	A_SUB1_SUB2	bool	TRUE
1	1556668801	B	int	10
2	1556668802	C	int	200
...

Count Unique Cell Values

awk -F '\t' 'FNR > 1 {print $3}' ~/Documents/sample_data.csv | sort | uniq -c | sort -n

1 A_SUB1_SUB2
2 A_SUB1
2 B
2 C
...

# for occurences in descending order
awk -F '\t' 'FNR > 1 {print $3}' ~/Documents/sample_data.csv | sort | uniq -c | sort -nr

Use F '\t' for tab-separated files and FNR > 1 to ignore the header row.

Summary

Of course there are even more tricks you can do in order to "get to know" your data and certainly from some point on you won't be using bash for working on it (especially when it comes to maths and calculations which is not very pleasant/possible in a shell) ... but for some initial exploration without long loading times bash can help you getting started.

Bonus

For not only quickly navigating through your data but also through your terminal these shortcuts are also (very) useful and great time savers:

  • ctrl+a: Jump to beginning of line.
  • ctrl+e: Jump to end of line.
  • ctrl+w: Delete one word to the left.
  • ctrl+u: Delete complete line to the left.