Profile Photo

Jamie Skipworth


Technology Generalist | Software & Data


Data Bashing

I’ve spent a long time tinkering around with data, moving it around, mangling it and generally trying to make it a bit more useful. There are plenty of fancy and expensive tools out there like IBM InfoSphere, Informatica, Talend, and Ab Initio you can use to do this sort of stuff. So far, so boring.

But did you know you can do some of fancy data boshing & query stuff on the linux command line? “No way!”, I hear you say (or more likely, “no shit!”). Yep - you can filter, sort, join, aggregate and otherwise dick-around with data in something like bash as much as you like. All you need is an intermediate level of understanding of a handful of GNU command line tools.

So guess what I’m about to do? Yep, I’m going to demo a few ways you can mangle data on the command line. These tricks are great ways to explore raw data, clean it up and generally get familiar with its structure before it gets loaded into whatever data store your platform gods have chosen for you.

The Data

I’m using tables from the Microsoft AdventureWorks database (yawn - I know). I grabbed the data from here. I’m only using 4 files from it to keep things simple:

  • Customers.csv - A list of IDs of customers
  • Person.csv - A list of people
  • Product.csv - a list of products
  • SalesOrderHeader.csv - Summary sales data for customers

Some of this data has mixed encodings and inconsistent delimiters, which is really annoying but allows me to use some cool tricks to make it cleaner. I’ve also manually added header records to make them easier to read.

You can get the files I’m using, with headers, from GitHub here.

The Tools

These are the commands I’ll be using (mostly):

  • awk - pattern-directed scanning and processing language
  • uniq - report or filter out repeated lines in a file
  • sort - sorts lines of text files
  • head - extracts n lines from the top of a file or stream
  • wc - counts characters, lines or words
  • join - relational database operator
  • nl - line numbering filter
  • iconv - character set conversion
  • file - determine file type
  • cat - list contents of files
  • sed - stream editor

Get Bashing

Normally when I get lumped with strange new files, I want to know a few things:

  • What’s the file encoding?
  • What are the field and record delimiters?
  • How many fields are in each record?
  • How many records are there?

To answer these questions I’ll poke and prod these files in various ways (don’t worry, they love it).

Encoding

Files can use various different encodings to represent the data within them. The encoding used depends on things like what language is being represented or what system it comes from. More modern character standards like unicode have a byte-order-mark (BOM) sequence at the start of the file that tells you exactly what encoding is being used (eg UTF-8, UTF-16, UTF-32 etc).

To determine the encoding used by a particular file, you can use the file command. This is what I get if I execute it on all my files:

$ file *
Customer.csv: ASCII text, with CRLF line terminators
Product.csv:  ASCII text, with CRLF line terminators
Person.csv:   Little-endian UTF-16 Unicode English text, with very long lines, with CRLF line terminators
SalesOrderHeader.csv:	ASCII text, with CRLF line terminators

Easy! Straight away we can see that everything is ASCII except for Person.csv which is UTF-16. We’ll need to convert this to UTF-8 (a superset of ASCII) so we can use the rest of our tools on it.

For converting files between different encodings, we can use iconv.

$ iconv -f utf-16 -t utf-8 Person.csv > Person_utf8.csv

If we run file on the output of this command we can see it’s now UTF-8.

$ file Person_utf8.csv
Person_utf8.csv: UTF-8 Unicode English text, with very long lines, with CRLF line terminators

Delimiters

I want all these files to be tab-delimited, so they’re all more-or-less the same format. I cat the file and pass the output to head -2 | tail -1 which essentially gives me the first non-header row of the file (the last of the first 2 rows).

$ cat Customer.csv | head -2 | tail -1
1        934    1    AW00000001    3F5AE95E-B87D-4AED-95B4-C3797AFCB74F    2004-10-13 11:15:07.263

OK, so I can see white-space between fields, but is it tabs, spaces or something else? If we give cat the -A option to display all non-printables as control characters we can see what delimiters are really being used. NOTE - for a list of what these control characters mean, Wikipedia has a reference.

cat -A Customer.csv | head -2 | tail -1
1^I^I934^I1^IAW00000001^I3F5AE95E-B87D-4AED-95B4-C3797AFCB74F^I2004-10-13 11:15:07.263^M$

So we can see the fields are tab-delimited (^I) and records are carriage-return, line-feed (CRLF or ^M) terminated (as reported by file above).

Let’s do SalesOrderHeader.csv and Product.csv. Note - I use fold below to wrap long lines purely for aesthetic reasons.

$ cat -A SalesOrderHeader.csv | head -2 | tail -1 | fold -w 80
43659^I8^I2011-05-31 00:00:00^I2011-06-12 00:00:00^I2011-06-07 00:00:00^I5^I0^IS
O43659^IPO522145787^I10-4020-000676^I29825^I279^I5^I985^I985^I5^I16281^I105041Vi
84182^I^I20565.6206^I1971.5149^I616.0984^I23153.2339^I^I{79B65321-39CA-4115-9CBA
-8FE0903E12E6}^I2011-06-07 00:00:00^M$

$ cat -A Product.csv | head -2 | tail -1 | fold -w 80
1^IAdjustable Race^IAR-5381^I0^I0^I^I1000^I750^I.0000^I.0000^I^I^I^I^I0^I^I^I^I^
I^I1998-06-01 00:00:00.000^I^I^I694215B7-08F7-4C0D-ACB1-D734BA44C0C8^I2004-03-11
 10:01:36.827^M$

Nice, they’re already tab-delimited. Nothing to do here.

Lets do Person_utf8.csv

$ cat -A Person_utf8.csv | head -2 | tail -1 | fold -w 80
2+|EM+|0+|+|Terri+|Lee+|Duffy+|+|1+|+|<IndividualSurvey xmlns="http://schemas.mi
crosoft.com/sqlserver/2004/07/adventure-works/IndividualSurvey"><TotalPurchaseYT
D>0</TotalPurchaseYTD></IndividualSurvey>+|D8763459-8AA8-47CC-AFF7-C9079AF79033+
|2008-01-24 00:00:00&|^M$

Hmmm, this one is a bit weird. It looks like we have field delimiters of +| and a record terminator of &|. No drama, we can just use sed to transform it.

$ sed 's/+|/\t/g;s/&|//g' Person_utf8.csv > Person_utf8_tab.csv

This command uses regluar expressions to replace any occurance of +| with a tab, and removes the &| records terminator (because records are CRLF terminated anyway). This is essentially 2 sed commands in one:

  • s/+|/\t/g - (g)lobal (s)earch & replace of +| to tabs
  • s/&|//g - (g)lobal (s)earch & replace of &| with nothing

You can head the output file to check it worked OK.

Fields

One final sanity check I normally do is to make sure that each file has a consistent number of fields, which easily done in awk.

$ awk -F'\t' '{print FILENAME,NF}' Customer.csv SalesOrderHeader.csv Person_utf8_tab.csv | uniq
Customer_sorted.csv 7
Person_utf8_tab.csv 13
SalesOrderHeader.csv 26

The variables FILENAME and NF are built-in to awk. The NF one stores the number of fields in a record, based on the delimiter specified by the -F option. I’m piping the output into uniq to get a unique field count for each file name. If you get more than one row per file then that file contains records of inconsistent length, and you’ll have to dig around to fix it. Dodgy output might look like this:

Customer_sorted.csv 7
Person_utf8_tab.csv 13
SalesOrderHeader.csv 26
SalesOrderHeader.csv 23

This would indicate that SalesOrderHeader.csv has records that are 26 fields wide and 23 fields wide.

Simple Queries

Right, so I have the files nice and clean. Now I can start asking them questions!

How many customers are there?

Very boring…

$ tail +1 Customer.csv | wc -l
       19821

I’m assuming here Customers.csv contains only unique records. If you’re worried there might be duplicate CustomerIDs you could run the following:

$ tail +1 Customer.csv | cut -f1 | wc -l
   19821
$ tail +1 Customer.csv | cut -f1 | uniq -d | wc -l
       0

These two commands will give you the following:

  • The total number of records in the file
  • The number of duplicate records (uniq -d) in the file.

In this case I have 19,821 records and no duplicates.

How many products are made on-site?

The field “MakeFlag” indicates which products are made on-site. I need to find out which field number this is, so I’ll interrogate the header row.

$ head -1 Product.csv | tr  '\t' '\n' | nl
1. ProductID
2. ProductName
3. ProductNumber
4. MakeFlag
5. FinishedGoodsFlag
6. Color
7. SafetyStockLevel
8. ReorderPoint
9. StandardCost
10. ListPrice
11. Size
12. SizeUnitMeasureCode
13. WeightUnitMeasureCode
14. Weight
15. DaysToManufacture
16. ProductClass
17. Line
18. Style
19. ProdSubcatID
20. ProdModelID
21. SellStartDt
22. SellEndDt
23. DisconDt
24. rowguid
25. ModifiedDate

The tr command translates tabs into new-lines, and nl gives us pretty line numbers. The “MakeFlag” field is field number 4.

$ awk -F'\t' 'BEGIN{ cntr=0 } { if( $4 == "1" ){ cntr+=1 } } END{ print cntr }' Product.csv
239

The awk language executes the code block for each line in a text file, but you can execute code in BEGIN{} and END{} blocks at the beginning or end of a file, respectively. Here awk initialises the cntr variable with 0 and then increments it whenever it sees that field 4 is a 1. Once finished processing the records, it outputs the value of cntr.

So here I have 239 products manufactured on-site.

More Advanced Queries

These queries are a bit more silly.

If I want to, I can use relational joins between files. Joins require use of the join command (gasp!), which expects the input files to be sorted on their join keys. So let’s sort them now.

Oh, quick note: because my files have header rows, I’ll have to discard them before the sort using tail +1 or they’ll end up in the data somewhere. Also, I’m using a bash-only way of defining the tab character: -t$'\t'. This bash syntax is explained here.

I know Customer.csv and Person_utf8_tab.csv files have their CustomerID key as the first column, so I sort on that.

$ tail +1 Customer.csv        | sort -t$'\t' -k1  > Customer_sorted.csv
$ tail +1 Person_utf8_tab.csv | sort -t$'\t' -k1  > Person_utf8_tab_sorted.csv

I don’t know which field number in SalesOrderHeader.csv is CustomerID, so I’ll quickly find out:

$ head -1 SalesOrderHeader.csv | tr '\t' '\n' | nl | grep Customer
    11    CustomerID

OK cool, it’s field number 11.

$ tail +1 SalesOrderHeader.csv | sort -t$'\t' -k11  > SalesOrderHeader_sorted.csv

Done, sorted!

What are the customers’ names?

The Customers_sorted.csv file only contains customer IDs. We need to join it to the Person_utf8_tab_sorted.csv file to get the names. Because the files are formatted nicely and already sorted, I can just issue the following join command.

$ join -t$'\t' -1 1 -2 1 -o 0,2.5,2.7 Customer_sorted.csv Person_utf8_tab_sorted.csv > Customer_names.csv

The options to join are a little funky. This basically says join the 2 specified files on their first fields (-1 1 -2 1), and output the join key, and the name fields from the Person file (-o 0,2.5,2.7).

Here’s a snippet of the output:

1 Ken Sánchez
10 Michael Raheem
100 Lolan Song
101 Houman Pournasseh
102 Zheng Mu
103 Ebru Ersan
104 Mary Baker
105 Kevin Homer
106 John Kane
107 Christopher Hill

Who are the top 10 customers?

Let’s get crazy and aggregate total sales by customer ID.

First I’ll aggregate the sales summary data and re-sort it:

$ awk -F'\t' 'NR>1 { cust_sum[$11]+=$23 } END{ for( cust in cust_sum ) { printf( "%s\t%s\n", cust, cust_sum[cust] ) } }' SalesOrderHeader_sorted.csv | sort -t$'\t' -k1 > SalesOrderHeader_sorted_agg.csv

This awk command essentially creates a {key: value} map of CustomerIDs and their total sales. For each record it adds a new CustomerID (if it’s missing) and increment the sales value. It only outputs the results once processing has finished, in the END block. The file was already sorted so it should maintain the sort order, but I’m re-sorting just in case.

The easy way to join the two results is by using the Customer_names.csv file I created earlier:

$ join -t$'\t' -1 1 -2 1 -o 0,1.2,1.3,2.2 Customer_names.csv SalesOrderHeader_sorted_agg.csv > Customer_sales.csv

Or if you’re feeling adventurous you could pipe the two joins together:

join -t$'\t' -1 1 -2 1 -o 0,2.5,2.7 Customer_sorted.csv Person_utf8_tab_sorted.csv | join -t$'\t' -1 1 -2 1 -o 0,1.2,1.3,2.2 - SalesOrderHeader_sorted_agg.csv > Customer_sales.csv

To get the top-10 customers it’s as simple as performing a numeric, reversed (-nr) sort on the sales total (-k4) and heading on the output:

$ sort -t$'\t' -k4 -nr Customer_sales.csv | head -10

12301 Hannah Clark 14691.4
12132 Taylor Jones 14690.2
12308 Hannah Lee 14662.5
12131 Kelli Chander 14658.9
12300 Colleen She 14633.2
12321 Madison Miller 14603.3
12124 Taylor Smith 14581.2
12307 Colleen Tang 14556.4
12296 Hannah Garcia 14546.9
11433 Jenny Chen 14265.2

Big spenders. These guys get on the Christmas card list. Nice.

Similarly, to get the the bottom-10 customers I can use tail (or change the sort direction):

$ sort -t$'\t' -k4 -nr Customer_sales.csv | tail -10

20178 Phillip Gonzalez 4.409
20146 Ronald Sanchez 4.409
20142 Ronald Lopez 4.409
20141 Ronald Fernandez 4.409
19516 Emma Peterson 4.409
19481 Emma Morris 4.409
17820 Dana Navarro 4.409
17819 Dana Romero 4.409
17818 Candice Ma 4.409
17649 Sandra Huang 4.409

This bunch probably aren’t worth the postage!

So that’s how you effectively write the following SQL query using GNU tools in bash.

SELECT
     c.CustomerID
  ,  p.FirstName
  ,  p.LastName
  ,  SUM( h.TotalSales ) AS TotalSales
FROM
  CUSTOMER c
INNER JOIN
  PERSON p
  ON( c.CustomerID = p.PersonID )
INNER JOIN
  SalesOrderHeader h
  ON( c.CustomerID = h.CustomerID )
GROUP BY
     c.CustomerID
  ,  p.FirstName
  ,  p.LastName
ORDER BY
  SUM( h.TotalSales ) DESC 
  -- ORDER BY DESC for highest value customers first, 
  -- ORDER BY ASC for lowest-value customers first.

To be honest, SQL is probably way easier!