Profile Photo

Jamie Skipworth

Technology Generalist | Software & Data


Not many people have heard of Clickhouse, myself included until very recently. It has a weird name that doesn’t really give you any sense of what it is or what it does, but that’s most application names these days (obligatory link to ‘Pokemon or Big Data?’).

So what is it and where did it come from? It’s an open-source, column-based OLAP RDBMS, originally created by Yandex (the Russian equivalent of Google) to perform web analytics, hence the ‘click’ part of Clickhouse. It can deal with absolutely monstrous data volumes; At the time of writing Yandex use Clickhouse to store around 13 trillion records, with 20 billion daily events. That’s mad.

I’m not going to regurgitate the detailed history, but you can find it here if you fancy a read. They should’ve called it Clickmonster, Click-cluthulu (Clickthulu?), or something.

Play Time

I’m not going to attempt to create any super-fancy benchmarks. Other people have already done this, and have done a much better job than I would have. All I’m going to do here is get some data and throw it into Clickhouse so I can run a few basic queries on it just to get a feel for it. I’ll also do a quick comparison of these same queries in Postgres, just to illustrate the performance difference between columnar OLAP RDBMSs and row-based OLTP ones.

I’m not going to use any fancy cloud platforms to do this. I know a lot of them have tools and products designed to do all of this stuff, either by point-and-click or writing massive YAML files. It’s not sexy, but it’s not always fun, nor free. I’ll be using GNU command line tools again.


I’ve decided to use S.M.A.R.T. data from the hard drives in the BackBlaze datacentre. I’ve downloaded the data from 2014-2018, and unzipped them so that each year has its own directory.

Now, the issue with the data is that each file contains a different number of columns, so one if the first things I’m going to do is unify the file format before loading. Call me weird, but I find this bit fun.

Fixing the Data

Initially when I started this I didnt know that each year could have different column layout. If I wanted to load any of this data into a database, I needed to ensure that all the data was in the same format.

I wanted to create a consistant file format for all the data, meaning I needed to create a new format that contained all the columns from all the files. So the first thing I did was to extract all the header rows from the data files.

$ for year in {2014..2018}; do head -q -n1 ${year}/* | sort -u > header_${year}; done
$ find . -name 'header*'

The next step was to create a new file containing all the columns from all the headers - a single unified header format, or an UBER HEADER! To start I merged all the headers into one file, with each column name on a new line.

$ for year in {2014..2018}; do \
     cat header_${year} | tr -t ',' '\n' > header_${year}_rows; \
$ head ./header_2017_rows

There you go, column names on new lines.

To create the UBER HEADER I needed to concatenate each of these individual header files together, deduplicate them, convert them to a tab-separated TSV file, and output to a new file named uber_header_flat.

$ cat header_201[45678]_rows | sort -u | sort -n | tr  -t '\n' '\t' | tr -d '\r' > uber_header_flat
$ cat -A uber_header_flat | head -c 60
capacity_bytes  date    failure model   serial_number   smart_10_nor%

This gives me a TSV file of all the columns available in all the files. This is my UBER HEADER, but what’s next? The data had to match this new format before it could be loaded anywhere.

I needed to modify all the data files I had so that they had the same unified format. Sound like a pain in the arse? Correct! But, that’s where awk comes in handy. My data was organised into directories like this:

$  tree -d
├── 2014
├── 2015
├── 2016
├── 2017
└── 2018

Each directory contains a years’ worth of data - hundreds of files, and I wasn’t going to rework them manually so I created an awk script to reformat them (named transformer.awk).

The script expects the following as inputs:

  • An optional tab-separated list of source input columns, otherwise it reads the header row of the input file.
  • A tab-separated list of destination output columns.
  • The file to operate on

It maps input columns to output columns, outputting a blank field if the mapping fails for any of the input columns. It assumes all input files have header rows.

Here’s an example of what this beast does. Given this input data:


If you run the following on that data:

awk -F, -f transformer.awk -v output_cols=lastname,firstname,dob my_test_file.csv

The resulting output will look like this, allowing you to reformat a file any way you like:


All of the data files needed to be transformed into a single unified format as specified by the UBER HEADER. I had the script but there’s a lot of data, so I parallelised the work using xargs.

$ mkdir all_data/
$ find 201[45678] -name '*.csv' | \
  xargs -t -P8  -I _FILE_ bash -c \
  'awk -F, -f transformer.awk \
  -v output_cols="$(cat uber_header_flat)" \
  -v RS='"'\r\n'"' \
  -v OFS=\\t \
  -v null_value="\\\N" _FILE_ > all_data/$(basename _FILE_)'

That will find all the CSV data files I have and transform them into the desired column layout. Using xargs I’m running 8 of these processes in parallel to speed it up a bit. In the end I’ll end up with 1278 files in the all_data directory.

Running Clickhouse & Loading Data

So far this blog post doesn’t feel much like anything to do with Clickhouse, all I’ve been doing is screwing around with data! I know, but we’re here now.

Time to spin-up ClickHouse and create a table matching the UBER HEADER format. I used the ClickHouse docker image, and followed the instructions here.

$ docker run -d --name clickhouse_db_test --ulimit nofile=262144:262144 --volume /Users/jamieskipworth/clickhouse_db_test:/var/lib/clickhouse yandex/clickhouse-server

$ docker run -it --rm --link clickhouse_db_test:clickhouse-server yandex/clickhouse-client --host clickhouse-server --query="CREATE DATABASE so;"

That starts the ClickHouse server and creates a new database. Next, I created a new table that matches the UBER HEADER format.

I can execute this SQL (stored in create_table.sql) by running the following:

$ docker run -it --rm --link clickhouse_db_test:clickhouse-server yandex/clickhouse-client --host clickhouse-server --query="$(cat create_table.sql)"

Now, to load the data. Remember from the steps above, I should have a new directory named all_data containing the reformatted files. To load these I can use tail with the -n +2 option to skip header rows in each of the files.

$ tail -q -n +2 all_data/* | \
   docker run -i --rm --link clickhouse_db_test:clickhouse-server yandex/clickhouse-client --host clickhouse-server --query="INSERT INTO so.data_all FORMAT TabSeparated"

Let’s just double-check all the data got loaded.

$ docker run -it --rm \
  --link clickhouse_db_test:clickhouse-server yandex/clickhouse-client \
  --host clickhouse-server
ClickHouse client version (official build).
Connecting to clickhouse-server:9000 as user default.
Connected to ClickHouse server version 19.14.3 revision 54425.

09e25071240a :) select count(1) from so.data_all;

SELECT count(1)
FROM so.data_all

│ 77280092 │

1 rows in set. Elapsed: 0.364 sec. Processed 77.28 million rows, 154.56 MB (212.11 million rows/s., 424.21 MB/s.)

09e25071240a :)

Looks like all 77m rows have loaded. Let’s run some SQL queries! By the way, did you notice how fast that count ran?

Querying The Data

OK, so I’ve put together some really simple queries that I’ll run both on Clickhouse and on Postgres just to give you an idea of the performance difference. I feel a bit guilty because really it’s sort of an unfair comparison, but what the heck.

Query: At what temperature did most drives fail?

  , sum(failure) 
where smart_194_raw is not null 
group by smart_194_raw 
order by smart_194_raw asc;


52 rows in set. Elapsed: 1.140 sec. Processed 77.28 million rows, 1.39 GB (67.80 million rows/s., 1.22 GB/s.)

0.00s user 0.01s system 0% cpu 1:23.28 total

Query: What are the most popular drive models by quarter?

  , toStartOfQuarter(date) as qtr
  , count(1) as num 
group by model, qtr 
order by qtr


911 rows in set. Elapsed: 1.443 sec. Processed 77.28 million rows, 2.01 GB (53.54 million rows/s., 1.39 GB/s.)

0.01s user 0.01s system 0% cpu 1:50.28 total

Query: What’s the average drive up-time before failure?

    Avg((smart_9_raw / 24)) as uptime
where smart_9_raw is not null 
and failure > 0
order by uptime asc;


1 rows in set. Elapsed: 0.653 sec. Processed 77.28 million rows, 835.12 MB (118.37 million rows/s., 1.28 GB/s.)

0.00s user 0.01s system 0% cpu 1:25.28 total

Query: What drive model has the longest up-time on average?

    Avg((smart_9_raw / 24)) as uptime
  , model
where smart_9_raw is not null 
and failure > 0
group by model
order by uptime asc;


76 rows in set. Elapsed: 0.845 sec. Processed 77.28 million rows, 1.21 GB (91.41 million rows/s., 1.43 GB/s.)

0.00s user 0.00s system 0% cpu 1:23.66 total

Look at those stats. Straight away it’s clear Clickhouse performance is way ahead of postgres, by several orders of magnitude.

Performance chart

In postgres these queries are taking an average of about 1m 30s, but in Clickhouse they take only a second or less. That’s BigQuery-levels of performance on your laptop, without opening your wallet. There’s not much not to like there.