Profile Photo

Jamie Skipworth


Technology Generalist | Software & Data


Fun with concurrency and Go.

You’d be surprised how deep a hole you can dig with a fork(). Sounds like a weird thing to say, but if you’ve ever written concurrent code then you can understand some of the headaches/nightmares it can cause. You start off with a simple enough idea, but then you either need better performance or it needs to handle multiple things at once.

So you reach for your nearest threading or multiprocessing library (or if you’re really hardcore maybe just the fork() call). And just like that you’ve just opened up a whole new world of potential angst. Some questions you might scream quietly to yourself include:

Do I use threads or processes? How do I manage state between them? How do I communicate with them? What’s the deal with shared memory & variables? What happens if a thread/process hangs or dies? What will happen if data is corrupted? What time is it? How long can my dog go without feeding?

Enter Go

With “traditional” concurrent programming you would normally answer these questions by dumping a sack of locks, mutexes, semaphores or whatever into your code so you can safely (ha!) change shared data. This should prevent other wayward threads or processes breaking down the door and replacing all your furniture with turnips, or something else just as unpredictable.

If you’re a Node or JS developer (or have a mild neurological disorder) you might instead write hundreds of callbacks and code that jumps around like a rabbit receiving shock therapy. I suppose that’s not really fair; Node is asynchronous/event-driven, a slightly different beast, but I like to take a swipe at it whenever I can. Anyway, whatever.

Getting back on track - enter Go, the systems programming language that (allegedly) makes all this nastiness go away. I’ve been reading articles about Go for years and they usually always mention how wonderful and easy it makes concurrency. It’s also has some pretty decent ‘oomph’ behind it, having been co-created by Ken Thompson (whilst at Google), who’s one of the fathers of Unix, C, UTF-8 and other groundbreaking tech. Despite this, I’m a cynical git and I never take anyone’s word for anything so I decided to teach myself some the basics to see if all of this was just marketing bull or genuine praise. Good news, I like it!

Go Routines

In Go you don’t have to concern yourself with the low-level drudgery of threads or processes (well, you can if you really want to). All you need to do is write functions that do stuff, and then invoke them using the go statement like this: go doStuff(). Boom - doStuff() is now running concurrently.

Go calls these “go routines” which are a bit like light-weight threads managed by the runtime. These are then multiplexed onto a smaller number of “real” threads. More deets here, but essentially they’re threads, but better. Have a look at the example below:

If you run the code above as-is, the chances are that it’ll say main: calling doStuff() then exit. That’s because go doesn’t wait for the routine to finish, and main() exits before doStuff() does, and when main() exits then everything is terminated. Uncomment the time.Sleep() line to fix it.

Channels

Go solves the hassle of managing communications between Go routines by using channels. Channels are baked right into the language and can be thought of as conduits into which you can chuck messages to other Go routines.

All you need to do is create one, then throw something down it (or expect something to be thrown at you). They’re incredibly easy to use, which bodes well for me and my small brain. So how do you use a channel?

Obviously this is a very simple example, but it demonstrates how easy it is to use channels. A few important points about channels:

  • Data flows in the direction of the arrow operator, eg: receiver <- sender
  • Sends and receives block until the other end is ready. This means, for example, that if you have a channel that is receiving data, it will block until there is data to receive.
  • Senders should close() channels once they’re done with them. This usually only matters when you have receivers waiting on a channel.

Some Demos

Ok, that’s enough copying from the Go documentation. To help show how all this makes a difference, I’ve written 4 similar but different word-counting programs in Go.

The first one starts with no concurrency and no channels - dead simple. The ones after that implement channels in different ways and show common techniques used to manage them. It’s worth noting that performance doesn’t necessarily improve when using channels as there is some overhead when using them.

I’ve put the word counting logic into demo packages numbered 1-4, the code below only shows main and the methods called. Examine these packages to see more detail about how channels are used. As always, the code might be wrong and/or ugly, but cut me some slack. You can find the repository HERE.

I’ll be counting the words from posts and comments extracted from StackOverflow’s bitcoin forum. I’ve stripped-away the all the XML using html-xml-utils, if you want to do something similar.

Word Count v1 - Nothing Fancy

To give you a feel for go, the code below performs a simple word count of any given text files in serial. Nothing fancy, just plain old function calls and returns.

To go to the file on github here.

The output looks a little like this:

$ go run wc_1.go data/comments.txt data/posts.txt
20180518 14:58:53.041 - wordCounter started for data/comments.txt
20180518 14:58:53.639 - wordCounter finished in 597.724568ms for data/comments.txt
20180518 14:58:53.639 - 1610620	words in data/comments.txt
20180518 14:58:53.639 - wordCounter started for data/posts.txt
20180518 14:58:55.035 - wordCounter finished in 1.396105383s for data/posts.txt
20180518 14:58:55.035 - 4264001	words in data/posts.txt
20180518 14:58:55.035 - Total elapsed time 1.993948022s

It takes about 2 seconds to count 34 MB of data containing about 5 million words.

Word Count v2 - Go Routines

In this version of the same code I run the wordCount function as goroutines (lightweight threads) using the go statement. The goroutines run concurrently, reading the files, aggregating the words before printing the results out.

I’m using a WaitGroup here which lets me know when my goroutines are finished. When I start a goroutine I Add() to the WaitGroup. Once all the goroutines are launched, I call Wait() which blocks until each goroutine calls Done().

To go to the file on github here.

Here’s the output:

$ go run wc_2.go data/comments.txt data/posts.txt
20180518 14:58:58.699 - wordReader started for data/posts.txt
20180518 14:58:58.699 - wordReader started for data/comments.txt
20180518 14:58:59.303 - 1610620	words in data/comments.txt (603.74191ms)
20180518 14:59:00.131 - 4264001	words in data/posts.txt (1.432480769s)
20180518 14:59:00.131 - Total elapsed time 1.432572708s

Checking the output we can see it’s about 30% faster than running in serial.

Word Count v3 - A Channel

A wild channel appears! In this version I’m using a single channel named wordCountResult into which the results are sent. The WordCounter goroutines run concurrently, and once done they send their results into the wordCountResult channel. We receive on this channel to get the results and print them out. I use a WaitGroup in order to close the results channel once the goroutines have finished.

To go to the file on github here.

Output:

$ go run wc_3.go data/comments.txt data/posts.txt
20180518 14:59:03.938 - wordReader started for data/comments.txt
20180518 14:59:03.938 - wordReader started for data/posts.txt
20180518 14:59:04.544 - 1610620	words in data/comments.txt
20180518 14:59:04.544 - wordCounter finished in 606.706369ms for data/comments.txt
20180518 14:59:05.366 - wordCounter finished in 1.428154775s for data/posts.txt
20180518 14:59:05.366 - 4264001	words in data/posts.txt
20180518 14:59:05.366 - Total elapsed time 1.428251929s

It’s taking about the same amount of time as before.

Word Count v4 - Many Channels

Channels everywhere! Here the logic changes a bit. Now each file has a WordReader and a WordCounter function. The WordReader is assigned a channel into which words are sent. The WordCounter receives the words, counts them, and outputs the results. WaitGroups are used to close channels once they’re no longer needed.

To go to the file on github here.

Output:

$ go run wc_4.go data/comments.txt data/posts.txt
20180518 14:59:08.791 - wordReader started for data/comments.txt
20180518 14:59:08.791 - wordReader started for data/posts.txt
20180518 14:59:08.791 - wordCounter started for data/comments.txt
20180518 14:59:08.791 - wordCounter started for data/posts.txt
20180518 14:59:10.055 - wordReader finished in 1.263682107s for data/comments.txt
20180518 14:59:10.055 - wordCounter finished in 1.263708429s for data/comments.txt
20180518 14:59:10.055 - 1610620	words in data/comments.txt
20180518 14:59:11.856 - wordReader finished in 3.06511103s for data/posts.txt
20180518 14:59:11.856 - wordCounter finished in 3.065200817s for data/posts.txt
20180518 14:59:11.856 - 4264001	words in data/posts.txt
20180518 14:59:11.856 - Total elapsed time 3.065248876s

Much slower this time. Like I said, you’re not always guaranteed better performance with channels, it depends how you use them.

But! We’ve created the beginnings of a pipeline, so we can send words to different functions for different processing. If we wanted to we could create different functions to calculate word frequency, or do sentiment analysis.

The End

It’s pretty cool, and pretty easy! Go's mascot