You’d be surprised how deep a hole you can dig with a fork(). Sounds like a weird thing to say, but if you’ve ever written concurrent code then you can understand some of the headaches/nightmares it can cause. You start off with a simple enough idea, but then you either need better performance or it needs to handle multiple things at once.
So you reach for your nearest threading or multiprocessing library (or if you’re really hardcore maybe just the fork() call). And just like that you’ve just opened up a whole new world of potential angst. Some questions you might scream quietly to yourself include:
Do I use threads or processes? How do I manage state between them? How do I communicate with them? What’s the deal with shared memory & variables? What happens if a thread/process hangs or dies? What will happen if data is corrupted? What time is it? How long can my dog go without feeding?
With “traditional” concurrent programming you would normally answer these questions by dumping a sack of locks, mutexes, semaphores or whatever into your code so you can safely (ha!) change shared data. This should prevent other wayward threads or processes breaking down the door and replacing all your furniture with turnips, or something else just as unpredictable.
If you’re a Node or JS developer (or have a mild neurological disorder) you might instead write hundreds of callbacks and code that jumps around like a rabbit receiving shock therapy. I suppose that’s not really fair; Node is asynchronous/event-driven, a slightly different beast, but I like to take a swipe at it whenever I can. Anyway, whatever.
Getting back on track - enter Go, the systems programming language that (allegedly) makes all this nastiness go away. I’ve been reading articles about Go for years and they usually always mention how wonderful and easy it makes concurrency. It’s also has some pretty decent ‘oomph’ behind it, having been co-created by Ken Thompson (whilst at Google), who’s one of the fathers of Unix, C, UTF-8 and other groundbreaking tech. Despite this, I’m a cynical git and I never take anyone’s word for anything so I decided to teach myself some the basics to see if all of this was just marketing bull or genuine praise. Good news, I like it!
In Go you don’t have to concern yourself with the low-level drudgery of threads or processes (well, you can if you really want to). All you need to do is write functions that do stuff, and then invoke them using the
go statement like this:
go doStuff(). Boom -
doStuff() is now running concurrently.
Go calls these “go routines” which are a bit like light-weight threads managed by the runtime. These are then multiplexed onto a smaller number of “real” threads. More deets here, but essentially they’re threads, but better. Have a look at the example below:
If you run the code above as-is, the chances are that it’ll say
main: calling doStuff() then exit. That’s because
go doesn’t wait for the routine to finish, and
main() exits before
doStuff() does, and when
main() exits then everything is terminated. Uncomment the
time.Sleep() line to fix it.
Go solves the hassle of managing communications between Go routines by using channels. Channels are baked right into the language and can be thought of as conduits into which you can chuck messages to other Go routines.
All you need to do is create one, then throw something down it (or expect something to be thrown at you). They’re incredibly easy to use, which bodes well for me and my small brain. So how do you use a channel?
Obviously this is a very simple example, but it demonstrates how easy it is to use channels. A few important points about channels:
- Data flows in the direction of the arrow operator, eg:
receiver <- sender
- Sends and receives block until the other end is ready. This means, for example, that if you have a channel that is receiving data, it will block until there is data to receive.
- Senders should
close()channels once they’re done with them. This usually only matters when you have receivers waiting on a channel.
Ok, that’s enough copying from the Go documentation. To help show how all this makes a difference, I’ve written 4 similar but different word-counting programs in Go.
The first one starts with no concurrency and no channels - dead simple. The ones after that implement channels in different ways and show common techniques used to manage them. It’s worth noting that performance doesn’t necessarily improve when using channels as there is some overhead when using them.
I’ve put the word counting logic into
demo packages numbered 1-4, the code below only shows
main and the methods called. Examine these packages to see more detail about how channels are used. As always, the code might be wrong and/or ugly, but cut me some slack. You can find the repository HERE.
I’ll be counting the words from posts and comments extracted from StackOverflow’s bitcoin forum. I’ve stripped-away the all the XML using
html-xml-utils, if you want to do something similar.
Word Count v1 - Nothing Fancy
To give you a feel for go, the code below performs a simple word count of any given text files in serial. Nothing fancy, just plain old function calls and returns.
The output looks a little like this:
$ go run wc_1.go data/comments.txt data/posts.txt 20180518 14:58:53.041 - wordCounter started for data/comments.txt 20180518 14:58:53.639 - wordCounter finished in 597.724568ms for data/comments.txt 20180518 14:58:53.639 - 1610620 words in data/comments.txt 20180518 14:58:53.639 - wordCounter started for data/posts.txt 20180518 14:58:55.035 - wordCounter finished in 1.396105383s for data/posts.txt 20180518 14:58:55.035 - 4264001 words in data/posts.txt 20180518 14:58:55.035 - Total elapsed time 1.993948022s
It takes about 2 seconds to count 34 MB of data containing about 5 million words.
Word Count v2 - Go Routines
In this version of the same code I run the
wordCount function as goroutines (lightweight threads) using the
go statement. The goroutines run concurrently, reading the files, aggregating the words before printing the results out.
I’m using a
WaitGroup here which lets me know when my goroutines are finished. When I start a goroutine I
Add() to the
WaitGroup. Once all the goroutines are launched, I call
Wait() which blocks until each goroutine calls
Here’s the output:
$ go run wc_2.go data/comments.txt data/posts.txt 20180518 14:58:58.699 - wordReader started for data/posts.txt 20180518 14:58:58.699 - wordReader started for data/comments.txt 20180518 14:58:59.303 - 1610620 words in data/comments.txt (603.74191ms) 20180518 14:59:00.131 - 4264001 words in data/posts.txt (1.432480769s) 20180518 14:59:00.131 - Total elapsed time 1.432572708s
Checking the output we can see it’s about 30% faster than running in serial.
Word Count v3 - A Channel
A wild channel appears! In this version I’m using a single channel named
wordCountResult into which the results are sent. The
WordCounter goroutines run concurrently, and once done they send their results into the
wordCountResult channel. We receive on this channel to get the results and print them out. I use a
WaitGroup in order to close the results channel once the goroutines have finished.
$ go run wc_3.go data/comments.txt data/posts.txt 20180518 14:59:03.938 - wordReader started for data/comments.txt 20180518 14:59:03.938 - wordReader started for data/posts.txt 20180518 14:59:04.544 - 1610620 words in data/comments.txt 20180518 14:59:04.544 - wordCounter finished in 606.706369ms for data/comments.txt 20180518 14:59:05.366 - wordCounter finished in 1.428154775s for data/posts.txt 20180518 14:59:05.366 - 4264001 words in data/posts.txt 20180518 14:59:05.366 - Total elapsed time 1.428251929s
It’s taking about the same amount of time as before.
Word Count v4 - Many Channels
Channels everywhere! Here the logic changes a bit. Now each file has a
WordReader and a
WordCounter function. The
WordReader is assigned a channel into which words are sent. The
WordCounter receives the words, counts them, and outputs the results.
WaitGroups are used to close channels once they’re no longer needed.
$ go run wc_4.go data/comments.txt data/posts.txt 20180518 14:59:08.791 - wordReader started for data/comments.txt 20180518 14:59:08.791 - wordReader started for data/posts.txt 20180518 14:59:08.791 - wordCounter started for data/comments.txt 20180518 14:59:08.791 - wordCounter started for data/posts.txt 20180518 14:59:10.055 - wordReader finished in 1.263682107s for data/comments.txt 20180518 14:59:10.055 - wordCounter finished in 1.263708429s for data/comments.txt 20180518 14:59:10.055 - 1610620 words in data/comments.txt 20180518 14:59:11.856 - wordReader finished in 3.06511103s for data/posts.txt 20180518 14:59:11.856 - wordCounter finished in 3.065200817s for data/posts.txt 20180518 14:59:11.856 - 4264001 words in data/posts.txt 20180518 14:59:11.856 - Total elapsed time 3.065248876s
Much slower this time. Like I said, you’re not always guaranteed better performance with channels, it depends how you use them.
But! We’ve created the beginnings of a pipeline, so we can send words to different functions for different processing. If we wanted to we could create different functions to calculate word frequency, or do sentiment analysis.
It’s pretty cool, and pretty easy!