Profile Photo

Jamie Skipworth


Technology Generalist | Software & Data


Geographic Functions in BigQuery

This is a little post to demo a couple of the dead-handy geographic functions that are available in Google Cloud’s BigQuery. I was working on a quick little proof-of-concept project recently where the team and I had to track a bunch of telemetry data on a map in Tableau. The data itself came from a handful of different sources, updated every minute.

Bog standard DataFlow pipeline

It was a pretty standard GCP solution - a DataFlow pipeline to ingest the data from PubSub, apply some transformations and perform basic data quality checks, before loading the data into BigQuery. There’s some other databasey stuff in there around data modelling and conformance so that data could be queried easier and faster, but I’ll skip over that for now.

Anyway, with that context out of the way, here’s what some of the data looks like:

$ bq query --nouse_legacy_sql 'select * from `demo_data.vehicle_location` order by datetime asc limit 10'
Waiting on bqjob_r1ec51cdd166d0331_0000016ca6ff2f99_1 ... (0s) Current status: DONE
+------------+-------------+----------------------------------------------+---------------------+
|    lat     |     lon     |                  vehicle_id                  |      datetime       |
+------------+-------------+----------------------------------------------+---------------------+
|  47.529275 |   8.5836066 | iAunO67VXtaJxkQ7qWUQwHamNncPmAWY1MnfpokbvQ0= | 2018-12-21 09:35:06 |
| 47.5292416 |   8.5835483 | iPJpheeZYJWqRkLtnG80Q7++PJpKDBnxE8gO9EPkpfg= | 2018-12-21 09:43:32 |
| 47.5291816 |    8.583465 | DCHwikOUMjXEpMo+0evQ5dNa+3l5FI69mibfsWz6T20= | 2018-12-21 09:55:51 |
| 47.5289516 |   8.5836016 | hMKZxU1OwA0gLgaLMUxAPrx2aWhyZ8FC0avnbq+5CDA= | 2018-12-21 09:56:21 |
|   47.52914 |     8.58345 | dDabbr6vedzq2XRR1119wUxmSBFa31hi/0FJ15Upc5k= | 2018-12-21 10:04:32 |
|  47.529085 |    8.583485 | knNUPX8a1bLUPEni5GKDJtJv0Gi9441z5LyUt0Py41s= | 2018-12-21 10:21:51 |
| 47.5291133 |   8.5834633 | AC63CI3G5jmzEr4ME+l21B3D1eB8okKqGTZDg6n/kwg= | 2018-12-21 10:49:10 |
| -32.755805 | 151.7629183 | mQ0q7QqmbF7i9VkimU3ulgLU5At0gTx3NbSLJu6MD+Q= | 2019-01-07 13:11:37 |
|   47.52906 |   8.5836383 | D8gLebEzZsSoxPlZmBJTuDLUiGTrnX/OpPVV2oJp9cc= | 2019-01-29 16:17:36 |
| 47.5292516 |   8.5833783 | g/ZeZZvppsHbU6rPAXAISkHEdrQgOPwo1OIoQ7hqPXI= | 2019-01-30 10:00:01 |
+------------+-------------+----------------------------------------------+---------------------+

Pretty bare-bones - just latitude, longitude, and the timetamp the position was recorded. A subset of the same data visualised on a map looks like this:

Raw data on a map

Each of the those points represents the position of a single vehicle at a particular point in time, so what we’re looking at is all positions for our entire timeframe. It’s hard to see the path itself, so wouldn’t it be cool if we could draw lines between the points?

BigQuery to the rescue! (actually it’s just standard SQL, but hey). We can use a window function to generate a sequence number that shows us the order in which each point was visited.

$ bq query --nouse_legacy_sql '
  SELECT
    *
  , ROW_NUMBER() OVER(PARTITION BY vehicle_id ORDER BY datetime ASC) as path_sequence
  FROM `demo_data.vehicle_location`
  ORDER BY vehicle_id ASC, datetime ASC
'
Waiting on bqjob_r5dee10ce107faf3c_0000016ca71efdd6_1 ... (0s) Current status: DONE
+-------------+-------------+----------------------------------------------+---------------------+---------------+
|     lat     |     lon     |                  vehicle_id                  |      datetime       | path_sequence |
+-------------+-------------+----------------------------------------------+---------------------+---------------+
|  -32.703225 |   151.50795 | +AL6H40LJIJOkaJm6HiTM09uDG+Vw48N4RGkqVrTnXg= | 2019-07-17 18:06:49 |             1 |
|  -32.703225 |   151.50795 | +AL6H40LJIJOkaJm6HiTM09uDG+Vw48N4RGkqVrTnXg= | 2019-07-18 00:16:03 |             2 |
|   -32.89035 | 151.7159316 | +CXU2Y31Iz9Bb4EzquWXNeuUg1pVa4CrTTatzHSFiFw= | 2019-07-17 22:18:17 |             1 |
|   -32.89035 | 151.7159316 | +CXU2Y31Iz9Bb4EzquWXNeuUg1pVa4CrTTatzHSFiFw= | 2019-07-17 22:20:45 |             2 |
|   -32.89035 | 151.7159316 | +CXU2Y31Iz9Bb4EzquWXNeuUg1pVa4CrTTatzHSFiFw= | 2019-07-17 22:23:17 |             3 |
|   -32.89035 | 151.7159316 | +CXU2Y31Iz9Bb4EzquWXNeuUg1pVa4CrTTatzHSFiFw= | 2019-07-17 23:53:21 |             4 |
|   -32.89056 | 151.7160766 | +CXU2Y31Iz9Bb4EzquWXNeuUg1pVa4CrTTatzHSFiFw= | 2019-07-17 23:55:44 |             5 |
| -32.8901983 |   151.71603 | +CXU2Y31Iz9Bb4EzquWXNeuUg1pVa4CrTTatzHSFiFw= | 2019-07-17 23:57:44 |             6 |
| -32.8901583 | 151.7160466 | +CXU2Y31Iz9Bb4EzquWXNeuUg1pVa4CrTTatzHSFiFw= | 2019-07-17 23:58:44 |             7 |
|   -32.89017 | 151.7159733 | +CXU2Y31Iz9Bb4EzquWXNeuUg1pVa4CrTTatzHSFiFw= | 2019-07-17 23:59:59 |             8 |
| -32.8901733 | 151.7159783 | +CXU2Y31Iz9Bb4EzquWXNeuUg1pVa4CrTTatzHSFiFw= | 2019-07-18 00:01:29 |             9 |
| -32.8902566 | 151.7160683 | +CXU2Y31Iz9Bb4EzquWXNeuUg1pVa4CrTTatzHSFiFw= | 2019-07-18 00:03:29 |            10 |

The sequence number is applied by ascending datetime for each vehicle. With that sequence number, we can now tell Tableau to plot a path based on that sequence:

Path on the map

I’ve added a colour gradient to better show the path taken, so the blue line gets darker the closer to the end of the journey. Now we can see more easily where a vehicle’s journey started and ended.

One of the aims of the project was to improve safety, and speed was one of the things we wanted to capture - Was a vehicle driving too fast? Where are vehicles slowing down a lot (indicating congestion where prangs are more likely).

Since we didn’t have speed in our data feeds, we had to work it out. Luckily we didn’t have to figure-out all the maths behind doing this, BigQuery has this stuff built-in! What we did was:

  • Create a GEOGPOINT (geographic point) for each lat/lon position.
  • Use the LAG window function to get values about previous positions for each new record.
  • Use ST_DISTANCE to calculate the distance between the current and previous GEOGPOINTs

With all of that worked out we can then do things like work out average speed over a particular time period. Here’s some SQL that does all of this for us:

WITH geodata as (
  SELECT 
  *
    -- Generate the sequence number for our journey path
  , ROW_NUMBER() OVER(PARTITION BY vehicle_id ORDER BY datetime ASC) as path_sequence
    -- Convert lat/lon to a GEOGPOINT
  , ST_GEOGPOINT(lon, lat) as geopoint
    -- Create a 5 minute interval. We'll use this later
  , CAST(UNIX_SECONDS(TIMESTAMP_TRUNC(datetime, MINUTE))/60/5 AS INT64) as interval_5min
  FROM `demo_data.vehicle_location`
)
, lag_geodata as (
  SELECT
    *
    -- Get the previous GEOGPOINT based on the path_sequence for a vehicle
  , LAG( geopoint ) OVER( PARTITION BY vehicle_id ORDER BY path_sequence) as prev_geopoint
    -- Get the previous datetime
  , LAG( datetime ) over( PARTITION BY vehicle_id ORDER BY path_sequence) AS prev_update_datetime
  FROM geodata
)
, dist_geodata as (
  SELECT
    *
    -- Calculate the distance between this point and the previous one
  , ST_DISTANCE( geopoint, prev_geopoint ) as distance_from_prev
    -- Get the difference between this datetime and the previous one, in minutes
  , TIMESTAMP_DIFF(datetime,  prev_update_datetime, MINUTE) as datetime_diff_min
  FROM lag_geodata
)
, calc_speed as (
  SELECT
      *
      -- Calculate the speed
    , ROUND(SAFE_DIVIDE(distance_from_prev, (datetime_diff_min * 60)), 2) as speed_kmh
  FROM dist_geodata
)
SELECT 
   *
   -- Calculate the average speed over 5 mintes
 , ROUND(AVG(speed_kmh) OVER( PARTITION BY vehicle_id, TIMESTAMP_TRUNC(datetime, DAY) ORDER BY interval_5min), 2) as avg_speed_5min
FROM calc_speed
;

That generates some extra columns I’ve selected below. The LAG window functions give me the value of previous records based on the sort order for datetime. With those values I can calculate the speed.

$ cat query.sql | bq query --nouse_legacy_sql
Waiting on bqjob_r6c53763986cf9788_0000016cabf1f0ae_1 ... (0s) Current status: DONE
+----------+------------------+----------+-------------------+--------------------+-------------+-----------+--------------
|  veh_id  |     geopoint     | path_seq |      datetime     |   prev_datetime    | dst_frm_prv | speed_kmh | avg_spd_5min|
+----------+------------------+----------+-------------------+--------------------+-------------+-----------+--------------
| 8q7fP6ZO | POINT(145.300... |        1 | 20190717 21:58:51 |               NULL |        NULL |      NULL |        NULL |
| 8q7fP6ZO | POINT(145.300... |        2 | 20190717 23:48:29 |  20190717 21:58:51 |         0.0 |       0.0 |         0.0 |
| 8q7fP6ZO |  POINT(145.30... |        3 | 20190717 23:54:20 |  20190717 23:48:29 | 23.16932271 |      0.08 |         0.2 |
| 8q7fP6ZO |  POINT(145.30... |        4 | 20190717 23:55:33 |  20190717 23:54:20 | 35.32941018 |      0.59 |         0.2 |
| 8q7fP6ZO | POINT(145.300... |        5 | 20190717 23:56:54 |  20190717 23:55:33 |   8.1486017 |      0.14 |         0.2 |
| 8q7fP6ZO | POINT(145.300... |        6 | 20190717 23:58:03 |  20190717 23:56:54 | 6.144019011 |       0.1 |        0.18 |
| 8q7fP6ZO |  POINT(145.30... |        7 | 20190718 00:00:33 |  20190717 23:58:03 |  9.98023782 |      0.08 |        0.08 |
| 8q7fP6ZO | POINT(145.300... |        8 | 20190718 00:00:43 |  20190718 00:00:33 |  3.90262255 |      NULL |        0.08 |
| 8q7fP6ZO | POINT(145.222... |        9 | 20190718 00:34:35 |  20190718 00:00:43 |  8641.47028 |      4.36 |        1.12 |
| 8q7fP6ZO |   POINT(145.2... |       10 | 20190718 00:35:44 |  20190718 00:34:35 | 2.386375125 |      0.04 |        1.12 |
| 8q7fP6ZO |   POINT(145.2... |       11 | 20190718 00:37:44 |  20190718 00:35:44 |         0.0 |       0.0 |        1.12 |
| 8q7fP6ZO |  POINT(145.22... |       12 | 20190718 00:39:11 |  20190718 00:37:44 |  5.16311328 |      0.09 |        2.28 |

Now we can plug that data into Tableau. If we create a dual-axis map, we can plot the speed data on top of the journey path. Here I’ve coloured the speed points so those that are more red are faster. The speeds we’re dealing with here are very low since these are delivery vehicles, but you get the idea.

Path with speed on the map

This is all very simple stuff. Clearly there’s a lot more that could be done with this data, and a mapping tool other than Tableau would probably be a better option, but given the short timeframe and tools provided I think it’s pretty neat!

There are a ton of other BigQuery geographic functions and types available, so check them out.