The Easy Way to Use MaxMind GeoIP with Redshift
by TJ Murphy, @teej_m on Twitter This article was originally published on Towards Data Science on Jan 18, 2019.
It always starts with an innocent observation. “We get a lot of traffic from Boston,” your boss remarks. You naturally throw out a guess or two and discuss why that might be. Until your boss drops the bomb —
“Can you dig into that?”
Darn it. You walked right into that one.
Now you’re in a predicament. You know Google Analytics has traffic by geographic location, but that’s not gonna cut it. If you want to report on those retention rates, lifetime values, or repeat behaviors by geo, you need something you can query with SQL, something that lives in your data warehouse. But you don’t have anything like that. You know there’s user IP addresses in your log data, you just need to turn them into locations. But Redshift doesn’t have a way to do that.
What you need is geolocation using IPs, aka GeoIP. The place folks commonly start is MaxMind , mostly because it’s the first Google result for “GeoIP”. Together we will use their IP-to-City dataset to enrich our log data and determine what city and country our users are from. We will use MaxMind data because it’s reliable and robust. Also it’s free. One less thing to bother your boss about.
So off we go to download
MaxMind’s GeoLite2 City data
.
Upon opening the zip, we find a number of CSV files, the most important among
them being GeoLite2-City-Blocks-IPv4.csv
. If we peek inside, this is what we
see:
Right away we notice a problem — this data has something that looks like an IP but has a slash and an extra number at the end. This is an IP network represented in CIDR notation and itrepresents a range of IPs. It’s composed of an IP, a slash, and a number after the slash called a subnet mask. It’s like how you might describe a street of physical addresses in New York City by saying “The 500 block of west 23rd street.”
If we had the network X.Y.Z.0/24
that would mean “every IP that starts with
X.Y.Z.
and has any number between 0 and 255 at the end. In other words, any IP
between X.Y.Z.0
and X.Y.Z.255
. So if we observed a user with the IP
X.Y.Z.95
, that would fall in the network X.Y.Z.0/24
and thus is located in
geoname_id
of 6252001
. A subnet mask can be any number between 1 and 32 and
the smaller the number, the wider the network.
If this MaxMind table were in Redshift, how would we join to it? Redshift doesn’t include any handy network address types likes modern Postgres or INET functions like MySQL. Instead we will use knowledge of the math behind IPs to do the work ourselves.
You can think of an IP as a fancy representation of a really big number. The IP
1.2.3.4
is really just 16,909,060
under the hood. Similarly, IP networks are
just ranges of really big numbers. The network 1.2.3.0/24
is a range that
starts with 16,909,056
and ends with 16,909,311
. We will use this to our
advantage. To do so, we need a way to convert IPs and IP networks to numbers.
Using the geoip2-csv-converter tool provided by MaxMind, we will add the integer range representation of each network to our CSV.
1# Download the converter tool from Github
2curl -o geoip2-csv-converter-v1.1.0-darwin-amd64.tar.gz \
3 -L https://github.com/maxmind/geoip2-csv-converter/releases/download/v1.1.0/geoip2-csv-converter-v1.1.0-darwin-amd64.tar.gz
4tar xvf geoip2-csv-converter-v1.1.0-darwin-amd64.tar.gz
5
6# Use the tool to add integer range columns to our CSV
7./geoip2-csv-converter-v1.1.0/geoip2-csv-converter \
8 -block-file=GeoLite2-City-CSV_20190108/GeoLite2-City-Blocks-IPv4.csv \
9 -output-file=GeoLite2-City-Blocks-IPv4-with-ranges.csv \
10 -include-cidr \
11 -include-integer-range
12gzip GeoLite2-City-Blocks-IPv4-with-ranges.csv
Note that I’m using the 20190108
version. MaxMind updates this data set
weekly, so your version may be different.
After uploading our modified CSV to S3, we can COPY
it into Redshift.
1-- Create Maxmind IP to geo table in Redshift
2-- https://dev.maxmind.com/geoip/geoip2/geoip2-city-country-csv-databases/
3CREATE TABLE maxmind_ipv4_to_geo (
4 network VARCHAR
5, network_start_integer BIGINT
6, network_last_integer BIGINT
7, geoname_id INTEGER
8, registered_country_geoname_id INTEGER
9, represented_country_geoname_id INTEGER
10, is_anonymous_proxy BOOLEAN
11, is_satellite_provider BOOLEAN
12, postal_code VARCHAR
13, latitude NUMERIC
14, longitude NUMERIC
15, accuracy_radius INTEGER
16);
17-- Copy our CSV into the Maxmind table
18COPY maxmind_ipv4_to_geo
19FROM 's3://path/to/your/data/GeoLite2-City-Blocks-IPv4-with-ranges.csv.gz'
20CREDENTIALS '...YOUR_CREDENTIALS_GO_HERE...'
21CSV
22GZIP
23IGNOREHEADER AS 1
24;
Now let’s write a function to convert IPs to really big numbers. Here’s a simple
one written in SQL. We will call it inet_aton
after the Linux utility that
does the same thing. “inet” stands for “internet” and “aton” means “Address
TO Number”. Linux folks like to keep things short and sweet.
1-- Create a function to convert IPs to numbers. Named after the linux inet_aton utility.
2CREATE FUNCTION f_inet_aton(VARCHAR)
3RETURNS BIGINT IMMUTABLE as $$
4
5 SELECT (
6 SPLIT_PART($1, '.', 1)::BIGINT * POW(256, 3) +
7 SPLIT_PART($1, '.', 2)::BIGINT * POW(256, 2) +
8 SPLIT_PART($1, '.', 3)::BIGINT * POW(256, 1) +
9 SPLIT_PART($1, '.', 4)::BIGINT * POW(256, 0)
10 )::BIGINT
11
12$$ LANGUAGE sql;
The last thing we need is to load the MaxMind CSV that contains a lookup from
geoname_id
to an actual place on earth. We will gzip it for speed, upload the
GeoLite2-City-Locations-en.csv.gz
file to S3, and COPY
it to a table.
Some quick pointers on Redshift optimization. For a small, commonly joined
dimension table like this, I recommend DISTSTYLE ALL
. This makes a copy of the
table on every node in your cluster, eliminating a data transfer step during
joins. I also define our join column as a SORTKEY
to speed things up.
1-- Create Maxmind geo to city table in Redshift
2CREATE TABLE maxmind_geo_to_city (
3 geoname_id INTEGER
4, locale_code VARCHAR
5, continent_code VARCHAR
6, continent_name VARCHAR
7, country_iso_code VARCHAR
8, country_name VARCHAR
9, subdivision_1_iso_code VARCHAR
10, subdivision_1_name VARCHAR
11, subdivision_2_iso_code VARCHAR
12, subdivision_2_name VARCHAR
13, city_name VARCHAR
14, metro_code VARCHAR
15, time_zone VARCHAR
16, is_in_european_union BOOLEAN
17)
18DISTSTYLE ALL
19SORTKEY (geoname_id)
20;
21-- Copy our CSV into the Maxmind table
22COPY maxmind_geo_to_city
23FROM 's3://path/to/my/data/GeoLite2-City-Locations-en.csv.gz'
24CREDENTIALS '...YOUR_CREDENTIALS_GO_HERE...'
25CSV
26GZIP
27IGNOREHEADER AS 1
28;
Now we have everything we need to enrich our logs with location data using GeoIP. Almost everything.
Redshift Is Hard
If we join our new MaxMind GeoIP tables to our log data, we will immediately run into a problem. Suppose I have some bare bones access logs and try to calculate the top 50 regions by traffic.
1CREATE TABLE access_logs (
2 ip_address VARCHAR,
3 created_at_utc TIMESTAMP,
4 user_id INTEGER
5);
If you ran this query, you’re going to have a bad time. You query will be running for minutes and you’ll start to sweat. Meanwhile your Redshift admin will be hunting for the individual who took down her cluster. Don’t be that person.
1 SELECT mm_city.city_name
2 , mm_city.country_name
3 , COUNT(*) AS traffic
4 FROM access_logs AS log
5 LEFT JOIN maxmind_ipv4_to_geo AS mm_geo
6 ON f_inet_aton(log.ip_address) BETWEEN mm_geo.network_start_integer
7 AND mm_geo.network_last_integer
8 LEFT JOIN maxmind_geo_to_city AS mm_city
9 ON mm_geo.geoname_id = mm_city.geoname_id
10 GROUP BY 1, 2
11 ORDER BY 3 DESC
12 LIMIT 50
13;
What’s wrong with this query? A quick peek at the EXPLAIN
plan, the list of
steps Redshift takes to execute the query, tells all.
Suppose you ran an ice cream shop with millions of delicious flavors. What if
each customer in line had to taste-test every single flavor before they could
choose one? That’s what happens if we attempt to combine our log data
(customers) to the MaxMind data (flavors) directly using our BETWEEN
join
(taste-test). This results in a
nested loop join
,
one of the quickest ways to make a database cry.
To speed up our ice cream shop, we are going to organize it into distinct sections — the chocolates over here, the vanillas over there, and a special spot for the minty flavors. Laid out this way, customers head to the one section that matches their preference. Once there they taste-test a minuscule number of flavors in comparison to before.
Creating an optimized GeoIP lookup table
We will build a new table in Redshift that will replace maxmind_ipv4_to_geo
that’s optimized for GeoIP lookups. We will organize it into distinct sections
using the first half of the IPs and place each network into it’s proper section.
Some networks are wide enough that they will go into more than one section.
These sections act almost like a database index, allowing Redshift to narrow
down which networks to check for each IP. Using a little bit-twiddling magic, we
take our table and convert it into one that’s fast and optimized.
1-- Create a helper table to allow us to enumerate from 0-255.
2-- This is a quick hack to generate a table with just the numbers 0 to 255 in it.
3 CREATE TEMPORARY TABLE range_0_to_255 AS
4 SELECT ROW_NUMBER() OVER () - 1 AS num
5 FROM maxmind_geo_to_city
6 ORDER BY 1
7 LIMIT 256
8;
9
10-- Create the special lookup table
11CREATE TABLE maxmind_ipv4_lookup
12DISTSTYLE ALL
13AS
14 SELECT octet_1.num || '.' || octet_2.num AS first_16_bits
15 , mm.network_start_integer
16 , mm.network_last_integer
17 , mm.network
18 , mm.geoname_id
19 FROM maxmind_ipv4_to_geo mm
20 JOIN range_0_to_255 octet_1
21 ON octet_1.num BETWEEN (mm.network_start_integer & f_inet_aton('255.0.0.0')) >> 24
22 AND (mm.network_last_integer & f_inet_aton('255.0.0.0')) >> 24
23 JOIN range_0_to_255 octet_2
24 ON octet_2.num BETWEEN (mm.network_start_integer & f_inet_aton('0.255.0.0')) >> 16
25 AND (mm.network_last_integer & f_inet_aton('0.255.0.0')) >> 16
26;
With our lookup table created, we can take another crack at our analysis. In our
query, we swap out maxmind_ipv4_to_geo
for maxmind_ipv4_lookup
and add a new
join condition. We will extract the first half of each IP with the regex
REGEXP_SUBSTR(log.ip_address, '\\d+\.\\d+')
and match it to it’s appropriate
section in the table mm_geo.first_16_bits
. Then we check which network it
belongs in using the integer representation of the IP and network. With this
optimization, our query returns quickly with no nested loop join in sight!
1 SELECT mm_city.city_name
2 , mm_city.country_name
3 , COUNT(*) AS traffic
4 FROM access_logs AS log
5 LEFT JOIN maxmind_ipv4_lookup AS mm_geo
6 ON REGEXP_SUBSTR(log.ip_address, '\\d+\.\\d+') = mm_geo.first_16_bits
7 AND f_inet_aton(log.ip_address) BETWEEN mm_geo.network_start_integer
8 AND mm_geo.network_last_integer
9 LEFT JOIN maxmind_geo_to_city AS mm_city
10 ON mm_geo.geoname_id = mm_city.geoname_id
11 GROUP BY 1, 2
12 ORDER BY 3 DESC
13 LIMIT 50
14;
And with that, you’re ready to start your analysis. You can use this lookup
table to join to any other in Redshift with an IP. Just remember to always
include the join to first_16_bits
as that’s the magic behind the approach.
This approach is inspired by a paper I read about optimizing IP lookups in a network appliance. I unfortunately can’t find the paper anymore. This post is a result of porting that solution to Redshift and iterating to reduce it to something simple and performant.
Thanks to the dbt Slack for inspiring this post, Julian Ganguli for working with some early code, and to Nick James for reading an early draft.
by TJ Murphy, @teej_m on Twitter This article was originally published on Towards Data Science on Jan 18, 2019.