The Apache Nutch community has been hard at work developing an open source web crawler. Nutch is a mature, production ready web crawler powering data acquisition, search and discovery for a broad spectrum of organizations over a broader spectrum of use cases. The Nutch 1.x branch enables fine grained configuration and relies on Apache Hadoop™ data structures, which are great for batch processing.
This post documents how reverse geolocation features were added to Nutch via MaxMind’s GeoIP2-java API, making good use of server IP addresses acquired within a Nutch crawl. Readers will take away:
- insight into why geocoding is appealing in today’s markets,
- practical code examples from the Nutch 1.x branch, showing how to use the GeoIP2-java API in order to geocode based on server IPs.
What’s so Appealing about Reverse Geocoding IP Addresses?
Reverse geocoding starts with a point location (latitude and longitude) and uses it to obtain user friendly geolocation information, such as country and city.
As part of its travels, Nutch determines a large number of web server IP addresses. With the help of GeoIP2 Databases and GeoIP2 Precision Services from MaxMind, Nutch is in a position not only to determine the latitude and longitude of those IP addresses, but to provide the type of geolocation information required by a large list of use cases, including website visitor segmentation by country for traffic management and content localization.
In short, from a Nutch perspective, the notion of being able to determine a web server location by using its IP address was certainly a worthwhile reverse geocoding exercise. Once we identified the resources provided byMaxMind’s suite of GeoIP2 Databases and GeoIP2 Precision Services, we had all we needed to proceed. You can follow the various code samples peppered throughout the article to learn just how we did it.
MaxMind’s GeoIP2-java and NUTCH-1660
Over at Nutch development headquarters (that is, the community mailing lists), we tend to like the occasional code sample as well as well grounded conversation. This section details a number of Java code excerpts (mostly) from the Nutch 1.X geoip indexing filter. For those accustomed to the popular Jira software project management tool, you can check out the development and correspondence trail which takes the form of NUTCH-1660. So without further ado, let’s dive in.
Obtaining a Server IP Address
A critical component of the Nutch geoip plugin is the provisioning and availability of a server IP address. In code sample 1, we create an actual InetSocketAddress by using the sockHost and sockPort objects obtained from our Nutch HTTP protocol implementation. The sockAddr variable represents an immutable object used by sockets for binding, connecting, etc.
Code sample 2 simply displays how we utilize the sockAddr object accessor methods e.g. sockAddr.getAddress().getHostAddress(), to obtain a String value representing the server InetSocketAddress. This is obtained after a successful connection has been made to the server. The rest of the code contained within the sample represents logic for checking the presence of a sockAddr value, checking that we actually wish to obtain the InetSocketAddress from the server. It should be noted that by default, as a respect measure within Nutch, this configuration setting is OFF. Once we verify that a value is present, we add this to the list of HttpHeaders associated with the page for which this request was made.
Nutch is a highly configurable piece of software. The overwhelming majority of that configuration is driven by and inherited from Hadoop’s Configuration. Nutch adds to, and at times overrides, this configuration via the Nutch specific NutchConfiguration. An inherited characteristic of Nutch configuration is therefore the definition of names and values with an optional description. In code sample 3 we see three configuration properties encoded as XML, which are specific to the Nutch reverse geocoding implementation. It is important to note that these values are obtained at runtime execution by using the conf object similar to what is displayed in code sample 2. All three of the geoip plugin properties below are documented. It is also important to note that with a little effort, the geoip plugin can easily be adapted to use the GeoIP2 Databases instead of the GeoIP Precision Services. By default the plugin is configured to use the insightsService.
One final Nutch configuration property which needs to be adapted in order to use the geoip plugin is shown in code sample 4. If you look closely at the list of included Nutch plugins to be used during runtime execution, you are required to register the index-geoip within the regular expression value. This ensures that we are able to register and associate the plugin with the reverse geocoding exercise within the Nutch crawl.
Within the actual geoip implementation we are then required to read in the previously defined configuration properties. Code sample 5 displays logic for making a check to see which GeoIP2 service we wish to use before creating a WebServiceClient object via the WebServiceClient.Builder constructor. The constructor takes both the index.geoip.userid and index.geoip.licensekey configuration property values as arguments. This allows us to use (in this case) a particular user account with which to execute the reverse geocoding. You will see that a dummy default value of 12345 has been associated with the index.geoip.userid configuration property value.
The final code excerpt we show in this article provides an indication of the expressiveness and functionality available within the Maxmind GeoIP2-java API. Code sample 6 shows part of a method createDocFromInsightsService which accepts as arguments the server IP we obtained in code sample 1, a NutchDocument object which represents a Nutch data structure suitable to be used for indexing purposes e.g. into Apache Solr, and finally the WebServiceClient object we created in code sample 5 above. The InsightsResponse we create can then be used to add an array of reverse geocoded data such as City e.g. City-level data associated with an IP address, Continent e.g. data for the continent record associated with an IP address, Country e.g. data for the country record associated with an IP address , etc. Much more functionality exists within the Nutch geoip plugin and a wealth of information on what else is offered within the GeoIP2-java API is provided by the project documentation.
This blog post displays how the Apache Nutch project currently uses MaxMind’s GeoIP2-java API to reverse geocode server IPs. The result is a wealth of information, driving search, discovery and understanding of server networks and the nature of the content they serve.
In closing, it is important to state that liberal software licensing practices are significant enabling factors in driving software development of this nature. All of the source code provided within this article is under Nutch source code management and control at The Apache Software Foundation. You will see that both the MaxMind GeoIP2-java API and Apache Nutch source code are licensed under the Apache Software License v2.0, a permissive software license which drives open source development and collaboration.
About our contributor: Dr. Lewis John McGibbney is an Engineering Applications Software Engineer at NASA’s Jet Propulsion Laboratory in Pasadena, California. He’s an advocate of open source, a keen believer in community over code and a mentor to others new to the open source spectrum. He considers as his home the Apache Software Foundation, where he is a member and an avid contributor. Follow him on Twitter @hectorMcSpector