Monthly Archives: April 2015

Reverse Geocoding for the Masses – Apache Nutch

The Apache Nutch community has been hard at work developing an open source web crawler. Nutch is a mature, production ready web crawler powering data acquisition, search and discovery for a broad spectrum of organizations over a broader spectrum of use cases. The Nutch 1.x branch enables fine grained configuration and relies on Apache Hadoop™ data structures, which are great for batch processing.

This post documents how reverse geolocation features were added to Nutch via MaxMind’s GeoIP2-java API, making good use of server IP addresses acquired within a Nutch crawl. Readers will take away:

  • insight into why geocoding is appealing in today’s markets,
  • practical code examples from the Nutch 1.x branch, showing how to use the GeoIP2-java API in order to geocode based on server IPs.

Continue reading