The problem
You have a large (>1 million records) dataset you need to geocode.
Background
Many customers come to us with a large dataset they need geocoded as a one-off project.
Large volumes are no problem, but by considering a few things before you start you can ensure the project runs smoothly.
Step by step guide to geocoding a dataset of several million locations
-
Understand how to call the OpenCage Geocoding API
We do our best to keep our geocoding API as simple as possible, so that things "just work" by default. Nevertheless, we offer several optional parameters that may be useful for your specific situation. Take two minutes to read the list and decide if they might apply to your case. As an example, if you know you only want results in Canada usingcountrycode=ca
will let us know to not return non-Canadian results. Before you start geocoding, invest a few minutes to understand our service. Please see the various best practices for using our API. -
Test on a small subset of locations
Choose a small subset of your dataset, say 1,000 entries or so, to test with and make sure everything is working well. Ideally you can choose the subset randomly, to make sure it truly reflects the make up of the full dataset. Use this smaller test dataset to get everything working, and only once you're confident that is working should you move on to the full dataset. -
Clean your data
Make sure to remove any duplicates. If you are forward geocoding (address to coordinates) we have a detailed guide on steps you can take to clean up your queries, and thus give us the best possible chance to answer your requests correctly. -
Add a unique identifier to each entry
Give each entry some sort of identifier. Don't just use the address to be geocoded as it may not be unique and your process may modify it. -
Use the right tool for calling the API
Just because a programming language is good for data analysis, that does not mean it is good for making millions of API requests. Based on several years of experience our advice is to use a scripting language like Python, Perl, PHP, or Ruby for making your API requests. We have libraries for many different programming languages, and we have a command line utility specifically for geocoding large CSV files We advise AGAINST using a language like Stata or MATLAB for anything beyond very small (10,000) datasets. These are great languages for evaluating data, they are not good languages for requesting data. Our strong recommendation is to use a scripting language to query our API and store the data locally, you can then use the language of your choice to evaluate it. Specifically with Stata please see some of the common issues that come up. -
Understand how long the process will take
There are several things you can do to speed up your geocoding. The main thing is to decide if it makes sense to structure your code so as to make requests in parallel or if you are fine just running things in series. We have a command line tool for geocoding large files and example scripts for making parallel requests in Python (see the "Running many parallel queries" section of our tutorial), Node.js, Ruby, and PHP. -
Log any errors
Despite all well-intended efforts, any process that involves processing millions of data points can have times when things don't work smoothly. Make sure your code is robust. What happens when the internet connection is bad? What happens if we are unable to geocode your query? The first step to dealing with such situations is to know how often they occur. Hopefully the number of problems is small enough to be negligible, but the only way to know is to log each failure. As a minimum we suggest you log the unique identifier of the data point (see step 2 above) and any non-success response codes the API returns. -
Run your script in the background
It's always best to run long-running processes in the background or in such a way that they are not lost if the computer crashes or is mistakenly turned off. How to do that will depend on exactly which operating system and programming language you are using, but the point is it is important to remember that long running processes can be unexpectedly interrupted.
Final thoughts
By following these steps the chances your project will run smoothly
increase greatly. If in doubt, feel free to
ask us
any questions you may have. We are here to help.