You have a large (>1 million records) dataset you need to geocode.
Many customers come to us with a large dataset they need geocoded as a one-off project.
Large volumes are no problem, but by considering a few things before you start you can ensure the project runs smoothly.
Step by step guide to processing a dataset of several million locations
1. Test on a small subset of locations
Choose a small subset of your dataset, say 1,000 entries or so, to test with and make sure everything is working well. Ideally you can choose the subset randomly, to make sure it truly reflects the make up of the full dataset.
Use this smaller test dataset to get everything working, and only once you're confident that is working should you move on to the full dataset.
2. Add a unique identifier to each entry
Give each entry some sort of identifier. Don't just use the address to be geocoded as it may not be unique and your process may modify it.
3. Understand how to call the OpenCage Geocoding API
We do our best to keep our geocoding API as simple as possible, so that things "just work" by default. Nevertheless, we also offer several optional parameters that may be useful for your specific situation. Take two minutes to read the list and decide if they might apply to your case.
As an example, if you know you only want results in Australia
will let us know to not return non-Australian results.
Please see the various best practices for using our API.
4. Clean your data
Make sure to remove any duplicates. If you are forward geocoding (address to coordinates) we have a detailed guide on steps you can take to clean up your queries, and thus give us the best possible chance to answer your requests correctly.
5. Use the right tool for calling the API
Just because a programming language is good for data analysis, that does not mean it is good for making millions of API requests.
Based on several years of experience our advice is to use a scripting language like Python, Perl, PHP, or Ruby for making your API requests. We have libraries for many different programming languages.
We advise AGAINST using a language like Stata or MATLAB for anything beyond very small (10,000) datasets. These are great lanuages for evaluating data, they are not good languages for requesting data. Our strong recommendation is to use a scripting language to query our API and store the data locally, where you can then use the language of your choice to evaluate it.
Specifically with Stata please see some of the common issues that come up.
6. Understand how long the process will take
There are several things you can do to speed up your geocoding. The main thing is to decide if it makes sense to structure your code so as to make requests in parallel or if you are fine just running things in series.
7. Log any errors
Despite all well-intended efforts, any process that involves processing millions of data points can have times when things don't work smoothly.
Make sure your code is robust. What happens when the internet connection is bad? What happens if we are unable to geocode your query?
The first step to dealing with such situations is to know how often they occur. Hopefully the number of problems is small enough to be negligible, but the only way to know is to log each failure. As a minimum we suggest you log the unique identifier of the data point (see step 2 above) and any non-success response codes the API returns.
8. Run your script in the background
It's always best to run long-running processes in the background or in such a way that they are not lost if the computer crashes or is mistakenly turned off. How to do that will depend on exactly which operating system and programmming language you are using, but the point is it's important to remember that long running processes can be unexpectedly interrupted.
By following these steps the chance that your project will run smoothly is greatly increased. If in doubt, feel free to ask us any questions you may have. We are here to help.