Guides:
The problem
You have a large (>1 million records) dataset you need to geocode.
Background
Many customers come to us with a large dataset they need geocoded as a one-off project.
Large volumes are no problem, but by considering a few things before you start you can ensure the project runs smoothly.
Step by step guide to processing a dataset of several million locations
1. Test on a small subset of locations
Choose a small subset of your dataset, say 1,000 entries or so, to test
with and make sure everything is working well. Ideally you can choose the
subset randomly, to make sure it truly reflects the make up of the full
dataset.
Use this smaller test dataset to get everything working, and only once
you're confident that is working should you move on to the full dataset.
2. Add a unique identifier to each entry
Give each entry some sort of identifier. Don't just use the address to be
geocoded as it may not be unique and your process may modify it.
3. Understand how to call the OpenCage Geocoding API
We do our best to keep our geocoding API as simple as possible, so that
things "just work" by default. Nevertheless, we also offer
several
optional parameters
that may be useful for your specific situation. Take two minutes to
read the list and decide if they might apply to your case.
As an example, if you know you only want results in Australia
using
countrycode=au
will let us know to not return non-Australian results.
Please see the various
best practices
for using our API.
4. Clean your data
Make sure to remove any duplicates.
If you are forward geocoding (address to coordinates) we have a
detailed guide on steps you can take to
clean up your queries,
and thus
give us the best possible chance to answer your requests correctly.
5. Use the right tool for calling the API
Just because a programming language is good for data analysis, that does
not mean it is good for making millions of API requests.
Based on several years of experience our advice is to use a scripting
language like Python, Perl, PHP, or Ruby for making your API requests.
We have
libraries for many different programming languages.
We advise AGAINST using a language like Stata or MATLAB for
anything beyond very small (10,000) datasets.
These are great lanuages for evaluating data, they are not good languages
for requesting data. Our strong recommendation is to use a scripting
language to query our API and store the data locally,
where you can then use the language of your choice to evaluate it.
Specifically with Stata
please see
some of the common issues
that come up.
6. Understand how long the process will take
There are several things you can do to
speed up your geocoding.
The main thing is to decide if it makes sense to structure your code so
as to make requests in parallel or if you are fine just running things
in series.
7. Log any errors
Despite all well-intended efforts, any process that involves processing
millions of data points can have times when things don't work smoothly.
Make sure your code is robust. What happens when the internet connection
is bad? What happens if we are unable to geocode your query?
The first step to dealing with such situations is to know how often they
occur. Hopefully the number of problems is small enough to be negligible,
but the only way to know is to log each failure. As a minimum we suggest you
log the unique identifier of the data point (see step 2 above)
and any non-success
response codes
the API returns.
8. Run your script in the background
It's always best to run long-running processes in the background or in
such a way that they are not lost if the computer crashes or is mistakenly
turned off. How to do that will depend on exactly which operating system
and programmming language you are using, but the point is it's important to
remember that long running processes can be unexpectedly interrupted.
Final thoughts
By following these steps the chances your project will run smoothly increase greatly.. If in doubt, feel free to ask us any questions you may have. We are here to help.