Introduction to OpenStreetMap: How to leverage Open Data and ensure “High-Quality” Data.

7 min readMay 25, 2023

In this article, we explore the use of OpenStreetMap for extracting geospatial data and how to implement an approach to enhance the overall data quality. We delve into the key steps of data extraction on OpenStreetMap, including selecting the area of interest, data collection, cleaning, and preparing the data for future use.

Introduction

Data is generated everywhere and all the time, whenever you purchase your dinner, book a train ticket or visit a website. Most of the time, this data is held by companies or institutions that do not share them. The data is said to be open when those actors decide to make it publicly available and re-usable without legal or business restrictions. Open data is often shared by governmental institutions rather than companies. They believe that the data made available will help drive value creation and GDP. From this open data, businesses will create more products and services, more rapidly than before while saving costs. It will then indirectly drive more business, create new jobs and increase GDP. Open data will also help improve decision making, whether it is for institutions or companies.

The market size of open data is constantly growing. The main domains for which open data is released are geography and meteorology, economy and business, traffic/transportation or social information.

Some key figures that highlight the importance of Open Data include:

· France has over 350 000 open data sets, the USA (according to data.gov) more than 250 000.

· € 1.7 billion cost estimated savings for EU public administrations.

· Reduction of corruption by 10 percent.

(To better understand the impact of using open data, you can visit Open Data Impact.)

Presentation of OpenStreetMap

Project presentation

OpenStreetMap (OSM) is a collaborative online mapping platform that has become a go-to resource for developers and businesses around the world such as Amazon or Uber. Launched in 2004, OSM is a free, open-source alternative to proprietary mapping services like Google Maps or Bing Maps. With over 9 billion elements, including roads, buildings, and points of interest, it’s a massive repository of geographical information that is constantly growing. In fact, as of 2021, OSM contains more than 95 million kilometers of mapped roads, making it the largest open-source mapping project in the world. To put that in perspective, that’s enough road length to circle the Earth more than 2300 times! OSM works in the same way as Wikipedia, with a global community of contributing volunteers but also private companies that add and modify information on the platform in real time.

Overpass API

Overpass API keeps a copy of the main OSM database and provides it for searching on the Web. It frequently updates the data by observing the changes happening in the OpenStreetMap database. The Overpass API code is open. Therefore, anyone can hold a read-only, searchable and updated version of the OpenStreetMap database, if they have sufficient hardware.

We will not enter the details of the OpenStreetMap data model. You can refer to the documentation for that. We provide a simple description of the main OpenStreetMap concepts: tags, objects (node, way, relation) and areas.

· Tags consist of key/value pairs and represent the semantical data of OpenStreetMap, for instance name: “Gare de Lyon”, railway: “station”. They can be either classifying (restricted values, for example the key railway can only take specific values) or describing (fixed key, any values, for example the name). One OpenStreetMap object cannot have two identical keys.

· A node consists of an id, coordinates and tags. It is the only OSM object having coordinates.

· A way consists of an id, a sequence of nodes and tags. The sequence of nodes creates geometry (by linking nodes). Ways can be connected if they share a common node. The same node can appear several times in the sequence.

· Relations are more complex than nodes and ways. They were created to represent turn restrictions, boarders etc. They are less relevant to our study.

· Areas are not OSM objects but can be extracted from ways and close relations.

Overpass API provides a search language (OverpassQL) to query OSM objects. It works just like any other API. To get OSM data, you must send a http request with a query following the OverpassQL language. For more details on this language, please refer to the Overpass API documentation. The language can be a bit disturbing, even if you are used to traditional query languages. However, there is the Overpass turbo website, enabling testing Overpass API queries and seeing the results on an interactive map. We strongly recommend you try your queries on a small geographical area using Overpass turbo before running them at scale from your scripts.

Build and enrich restaurants map data using OSM

Use case description

To illustrate our study, we build a map database of restaurants in Ile-de-France (French region to which Paris belongs). You could try to scrap popular booking or review websites. However, using OpenStreetMap is much simpler if you want to avoid some painful web scraping work.

Restaurants data change frequently. We also hope that the periodic refreshment of OpenStreetMap database will include newly opened restaurants, recently closed ones etc.

Data collection

We will rely on the Tags system (see above for more information) to perform the data extraction. To determine which tags are used, websites like TagInfo gather comprehensive data about tags and present pertinent statistics to identify commonly used tags. For instance, we can see that restaurants are labeled using the information: “amenity=restaurant”

Using Overpass and the previous information, we have the following query to collect all restaurants in Ile-de-France:

[out:json];
area["name"="Paris"]->.searchArea;
(
   node["amenity"="restaurant"](area.searchArea);
way["amenity"="restaurant"](area.searchArea);
rel["amenity"="restaurant"](area.searchArea);
);
out center;

In the output GeoJson, each element (node, way and relation) is represented by the following format:

{
    "type": <type>,
    "id": <id>,
    "lat": <lat>,
    "lon": <lon>,
    "tags": {
        "amenity": "restaurant",
        "name": <name>,
        ...
    }
}

*Results of the query on Overpass Turbo (Paris restaurants)*

Data Quality improvement

Raw data quality assessment

Understanding around 450 features can be painful. Most of them only account for 10 to 100 restaurants. To gain some insight into the features and to avoid reading them all, we implement clustering methods applied to the feature’s labels.

We proceed in several steps:

Computing the embeddings of the features labels using popular NLP models (transformers).
Computing a distance matrix (pairwise cosine similarity between features embeddings)
Getting clusters using hierarchical clustering.

The quality of the clusters was assessed using silhouette score and sum of squared errors. There is probably a better method to perform clustering. However, our objective is to get a glance at the features, not to obtain a clustering benchmark. For that, hierarchical clustering gave us satisfying results. The results can be visualized below.

*9 out of the 19 identified clusters using Hierarchical Clustering*

Data cleansing: redundancy removal

OpenStreetMap is a collaborative tool. Then we wonder whether several observations of the same restaurant can be in the dataset.

In our case, we extracted nodes and ways from the OpenStreetMap database. A way often defines a polyline or polygon, thus containing a sequence of nodes. Since some nodes are included in ways and we want to merge nodes and ways into the same structure, we need to remove the possible redundancy. We can’t only consider nodes or ways because restaurant tags can be found in both elements.

The solution we chose is to remove nodes that are contained in ways. It corresponds to 12 735 nodes (for 1205 ways). After merging the remaining ways and nodes, we have 15 382 identified restaurants. Official sources agree that there are a little more than 18 000 restaurants in Ile-de-France. This figure includes fast food, whereas OpenStreetMap distinguishes them. Then 15 382 is not a bad estimation of the number of restaurants around Paris.

Data augmentation: enriching addresses

Reverse geocoding is the process of translating geographic coordinates into a readable address format. It can be a valuable tool for enriching address data, providing additional information such as zip codes, neighborhoods, and cities, which can improve the quality of analysis and predictive models.

In our case, we use the API Adresse of data.gouv.fr to enrich and we collect the address information to complete the missing information.

Results

Out of 15 382 restaurants, we managed to enrich most addresses (see charts below). In the case of failed enrichment, the issue arises from the empty result of the API Adresse. After looking at some examples, the information about those restaurants was very scarce and the addresses were either unavailable or inaccurate. Thankfully, these restaurants only amount to 0.5% of the total restaurants retrieved.

Conclusion

OpenStreetMap is a powerful, easy to use and opensource tool. It does not require a lot of setups thanks to the Overpass API and the associated public endpoints. This article aims to provide a hands-on introduction to OSM and how you can use it in your future projects.

In addition to point information, OSM offers a multitude of detailed geographic data such as transit routes, buildings, waterways, parks, public transport etc. This data can be used for a variety of use cases such as urban planning, accessibility analysis and many others.

At Sia Partners, we are proficient with Open Data because we think that there is a lot of value your business can get from it. If you want to know more about Open Data or our studies, check out our article, “Open data in France, an opportunity for fine-scale prediction”, to explore the world of Open Data in France and please visit Heka website for more content.