GeoFile: How to Transform OpenStreetMap Data into GeoJSON Using GDAL

Published

GeoFile is a series dedicated to looking at geographical data, its features, and uses. In today's article, we're going to show you how to convert OSM data to GeoJSON and import it into a Compose for MongoDB deployment.

In the last GeoFile article we looked at what OpenStreetMap (OSM) is and how to import data from it into a Compose PostgreSQL deployment. We also touched on making queries on that data using the hstore column, which stored supplemental non-standardized data like amenities, cuisines, place descriptions, etc. Using only the command line tool osm2pgsql we could import OSM data and set up an hstore column with an index and it would create tables out of OSM's map layers that we could then query. Unfortunately, it's not that easy if we want to convert OSM data to GeoJSON.

In this article, we'll look at converting OSM data into useable GeoJSON and importing that into Compose for MongoDB. What we mean by useable will be shown below, but it requires some setup using the command line tool ogr2ogr, specifying the keys we want in a OSM_CONFIG_FILE used by the tool, and converting the OSM layer that we want to run queries on. We'll take you step-by-step in the process so let's get started ...

Setting things up

Ogr2ogr is a command line tool within the Geospatial Data Abstraction Library (GDAL), for converting geospatial data into different formats, while providing customization options to reproject coordinates, minimize attributes, and a whole host of advanced options.

To use the tool, you'll first have to download the source or the binaries, but we prefer downloading and installing GDAL from Homebrew if you're on a Mac writing brew install gdal in the terminal.

Once that's installed, let's go over to OSM so that we can select the dataset we're going to use. Once we're on the OSM website, select the area of North Los Angeles, California, or whichever area you prefer to select on OSM. It really doesn't matter what area you select in this tutorial since OSM has common keys that we'll be using. You have the option of selecting the map in your browser's window or manually selecting an area. We've manually selected the area like below.

Once your area has been selected, we'll have to use one of OSM's alternative export sources since our map exceeds the number of nodes we can export. Click the Overpass API which will give us more nodes and will automatically start downloading a file containing OSM data named map. Depending on your selection, the file may take a while to download because of its size - our's was 1.28 GB. Once that file has been downloaded, you'll have to add to it the file extension .osm.

So, now that we have the GDAL and ogr2ogr installed and the OSM data, we can now transform it to GeoJSON.

Transforming OSM to GeoJSON

Using ogr2ogr we now have the ability to transform our OSM data into something MongoDB can use. There are some hiccups that you might run into, however, and we'll show you how to get out of some of the common ones.

When running ogr2ogr, we have to specify the type of data we're converting to, the file we want our transformed data to be saved into, and the OSM file that has all the data.

ogr2ogr -f GeoJSON map.geojson map.osm  

It's as simple as that. But, this simplicity has a hiccup. Running this command will give you the following error:

ERROR 6: GeoJSON driver doesn't support creating more than one layer  

Since we're transforming data into GeoJSON, it doesn't automatically create different sets of data that contain the various OSM layers. You have to specify individual layers. To view the layers that are available, run ogrinfo map.osm. This will give you something like the following showing you the layers:

Had to open data source read-only.  
INFO: Open of `map.osm'  
      using driver `OSM' successful.
1: points (Point)  
2: lines (Line String)  
3: multilinestrings (Multi Line String)  
4: multipolygons (Multi Polygon)  
5: other_relations (Geometry Collection)  

For now, we'll just transform the points layer to transform the points of interest from the OSM map to GeoJSON. This tool automatically will convert the spatial coordinates from OSM into CRS 84, the standard used for GeoJSON, so we won't have to worry about changing the coordinate system later. Now, we'll run the ogr2ogr command again, but specify the points layer like:

ogr2ogr -f GeoJSON map.geojson map.osm points  

After running the command, we will have a file called map.geojson which contains a GeoJSON feature collection containing an array of GeoJSON features with our points of interest and their coordinates, and it's about 6.6 MB - much less than the 1.28 GB we started with.

{
"type": "FeatureCollection",
"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },

"features": [
{ "type": "Feature", "properties": { "osm_id": "10537875", "name": null, "barrier": null, "highway": "traffic_signals", "ref": null, "address": null, "is_in": null, "place": null, "man_made": null, "other_tags": null }, "geometry": { "type": "Point", "coordinates": [ -118.2897534, 34.1710594 ] } },
...
{ "type": "Feature", "properties": { "osm_id": "4856744809", "name": null, "barrier": null, "highway": "traffic_signals", "ref": null, "address": null, "is_in": null, "place": null, "man_made": null, "other_tags": "\"traffic_signals\"=>\"signal\"" }, "geometry": { "type": "Point", "coordinates": [ -118.4736755, 34.2355165 ] } }
]
}

Depending on what you're looking for, you may not need to transform the data further so you can import it into MongoDB now. However, for non-standardized data like amenities, traffic signals, etc., they have not been included automatically in the feature properties. For example, if we take the sample above, look at the other_tags key.

"other_tags": "\"traffic_signals\"=>\"signal\""

It contains a string with the key traffic_signals and signal as the value. We could put this into MongoDB, but if we wanted to search within other_tags, we'd have to run a text search and we could only index the other_tags key and not traffic_signals. This could end up taking up a lot of resources. We could avoid this altogether by modifying ogr2ogr's OSM_CONFIG_FILE that contains the configuration settings for the tool's OSM driver so that it includes the keys we need inside of the feature properties.

If you're on a Mac and installed GDAL using Homebrew, the location of the configuration file for the OSM driver osmconf.ini will probably be in /opt/opengeo/share/gdal. Opening this file you'll have the following:

#
# Configuration file for OSM import
#

# put here the name of keys for ways that are assumed to be polygons if they are closed
# see http://wiki.openstreetmap.org/wiki/Map_Features
closed_ways_are_polygons=aeroway,amenity,boundary,building,craft,geological,historic,landuse,leisure,military,natural,office,place,shop,sport,tourism  
...

The part we're most interested in is the section starting with [points] since we're transforming the points layer from OSM to GeoJSON. The configuration for that will have the following:

[points]
# common attributes
osm_id=yes  
osm_version=no  
osm_timestamp=no  
osm_uid=no  
osm_user=no  
osm_changeset=no

# keys to report as OGR fields
attributes=name,barrier,highway,ref,address,is_in,place,man_made  
# keys that, alone, are not significant enough to report a node as a OGR point
unsignificant=created_by,converted_by,source,time,ele,attribution  
# keys that should NOT be reported in the "other_tags" field
ignore=created_by,converted_by,source,time,ele,note,openGeoDB:,fixme,FIXME  
# uncomment to avoid creation of "other_tags" field
#other_tags=no
# uncomment to create "all_tags" field. "all_tags" and "other_tags" are exclusive
#all_tags=yes

In the section keys to report as OGR fields with attributes, this is where we can add fields that will appear in the feature properties as keys. Since we'll be looking for amenities in North Los Angeles, we'll need the amenity keys and values. Additionally, we'll need the cuisine keys and values because if we look for restaurants, we'll want to know the cuisine. These keys are currently contained in a string in other_tags.

A word of caution, however, is needed: you shouldn't modify the original OSM driver configuration file unless you really need to. Instead, we can set up our own custom, configuration file and tell ogr2ogr to use that.

Setting up a custom OSM_CONFIG_FILE can be done by copying all the contents of the original osmconf.ini file and creating a new one saved somewhere on your system. You also can copy the contents of the file linked from the OSM driver's webpage and save that, too. We'll save the contents of the file into one called customOSMconfig.ini.

In that file, the only part we'll change is under [points] in the part that says attributes. We'll modify it to look like:

# keys to report as OGR fields
attributes=name,ref,address,amenity,cuisine  

These are the only attributes that we'll need for our purposes. If we needed more, or wanted to keep the original attributes, we could just append what we need and run that.

Now that the new customized configuration file has been saved, we can transform the points layer again and see what our GeoJSON will look like. First, we'll have to tell ogr2ogr where to look for the OSM_CONFIG_FILE. If we set up a global environment variable for OSM_CONFIG_FILE, GDAL will automatically take that as our OSM driver configuration file.

export OSM_CONFIG_FILE=/path/to/file/customOSMconfig.ini  

Now that's there, we can run ogr2ogr again:

ogr2ogr -f GeoJSON map.geojson map.osm points  

Now, if we look at the file, we'd get a 5 MB file with only the feature properties we added in our configuration file:

{
"type": "FeatureCollection",
"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },

"features": [
{ "type": "Feature", "properties": { "osm_id": "10537875", "name": null, "ref": null, "address": null, "amenity": null, "cuisine": null, "other_tags": "\"highway\"=>\"traffic_signals\"" }, "geometry": { "type": "Point", "coordinates": [ -118.2897534, 34.1710594 ] } },
...

Running ogrinfo on our GeoJSON file will provide us with a much clearer overview of the keys it contains. We'll use the -sql flag to select the first point by the osm_id.

ogrinfo map.geojson -sql "SELECT *  FROM OGRGeoJSON WHERE osm_id = '10537875'"

Layer name: OGRGeoJSON  
Geometry: Point  
Feature Count: 1  
Extent: (-118.289753, 34.171059) - (-118.289753, 34.171059)  
Layer SRS WKT:  
GEOGCS["WGS 84",  
    DATUM["WGS_1984",
        SPHEROID["WGS 84",6378137,298.257223563,
            AUTHORITY["EPSG","7030"]],
        TOWGS84[0,0,0,0,0,0,0],
        AUTHORITY["EPSG","6326"]],
    PRIMEM["Greenwich",0,
        AUTHORITY["EPSG","8901"]],
    UNIT["degree",0.0174532925199433,
        AUTHORITY["EPSG","9108"]],
    AUTHORITY["EPSG","4326"]]
Geometry Column = _ogr_geometry_  
osm_id: String (0.0)  
name: String (0.0)  
ref: String (0.0)  
address: String (0.0)  
amenity: String (0.0)  
cuisine: String (0.0)  
other_tags: String (0.0)  
OGRFeature(OGRGeoJSON):0  
  osm_id (String) = 10537875
  name (String) = (null)
  ref (String) = (null)
  address (String) = (null)
  amenity (String) = (null)
  cuisine (String) = (null)
  other_tags (String) = "highway"=>"traffic_signals"
  POINT (-118.2897534 34.1710594)

If we want to see all the types of amenities that our GeoJSON data has, we could query ogrinfo again using SQL like:

ogrinfo map.geojson -sql "SELECT DISTINCT amenity FROM OGRGeoJSON"

Layer name: OGRGeoJSON  
OGRFeature(OGRGeoJSON):0  
  amenity (String) = restaurant

OGRFeature(OGRGeoJSON):1  
  amenity (String) = (null)

OGRFeature(OGRGeoJSON):2  
  amenity (String) = fast_food

OGRFeature(OGRGeoJSON):3  
  amenity (String) = parking
...

This will give us 80 amenities that are found within our GeoJSON file. There's a lot we could find out just by running SQL queries with ogrinfo. But, let's see if we can make our GeoJSON file smaller by only including names, amenities, and cuisines.

ogr2ogr -f GeoJSON map_optimized.geojson map.geojson -select name,amenity,cuisine  

This takes our 5 MB GeoJSON file down to 2.7 MB and keeps our 17,265 points intact.

...
"features": [
{ "type": "Feature", "properties": { "name": null, "amenity": null, "cuisine": null }, "geometry": { "type": "Point", "coordinates": [ -118.2897534, 34.1710594 ] } },
{ "type": "Feature", "properties": { "name": null, "amenity": null, "cuisine": null }, "geometry": { "type": "Point", "coordinates": [ -118.3217314, 34.184282 ] } },
...

We can optimize our data further using SQL by selecting only documents were amenity or cuisine have at least one value. The reason why we're doing that is that sometimes amenities that are restaurants have not had the cuisine tag added, and vice versa. Instead of only selecting amenities and cuisines that have values, we'll just select those that have at least one value. There is a problem with this as well in that some places, take a name like Starbucks as an example, doesn't always have amenities or cuisine tags. This may cause an issue if we wanted very accurate results because we'd have to locate it by name rather than by these tags.

ogr2ogr -f GeoJSON map_no_nulls.geojson map_optimized.geojson -sql "SELECT * FROM OGRGeoJSON WHERE amenity IS NOT NULL OR cuisine IS NOT NULL"  

That takes us from 2.9 MB to 329 KB and 1,765 points, which look like:

...
"features": [
{ "type": "Feature", "properties": { "name": "Hometown Buffet", "amenity": "restaurant", "cuisine": "american" }, "geometry": { "type": "Point", "coordinates": [ -118.3305106, 34.190943 ] } },
{ "type": "Feature", "properties": { "name": "McDonald's", "amenity": "fast_food", "cuisine": "burger" }, "geometry": { "type": "Point", "coordinates": [ -118.6031003, 34.1807258 ] } },
...

Since our GeoJSON file has been filtered down to only include the data we need, let's import our GeoJSON into Compose for MongoDB ...

Importing to Compose for MongoDB

Importing our data is now much easier to manage since we filtered it down significantly. By making the file smaller, we won't have as much overhead when trying to index or query over data that we're not interested in. To import the data, you can either use mongoimport or you could use Studio 3T, which we recently reviewed.

As it stands, in order to import the GeoJSON feature collection into MongoDB, you'd have to modify the file and keep only the features array portion with the documents. But, why would you want to manually fiddle with your data if you don't have to? To help us out, we can use the command line JSON processor jq which will select the documents in the features array them import them to Compose for MongoDB using mongoimport.

You can download and install jq for your platform, but if you're on macOS and using Homebrew, just use brew install jq. Once it's installed, we just have to use the tool with mongoimport like:

jq '.features[]' map_no_nulls.geojson | mongoimport --host aws-us-west-2-portal.2.dblayer.com --port 11111 --db los_angeles --collection amenities --ssl --sslAllowInvalidCertificates -u user -p mypass --file map_no_nulls.geojson --drop  

And now you will have your processed JSON data inside your Compose for MongoDB database to start querying.

That's GDAL folks

We've taken you on a tour on how you can use GDAL's ogr2ogr command line tool to transform and filter OSM data into useable and manageable GeoJSON. GDAL has other powerful tools that you can use to transform other types of geospatial data into those that can be used in various GIS applications or databases. We scratched the surface on how to use ogr2ogr, but it will provide you with the ability to transform data into different formats and to explore other features of the tool that may be useful for your application. Next time, we'll take the data that we got from North Los Angeles and see how we can use Compose for MongoDB and Mapbox together to make an interactive map of the area.


If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com. We're happy to hear from you.

attribution Andrew Neel

Abdullah Alger
Abdullah Alger is a former University lecturer who likes to dig into code, show people how to use and abuse technology, talk about GIS, and fish when the conditions are right. Coffee is in his DNA. Love this article? Head over to Abdullah Alger’s author page to keep reading.

Conquer the Data Layer

Spend your time developing apps, not managing databases.