Geocoding in NYC from R

Edit 12 Feb 2018: Thanks to twitter I learned about being able to call GeoSearch via the rmapzen R package; see the end of this post for more info.

Geocoding

In GIS and various other fields, Geocoding is a common operation. Not familar with Geocoding? Check out this great definition from Google:

Geocoding is the process of converting addresses (like a street address) into geographic coordinates (like latitude and longitude), which you can use to place markers on a map, or position the map.

Reverse geocoding is the process of converting geographic coordinates into a human-readable address.

There are lots of services that functionally do the queries and spatial overlays to get users the information they need. Lots are free, or free up to a certain number of queries, or depending on the type of use (e.g., check out the Google geocoding API, there are OpenStreetMap tools (e.g., Nominatim, and there is a MapQuest Geocoding API). A tricky thing is that these often serve out the best approximations and aggregations of data for large areas, whether it be globally, nationwide, etc.

An opensource geocoding project called Pelias had been developed by the former mapping and navigation company, Mapzen, which was dedicated to opendata and opensource tools. Unfortunately the company shut down recently. However, because their tools were open source, others could pick them up and deploy them fairly readily (realizing it’s often not trivial, but if staff/expertise and infrastructure exist, it’s doable). Thus, the NYC Planning Labs contingent of the NYC Department of City Planning spun up an instance of Pelias, using official NYC data, currently in Beta, called GeoSearch. (Note: still being in Beta, at the time of writing, GeoSearch is noted as not yet supported by the NYC Department of City Planning)

Being in NYC, sometimes it’s helpful to have an accurate geocoding service. It’s even better if I can use it from tools I already leverage, such as R. Thus, I was curious to see if I could use the NYC GeoSearch tool from R. Here’s my coarse and quick attempt…

Calling on NYC GeoSearch through with R

Spatial tools in R have come a long way in recent years. For this work, since it deals with vector data (points), I went right for the sf package. This has been developed pretty recently (i.e., in the last few years) - if you’re using sp, it’s worth a look at sf for additional speed and functionality. If you haven’t used sf it before, you can install it as you would any other package install.packages("sf"). Then you can load the package via library(sf), or call the package explicitly (i.e., sf::[function_name]) as I’ve doen below.

The function st_read lets you easily read in GeoJSON data, which is what GeoSearch returns. Here’s an initial test, with the address for the Museum of the City of New York, just as a place I’ve been meaning to check out. I just followed the format from an example URL on the API Docs page. As you’ll see, it throws an error.

sf::st_read("https://geosearch.planninglabs.nyc/v1/search?text=1220 Fifth Ave, New York, NY")

#GDAL Error 1: HTTP error code : 502Cannot open data source https://geosearch.planninglabs.nyc/v1/search?text=1220 Fifth Ave, New York, NY

I remembered from some prevoius work that that spaces can cause problems when calling URLs from R. I can’t recall how I came to this conclusion, but ultimately, spaces had to be replaced with ‘%20’ per ASCII encoding. Thus, this works:

sf::st_read("https://geosearch.planninglabs.nyc/v1/search?text=1220%20Fifth%20Ave,%20New%20York,%20NY")
## Reading layer `OGRGeoJSON' from data source `https://geosearch.planninglabs.nyc/v1/search?text=1220%20Fifth%20Ave,%20New%20York,%20NY' using driver `GeoJSON'
## Simple feature collection with 1 feature and 32 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -73.95187 ymin: 40.79249 xmax: -73.95187 ymax: 40.79249
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs

We can even plot it on a quick interactive map using the mapview package. If you click on the point, you’ll see all of the tabular information that comes with the point location. Some things are related to the geocoder/geocoding process, while things prefaced by ‘pad_’ are from the Property Address Directory. For example, ‘pad_bbl’ is the Borough, Block and Lot number for the property, and ‘pad_bin’ is the Building ID Number.

mapview::mapview(sf::st_read("https://geosearch.planninglabs.nyc/v1/search?text=1220%20Fifth%20Ave,%20New%20York,%20NY"))
## Reading layer `OGRGeoJSON' from data source `https://geosearch.planninglabs.nyc/v1/search?text=1220%20Fifth%20Ave,%20New%20York,%20NY' using driver `GeoJSON'
## Simple feature collection with 1 feature and 32 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -73.95187 ymin: 40.79249 xmax: -73.95187 ymax: 40.79249
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs

If we can automate replacement of spaces with the %20, that is hugely helpful for dealing with any amount of addresses, given how many spaces one would encounter. A quick search yielded this suggestion on StackOverflow: you can just use the ‘URLencode’ function in base R, as below.

sf::st_read(URLencode("https://geosearch.planninglabs.nyc/v1/search?text=1220 Fifth Ave, New York, NY"))

So now lets try more than one address. I’ll start by making a vector of addresses - some landmarks of NYC, including the aformentioned one. (The others are the Booklyn Museum of Art, Snug Harbor Cultural Center, and the Bronx Zoo.)

addresses <- c("1220 Fifth Ave, New York, NY", "200 Eastern Pkwy, Brooklyn, NY 11238", "1000 Richmond Terrace, Staten Island, NY 10301", "2300 Southern Blvd, Bronx, NY 10460")

Next, we need can encode the addresses for the URL; lapply is quick for this type of work, but splits the vector into a list, so the rest of the function (do.call(rbind...) combines it back into a object (a matrix in this case).

urladdresses <- do.call(rbind, lapply(addresses, URLencode))

Next, we can transform all of the addresses to be part of appropriate URLs, and then go through them via the lapply to import the results using sf::st_read. Again, we’re using do.call(rbind... to bring results back together. You’ll see the outputs of st_read; if ou would rather not, you can use sf::read_sf instead with the same results.

search_urls <- paste("https://geosearch.planninglabs.nyc/v1/search?text=", urladdresses, sep="")
geocoded_results <- do.call(rbind, lapply(search_urls, sf::st_read))

You can inspect the resulting object using functions like str, or View (and you can easily see details of the resulting object). And just as before, you can use

mapview::mapview(geocoded_results)

Interestingly, you’ll notice a bunch of points at Snug Harbor Cultural Center, which presumably share the same address, but are distinct buildings.

We can put all of the above into a function too:

nycgeocode <- function(addresses){
  urladdresses <- do.call(rbind, lapply(addresses, URLencode))
  search_urls <- paste("https://geosearch.planninglabs.nyc/v1/search?text=", 
                       urladdresses, sep="")
  geocoded_results <- do.call(rbind, lapply(search_urls, sf::st_read))
  return(geocoded_results)
  }

misc_addresses <- nycgeocode(addresses)

Some Known Challenges and Opportunities for working with GeoSearch from R

At this point, I’ve gone through the basics of getting addresses geocoded in R using the NYC GeoSearch API. There are challenges I haven’t quite addressed yet, but worth considering in further development.

  • First, it’s worth noting that you can Reverse Geocode too, as per some twitter correspondence.

  • I experimented with a few different addresses. One in particular was the Staten Island Zoo, 614 Broadway, Staten Island, NY 10310. Surprisingly, it returned 614 Broadway locations in Brooklyn and Manhattan as well as Staten Island. A look at the query indicates the borough and postal code were correctly parsed. GeoSearch does note that not all parameters in Pelias are implemented, thus it might be a limitation at this point, though I’ll be curious to dive in further. Note, users can refine searches based on things like proximity to a point, or within a bounding box. I’m still somewhat new to APIs myself, and readily admit that I could just be missing something too.

  • I tried a combination of very clear and direct addresses, as above, as well as some vague ones, like “1 Vanderbilt”. Some different fields were provided/not returned for the specific queries, which caused issues in the rbind. For example, with specific addresses, ‘match_type’ was a field returned that did not come back with the very vague addresses.

  • Regardless of everything else, if the system is given a query for something fairly vague, amidst a query for more specific places, a user would need to parse through which of the options given for a vague location they were looking for. Perhaps this could be done on a map, graphically, using the mapedit package. I don’t have another clean way of doing that, but it’s probably something that would need to be thought through if further developing this work in R.

  • One would probably want to include functionality to easily join the spatial results with the original addresses provided. For example, if a user has a data table with a column of addresses, along with other fields, it might be helpful to do the geocode and have a result returned with the original fields as well as lat/long, bbl, bin, etc.

Closing Note

Overall, this was an interesting exercise to spend a few hours on. I’m excited about the NYC specific geocoder, and as always in trying things out in R, I learned a thing or two. Feel free to reach out with other thoughts on this.

Additional Info, added 12 Feb 2018

Thanks to Twitter, I learned of the rmapzen R package, which is already set up to use Pelias-based geocoding systems. Below is some sample code for a single address… I’ll be curious to explore this more myself. At time of writing, for application with the NYC GeoSearch system, you’ll need to use a dev-version from GitHub, as per the install.packages line (commented out), which lets you use a custom api.

# devtools::install_github("tarakc02/rmapzen", ref = "devel") 
library(rmapzen)
## Warning: package 'rmapzen' was built under R version 3.5.3
# Specify API
options(RMAPZEN_SEARCH_HOST = "geosearch.planninglabs.nyc")

#Specify Mapzen key (NA in this case)
Sys.setenv("MAPZEN_KEY" = NA) 

#Search
#mz_search("1220 Fifth Ave, New York, NY")