August 12, 2020 by Matt Bartholomew

Making a Better Map Part 2: ggplot and Tigris

R Tigris ggplot

Continuing Where we Left Off

In Part One of this series we worked on getting data using the TidyCensus package to recreate the map below from a recent New York Times Story. Now we’re going to actually do the mapping bit.

It Floats…kind of

So this should be pretty straightforward: we’ve got a dataframe with shapefiles called bay_area_housing. It’s got one column with a shapefile of each census tract in the bay area and another with a ratio of dwellings where more than one person resides per room. Using the geom_sf() function we can make a basic map with just a couple lines of code.

ggplot(data = bay_area_housing) + 
  geom_sf(color = NA, aes(geometry = geometry, fill = ratio_over_1))

It’s not pretty. At all. And it’s nothing like the map we are trying to replicate. In no particular order, if we want to recreate the map from the NYT, we’re going to need:
1) Roads
2) The landmass outside of our chosen area
3) City Names
4) A better color palette
5) Water. And that water should be blue!

So how do we get all of that?

Enter Tigris

By an almost obscene stroke of luck (and by luck I mean the hard work of the authors), the Tigris library can help us get almost all of these. Seriously, getting all of the data we need for roads, landmass and water is just this simple:

library(tigris)

my_state = 'CA'
my_county = c('San Francisco','San Mateo','Santa Cruz','Santa Clara','Alameda','Contra Costa','Los Angeles')

roads_data = primary_secondary_roads(my_state) %>% 
  filter(RTTYP %in% c('U','S','I')) #this will limit the roads to just highways. 
landmass_data = counties(cb = TRUE, state = my_state)
water_data = area_water(my_state,my_county)

Let’s see what we can muster now with these supporting layers. Note that I’m moving my data arguments into each individual geom_sf() call since we’re pulling from more than one data frame. Also since these layers will stack on top of eachother, the order that we feed them into ggplot is very important.

my_map = ggplot() +
  theme_void()  + #whiteout
  geom_sf(data = landmass_data, fill = '#e6e6e6', color = NA,aes(geometry = geometry)) + #grey landmass
  geom_sf(data = bay_area_housing, color = NA, aes(geometry = geometry, fill = ratio_over_1)) + #our census map
  geom_sf(data = water_data, fill = '#cce6ff',color = NA, aes(geometry = geometry)) + #add water
  theme(panel.background = element_rect(fill = '#cce6ff', color = NA)) + #blue background to mesh with water
  geom_sf(data = roads_data, color = 'white', aes(geometry = geometry)) + #add roads
  NULL

my_map

Crap. We’ve got what’s probably a pretty decent map, but it’s for the whole state. This kind of makes sense actually, since some of our calls to Tigris only specify a state and not a county, and ggplot doesn’t know that we only care to frame the parts of the data that are included in our bay_area_housing layer. Luckily there is a way around this. Using the coord_sf() function we can set x and y limits based on our original dataframe. This is kind of tricky since shapefiles don’t behave the way a lot of other R objects do but we can extract them like so using the st_bbox() function contained in the sf library.

limits = st_bbox(bay_area_housing$geometry)
limits

##       xmin       ymin       xmax       ymax 
## -123.01392   36.85065 -121.20823   38.09988

Now lets try again with our new limited frame.

my_map = my_map + 
  coord_sf(xlim = c(as.numeric(limits$xmin),as.numeric(limits$xmax)), 
           ylim = c(as.numeric(limits$ymin),as.numeric(limits$ymax)))

my_map

Last Touches

Okay! This is starting to look more professional. We definitely need to get a color scheme that’s a bit more expressive and get our legend a bit more legible. There’s lots of palettes in the viridis library that can make your maps much more legible and accessible but in the spirit of recreating the original map, I’m going to hand pick a gradient to match the levels in the original, while also getting my legend cleaned up.

my_map = my_map + 
  labs(fill = 'Share of housing\nunits with more\nthan one person\nper room') + #label legend
  theme(legend.justification = c(0,0), legend.position = c(0.02,0.04)) + # move legend to bottom left
  scale_fill_gradient(low = '#ffff99', high = '#b10026', na.value = '#ffffcc' )  # custom fill palette

my_map

Very nice. All that’s left to do is add our cities and we’ll be all set. I found that one area where Tigris isn’t great is getting simple points for the locations of cities (it really just deals with polygons), but there is a maps package that has built in data for US cities. Lets bring in that data and add in the last layer of our map. In order to keep things from getting too crowded we’ll add a cutoff for cities that are under a certain size.

library(maps)
cities_data = us.cities %>% 
  filter(country.etc %in% my_state & pop > 75000)  %>% 
  mutate(fixed_name = str_replace(name,country.etc,'')) # this will drop the state name from the name column
  
my_map + 
  geom_text(color = 'black',data = cities_data ,check_overlap = TRUE ,aes(x = long, y = lat, label = fixed_name))

Wow…lets just take minute to recall our journey here. Nicely done!

We Hunger for More Maps

Ok! We’ve recreated the original map. Remember though, one of the goals of this project was to be able to make a tool that we can use to create similar visualizations for ANYWHERE in the US.

Well…if we’re gonna be running the same code several times over, this sounds like a job for a function. This next part is gonna involve a lot of code we’ve already gone over but the basic idea is to make two functions: one to pull the data we need to make the map and one to actually render it. First the data puller: This is going to let us grab our underlying info from the TidyCensus, Tigris and Maps package.

pull_data = function(my_state, my_county){
  
  acs = load_variables(2018, 'acs5', cache = TRUE) 
  
  concepts = acs %>% 
    filter(concept == 'TENURE BY OCCUPANTS PER ROOM')
  
  concepts_string = concepts %>% 
    select(1) %>% 
    pull() # this creates a character vector of just the concepts we want
  
  
  raw =  get_acs(geography = 'tract', variables = concepts_string, year = 2018, state = my_state, 
                 county = my_county,geometry = TRUE, cache = TRUE)
  
  fixed_data = concepts %>% 
    select(name, label) %>% 
    right_join(raw, by = c('name' = 'variable')) %>% 
    select(2,3,5,7) %>% 
    spread(key = label, value = estimate, -4)
  
  names(fixed_data) = make.names(names(fixed_data))
  
  fill_data = fixed_data %>% 
    mutate(
      total_under_1 = Estimate..Total..Owner.occupied..0.50.or.less.occupants.per.room + Estimate..Total..Owner.occupied..0.51.to.1.00.occupants.per.room + Estimate..Total..Renter.occupied..0.50.or.less.occupants.per.room + Estimate..Total..Renter.occupied..0.51.to.1.00.occupants.per.room,
      total_over_1 = Estimate..Total..Owner.occupied..1.01.to.1.50.occupants.per.room + Estimate..Total..Owner.occupied..1.51.to.2.00.occupants.per.room + Estimate..Total..Owner.occupied..2.01.or.more.occupants.per.room + Estimate..Total..Renter.occupied..1.01.to.1.50.occupants.per.room + Estimate..Total..Renter.occupied..1.51.to.2.00.occupants.per.room + Estimate..Total..Renter.occupied..2.01.or.more.occupants.per.room
    ) %>% 
    select(GEOID,geometry ,total_under_1, total_over_1, Estimate..Total) %>% 
    mutate(ratio_over_1 = total_over_1/ Estimate..Total)

  roads_data = primary_secondary_roads(my_state) %>% 
    filter(RTTYP %in% c('U','S','I'))
  
  counties_data = counties(cb = TRUE)
  water_data = area_water(my_state,my_county)
  
  cities_data = us.cities %>% 
    filter(country.etc %in% my_state)  %>% 
    mutate(fixed_name = str_replace(name,country.etc,'')) # this will drop the state name from the name column
  
  return(list(fill = fill_data, roads = roads_data, counties = counties_data, water =  water_data, cities = cities_data))
}

Second is the actual mapper. I’ve got two extra parameters in here to make this more flexible for various geographies. One allows you to specify a population cutoff for which cities will be included on the map. Another lets you specify which corner of the map to stick the legend.

mapper = function(input_data, city_size_floor = 100000000, legend_position = 'bottom left'){ #accepts list object from pull_data function
  map_frame = input_data$fill
  limits = st_bbox(map_frame$geometry)
  roads = input_data$roads
  county = input_data$counties
  water = input_data$water
  cities = input_data$cities %>% 
    filter(pop > city_size_floor & long > as.numeric(limits$xmin) & long < as.numeric(limits$xmax) &
             lat>(as.numeric(limits$ymin) & lat < as.numeric(limits$ymax))) 
  
  legend_placer = function(position = 'bottom left'){
    if(position == 'bottom left'){
      ret = (list(c(0,0),c(0.02,0.04)))
    }else if(position == 'top left'){
      ret =(list(c(0,1),c(0.02,0.96)))
    }else if (position == 'bottom right'){
      ret=(list(c(1,0),c(0.98,0.04)))
    }else if(position == 'top right'){
      ret=(list(c(1,1),c(0.98,0.96)))
    }else(stop('invalid input: must supply one of: bottom left, bottom right, top left, top right'))
    return(theme(legend.justification = ret[[1]], legend.position = ret[[2]]))
  }
  
  ggplot() +
    theme_void()  + #whiteout
    legend_placer(legend_position) +
    labs(fill = 'Share of housing\nunits with more\nthan one person\nper room')+ #label legend
    geom_sf(data = county, alpha = 1, fill = '#e6e6e6', color = NA,aes(geometry = geometry)) + #grey landmass
    geom_sf(data = map_frame, alpha = 1,color = NA, aes(geometry = geometry, fill = ratio_over_1)) + #our census map
    scale_fill_gradient(low = '#ffff99', high = '#b10026', na.value = '#ffffcc' ) + # custom fill palette
    geom_sf(data = water, fill = '#cce6ff',color = NA, aes(geometry = geometry)) + #add water
    theme(panel.background = element_rect(fill = '#cce6ff', color = NA)) + #blue background to mesh with water
    geom_sf(data = roads, color = 'white', aes(geometry = geometry)) + #add roads
    geom_text(color = 'black',data = cities,check_overlap = TRUE ,aes(x = long, y = lat, label = fixed_name)) + #add city names
    coord_sf(xlim = c(as.numeric(limits$xmin),as.numeric(limits$xmax)), ylim = c(as.numeric(limits$ymin),as.numeric(limits$ymax))) + #set framing(very important)
    NULL
}

So lets give it a whirl: let’s start with New York City:

nyc = pull_data('NY',c('Bronx','New York','Queens','Kings'))
mapper(nyc, legend_position = 'top left')

How about my hometown?

atx = pull_data('TX',c('Travis','Williamson','Hays','Caldwell','Bastrop'))
mapper(atx, 40000, legend_position = 'top left')

I think we’ve done it. If you want to see maps for your own city or region, feel free to copy my functions and run them locally. If you find a way to improve on them let me know!

August 11, 2020 by Matt Bartholomew

Making a Better Map Part 1: Getting Data with TidyCensus

R TidyCensus

Recreating a Visualization

One of the many perks of having Data Science skills is being able to interact with statistics and graphs that you see in the news in a much deeper way. This weekend’s New York Times included an interesting graphic on housing density and overcrowding in the Bay Area:

My goal in this project is to 1) Recreate the original visualization using census data. 2) Build a function to generate similar maps for any region of the country.

Setup

Fortunately the graphic cites its sources. I’m going to use tidycensus package to grab the base data and the ggplot2 library for my visual output. Since I’m sure I’ll also be needing dplyr and som other common tools to boot, I’ll bring in the whole tidyverse package as well.

library(tidycensus)
library(tidyverse)

Getting to the data

The good news is that the tidycensus package makes it a snap to get recent census data without even having to write and load any csv files to your working directory. The bad news is that census data is extremely complex and detailed and we will have to do a bit of sleuthing to find the inputs we’ll need to recreate the graphic. For this part of the project I found Kyle Walker and Matt Herman’s tidycensus vignettes to be of incredible help.

You’ll want to have a census API key in order to access the data. It’s free and easy to register so no worries there. Tidycensus has a built in function for storing your key.

census_api_key("YOUR API KEY GOES HERE")

We’ll be using the 2018 American Community Suervey (ACS) data since that’s the most recent available. Let’s have a look at what data is available.

acs = load_variables(2018, 'acs5', cache = TRUE) 
summary(acs)

##      name              label             concept         
##  Length:26996       Length:26996       Length:26996      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character

That’s …. a lot of categories: nearly 27000 rows! After some extensive sleuthing and googling to get a better sense of census jargon I concluded that the TENURE BY OCCUPANTS PER ROOM category is what we want. The job now is to query that subset of categories for the geographic area we want to map. Since our goal is to recreate the New York Times map, I’m going to use a list of bay area counties. In our get_acs() call we’ll want to specify that we want tract level data with geometry shapefiles included since this is ultimately going to power our map.

my_state = 'CA'
my_county = c('San Francisco','San Mateo','Santa Cruz','Santa Clara','Alameda','Contra Costa')

  
  concepts = acs %>% 
    filter(concept == 'TENURE BY OCCUPANTS PER ROOM')
  
  concepts_string = concepts %>% 
    select(1) %>% 
    pull() # this creates a character vector of just the concepts we want
  
raw =  get_acs(geography = 'tract', variables = concepts_string, year = 2018, state = my_state, 
              county = my_county,geometry = TRUE, cache = TRUE)

Transforming and Aggregating

Now we have our raw data! It’s got some issues though: First there’s multiple rows for each Census tract that we will want to aggreage in order to make our map. Also the return from the census API includes our variable codes but does not contain friendly labels that our human brains can easily parse. I’ll consolidate every tract ID using the spread() function and then join our original list of concepts back in for easier reading. I’m also going to clean up the column names a bit using the make_names() function that is built into base R.

fixed_data = concepts %>% 
  select(name, label) %>% 
  right_join(raw, by = c('name' = 'variable')) %>% 
  select(2,3,5,7) %>% 
  spread(key = label, value = estimate, -4)
  
names(fixed_data) = make.names(names(fixed_data))

That gives us a nicer dataframe to work with. The last step is to do some basic arithmetic with our columns to figure out the ratio of households with more than one occupant per room. I just used basic dplyr mutates for this, which is a little inelegant but gets the job done. You may find a better solution exists.

fill_data = fixed_data %>% 
  mutate(
    total_under_1 = Estimate..Total..Owner.occupied..0.50.or.less.occupants.per.room + Estimate..Total..Owner.occupied..0.51.to.1.00.occupants.per.room +    Estimate..Total..Renter.occupied..0.50.or.less.occupants.per.room + Estimate..Total..Renter.occupied..0.51.to.1.00.occupants.per.room,
    total_over_1 = Estimate..Total..Owner.occupied..1.01.to.1.50.occupants.per.room + Estimate..Total..Owner.occupied..1.51.to.2.00.occupants.per.room +    Estimate..Total..Owner.occupied..2.01.or.more.occupants.per.room + Estimate..Total..Renter.occupied..1.01.to.1.50.occupants.per.room + Estimate..Total..Renter.occupied..1.51.to.2.00.occupants.per.room + Estimate..Total..Renter.occupied..2.01.or.more.occupants.per.room
    ) %>% 
select(GEOID,geometry ,total_under_1, total_over_1, Estimate..Total) %>% 
mutate(ratio_over_1 = total_over_1/ Estimate..Total)

head(fill_data)

## # A tibble: 6 x 6
##   GEOID                  geometry total_under_1 total_over_1 Estimate..Total
##   <chr>        <MULTIPOLYGON [°]>         <dbl>        <dbl>           <dbl>
## 1 0600~ (((-122.2469 37.88544, -~          1285           12            1297
## 2 0600~ (((-122.2574 37.8431, -1~           849            6             855
## 3 0600~ (((-122.2642 37.84, -122~          2375           66            2441
## 4 0600~ (((-122.2618 37.84179, -~          1718           32            1750
## 5 0600~ (((-122.2694 37.84811, -~          1575           47            1622
## 6 0600~ (((-122.2681 37.84414, -~           653            4             657
## # ... with 1 more variable: ratio_over_1 <dbl>

Now we’ve got a data frame with one row for each census tract that contains a shapefile for the contours of the tract, plus a calculation of the percentage of housing units that have more than one occupant per room. We’ll cover making the map using ggplot in Part Two.

Older Newer