Akos Furton    About Me    Github    LinkedIn    Email Me    Resume    Archive

Warehouse Mapping and Traveling Salesman

Analyze geospatial data within the R and ggplot2 ecosystem to develop efficient routing networks. Apply the Traveling Salesman Problem to minimize the total distance between a large set of locations and visualize the results on an interactive map.

Even though the Christmas season has come and gone, consumers across America look back fondly on new presents and memories shared with family. As American retailers report Christmas sales, we take a look into their distribution networks, a hidden yet vitally important piece in the complex network that allows customers to receive their gifts during the holidays.

In 2015, Amazon reported 107 Billion, Walmart 482 Billion, and Target 74 Billion USD in revenue. Such large sales figures can only be supported by an extensive logistics network that can efficiently transport large volumes of product from manufacturer to customer.

In this post, we assume that Santa and his reindeer must visit each of Amazon’s, Walmart’s, and Target’s warehouses to distribute Christmas presents to all of America. We will first conduct some exploratory data analysis on the warehouses in the dataset, then plot the most efficient route to all of the warehouses and finally show what is the quickest way for each consumer to pick up his or her packages.


We first begin by loading a number of packages to visualize geospatial data. The popular ggplot2 package contains a number of extensions that allow it to display coordinates on a map background. The TSP package allows for a user to input a set of GPS coordinates and then calculates the most efficient route that passes through all of the points. By minimizing distance, delivery companies (such as Rudolph Inc.) do not waste extra time by traveling across a suboptimal route. The SP package will allow us to calculate the fastest route from our location to the nearest warehouse. Finally, the Leaflet package allows us to convert our visualization into an interactive HTML file where we can explore the dataset.

options(digits = 15)

The dataset used for the analysis can be found here. We load the data using the read.csv module to convert the values into an R dataframe. While examining the data we notice that the square foot attribute should be a number but is read as a factor. Therefore, we convert it into a numeric value.

warehouses <- read.csv('warehouse_stats_w_latlon.csv')

warehouses$sq_ft <- as.numeric(as.character(warehouses$sq_ft))

Exploratory Data Analysis

The first thing when encountered with a new dataset is to perform some exploratory data analysis. We start by viewing the first 5 rows of the dataset. We see that it contains attributes for the retailer that owns the warehouse, the type of warehouse, and the warehouse address. The dataset also contains the facility’s size and opening date. Finally, we convert the address into a latitude and longitude coordinate pair. However, this was hard to obtain for certain warehouses so their latitude and longitude values are NA.


##   retailer                type company_id
## 1   target general merchandise      T-580
## 2   target general merchandise      T-588
## 3   target general merchandise     T-0553
## 4   target general merchandise     T-0555
## 5   target general merchandise     T-0593
##                                                location
## 1           6175 Greenbrier Rd, Madison, Alabama, 35756
## 2                25 N 75th Ave, Phoenix, Arizona, 85043
## 3       14750 Miller Avenue, Fontana, California, 92336
## 4 2050 East Beamer Street, Woodland, California, 957776
## 5          3880 Zachary Ave, Shafter, California, 93263
##    country   sq_ft yr_open         lat           lon
## 1 USA 1357500    2000 34.64499950  -86.84391439
## 2 USA 1530700    2002 33.44837487 -112.22108920
## 3 USA 1423000    1987 34.11206400 -117.48052900
## 4 USA 1862000    1988          NA            NA
## 5 USA 2100000    2003 35.44491959 -119.18513490

Since we have data for multiple retailers, we would like to see whether the companies have similar distribution strategies. By using a table to calculate frequencies and then converting the aggregated counts into a barchart, we can see that Amazon has more warehouses than Walmart. Even though Walmart has revenues that are over 4x greater than those of Amazon, the online retailer has many more facilities. We hypothesize that Walmart’s warehouses are larger whereas Amazon has many more small warehouses to ensure fast delivery.


We explore the above hypothesis by finding the average warehouse size for each retailer. To aggregate data, R has a function called tapply that can group data and return summary statistics. We are only interested in the median warehouse size, but the function can return an array of summary statistics.

As we predicted, Amazon’s warehouses are about half of the size of Walmart’s warehouses. Interestingly, Target has even larger warehouses than Walmart. Clearly, the companies have differing priorities with regards to distribution networks.

barplot(tapply(warehouses$sq_ft, warehouses$retailer, 
		median, na.rm = TRUE))

We can also break the data into each specific type of warehouse for each retailer. The table below returns the median warehouse size. Amazon has extremely small Prime Now and Delivery Sort warehouses in order to deliver exceptionally small lead times to its customers. Prime Now promises same day delivery for customers, a tactic that is only possible with small warehouses located close to population centers.

Additionally, Amazon’s general merchandise warehouses are significantly smaller than Walmart and Target’s warehouses. This is primarily because Amazon stocks goods closer to population centers in order to achieve faster lead times. In contrast, Walmart and Target do not have stores in all areas and have a lead time buffer in terms of their stores. Amazon in contrast ships directly to customers so it must place its warehouses closer to customers.

   by = list(warehouses$retailer, warehouses$type), 
   FUN = median, na.rm = TRUE)

##       Group.1             Group.2       x
## 1      target              closed 1225000
## 2     walmart              closed  414000
## 3      amazon       delivery sort   64000
## 4     walmart    domestic-inbound  104380
## 5      target          e-commerce  785400
## 6     walmart          e-commerce 1001310
## 7     walmart              export  191700
## 8     walmart             fashion  893700
## 9      amazon                food  161400
## 10     target                food  430000
## 11    walmart                food  875100
## 12    walmart            footwear  236100
## 13     amazon general merchandise  820400
## 14 sam's club general merchandise   70000
## 15     target general merchandise 1500000
## 16    walmart general merchandise 1200000
## 17     target             inbound 1900000
## 18    walmart     inbound-foreign 2200000
## 19    walmart             optical   38918
## 20    walmart            pharmacy   70900
## 21     amazon           prime now   36400
## 22    walmart         print/photo  390000
## 23     target             returns  190000
## 24    walmart             returns  222500
## 25     amazon                sort  268500
## 26    walmart                tire  388560

We also notice that some of the data contains NA for either latitude or longitude. We need both to calculate distances or plot the warehouses. The complete.cases function subsets the data so only values with both latitude and longitude remain.

complete.latlon <- subset(warehouses, 
	complete.cases(warehouses$lat) == TRUE)

Plotting Data on a Map

We wish to get a visual representation of the geospatial distribution of our data. By plotting the warehouses on a map of the United States, we can identify patterns in location.

First, plot the map base layer that contains outlines of each of the states, and then add points to represent each warehouse. Finally, group the 4 retailers by color and add a legend to identify which color corresponds to which retailer.

For each retailer, we notice that there exist dense clusters on both the East and West coasts. Walmart tends to place a warehouse on the outskirts of each larger city in the US. Amazon targets only the largest of cities and places many smaller warehouses within each city, ensuring efficient deliveries. While Walmart’s warehouses are relatively evenly distributed according to the United States population, Amazon is clustered only in the largest cities. We also see that no warehouses for any retailer exist in Idaho, Montana, or the Dakotas. These areas are sparsely populated with large distances so retailers choose to serve them through farther warehouses.

state.map <- map_data("state")
ggplot() + geom_polygon(data = state.map, aes(x=long,y=lat,
		   group=group), colour = "grey", fill = NA) +
  geom_point(data = complete.latlon, 
  	aes(x = lon, y = lat, color = retailer))

Traveling Salesman (Santa) Problem

To visit each of the warehouses in the United States, it is critical that Santa take the most efficient route. By using the ETSP function in the TSP package, we can calculate the shortest route. The function requires a matrix of X/Y coordinates. We can use the Euclidian distance (“as the crow flies”) to travel between each warehouse.

The solve_TSP function applies a set of heuristics to solve the Traveling Salesman problem. The two_opt method seeks to minimize backtracking or path crossing.

latlon.m <- ETSP(cbind(complete.latlon$lat, complete.latlon$lon))

tour <- solve_TSP(latlon.m, method = "two_opt")

By plotting the most efficient tour, we can see the route that Santa would take to deliver packages to all warehouses in the United States. However, without a background of the United States, the entire plot does not make much sense.

plot(latlon.m, tour)

When we examine the order of the warehouses in the Traveling Salesman Tour, we convert the tour order into integers and then view the warehouses corresponding to the index. By calling the index, we see that the tour starts in Texas, winding around Fort Worth.

route.order <- (as.integer(tour))

## [1] 222 Commercial Street, Sunnyvale, California,  94085-4508    
## [2] 1700 Montague Expy, San Jose, California,  95131             
## [3] 38811 Cherry Street, Newark, California,  94560-4939         
## [4] 990 Beecher St, San Leandro, California,  94577              
## [5] 250 Utah Avenue, South San Francisco, California,  94080-6801
## [6] 888 Tennessee Street, San Francisco, California,  94112      
## 451 Levels: 1 Centerpoint Blvd, New Castle. Delaware. USA. 19720 ...

Now that there is a solution to the shortest path that passes through each warehouse, we reorder the data frame to be in the same order as the tour. We sequence each of the indices in the tour and match them to the row index. Since the values of the indices will now match, the data frame is in the order of the efficient delivery.

complete.latlon$index <- seq(1, nrow(complete.latlon))
complete.latlon <- complete.latlon[match(route.order, 

As we saw above, the plot of the tour itself was confusing and just looked like a bunch of lines that zig-zagged across a plane. To fix this issue, we want to plot the Traveling Salesman Tour on a map of the United States. First, we load the geospatial coordinates to map the state boundaries. We overlay the state data as polygons to create a blank map of the continental United States. Finally, we overlay the path created by the Traveling Salesman on top of the states. We only plot those points that are located inside our viewable area, that is the Lower 48.

We can now see that the optimal route starts in Texas, winds its way across the South, then works up the East Coast, then across the midwest. After that, the route cuts across the center of the country before winding up the West Coast and finally finishing in Nevada. This route makes visual and intuitive sense because it does not contain any crosses or overlaps.

state.map <- map_data("state")
ggplot() + geom_polygon(data = state.map, 
                        colour = "grey", fill = NA) +
                                  aes(x=lon, y=lat), col="red") +
                                  xlim(-125,-65) + ylim(25,50) +

Now that all of the packages have been delivered to a warehouse, we wish to know what is the closest facility to pick up the gifts. We need to first geocode a given address by converting it into latitude and longitude coordinates. By using the geocode function, we can convert the text into GPS using the Google Maps API. (Note: Google Maps API is limited to 2500 calls per day.) The location returns a latitude and longitude pair.

my.loc <- "1600 Pennsylvania Ave. Washington DC"
loc.coords <- geocode(my.loc)

##           lon        lat
## 1 -77.0365298 38.8976763

For our Warehouse data, we make use of the Spatial Points library to note that the latitude and longitude values are not independent points but rather a coordinate pair. By passing the SpatialPoints function a data frame of coordinate pairs in longitude/latitude format, we now have a set of GPS points we can use to calculate great circle distances. Distances over large scales are not the typical x/y coordinate Euclidian distance. Because Earth is a sphere, the shortest distance between points looks like an arc when plotted on a 2 dimensional surface such as a map or screen. These routes are called great circles and the coordinate system WGS84 is used to calculate great circle distances.

By using the function spDists, we can calculate the distance from our specified point to any of the warehouses. We append this distance onto the warehouse data frame. Because we specify longlat to be True, the function calculates Great Circle distance instead of euclidian straight line distance. The distance column of the complete.latlon data now contains the separation between The White House and each individual warehouse.

Finally, we return the index of the minimum distance of all of the computed distances to display the closest warehouse to the White House.

warehouses.coords <- SpatialPoints(coords = data.frame(
	lon = complete.latlon$lon, 
lat = complete.latlon$lat), 
proj4string = CRS("+proj=longlat +datum=WGS84"))

my.coords <- SpatialPoints(coords = data.frame(loc.coords), 
	proj4string = CRS("+proj=longlat +datum=WGS84"))

complete.latlon$distance <- spDists(warehouses.coords, 
		my.coords, longlat = T)

closest.warehouse <- complete.latlon[which.min(complete.latlon$distance),]


##     retailer      type company_id
## 453   amazon prime now       UVA1
##                    location
## 453 5617 Industrial Dr, Suite A, Springfield, Virginia,  22151- 4410
##   country  sq_ft yr_open     lat    lon          index
## 453 USA 126900    2015 38.79845428 -77.1696701   441
##            distance
## 453 15.966158074277

Santa could directly fly the 16 kilometers between the White House and the Amazon Food warehouse that is closest. However, for the President to access the location he would need his Secret Service to drive. Fortunately, ggmap contains the route function that takes two locations, the start and end points, and returns the Google Maps routing between the points. Route returns a dataframe that contains the latitude and longitude coordinates of each of the turns required to travel from point A to point B (White House to warehouse).

Once we have our routing information, we display it on a map to visualize the President’s journey to pick up his or her package. By adding a geom_path layer to the map, we can visualize the routing via the order of the coordinates in the route.

warehouse.address <- as.character(closest.warehouse$location)

route.df <- route(my.loc, warehouse.address, structure = 'route')

qmap(my.loc, zoom = 11) + geom_path(data = route.df, 
	aes(x = lon, y = lat), color = "red", 
size = 1.5, lineend = "round")

Finally, we use the Leaflet package to transform our static geospatial visualization created earlier into an interactive map. We first add our latitude and longitude data frame and the map images onto the map. We color each warehouse by its retailer, so that Orange refers to Amazon, Blue refers to Walmart, Green refers to Sam’s Club, and Red refers to Target. Afterwards, for each warehouse marker on our map, we add a popup that displays its status. Therefore, we can interactively view each facility’s function, owner, and size. It is extremely easy to see where warehouse clusters are located, especially those within large cities in America.

dynamic.map <- leaflet(complete.latlon)
dynamic.map <- addTiles(dynamic.map)

pal <- colorFactor(c('orange','green''red','blue'), 
				domain = complete.latlon$retailer)

dynamic.map <- addCircleMarkers(dynamic.map, 
        popup = paste("Retailer:", complete.latlon$retailer,"<br>",
          'Type:', complete.latlon$type, '<br>',
          'Square Feet:', complete.latlon$sq_ft, '<br>',
          'Opened:', complete.latlon$yr_open,'<br>',
          'Distance from the White House:', round(complete.latlon$distance,3), 'km'),
          color = ~pal(retailer), stroke = FALSE, fillOpacity = 0.7)


We can explore each retailer’s distribution and logistics network on the Leaflet map. We notice again Amazon’s clusters around large cities and Walmart’s more even spread of distribution centers on the outskirts of large and medium sized population centers.

You can explore the map for yourself here.

comments powered by Disqus