The project was created in association with the MUSA 801 Practicum at the University of Pennsylvania, taught by Ken Steif, Karl Dailey, and Michael Fichman. We would like to thank Matthew Harris for providing guidance and support to help us develop the model. All products from this class should be considered proofs of concept and works in progress.

This document starts from the Introduction section, then dived into the data exploration and feature engineering. After preparing the concepts and the data, we come into the model building and model validation sections.

1. Introduction

1.1 Abstract

Transportation congestion is an enormous threat to our economic prosperity and way of life. Whether it takes the form of cars or trucks stalled in traffic or cargo stuck at overwhelmed seaports, congestion costs America an estimated $200 billion a year. Beyond these immediate costs, transportation delay and unreliability have begun to chip away at one of our nation’s most important economic assets: the businesses freedom of location and the ability to quickly reach customers across the city. In specific, in terms of avoiding traffic congestion, the question of when and where to hold an event deserves our concern. While a great deal of study focuses on the negative externalities of congestion, not many cities have been able to launch strategies to answer this question. Predictive modeling opens a new window and an innovative approach to do so. Our aim is to provide insights into the degree by which built-environment characteristics of the road, and their interaction in space and time can be used as predictors of jam occurrence. If we deem successful, we hope to provide the transportation professional and policy makers with a new planning tool for informing and targeting their preparation and planning for an event.

For this study, we train a space/time traffic congestion forecasting model utilizing space/time traffic jam observations in Louisville KY provided from the Waze Connected Citizens project. Predictions from this model are then input into a proof of concept application aimed at helping cities like Louisville manage traffic congestion.

1.2 Motivation

A planned event is a public activity, with a scheduled time and location, whose performance is greatly impacted by the transportation system operations, since the travel demand and congestion attribute to event staging. Event Planning has become increasingly sensitive to traffic congestion due to its effects on people’s quality of life, movement efficiency, and travel experience. In a recent National survey, delays caused by traffic congestion is people’s top concern when attending an event.

One typical group who are interested in understanding traffic congestion for planning special events is Transportation engineers/ planner. Traffic engineers/ planners are in charge of day-of-event traffic management. Before arranging an event, the hourly prediction of congestion illuminates the best time and location to hold the event. On the event day, engineers/ planners have the responsibility of monitoring and maintaining traffic flow traversing their jurisdiction. With the prediction of the day-of-event traffic congestion, they can better develop and review traffic management plans to accommodate anticipated fluctuations in traffic demand.

Presently, planners site an event or pursue interventions based on the congestion occurrence, or a piece of general knowledge from the past experience, without an accurate and systematic numeric prediction informing the traffic congestion of that selected specific time and location. Our approach seeks to predict where, when and to what degree the traffic congestion is likely to occur based on the characteristics of each site, and the weather of the scheduled time, thereby providing transportation engineers/ planner with the means to be more proactive in their interventions and their traffic operations.

Figure 1 shows the serious road congestion of the 2015 Kentucky Derby in Louisville Source: The Courier-Journal Published 5:02 p.m. ET April 28, 2015

1.3 Brief Methodology

1.3.1 Unit of analysis: 500m grid cell

Our analysis begins with the Waze jam data, which is composed of 17 million rows. Each row represents a jam report, and it has the following attributes:

Table 1: Waze Raw Data
Attribute	Description
id	Jam ID. The unique id of each jam report
pub_utc_date	The UTC time of the jam’s occurance
speed	Current average speed on jammed segments meters/seconds
length	Jam length in meters
delay	Delay of jam compared to free flow speed in seconds
road type	road type of the jammed segments
level	Traffic congestion level (0 = free flow; 5 = blocked)
blocking alter id	whether the jam is connecting to a block or not
latitude	Latitude of the jam report
longitude	Longitude of the jam report

The first step of the prediction is to aggregate the jam data to the Louisville road network. We do so by creating 2075 hexagon grid cells of the 500-meter radius (size of 216506 squared meter) that cover the primary, secondary and tertiary types of road in the City of Louisville, and spatially aggregating the jam data to the grid cells. As a result, there are 1946 grid cells are attached with jam data. We calculated the average value of the jam length/speed/delay of each grid cells and took it as the dependent variable in this study. All additional data which we collect will be aggregated to these grids. Figure 2 shows the aggregated jam length by the grid, indicating that the urban arterials connecting the inner center city and the outskirts, and the eastbound to westbound are most likely to be congested.

In detail, we created the grid cell system through four steps:

Obtain Louisville (Jefferson County) boundary and its street network data from OpenStreetMap
Filter for primary streets motorway, motorway_link, primary, primary_link, tertiary, tertiary_link, trunk, trunk_link, the secondary and secondary link
Create grid cells for the primary streets using the grid cell size of 500m
Calcuate the average value of jam length/delay/speed of each grid cell

Figure 2: 2018 Average jam length in meters by grids

1.3.2 Modeling Strategy

Our modeling strategy attempts to predict hourly jam length of time for each grid cell as a function of the time of day, weather, and built environment. To do so, we develop three predictive machine learning framework or models.
(1)Prophet Time-series model
The Prophet is a machine learning developed by Facebook for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects.

It works best with time series that have strong seasonal effects and several seasons of historical data. For our data, the traffic jam has a strong pattern of seasonality for its repeating trend of the time of day, week, month and year, and also greatly affected by holidays and events. That’s why the Prophet model is theoretically appropriate to use for our purpose.

(2)Random Forest Model
Random forests or random decision forests are a machine learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees.Random decision forests correct for decision trees’ habit of overfitting to their training set.

(3)Mixed-Effects Model
Though the Prophet model is theoretically appropriate, it might not be applicable spatial features, because spatial features, such as building density, won’t vary across the time. To incorporate the spatial features, we develop the Mixed-Effects Model, which is a statistical model containing both fixed effects and random effects.

Fixed effects refer to variables that we expect will have an effect on the dependent variable, such as the built environment (like building density and the distribution of intersections), and the weather (like the rainy day and cloudy day).

Random effects refer to those factors that we cannot control experimentally, and we are unsure of how they might influence the prediction. In our model, we have two random effects. The first one is the temporally lagged dependent variable - the next hour’s jam length, since traffic jam is spatially clustered. The second one is the jam count if it is a freeway grid cell, since the count of jam report for freeway grid cells is greatly larger than those of non-freeway grid cells. We think the two features might influence the jam, but not sure how and to what degree it influence, so we take them as random effects.

After the model training and validation, the mixed-effects model is proved to be more robust, reasonable and reliable. In this report, we briefly demonstrate both models but mostly focus on the mixed-effect model.

2. Data

2.1 Data Collection

For our analysis, we primarily use open data so that other municipalities with similar open data repositories are able to reproduce our analysis.

Our goal is to provide decision-makers with insight regarding the built-environment and regulatory factors which might indicate traffic congestion. As a result, we had six categories in the dataset we created:

Dependent variable: the length of jam that transportation planners care about
Temporal variables: the time of day accounts for variances in jam due to its seasonality (i.e. peak-hour, weekday)
Weather variables: we hypothesize that bad weather contributes to the increase of jam (i.e. rainy, cloudy)
Holiday variable: we hypothesize that holidays/events have some effects on the road network performance (i.e. Kentucky Derby)
Built-environment variables: the physical environment is correlated with local variations in the jam (i.e. road type, building density, traffic control, etc.)
Others: other information about the jam (i.e fishnet_id)

Table 2: Features in the Final Dataset
Label	Description	Type	Source
Dependent variable
length	The length of the jam	Numeric	Waze
Temporal variables
local_time	Louisville local time in ISO 8601 format	Factor	Waze
year	The year of the jam	Numeric	Waze
month	The month of the jam	Numeric	Waze
day	The day of the jam	Numeric	Waze
weekday	What weekday the jam is	Factor	Waze
weekend	Is it a weekend or not	Dummy	Waze
peak	Is it the peak hour or not	Dummy	Waze
hour	The hour of the jam	Numeric	Waze
Weather variables
icon	The summary of the weather of the day	Factor	DarkskyAPI
precipProbability	The probability of precipitation occurring	Numeric	DarkskyAPI
temperature	The air temperature in degrees Fahrenheit	Numeric	DarkskyAPI
humidity	The relative humidity	Numeric	DarkskyAPI
pressure	The sea-level air pressure in millibars	Numeric	DarkskyAPI
windSpeed	The wind speed in miles per hour	Numeric	DarkskyAPI
snow	A snowy time	Dummy	DarkskyAPI
hurri	A time with hurricane	Dummy	DarkskyAPI
Hrain	A time with heavy rain	Dummy	DarkskyAPI
Foggy	A time with fog	Dummy	DarkskyAPI
Holiday variable
holiday	What holiday it is	Factor	holidayAPI
Built-environment variables
freeway	Does the grid cover a freeway or not	Dummy	OSM
f_area_m2	The area of the grid	Numeric	OSM
parking_ct2	The count of parking	Numeric	OSM
off_ct	The count of off-street parking	Numeric	OSM
ind_ct	The count of indidents	Numeric	OSM
comm_ct	The count of commercial buildings	Numeric	OSM
res_ct	The count of residential buildings	Numeric	OSM
ret_cnt	The count of retail buildings	Numeric	OSM
tot_bling_cnt	The count of total buildings	Numeric	OSM
turning_cnt	The count of the turns	Numeric	OSM
roundabt_ct	The count of roundabouts	Numeric	OSM
sign_ct	The count of stop signs	Numeric	OSM
cross_ct	The count of crossways	Numeric	OSM
toll_ct	The count of tolls	Numeric	OSM
traffic_signal	The count of traffic signals	Numeric	OSM
junction	The count of junctions	Numeric	OSM
Others
fishnet_id	The unique ID of the grid	Numeric	Self-create
level	The level of the jam	Factor	Waze
count	The count of the jam report	Numeric	Waze

2.2 Data Wrangling

2.2.1 Dependent variable

The final dataset we create includes jam length observations for each grid cell location (n=2075) for each hour of the day (n=24) for n days of the year (n=365). However, in our orginal dataset, some space/time units record 0 jams. To address this problem, we imputed those missing data points as zero to complete the dataset.

##### imputation ####
df <- 
  d1_1%>%  #d1_1 is nothing but the Final_df created by andrew which has all the information 
  dplyr::select(fishnet_id) %>% #are using it to just get the fishnet ids
  group_by(fishnet_id)%>%
  summarise(count=n())%>%
  dplyr::select(-count)%>%
  split(.$fishnet_id)%>%
  map_df(~data_frame(fishnet_id = .$fishnet_id[1],
                     local_time =seq(as.POSIXct("2018-01-01 00:00:00", tz ="UTC"), 
                                     as.POSIXct("2018-12-31 23:00:00", tz="UTC"), 
                                     by = "hour") )) %>%
  mutate(local_time = with_tz(local_time,"America/Louisville"))

# credits: https://stackoverflow.com/questions/40227494/how-to-fill-in-missing-dates-in-range-by-group

#clean up weather 
#join weather to main df 
# bringing the jams parameters 
df_1<- 
  d1_1 %>%
  group_by(fishnet_id, year,month, day, hour, Direction, freeway) %>%  
  mutate(count = n()) %>% 
  summarise_if(is.numeric, mean)  %>% 
  ungroup() %>% 
  dplyr::select(fishnet_id, count,length, level, freeway, D_north, D_south,
                D_east, D_west,year ,month, day, hour) %>%
  mutate( local_time=  as.Date(paste(year, month,day, sep='-')),  #creating local_time variable to complete the join
          H = sprintf("%02d0000", hour),
          H = format(strptime(H, format="%H%M%S"), format = "%H:%M:%S"),
          local_time = as.POSIXct(paste(local_time,H, sep=" "),tz="America/Louisville", 
                                  fortmat= "%Y-%M-%D %H:%M:%S")) %>%
  dplyr::select(-H,-year,-month,-day,-hour)

df_2 <-
  left_join(df,x, by=c("fishnet_id","local_time")) %>% 
  replace(., is.na(.), 0) %>%  #imputing the NAs created by the join as 0 
  mutate(  weekend = weekend(weekday),
          peak = peakHour(hour,weekend)) %>% #creating weekend, and peak hour variables 
  dplyr::select(-local_time) 

#adding in holidays 
#adding in built-characteristics

data.frame('Before' = c(319648, 100),
           'After' = c(4670400, 100),
           row.names = c('Data rows', 'Dates covered')) %>%
  kable(., caption = "Table 3: Before and After the Imputation") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = F)

Table 3: Before and After the Imputation
	Before	After
Data rows	319648	4670400
Dates covered	100	100

2.2.2 Temporal variables

We wrangled the original jam data to indicate the peak hour and off-peak hour jams, as well as the weekday and weekend jams.

weekend <- function(ds) {
  dates <- as.Date(ds)
  weekday<- weekdays(dates)
  as.numeric((weekday=="Sunday")|(weekday=="Saturday"))
}

peakHour <- function(ds) {
  dates <- as.POSIXct(ds)
  hour<- hour(dates)
  weekday <- weekend(ds)
  as.numeric(((hour >=7 & hour <=10) | (hour >=18 & hour <=20)) & 
               (weekday=='0'))
}

2.2.3 Weather variables

We wrangled the original weather data to define the rain, cloudy, and clear day. For the weather data gathering, we use the Darksky to collect the hourly weather in the City of Louisville in the year of 2018. In detail, we first created a dataframe of hourly timestamps, then create a dataframe with the requisite columns for the weather data. After that, we make a request using the Darksky API key to get the yearly weather data.

With the whole year’s hourly weather data, we define the rain, cloudy and clear day using the attributes we collected.

#OBTAIN WEATHER DATA FROM DARK SKY API
#LOAD API KEYS
rwunderground::set_api_key("eab8742b2f09a617688fe43f3e1735db")
alerts(set_location(territory = "Kentucky ", city = "Louisville"))
darksky_api_key("eab8742b2f09a617688fe43f3e1735db")

#CREATE A DATFRAME FOR ALL HOURS IN A DAY FOR THE WHOLE YEAR OF 2018
#create dataframe of hourly timestamps
dateDF <- seq(as.POSIXct("2018-01-01 00:00:00", tz = "America/Louisville"), 
              as.POSIXct("2018-12-31 23:00:00", tz = "America/Louisville"), 
              by = "hour") %>%
  as.data.frame() 
#rename the timestamp column name
names(dateDF)[1] <- paste("dateTimeObject")
#create a column for dates in string/character format
dateDF <- dateDF %>%
  mutate(dateString = format.Date(dateTimeObject))


#CREATE A DATAFRAME WITH THE REQUISITE COLUMNS FOR THE WEATHER DATA
#make a request for one day to get the columns for weather data
weatherRequest<-as.data.frame(get_forecast_for(38.201033, -85.651970,
                                   "2018-02-01T12:00:00") [["hourly"]])
#make a dataframe with weather columns
weather <- head(weatherRequest,0)


#GET THE WEATHER DATA FOR THE ENTIRE YEAR
#use 'for' loop to loop though every day of the year
for (i in seq(1, length(dateDF$dateString), by=24)) {
  
  #code to get the date/timestamp in the darksky API format
  #get the date from the timestamp dataframe
  date <- dateDF$dateString[i]
  #separate the date and time 
  dateSeparated <- paste(date, sep=" ") 
  #subset the date from the separated date string
  dateAPI <- stringi::stri_sub(dateSeparated, 1, 10)
  #subset the time from the separated date string
  timeAPI <- stringi::stri_sub(dateSeparated, 12, 19)
  #create a vector of the date and time strings
  dateVector <- c(dateAPI, timeAPI)
  #join the two elements of the date vector using "T" 
  joinedAPI <- paste(dateVector, collapse="T")
  
  #get the daily weather by hour data and assign it to a dataframe
  weatherDaily <- as.data.frame(get_forecast_for(38.201033, -85.651970,
                                                 joinedAPI) [["hourly"]])
  #smartbind it to the weather dataframe created earlier
  weather <- smartbind(weather, weatherDaily)
}

#convert any NA values to zero
weather[is.na(weather)] <- 0

df<- Final_df %>%
  filter(fishnet_id == 3206)%>% 
  dplyr::select(-jam_id, -X.x, -Unique_ID, -f_area_m2, -jam_id_1, -uuid,
                -id, -latitude, -longitude, -order, -irregularity_id, 
                -coordinate_type_id, -value, -name) %>% 
  mutate(precipProbability = if_else(is.na(precipProbability),0,precipProbability),
         rain = if_else(precipProbability > 0.1,1,0),
         poured = if_else(precipProbability > 0.5,1,0),
         temperature = if_else(is.na(temperature),64,temperature), 
         hot = if_else(temperature > 90,1,0),
         humidity = if_else(is.na(humidity),0.75,humidity), 
         pressure = if_else(is.na(pressure),1018.5,pressure), 
         windSpeed = if_else(is.na(windSpeed),2.445,windSpeed), 
         windy = if_else(windSpeed > 25,1,0),
         cloudCover = if_else(is.na(cloudCover),0.552,cloudCover), 
         clear_sky = if_else(cloudCover <= 0.1,1,0),
         partly_cloudy = if_else(cloudCover > 0.1 & cloudCover <= 0.25,1,0),
         mostly_cloudy = if_else(cloudCover > 0.25 & cloudCover <= 0.75,1,0),
         cloudy = if_else(cloudCover > 0.75,1,0),
         visibility = if_else(is.na(visibility),7.429,visibility), 
         clear = if_else(visibility >= 8,1,0),
         med_viz = if_else(visibility < 8 & visibility >= 4,1,0),
         low_viz = if_else(visibility < 4,1,0),
         Holiday = if_else(is.na(Holiday),0,1),
         pub_utc_date = ymd_hms(as.character(pub_utc_date)),
         local_time = with_tz(pub_utc_date,"America/Louisville")
  )

2.2.4 Built-environment variables

To wrangle the build-environment variables, we first obtain OpenStreetMap features, and aggregate them by grid cells and join to the gri cell dataframe. Then, we use ‘for’ loop to obtain and join OSM features, and loop through the length of the column names list for features that will be joined to the fishnet. Please see the code in the following block.

#OBTAIN DATA FROM OPEN STREET MAP (OSM) 
#CREATE BUFFER AROUND EACH FISHNET GRID CELL
#set buffer area (in meters) 
buffer_area <- 200
#create a buffer aroud each fishnet grid cell
fishnetBuffer <- st_buffer(fishnet, buffer_area)


#CREATE LIST OF THE OSM FEATURES THAT WILL BE OBTAINED USING THE 'OSMDATA' R PACKAGE
#list for OSM keys
key_lst = list("highway", "highway", "highway", "highway", "highway", "highway",
               "barrier",
               "amenity",
               "building", 
               "landuse", "landuse", "landuse", "landuse")
#list for OSM values 
val_lst = list("turning_circle", "mini_roundabout", c("stop", "give_way"), "crossing", "traffic_signal", "motorway_junction",
               "toll_booth",
               c("parking", "parking_entrance", "parking_space"),
               "office", 
               "industrial", "commercial", "residential", "retail")
#list for column names that will be used for the OSM features joined to the fishnet dataframe
col_lst = list("turning_ct", "roundabt_ct", "sign_ct", "cross_ct", "traffic_signal", "junction",
               "toll_ct")
#list for column names that will be used for the OSM features joined to the fishnetBuffer dataframe
col_lst_buffer = list("parking_ct",
                      "off_ct", 
                      "ind_ct", "comm_ct", "res_ct", "ret_ct")


#OBTAIN OSM FEATURES, AGGREGATE BY FISHNET GRID CELL AND JOIN TO FISHNET DATAFRAME
#IN ORDER TO AVOID API QUERY TIME OUT, PLEASE RUN THIS CODE WHEN OSM API TRAFFIC IS LOW. OTHERWISE, RUN THE CODE FOR EACH OSM FEATURE SEPARATELY WITHOUT USING THE 'FOR' LOOP. 
#use 'for' loop to obtain and join OSM features
#loop through the length of the column names list for features that will be joined to the fishnet
for (i in 1:length(col_lst)) {
  
  #get key from key_list
  key_inp = key_lst[[i]]
  #get value from val_list
  val_inp = val_lst[[i]]
  
  #get osm data using the appropriate key and value
  osm_data <- add_osm_feature(opq = osm_cityBB, key=key_inp, value=val_inp)
  #convert osm dataframe to a sf object
  osm_data_sf <- osmdata_sf(osm_data)
  #get geometry for osm sf (point geometry)
  osm_data_pt <- st_geometry(osm_data_sf$osm_points)
  #set the coordinate reference system - UTM 16N NAD83 meters
  osm_data_pt <- 
    osm_data_pt %>%
    st_sf() %>%
    st_as_sf(.,  coords = c("longitude", "latitude"), crs = 4326) %>%
    st_transform(., crs = 32616) 
  #clip osm data to jefferson county
  osm_data_clipped <- osm_data_pt[jeffersonCounty, ]
  
  #aggregte OSM data to each fishnet grid cell
  osm_data_sum <- st_intersection(fishnet, osm_data_clipped) %>%
    group_by(fishnet_id) %>% 
    summarise(count = n()) 
  #drop geometry
  osm_data_sum_df <- osm_data_sum %>%
    st_set_geometry(NULL)
  #join aggregated OSM data back to fishnet
  fishnetOSM <- fishnet %>% 
    left_join(., osm_data_sum_df, by = "fishnet_id")
}

#use 'for' loop to change column names for newly added OSM features
#loop through the length of the column names list and start loop from 3rd column (first two columns are related to fishnet ID and fishnet area)
for (i in 3:length(col_lst)) {
  #obtain column name for the OSM feature
  col_nam = col_lst[[i]]
  #change the name of the column
  names(fishnetOSM)[i]<-paste(col_nam) 
}
#check column names
names(fishnetOSM)

#remove osm data frames and free up memory space
rm(osm_data)
rm(osm_data_clipped)
rm(osm_data_pt)
rm(osm_data_sf)
rm(osm_data_sum)
rm(osm_data_sum_df)


#OBTAIN OSM FEATURES, AGGREGATE BY fishnetBuffer GRID CELL AND JOIN TO FISHNETBUFFER 
#DATAFRAME IN ORDER TO AVOID API QUERY TIME OUT, PLEASE RUN THIS CODE WHEN OSM API TRAFFIC #IS LOW. OTHERWISE, RUN THE CODE FOR EACH OSM FEATURE SEPARATELY WITHOUT USING THE 'FOR' #LOOP. 
#use 'for' loop to obtain and join OSM features
#loop through the length of the column names list for features that will be joined to the fishnetBuffer
for (i in 1:length(col_lst_buffer)) {
  
  #get key from key_list
  key_inp = key_lst[[i]]
  #get value from val_list
  val_inp = val_lst[[i]]
  
  #get osm data using the appropriate key and value
  osm_data <- add_osm_feature(opq = osm_cityBB, key=key_inp, value=val_inp)
  #convert osm dataframe to a sf object
  osm_data_sf <- osmdata_sf(osm_data)
  #get geometry for osm sf (point geometry)
  osm_data_pt <- st_geometry(osm_data_sf$osm_points)
  #set the coordinate reference system - UTM 16N NAD83 meters
  osm_data_pt <- 
    osm_data_pt %>%
    st_sf() %>%
    st_as_sf(.,  coords = c("longitude", "latitude"), crs = 4326) %>%
    st_transform(., crs = 32616) 
  #clip osm data to jefferson county
  osm_data_clipped <- osm_data_pt[jeffersonCounty, ]
  
  #aggregte OSM data to each fishnet grid cell
  osm_data_sum <- st_intersection(fishnetBuffer, osm_data_clipped) %>%
    group_by(fishnet_id) %>% 
    summarise(count = n()) 
  #drop geometry
  osm_data_sum_df <- osm_data_sum %>%
    st_set_geometry(NULL)
  #join aggregated OSM data back to fishnet
  fishnetBuffer <- fishnetBuffer %>% 
    left_join(., osm_data_sum_df, by = "fishnet_id")
}

#use 'for' loop to change column names for newly added OSM features
#loop through the length of the column names list and start loop from 3rd column (first two columns are related to fishnet ID and fishnet area)
for (i in 3:length(col_lst_buffer)) {
  #obtain column name for the OSM feature
  col_nam = col_lst_buffer[[i]]
  #change the name of the column
  names(fishnetBuffer)[i]<-paste(col_nam) 
}
#check column names
names(fishnetBuffer)

#remove osm data frames and free up memory space
rm(osm_data)
rm(osm_data_clipped)
rm(osm_data_pt)
rm(osm_data_sf)
rm(osm_data_sum)
rm(osm_data_sum_df)


#OBTAIN AND ADD OSM BUILDING DATA (TOTAL NUMBER OF BUILDINGS) TO FISHNETBUFFER DATAFRAME
#PLEASE NOTE THAT THE OSM BUILDING DATA IS A VERY LARGE FILE AND THAT IT TAKES AN EXTREMELY LONG TIME TO DOWNLOAD THE DATA USING THE OSM OVERPASS QUERY. THEREFORE, WE RECOMMEND DOWNLOADING THE OSM BUILDING DATA SHAPEFILE FROM THE OSM GEOFABRIK WEBSITE (https://download.geofabrik.de/north-america.html) AND LOADING THE DATASET AS A SHAPEFILE LOCALLY.
#load building OSM shapefile using 'sf' package
buildings_kentucky <- read_sf("kentuckyOSM/gis_osm_buildings_a_free_1.shp")
#set the coordinate reference system - UTM 16N NAD83 meters
buildings_kentucky <- buildings_kentucky %>%
  st_sf() %>%
  st_as_sf(.,  coords = c("longitude", "latitude"), crs = 4326) %>%
  st_transform(., crs = 32616) %>%
  st_sf()
#clip OSM building data to Jefferson County
building_clipped <- buildings_kentucky[jeffersonCounty, ]

#aggregte OSM data to each fishnet grid cell
building_clipped_sum <- st_intersection(fishnetBuffer, building_clipped) %>%
  group_by(fishnet_id) %>% 
  summarise(count = n()) 
#drop geometry
building_clipped_sum_df <- building_clipped_sum %>%
  st_set_geometry(NULL)
#join aggregated building data back to fishnetBuffer 
fishnetBuffer <- fishnetBuffer %>% 
  left_join(., building_clipped_sum_df, by = "fishnet_id")

#get column number for the newly added OSM building data feature in the fishnetBuffer dataframe
blng_col_num = ncol(fishnetBuffer) - 1
#change column name for the OSM building data feature
names(fishnetBuffer)[blng_col_num]<-paste("tot_blng_cnt") 
#check column names in fishnetBuffer dataframe
names(fishnetBuffer)


#JOIN OSM FEATURES AGGREGATED TO FISHNETBUFFER TO FISHNET DATAFRAME
#drop geometry for fishnetBuffer dataframe
fishnetBuffer_df <- fishnetBuffer %>%
  st_set_geometry(NULL)
#select relevant columns (i.e., only the OSM features) in the fishnetBuffer and join to fishnet by fishnet ID.
fishnetOSM <- fishnetBuffer_df %>% 
  dplyr::select(1, 3:blng_col_num) %>%
  left_join(., fishnetOSM, by = "fishnet_id")
#convert all NA values to zero
fishnetOSM[is.na(fishnetOSM)] <- 0

2.2.5 Holiday variable

Obtain holiday dates and join the dates to the dataframe.

Events <- data_frame(
  holiday = 'Kentuky derby Fest',
  ds = as.Date(c('2018-04-26','2018-04-27','2018-04-28','2018-04-29',
                 '2018-04-30','2018-05-01','2018-05-02','2018-05-03',
                 '2018-05-04')),
  lower_window = 0,
  upper_window = 1
)

OEvents <- data_frame(
  holiday = 'Abbey Road Music Fest',
  ds = as.Date(c('2018-05-23','2018-05-24','2018-05-25')),
  lower_window =0,
  upper_window=1
)

OEvents2 <- data_frame(
  holiday = 'Boat show',
  ds = as.Date(c('2018-01-24','2018-01-25','2018-01-26',
                 '2018-01-27')),
  lower_window =0,
  upper_window=1
)

OEvents3 <- data_frame(
  holiday = 'tailSpin Ale Fest',
  ds = as.Date(c('2018-02-17')),
  lower_window =0,
  upper_window=1
)

holidays <- bind_rows(Events, OEvents,OEvents2,OEvents3) 

rm(Events, OEvents, OEvents2,OEvents3)

#this has cummulative data from holiday csv and the newly created events holiday dataframe above.
hol <- as.data.frame(
  read_csv("F:/BACKUP/MUSA801_Practicum/Dataset/Holiday2018.csv")%>%
    mutate( ds= as.Date(substr(time, 1, 10)))%>%
    dplyr::select(-time,-X1) %>%
    subset(Holiday != "NA")%>%
    group_by(Holiday, ds)%>%
    summarise(count=n())%>%
    mutate(lower_window = 0,
           upper_window = 1)%>%
    rename(holiday= `Holiday`)%>%
    dplyr::select(-count) 
)%>%
  rbind(holidays)

3. Exploratory Analysis for Jam Data

3.1 Jam Distribution

In this section, we will first look at the distributions of the dependent variable - the statistic distribution of the jam length. The original data’s length distribution is as Figure 3 shows. After the imputation, the distribution of the jam length is as Figure 3 shows. These graphs give us a big picture of the distribution of our dependent variable.

Figure 3: The statistic distribution of jam lengh

img3 <-  rasterGrob(as.raster(readPNG("jam length_non 0.png")), interpolate = FALSE)
img4 <-  rasterGrob(as.raster(readPNG("jam length_imputate 0.png")), interpolate = FALSE)
grid.arrange(img4, img3, ncol = 2)

3.2 Temporal analysis

3.2.1 Jam Length Across Hour

To understand the jam length variance across the time of day, we choose June 2018 as an example to visualize the length variance in an animated way. The animated Figure 4 indicates us to include “weekday” and “peak/off-peak hour” variable.

Firstly, weekday and weekend have obviously different jam pattern. As it is shown above, weekdays (Monday to Friday) have conspicuous morning peak hour and afternoon peak hour, and generally small jam length from 0:00 AM - 6:00 AM. Saturday and Sunday, in comparison, have a high prevalence of jam from 0:00 AM - 6:00 AM, and they nearly don’t have morning peak hour.

Secondly, the daily pattern is obvious to have a peak hour. Weekdays have two peak hour: the morning peak hour generally goes from 7:00 AM -10:00 AM, and the afternoon peak hour generally goes from 18:00 PM - 20:00 PM. Weekends don’t have clear morning peak hour, but starting from 12:00 PM, their jam length constantly increase.

library(gganimate)

## New data

aggregated <-
  jams_June %>%
  mutate(day = wday(pub_utc_date, label = TRUE),
         month = month(pub_utc_date, label = TRUE),
         num = day(pub_utc_date),
         year = year(pub_utc_date),
         hour = hour(pub_utc_date)) %>%
  mutate(string = glue("{day}, {month} {num}, {year}")) %>%
  mutate(time = format(pub_utc_date, '%A, %B %d')) %>%
  group_by(day, hour) %>%
  summarise(length = mean(length),
            delay = mean(delay),
            speed = mean(speed))

## Animating it

animate <- 
  ggplot(aggregated,
         aes(hour, length, group = day, color = day)) +
  geom_line(size = 2) +
  geom_segment(aes(xend = 24, yend = length), linetype = 2, colour = 'grey') + 
  geom_text(aes(x = 24.1, label = day), hjust = 0, show.legend = FALSE) + 
  scale_colour_manual(values = colour_function(7),
                      name = "day",
                      guide = guide_days) +
  labs(x = "Time (in hours)", y = "Jam Length (in seconds)") +
  theme_hor() +
  theme(legend.position = 'bottom') +
  transition_reveal(hour)

## Saving

anim_save("length.gif", animation = animate, 
          height = 500, width = 800, nframes = 120, fps = 5,
          start_pause = 2, end_pause = 20)

Figure 4: Average hourly jam length for each weekday

Then, this is an interactive chart (Figure 5) shows the jam length variance by time of day in detail by separating the data by month. From the interactive chart, we can see the peak hour and off-peak hour of each month are similar. Thus, it is safe to use the same morning peak hour and the same afternoon peak hour for the whole year. The similar pattern between months also validates the jam’s seasonality by month.

Figure 5: 2018 hourly Jam length for each weekday for per month

# The csv file (byhourJoinedFishWoutDirWoutFriWay.csv) is our final dataset, we aggregate it by month, weekday and hour to make the interactive plot. 

# noDir3<-read.csv("byhourJoinedFishWoutDirWoutFriWay.csv") %>%
#    group_by(month, weekday, hour) %>%
#    summarize(sum_length = sum(length),
#              sum_count = sum(count))

p2 <- read.csv("noDir3.csv") %>%
  plot_ly(
    x = ~hour,
    y = ~sum_length,
    frame = ~month,
    color = ~weekday,
    colors = c("#21BFD7", "#24ABDF", "#2897E7","#2C84F0","#5464F3","#7D45F7","#A626FB"), 
    size = ~sum_count,
    type = 'scatter',
    mode = 'lines+markers',
    showlegend = T
  )

p2

3.2.2 Jam Length Across Day

In the previous section, we saw how jam length varies across hour. Here, let’s zoom in to a specific month - 2018 June, to see how jam length varies across day. Figure 6 here shows the jam length variance of the 30 days in June 2018. Through the graph, we observe that the daily trend is pretty similar across the month, meaning that the daily seasonality is strong. Meanwhile, there are some peaks in some specific moments, which means there are some other effects impacting on the traffic performance.

data2<-read.csv("byhourJoinedFishWoutDirWoutFriWay.csv") %>%
  group_by(month, weekday) %>%
  summarize(sum_length = sum(length)) %>%
  ungroup()
table(data2$weekday)
level_order <- c('Monday', 'Tuesday', 'Wednesday', 'Thursday','Friday','Saturday','Sunday')
data2 %>%
  ggplot(.) +
  geom_line(aes(x = ordered(weekday, level = level_order), 
                     y = sum_length, group = 1,
                color = factor(month)) )+
  facet_wrap(~month)+
  labs(title="JAM LENGTH BY THE WEEK OF MONTH",
       x="MONTH",
       y="JAM LENGTH") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Figure 6: 2018 June daily jam length variance

3.2.3 Jam Length Across Weekday by Month

Overall, the most congested day is Monday. Other days, except Sunday, have similar summed jam length. Sunday is the most uncongested day in terms of the summed jam length. However, this plot has the bias about the record of jam for each day. The more meaningful jam reports (without the imputed data), the more likely it shown to be congested. Despite of this, we can have some guess of the nadir on Sunday. It is the weekend and the worship day. Traffic might be relatively smooth on Sunday.

Figure 7: Sum of jam length for each weekday

Observing the data by weekday and by month, we notice that most months have similar pattern, while with some peaks and outliers. In specific, from Monday to Wednesday, most months have a rising trend in traffic congestion with the increasing jam length. Thursday and Friday’ jam varies by month. Some of them is more congested on Thu and Fri, while others may have a falling trend. On weekends, nearly all the months are more congested on Saturday than on Sunday.

This following Figuer 8 informs us a similar but still varied jam trend across the twelve months in the entire year, which means, except the temporal seasonality, there are other features, such as spatial features, have impacts on traffic performance in Louisville.

Figure 8: 2018 Jam length by the weekday of month

3.2.4 Jam Seasonality

From the above exploration, we find that jam has strong seasonality. Now, we aggregate the months into real seasons, and explore the weather characteristics of each season. Our exploration is based on the theory that traffic jam is influenced by the natural weather. Some typical weather, like the heavy rain and snow, have a negative effect on the traffic operation:

Dec - Feb: Winter;
Mar - May: Spring;
Jun - Aug: Summer;
Sept - Nov: Fall

The y-axis is different weather conditions, and the x-axis is the average jam length recorded as the type of weather condition. As the summarized graph shows, in Louisville, spring and summer have very few extreme weather that might influence the traffic largely. However, it has relatively frequent heavy in the winter and snow in the fall. Figure 9 indicates us that we should take weather as a predictor in our model.

Figure 9: Impact of climate

3.3 Spatial Analysis

3.3.1 Animated jam record across the city

Figure 10 animated map shows the distribution of jam in June 2018 across the City of Louisville’s primary, secondary and tertiary roads. The map delivers three information:

The Louisville Center City has a dense urban environment and the jams are congested in the Center City.
The arterials connecting the Center City to the outskirts, and the eastbound to the westbound, are likely to be congested during the commuting hours
The remote areas have few jam records.

This indicates us when doing the prediction, we should take the spatial difference into consideration.

Figure 10: 2018 June jam length animation across Louisville

3.3.2 Aggregated features by grids

Figure 11 shows the each grid cell’s average jam length in June 2018. It validates what we discussed for the previous animation:

First, Center City has a dense road network. Even though the average jam length for each grid is not conspicuous, the cluster of jam deserves our concern. More importantly, during the commuting peak hours, vehicles from the outskirts of the city piled on the south-north bound arterials. Also, the arterials connecting the city east and the west are also likely to be congested.

From the aggregated jem length graph, it is easy to tell that Louisville has a ring-structured road network. What worth noting are those important connecting arterials in the city.

Figure 11: Map of average jam length for each grid cell

From other sample spatial feature graphs, we can easily observe the spatial distribution of those features.

Crossing refers to pedestrian crossings across roads, and other types of crossing over road or rail. They are clustered in the center city.
Junction refers to the presence of road junctions, including the roundabout, circular, and jughandle. They are distributed along the south-north, east-west arterials.
Signals refers to the presence of the signal timing, and they are dispersed cross the city.
Signs typically refers to traffic control signs. They are clustered in the center city, and the arterials.

Figure 12: Map of four spatial features

4. Feature Summary and Engineering

4.1 Numeric Feature Summary

Below is the summary statistics of the numeric variables we used in the models.

Table 4: Numeric Features Summary
Index	Min	X1st.Qu	Median	Mean	X3rd.Qu	Max
precipProbability	0.00	0.00	0.00	1.03600e-01	0.00	1.00
temperature	0.13	37.23	53.97	5.47500e+01	73.77	95.72
humidity	0.19	0.55	0.72	6.89900e-01	0.85	0.96
pressure	996.50	1014.20	1018.60	1.01900e+03	1023.40	1042.70
windSpeed	0.00	1.18	2.61	3.17300e+00	4.33	19.86
cloudCover	0.00	0.03	0.48	5.01500e-01	1.00	1.00
visibility	0.34	10.00	10.00	9.13800e+00	10.00	10.00
month	1.00	3.00	5.00	6.03000e+00	9.00	12.00
day	1.00	6.75	15.00	1.53300e+01	23.00	31.00
length	0.00	0.00	0.00	8.97400e+01	0.00	36607.00
count	0.00	0.00	0.00	3.10000e+00	0.00	330316.00
f_area_m2	216506.00	216506.00	216506.00	2.16506e+05	216506.00	216506.00
parking_ct2	0.00	0.00	0.00	3.35100e+01	26.00	837.00
off_ct	0.00	0.00	0.00	7.23500e-01	0.00	153.00
ind_ct	0.00	0.00	0.00	8.69500e-01	0.00	48.00
comm_ct	0.00	0.00	0.00	3.09400e-01	0.00	52.00
res_ct	0.00	0.00	0.00	4.50100e-01	0.00	80.00
ret_cnt	0.00	0.00	0.00	7.05000e-01	0.00	107.00
tot_blng_cnt	0.00	87.00	254.50	3.47500e+02	518.00	1715.00
turning_ct	0.00	0.00	0.00	3.14000e-01	0.00	9.00
roundabt_ct	0.00	0.00	0.00	5.13900e-04	0.00	1.00
sign_ct	0.00	0.00	0.00	8.41700e-01	0.00	48.00
cross_ct	0.00	0.00	0.00	3.78700e-01	0.00	42.00
toll_ct	0.00	0.00	0.00	5.13900e-04	0.00	1.00
traffic_signal	0.00	0.00	0.00	4.50700e-01	1.00	10.00
junction	0.00	0.00	0.00	9.66100e-02	0.00	3.00

4.2 Binary Feature Summary

Below (Figure 13) is the summary statistics of the binary variables we used in the models. As it is shown, all the extreme weather, like the heavy rain, the snow, are not very usual in Louisville.

Figure 13: Summary of binary features

4.3 Correlation Matrix

The Correlation Matrix allows us to check the independence between predictors. In OLS regression model, the predictors should not be very strongly correlated with each other. If the correlation is greater than 0.8, one predictor is interfering with the another, and our coefficient estimation would be inaccurate. The following correlation matrix shows nearly all of the correlations are within -0.2- 0.2, indicating that no severe correlation exists among our variables. Therefore, all the variables could be input in the regression model.

#RUN CORRELATION MATRIX FOR ALL NUMERIC FEATURES
#select variables for correlation plot
corr_df <- imputedDF_Features %>%
  dplyr::select(7:37)
M <- cor(corr_df)
corrplot(M, method = "number")

Figure 14: Correlation matrix

4.3 Relationship Between Dependent and Numeric Independent Variables

Below is a multi-scatterplot visualization that display the relationships between the dependent variable and independent variables, from which we can tell there is some potential relationship between the length of the jam and the indicators we collected.

Figure 15: Relationship between jam length and numeric features

4.4 Feature Selection

Feature selection is a crucial step in predictive modeling. Boruta is a feature selection algorithm, which achieves supreme significance when the dataset is comprised of several variables. The logic of Boruta is as follows (referred from here):

Firstly, shuffling features to add randomness to the data (the shuffled data is called shadow features).
Then, by training a random forest classifier and applying a feature importance measure (usually the Mean Decrease Accuracy, shown as Z score) to evaluate the importance of each feature.
Check whether a feature has a higher importance than its shadow features (higher Z score). If one feature is unimportant, it will be removed.
The algorithm stops when all features gets confirmed or rejected.

The resulf of the Boruta feature selection is shown as below. Blue boxplots correspond to minimal, average and maximum Z score of a shadow attribute. Red, yellow and green boxplots represent Z scores of rejected, tentative and confirmed attributes respectively.

Now is the time to take decision on tentative attributes. The tentative attributes will be classified as confirmed or rejected by comparing the median Z score of the attributes with the median Z score of the best shadow attribute. Those tentative attributes will be reclassified. The reclassify process is shown as below.

modelDF <- read.csv("byhourJoinedFishWoutDirWFriWayWAccAl.csv")
modelDFCopy <- modelDF

#create feature  
modelDF <- modelDF %>%
  mutate(interaction = Day*Hour)

#create quantile groups for jam length and building density
modelDF <- modelDF %>%
  mutate(jamQtl = ntile(modelDF$length, 3),
         blngQtl = ntile(modelDF$tot_blng_cnt, 3))
names(modelDF)
#Final dataset
#dates relating to flooding filtered out 
modelDF <- modelDF %>%
  mutate(date = format.Date(date(Local_Time))) %>%
  filter(date %notin% c("2018-02-23", "2018-02-24", "2018-02-25", "2018-02-26",
                        "2018-02-27", "2018-02-28"))


#remove irrelevant columns
modelDF <- modelDF %>%
  dplyr::select(-rain, -clear, - snow, -hurri, -Hrain, -Foggy, -year, -level, -f_area_m2)
names(modelDF)

#reorder column names
col_order <- c("fishnet_id","local_time","date","icon","weekday","holiday",
               "weekend","peak", "freeway",
               "precipProbability","temperature", "humidity","pressure","windSpeed",
               "cloudCover","visibility",
               "month","day","hour",
               "count", "parking_ct2", "turning_ct","roundabt_ct",
               "sign_ct","cross_ct","toll_ct","traffic_signal","junction",
               "off_ct","ind_ct","comm_ct","res_ct","ret_cnt","tot_blng_cnt",
               "al_rel","al_count","interaction","length", "jamQtl","blngQtl",
               "geometry")
#rearrange columns
modelDF <- modelDF[, col_order]
names(modelDF)

#list of names for columns
columnName_lst <- list("Fishnet ID", "Local_Time", "Date","Weather_Event","Weekday","Holiday",
                       "Weekend", "Peak", "Freeway",
                       "Precipitation_Probability", 
                       "Temperature","Humidity","Pressure","Wind Speed",
                       "Cloud Cover","Visibility",
                       "Month", "Day","Hour", 
                       "Jam_Count","Parking","Turning_Circles","Roundabouts",
                       "Road_Signs","Crossings","Tolls","Traffic Signals","Junctions",
                       "Office_Building", "Industrial_Building","Commercial_Building",
                       "Residential_Building","Retail_Building","Total_Buildings",
                       "AccidentAlert_Probability","AccidentAlert_Counts",
                       "Hour_Day_Interaction","Length",
                       "jamQtl","blngQtl",
                       "geometry")   
#rename column
for (i in 1:length(columnName_lst)) {
  columnName = columnName_lst[[i]]
  names(modelDF)[i]<-paste(columnName)
}
names(modelDF)


#PART 2: FEATURE SELECTION
#Run correlation matrix for all continous variables
#select variables for corr plot -continous (exclude binary)
corr_df <- modelDF %>%
  dplyr::select(10:37)
#check names
names(corr_df)
#run correlation
M <- cor(corr_df)
corrplot(M, method = "number")
#select variables for corr plot -continous (include binary)
corr_df2 <- modelDF %>%
  dplyr::select(7:37)
N <- cor(corr_df2)
corrplot(N, method = "number")
#day and hour correlated with interaction term - not surprising


#Boruta Package
#randomly pick a week in 2018 - will update the code for the random selection
featureDF <- modelDF %>%
  filter(Date %in% c("2018-07-08", "2018-07-09", "2018-07-10", "2018-07-11",
                     "2018-07-12", "2018-07-13", "2018-07-14"))
#check columns 
names(featureDF)

#select all features and dependent variable in our dataset
featureDF <- featureDF %>%
  dplyr::select(4:38) 
names(featureDF)

#impute any missing values
featureDF <- featureDF[complete.cases(featureDF),]

#create vector of column numbers that have categorical variables
convert <- c(1,2,3)
#convert categorical variables into factor data type
featureDF[,convert] <- data.frame(apply(featureDF[convert], 2, as.factor))

#run Boruta model
set.seed(123)
borutaTrain <- Boruta(Length ~., data = featureDF, doTrace = 2)
#print(borutaTrain)
borutaTrainDF <- attStats(borutaTrain)

# run once more 
finalBoruta <- TentativeRoughFix(borutaTrain)
#print(finalBoruta)
borutaFinalDF <- attStats(finalBoruta)


#plot results for borutaTrain and borutaFinal
plot(borutaTrain, xlab = "", xaxt = "n")
lz<-lapply(1:ncol(borutaTrain$ImpHistory),function(i)
  borutaTrain$ImpHistory[is.finite(borutaTrain$ImpHistory[,i]),i])
names(lz) <- colnames(borutaTrain$ImpHistory)
Labels <- sort(sapply(lz,median))
axis(side = 1,las=2,labels = names(Labels),
     at = 1:ncol(borutaTrain$ImpHistory), cex.axis = 0.7)

plot(finalBoruta, xlab = "", xaxt = "n")
lz<-lapply(1:ncol(finalBoruta$ImpHistory),function(i)
  finalBoruta$ImpHistory[is.finite(finalBoruta$ImpHistory[,i]),i])
names(lz) <- colnames(finalBoruta$ImpHistory)
Labels <- sort(sapply(lz,median))
axis(side = 1,las=2,labels = names(Labels),
     at = 1:ncol(finalBoruta$ImpHistory), cex.axis = 0.7)

getSelectedAttributes(finalBoruta, withTentative = F)

Figure 16: Reclassify process of Boruta Feature Selection

After the classification of tentative attributes, Boruta result plot is as follows:

Figure 17: Results of Boruta Feature Selection

Now, let’s see the final result derived from Boruta feature selection. The column “decision” gives us a clear clue of whether or not to keep the particular feature.

Table 5: Results of Boruta Feature Selection
	meanImp	medianImp	minImp	maxImp	normHits	decision
Weather_Event	6.865791	6.888379	4.8131467	8.461971	0.9898990	Confirmed
Weekday	8.752743	8.753282	6.8133292	10.448194	1.0000000	Confirmed
Holiday	0.000000	0.000000	0.0000000	0.000000	0.0000000	Rejected
Weekend	8.794293	8.748721	7.3254025	10.347617	1.0000000	Confirmed
Peak	2.481912	2.570637	-0.4490447	4.825912	0.4444444	Confirmed
Freeway	47.956294	47.917507	45.4017088	50.145906	1.0000000	Confirmed
Precipitation_Probability	0.000000	0.000000	0.0000000	0.000000	0.0000000	Rejected
Temperature	13.845000	14.508756	8.4393535	17.438199	1.0000000	Confirmed
Humidity	13.008197	13.197512	9.5626460	15.223205	1.0000000	Confirmed
Pressure	13.935175	13.983206	10.9672411	16.934712	1.0000000	Confirmed
`Wind Speed`	12.143127	12.215304	8.4007553	15.063000	1.0000000	Confirmed
`Cloud Cover`	9.295214	9.390160	6.8551203	12.457478	1.0000000	Confirmed
Visibility	6.891507	6.888892	4.4223872	9.035681	0.9898990	Confirmed
Month	0.000000	0.000000	0.0000000	0.000000	0.0000000	Rejected
Day	8.822966	8.896361	7.4038406	10.162629	1.0000000	Confirmed
Hour	11.607195	11.602230	7.1348346	14.000157	1.0000000	Confirmed
Jam_Count	63.265437	63.091611	58.9896841	67.696373	1.0000000	Confirmed
Parking	26.305075	26.264909	22.2633560	30.144720	1.0000000	Confirmed
Turning_Circles	9.833026	9.877891	7.1321307	12.298370	1.0000000	Confirmed
Roundabouts	0.000000	0.000000	0.0000000	0.000000	0.0000000	Rejected
Road_Signs	16.393085	16.427305	14.5807640	18.778559	1.0000000	Confirmed
Crossings	13.959913	14.544149	6.7989550	17.209368	1.0000000	Confirmed
Tolls	2.498372	2.571368	0.2366571	3.857046	0.4242424	Confirmed
`Traffic Signals`	25.404766	25.413223	22.1572716	28.029743	1.0000000	Confirmed
Junctions	4.487572	4.543585	1.4692331	7.334574	0.8888889	Confirmed
Office_Building	1.303430	1.304236	-1.0055234	3.544954	0.0303030	Rejected
Industrial_Building	15.833000	15.820868	12.9115875	18.933528	1.0000000	Confirmed
Commercial_Building	5.650532	5.756497	4.0494514	7.083924	0.9898990	Confirmed
Residential_Building	7.681179	7.775064	4.4903146	10.006895	1.0000000	Confirmed
Retail_Building	13.165399	13.135587	11.3179526	15.400738	1.0000000	Confirmed
Total_Buildings	24.987559	24.865594	20.5711747	28.730207	1.0000000	Confirmed
AccidentAlert_Probability	3.626818	3.528742	2.0649907	5.855824	0.7676768	Confirmed
AccidentAlert_Counts	2.431503	2.643624	-0.2284358	4.469037	0.4141414	Confirmed
Hour_Day_Interaction	11.509459	11.672000	6.8432158	14.095599	1.0000000	Confirmed

5. Model Building

After exploratory analyses, We then move on to test the potential statistical relationships between each feature variable and the jam across the city.We develop three machine learning algorithms to try the prediction: Facebook Prophet Model, Random Forest model, and Mixed-effect Model. At last, we choose the Mixed-effect Model because of its accommodation of spatial features and its higher accurancy. Then, in the model validation section, we will focus on the Mixed-effect model.

5.1 Data Setup

We divided our final dataset into test and training datasets. The training dataset was used to train our models and the test dataset was used to test the models developed using the training dataset.

Facebook Prophet Model & Random Forest Model
The first three weeks of each month were included in the training data while the last week of each month formed part of the test dataset. We saw in the exploratory data analysis section that jam counts spiked every month at different time intervals. In other words, there was no obvious pattern in the monthly jam counts. Therefore, subsetting the first three weeks as the training data and last week as the test data did not introduce any sample selection bias.

Train_delay <- df %>%
 filter(day <20)%>%
 dplyr::select(-day,-weekday)

Test_delay <- df %>%
 filter(day >=20) %>%
 dplyr::select(-day,-weekday)

Mixed-effect Model
To split the dataset into training data and testing data, we randomly select 80% data from the orginal dataset as the training dataset, and the remaining 20% of the data as the testing dataset.

#DIVIDE INTO TRAINING AND TEST DATAFRAMES
#Partition of data into train and test datasets (80% train and 20% test) through random selection
inTrain <- createDataPartition (
  y = modelDF$length, 
  p = .80, 
  list = FALSE)
trainDF_M2 <- modelDF[inTrain,] 
testDF_M2 <- modelDF[-inTrain,]

5.2 Model Selection

Facebook Prophet Model
The Waze dataset contained three types of jam attributes namely jam speed, delay and length. We ran our prediction model thrice using each of these three attributes as our dependent variable. Our time series model was run to generate predictions for the next year (365 days) for all 24 hours of the day.

We first involved fitting the Prophet model for individual fishnet grid cells. Four fishnets were selected for this purpose. The selection was done by first determining the total jam counts for each fishnet. Next, the total jam counts were classified into four quantile groups and mapped. From each of the four quantile groups, one grid cell was selected. Thus, we had grid cells with varying levels of jam counts. Furthermore, we ensured that grid cells were selected from different neighborhoods of the city and represented different road types. We calculate the model’s training set MAE and the testing set MAE as the mean value of the four selected grid cells.

Random Forest Model
We ran the random forest model for jam length for the same four grids as we used for the Prophet model. Here, our basic assumption as follows:

The occurrence of jam is decided by multiple indicators, and random forest helps us to figure out the decision-making process by grouping multiple decision trees.

The model turned out to be valid and accurate. However, it takes incredibly long time to run even for just one grid, and it occupies too much memory of our computer. Therefore, running it for the entire dataset is functionally impossible and impractical.

Mixed-effect Model
One of the required assumption of linear model is the the observations are independent with each other. However, for the Waze jam data, observations are observed from the same subject at multiple occasions. Also, because of the seasonality and spatial autocorrelation, the jam data are correlated within the dataset, meaning the violation to the independence assumption. Therefore, there is a need to account for correlation for valid inference.

Mixed-effect model provides us with an approach to model the situation by adding a random effect for subject. This allows us to resolve this non-independence by assuming a different “baseline” pitch value for each subject, which is the recent hour’s jam in this case.

data.frame('Fixed_effects' = c("Temporal",
                               "Weather",
                               "Built-environment",
                               "Holiday"),
           'Random_effects' = c("Next hour jam length in the same grid",
                                "Count of jam if it is a freeway grid",
                                "NA",
                                "NA"),
           'Error' = c("Residuals",
                       "NA",
                       "NA",
                       "NA")) %>%
  kable(., caption = 'Table 6: Explaination of Mixed-effect Model') %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                font_size = 15,
                full_width = T)

Table 6: Explaination of Mixed-effect Model
Fixed_effects	Random_effects	Error
Temporal	Next hour jam length in the same grid	Residuals
Weather	Count of jam if it is a freeway grid	NA
Built-environment	NA	NA
Holiday	NA	NA

Our equation is shown as follows:
length ~ icon + freeway + count + weekend + peak + temperature + pressure + windSpeed + cloudCover + parking_ct2 + ind_ct + comm_ct + res_ct + ret_cnt + turning_ct + sign_ct + cross_ct + toll_ct + traffic_signal + junction + interaction + (1+hour || fishnet_id) + (count || freeway)

In the equation, interaction = day * hour, which is a nested temporal variable binding the date and the hour that can specify the moment of the jam. Two syntax ‘||’ variables are the random effects, remaining others are fixed effects.
By running the model on the entire dataset, the prediction gives the users a clear clue to choose the location and time of an event.

#BUILD MODEL AND GENERATE PREDICTIONS
#list of days in a week
day_lst <- list("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")

#function for mixed effect model
modelFunction <- function(df) {
  lme2 <- lmer(length ~ icon+freeway+count+al_count+mnth+holiday+
                 precipProbability+temperature+pressure+windSpeed+cloudCover+
                 parking_ct2+off_ct+ind_ct+comm_ct+res_ct+ret_cnt+tot_blng_cnt+
                 turning_ct+roundabt_ct+sign_ct+cross_ct+toll_ct+traffic_signal+junction+
                 interaction+(1+hour || fishnet_id),
               data=df)
}

#function to build prediction dataframe 
dfFunction <- function(df) {
  #run model
  model <- modelFunction(df)
  #add predictions to dataframe
  df$fitted <- predict(model, df)
  #get output dataframe
  outputDF <- df 
  #return output dataframe
  return(outputDF)
}

#function for generating predictions 
predFunction <- function(inputDF) {
  #create empty predictions dataframe
  predDF <- head(modelDF,0)
  #use 'for' loop to run model and generate predictions for different days of the week  
  for (j in day_lst) {
    #extract day of the week
    dayOfWeek <- j
    #filter for the day of the week
    df <- inputDF %>%
      filter(weekday == j)
    #call function to get predictions dataframe
    output <- dfFunction(df) 
    #add day of the week column (weekday) to the predictions dataframe 
    output1 <- output %>%
      mutate(weekday = dayOfWeek)
    #rbind the prediction dataframes for different days of the week
    predDF <- rbind(predDF, output1)
  }
  #remove extraneous dataframes - free up memory space
  rm(output)
  rm(output1)
  rm(df)
  #convert predictions from seconds to minutes, add error metrics 
  predDF <- predDF %>%
    mutate(lengthMin = length/60,
           fittedMin = fitted/60,
           MAE = abs(lengthMin-fittedMin),
           MAPE = ifelse(length == 0, 0, abs((lengthMin-fittedMin)/fittedMin)),
           RMSE = sqrt(mean(abs(lengthMin-fittedMin)^2)))
  #add columns indicating if test or training and selection method - useful when rbinding predictions later
  inputDFName <- deparse(substitute(inputDF))
  type <- unlist(strsplit(inputDFName, "_"))[1]
  method <- unlist(strsplit(inputDFName, "_"))[2]
  predDF <- predDF %>%
    mutate(type = type,
           method = method)
  #return the predictions dataframe run for all days of the week
  return(predDF)
}

#function to get goodness of fit metrics
#GOF metrics to display in table format
getGOFMetrics1 <- function (df) {
  tbl1 <- df %>%
    group_by(type,method,weekday) %>%
    summarise(Observed = mean(lengthMin),
              Predicted = mean(fittedMin),
              MAE = mean(MAE),
              MAPE = mean(MAPE))
  tbl2 <- df %>%
    group_by(type,method,weekday,hour) %>%
    summarise(Observed = mean(lengthMin),
              Predicted = mean(fittedMin),
              MAE = mean(MAE),
              MAPE = mean(MAPE))
  output <- smartbind(tbl1,tbl2)
  return(output)
}

#function to get goodness of fit metrics
#GOF metrics for maps
getGOFMetrics2 <- function (df) {
  tbl1 <- df %>%
    group_by(type,method,fishnet_id,weekday) %>%
    summarise(Observed = mean(lengthMin),
              Predicted = mean(fittedMin),
              MAE = mean(MAE),
              MAPE = mean(MAPE))
  tbl2 <- df %>%
    group_by(type,method,fishnet_id,weekday,hour) %>%
    summarise(Observed = mean(lengthMin),
              Predicted = mean(fittedMin),
              MAE = mean(MAE),
              MAPE = mean(MAPE))
  output <- smartbind(tbl1,tbl2)
  return(output)
}


#RUN MODELS ON TRAINING AND TEST DATASETS AND GET PREDICTIONS
predTrain1 <- predFunction(trainDF_M1)
predTest1 <- predFunction(testDF_M1)
predTrain2 <- predFunction(trainDF_M2)
predTest2 <- predFunction(testDF_M2)
#create one dataframe for all the predictions 
allPredDF <- rbind(predTrain1,predTest1,predTrain2,predTest2)

The code in the folded chunk shows the process of building the mixed-effect model.
After setting up the training and testing dataset, we developed several functions to run the model and display the predictions. The three key functions are shown explained in detail below:

modelFunction: build the equation of the mixed-effect model
trainPredFunction: function to generate train dataset predictions, using the model equation to predict jams to see the training set results
testPredFunction: function to generate train dataset predictions

Let’s simply have a glance at the plots between the observed length and the predicted length for the training dataset and the testing dataset respectively. As shown in the graphics, there have an obvious gap in the jam length. The low cluster is length for non-freeway roads, while the high cluster is the length for freeways. The gap exists due to the large difference between the freeway congestion and the non-freeway congestion.

After introducing the model building process for three kinds of models, the document shows a summary table of the model results for all the models we built. The table illustrates why we choose the mixed-effect model. The index we use to evaluate the model’s performance is the Mean Absolute Error (MAE), which indicates the difference between the observed value and the predicted value.

From the table, we can tell that the Mixed-effect model has the smallest Training set MAE and smallest Testing set MAE. Moreover, Mixed-effect model has the smallest difference between the training and testing set MAE. It illustrates that the mixed-effect model has the accurate and generalizable prediction.

To conclude, the mixed-effect model, which is the most complex of the bunch, produces the best predictions, with the MAE around 1.7.

In the model validation section, we will see if this result is generalizable and stable across different training/test sets, different time spans, and different regions in our data.

Table 7: Three Models Comparison
Model	Training.Set.MAE	Testing.Set.MAE
Prophet Model
Model for Delay	0.60	3.20
Model for Speed	19.20	59.20
Model for Length	1.90	16.20
Random Forest Model
Model for Length	2.40	6.30
Mixed-effect Model
Model for Length	1.72	1.71

6. Model Validation

Now we have a prediction from the mixed-effect model, but how reliablae the prediction is? To validate the models, first let’s have a look of some model validation criteria:

Mean Absolute Error (MAE): The absolute difference between the predicted variable and the observed variance. It is a frequently used index to evaluate the model’s accuracy. Generally, the smaller the MAE, the better the prediction is. MAE is the primary index we use to evaluate the model validation.
Mean Absolute Percentage Error (MAPE): What percent is the predicted variable differs from the observed variable. This index is also widely used. However, since we have a lot of records has length equals to 0 due to the data imputation, we are not using MAPE in our study.
Root Mean Square Error (RMSE): RMSE refers to the estimate of the standard deviation (also known as standard error) of the regression. It is not as straightforward as MAE, so we are not focusing on it.

Second, let’s have a glance of the methodology we use to test the model validation. Basically, we are using two methods to test the model’s generalizability, the out of sample test and the cross validation. See the details below:

Out of sample: Build the model based on the training data, then test the model for test data. Indexs like Mean Absolute Error (MAE) can evaluate how well the model is on predicting unseen data.
Temporal cross validation: The technique helps us see how generalizable the model is to a number of different time range. When running the model 24 times for each hour of the day, we can see the distribution of MAE. If the model is generalizable, each fold’s MAE should not be greatly different from the others.
Spatial cross validation: The technique helps us test the model’s spatial autocorrelation by calculating the residual Morans’I and summarizing the MAE by different regions (neighborhoods) in the city. The closer to 0 the MoranS’ I is, the weaker the spatial autocorrelation the model has. If the P-value of Morans’I is greater than 0.05, no significant spatial autocorrelation exists.
Group cross validation: TThe technique helps us see how generalizable the model to a number of different subgroup dataset. We subgroup the data by the traffic congestion situation to three subgroups (high jam length, medium jam length, low jam length) according to the quantile, then perform the mode to the three subgroups and observe the MAE distribution.

6.1 Out-of-sample Validation

6.1.1 Overall out-of-sample MAE

First, let’s calculate the overall MAE for the training and testing dataset. As the following table shows, the training dataset’s overall MAE is around 1.7, and the RMSE is 5.7. The testing dataset’s overall MAE is also around 1.7, and the RMSE is also 5.7. The simiar MAE and the RMSE is pretty across the training and testing dataset, meaning that the model is generlizable to the unseen data.

Table 8: Overall MAE
	Training	Testing
MAE	1.715872	1.714859

6.1.2 MAE by grid cell and by time

However, even though the MAE is low across both dataset, we are curious more detailed information. For example, how’s the model performance for each grid cell’s hourly pprediction? To explore deeper, we aggregated the prediction by grid cell and bu hour and weekday in the following graphs.

The following graphs shows each grid cell’s hourly prediction as a function of its observation. As it is shown after aggregating by grid cell and by hour, the predicted jam length is mostly responsive to the observed jam length. When doing the grid cell aggregation, we are actually separating the freeway grid cells and the non-freeway grid cells. After hourly and grid cell aggregation, the MAE reduced a bit to around 1.70. Also, the training dataset’s MAE is similar to the testing dataset’s, meaning the model is generalizable across datasets.

#predictions grouped by fishnet,hour
predByFishnetHourFunction <- function(predDF) {
  #run model
  outputDF <- predDF %>%
    group_by(fishnet_id,hour) %>%
    summarise(Observed = mean(lengthMin),
              Predicted = mean(fittedMin),
              MAE = mean(MAE),
              MAPE = mean(MAPE),
              RMSE = mean(RMSE))
  return(outputDF)
}

predTrainFishnetHourDF <- predByFishnetHourFunction(predTrainDF)
predTestFishnetHourDF <- predByFishnetHourFunction(predTestDF)

#Making plots
ggplot(predTrainFishnetHourDF, aes(x=Observed,y=Predicted)) +
  geom_point(alpha = 0.05, color = '#66ccff') +
  geom_abline() +
  labs(title = 'Training Data by Hour for Each Grid',
       x="Observed Length",
       y="Predicted Length")+
  coord_fixed() +
  theme_bw()

ggplot(predTestFishnetHourDF, aes(x=Observed,y=Predicted)) +
  geom_point(alpha = 0.05, color = '#66ccff') +
  geom_abline() +
  labs(title = 'Testing Data by Hour for Each Grid',
       x="Observed Length",
       y="Predicted Length")+
  coord_fixed() +
  theme_bw()

data.frame('Training' = c(mean(predTrainFishnetHourDF$MAE), mean(predTrainFishnetHourDF$RMSE)),
           'Testing' = c(mean(predTestFishnetHourDF$MAE), mean(predTestFishnetHourDF$RMSE)),
           row.names = c('MAE', 'RMSE')) %>%
  kable(., caption = 'MAE and RMSE') %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                font_size = 15,
                full_width = T)

Figure 18: Testing Training Data by Hour for Each Grid

Table 9: MAE Aggregated by Grid and Hour
	Training	Testing
MAE	1.707451	1.707731

Furthermore, we aggregate the prediction results by grid cell and by weekday to see how’s the model performance across weekdays and grid cells. The graph here shows each grid cell each weekday’s prediction as a function of its observation. The results are shows that the testing dataset’s plot is not as good as that of the training dataset, but the MAE for both datasets are small and similar to each other.

#predictions grouped by day of the week
predByFishnetWeekdayHourFunction <- function(predDF) {
  #run model
  outputDF <- predDF %>%
    group_by(fishnet_id,weekday,hour) %>%
    summarise(Observed = mean(lengthMin),
              Predicted = mean(fittedMin),
              MAE = mean(MAE),
              MAPE = mean(MAPE),
              RMSE = mean(RMSE))
  return(outputDF)
}

predTrainFishnetWeekHourDF <- predByFishnetWeekdayHourFunction(predTrainDF)
predTestFishnetWeekHourDF <- predByFishnetWeekdayHourFunction(predTestDF)

#Making plots
ggplot(predTrainFishnetWeekHourDF, aes(x=Observed,y=Predicted)) +
  geom_point(alpha = 0.05, color = '#66ccff') +
  geom_abline() +
  labs(title = 'Training Data by Weekday and By Hour for Each Grid',
       x="Observed Length",
       y="Predicted Length")+
  coord_fixed() +
  theme_bw()

ggplot(predTestFishnetWeekHourDF, aes(x=Observed,y=Predicted)) +
  geom_point(alpha = 0.05, color = '#66ccff') +
  geom_abline() +
  labs(title = 'Testing Data by Weekday and By Hour for Each Grid',
       x="Observed Length",
       y="Predicted Length")+
  coord_fixed() +
  theme_bw()

data.frame('Training' = c(mean(predTrainFishnetWeekHourDF$MAE), mean(predTrainFishnetWeekHourDF$RMSE)),
           'Testing' = c(mean(predTestFishnetWeekHourDF$MAE), mean(predTestFishnetWeekHourDF$RMSE)),
           row.names = c('MAE', 'RMSE')) %>%
  kable(., caption = 'MAE and RMSE') %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                font_size = 15,
                full_width = T)

Figure 19: Testing Training Data by Weekday for Each Grid

Table 10: MAE by Grid and Weekday
	Training	Testing
MAE	1.71702	1.711574

6.1.3 MAE by Hour and by Weekday

At last, we aggregate all the grid cells’prediction by weekday and by hour respectively to see the each day and each hour’s MAE. Now, we did it only for the testing dataset, because it is the unseen dataset that was not used to train the model. Thus, it is more representative to account for the model’s performance.

Aggregating the prediction by hour, we can see the MAE ranges from 0.7 to 3.8 approximately. Most of the MAE ranges from 1 to 2. The variance across the hour is not very large, and the variance of each hour across training and testing dataset is not large.

Table 11: MAE by Hour
Hour	Peak	Train.MAE	Test.MAE
12:00 AM	No	1.98	1.98
1:00 AM	No	0.71	0.74
2:00 AM	No	0.71	0.73
3:00 AM	No	0.74	0.76
4:00 AM	No	0.82	0.85
5:00 AM	No	0.92	0.95
6:00 AM	No	1.37	1.40
7:00 AM	Yes	2.73	2.67
8:00 AM	Yes	2.12	2.07
9:00 AM	Yes	1.67	1.67
10:00 AM	Yes	1.59	1.58
11:00 AM	No	1.79	1.80
12:00 AM	No	0.71	0.68
1:00 PM	No	2.19	2.16
2:00 PM	No	2.33	2.33
3:00 PM	No	2.69	2.66
4:00 PM	No	3.72	3.72
5:00 PM	No	3.80	3.85
6:00 PM	Yes	2.49	2.45
7:00 PM	Yes	1.85	1.88
8:00 PM	Yes	1.19	1.22
9:00 PM	No	1.12	1.09
10:00 PM	No	0.80	0.79
11:00 PM	No	0.69	0.69

Figure 20: MAE by Hour

Aggregating the prediction by weekday, we can see the MAE ranges from 1 to 2 approximately. Most of the MAE ranges from 1.5 to 2. The variance across the weekday and across different datasets is slight.

Table 12: MAE by Weekday
Weekday	Train.MAE	Test.MAE
Sunday	1.22	1.23
Monday	1.66	1.65
Tuesday	1.81	1.82
Wednesday	2.03	2.04
Thursday	1.87	1.85
Friday	1.91	1.91
Saturday	1.54	1.53

Figure 21: MAE by Weekday

6.1.4 Other Visualizations

At last, let’s see animation of the model error as well as the animation of prediction and the observance to have a direct sense of the model performance:

From the animated error map (Figure 21), we can tell that the error increase in the hotspot congested grids, espectially during the peak hours.

Figure 22: Hourly Error for each grid cell

From the comparison between the hourly predicted value and the observed value (Figure 22), we can tell the change over time and space is on the similar pace.

Figure 23: Predicted value as a function of observed value (hourly)

Figure 24: Obversed jam length vs. Predicted jam length

6.2 Cross Validation

6.2.1 Temporal Cross Validation

The technique helps us see how generalizable the model is to a number of different time range. When running the model 24 times for each hour of the day, we can see the distribution of MAE. If the model is generalizable, each fold’s MAE should not be greatly different from the others.

From the table and the plot we can know that our model is relatively generalizable across the time of day. When performing the model to 24 hours’ jam length, the MAE range from 0.7 to 3.6, which is acceptable. The plot shows the peak hour’s MAE is not greatly larger than the off-peak’s, which means the model is residuals are not temporally clustered, so the model is reasonable.

Table 13: Temporal Cross Validation Results
foldHoldout	Observed	Predicted	MAE	MAPE	RMSE
12:00 AM	1.0643697	0.1786034	1.3716090	4.6783027	5.652422
1:00 AM	1.0335703	0.4350018	1.1094478	10.1158497	5.512698
2:00 AM	0.9301240	0.4565641	0.9772047	8.7059437	6.357228
3:00 AM	0.6882478	0.5424400	0.7701086	2.8244010	4.581408
4:00 AM	0.4219701	0.5442040	0.7036390	2.6452559	3.316222
5:00 AM	0.3418159	0.5727896	0.7356176	0.8994643	3.727113
6:00 AM	0.2834110	0.6601033	0.7745351	1.2113779	3.292259
7:00 AM	0.2573553	0.7525554	0.7916729	0.3043875	2.680122
8:00 AM	0.2247586	0.8001686	0.8744371	0.3253121	2.806278
9:00 AM	0.2998308	0.8947041	0.9815246	0.2353699	3.373311
10:00 AM	0.7051245	1.1219869	1.2910330	0.5331255	5.333796
11:00 AM	2.5465655	2.7432625	2.2827131	0.9126438	7.493777
12:00 PM	2.7107735	2.6964239	2.3851202	1.3593766	7.446211
1:00 PM	1.4332636	1.6062702	1.8442251	0.9421063	5.311092
2:00 PM	0.8569235	1.2473083	1.6034751	1.1942178	3.675767
3:00 PM	1.0649044	1.3564475	1.7641464	1.0718002	4.220682
4:00 PM	1.4065611	1.4723045	1.9417602	34.7888011	4.362701
5:00 PM	1.7914662	1.6691566	2.1967119	1.3807811	6.403673
6:00 PM	1.9530763	1.7776503	2.2134786	1.3062152	5.496262
7:00 PM	2.9559541	2.5100152	2.5893905	1.6429369	6.764269
8:00 PM	4.5946524	3.7456410	3.2697261	2.2932170	8.540070
9:00 PM	5.0510656	3.8442890	3.6684418	2.9640816	10.022619
10:00 PM	2.8584082	2.7392467	2.8542458	1.3276355	7.308090
11:00 PM	1.4254215	2.3097551	2.4433860	0.5905671	5.250625

Figure 25: Temporal cross validation MAE for 24 hours

6.2.2 Spatial Cross Validation

The technique helps us test the model’s spatial autocorrelation by summarizing the MAE by different regions in the city. If the MAE for each sub-region is similar to each other, our model is generlizable across different spaces.

In detail, we performed our model on three neighborhoods: the SOUTHSIDE, the CLIFTON, and the CENTRAL BUSINESS DISTRICT. Then, we summarized the MAE, MAPE and the RMSE for the three prediction. From the below table, we can see that the model performances are similar between the the CENTRAL BUSINESS DISTRICT and the CLIFTON, but not very normal in SOUTHSIDE. The spatial cross-validation reault shows that our model is mostly generlizable across different regions in the city, but there’s some spaces for improvement.

Table 14: Spatial Cross Validation Results
foldHoldout	Observed	Predicted	MAE	MAPE	RMSE
CENTRAL BUSINESS DISTRICT	4.8487264	3.4335172	4.3360803	6.290866	7.744681
CLIFTON	2.8887163	2.6220521	2.1220925	10.787537	7.122485
SOUTHSIDE	0.5252538	0.5054118	0.9655746	1.665584	2.962543

Figure 26: Spatial cross validation MAE for 3 Regions

6.2.3 Group Cross Validation

The technique helps us see how generalizable the model to a number of different subgroup dataset. We subgroup the data by the traffic congestion situation to three subgroups (high jam length, medium jam length, low jam length) according to the quantile, then perform the mode to the three subgroups and observe the MAE distribution.

To be specific, the better the traffic (shorter jam length), the worse the prediction it is. The subgroup with the longest jam length has the least MAE and RMSE. The subgroup with the shortest jam length has the largest MAE and RMSE. The group cross validation indicates that our model does a good job in predicting the seriously congested place. We assume this is because the bad the traffic is, the more jam reports it has, so the larger the predicting sample pool we have.

Table 15: Group Cross Validation Results
foldHoldout	Observed	Predicted	MAE	MAPE	RMSE
Lowest Jam Length	0.000000	23.74932	25.135020	0	31.39282
Medium Jam Length	0.000000	14.73983	17.422222	0	20.89708
Highest Jam Length	4.622195	0.00000	4.622195	Inf	16.77538

Figure 27: Group cross validation MAE for 3 Groups

7. Web Application

We design a web application to visualize the prediction result. The application is built to assist the interested user in interacting and visualizing the congestion predictions; built on 2018 data made available under Waze Connected Citizen Program. The frontpage of the application is shown as below.

Figure 28: Frontpage of the App

When clicking the button “Start”, it will jamp to the following page. There is a sidebar on the right with two options: SCENARIOS and INFORMATION.

Under the scenarios, it allows users to select a specific scenario and show the jam length prediction under that scenario. Users can filter out month, weekday and different weather or road type condition (Cloudy, Rainy, Freeway). After choosing the filter, it will show a graph presenting the jam length by year under that scenario. The graph can be enlarged so the users can see it in detail. Here, our users can gain the knowledge of the citywide traffic congestion situation under a specific scenario.

Figure 29: Web pages (scenario)

If the user click on a specific grid cell, the page will be zoomed in to that cell, and the sidebar will automatically jump to the “INFORMATION” tab, which presents the basic information of the selected grid cell, including it’s unique ID, the neighborhood it belongs to, and the time-series jam data it has - the hourly jam length of selected day.

Figure 30: Web pages (information)

Using this web application, users can gain a general understanding of the traffic congestion in Louisville under different scenarios, and understand each grid cell’s situation in detail. With the knowledge, transportation planners and event planners can better site and schedual an event in Louisville, which helps avoid the negative effects of traffic congestion. Please to go here to use the App.

8. Conclusion

In this reseach project, we developed a model that is able to predict the length of jam in Louisville’s road network with high accuracy and generalizability. Perhaps more importantly, we are able to see how the model arrived at its current predictions, and thereby obtain insights into factor combinations that may be predictive of the increasing traffic congestion.

In short, to launch the project, we struggled in four phases:

First, data collection and exploratory analysis. In this phase, we explored the raw data via different ways to understand the existing traffic congestion condition in Louisville. We also collecte multiple variables that theoretically associated with the congestion. We have important deliverables from the phase: jam’s seasonality and spatial cluster, which illustrates that temporal and spatial features are two primary indicators in predicting the congestion.

Second, model building. We tried three machine learning algorithms to build the model: Facebook Prophet time-series model, Random Forest model, and the Mixed-effect model. At last, we choosed the Mixed-effect model to move on for three reasons: First, Prophet cannot demonstrate spatial features due to its temporal-fixed characterist; Second, Random Forest model is relatively accurate but funcationally impractical using our machine; Third, Mixed-effect model incorporate both temporal and spatial features and it is doable.

Third, model validation. After developing models, we validated them for its accurancy and generalizability. We refer to MAE, MAPE and RMSE, but focused on MAE, and tried out-of-sample method and different cross-validation. At last, the mixed-effect model is proved to be robust. However, it will not predict jams well for low jam lengths, so there’s still some rooms for improvement in the future.

Fourth, application design and development. With the model and the prediction result, we are using JavaScript to develop an online web application (as section 7 shows) to help transportation professionals and event planners to site and schedual an event in terms of avoiding traffic congestion. Please go to here to use the App.

At last, we sincerely give the foremost appreciation to our advisor Ken Steif and Matthew D. Harris. In the whole semester, they gave us the most useful guidance, the best knowledge, the the most responsive support. The project cannot be realized without their help. Thank you, Matt and Ken. Also, we are grateful to our clients, Michael Schnuerle and Mary Hampton from the City of Louisville. Thanks for giving us the rich data and the useful resources.

9. Appendix

9.1 Load libraries

#LOAD REQUISISTE LIBRARIES
#data wrangling
library(tidyverse) 
library(dbplyr)
library(plyr)
library(mefa4)
library(stringr)
library(lubridate)
library(sf)
library(geosphere)
library(tmap)
#plots
library(ggplot2)
library(viridisLite)
library(gganimate)
#API 
library(tidycensus) #census 
library(osmdata) #OSM
library(rwunderground) #darksky 
library(darksky) #darksky 
library(curl) #darksky 
library(gtools) #darksky 
#modeling
library(Boruta)
library(caret)
library(mlbench)
library(prophet) 
library(randomForest)
library(lme4)

9.2 Create hexagonal grid cells for main roads and highways

#CREATE MAP OF HEXAGONAL GRID CELLS (FISHNET)
#GET CENSUS DATA
#cache census shapefile downloads
options(tigris_use_cache = TRUE)
#insert your API key here  
census_api_key("") 
#get jefferson county shapefile from census data
jeffersonCounty <- get_acs(state = "KY",  geography = "county",
                       variables = "B19013_001", geometry = TRUE) %>%
  filter(NAME == 'Jefferson County, Kentucky')
#set the coordinate reference system for the county shapefile - UTM 16N NAD83 meters
jeffersonCounty <- jeffersonCounty %>%
  st_transform(., crs = 32616)



#GET OSM BOUNDING BOX
#set bounding box for louisville - refer osm website for coordinates (https://www.openstreetmap.org/export#map=10/38.1889/-85.6761)
osm_cityBB <- opq(bbox = c(-86.4899, 37.8564, -84.8611, 38.5192))


#GET OSM STREET DATA
#get OSM road data - select main roads and highways
osm_roads <- add_osm_feature(opq = osm_cityBB, key = "highway", 
                             value = c("motorway", "motorway_link","primary", 
                                       "primary_link", "tertiary", "trunk", 
                                       "tertiary_link", "trunk_link", "secondary",
                                       "secondary_link"))
#convert osm street dataframe to a sf object
osm_roads_sf <- osmdata_sf(osm_roads)
#get geometry for osm roads sf (line geometry)
osm_roads_lines <- st_geometry(osm_roads_sf$osm_lines)
#set the coordinate reference system - UTM 16N NAD83 meters
osm_roads_lines <- 
  osm_roads_lines %>%
  st_sf() %>%
  st_as_sf(.,  coords = c("longitude", "latitude"), crs = 4326) %>%
  st_transform(., crs = 32616) %>%
  st_sf()
#clip osm roads to jefferson county
OSM_road_clipped <- osm_roads_lines[jeffersonCounty, ]
#remove extraneous dataframes to free up memory space
rm(osm_roads)
rm(osm_roads_sf)
rm(osm_roads_lines)


#CREATE FISHNET GRID CELLS
#set fishnet grid cell size (in meters)
fishnet_grid_dim <- 500
#create fishnet - hexagon shaped grid cells
fishnet <- st_make_grid(st_as_sfc(st_bbox(st_buffer(OSM_road_clipped, 100))), 
                        cellsize = fishnet_grid_dim, square = FALSE) %>%
  st_sf() %>% 
  mutate(fishnet_id = row_number(),
         f_area_m2 = as.numeric(st_area(.))) %>% 
  filter(lengths(st_intersects(., OSM_road_clipped)) > 0)

9.3 Spatial join grid cells with traffic jams data sampled from Waze database

#SPATIAL JOIN JAMS DATA WITH FISHNET/GRID CELLS 
#load jams data sampled from Waze database
jams <- read.csv("jams_bigQ.csv")
#convert jams dataframe to a sf objecr
jams_sf <- jams %>%
  st_as_sf(.,  coords = c("longitude", "latitude"), crs = 4326) %>%
  st_transform(., crs = 32616) %>%
  st_sf()
#spatial join fishnet and jams data
jamsFishnet <- st_intersection(fishnet, jams_sf)

9.4 Extract date/time for the jams data

#EXTRACT DATE/TIME INFORMATION 
#NOTE: PLEASE ENSURE THAT THE JAMS DATA SAMPLED FROM WAZE DATABASE IS AT LOCAL TIME AND ADJUSTED FOR TIME DIFFERENCE BETWEEN 'UTC' AND LOCAL TIME
#create date time object and add columns for date, month, day, weekday
jamsFishnet <- jamsFishnet %>%
  mutate(dateTimeObj = ymd_hms(jam$pub_utc_date, tz = "America/Louisville"),
         date = format.Date(date(dateTimeObj)),      #date (character)
         month = month(dateTimeObj),                 #month (numeric)
         day = day(dateTimeObj),                     #day (numeric)
         weekday = weekdays(dateTimeObj),            #weekday (character)
         hour = hour(dateTimeObj))                   #hour (numeric)

9.5 Jam directions

#JAMS DIRECTION
#load jams data sampled from Waze database
jams <- read.csv("jams_bigQ.csv")

#get traffic jam directions 
jamsFishnet_dir <- jamsFishnet %>% 
  dplyr::select(jam_id, uuid, pub_utc_date, speed, delay, length,
                road_type, level, blocking_alert_id, order) %>% 
  group_by(jam_id) %>% 
  mutate(min_order = min(order),
         max_order = max(order)) %>% 
  dplyr::select(-order) %>% 
  distinct() %>% 
  ungroup()

jamsFishnet_dir2 <- jamsFishnet_dir %>% 
  inner_join(., dplyr::select(jamsFishnet, jam_id, "start_lat" = latitude, 
                              "start_long" = longitude, order) %>% 
               group_by(jam_id) %>% 
              filter(order == min(order)), by = "jam_id") %>%  # min order not always 1
  dplyr::select(-order, -pub_utc_date, -uuid, -speed, -delay, -length,
                -road_type, -level, -blocking_alert_id) %>% # clean-up
  inner_join(., dplyr::select(jamsFishnet_dir, jam_id, "end_lat" = latitude, 
                              "end_long" = longitude, order) %>% 
               group_by(jam_id) %>% 
               filter(order == max(order)), by = "jam_id") %>% 
  ungroup()

jamsFishnet_dir3 <- jamsFishnet_dir2 %>% 
  group_by(jam_id) %>% 
  mutate(
    geom = st_as_text(st_linestring(matrix(c(start_long, start_lat,end_long, end_lat),
                               nrow=2, ncol=2, byrow = TRUE))), 
    line_dir = geosphere::bearing(c(start_long,start_lat),c(end_long,end_lat)),
    line_dir = if_else(line_dir < 0, 360-abs(line_dir), line_dir)
    ) %>%   ### HOW TO BREAK THIS FREE!? text?
  st_as_sf(wkt = "geom", crs = 4326) %>% 
  st_transform(., crs = 32616) # UTM 16N NAD83 meters

jamsFishnet_dir4 <- jamsFishnet_dir3 %>% 
  mutate(ew_direction = case_when(end_long < start_long ~ "w",
                                  end_long > start_long ~ "e"),
         ns_direction = case_when(end_lat  < start_lat  ~ "s",
                                  end_long > start_long ~ "n"),
         line_dir     = case_when(line_dir > 0   & line_dir <= 50    ~ "north bound",
                                  line_dir > 50  & line_dir <= 140   ~ "east bound",
                                  line_dir > 140 & line_dir <= 220   ~ "south bound",
                                  line_dir > 220 & line_dir <= 310   ~ "west bound",
                                  line_dir > 310 & line_dir <= 360   ~ "north bound")) %>%
  na.omit()
#Jam direction to jam lines
jamsFishnet_dir4_df <- jamsFishnet_dir4 %>%
  st_set_geometry(NULL)
class(jamsFishnet_dir4_df)

Jam_Q_dir <- jamsFishnet %>% 
  dplyr::select(-order) %>% 
  left_join(., jamsFishnet_dir4_df, by = "jam_id")

jamsFishnetDir <- Jam_Q_dir %>%
  dplyr::select(jam_id, pub_utc_date, speed, delay, name, 
                length, level, latitude, longitude, line_dir) %>% 
  group_by(jam_id, pub_utc_date, speed, delay, name, length, level, line_dir) %>%  ### Group must include all non-coord columns
  tidyr::nest() %>% 
  mutate(data = map(data, as.matrix),
         ### have tp REVERSE COL ORDER and use drop=FLASE for single point lines
         geom = map(data, function(.x) st_linestring(.x[,2:1,drop=FALSE])),
         geom = map_chr(geom, st_as_text)) %>%   ### HOW TO BREAK THIS FREE!? text?
  dplyr::select(-data) %>%
  st_as_sf(wkt = "geom", crs = 4326) %>%
  st_transform(., crs = 32616) %>% # UTM 16N NAD83 meters
  na.omit()

9.6 Jams data imputation

#IMPUTE JAMS DATA
df1 <- 
  jamsFishnetDir %>%
  dplyr::select(fishnet_id, pub_utc_date,length,speed,delay,level,Direction,road_type) %>%
  mutate(
    pub_utc_date = ymd_hms(as.character(pub_utc_date)),
    local_time = with_tz(pub_utc_date,"UTC"),
    Direction = as.character(ifelse(is.na(Direction),"east bound",as.character(Direction))),
    D_north = if_else(Direction=='north bound',1,0),
    D_south=if_else(Direction=='south bound',1,0),
    D_east=if_else(Direction=='east bound',1,0),
    D_west=if_else(Direction=="west bound",1,0),
    freeway = as.factor(if_else(road_type==3,1,0))
  ) %>% 
  mutate(year  = year(local_time),
         month = month(local_time),
         day   = day(local_time),
         weekday = weekdays(local_time),
         hour  = hour(local_time),
         yday  = yday(local_time),
         yhour = ((yday-1)*23)+hour) %>% 
  dplyr::select( -pub_utc_date, -speed, -delay, -yday, -yhour,-road_type)

#create dataframe of dates
df <- 
  df1 %>%   
  dplyr::select(fishnet_id) %>% 
  group_by(fishnet_id) %>%
  summarise(count=n()) %>%
  dplyr::select(-count) %>%
  split(.$fishnet_id) %>%
  map_df(~data_frame(fishnet_id = .$fishnet_id[1],
                     local_time =seq(as.POSIXct("2018-01-01 00:00:00", tz ="UTC"), 
                                     as.POSIXct("2018-12-31 23:00:00", tz="UTC"), 
                                     by = "hour") )) %>%
  mutate(date=date(local_time))

#extracting dates in the 100 day Waze sample data 
x <- jamsFishnetDir %>%
  dplyr::select(pub_utc_date) %>%
  mutate(pub_utc_date = ymd_hms(as.character(pub_utc_date)),
         date =date(pub_utc_date)) %>%
  dplyr::select(date) %>%
  distinct()
df <- df %>%
  filter(date %in% x$date) %>%
  dplyr::select(-date)
#credits: https://stackoverflow.com/questions/40227494/how-to-fill-in-missing-dates-in-range-by-group

#add jams data  
df_1_1 <- 
  df1 %>%
  dplyr::select(-D_west, -D_north, -D_south, -D_east) %>%  
  group_by(fishnet_id, year,month, day, hour) %>%  
  mutate(count = n()) %>% 
  summarise_if(is.numeric, mean)  %>% 
  ungroup() %>%
  mutate(local_time =  as.Date(paste(year, month,day, sep='-')),  #creating local_time variable to complete the join
          H = sprintf("%02d0000", hour),
          H = format(strptime(H, format="%H%M%S"), format = "%H:%M:%S"),
          local_time = paste(local_time,H, sep=" "),
          local_time = as.POSIXct(local_time,tz="UTC", fortmat= "%Y-%M-%D %H:%M:%S")) %>%
  dplyr::select(-H,-year,-month,-day,-hour)

df_2_1 <-
  left_join(df,df_1_1, by=c("local_time","fishnet_id")) %>% 
  replace(., is.na(.), 0)

#final imputed dataset 
imputedDF <- df_2_1 %>%
  groupby(fishnet_id, month, day, hour) %>%
  summarize(length = mean(length))

9.7 Wrangle data from OpenStreetMap (OSM)

#OBTAIN DATA FROM OPEN STREET MAP (OSM) 
#CREATE BUFFER AROUND EACH FISHNET GRID CELL
#set buffer area (in meters) 
buffer_area <- 200
#create a buffer aroud each fishnet grid cell
fishnetBuffer <- st_buffer(fishnet, buffer_area)


#CREATE LIST OF THE OSM FEATURES THAT WILL BE OBTAINED USING THE 'OSMDATA' R PACKAGE
#list for OSM keys
key_lst = list("highway", "highway", "highway", "highway", "highway", "highway",
               "barrier",
               "amenity",
               "building", 
               "landuse", "landuse", "landuse", "landuse")
#list for OSM values 
val_lst = list("turning_circle", "mini_roundabout", c("stop", "give_way"), "crossing", "traffic_signal", "motorway_junction",
               "toll_booth",
               c("parking", "parking_entrance", "parking_space"),
               "office", 
               "industrial", "commercial", "residential", "retail")
#list for column names that will be used for the OSM features joined to the fishnet dataframe
col_lst = list("turning_ct", "roundabt_ct", "sign_ct", "cross_ct", "traffic_signal", "junction",
               "toll_ct")
#list for column names that will be used for the OSM features joined to the fishnetBuffer dataframe
col_lst_buffer = list("parking_ct",
                      "off_ct", 
                      "ind_ct", "comm_ct", "res_ct", "ret_ct")


#OBTAIN OSM FEATURES, AGGREGATE BY FISHNET GRID CELL AND JOIN TO FISHNET DATAFRAME
#IN ORDER TO AVOID API QUERY TIME OUT, PLEASE RUN THIS CODE WHEN OSM API TRAFFIC IS LOW. OTHERWISE, RUN THE CODE FOR EACH OSM FEATURE SEPARATELY WITHOUT USING THE 'FOR' LOOP. 
#use 'for' loop to obtain and join OSM features
#loop through the length of the column names list for features that will be joined to the fishnet
for (i in 1:length(col_lst)) {
  
  #get key from key_list
  key_inp = key_lst[[i]]
  #get value from val_list
  val_inp = val_lst[[i]]
  
  #get osm data using the appropriate key and value
  osm_data <- add_osm_feature(opq = osm_cityBB, key=key_inp, value=val_inp)
  #convert osm dataframe to a sf object
  osm_data_sf <- osmdata_sf(osm_data)
  #get geometry for osm sf (point geometry)
  osm_data_pt <- st_geometry(osm_data_sf$osm_points)
  #set the coordinate reference system - UTM 16N NAD83 meters
  osm_data_pt <- 
    osm_data_pt %>%
    st_sf() %>%
    st_as_sf(.,  coords = c("longitude", "latitude"), crs = 4326) %>%
    st_transform(., crs = 32616) 
  #clip osm data to jefferson county
  osm_data_clipped <- osm_data_pt[jeffersonCounty, ]
  
  #aggregte OSM data to each fishnet grid cell
  osm_data_sum <- st_intersection(fishnet, osm_data_clipped) %>%
    group_by(fishnet_id) %>% 
    summarise(count = n()) 
  #drop geometry
  osm_data_sum_df <- osm_data_sum %>%
    st_set_geometry(NULL)
  #join aggregated OSM data back to fishnet
  fishnetOSM <- fishnet %>% 
    left_join(., osm_data_sum_df, by = "fishnet_id")
}

#use 'for' loop to change column names for newly added OSM features
#loop through the length of the column names list and start loop from 3rd column (first two columns are related to fishnet ID and fishnet area)
for (i in 3:length(col_lst)) {
  #obtain column name for the OSM feature
  col_nam = col_lst[[i]]
  #change the name of the column
  names(fishnetOSM)[i]<-paste(col_nam) 
}
#check column names
names(fishnetOSM)

#remove osm data frames and free up memory space
rm(osm_data)
rm(osm_data_clipped)
rm(osm_data_pt)
rm(osm_data_sf)
rm(osm_data_sum)
rm(osm_data_sum_df)


#OBTAIN OSM FEATURES, AGGREGATE BY fishnetBuffer GRID CELL AND JOIN TO FISHNETBUFFER 
#DATAFRAME IN ORDER TO AVOID API QUERY TIME OUT, PLEASE RUN THIS CODE WHEN OSM API TRAFFIC #IS LOW. OTHERWISE, RUN THE CODE FOR EACH OSM FEATURE SEPARATELY WITHOUT USING THE 'FOR' #LOOP. 
#use 'for' loop to obtain and join OSM features
#loop through the length of the column names list for features that will be joined to the fishnetBuffer
for (i in 1:length(col_lst_buffer)) {
  
  #get key from key_list
  key_inp = key_lst[[i]]
  #get value from val_list
  val_inp = val_lst[[i]]
  
  #get osm data using the appropriate key and value
  osm_data <- add_osm_feature(opq = osm_cityBB, key=key_inp, value=val_inp)
  #convert osm dataframe to a sf object
  osm_data_sf <- osmdata_sf(osm_data)
  #get geometry for osm sf (point geometry)
  osm_data_pt <- st_geometry(osm_data_sf$osm_points)
  #set the coordinate reference system - UTM 16N NAD83 meters
  osm_data_pt <- 
    osm_data_pt %>%
    st_sf() %>%
    st_as_sf(.,  coords = c("longitude", "latitude"), crs = 4326) %>%
    st_transform(., crs = 32616) 
  #clip osm data to jefferson county
  osm_data_clipped <- osm_data_pt[jeffersonCounty, ]
  
  #aggregte OSM data to each fishnet grid cell
  osm_data_sum <- st_intersection(fishnetBuffer, osm_data_clipped) %>%
    group_by(fishnet_id) %>% 
    summarise(count = n()) 
  #drop geometry
  osm_data_sum_df <- osm_data_sum %>%
    st_set_geometry(NULL)
  #join aggregated OSM data back to fishnet
  fishnetBuffer <- fishnetBuffer %>% 
    left_join(., osm_data_sum_df, by = "fishnet_id")
}

#use 'for' loop to change column names for newly added OSM features
#loop through the length of the column names list and start loop from 3rd column (first two columns are related to fishnet ID and fishnet area)
for (i in 3:length(col_lst_buffer)) {
  #obtain column name for the OSM feature
  col_nam = col_lst_buffer[[i]]
  #change the name of the column
  names(fishnetBuffer)[i]<-paste(col_nam) 
}
#check column names
names(fishnetBuffer)

#remove osm data frames and free up memory space
rm(osm_data)
rm(osm_data_clipped)
rm(osm_data_pt)
rm(osm_data_sf)
rm(osm_data_sum)
rm(osm_data_sum_df)


#OBTAIN AND ADD OSM BUILDING DATA (TOTAL NUMBER OF BUILDINGS) TO FISHNETBUFFER DATAFRAME
#PLEASE NOTE THAT THE OSM BUILDING DATA IS A VERY LARGE FILE AND THAT IT TAKES AN EXTREMELY LONG TIME TO DOWNLOAD THE DATA USING THE OSM OVERPASS QUERY. THEREFORE, WE RECOMMEND DOWNLOADING THE OSM BUILDING DATA SHAPEFILE FROM THE OSM GEOFABRIK WEBSITE (https://download.geofabrik.de/north-america.html) AND LOADING THE DATASET AS A SHAPEFILE LOCALLY.
#load building OSM shapefile using 'sf' package
buildings_kentucky <- read_sf("kentuckyOSM/gis_osm_buildings_a_free_1.shp")
#set the coordinate reference system - UTM 16N NAD83 meters
buildings_kentucky <- buildings_kentucky %>%
  st_sf() %>%
  st_as_sf(.,  coords = c("longitude", "latitude"), crs = 4326) %>%
  st_transform(., crs = 32616) %>%
  st_sf()
#clip OSM building data to Jefferson County
building_clipped <- buildings_kentucky[jeffersonCounty, ]

#aggregte OSM data to each fishnet grid cell
building_clipped_sum <- st_intersection(fishnetBuffer, building_clipped) %>%
  group_by(fishnet_id) %>% 
  summarise(count = n()) 
#drop geometry
building_clipped_sum_df <- building_clipped_sum %>%
  st_set_geometry(NULL)
#join aggregated building data back to fishnetBuffer 
fishnetBuffer <- fishnetBuffer %>% 
  left_join(., building_clipped_sum_df, by = "fishnet_id")

#get column number for the newly added OSM building data feature in the fishnetBuffer dataframe
blng_col_num = ncol(fishnetBuffer) - 1
#change column name for the OSM building data feature
names(fishnetBuffer)[blng_col_num]<-paste("tot_blng_cnt") 
#check column names in fishnetBuffer dataframe
names(fishnetBuffer)


#JOIN OSM FEATURES AGGREGATED TO FISHNETBUFFER TO FISHNET DATAFRAME
#drop geometry for fishnetBuffer dataframe
fishnetBuffer_df <- fishnetBuffer %>%
  st_set_geometry(NULL)
#select relevant columns (i.e., only the OSM features) in the fishnetBuffer and join to fishnet by fishnet ID.
fishnetOSM <- fishnetBuffer_df %>% 
  dplyr::select(1, 3:blng_col_num) %>%
  left_join(., fishnetOSM, by = "fishnet_id")
#convert all NA values to zero
fishnetOSM[is.na(fishnetOSM)] <- 0

9.8 Wrangle weather data from Dark Sky API

#OBTAIN WEATHER DATA FROM DARK SKY API
#LOAD API KEYS
rwunderground::set_api_key("eab8742b2f09a617688fe43f3e1735db")
alerts(set_location(territory = "Kentucky ", city = "Louisville"))
darksky_api_key("eab8742b2f09a617688fe43f3e1735db")

#CREATE A DATFRAME FOR ALL HOURS IN A DAY FOR THE WHOLE YEAR OF 2018
#create dataframe of hourly timestamps
dateDF <- seq(as.POSIXct("2018-01-01 00:00:00", tz = "America/Louisville"), 
              as.POSIXct("2018-12-31 23:00:00", tz = "America/Louisville"), 
              by = "hour") %>%
  as.data.frame() 
#rename the timestamp column name
names(dateDF)[1] <- paste("dateTimeObject")
#create a column for dates in string/character format
dateDF <- dateDF %>%
  mutate(dateString = format.Date(dateTimeObject))


#CREATE A DATAFRAME WITH THE REQUISITE COLUMNS FOR THE WEATHER DATA
#make a request for one day to get the columns for weather data
weatherRequest<-as.data.frame(get_forecast_for(38.201033, -85.651970,
                                   "2018-02-01T12:00:00") [["hourly"]])
#make a dataframe with weather columns
weather <- head(weatherRequest,0)


#GET THE WEATHER DATA FOR THE ENTIRE YEAR
#use 'for' loop to loop though every day of the year
for (i in seq(1, length(dateDF$dateString), by=24)) {
  
  #code to get the date/timestamp in the darksky API format
  #get the date from the timestamp dataframe
  date <- dateDF$dateString[i]
  #separate the date and time 
  dateSeparated <- paste(date, sep=" ") 
  #subset the date from the separated date string
  dateAPI <- stringi::stri_sub(dateSeparated, 1, 10)
  #subset the time from the separated date string
  timeAPI <- stringi::stri_sub(dateSeparated, 12, 19)
  #create a vector of the date and time strings
  dateVector <- c(dateAPI, timeAPI)
  #join the two elements of the date vector using "T" 
  joinedAPI <- paste(dateVector, collapse="T")
  
  #get the daily weather by hour data and assign it to a dataframe
  weatherDaily <- as.data.frame(get_forecast_for(38.201033, -85.651970,
                                                 joinedAPI) [["hourly"]])
  #smartbind it to the weather dataframe created earlier
  weather <- smartbind(weather, weatherDaily)
}

#convert any NA values to zero
weather[is.na(weather)] <- 0

9.9 Wrangle Waze accident alerts data

#OBTAIN WAZE ACCIDENT ALERTS DATA 
#load accident alerts data - same sample days as jams data
accidents <- read.csv("alerts_bigQ.csv")
#set the coordinate reference system - UTM 16N NAD83 meters
accident_sf <- accident %>%
  st_as_sf(.,  coords = c("longitude", "latitude"), crs = 4326) %>%
  st_transform(., crs = 32616) %>%
  st_sf()
#aggregate the accident alerts for each hexagonal grid cell (fishnet)
accidentFishnet <- st_intersection(fishnet, accident_sf) %>%
  group_by(fishnet_id) %>% 
  summarise(accCount = n())

9.10 Join OSM and weather API data, accident alerts and holiday data to the jams and imputed jams dataframes

#JOIN API DATA, ACCIDENT ALERTS AND HOLIDAY DATA TO THE JAMS AND IMPUTED DATAFRAME 
#drop geometry - convert sf object to dataframe
fishnetOSM_df <- fishnetOSM %>%
  st_set_geometry(NULL)
accidentFishnet_df <- accidentFishnet %>%
  st_set_geometry(NULL)

#join OSM variables and accident alerts by fishnet_id variable
#join with jamsFishnetDir dataframe
jamsFishnetDir <- jamsFishnetDir %>%
  left_join(., fishnetOSM_df, by = "fishnet_id")
jamsFishnetDir <- jamsFishnetDir %>%
  left_join(., accidentFishnet_df, by = "fishnet_id")
#join with imputedDF dataframe 
imputedDF <- imputedDF %>%
  left_join(., fishnetOSM_df, by = "fishnet_id")
imputedDF <- imputedDF %>%
  left_join(., accidentFishnet_df, by = "fishnet_id")

#create time variable for weather data 
weather <- weather %>%  
  mutate (dateTimeObject = as.POSIXct(time, tz="UTZ", format ="%Y-%m-%d %H:%M:%S"),
          local_time = as.POSIXct(time, tz="UTZ", format ="%Y-%m-%d %H:%M:%S")

#join weather data by time to jams and imputed data
jamsFishnetDir <- jamsFishnetDir %>%
  left_join(., weather, by="dateTimeObject")
imputedDF <- imputedDF %>%
  left_join(., weather, by="local_time")

#load holiday data
holiday <- read.csv("Holiday2018.csv")

#create time variable for holiday data
holiday <- holiday %>%
  mutate (dateTimeObject = as.POSIXct(time, tz="UTZ", format ="%Y-%m-%d %H:%M:%S"),
          local_time = as.POSIXct(time, tz="UTZ", format ="%Y-%m-%d %H:%M:%S")

#join weather data by time to jams and imputed data
jamsFishnetDir <- jamsFishnetDir %>%
  left_join(., holiday, by="dateTimeObject")
imputedDF <- imputedDF %>%
  left_join(., holiday, by="local_time")

9.11 Create final dataframes

#CREATE FINAL DATAFRAMES
#non aggregated/non-imputed jams data
finalDF <- jamsFishnetDir
rm(jamsFishnetDir)
#aggregated by hour/imputed jams data
imputedDF <- imputedDF

9.13 Select relevant features from imputed dataframe and run correlation

#SELECT ALL FEATURES FROM IMPUTED JAMS DATAFRAME AND RUN CORRELATION
#ADD BINARY FEATURES - FEATURE ENGINEERING
#add binary variables - peak, weekend and freeway
finalDF <- finalDF %>%
  mutate(weekend = ifelse(weekday %in% c("Saturday", "Sunday"), 1, 0),  #Saturday, Sunday
         peak = ifelse(hour %in% c(7,8,18,19), 1,0),                #peak hr 7am-9am and 6pm-8pm
         freeway = freeway = as.factor(if_else(road_type==3,1,0))       #highways, freeway, etc.
         
#reorder column names
col_order <- c("fishnet_id","local_time","date","icon","weekday","holiday",
               "weekend","peak", "freeway",
               "precipProbability","temperature", "humidity","pressure","windSpeed",
               "cloudCover","visibility",
               "month","day","hour",
               "count", "parking_ct2", "turning_ct","roundabt_ct",
               "sign_ct","cross_ct","toll_ct","traffic_signal","junction",
               "off_ct","ind_ct","comm_ct","res_ct","ret_cnt","tot_blng_cnt",
               "al_rel","al_count","interaction","length", "jamQtl","blngQtl",
               "geometry")
#rearrange columns
imputedDF_Features <- imputedDF[, col_order]
names(imputedDF_Features)

#list of names for columns
columnName_lst <- list("Fishnet ID", "Local_Time", "Date","Weather_Event","Weekday","Holiday",
                       "Weekend", "Peak", "Freeway",
                       "Precipitation_Probability", 
                       "Temperature","Humidity","Pressure","Wind Speed",
                       "Cloud Cover","Visibility",
                       "Month", "Day","Hour", 
                       "Jam_Count","Parking","Turning_Circles","Roundabouts",
                       "Road_Signs","Crossings","Tolls","Traffic Signals","Junctions",
                       "Office_Building", "Industrial_Building","Commercial_Building",
                       "Residential_Building","Retail_Building","Total_Buildings",
                       "AccidentAlert_Probability","AccidentAlert_Counts",
                       "Hour_Day_Interaction","Length",
                       "jamQtl","blngQtl",
                       "geometry")   
#rename column
for (i in 1:length(columnName_lst)) {
  columnName = columnName_lst[[i]]
  names(imputedDF_Features)[i]<-paste(columnName)
}
names(imputedDF_Features)


#RUN CORRELATION MATRIX FOR ALL NUMERIC FEATURES
#select variables for correlation plot
corr_df <- imputedDF_Features %>%
  dplyr::select(7:37)
M <- cor(corr_df)
corrplot(M, method = "number")

9.13 Feature selection using the Boruta Package

#FEATURE SELCTION USING THE BORUTA PACKAGE
#randomly pick a week in 2018 
featureDF <- imputedDF %>%
  filter(Date %in% c("2018-07-08", "2018-07-09", "2018-07-10", "2018-07-11",
                     "2018-07-12", "2018-07-13", "2018-07-14"))

#select all features and dependent variable in our dataset
featureDF <- featureDF %>%
  dplyr::select(4:38) 
#names(featureDF)

#impute any missing values
featureDF <- featureDF[complete.cases(featureDF),]

#create vector of column numbers that have categorical variables
convert <- c(1,2,3)
#convert categorical variables into factor data type
featureDF[,convert] <- data.frame(apply(featureDF[convert], 2, as.factor))

#run Boruta model
set.seed(123)
borutaTrain <- Boruta(Length ~., data = featureDF, doTrace = 2)
#print(borutaTrain)
borutaTrainDF <- attStats(borutaTrain)

#get final selection 
finalBoruta <- TentativeRoughFix(borutaTrain)
#print(finalBoruta)
borutaFinalDF <- attStats(finalBoruta)

9.14 Prepare imputed data for modeling

#PREPARE IMPUTED DATA FOR MODELING
#REMOVE OUTLIERS FROM DATASET
#filter out dates relating to flood events in February 2018
imputedDF <- imputedDF %>%
  filter(date %notin% c("2018-02-23", "2018-02-24", "2018-02-25", "2018-02-26",
                        "2018-02-27", "2018-02-28"))


#SELECT RELVEANT FEATURES
#exclude features as per analysis under feature selection using 
imputedDF <- imputedDF %>%
  dplyr::select(-precipProbability, -month, -off_ct, -roundabt_ct, -holiday, -date, -geometry,)

9.15 Prophet Model

####Regressor####
weekend <- function(ds) {
  dates <- as.Date(ds)
  weekday<- weekdays(dates)
  as.numeric((weekday=="Sunday")|(weekday=="Saturday"))
}

peakHour <- function(ds) {
  dates <- as.POSIXct(ds)
  hour<- hour(dates)
  weekday <- weekend(ds)
  as.numeric(((hour >=7 & hour <=10) | (hour >=18 & hour <=20)) & 
               (weekday=='0'))
}

m <- prophet(holidays = hol,      
             holidays.prior.scale = 5,
             daily.seasonality = TRUE, 
             weekly.seasonality = TRUE,
             yearly.seasonality = TRUE,
             growth = "linear")
m2 <- 
  m%>%
  add_regressor('snow',standardize = FALSE) %>%
  add_regressor('rain',standardize = FALSE) %>%
  add_regressor('cloudy',standardize = FALSE) %>%
  add_regressor('weekend',standardize = FALSE)%>%
  add_regressor('peakHour',standardize = FALSE)
m_train <- fit.prophet(m2, Train_length) 
pre_train <- predict(m_train,Train_length) %>%
  left_join(Train_length, by = 'ds') %>% 
  mutate(mae = abs(yhat - y),
         mape = abs(yhat - y)/y) 
mean(pre_train$mae)
mean(pre_train$mape)
sqrt(mean(pre_train$mae^2))

pre_test <- predict(m_train,Test_length) %>%
  left_join(Test_length, by = 'ds') %>% 
  mutate(mae = abs(yhat - y),
         mape = abs(yhat - y)/y) 
mean(pre_test$mae)
mean(pre_test$mape)
sqrt(mean(pre_test$mae^2))

plot(m_train, pre_test) +
  labs(title="Predicted Length",
       subtitle = "Fishnet 6696")+
  theme_bw() +
  coord_cartesian(ylim = c(0,1000))
prophet_plot_components(m_train, pre_test)

ggplot() + 
  geom_point(data=pre_test, aes(y, yhat),size=1)  +
  stat_smooth(data=pre_test, aes(y, yhat), method = "lm", se = FALSE, size = 1, colour="blue") + 
  labs(title="Predicted Length as a function of Observed Length",
       subtitle = "Fishnet 6696") +
  theme(plot.title = element_text(size = 14,colour = "black"),
        plot.subtitle=element_text(face="italic"))+
  xlim(0, 800)+
  theme_bw()

####Cross-validation####
x <- cross_validation(m_train, initial = 1680, period = 24,horizon = 100,units = 'hours')
x_ <- as.data.frame(performance_metrics(x))
mean(x_$mape)
mean(x_$mae)
mean(x_$rmse)
plot_cross_validation_metric(x, metric = 'mape')

ggplot(x_, aes(x=mape)) + 
  geom_histogram(bins=8) +
  labs(x="MAPE",
       y="Count",
       title="Cross-validation of MAPE")+
  theme_bw()

9.16 Random Forest Model

#RANDOM FOREST MODEL
#CREATE FUNCTIONS TO RUN RANDOM FOREST MODEL AND GENERATE PREDICTIONS
#function to generate predictions and metrics for train dataset
rfTrainPredFunction <- function (df, fishnetID) {
  #filter out relevant fishnet
  fishnetDF <- df %>%
    filter(fishnet_id == fishnetID)
  #partition data into train and test
  inTrain <- createDataPartition (
    y = modelDF$length, 
    p = .80, 
    list = FALSE)
  trainDF <- modelDF[inTrain,] 
  testDF <- modelDF[-inTrain,]  
  #run random forest model
  rfModel <- randomForest(length ~ ., data = trainDF, ntree = 500, importance = TRUE)
  #generate predictions for train data
  trainDF$fitted <- predict(rfModel, trainDF)
  #convert dependent variable from seconds to minutes; compute and add columns for relevant goodness of fit metrics
  trainDF <- trainDF %>%
    mutate(lengthMin = length/60,
           fittedMin = fitted/60,
           MAE = abs(lengthMin - fittedMin),
           MAPE = abs((lengthMin - fittedMin)/fittedMin),
           RMSE =  sqrt(mean(abs(lengthMin-fittedMin)^2)))
  #get mean of metrics
  output <- trainDF %>%
    summarise(Observed = mean(lengthMin),
              Predicted = mean(fitedMin),
              MAE = mean(MAE),
              MAPE = mean(MAPE),
              RMSE =  mean(RMSE))
  return(output)
}

#function to generate predictions and metrics for test dataset
rfTestPredFunction <- function (df, fishnetID) {
  #filter out relevant fishnet
  fishnetDF <- df %>%
    filter(fishnet_id == fishnetID)
  #partition data into train and test
  inTrain <- createDataPartition (
    y = modelDF$length, 
    p = .80, 
    list = FALSE)
  trainDF <- modelDF[inTrain,] 
  testDF <- modelDF[-inTrain,]  
  #run random forest model
  rfModel <- randomForest(length ~ ., data = trainDF, ntree = 500, importance = TRUE)
  #generate predictions for test data
  testDF$fitted <- predict(rfModel, testDF)
   #convert dependent variable from seconds to minutes; compute and add columns for relevant goodness of fit metrics
  testDF <- testDF %>%
    mutate(lengthMin = length/60,
           fittedMin = fitted/60,
           MAE = abs(lengthMin - fittedMin),
           MAPE = abs((lengthMin - fittedMin)/fittedMin),
           RMSE =  sqrt(mean(abs(lengthMin-fittedMin)^2)))
  #get mean of metrics
  output <- testDF %>%
    summarise(Observed = mean(lengthMin),
              Predicted = mean(fitedMin),
              MAE = mean(MAE),
              MAPE = mean(MAPE),
              RMSE =  mean(RMSE))
  return(output)
}


#list of fishnets selected
fishnet_lst <- list("6696", "7640", "5386", "3206") 

#run model for all fishnets selected
#create empty dataframe for random forest results
rfResults <- data.frame()
#use 'for' loop to loop through all the fishnets in the list
for (i in fishnet_lst) {
  #extract fishnet ID
  fishnetID <- i
  #get goodness of fit metrics for train data
  trainPred <- rfTrainPredFunction(imputedDF, fishnetID)
  #indicate what fishnet ID and dataset type i.e., train or test
  trainPred <- trainPred %>%
    mutate(FishnetID = fishnetID,
           Type = "Train")
  #get goodness of fit metrics for test data
  testPred <- rfTestPredFunction(imputedDF, fishnetID)
  #indicate what fishnet ID and dataset type i.e., train or test
  testPred <- testPred %>%
    mutate(FishnetID = fishnetID,
           Type = "Test")
  #rbind results
  rfResults <- rbind(rfResults, trainPred, testPred)
}
#rfResults

9.17 Mixed Effect Model

#MIXED EFFECT MODEL
#PART 1: DIVIDE INTO TEST AND TRAINING
#Partition of data into train and test datasets (80% train and 20% test) through random selection
inTrain <- createDataPartition (
  y = imputedDF$length, 
  p = .80, 
  list = FALSE)
trainDF <- imputedDF[inTrain,] 
testDF <- imputedDF[-inTrain,]  


#PART 2: CREATE FUNCTIONS FOR MIXED EFFECT MODEL 
#PART2.1 FUNCTION TO BUILD MODEL AND GENERATE PREDICTIONS
#function for mixed effect model
modelFunction <- function(df) {
  lme2 <- lmer(length ~ visibility+weekday+freeway+weekend+peak+hour+
                 parking_ct2+ind_ct+comm_ct+res_ct+ret_cnt+tot_blng_cnt+
                 turning_ct+sign_ct+cross_ct+toll_ct+traffic_signal+junction+
                 al_rel+al_count+interaction+
                 (1+hour || fishnet_id)+(1+hour|| freeway),
               data=df)
}

#function to generate train dataset predictions
trainPredFunction <- function(trainDF) {
  #run model
  model <- modelFunction(trainDF)
  #add predictions to dataframe
  trainDF$fitted <- predict(model, trainDF, allow.new.levels = TRUE)
  #get output dataframe
  trainPred <- trainDF 
   #convert dependent variable from seconds to minutes; compute and add columns for relevant goodness of fit metrics
  trainPred <- trainPred %>%
    mutate(lengthMin = length/60,
           fittedMin = fitted/60,
           MAE = abs(lengthMin-fittedMin),
           MAPE = ifelse(length == 0, 0, abs((lengthMin-fittedMin)/fittedMin)),
           RMSE = sqrt(mean(abs(lengthMin-fittedMin)^2)))
  #return output dataframe
  return(trainPred)
}

#function to generate train dataset predictions
testPredFunction <- function(trainDF, testDF) {
  #run model
  model <- modelFunction(trainDF)
  #add predictions to dataframe
  testDF$fitted <- predict(model, testDF, allow.new.levels = TRUE)
  #get output dataframe
  testPred <- testDF 
   #convert dependent variable from seconds to minutes; compute and add columns for relevant goodness of fit metrics
  testPred <- testPred %>%
    mutate(lengthMin = length/60,
           fittedMin = fitted/60,
           MAE = abs(lengthMin-fittedMin),
           MAPE = ifelse(length == 0, 0, abs((lengthMin-fittedMin)/fittedMin)),
           RMSE = sqrt(mean(abs(lengthMin-fittedMin)^2)))
  #return output dataframe
  return(testPred)
}

#FUNCTIONS TO DISPLAY PREDICTIONS AND GOODNESS OF FIT METRICS IN DIFFERENT FORMATS
#functions to display output as tables - predictions grouped by day of the week
predByWeekdayFunction <- function(predDF) {
  #run model
  outputDF <- predDF %>%
    group_by(weekday) %>%
    summarise(Observed = mean(lengthMin),
              Predicted = mean(fittedMin),
              MAE = mean(MAE),
              MAPE = mean(MAPE),
              RMSE = mean(RMSE))
  return(outputDF)
}

#functions to display output as tables - predictions grouped by hour
predByHourFunction <- function(predDF) {
  #run model
  outputDF <- predDF %>%
    group_by(hour) %>%
    summarise(Observed = mean(lengthMin),
              Predicted = mean(fittedMin),
              MAE = mean(MAE),
              MAPE = mean(MAPE),
              RMSE = mean(RMSE))
  return(outputDF)
}

#functions to display output as tables - function to bind relevant test and train dataframes
cbindFunction <- function(predTrainDF, predTestDF) {
  predTrainDF <- predTrainDF %>%
    mutate(Type = "Train")
  predTestDF <- predTestDF %>%
    mutate(Type = "Test")
  outputDF <- cbind(predTrainDF, predTestDF)
  return(outputDF)
}
rbindFunction <- function(predTrainDF, predTestDF) {
  predTrainDF <- predTrainDF %>%
    mutate(Type = "Train")
  predTestDF <- predTestDF %>%
    mutate(Type = "Test")
  outputDF <- rbind(predTrainDF, predTestDF)
  return(outputDF)
}


#functions to group predictions for mapping - predictions grouped by fishnet,hour
predByFishnetHourFunction <- function(predDF) {
  #run model
  outputDF <- predDF %>%
    group_by(fishnet_id,hour) %>%
    summarise(Observed = mean(lengthMin),
              Predicted = mean(fittedMin),
              MAE = mean(MAE),
              MAPE = mean(MAPE),
              RMSE = mean(RMSE))
  return(outputDF)
}

#functions to group predictions for mapping - predictions grouped by fishnet,weekday,hour
predByFishnetWeekdayHourFunction <- function(predDF) {
  #run model
  outputDF <- predDF %>%
    group_by(fishnet_id,weekday,hour) %>%
    summarise(Observed = mean(lengthMin),
              Predicted = mean(fittedMin),
              MAE = mean(MAE),
              MAPE = mean(MAPE),
              RMSE = mean(RMSE))
  return(outputDF)
}

#functions to group predictions for mapping - function to spatial join data with fishnets
fishnetJoinFunction <- function (fishnet, df) {
  joinedOutput <- fishnet %>%
    left_join(., df, by = "fishnet_id")
}


#functions to group predictions for cross validation - function to group by fold
predByFoldFunction <- function(predDF) {
  #run model
  outputDF <- predDF %>%
    group_by(foldHoldout) %>%
    summarise(Observed = mean(lengthMin),
              Predicted = mean(fittedMin),
              MAE = mean(MAE),
              MAPE = mean(MAPE),
              RMSE = mean(RMSE))
  return(outputDF)
}


#FUNCTIONS FOR CROSS VALIDATION
#TEMPORAL CROSS VALIDATION
temporalCV <- function (df) {
  #create empty dataframe for temporal cross validation output
  outputTemporalCV <- data.frame()
  #run 'for' loop to loop through the different hours of the day
  for (i in 0:23) {
    #extract holdout hour
    holdoutHour <- i
    #create training and test data for the ith fold
    train <- df %>%
      filter(hour != i) 
    test <- df %>%
      filter(hour == i) 
    #run model and generate predictions
    predDF <- testPredFunction(train, test) 
    #indicate holdout group in predictions output
    predDF <- predDF %>%
      mutate(foldHoldout =  holdoutHour)
    #get predictions grouped by holout period
    outputDF <- predByFoldFunction(predDF)  
    #rbind the predictions for temporal cross validation
    outputTemporalCV <- rbind(outputTemporalCV, outputDF)
  }
  return(outputTemporalCV)
}


#GROUP CROSS VALIDATION
groupCV <- function(df) {
  #create empty dataframe for spatial cross validation output
  outputGroupCV <- data.frame()
  #run 'for' loop to loop through the different jam groups 
  for (i in 1:3) {
    #extract holdout group
    holdoutGroup <- i
    #create training and test data for the ith fold
    train <- df %>%
      filter(jamQtl != i) 
    test <- df %>%
      filter(jamQtl == i) 
    #run model and generate predictions
    predDF <- testPredFunction(train, test) 
    #indicate holdout group in predictions output
    predDF <- predDF %>%
      #get predictions grouped by holout period
      outputDF <- predByFoldFunction(predDF) 
    #rbind the predictions for group cross validation
    outputGroupCV <- rbind(outputGroupCV, outputDF)
  }
  return(outputGroupCV)
}


#SPATIAL CROSS VALIDATION FUNCTION
spatialCV <- function(df, neighborhood_lst) {
  #create empty dataframe for spatial cross validation output
  outputSpatialCV <- data.frame()
  #run 'for' loop to loop through the different neighborhoods in the neighborhood list
  for (i in neighborhood_lst) {
    #extract holdout neighborhood name
    holdoutNeighborhood <- i
    #create training and test dataframes for the ith fold
    train <- df %>%
      filter(NH_NAME != i) 
    test <- df %>%
      filter(NH_NAME == i) 
    #run model and generate predictions
    predDF <- testPredFunction(train, test) 
    #indicate holdout group in predictions output
    predDF <- predDF %>%
      mutate(foldHoldout = holdoutNeighborhood)
    #get predictions grouped by holout period
    outputDF <- predByFoldFunction(predDF) 
    #rbind the predictions for spatial cross validation
    outputSpatialCV <- rbind(outputSpatialCV, outputDF)
  }
  return(outputSpatialCV)
}


#PREPARE DATA FOR SPATIAL CROSS VALIDATION
#load data
fishnet <- read_sf("Data/Fishnet/fishnet.shp")
neighborhoods <- read_sf("Data/Neighborhoods/Louisville_KY_Urban_Neighborhoods.shp")

#set the coordinate reference system - UTM 16N NAD83 meters
neighborhoods <- neighborhoods %>%
  st_sf() %>%
  st_as_sf(.,  coords = c("longitude", "latitude"), crs = 4326) %>%
  st_transform(., crs = 32616) %>%
  st_sf()

#join fishnet and neighborhoods
fishnetNeighorhood <- st_intersection(fishnet, neighborhoods)

#drop geometry 
fishnetNeighorhood_df <- fishnetNeighorhood %>%
  st_set_geometry(NULL)

#left join
imputedDF_Spatial <- imputedDF %>%
  left_join(., fishnetNeighorhood_df, by="fishnet_id")


#PART 3: RUN MODEL AND GENERATE PREDICTIONS
#GENERATE PREDICTIONS
predTrainDF <- trainPredFunction(trainDF)
predTestDF <- testPredFunction(trainDF, testDF)

#RUN FUNCTIONS TO DISPLAY PREDICTIONS IN DIFFERENT FORMATS
#group predictions for tables 
predTrainWeekDF <- predByWeekdayFunction(predTrainDF)
predTestWeekDF <- predByWeekdayFunction(predTestDF)
predTrainHourDF <- predByHourFunction(predTrainDF)
predTestHourDF <- predByHourFunction(predTestDF)
#bind predictions to display as tables - filter ouT goodness of fit metrics as needed for display
weekPredTable <- cbindFunction(predTrainWeekDF, predTestWeekDF)
hourPredTable <- cbindFunction(predTrainHourDF, predTestHourDF)

#group predictions for mapping
predTrainFishnetHourDF <- predByFishnetHourFunction(predTrainDF)
predTestFishnetHourDF <- predByFishnetHourFunction(predTestDF)
predTrainFishnetWeekHourDF <- predByFishnetWeekdayHourFunction(predTrainDF)
predTestFishnetWeekHourDF <- predByFishnetWeekdayHourFunction(predTestDF)
#bind predictions to display as tables 
hourPredMapping <- rbindFunction(predTrainFishnetHourDF, predTestFishnetHourDF)
weekHourPredMapping <- rbindFunction(predTrainFishnetWeekHourDF, predTestFishnetWeekHourDF)
#create sf mapping objects by joining to dataframes to fishnet
hourPredMapping_sf <- fishnetJoinFunction(fishnet, hourPredMapping)
weekHourPredMapping_sf <- fishnetJoinFunction(fishnet, weekHourPredMapping)


#PART 5: PERFORM CROSS VALIDATIONS
#run temporal cross validation
temporalCVResults <- temporalCV(imputedDF) 

#run group cross validation
groupCVResults <- groupCV(imputedDF)

#neighborhoods selected for cross validation
neighborhood_lst <- list("CENTRAL BUSINESS DISTRICT", "CLIFTON", "SOUTHSIDE")

#run spatial cross validation
spatialCVResults <- spatialCV(imputedDF_Spatial, neighborhood_lst)

WAZE: Congestion Predictive Study

Andrew Renninger, Dhruvi Kothari, Lufeng Lin, Sagari Datta