Return to MUSA 801 Projects Page
The project was created in association with the MUSA 801 Practicum at the University of Pennsylvania, taught by Ken Steif, Karl Dailey, and Michael Fichman. We would like to thank Matthew Harris for providing guidance and support to help us develop the model. All products from this class should be considered proofs of concept and works in progress.
This document starts from the Introduction section, then dived into the data exploration and feature engineering. After preparing the concepts and the data, we come into the model building and model validation sections.
Transportation congestion is an enormous threat to our economic prosperity and way of life. Whether it takes the form of cars or trucks stalled in traffic or cargo stuck at overwhelmed seaports, congestion costs America an estimated $200 billion a year. Beyond these immediate costs, transportation delay and unreliability have begun to chip away at one of our nation’s most important economic assets: the businesses freedom of location and the ability to quickly reach customers across the city. In specific, in terms of avoiding traffic congestion, the question of when and where to hold an event deserves our concern. While a great deal of study focuses on the negative externalities of congestion, not many cities have been able to launch strategies to answer this question. Predictive modeling opens a new window and an innovative approach to do so. Our aim is to provide insights into the degree by which built-environment characteristics of the road, and their interaction in space and time can be used as predictors of jam occurrence. If we deem successful, we hope to provide the transportation professional and policy makers with a new planning tool for informing and targeting their preparation and planning for an event.
For this study, we train a space/time traffic congestion forecasting model utilizing space/time traffic jam observations in Louisville KY provided from the Waze Connected Citizens project. Predictions from this model are then input into a proof of concept application aimed at helping cities like Louisville manage traffic congestion.
A planned event is a public activity, with a scheduled time and location, whose performance is greatly impacted by the transportation system operations, since the travel demand and congestion attribute to event staging. Event Planning has become increasingly sensitive to traffic congestion due to its effects on people’s quality of life, movement efficiency, and travel experience. In a recent National survey, delays caused by traffic congestion is people’s top concern when attending an event.
One typical group who are interested in understanding traffic congestion for planning special events is Transportation engineers/ planner. Traffic engineers/ planners are in charge of day-of-event traffic management. Before arranging an event, the hourly prediction of congestion illuminates the best time and location to hold the event. On the event day, engineers/ planners have the responsibility of monitoring and maintaining traffic flow traversing their jurisdiction. With the prediction of the day-of-event traffic congestion, they can better develop and review traffic management plans to accommodate anticipated fluctuations in traffic demand.
Presently, planners site an event or pursue interventions based on the congestion occurrence, or a piece of general knowledge from the past experience, without an accurate and systematic numeric prediction informing the traffic congestion of that selected specific time and location. Our approach seeks to predict where, when and to what degree the traffic congestion is likely to occur based on the characteristics of each site, and the weather of the scheduled time, thereby providing transportation engineers/ planner with the means to be more proactive in their interventions and their traffic operations.
Figure 1 shows the serious road congestion of the 2015 Kentucky Derby in Louisville Source: The Courier-Journal Published 5:02 p.m. ET April 28, 2015
Attribute | Description |
---|---|
id | Jam ID. The unique id of each jam report |
pub_utc_date | The UTC time of the jam’s occurance |
speed | Current average speed on jammed segments meters/seconds |
length | Jam length in meters |
delay | Delay of jam compared to free flow speed in seconds |
road type | road type of the jammed segments |
level | Traffic congestion level (0 = free flow; 5 = blocked) |
blocking alter id | whether the jam is connecting to a block or not |
latitude | Latitude of the jam report |
longitude | Longitude of the jam report |
The first step of the prediction is to aggregate the jam data to the Louisville road network. We do so by creating 2075 hexagon grid cells of the 500-meter radius (size of 216506 squared meter) that cover the primary, secondary and tertiary types of road in the City of Louisville, and spatially aggregating the jam data to the grid cells. As a result, there are 1946 grid cells are attached with jam data. We calculated the average value of the jam length/speed/delay of each grid cells and took it as the dependent variable in this study. All additional data which we collect will be aggregated to these grids. Figure 2 shows the aggregated jam length by the grid, indicating that the urban arterials connecting the inner center city and the outskirts, and the eastbound to westbound are most likely to be congested.
In detail, we created the grid cell system through four steps:
Figure 2: 2018 Average jam length in meters by grids
Our modeling strategy attempts to predict hourly jam length of time for each grid cell as a function of the time of day, weather, and built environment. To do so, we develop three predictive machine learning framework or models.
(1)Prophet Time-series model
The Prophet is a machine learning developed by Facebook for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects.
It works best with time series that have strong seasonal effects and several seasons of historical data. For our data, the traffic jam has a strong pattern of seasonality for its repeating trend of the time of day, week, month and year, and also greatly affected by holidays and events. That’s why the Prophet model is theoretically appropriate to use for our purpose.
(2)Random Forest Model
Random forests or random decision forests are a machine learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees.Random decision forests correct for decision trees’ habit of overfitting to their training set.
(3)Mixed-Effects Model
Though the Prophet model is theoretically appropriate, it might not be applicable spatial features, because spatial features, such as building density, won’t vary across the time. To incorporate the spatial features, we develop the Mixed-Effects Model, which is a statistical model containing both fixed effects and random effects.
Fixed effects refer to variables that we expect will have an effect on the dependent variable, such as the built environment (like building density and the distribution of intersections), and the weather (like the rainy day and cloudy day).
Random effects refer to those factors that we cannot control experimentally, and we are unsure of how they might influence the prediction. In our model, we have two random effects. The first one is the temporally lagged dependent variable - the next hour’s jam length, since traffic jam is spatially clustered. The second one is the jam count if it is a freeway grid cell, since the count of jam report for freeway grid cells is greatly larger than those of non-freeway grid cells. We think the two features might influence the jam, but not sure how and to what degree it influence, so we take them as random effects.
After the model training and validation, the mixed-effects model is proved to be more robust, reasonable and reliable. In this report, we briefly demonstrate both models but mostly focus on the mixed-effect model.
For our analysis, we primarily use open data so that other municipalities with similar open data repositories are able to reproduce our analysis.
Our goal is to provide decision-makers with insight regarding the built-environment and regulatory factors which might indicate traffic congestion. As a result, we had six categories in the dataset we created:
Label | Description | Type | Source |
---|---|---|---|
Dependent variable | |||
length | The length of the jam | Numeric | Waze |
Temporal variables | |||
local_time | Louisville local time in ISO 8601 format | Factor | Waze |
year | The year of the jam | Numeric | Waze |
month | The month of the jam | Numeric | Waze |
day | The day of the jam | Numeric | Waze |
weekday | What weekday the jam is | Factor | Waze |
weekend | Is it a weekend or not | Dummy | Waze |
peak | Is it the peak hour or not | Dummy | Waze |
hour | The hour of the jam | Numeric | Waze |
Weather variables | |||
icon | The summary of the weather of the day | Factor | DarkskyAPI |
precipProbability | The probability of precipitation occurring | Numeric | DarkskyAPI |
temperature | The air temperature in degrees Fahrenheit | Numeric | DarkskyAPI |
humidity | The relative humidity | Numeric | DarkskyAPI |
pressure | The sea-level air pressure in millibars | Numeric | DarkskyAPI |
windSpeed | The wind speed in miles per hour | Numeric | DarkskyAPI |
snow | A snowy time | Dummy | DarkskyAPI |
hurri | A time with hurricane | Dummy | DarkskyAPI |
Hrain | A time with heavy rain | Dummy | DarkskyAPI |
Foggy | A time with fog | Dummy | DarkskyAPI |
Holiday variable | |||
holiday | What holiday it is | Factor | holidayAPI |
Built-environment variables | |||
freeway | Does the grid cover a freeway or not | Dummy | OSM |
f_area_m2 | The area of the grid | Numeric | OSM |
parking_ct2 | The count of parking | Numeric | OSM |
off_ct | The count of off-street parking | Numeric | OSM |
ind_ct | The count of indidents | Numeric | OSM |
comm_ct | The count of commercial buildings | Numeric | OSM |
res_ct | The count of residential buildings | Numeric | OSM |
ret_cnt | The count of retail buildings | Numeric | OSM |
tot_bling_cnt | The count of total buildings | Numeric | OSM |
turning_cnt | The count of the turns | Numeric | OSM |
roundabt_ct | The count of roundabouts | Numeric | OSM |
sign_ct | The count of stop signs | Numeric | OSM |
cross_ct | The count of crossways | Numeric | OSM |
toll_ct | The count of tolls | Numeric | OSM |
traffic_signal | The count of traffic signals | Numeric | OSM |
junction | The count of junctions | Numeric | OSM |
Others | |||
fishnet_id | The unique ID of the grid | Numeric | Self-create |
level | The level of the jam | Factor | Waze |
count | The count of the jam report | Numeric | Waze |
The final dataset we create includes jam length observations for each grid cell location (n=2075) for each hour of the day (n=24) for n days of the year (n=365). However, in our orginal dataset, some space/time units record 0 jams. To address this problem, we imputed those missing data points as zero to complete the dataset.
##### imputation ####
df <-
d1_1%>% #d1_1 is nothing but the Final_df created by andrew which has all the information
dplyr::select(fishnet_id) %>% #are using it to just get the fishnet ids
group_by(fishnet_id)%>%
summarise(count=n())%>%
dplyr::select(-count)%>%
split(.$fishnet_id)%>%
map_df(~data_frame(fishnet_id = .$fishnet_id[1],
local_time =seq(as.POSIXct("2018-01-01 00:00:00", tz ="UTC"),
as.POSIXct("2018-12-31 23:00:00", tz="UTC"),
by = "hour") )) %>%
mutate(local_time = with_tz(local_time,"America/Louisville"))
# credits: https://stackoverflow.com/questions/40227494/how-to-fill-in-missing-dates-in-range-by-group
#clean up weather
#join weather to main df
# bringing the jams parameters
df_1<-
d1_1 %>%
group_by(fishnet_id, year,month, day, hour, Direction, freeway) %>%
mutate(count = n()) %>%
summarise_if(is.numeric, mean) %>%
ungroup() %>%
dplyr::select(fishnet_id, count,length, level, freeway, D_north, D_south,
D_east, D_west,year ,month, day, hour) %>%
mutate( local_time= as.Date(paste(year, month,day, sep='-')), #creating local_time variable to complete the join
H = sprintf("%02d0000", hour),
H = format(strptime(H, format="%H%M%S"), format = "%H:%M:%S"),
local_time = as.POSIXct(paste(local_time,H, sep=" "),tz="America/Louisville",
fortmat= "%Y-%M-%D %H:%M:%S")) %>%
dplyr::select(-H,-year,-month,-day,-hour)
df_2 <-
left_join(df,x, by=c("fishnet_id","local_time")) %>%
replace(., is.na(.), 0) %>% #imputing the NAs created by the join as 0
mutate( weekend = weekend(weekday),
peak = peakHour(hour,weekend)) %>% #creating weekend, and peak hour variables
dplyr::select(-local_time)
#adding in holidays
#adding in built-characteristics
data.frame('Before' = c(319648, 100),
'After' = c(4670400, 100),
row.names = c('Data rows', 'Dates covered')) %>%
kable(., caption = "Table 3: Before and After the Imputation") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = F)
Before | After | |
---|---|---|
Data rows | 319648 | 4670400 |
Dates covered | 100 | 100 |
We wrangled the original jam data to indicate the peak hour and off-peak hour jams, as well as the weekday and weekend jams.
weekend <- function(ds) {
dates <- as.Date(ds)
weekday<- weekdays(dates)
as.numeric((weekday=="Sunday")|(weekday=="Saturday"))
}
peakHour <- function(ds) {
dates <- as.POSIXct(ds)
hour<- hour(dates)
weekday <- weekend(ds)
as.numeric(((hour >=7 & hour <=10) | (hour >=18 & hour <=20)) &
(weekday=='0'))
}
We wrangled the original weather data to define the rain, cloudy, and clear day. For the weather data gathering, we use the Darksky to collect the hourly weather in the City of Louisville in the year of 2018. In detail, we first created a dataframe of hourly timestamps, then create a dataframe with the requisite columns for the weather data. After that, we make a request using the Darksky API key to get the yearly weather data.
With the whole year’s hourly weather data, we define the rain, cloudy and clear day using the attributes we collected.
#OBTAIN WEATHER DATA FROM DARK SKY API
#LOAD API KEYS
rwunderground::set_api_key("eab8742b2f09a617688fe43f3e1735db")
alerts(set_location(territory = "Kentucky ", city = "Louisville"))
darksky_api_key("eab8742b2f09a617688fe43f3e1735db")
#CREATE A DATFRAME FOR ALL HOURS IN A DAY FOR THE WHOLE YEAR OF 2018
#create dataframe of hourly timestamps
dateDF <- seq(as.POSIXct("2018-01-01 00:00:00", tz = "America/Louisville"),
as.POSIXct("2018-12-31 23:00:00", tz = "America/Louisville"),
by = "hour") %>%
as.data.frame()
#rename the timestamp column name
names(dateDF)[1] <- paste("dateTimeObject")
#create a column for dates in string/character format
dateDF <- dateDF %>%
mutate(dateString = format.Date(dateTimeObject))
#CREATE A DATAFRAME WITH THE REQUISITE COLUMNS FOR THE WEATHER DATA
#make a request for one day to get the columns for weather data
weatherRequest<-as.data.frame(get_forecast_for(38.201033, -85.651970,
"2018-02-01T12:00:00") [["hourly"]])
#make a dataframe with weather columns
weather <- head(weatherRequest,0)
#GET THE WEATHER DATA FOR THE ENTIRE YEAR
#use 'for' loop to loop though every day of the year
for (i in seq(1, length(dateDF$dateString), by=24)) {
#code to get the date/timestamp in the darksky API format
#get the date from the timestamp dataframe
date <- dateDF$dateString[i]
#separate the date and time
dateSeparated <- paste(date, sep=" ")
#subset the date from the separated date string
dateAPI <- stringi::stri_sub(dateSeparated, 1, 10)
#subset the time from the separated date string
timeAPI <- stringi::stri_sub(dateSeparated, 12, 19)
#create a vector of the date and time strings
dateVector <- c(dateAPI, timeAPI)
#join the two elements of the date vector using "T"
joinedAPI <- paste(dateVector, collapse="T")
#get the daily weather by hour data and assign it to a dataframe
weatherDaily <- as.data.frame(get_forecast_for(38.201033, -85.651970,
joinedAPI) [["hourly"]])
#smartbind it to the weather dataframe created earlier
weather <- smartbind(weather, weatherDaily)
}
#convert any NA values to zero
weather[is.na(weather)] <- 0
df<- Final_df %>%
filter(fishnet_id == 3206)%>%
dplyr::select(-jam_id, -X.x, -Unique_ID, -f_area_m2, -jam_id_1, -uuid,
-id, -latitude, -longitude, -order, -irregularity_id,
-coordinate_type_id, -value, -name) %>%
mutate(precipProbability = if_else(is.na(precipProbability),0,precipProbability),
rain = if_else(precipProbability > 0.1,1,0),
poured = if_else(precipProbability > 0.5,1,0),
temperature = if_else(is.na(temperature),64,temperature),
hot = if_else(temperature > 90,1,0),
humidity = if_else(is.na(humidity),0.75,humidity),
pressure = if_else(is.na(pressure),1018.5,pressure),
windSpeed = if_else(is.na(windSpeed),2.445,windSpeed),
windy = if_else(windSpeed > 25,1,0),
cloudCover = if_else(is.na(cloudCover),0.552,cloudCover),
clear_sky = if_else(cloudCover <= 0.1,1,0),
partly_cloudy = if_else(cloudCover > 0.1 & cloudCover <= 0.25,1,0),
mostly_cloudy = if_else(cloudCover > 0.25 & cloudCover <= 0.75,1,0),
cloudy = if_else(cloudCover > 0.75,1,0),
visibility = if_else(is.na(visibility),7.429,visibility),
clear = if_else(visibility >= 8,1,0),
med_viz = if_else(visibility < 8 & visibility >= 4,1,0),
low_viz = if_else(visibility < 4,1,0),
Holiday = if_else(is.na(Holiday),0,1),
pub_utc_date = ymd_hms(as.character(pub_utc_date)),
local_time = with_tz(pub_utc_date,"America/Louisville")
)
To wrangle the build-environment variables, we first obtain OpenStreetMap features, and aggregate them by grid cells and join to the gri cell dataframe. Then, we use ‘for’ loop to obtain and join OSM features, and loop through the length of the column names list for features that will be joined to the fishnet. Please see the code in the following block.
#OBTAIN DATA FROM OPEN STREET MAP (OSM)
#CREATE BUFFER AROUND EACH FISHNET GRID CELL
#set buffer area (in meters)
buffer_area <- 200
#create a buffer aroud each fishnet grid cell
fishnetBuffer <- st_buffer(fishnet, buffer_area)
#CREATE LIST OF THE OSM FEATURES THAT WILL BE OBTAINED USING THE 'OSMDATA' R PACKAGE
#list for OSM keys
key_lst = list("highway", "highway", "highway", "highway", "highway", "highway",
"barrier",
"amenity",
"building",
"landuse", "landuse", "landuse", "landuse")
#list for OSM values
val_lst = list("turning_circle", "mini_roundabout", c("stop", "give_way"), "crossing", "traffic_signal", "motorway_junction",
"toll_booth",
c("parking", "parking_entrance", "parking_space"),
"office",
"industrial", "commercial", "residential", "retail")
#list for column names that will be used for the OSM features joined to the fishnet dataframe
col_lst = list("turning_ct", "roundabt_ct", "sign_ct", "cross_ct", "traffic_signal", "junction",
"toll_ct")
#list for column names that will be used for the OSM features joined to the fishnetBuffer dataframe
col_lst_buffer = list("parking_ct",
"off_ct",
"ind_ct", "comm_ct", "res_ct", "ret_ct")
#OBTAIN OSM FEATURES, AGGREGATE BY FISHNET GRID CELL AND JOIN TO FISHNET DATAFRAME
#IN ORDER TO AVOID API QUERY TIME OUT, PLEASE RUN THIS CODE WHEN OSM API TRAFFIC IS LOW. OTHERWISE, RUN THE CODE FOR EACH OSM FEATURE SEPARATELY WITHOUT USING THE 'FOR' LOOP.
#use 'for' loop to obtain and join OSM features
#loop through the length of the column names list for features that will be joined to the fishnet
for (i in 1:length(col_lst)) {
#get key from key_list
key_inp = key_lst[[i]]
#get value from val_list
val_inp = val_lst[[i]]
#get osm data using the appropriate key and value
osm_data <- add_osm_feature(opq = osm_cityBB, key=key_inp, value=val_inp)
#convert osm dataframe to a sf object
osm_data_sf <- osmdata_sf(osm_data)
#get geometry for osm sf (point geometry)
osm_data_pt <- st_geometry(osm_data_sf$osm_points)
#set the coordinate reference system - UTM 16N NAD83 meters
osm_data_pt <-
osm_data_pt %>%
st_sf() %>%
st_as_sf(., coords = c("longitude", "latitude"), crs = 4326) %>%
st_transform(., crs = 32616)
#clip osm data to jefferson county
osm_data_clipped <- osm_data_pt[jeffersonCounty, ]
#aggregte OSM data to each fishnet grid cell
osm_data_sum <- st_intersection(fishnet, osm_data_clipped) %>%
group_by(fishnet_id) %>%
summarise(count = n())
#drop geometry
osm_data_sum_df <- osm_data_sum %>%
st_set_geometry(NULL)
#join aggregated OSM data back to fishnet
fishnetOSM <- fishnet %>%
left_join(., osm_data_sum_df, by = "fishnet_id")
}
#use 'for' loop to change column names for newly added OSM features
#loop through the length of the column names list and start loop from 3rd column (first two columns are related to fishnet ID and fishnet area)
for (i in 3:length(col_lst)) {
#obtain column name for the OSM feature
col_nam = col_lst[[i]]
#change the name of the column
names(fishnetOSM)[i]<-paste(col_nam)
}
#check column names
names(fishnetOSM)
#remove osm data frames and free up memory space
rm(osm_data)
rm(osm_data_clipped)
rm(osm_data_pt)
rm(osm_data_sf)
rm(osm_data_sum)
rm(osm_data_sum_df)
#OBTAIN OSM FEATURES, AGGREGATE BY fishnetBuffer GRID CELL AND JOIN TO FISHNETBUFFER
#DATAFRAME IN ORDER TO AVOID API QUERY TIME OUT, PLEASE RUN THIS CODE WHEN OSM API TRAFFIC #IS LOW. OTHERWISE, RUN THE CODE FOR EACH OSM FEATURE SEPARATELY WITHOUT USING THE 'FOR' #LOOP.
#use 'for' loop to obtain and join OSM features
#loop through the length of the column names list for features that will be joined to the fishnetBuffer
for (i in 1:length(col_lst_buffer)) {
#get key from key_list
key_inp = key_lst[[i]]
#get value from val_list
val_inp = val_lst[[i]]
#get osm data using the appropriate key and value
osm_data <- add_osm_feature(opq = osm_cityBB, key=key_inp, value=val_inp)
#convert osm dataframe to a sf object
osm_data_sf <- osmdata_sf(osm_data)
#get geometry for osm sf (point geometry)
osm_data_pt <- st_geometry(osm_data_sf$osm_points)
#set the coordinate reference system - UTM 16N NAD83 meters
osm_data_pt <-
osm_data_pt %>%
st_sf() %>%
st_as_sf(., coords = c("longitude", "latitude"), crs = 4326) %>%
st_transform(., crs = 32616)
#clip osm data to jefferson county
osm_data_clipped <- osm_data_pt[jeffersonCounty, ]
#aggregte OSM data to each fishnet grid cell
osm_data_sum <- st_intersection(fishnetBuffer, osm_data_clipped) %>%
group_by(fishnet_id) %>%
summarise(count = n())
#drop geometry
osm_data_sum_df <- osm_data_sum %>%
st_set_geometry(NULL)
#join aggregated OSM data back to fishnet
fishnetBuffer <- fishnetBuffer %>%
left_join(., osm_data_sum_df, by = "fishnet_id")
}
#use 'for' loop to change column names for newly added OSM features
#loop through the length of the column names list and start loop from 3rd column (first two columns are related to fishnet ID and fishnet area)
for (i in 3:length(col_lst_buffer)) {
#obtain column name for the OSM feature
col_nam = col_lst_buffer[[i]]
#change the name of the column
names(fishnetBuffer)[i]<-paste(col_nam)
}
#check column names
names(fishnetBuffer)
#remove osm data frames and free up memory space
rm(osm_data)
rm(osm_data_clipped)
rm(osm_data_pt)
rm(osm_data_sf)
rm(osm_data_sum)
rm(osm_data_sum_df)
#OBTAIN AND ADD OSM BUILDING DATA (TOTAL NUMBER OF BUILDINGS) TO FISHNETBUFFER DATAFRAME
#PLEASE NOTE THAT THE OSM BUILDING DATA IS A VERY LARGE FILE AND THAT IT TAKES AN EXTREMELY LONG TIME TO DOWNLOAD THE DATA USING THE OSM OVERPASS QUERY. THEREFORE, WE RECOMMEND DOWNLOADING THE OSM BUILDING DATA SHAPEFILE FROM THE OSM GEOFABRIK WEBSITE (https://download.geofabrik.de/north-america.html) AND LOADING THE DATASET AS A SHAPEFILE LOCALLY.
#load building OSM shapefile using 'sf' package
buildings_kentucky <- read_sf("kentuckyOSM/gis_osm_buildings_a_free_1.shp")
#set the coordinate reference system - UTM 16N NAD83 meters
buildings_kentucky <- buildings_kentucky %>%
st_sf() %>%
st_as_sf(., coords = c("longitude", "latitude"), crs = 4326) %>%
st_transform(., crs = 32616) %>%
st_sf()
#clip OSM building data to Jefferson County
building_clipped <- buildings_kentucky[jeffersonCounty, ]
#aggregte OSM data to each fishnet grid cell
building_clipped_sum <- st_intersection(fishnetBuffer, building_clipped) %>%
group_by(fishnet_id) %>%
summarise(count = n())
#drop geometry
building_clipped_sum_df <- building_clipped_sum %>%
st_set_geometry(NULL)
#join aggregated building data back to fishnetBuffer
fishnetBuffer <- fishnetBuffer %>%
left_join(., building_clipped_sum_df, by = "fishnet_id")
#get column number for the newly added OSM building data feature in the fishnetBuffer dataframe
blng_col_num = ncol(fishnetBuffer) - 1
#change column name for the OSM building data feature
names(fishnetBuffer)[blng_col_num]<-paste("tot_blng_cnt")
#check column names in fishnetBuffer dataframe
names(fishnetBuffer)
#JOIN OSM FEATURES AGGREGATED TO FISHNETBUFFER TO FISHNET DATAFRAME
#drop geometry for fishnetBuffer dataframe
fishnetBuffer_df <- fishnetBuffer %>%
st_set_geometry(NULL)
#select relevant columns (i.e., only the OSM features) in the fishnetBuffer and join to fishnet by fishnet ID.
fishnetOSM <- fishnetBuffer_df %>%
dplyr::select(1, 3:blng_col_num) %>%
left_join(., fishnetOSM, by = "fishnet_id")
#convert all NA values to zero
fishnetOSM[is.na(fishnetOSM)] <- 0
Obtain holiday dates and join the dates to the dataframe.
Events <- data_frame(
holiday = 'Kentuky derby Fest',
ds = as.Date(c('2018-04-26','2018-04-27','2018-04-28','2018-04-29',
'2018-04-30','2018-05-01','2018-05-02','2018-05-03',
'2018-05-04')),
lower_window = 0,
upper_window = 1
)
OEvents <- data_frame(
holiday = 'Abbey Road Music Fest',
ds = as.Date(c('2018-05-23','2018-05-24','2018-05-25')),
lower_window =0,
upper_window=1
)
OEvents2 <- data_frame(
holiday = 'Boat show',
ds = as.Date(c('2018-01-24','2018-01-25','2018-01-26',
'2018-01-27')),
lower_window =0,
upper_window=1
)
OEvents3 <- data_frame(
holiday = 'tailSpin Ale Fest',
ds = as.Date(c('2018-02-17')),
lower_window =0,
upper_window=1
)
holidays <- bind_rows(Events, OEvents,OEvents2,OEvents3)
rm(Events, OEvents, OEvents2,OEvents3)
#this has cummulative data from holiday csv and the newly created events holiday dataframe above.
hol <- as.data.frame(
read_csv("F:/BACKUP/MUSA801_Practicum/Dataset/Holiday2018.csv")%>%
mutate( ds= as.Date(substr(time, 1, 10)))%>%
dplyr::select(-time,-X1) %>%
subset(Holiday != "NA")%>%
group_by(Holiday, ds)%>%
summarise(count=n())%>%
mutate(lower_window = 0,
upper_window = 1)%>%
rename(holiday= `Holiday`)%>%
dplyr::select(-count)
)%>%
rbind(holidays)
In this section, we will first look at the distributions of the dependent variable - the statistic distribution of the jam length. The original data’s length distribution is as Figure 3 shows. After the imputation, the distribution of the jam length is as Figure 3 shows. These graphs give us a big picture of the distribution of our dependent variable.
Figure 3: The statistic distribution of jam lengh
img3 <- rasterGrob(as.raster(readPNG("jam length_non 0.png")), interpolate = FALSE)
img4 <- rasterGrob(as.raster(readPNG("jam length_imputate 0.png")), interpolate = FALSE)
grid.arrange(img4, img3, ncol = 2)
To understand the jam length variance across the time of day, we choose June 2018 as an example to visualize the length variance in an animated way. The animated Figure 4 indicates us to include “weekday” and “peak/off-peak hour” variable.
Firstly, weekday and weekend have obviously different jam pattern. As it is shown above, weekdays (Monday to Friday) have conspicuous morning peak hour and afternoon peak hour, and generally small jam length from 0:00 AM - 6:00 AM. Saturday and Sunday, in comparison, have a high prevalence of jam from 0:00 AM - 6:00 AM, and they nearly don’t have morning peak hour.
Secondly, the daily pattern is obvious to have a peak hour. Weekdays have two peak hour: the morning peak hour generally goes from 7:00 AM -10:00 AM, and the afternoon peak hour generally goes from 18:00 PM - 20:00 PM. Weekends don’t have clear morning peak hour, but starting from 12:00 PM, their jam length constantly increase.
library(gganimate)
## New data
aggregated <-
jams_June %>%
mutate(day = wday(pub_utc_date, label = TRUE),
month = month(pub_utc_date, label = TRUE),
num = day(pub_utc_date),
year = year(pub_utc_date),
hour = hour(pub_utc_date)) %>%
mutate(string = glue("{day}, {month} {num}, {year}")) %>%
mutate(time = format(pub_utc_date, '%A, %B %d')) %>%
group_by(day, hour) %>%
summarise(length = mean(length),
delay = mean(delay),
speed = mean(speed))
## Animating it
animate <-
ggplot(aggregated,
aes(hour, length, group = day, color = day)) +
geom_line(size = 2) +
geom_segment(aes(xend = 24, yend = length), linetype = 2, colour = 'grey') +
geom_text(aes(x = 24.1, label = day), hjust = 0, show.legend = FALSE) +
scale_colour_manual(values = colour_function(7),
name = "day",
guide = guide_days) +
labs(x = "Time (in hours)", y = "Jam Length (in seconds)") +
theme_hor() +
theme(legend.position = 'bottom') +
transition_reveal(hour)
## Saving
anim_save("length.gif", animation = animate,
height = 500, width = 800, nframes = 120, fps = 5,
start_pause = 2, end_pause = 20)
Figure 4: Average hourly jam length for each weekday
Then, this is an interactive chart (Figure 5) shows the jam length variance by time of day in detail by separating the data by month. From the interactive chart, we can see the peak hour and off-peak hour of each month are similar. Thus, it is safe to use the same morning peak hour and the same afternoon peak hour for the whole year. The similar pattern between months also validates the jam’s seasonality by month.
Figure 5: 2018 hourly Jam length for each weekday for per month
# The csv file (byhourJoinedFishWoutDirWoutFriWay.csv) is our final dataset, we aggregate it by month, weekday and hour to make the interactive plot.
# noDir3<-read.csv("byhourJoinedFishWoutDirWoutFriWay.csv") %>%
# group_by(month, weekday, hour) %>%
# summarize(sum_length = sum(length),
# sum_count = sum(count))
p2 <- read.csv("noDir3.csv") %>%
plot_ly(
x = ~hour,
y = ~sum_length,
frame = ~month,
color = ~weekday,
colors = c("#21BFD7", "#24ABDF", "#2897E7","#2C84F0","#5464F3","#7D45F7","#A626FB"),
size = ~sum_count,
type = 'scatter',
mode = 'lines+markers',
showlegend = T
)
p2
In the previous section, we saw how jam length varies across hour. Here, let’s zoom in to a specific month - 2018 June, to see how jam length varies across day. Figure 6 here shows the jam length variance of the 30 days in June 2018. Through the graph, we observe that the daily trend is pretty similar across the month, meaning that the daily seasonality is strong. Meanwhile, there are some peaks in some specific moments, which means there are some other effects impacting on the traffic performance.
data2<-read.csv("byhourJoinedFishWoutDirWoutFriWay.csv") %>%
group_by(month, weekday) %>%
summarize(sum_length = sum(length)) %>%
ungroup()
table(data2$weekday)
level_order <- c('Monday', 'Tuesday', 'Wednesday', 'Thursday','Friday','Saturday','Sunday')
data2 %>%
ggplot(.) +
geom_line(aes(x = ordered(weekday, level = level_order),
y = sum_length, group = 1,
color = factor(month)) )+
facet_wrap(~month)+
labs(title="JAM LENGTH BY THE WEEK OF MONTH",
x="MONTH",
y="JAM LENGTH") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Figure 6: 2018 June daily jam length variance
Overall, the most congested day is Monday. Other days, except Sunday, have similar summed jam length. Sunday is the most uncongested day in terms of the summed jam length. However, this plot has the bias about the record of jam for each day. The more meaningful jam reports (without the imputed data), the more likely it shown to be congested. Despite of this, we can have some guess of the nadir on Sunday. It is the weekend and the worship day. Traffic might be relatively smooth on Sunday.
Figure 7: Sum of jam length for each weekday
Observing the data by weekday and by month, we notice that most months have similar pattern, while with some peaks and outliers. In specific, from Monday to Wednesday, most months have a rising trend in traffic congestion with the increasing jam length. Thursday and Friday’ jam varies by month. Some of them is more congested on Thu and Fri, while others may have a falling trend. On weekends, nearly all the months are more congested on Saturday than on Sunday.
This following Figuer 8 informs us a similar but still varied jam trend across the twelve months in the entire year, which means, except the temporal seasonality, there are other features, such as spatial features, have impacts on traffic performance in Louisville.
Figure 8: 2018 Jam length by the weekday of month
From the above exploration, we find that jam has strong seasonality. Now, we aggregate the months into real seasons, and explore the weather characteristics of each season. Our exploration is based on the theory that traffic jam is influenced by the natural weather. Some typical weather, like the heavy rain and snow, have a negative effect on the traffic operation:
The y-axis is different weather conditions, and the x-axis is the average jam length recorded as the type of weather condition. As the summarized graph shows, in Louisville, spring and summer have very few extreme weather that might influence the traffic largely. However, it has relatively frequent heavy in the winter and snow in the fall. Figure 9 indicates us that we should take weather as a predictor in our model.
Figure 9: Impact of climate
Figure 10 animated map shows the distribution of jam in June 2018 across the City of Louisville’s primary, secondary and tertiary roads. The map delivers three information:
This indicates us when doing the prediction, we should take the spatial difference into consideration.
Figure 10: 2018 June jam length animation across Louisville