GLOBAL PROGRAM
           RESILIENT HOUSING




Detecting
Urban Clues
for Road Safety
Leveraging Big Data
and Machine Learning




DECEMBER 2021

 1
Table of Contents


List of Figures and Tables�������������������������������������������������������������������������������������������������������������������������������������� 3
Acknowledgments�������������������������������������������������������������������������������������������������������������������������������������������������� 4
Objective, Audience and Structure����������������������������������������������������������������������������������������������������������������������� 5
Abbreviations���������������������������������������������������������������������������������������������������������������������������������������������������������� 6
Introduction������������������������������������������������������������������������������������������������������������������������������������������������������������� 7
PART 1: The Demand for Data to Assess Risks and Conduct Safety Assessments��������������������������������������� 10
             1.1 Conventional Tools for Road Safety Assessment�������������������������������������������������������������������������� 10
                   Data Requirements for Traffic and Road Safety Assessment Tools��������������������������������������������� 12
                   Key Challenges with Current Approaches to Road Safety Analysis���������������������������������������������� 13
PART 2: Big Data and Machine Learning to Strengthen Road Safety in Transport Projects�������������������������� 15
             2.1 New Data (and Big Data) in Road Safety Analysis�������������������������������������������������������������������������� 16
                   How to Access Big Data������������������������������������������������������������������������������������������������������������������� 20
                   Key Considerations for Selecting the “Right” Big Data Source������������������������������������������������������ 23
             2.2 Machine Learning in Road Safety Analysis������������������������������������������������������������������������������������� 25
                   How to Use Machine Learning��������������������������������������������������������������������������������������������������������� 27
                   Key Considerations for Using Machine Learning���������������������������������������������������������������������������� 30
             2.3 Big Data, Machine Learning and the Future of Road Safety Assessments����������������������������������� 33
PART 3: Case Studies: Applying Big Data and Machine Learning to Assess Road Safety����������������������������� 35
             3.1 Objectives of the Case Studies�������������������������������������������������������������������������������������������������������� 35
             3.2 Methodology������������������������������������������������������������������������������������������������������������������������������������� 37
             3.3 Case Study 1: Bogotá, Colombia������������������������������������������������������������������������������������������������������ 41
             3.4 Case Study 2: Padang, Indonesia����������������������������������������������������������������������������������������������������� 45
             3.5 Findings��������������������������������������������������������������������������������������������������������������������������������������������� 46
Conclusion������������������������������������������������������������������������������������������������������������������������������������������������������������ 48
Annex 1: Most Relevant Big Data Types for Road Safety Analysis������������������������������������������������������������������ 50
Annex 2: Overview of Big Data Sources�������������������������������������������������������������������������������������������������������������� 51
Annex 3: Hotspots and Heatmaps: Uncovering Data Patterns for Road Safety���������������������������������������������� 55
Annex 4: Classes Detected Using Mapillary Vistas Dataset in RIC Model and Input Classes for the RRE Model��62
Annex 5: Average Precision of the Bounding Box Detection and Classification��������������������������������������������� 63
Glossary of Terms������������������������������������������������������������������������������������������������������������������������������������������������ 64
References������������������������������������������������������������������������������������������������������������������������������������������������������������ 65




                                                                                     2
List of Figures


Figure 1: Road safety is a serious concern in low- and middle-income countries��������������������������������������������� 8
Figure 2: Potential applications of big data and ML in road safety projects������������������������������������������������������ 9
Figure 3: Street view and OSM����������������������������������������������������������������������������������������������������������������������������� 18
Figure 4: Hotspot analysis of major crashes reported by Waze application users������������������������������������������ 19
Figure 5: ML lifecycle�������������������������������������������������������������������������������������������������������������������������������������������� 25
Figure 6: Categories of ML and the tasks they can perform����������������������������������������������������������������������������� 25
Figure 7: ANN structure���������������������������������������������������������������������������������������������������������������������������������������� 27
Figure 8: ML algorithms and street view������������������������������������������������������������������������������������������������������������� 29
Figure 9: Labeling a crosswalk in Padang, Indonesia using the Computer Vision Annotation Tool (CVAT)������ 32
Figure 10: Framework for automatic road safety analysis and management powered by ML����������������������� 34
Figure 11: Training phase for road safety segment analysis using ML������������������������������������������������������������ 38
Figure 12: Deployment phase to predict road safety����������������������������������������������������������������������������������������� 39
Figure 13: RIC and RRE applied to predict road segment risk��������������������������������������������������������������������������� 40
Figure 14: Image segmentation in Bogotá���������������������������������������������������������������������������������������������������������� 42
Figure 15: Six study areas and crash frequency in Bogotá�������������������������������������������������������������������������������� 43
Figure 16: Confusion matrix showing the accuracy of the RRE model������������������������������������������������������������� 43
Figure 17: Road risk prediction in Bogotá����������������������������������������������������������������������������������������������������������� 44
Figure 18: Road risk prediction in Padang���������������������������������������������������������������������������������������������������������� 46




List of Tables


Table 1: Overview of common road safety assessment tools��������������������������������������������������������������������������� 12
Table 2: Overview of data requirements for common road safety assessment tools������������������������������������� 13
Table 3: SWOT analysis of using big data in road safety analysis�������������������������������������������������������������������� 17
Table 4: Overview of potential big data sources for road safety assessments������������������������������������������������ 23
Table 5: Categories of ML and algorithms���������������������������������������������������������������������������������������������������������� 26
Table 6: ML and DL algorithms���������������������������������������������������������������������������������������������������������������������������� 27
Table 7: Frequently used ML techniques for road safety analysis�������������������������������������������������������������������� 28
Table 8: SWOT analysis of using ML in road safety analysis����������������������������������������������������������������������������� 31
Table 9: Potential applications of big data and ML in road safety analysis������������������������������������������������������ 33
Table 10: Data used for case study in Bogotá, Colombia���������������������������������������������������������������������������������� 41
Table 11: Data used for case study in Padang, Indonesia��������������������������������������������������������������������������������� 45




                                                                                  3
Acknowledgments


This Guidance Note was prepared by a team from the Global Program for Resilient Housing at the
World Bank. The team was led by Sarah Elizabeth Antos (Data Scientist) and Luis Miguel Triveno
Chan Jan (Senior Urban Development Specialist). Overall managerial support was provided by Fran-
cis Ghesquiere (Practice Manager, Urban EAP) and Radoslaw Czapski (Senior Transport Specialist).
The core team included Jessica Gosling-Goldsmith, Charles Wang, Bushra Syed Shafat Ali, and Se-
bastian Anapolsky.
The Global Program for Resilient Housing supports safe and resilient housing by creating new,
cost-saving tools to evaluate homes from the air and the street to help identify those vulnerable to
natural and health hazards. While the program focuses on housing, it developed a methodology to
extract urban clues from street view imagery with multiple applications including those related to
urban mobility and road safety.
The note incorporates valuable input and review from Holly Krambeck (Program Manager), Said
Dahdah (Lead Transport Specialist), Satoshi Ogita (Senior Transport Specialist), Veronica Ines Raffo
(Senior Infrastructure Specialist), Li Qu (Senior Transport Specialist), and Glenn S. Morgan (ESF Con-
sultant).
During the drafting of this note several industry experts were interviewed. The team would like to
express gratitude for the external inputs of: Anthony Germanchev (Principal Professional Leader,
Advanced Technologies Lab, Australian Road Research Board), David Hynd (Chief Scientist, TRL),
Monica Olyslagers (Safe Cities and Innovation Specialist, iRAP), Professor George Yannis (National
Technical University of Athens), and Spencer Rigler (Account Director, TRL).
Design was done by Xavier Conesa.
This note would not have been possible without generous support from the Global Road Safety Facil-
ity and UK Aid.




                                                  4
Objective, Audience and Structure


The purpose of this Guidance Note is to provide concrete guidance on how big data and machine
learning (ML) can be leveraged in road safety analysis. The document presents opportunities to use
these new technologies to improve current methods for data collection and analysis for various road
safety assessments.
This Guidance Note provides a practical guide for using new data sources and analytical meth-
ods for road safety analysis in different types of projects that may impact road infrastructure or
risk-related factors. Road safety practitioners, project managers, researchers, international devel-
opment organizations, data scientists, and government agencies responsible for road safety assess-
ments, transportation management, and infrastructure development would also find this document
useful to understand how these new technologies can be implemented for various road safety assess-
ment procedures and requirements.
This document consists of three parts. Part 1 provides an overview of existing approaches and tools
for road safety assessment and identifies opportunities to improve these using new technologies
such as big data and ML. Part 2 provides an overview of these new technologies and concrete guid-
ance on how they can be integrated into transport projects for road safety analysis. Part 3 presents
case studies on two regions of interest – Bogotá, Colombia and Padang, Indonesia – to demonstrate
how ML can be implemented to evaluate road safety. The document concludes with recommenda-
tions for using big data and ML in road safety assessments in the future.




                                                 5
Abbreviations


ADB	     Asian Development Bank
API	     Application Programming Interface
DDP	     Development Data Partnership
DL	      Deep Learning
DRIVER	 Data for Road Incident Visualization, Evaluation and Reporting
FSI	     Fatalities and Serious Injuries
GRSF	    Global Road Safety Facility (World Bank)
IoT	     Internet of Things
iRAP	    International Road Assessment Programme
ITS	     Intelligent Transport System
LMICs	   Low- and Middle-Income Countries
ML	      Machine Learning
OSM	     OpenStreetMap
RIC	     Road Information Collector
ROI	     Region of Interest
RRE	     Road Risk Evaluator
RSA	     Road Safety Audit
RSI	     Road Safety Inspection
RSIA	    Road Safety Impact Assessment
RSO	     Road Safety Observatory
SDGs	    Sustainable Development Goals
UAV	     Unmanned Aerial Vehicle




                                                    6
Introduction


Transportation services and infrastructure connect people, businesses, and places. They allow
citizens to access opportunities, such as jobs, education, health services, recreation, and enable the
movement and distribution of goods. As a result, transport services and infrastructure are key to the
economic development of cities and regions.1
While the development of transportation systems and infrastructure is vital to economic growth,
it is also important to evaluate and mitigate its potential negative externalities and costs to soci-
ety.2 According to the World Health Organization (WHO), around 1.25 million people are killed on the
world’s roads every year and between 20 and 50 million are seriously injured. These costs are dispro-
portionately higher in low- and middle-income countries (LMICs), which are estimated to endure 93
percent of the world’s fatalities on the road, despite having 60 percent of the world’s vehicles (figure
1).3 According to a 2019 study of select countries, road crashes cost World Bank client countries an
estimated 7 percent to 22 percent of their GDP over a 24-year period.4
Road fatalities and injuries are predictable and preventable.5 Research indicates that roughly 70
percent of serious crashes are due to simple and unintentional errors of perception or judgement.6
The most vulnerable road users are pedestrians, bicyclists, and motorcyclists, accounting for more
than 50 percent of reported fatalities in LMICs.7 Effective transport planning and management that
carefully considers and incorporates measures to address safety risks.8 Speed reductions and the
design of infrastructure to promote safer streets have demonstrated clear results in Colombia and
India. In Bogotá, Colombia, the speed management program resulted in a 21 percent decrease in
traffic fatalities compared to the average for the three preceding years (2015-18).9 In India, Pune has
become a regional leader in complete streets, in which streets are designed for all users, rather than
only for cars; pedestrians, cyclists, motorists, and transit riders are given safe access with the com-
plete streets approach.10
The United Nations (UN) launched its second Decade of Action for Road Safety in 2020 to address
the road safety objectives of its Sustainable Development Goals (SDGs). These include SDG 3.6,
which seeks to reduce deaths and injuries from road crashes by 50 percent, and SDG 11, which focus-
es on making cities and human settlements inclusive, safe, resilient, and sustainable.

1
  World Bank, Mobile Metropolises: Urban Transport Matters: An IEG Evaluation of the World Bank Group’s Support for Urban
Transport (Washington, DC: World Bank, 2017).
2
  Word Bank, Making Roads Safer (Washington, DC: World Bank, 2014).
3
  WHO (World Health Organization), Global Status Report on Road Safety 2018 (Geneva: World Health Organization, 2018), 4.
4
  World Bank, The High Toll of Traffic Injuries: Unacceptable and Preventable (Washington, DC: World Bank, 2017).
5
  Makhtar Diop, “All Road Deaths Are Preventable. We Can Make It Happen,” World Bank, accessed May 14, 2021,
https://blogs.worldbank.org/transport/all-road-deaths-are-preventable-we-can-make-it-happen
6
  International Transport Forum, Zero Road Deaths and Serious Injuries: Leading a Paradigm Shift to a Safe System (Paris: OECD
Publishing, 2016). https://doi.org/10.1787/9789282108055-en
7
  World Bank, Good Practice Note on Road Safety (Washington, DC: World Bank, 2019). https://pubdocs.worldbank.org/
en/648681570135612401/Good-Practice-Note-Road-Safety.pdf
8
  International Transport Forum, “Best Practice for Urban Road Safety: Case Studies,” International Transport Forum Policy
Papers, no. 76 (2020).
9
  International Transport Forum, “Best Practice for Urban Road Safety: Case Studies.”
10
   Institute for Transportation and Development Policy, “Pune, India Wins 2020 Sustainable Transport Award,” last modified
June 27, 2019, https://www.itdp.org/2019/06/27/pune-india-wins-2020-sustainable-transport-award/


                                                              7
The World Bank hosts the Global Road Safety Facility (GRSF) to FIGURE 1: Road safety is a
provide funding, knowledge, and technical assistance to help de- serious concern in low- and
                                                                     middle-income countries
veloping countries create safer roads. The Facility addresses road
safety issues across a wide range of projects, from infrastructure
design and vehicle safety to traffic law enforcement, post-crash re-
sponse systems, data collection, and institutional strengthening.
Since its inception in 2006, the Facility has disbursed a total of
USD 44.6 million to improve road safety in 64 countries.
It is important, and often required, to incorporate road safety
management procedures in transport projects to identify and
mitigate risks in a timely manner. Governments, international
                                                                                   93%
                                                                                   of road fatalities occur in low- and
development organizations, and other agencies have established                     middle-income countries, despite these
various tools and systems to facilitate road safety analysis. How-                 countries having 60 percent of the world’s
                                                                                   vehicles.
ever, the absence of valid, representative data presents significant
                                                                                      SOURCE: Original figure for this publication, based on
challenges to developing a good understanding of road safety risks                                                          data from WHO.

and reducing crash fatalities and injuries through data-driven, evi-
dence-based interventions.11
New technologies such as big data and machine learning (ML)
provide promising opportunities to improve existing data sources and methods for road safety
analysis. From analyzing anonymized GPS data to understand traffic flows in the Philippines to part-

 World Bank, Guide for Road Safety Opportunities and Challenges: Low and Middle Income Country Profiles (Washington, DC:
11

2020). https://openknowledge.worldbank.org/handle/10986/33363


                                                            8
nering with data providers that crowdsource information about crash sites in Kenya, governments,
road safety practitioners, and other stakeholders are adopting innovative approaches to identify,
monitor, and mitigate fatalities and injuries in high-risk areas.12 Unsupervised learning techniques
have been applied in Lima, Peru, using records of different crash types to identify safe areas along
routes and safer pedestrian pathways, decreasing the likelihood of pedestrians suffering an acci-
dent.13 The Urban Traffic Modeling and Control project at the National University of Medellín has
been using deep learning (DL) techniques to classify traffic and identify motorbike usage. In Carta-
gena, Colombia, data mining and ML algorithms were used to analyze road records and predict the
severity of crashes using classification algorithms.14 Figure 2 provides an overview of the potential
uses of big data and ML in road safety analysis that will be discussed in this note.

FIGURE 2:   Potential applications of big data and ML in road safety projects
BIG DATA OR
SPATIAL DATA
SOURCE
                       Street view            Satellite and          Internet of         Incident             Natural                     Social
                        imagery              aerial imagery            Things             reports           phenomena                     media

MACHINE/           Identify road           Delineate road        Analyze vehicle   Identify road crash   Find patterns in         Extract traffic or
                   conditions, barriers,   curvature, complex    and population    patterns and          weather and time         road condition data
DEEP               crosswalks,             intersections, road   movement          develop prediction    of day
LEARNING           pedestrian paths,       gradient; provide                       models
                   street signs, traffic   car and truck count
                   lights                                                                                   SOURCE: Original figure for this publication.




12
   World Bank, “Open Traffic Data to Revolutionize Transport,” last modified December 19, 2016, https://www.worldbank.
org/en/news/feature/2016/12/19/open-traffic-data-to-revolutionize-transport; Guadalupe Bedoya Arguelles, et al., “Smart and
Safe Kenya Transport (SMARTTRANS)” (Washington, DC: World Bank, 2019), https://documents1.worldbank.org/curated/
en/723411574361015073/pdf/Smart-and-Safe-Kenya-Transport-SMARTTRANS.pdf
13
   Jesús Lovón-Melgarejo et al., “Identification of Risk Zones for Road Safety through Unsupervised Learning Algorithms,” in
16th LACCEI International Multi-Conference for Engineering, Education, and Technology: Innovation in Education and Inclusion,
http://www.laccei.org/LACCEI2018-Lima/full_papers/FP413.pdf
14
   Holman Ospina-Mateus et al., “Using Data-Mining Techniques for the Prediction of the Severity of Road Crashes in
Cartagena, Colombia,” in Applied Computer Sciences in Engineering, eds. J. Figueroa-García et al., vol. 1052 (2019): 309-20,
https://doi.org/10.1007/978-3-030-31019-6_27


                                                                         9
PART 1:
The Demand for Data to Assess Risks and Conduct
Safety Assessments

Road safety practitioners utilize a variety of data-driven tools and methods to evaluate road safety
risks and determine mitigation measures across different stages of road and infrastructure de-
velopment projects. Comprehensive road safety evaluation tools and procedures require both crash
and non-crash data to identify issues and measure their associated risks. The variety, quantity, and
quality of data available is an important determinant of the tool for measurement and analysis of
various road safety indicators.
This section provides an overview of the most widely used road safety assessment tools and their
data requirements. A brief description of these road safety assessment procedures and tools can be
found in table 1. This brief review of existing approaches informs the suggestions for improving data col-
lection and analysis for road safety evaluation procedures through big data and machine learning (ML).

1.1 Conventional Tools for Road Safety Assessment

Road safety risks arise from the interaction of many different elements. The road and roadside de-
sign and engineering, travel speeds, the extent and type of road use, road user behavior, vehicle safe-
ty features (both active and passive), and post-crash response. The Safe System approach addresses
all of these interactive elements in an integrated manner and emphasizes sharing accountability with
designers and users of the road network to achieve road safety targets.15
The primary purpose of road safety assessment procedures is to identify risks in existing or
planned infrastructure developments. Road safety practitioners utilize a wide range of tools for this
purpose. Some of these can be purchased commercially, while others are provided, and occasionally
mandated by local governments. Organizations providing financial support for international develop-
ment projects may also create their own tools for road safety analysis, such as the “Simplified Meth-
odology” by the World Bank.16 In general, road safety assessment tools tend to comprise checklists
for evaluating the safety of road networks at different stages of a road project’s lifecycle. Some tools,
such as the Austroads Road Safety Audit tool, provide guidelines for conducting road safety audits
at all the stages of a road project, while other tools like iRAP have guidelines for only some stages
(such as during preparation and post-construction). Tools may also need to be adapted or customized
depending on the type of project or the project location.
A comprehensive approach to managing road safety and reducing crash risk generally requires a
combination of reactive and proactive approaches across some or all stages of a road’s lifecycle.17
Reactive approaches rely on historical crash data to identify high risk regions and risk factors. Proactive
approaches aim to identify and address potential risks before a project is implemented or crashes occur.

15
   Tony Bliss and Jeanne Breen, “Meeting the Management Challenges of the Decade of Action for Road Safety,“ IATSS Res., 35
(2012): 48–55. https://doi.org/10.1016/j.iatssr.2011.12.001
16
   World Bank, Innovative Road Safety Risk Assessment Tool with Automated Image Analysis Technology (Washington, DC: World
Bank, 2021).
17
   World Road Association, “Road Safety Manual: Infrastructure Management Tools,” accessed May 10, 2021,
https://roadsafety.piarc.org/en/planning-design-operation-infrastructure-management/management-tools


                                                            10
Reactive approaches are often the starting point for road safety analysis and rely on some form
of crash-based identification. Crash data-based risk assessments may involve evaluating one or sev-
eral of the following criteria: infrastructure, users, speeds, vehicle standards and post-crash trauma
care. This approach requires that risk factors be constantly monitored and assessed throughout the
project lifecycle.
Recently, the focus has shifted toward using more proactive approaches, with a wide range of
tools being developed for this purpose. These are especially useful in the absence of crash data, and
often involve surveys of existing roads for road infrastructure risk or assessment of other criteria to
obtain subjective estimates of road infrastructure risk. Some common tools for proactive road risk
assessments are discussed below.
Road Safety Impact Assessments (RSIA) are designed to estimate the potential effects of planned
road or traffic developments, or any other interventions that may significantly affect transport
conditions and risks to road users. The procedure is often conducted at the planning stage to assess
the possible impacts of different schematic designs before the most appropriate design is audited and
selected for implementation.
Road Safety Audits (RSA) are generally used to analyze a road project, or any other type of project
which affects road users. An independent, qualified team reports on the project’s crash potential
and safety performance to identify safety performance for all kinds of road users. Road safety audits
can be conducted at various stages in the project lifecycle including planning, preliminary design,
detailed design and pre-opening or post-construction stages. However, it is most cost-effective when
it is applied to a road or traffic design before construction to ensure that safety is fully integrated into
all elements of the project’s infrastructure, with minimal risk of redesign or physical rework.
Road Safety Inspections (RSI) involve a systematic evaluation of an existing road or section of road
by a team of seasoned experts. They are conducted on-site to determine potential hazards, faults
and deficiencies that could contribute to serious crashes.18 RSIs are more comprehensive than RSAs
and are usually conducted post construction to identify further interventions to improve road safety
and inform future projects.
Road Assessment Programmes (RAP) entail a comprehensive review of existing roads and road
networks. Most RAPs, such as the EuroRAP, usRAP and iRAP, use a star-rating approach to provide
a relative and comparable measure of the safety level of road networks all around the world. RAPs
are highly comprehensive, detailed, and costly. They are usually commissioned by national or local
governments to evaluate extensive road networks as an ad-hoc project to determine safety inter-
ventions and inform further infrastructure development. Therefore, RAPs are either utilized at the
preparation stage of a project to determine project scope, design, and other key requirements for
pre-appraisal and construction, or they are conducted to assess the impact of major infrastructure
development projects during the post-project operations phases.




 Phil Allan, “Road Safety Inspections” (presentation, Road Safety Seminar, World Road Association, Lomé, Togo: October
18

2006). https://www.piarc.org/ressources/documents/actes-seminaires06/c31-togo06/8718,2-PIARC_Oct06_Allan.pdf


                                                           11
TABLE 1:   Overview of common road safety assessment tools
TYPE OF           WHEN TO USE           WHEN TO USE                     RELATIVE COST         DATA                  EXAMPLES OF TOOLS
ASSESSMENT        (PROJECT STAGE)       (PROJECT ACTIVITY)              (HIGH, MEDIUM,        REQUIREMENTS
                                                                        LOW, DEPENDS)         (HIGH, MEDIUM,
                                                                                              LOW, DEPENDS)

Crash data-       Preparation,          Pre-Planning and                Depends, low-   Depends                     Crash frequency, crash risk factors,
based risk        Implementation,       Design, Monitoring              cost models are                             crash severity analysis
assessment        Post-Project          and Evaluation, Error           available
                  Operations            Correction and Hazard
                                        Elimination
Road Safety       Preparation           Pre-Planning and                Low                   Low
Impact                                  Design
Assessment
(RSIA)
Road Safety       Preparation,          Planning and Design,            Medium to High Medium/                      iRAP Road Safety Audit Toolkit,
Audit (RSA)       Implementation        Construction and Pre-                          Depends                      Austroads Road Safety Audit Toolkit
                                        Opening                                                                     (currently unavailable), ADB Road
                                                                                                                    Safety Audit Toolkit
Road Safety       Implementation,                                       High                  High                  iRAP
Inspection        Post-Project
(RSI)             Operations
Road          Preparation, Post- Planning and                           High                  High                  iRAP, EuroRap, usRAP
Assessment    Project            Design, Independent
Program (RAP) Operations         Assessment
                                    SOURCE: Modified from Remote Project Supervision and Construction Management of IPF Projects (Washington, DC: World Bank, 2020).




Data Requirements for Traffic and Road Safety Assessment Tools

One or more types of road safety assessments may be conducted at once or at different phases
of a project. Table 2 summarizes the assessment methods, objectives, and their data requirements.
Assessments prepared early in a project’s lifecycle may help to identify and evaluate potential traffic
and road safety risks that may arise from the project activities and/or their implementation. Such
assessments are intended to help mobilize appropriate resources, analyze risks in detail, and identi-
fy and adopt the most appropriate mitigation measures. During the project preparation stage, more
in-depth assessments to identify and evaluate potential traffic and road safety risks may need to be
conducted. The assessments should consider Safe System principles to ensure that all opportunities
to minimize risks have been realized.19
Since the key objectives of these assessments (i.e., identifying risk elements and estimating crash
exposure, likelihood, and severity for different road users) are complex and not standardized,
the scoring system is subjective. This can complicate comparisons between sites, especially when
these have been assessed by different individuals or teams. It is, therefore, usually most suitable for
comparing options at a single site, identifying sources of risk and identifying solutions, rather than
for comparing different sites.




19
     Tony Bliss and Jeanne Breen, “Meeting the Management Challenges of the Decade of Action for Road Safety.”


                                                                              12
TABLE 2:   Overview of data requirements for common road safety assessment tools
METHOD                 OBJECTIVES                         DATA REQUIREMENTS

Crash data-based       Estimate risk using Fatalities   •	 Crash data from the previous 3–5 years or estimated from data available
risk assessment        and Serious Injuries (FSI)          from similar roads in the country
                       crash data to reflect road       •	 Assessment of vehicle standards (safe vehicles)
                       infrastructure, users, and speed •	 Post-crash trauma care (response time, quality of attention)
                       factors. This is evaluated with
                       vehicle standards and post-
                       crash care.
Road Safety Audit   Identify safety concerns. It          Analysis of project designs and interventions: specialists assess road
(RSA) (performed by audits the safety of the specific     options, such as intersections, signs, crossings; design standards, and the
an independent team design of the chosen scheme.          relationship of this intervention to main network. Main data needed includes:
of specialists)                                           •	 Scheme plans
                                                          •	 Crash and FSI data
                                                          •	 Traffic mix and volumes
                                                          •	 Road features (e.g., design elements, such as bypasses, cycle routes,
                                                             junction improvements, installation of traffic signals, roundabouts, traffic
                                                             calming, bend realignment, safety fence schemes and pedestrian crossing
                                                             facilities)
Road Safety Impact     Assess the impact of each          The evaluation of each alternative is based on several factors, some of which
Assessment (RSIA)      of the planning options on         include:
(performed by          the safety performance of          •	 The scheme objectives
members of the         the current road network.          •	 Crash and FSI data
project design team    It estimates the impact of
                                                          •	 Traffic mix and volumes
with road design and   possible schemes on safety for
road safety auditing   an entire geographic area at the   •	 Road features
experience)            strategic level.                   •	 Categorization of roads and streets of that network
Safe System            Assess how closely road            The core of this SSA approach is the “Safe System Matrix” framework which
Assessment (SSA)       design and operation align with    is essentially a risk assessment. The assessment is done by scoring the risk
                       the Safe System objectives,        exposure, likelihood and severity from 0–4. The Austroads approach can be
                       and to clarify which elements      used to perform this type of assessment. Data needed include:
                       need to be modified to achieve
                                                          •	 Traffic mix and volumes
                       closer alignment with these
                       objectives.                        •	 Road features
                                                                                                                    SOURCE: Road Safety GPN.




Key Challenges with Current Approaches to Road Safety Analysis

Since data is the cornerstone of all road safety assessments, the availability of high quality, reli-
able data is key to extracting useful, actionable insights and improving road safety conditions.
Without quality information, it is difficult to estimate crash locations and crash types, at-risk individ-
uals and groups, and key risk factors influencing exposure to risk, crash involvement, crash severity,
and post-crash outcomes. Meeting data requirements for road safety assessments can be a challenge
for various reasons, such as the lack of open data, or data collection costs.
There can be a lack of adequate crash data or road ratings in data scarce countries and regions for
identifying risk factors. Governments often lack adequate and reliable data to identify road safety
risks and perform road safety assessments. In addition, road crashes tend to be underreported, es-
pecially in LMICs. There may also be significant gaps in the data in terms of geographic or temporal
coverage, or the data may be missing important variables and categories. Access to data can also be
limited for certain data types, or the process of obtaining the data may be too complex, costly, and
time-consuming.




                                                                   13
Collecting data on road safety attributes through manual detection or special equipment can be
expensive, time-consuming, and complex.20 Budgeting for data collection can be a challenge. In
these cases, data is most often estimated through existing road designs or by local transportation
agencies. The most cost-effective method for data collection is the installation of cameras and sensors
that record street imagery, speed information, and other data. Images and video are then analyzed
by road safety experts to identify relevant attributes, assess road conditions, and identify potential
risks. Commissioning equipment and hiring resources to manually collect data on road features and
design may be a hindrance, especially for smaller-scale projects where the opportunity to benefit
from economies of scale is low.
In addition to the quality and availability of data, preparing and analyzing road safety data can also
be costly, resource-intensive, and technically demanding. Most road safety assessments require
data to be combined from various sources, which often involves aggregating, cleaning and preparing
the data. Additional resources and specialist expertise may be necessary for this process, and also to
analyze the data and extract useful insights using methods such as clustering and developing spatial
models. Conventional statistical techniques can also be limited in their ability to identify complex
correlations and underlying factors that may contribute to road safety risks across various projects.
The purpose of this Guidance Note is to identify new methods for the collection and analysis of
road safety data that could overcome the limitations of existing approaches, and also improve
their efficacy in identifying risks and opportunities to mitigate crashes. Conducting road safety
assessments is a required component of most road investment and infrastructure development proj-
ects. Advanced technologies such as big data and ML have the potential to not only supplement
existing methods, but also significantly reduce costs while improving the efficacy of road safety as-
sessments in identifying risks and opportunities to mitigate crashes.
The following section explains how big data and ML be practically implemented by road safety
practitioners for various road safety assessment procedures. It introduces these methods and pro-
vides an overview of big data sources and ML techniques that are useful for road safety assessments.
Part 2 also discusses best practices and key considerations that are vital to implementing these new
methods effectively. A framework for integrating these technologies in road safety assessments is
also proposed, and Part 3 demonstrates how this framework can be applied in LMICs through two
original case studies.




20
   OECD (Organisation for Economic Co-operation and Development)/ITF (International Transport Forum), Big Data and
Transport: Understanding and Assessing Options (Paris: OECD/ITF, 2015),
https://www.itf-oecd.org/sites/default/files/docs/15cpb_bigdata_0.pdf


                                                           14
PART 2:
Big Data and Machine Learning to Strengthen Road
Safety in Transport Projects

Governments, road safety practitioners, international development organizations, and road safe-
ty advocates such as the Global Road Safety Facility are keen to use new technologies, such as big
data and ML, in data collection and analysis for road safety to overcome the limitations of existing
approaches. As these technologies become more sophisticated and accessible, a growing body of re-
search indicates their potential to complement, and eventually even surpass conventional methods.
The usefulness of big data and ML in road safety and other transport and infrastructure projects
has been widely demonstrated over the past few years. For example, a World Bank task team de-
veloped an open data platform in 2015 based on a pilot in Cebu City, Philippines, which sourced data
from a taxi company to generate insights for traffic management.21 Another team has developed a
“Simplified Methodology” to implement ML in video analysis to extract data on road attributes. The
new tool was piloted across over 500 kilometers of road in Mozambique and Liberia in 2019.22 The
World Bank, in collaboration with the Philippines government, has also launched the Data for Road
Incident Visualization Evaluation and Reporting (DRIVER) system to facilitate data sharing for road
safety analysis. This free web-based, open-source platform connects traffic crash data from multiple
agencies through a standardized reporting system. DRIVER also provides tools to geo-spatially an-
alyze road crash data, predict blackspots, estimate the economic costs of crashes, and evaluate the
effectiveness of various interventions to support investments and policymaking for improved road
safety.23
Road safety practitioners are increasingly turning to data partnerships to obtain crash, traffic,
and other types of data for road safety analysis. For example, in Kenya, the WHO estimates that up
to 75 percent of crashes go unreported.24 SmarTTrans – a collaboration between the Kenyan govern-
ment and the World Bank – has worked to fill this gap by bringing together crash information both
from administrative records and from bystander crash reports from Twitter.25 In addition, the team
has leveraged the Development Data Partnership (DDP) to access Waze API and Uber congestion and
speed information for all 6,200 km of the city’s road network. Using all data sources, the smarTTrans
team is creating near real-time analytics to facilitate the identification of crash hotspots, speeding,
and congestion patterns.




21
   World Bank, Open Traffic: Easing Urban Congestion (Washington, DC: World Bank, n.d.),
https://olc.worldbank.org/system/files/WBG_BD_CS_OpenTraffic_1.pdf
22
   World Bank, Innovative Road Safety Risk Assessment Tool with Automated Image Analysis Technology (Washington, DC: World
Bank, 2019).
23
   World Bank, GRSF DRIVER Completion Report (Washington, DC: World Bank, 2019),
https://documents1.worldbank.org/curated/en/245151560919065747/pdf/Data-for-Road-Incident-Visualization-Evaluation-and-
Reporting-Lowing-the-Barriers-to-Evidence-Based-Road-Safety-Management-in-Resource-Constrained-Countries.pdf
24
   WHO, Global Status Report on Road Safety 2018.
25
   Sveta Milusheva et al., “Applying Machine Learning and Geolocation Techniques to Social Media Data (Twitter) to Develop a
Resource for Urban Planning,” PLoS ONE 16, 2 (2021),
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0244317


                                                             15
2.1 New Data (and Big Data) in Road Safety Analysis

Big data is generally understood as extremely large datasets that are generated by a wide range of
data sources, including machines, sensors, and other Internet of Things (IoT) devices. Big data can
also be captured over the internet through social media and other types of applications, especially
those that track locational or transactional data.
The large volume of such data is one of many characteristics that make big data especially useful
for road safety and other applications in transport and infrastructure development. For example,
big data can be generated at immense velocity, especially as more such data is collected real-time
and for large populations. It also occurs in a variety of data formats, from structured databases to
unstructured text documents, emails, videos, audios, stock ticker data and financial transactions. Big
data is also characterized by a high degree of variability since data flows can change over time, de-
pending on seasons, off-peak hours, or availability of collection methods across an entire population
under study. Table 3 provides a SWOT analysis of the use of big data in road safety analysis.
For transport, the increasing use of personal mobile devices and vehicle sensors to collect traffic
and location data presents a significant opportunity to augment traditional sources of transport
data. Annex 1 discusses the most relevant big data types for road safety analysis. It also provides
guidance on the potential applications of these sources for evaluating road safety, and the advantages
and disadvantages of each source. The following sections discuss how big data can be used for the
various road safety assessment methods and tools discussed in Part 1.




                                                 16
TABLE 3:   SWOT analysis of using big data in road safety analysis
STRENGTHS                                                                 WEAKNESSES

•	 Recent and broad geographic coverage allows researchers to             •	 Requires investment in expertise, software and computing
   dive deeper into transport issues and get a comprehensive and             power to store, access and process big data.
   current picture of risks.                                              •	 Availability of data can vary significantly by geography and
•	 Can help obtain real-time data and track up-to-the-minute                 context.
   changes in traffic flows and other important variables.                •	 Coverage can be inconsistent or exclude important segments
•	 May be faster and easier to obtain and process, compared to               of the population.
   manual collection.                                                     •	 Most big data sources are not set up to support road safety
•	 Can offer higher spatial and temporal resolution than                     assessments—it is often data that was collected for other
   conventional sources.                                                     purposes but gets repurposed for road safety analysis. This
•	 Can be more affordable and easier to scale.                               can lead to the data being biased, incomplete and/or difficult to
                                                                             incorporate in road safety analysis.
•	 Vast quantities of data can limit bias from outliers and other
   sources of “noise” since data gets aggregated across vast              •	 Need to consider the interoperability of different datasets (i.e.,
   populations.                                                              how easy it is to combine different datasets for complex road
                                                                             safety assessment models).
•	 Can help improve data quality since often covers large
   geographic and/or temporal scope, also allowing for                    •	 Changes in privacy laws and other relevant policies can impact
   comparison against “control” datasets and scenarios.                      quality, consistency and coverage of data.

OPPORTUNITIES                                                             THREATS

•	 Provides an alternative approach to road safety data collection        •	 Privacy concerns – data should be de-identified and
   and analysis that may complement or supplement traditional                anonymized before use.
   approaches or datasets. For example, big data sources may be           •	 Data providers may be reluctant to share data.
   able to collect more accurate crash data.                              •	 Governments, local municipalities, and other stakeholders
•	 Big data analysis can uncover new dynamics, complex                       must invest in technological infrastructure to support big data
   behavioral patterns and relationships, and correlations that              collection and analysis.
   conventional statistical methods and data may not be able to           •	 Need to enforce quality control to limit risk of data bias.
   detect.
                                                                          •	 Licensing constraints – most private companies, such as
•	 Growing interest in autonomous vehicles is generating more                Google, provide limited licenses for data use.
   data about road systems, vehicles, and vulnerable users that
   can be integrated into road safety analysis.
•	 Rising momentum for the creation of a “big data platform”
   where data providers can sell or share data.
                                                                                                            SOURCE: Original table for this publication.



Big data, especially when combined with ML, which is discussed in the following section, can
enhance the capabilities of current systems and road safety assessment tools. The increasing use
of IoT devices, which range from smartphones to vehicle sensors, as well as Intelligent Transport
Systems (ITS), is making it possible to collect, access and utilize real-time data about a large range of
variables that are relevant to road safety analysis. This includes traffic flows, crash sites, peak tim-
ings, travel times and road usage by pedestrians, bicyclists, and motorists. The availability of such ex-
tensive data creates new possibilities for crash risk modelling, especially to predict the outcomes of
various types of road safety interventions as well as possible impacts of road infrastructure projects.
As mobile phone use rises globally, smartphones have become a prominent source of big data,
though there are many other sources to consider. In addition to the location and velocity of road
travelers collected passively through mobile devices, transportation projects can take advantage of
street view, aerial, and satellite imagery, traffic monitoring systems, connected vehicles for road
safety analysis, as well as crowdsourced data provided by the community through mobile devices.26
Annex 2 provides an overview of the most relevant and accessible big data sources for road safety
analysis. Road safety practitioners are advised to look for relevant local and regional data providers
based on the region(s) of interest that concern their project(s). As big data infrastructure advances


26 Alex Neilson et al., “Systematic Review of the Literature on Big Data in the Transportation Domain: Concepts and
Applications,” Big Data Res. 17 (2019): 35-44. https://doi.org/10.1016/j.bdr.2019.03.001


                                                                     17
globally and new companies and startups begin data collection for various purposes, it is likely that
the list of available big data sources will expand significantly in coming years.
Street view imagery can complement or potentially substitute manual or commissioned road sur-
veys to collect data on road safety attributes for various types of assessments. For example, street
view imagery can help obtain baseline data for RSIA more quickly and cheaply, especially if the data
is not already readily available. By applying ML algorithms to street view images, road attributes and
other data can be detected that are important for road safety assessments. Similarly, there may be
instances where satellite imagery or aerial imagery, those collected by an unmanned aerial vehicle
(UAV) or drone, can be analyzed to detect road or road user attributes. Figure 3 shows the same
crosswalk visible in satellite imagery and street view imagery using OpenStreetMap in OSM. ML is
discussed in greater detail in the next section.

FIGURE 3:   Street view and OSM
Road safety data can be extracted from images such as road markings and signs, types of road users, and designated
paths for vulnerable users. Each image and relevant attributes are geolocated for further analysis. In this instance, the
crosswalk identified in OSM can be verified in street view imagery.




                                             SOURCE: Original figure for this publication derived from OSM, Mapillary, and Maxar Technologies.



Mobile applications and telematics can provide data related to vehicle movement to identify road
infrastructure risks. This data includes current and historical average speeds along road segments as
well as irregularities, like traffic jams and incidents. This data is useful for most proactive road safety
assessment tools, including RSIA, RSA, and RSI. It can be geographically visualized and ana­    lyzed, such
as through heatmaps or hotspot analysis as shown in figure 4 (see Annex 3 for additional examples and
descriptions). Telematics data has also been used to assess driver behavior, facilitate the prediction of
crash-prone locations, and create geographic visualizations, as discussed in interviews with research-
ers at the ARRB and Professor George Yannis from the National Technical University of Athens. Howev-
er, data privacy is an especially important concern when it comes to the use of telematics data.27


 Anthony Germanchev (Principal Professional Leader, Advanced Technologies Lab, Australian Road Research Board) and Professor
27

George Yannis (School of Civil Engineering, National Technical University of Athens), in discussion with the authors, April 2021.


                                                                        18
FIGURE 4:   Hotspot analysis of major crashes reported by Waze application users

     Bogotá, Colombia
     Waze Major Crash
         Cold Spot - 95% Confidence
         Cold Spot - 90% Confidence
         Not Significant
         Hot Spot - 90% Confidence
         Hot Spot - 95% Confidence
         Hot Spot - 99% Confidence




                                                       SOURCE: Original figure for this publication (data provided by Waze App; learn more at waze.com).



Mobile applications are helping overcome underreporting of road crashes by crowdsourcing inci-
dent reports. For example, in Kenya, road crashes have been shown to be largely underreported, es-
pecially in areas where incident reporting mechanisms are lacking or underdeveloped.28 Navigation
applications such as Waze are providing a valuable new source of crash and traffic data by allowing
users to report incidents through their smartphone applications. Each incident report submitted by
a user is geolocated and timestamped, which allows it to be combined with other geospatial data to
identify segments of a road that are experiencing major or minor crashes, light to stand still traffic
jams or hazardous conditions (hazards on the road or on the shoulder, weather alerts or dangerous
road surfaces). Additionally, social media platforms like Twitter are used by many people on the
ground to report on crashes and traffic conditions and can be leveraged using machine learning al-
gorithms to produce additional data on crashes, as was done by the smarTTrans team in Nairobi.29
Lastly, mobile application data can be generated in real-time to assist with monitoring or collected
and analyzed over time to develop models.

A growing number of countries and regions are focusing on developing a big data infrastructure
to collect official incident reports. Collecting comprehensive and accurate information about road
incidents is an important objective for government transportation agencies. There is growing inter-


 Guadalupe Bedoya Arguelles, et al., “Smart and Safe Kenya Transport (SMARTTRANS).”
28

 Sveta Milusheva et al., “Applying Machine Learning and Geolocation Techniques to Social Media Data (Twitter) to Develop a
29

Resource for Urban Planning.”


                                                             19
est in gathering and analyzing the information in big data formats to provide deeper and more com-
prehensive insight into road safety risks and the impact of different interventions. The collection of
real-time data would also be beneficial for this purpose, for which collecting, storing, and analyzing
the information as big data would be most realistic and feasible.

How to Access Big Data

Big data for road safety generally falls into two categories: public sector and private sector. Tradi-
tionally governments have collected and provided data for road safety analysis, such as police reports
of crash incidents. However, alternative sources are becoming increasingly available as mobile apps
are used to crowdsource reports of roadside incidents and companies aggregate traffic speeds from
proprietary mobile applications. Often data quality from such sources can vary significantly by loca-
tion, with certain sources being more effective, reliable, and better developed in some regions com-
pared to others. Road safety practitioners advised to use the list provided in Annex 2 as a starting point
and find the most relevant data providers for the region(s) of interest that their project focuses on.
This Guidance Note focuses on big data sources that are most easily and readily accessible for road
safety analysis. Different sources require different approaches to obtaining relevant data quickly and
efficiently. It is important to understand the licensing restrictions that accompany each source. For
example, even though a dataset is crowdsourced, it may have licensing restrictions. It is best to con-
sult a legal advisor and the data provider to clarify terms of use when necessary.
Public sector. Governments can collect, manage, and share data relating to transport, infrastruc-
ture, and mobility. Many governments, whether at the national level or even local municipalities, are
establishing open data platforms where datasets can be accessed by running a simple search query.
Such platforms have already been created in the Philippines as well as in Australia and the United
States.30 In other instances, particularly where the data infrastructure is not as advanced, data may
have to be requested through the relevant department. It is often possible to obtain datasets relating
to crash histories or collected by road sensors from government sources which are extensive enough
to be processed as big data in road safety analysis.
The World Bank’s Road Safety Observatories (RSO) initiative also has the potential to become an
important source of government-generated big data in the future. The Observatories provide a for-
mal network of government representatives to share and exchange road safety data and experience
in order to improve road safety throughout the region. The World Bank established its first RSO in
Latin America (OISEVI), before introducing the initiative in Africa (ARSO) and Asia-Pacific (APRSO).
By enhancing road safety data and information systems, the Observatories play a pivotal role in help-
ing countries monitor, evaluate, and develop more impactful road safety policies and interventions.31
In other cases, publicly available datasets with a global reach may be considered. A good example
of this is OSM, which offers freely available geographic data generated by volunteers who trace satel-
lite images around the world to create and update the map consisting of road networks (detailing road

30
   Australian BITRE (Bureau of Infrastructure and Transport Research Economics), “Australian Road Deaths Database
(ARDD),” Australian BITRE, updated May 13, 2021, https://data.gov.au/data/dataset/australian-road-deaths-database; ODPH
(Open Data Philippines), “Open Data Philippines,” ODPH, accessed June 3, 2021, https://data.gov.ph/; US NHTSA (United
States National Highway Traffic Safety Administration), “Data,” US NHTSA, accessed May 28, 2021,
https://www.nhtsa.gov/data
31
   World Bank, “Better Data for Safer Roads: The Powerful Mission of Road Safety Observatories,” last modified November 5,
2020, https://www.worldbank.org/en/news/video/2020/11/05/better-data-for-safer-roads-the-powerful-mission-of-road-safety-
observatories


                                                            20
types, bridges, tunnels, direction of traffic flow), among other features. OSM data can be combined
with other datasets for road safety analysis. While OSM provides an overview of the road geometry,
the recency and accuracy of the data requires validation. Due to variability in quality and coverage,
OSM data would be considered a starting point and is not recommended for detailed assessments.
Private sector. Mobility datasets are generated through ride-hailing services, delivery services, so-
cial media, and other mobile applications that collect user location and movement. Companies in the
transportation and logistics sector use smartphone applications to digitize their operations and take
advantage of higher quality, real-time data to improve efficiency as well. Other companies provide
telematics software to track vehicle movement and safety features. Companies and start-ups invest-
ing in autonomous vehicle research are providing valuable sources of big data for road safety analy-
sis. Some companies also provide APIs that allow developers to access these datasets (often on a lim-
ited basis). However, proprietary or commercial data may have to be purchased in some instances, or
data partnerships need to be established to access such data. It is also crucial to understand how the
data is licensed and can be legally used for different types of analysis. For example, Google restricts
digitizing and tracing information as well as using applications to analyze and extract information
from street view images, although annotation and labelling is permitted.32
Data Partnership Agreements. Road safety practitioners can access various datasets for road safety
analysis through data partnership agreements with companies. Practitioners can directly contact
companies to request data relevant to road safety and, upon signing a licensing agreement, receive
the data. Practitioners can also leverage data sharing platforms such as the Development Data Part-
nership (DDP), which is accessible to practitioners affiliated with certain international development

32
  Google, “Google Maps, Google Earth, and Street View,” accessed May 14, 2021,
https://about.google/brand-resource-center/products-and-services/geo-guidelines/


                                                             21
organizations. DDP is a formal collaboration of private sector companies and select international
organizations to use third-party data in research and international development.33
The Waze for Cities program is one example of a data sharing agreement that can be leveraged
through direct contact with the company or, if accessible, through DDP. The program allows cities
to utilize data standards designed by Waze for closure and incident reporting to reduce data frag-
mentation and promote transport and government data aggregation. It now has more than 500 glob-
al partners including city, state and country government agencies, nonprofits and first responders.
Another example of a possible data provider for road safety analysis is Moovit, an app focused on
public transport, offers Mobility as a Service (MaaS) solutions for cities, providing personalized apps,
payment solutions, real-time transit information, and other analytics.
In many cases, data providers help local governments by exchanging data. For example, the city
of Tokyo in Japan has partnered with a private firm to develop a smartphone compatible app, Zen-
ryoku Annai!. The app analyzes nearly 360 million observations every second to generate real-time
information on the shortest and least-congested travel routes. A similar intelligent transport system
(ITS) in Denmark, Copenhagen Connecting, was implemented to promote transport sustainability
through real-time digital traffic control and weather adaptation options. Road safety practitioners
should consider seeking the support of local governments to establish data partnership agreements,
particularly if the datasets are not accessible through DDP.
Data marketplaces. Business leaders are keen to explore the value of the big data they collect as a
tradable commodity. This has given rise to data marketplaces which are essentially online platforms
dedicated to the buying and selling of data. These marketplaces can provide a more cost-effective
source of data compared to other data mining techniques. Dedicated marketplaces for traffic and
transport data have also emerged in recent years, although their coverage of LMICs tends to be low.
As part of its efforts to establish an artificial intelligence tool for road safety analysis (called Ai-
RAP), iRAP is seeking to establish a data marketplace where public and private data providers can
trade data for road safety analysis. The data marketplace will focus on three types of data products,
according to Monica Olyslagers (Safe Cities and Innovation Specialist at iRAP), who was interviewed
for this Guidance Note.34 The first is raw datasets that need to be processed to extract relevant in-
formation. The second is datasets that have been at least partially cleaned up and processed by data
providers or Ai-RAP and are ready to be plugged into road safety assessments. The third is pre-
pared-for-purpose datasets that are specifically commissioned for road safety assessments in differ-
ent types of projects. This data marketplace model is currently being piloted in Africa, as part of a
project to set up a regional road safety observatory there in collaboration with the World Bank.
The new data marketplace will initially focus on aggregating and trading conventional datasets.
However, the project team plans to bring on big data providers and incorporate ML in the Ai-RAP tool
to allow for more sophisticated analysis in road safety assessment procedures. Road safety practi-
tioners are advised to search data marketplaces as a lesser-cost alternative to commissioning data
collection for their projects.




33
     Development Data Partnership, https://datapartnership.org/
34
     Monica Olyslagers (Safe Cities and Innovation Specialist, iRAP), in discussion with the authors, April 2021.


                                                                 22
Key Considerations for Selecting the “Right” Big Data Source

This section provides an overview on how different big data sources can be used. The data sourc-
es covered in table 4 for each method or assessment type should be viewed as guides, rather than
concrete, all-inclusive lists. The most appropriate choice of data sources should eventually be deter-
mined by considering the costs and benefits of each source. A list of factors that may be useful to
consider for this purpose are discussed toward the end of this section. It is also worth noting that
while big data may not be a feasible alternative to conventional data for every project or assessment
(if only at present), it can still complement and supplement current approaches or be used to validate
their outcomes and analyses.35

TABLE 4:   Overview of potential big data sources for road safety assessments
TYPE OF DATA REQUIRED       WHICH METHODS                 POTENTIAL BIG DATA SOURCE               EXAMPLES
                            IT’S USED FOR

Crash data from 3–5 years   Methods I, V and VI           Government                              Government portal or contact
                                                          Mobile applications and telematics      Waze
                                                          Crowdsourced                            Waze
Operating speeds            Methods II to IV              Mobile applications and telematics      Mapbox, Waze
Road features (road         Methods III, V, VI, and VII   Street view imagery                     Mapillary
markings, signs, traffic                                  Crowdsourced                            OSM
calming measures, etc.)
                                                          Aerial and satellite imagery            Maxar, UAV
Road type (urban road,      Methods III, V, VI, and VII   Street view imagery                     Mapillary
pedestrian area, etc.)                                    Crowdsourced                            OSM
                                                          Aerial and satellite imagery            Maxar, UAV
                                                          Mobile applications                     Orbital Insight
Vehicle fleet mean speed    Methods III to VII            Mobile applications and telematics      Mapbox, Waze
Traffic flow                Methods IV to VII             Traffic imagery                         Mapillary
                                                          Aerial and satellite imagery            Maxar, UAV
                                                          Mobile applications and telematics      Mapbox, Waze
                                                                                               SOURCE: Original table for this publication.



As a broader variety of big data sources become available, road safety practitioners are advised to
carefully consider the trade-offs involved when collecting data from various sources. The factors
noted below do not provide an exhaustive list. Some factors may be more relevant to some projects
than others, while additional considerations may be required for certain projects. In some cases, data
from existing sources may not be available and will need to be collected using cameras, sensors, and/
or other tools.
•	 It is worth noting that many of these factors are also interrelated. For example, the types and
   quantity of data required could impact costs of obtaining and processing it. Costs can also vary
   by region, as can the availability of resources to process and analyze the data. This list may be
   used in tandem with Annex 2, which provides an overview of the most relevant big data sources
   for road safety analysis as well as their relative costs, data attributes and formats, and possible
   limitations.




35
  Holly Krambeck, Magreth Kakoko, and Mireille Raad, Using Computer Vision to Automatically Detect Road Features for Road
Safety Audits and Assessments: Inception Report (Washington, DC: World Bank, 2019).


                                                               23
•	 Type of road safety assessment or procedure. As discussed in Part 1, a broad range of tools and
   procedures are used for road safety assessments. Each tool has its own specific data require-
   ments. It is important to consider these before determining appropriate big data sources to com-
   plement analysis.
•	 Context/Region(s) of Interest. The types and variety of big data sources available can vary great-
   ly from region to region, country to country, or even different provinces or localities within the
   same country. For example, Waze crowdsourced crash data is especially useful for urban regions
   that are more densely populated compared to rural regions.
•	 Type of data required. As more big data sources become available for road and traffic data, road
   safety practitioners carefully consider which variables and data types are most relevant to their
   model before selecting a source. For example, Google offers a number of APIs that may be useful
   for road safety analysis. This includes Google Maps, Google Traffic, and Google Street View. It is
   important to consider the quantity, duration, and extensiveness of the data required. For exam-
   ple, some data sources include time-series information, others do not. Some may include specific
   road features or road user data, while others may just be focused on traffic flows.
•	 Data formats. Big data is collected, stored, and transmitted in a wide range of formats. It is
   important to consider the usability of available big data formats as well as their interoperability
   with other types of data. Since many big data sources that are currently available are not custom
   designed for road safety analysis, it may be necessary to invest in resources and skilled expertise
   to extract, aggregate, clean, and convert the data into a format that can be combined with other
   data and/or used with analytical tools and models.
•	 Cost. Given the size of big datasets, costs can arise from accessing, storing, handling, process-
   ing, and analyzing the data. The cost may be in the form of data licenses, software licenses or
   equipment (if the data is being collected specifically for the project at hand). Besides the cost of
   obtaining the data, it is also important to consider the cost of using it, such as by acquiring the
   necessary expertise, software tools and processing power for analysis. Annex 2 discusses the
   relative costs associated with using different big data sources.
•	 Resources required to make data usable. In addition to relevant data sources and the costs that
   may be associated with accessing them, other resources could also be required to utilize the data
   in road safety assessment and analysis. This includes technical skills and expertise required to
   handle and analyze the data.
•	 Time constraints. Some big data sources are faster to access and obtain data from compared to
   others. For example, open data platforms allow you to run a search query and instantly obtain
   relevant datasets. Other avenues, such as data sharing agreements, may take longer to deliver
   the required data. It is important to consider the project timeframe to determine which data
   source may be more useful for road safety analysis at a given stage.
•	 Licensing constraints. Any official and legitimate data source is accompanied by licensing reg-
   ulations that outline the terms of use of the provided dataset. Big data sources are no exception.
   Different data sources have different licensing agreements associated with them. Some, such as
   open data platforms, may have minimal licensing restrictions. Others, such as APIs and data-
   sets obtained through data partnership agreements, can have more restrictive terms of use. It
   is important to carefully consider these limitations before choosing a source. Road safety prac-
   titioners are advised to consult legal advisors or the data provider to fully understand licensing
   restrictions associated with different big data sources to avoid legal ramifications.


                                                  24
2.2 Machine Learning in Road Safety Analysis

ML is a branch of artificial intelligence. It involves creating algorithms that “learn” patterns, trends
and behaviors from data and improve accuracy over time without further programming. As figure 5
illustrates, the lifecycle of an ML model can be typically divided into two phases: training and deploy-
ment. In the training phase, training data is fed into the algorithm to obtain a trained model. In the
deployment phase, new input data is fed into the trained algorithm (or model) to predict the output.

FIGURE 5:   ML lifecycle




     Training data             Training the algorithm                       Trained model                                       New input data




                                                                              Prediction
                                                                                               SOURCE: Modified from https://randomtrees.com/data-science



As shown in figure 6, ML algorithms can be divided into three categories: supervised learning,
unsupervised learning, and reinforcement learning. The specific tasks they are capable of and the
corresponding algorithms that are most widely used for this purpose are also listed in table 5. One
significant difference between these categories is the format and source of training data.

FIGURE 6:   Categories of ML and the tasks they can perform
Meaningful compression                                                                      Fraud detection           Image classification
                                      Structure discovery
               DIMENSIONALITY                                                                                                            Customer
                                        Feature elicitation                                             CLASSIFICATION                   retention
                 REDUCTION

 Big data visualization             UNSUPERVISED                                 SUPERVISED                                 Diagnostics
   Recommendations                    LEARNING                                    LEARNING                            Weather forecasting

                                                                                                                                        Advertising
                  CLUSTERING                                                                                REGRESSION                  popularity
                                                           MACHINE                                                                      predictions
                                                           LEARNING
       Customer              Targeted                                                        Estimating life           Market forecasting
     segmentation            marketing                                                        expectancy


                           Real-time decisions                                        Game
                                                        REINFORCEMENT
                                                           LEARNING
                             Robot navigation                                         Skill acquisition
                             SOURCE: Modified from https://towardsdatascience.com/coding-deep-learning-for-beginners-types-of-machine-learning-b9e651e1ed9d




                                                                      25
Supervised learning is a family of algorithms that learn from previous data to map an input (X) to an
output (Y). For example, a supervised learning algorithm can be used to predict the risk level or crash
frequency (Y) of a road segment given its characteristics (X). “Supervised” means the training data is
labelled (i.e., the training data should be pairs of X-Y, where Y is usually called labels).
Unsupervised learning algorithms find structures in a dataset in order to group or cluster data points
based on their similarity. As the name suggests, these algorithms do not require “supervision” or
human intervention in the training phase. This means that, unlike supervised learning, the training
data for unsupervised learning algorithms has no labels (Y). These algorithms learn to group X based
on similar characteristics. The most common unsupervised learning task is clustering. For example,
given the characteristics of a road segment, an unsupervised learning algorithm can classify it into
a group of similar segments. It does not need to understand the characteristics that the group rep-
resents to complete this task.
Reinforcement learning trains a software agent to make decisions that maximize rewards from
interactions with an external environment.36 As opposed to supervised learning and unsupervised
learning, which require training data to be prepared before training, reinforcement learning gener-
ates the training data during the training phase. The data is generated when the agent interacts with
the environment. For example, reinforcement learning can be used to train an agent to control traffic
lights based on traffic conditions.

TABLE 5:   Categories of ML and algorithms*
                            ALGORITHMS                 TASKS                                       *The algorithms listed in this table are not exhaustive.
Supervised Learning         SVM, DT, RF, KNN, ANN      Classification                              SVM: support vector machine
                                                                                                   DT: decision trees
                                                       Regression                                  RF: random forest
                                                                                                   KNN: k-nearest neighbors
Unsupervised Learning       K-means, PCA, ANN          Clustering                                  ANN: artificial neural networks
                                                       Dimensionality Reduction                    PCA: principal component analysis
                                                                                                   DQN: deep Q-network, which includes and ANN in its
Reinforcement Learning      Q-Learning, DQN            Robotics/Decision-making                    algorithm
                                                    Source: Original table for this publication.



Artificial neural network (ANN) is a family of ML algorithms that have been inspired by the human
brain. ANN is the most versatile ML algorithm – it can be used for supervised learning, unsuper-
vised learning, and also reinforcement learning. As shown in figure 7, ANN structures the data and
the computation in different layers. Every layer adds more depth to the algorithm; therefore, more
layers indicate that it is “deeper”. Such ANNs are called deep neural networks or deep ANN or DNN.
ML algorithms that use deep ANN are called deep learning (DL) algorithms. Therefore, from another
perspective, ML algorithms can be divided into conventional ML and DL (table 6).




36
     This agent is a piece of software that makes a decision based on the environment.


                                                                        26
FIGURE 7:   ANN structure
Input 1




Input 2                                                                                          Output 1




Input 3



              INPUT LAYER                     HIDDEN LAYER                  OUTPUT LAYER
                                                                     SOURCE: Original figure for this publication.



TABLE 6:   ML and DL algorithms
                                    CONVENTIONAL ML*                          DL

Supervised Learning                 SVM, DT, RF, KNN, shallow ANN             Deep ANN
Unsupervised Learning               K-means, PCA                              Deep ANN
Reinforcement Learning (RL)         RL without deep ANN                       RL with deep ANN
*The conventional ML algorithms listed in this table are not exhaustive.
                                                            SOURCE: Original table for this publication.



Most ML algorithms are conventional ML, such as conventional supervised learning algorithms
like support vector machine (SVM), which can be used for classification or regression, for exam-
ple, classifying the risk level of a road segment based on its characteristics. Conventional unsu-
pervised learning algorithms, such as K-means clustering, automatically identify spatial patterns in
datasets, which can be applied to locate clusters or areas with recurring road crashes. Conventional
ML works well for small, low dimensional datasets. Meanwhile, DL is a subset of ML that learns the
complex patterns from high dimensional (e.g., an image) and large quantities of data (e.g., big data).
Supervised, unsupervised, and reinforcement learning algorithms that use deep ANN technique be-
long to the deep learning category. DL’s first successful application is in the computer vision area.
For example, image classification is a supervised learning task that utilizes deep neural networks to
classify images into different classes (e.g., cars, pedestrians, etc.).

How to Use Machine Learning

The use of ML methods in road safety analyses is being widely explored.37 As ML methods become
more advanced, economical, and accessible, their potential applications in various disciplines continue to
grow and become more feasible. In road safety analyses, ML has great potential to overcome the limita-
tions of traditional statistical models in crash analysis and crash probability modeling. The applications
of ML in road safety analyses are discussed under three categories: conventional ML, DL, and reinforce-
ment learning, as listed in table 7. It should be noted that some reinforcement learning algorithms using
deep ANN belong to DL, but all reinforcement learning techniques are discussed separately.

37
  Philippe Barbosa Silva, Michelle Andrade, and Sara Ferreira, “Machine Learning Applied to Road Safety Modeling: A
Systematic Literature Review,” Journal of Traffic and Transportation Engineering (English Edition), 7, no. 6, (2020),
https://www.sciencedirect.com/science/article/pii/S2095756420301410


                                                                            27
TABLE 7:   Frequently used ML techniques for road safety analysis*
ML CATEGORIES                 SUBCATEGORIES         ALGORITHMS          TASKS                       EXAMPLES

Conventional ML               Supervised            SVM                 Classification              Predict risk level based on road
                              Learning              DT                                              characteristics.
                                                    RF                  Regression                  Crash frequency prediction based on road
                                                    KNN                                             characteristics.
                                                    shallow ANN
                              Unsupervised          K-means             Clustering                  Group road segments by characteristics
                              Learning                                                              similarity; group drivers based on their
                                                                                                    driving behaviors.
                                                    PCA                 Dimensionality Reduction    Identify critical factors of road safety.

DL                            Supervised            CNN                 Image Classification/       Detect road features from images.
                              Learning                                  Object Detection/
                                                                        Segmentation
                              Unsupervised          GAN                 Clustering/Dimensionality   Find the hidden features related to road
                              Learning                                  Reduction                   safety from map and satellite images of
                                                                                                    the road environments.
Reinforcement Learning        N/A                   Q-Learning          Robotics/Decision-making    Control traffic lights based on traffic
                                                    DQN                                             conditions.
*The algorithms and examples listed in this table are not exhaustive.
CNN: convolutional neural network, a type of deep ANN
GAN: generative adversarial networks, a type of deep ANN
                                                                                                               SOURCE: Original table for this publication.



A growing body of research explores various ML techniques to predict the probability of road
crashes and assess their severity by training on historical datasets that encompass diverse fac-
tors. Conventional ML algorithms are the most frequently used ML algorithms for this purpose.
They are summarized in table 7. ML-based approaches to road safety analysis can be used to comple-
ment, supplement or even potentially substitute conventional road safety assessments.
Conventional supervised learning algorithms learn functions that take vectors of variables as in-
put to predict the output. Most conventional supervised learning algorithms that are frequently
used in data science have been used in road safety analyses, including but not limited to: decision
trees (DT), random forest (RF), support vector machine (SVM), k-nearest neighbors (KNN), and artifi-
cial neural networks (ANN).38 It should be noted that there is no “best” algorithm. Determining which
algorithm may be most appropriate for an ML-based road safety analysis is essentially a data science
problem for which there are usually no set rules. One algorithm may perform well for a dataset, but
badly for another. It is common practice for data scientists to try different algorithms in order to
find a suitable one for a specific problem. When using the aforementioned conventional supervised
learning algorithms for road safety assessments, the problem is often framed as a classification or
regression problem, in which the output (Y) of the ML algorithm is either a class (e.g., risk level or
severity: low, moderate, substantial or high) or a scalar (e.g., crash probability, crash frequency) and
the input (X) to the ML algorithm could be any parameter (including but not limited to weather, time,
road factors, human factors, etc.) that is related to the output.
Conventional unsupervised learning algorithms are mainly used for clustering and dimensional-
ity reduction purposes. In road safety analyses, K-means can be used for grouping tasks that help
find clustering patterns in the data. For example, it can be used to group road segments by similar
characteristics or group drivers based on their driving behaviors, so that dangerous road segments
or drivers can be identified based on the similarity. In another example of unsupervised learning ap-

38
     Silva, Andrade, and Ferreira, “Machine Learning Applied to Road Safety Modeling: A Systematic Literature Review.”


                                                                           28
plication, principal component analysis is used for reducing the dimensions of input data to identify
the most critical factors that affect road safety.
DL has been applied in various disciplines and achieved impressive performance. DL technologies
have progressed significantly over the past few years, especially in image analysis and computer
vision, the method’s first successful application. The core technique in this domain is deep convo-
lutional neural network (CNN), which is the state-of-the-art approach for object detection, semantic
segmentation, and instance segmentation of images. Object detection is a task in which, given an
image, the model outputs a bounding box of detected objects (figure 8). Semantic segmentation is a
task in which, given an image, the model classifies every pixel into predefined classes (e.g., road lane,
traffic light, etc.). Instance segmentation is a task, in which, given an image, the model groups pixels
belonging to an instance of the object.

FIGURE 8:   ML algorithms and street view
After applying an object detection algorithm to a street view image, a bounding box surrounds each predicted object, which also
contains a confidence level for each prediction.



             Logo 90%                                                                   Window 72%
                                                  Buildings 85%

                         Merchandise 77%                                                              Commerical sign 85%


                                                                                         Window 75%

                                                                                                                                       Commerical sign 45%

                                                                                                            Street sign 69%

                                                                                                                                   Door 96%



                                           Person 72%
                                                   Car 69%        Car 98%   Truck 92%
                                                                                                                                                                                        Person 81%
                                                                                                                                                  Person 78%

                                                                                                                    Person 96%                               Merchandise 83%
                                                                                                                                                                          Merchandise 83%
                         Merchandise 71%




                                                                                                                                                                     BOGOTÁ, COLOMBIA.

                                                                                                                              SOURCE: World Bank Global Program for Resilient Housing.



DL-based image analysis has been successfully used in various industries for applications ranging
from facial recognition to autonomous driving. It has great potential to be used in road safety
analysis to automatically analyze images and infer road attributes that are relevant to road safety
assessments. Large sets of images with annotations such as road lanes, traffic lights, speed limit
signs, and pedestrians can be compiled for training deep CNNs so that they learn to recognize these
objects through images that the models have not previously encountered. If successful, this approach
should equip the model to detect road attributes at a regional scale.
The detected information can then be used for safety and risk analysis. For example, if the DL mod-
el can infer the road segment characteristics (e.g., number of lanes, terrain type, road markings and
signs, and pedestrian, bicycling, and motorcycling facilities), the inferred information can readily be



                                                                                        29
used as input for various road safety assessment tools. This would allow the process of detection and
analysis to become fully, or at least significantly automated and scalable at a low cost.
DL can also provide a lower-risk alternative to manual detection of certain road attributes and
other important variables in road safety analysis. For example, a team used imagery from Baidu
Street View to provide a practical, automated alternative to the manual detection of street cracks,
which can be labor-intensive, hazardous, and difficult to conduct on a large scale. The authors use the
Deeplabv3+ network model, a DL neural network, to develop an automated road crack identification
system and demonstrate its practicality as a method to generate faster, more accurate and efficient
information about road cracks at lower cost compared to manual detection.39
Reinforcement learning is widely used to design intelligent control and decision-making systems.
In road safety and traffic management, reinforcement learning is most commonly employed to devel-
op intelligent signal control algorithms. A typical reinforcement learning-based traffic light system
makes divisions based on specific input traffic parameters, such as the length of time for which vehi-
cles wait at the intersection, the cumulative delay caused by waiting at the intersection, the length of
time for which the light stays green for each signal head, etc. The output of the system would be the
next color of the light and length of time for which it should remain switched on. Designing traffic
systems using reinforcement learning helps save time and improve safety standards.

Key Considerations for Using Machine Learning

Road safety can be evaluated explicitly using rule-based reasoning systems. However, developing
such systems can be complex if there are many input variables. Compared with rule-based evalua-
tion systems, ML algorithms are data-driven and don’t require developing rules; therefore, they are
relatively inexpensive to implement. ML algorithms are more suitable for high dimensional inputs.
As a broader spectrum of ML algorithms become available, road safety practitioners are advised to
carefully consider the trade-offs involved when applying them to road safety analysis. This section dis-
cusses various factors that must be considered before deciding to use an ML algorithm for road safety
analysis in their project. Again, this is not an exhaustive list. Some factors may be more relevant to
some projects than others, while additional considerations may be required for certain projects. It is
worth noting that many of these factors are also interrelated. For example, the feasibility of using ML
for a project can be affected by time and budget constraints, the availability of data and the anticipated
resource intensiveness of the data preparation process. Table 8 provides a SWOT analysis of the use of
ML in road safety analysis.




39
  Min Zhang et al., “Research on Baidu Street View Road Crack Information Extraction Based on Deep Learning Method,”
Journal of Physics: Conference Series, no. 1616 (2020). https://iopscience.iop.org/article/10.1088/1742-6596/1616/1/012086/pdf


                                                               30
TABLE 8:   SWOT analysis of using ML in road safety analysis
STRENGTHS                                                          WEAKNESSES

•	 Offers tools and techniques to process big data that may be     •	 Algorithms can be limited in their applicability; models may not
   more precise compared to traditional methods.                      perform well on data that is different from the training data’s
•	 Especially effective for feature learning, parameter               distribution.
   optimization, and processing large amounts of big data.         •	 Large amounts of data are needed to train the models and yield
•	 ML algorithms tend to perform better than traditional              more accurate models, which may be difficult in data-scarce
   statistical techniques in cases where high-dimensional and         contexts.
   high-nonlinear data is involved.                                •	 Some ML algorithms (e.g., ANN) works like a black box, and can
•	 As the technology develops, novel techniques create new            be hard to interpret, therefore an ML algorithm usually requires
   opportunities to understand complex relationships between          thorough validation and test processes before it can be deployed in
   multiple, interrelated variables and predict outcomes with         the real environment and assist decision-making.
   greater accuracy.                                               •	 The technology still needs further development before it can be
•	 ML algorithms can be improved continuously as more data            mainstreamed for use in road safety assessments.
   is generated or made available for training.

OPPORTUNITIES                                                      THREATS/CHALLENGES

•	 May eliminate the need for manual coding of road safety         •	 Requires specialist expertise, tools, and knowledge which may
   data in the future, making the process less labor-intensive        make its usefulness limited in some contexts, especially in
   and time consuming.                                                developing countries.
•	 Possible to train datasets in one location or for one purpose   •	 May require additional investment in computer power and analytical
   and use them for another.                                          software.
•	 Provides a powerful method for complex crash risk               •	 Complexity of ML algorithms can make them difficult to implement
   modelling and other types of predictive analytics in road          and analyze.
   safety.                                                         •	 Ethical considerations, such as bias in ML systems.
•	 As the technology develops, a platform powered by ML            •	 As a data-driven approach, ML relies on high-quality data for
   could be used across geographies for road assessments.             training. Significant bias in the training data could lead to the
•	 As more and more data is generated and collected everyday,         failure of model training. Quality control of training data could be
   this could be potentially analyzed with ML algorithms to           difficult, especially when annotating the data requires professional
   discover new patterns and insights.                                knowledge.
                                                                                                        SOURCE: Original table for this publication.



Feasibility with project objectives and client requirements. Before deciding to use ML for any proj-
ect, it must be ascertained if ML is suitable for the project. Some ML algorithms, such as neural net-
works, are not interpretable. They work like a black box. Clients may not have confidence in using
them for significant decision-making unless their predictions can be sufficiently validated.
Preparing data to train ML algorithms. ML is a data-driven approach. Therefore, as with any da-
ta-related project, it is important to plan the data collection and preparation process. To facilitate this
process, make sure to have clearly defined the inputs and outputs of the model at the outset of the
project. Section 2.1 provides guidance on how to select data sources, especially where big data may
be involved. It is common that, during the training stage, an ML team may find the data is not enough
to train a model with satisfactory performance. In this case, more data needs to be collected. In terms
of data preparation, teams should be aware of the need to aggregate, clean and annotate data before
it can be used for ML modelling. Annotation of data is especially necessary for supervised learning
algorithms and entails manually identifying an object drawing a box or polygon around it and giving
it a label such as “pothole” or “crosswalk” (figure 9).




                                                                   31
FIGURE 9:   Labeling a crosswalk in Padang, Indonesia using the Computer Vision Annotation Tool (CVAT)




                                                                                    SOURCE: World Bank Global Program for Resilient Housing.



Teams are advised to incorporate a quality control process to ensure data being used for any ML
model, especially test data, is of good quality and truly valid and representative of the population
or situation under study. For an ML-based project, steps include: (i) identifying data required for the
model; (ii) data collection, cleaning, annotation; (iii) trial and error training; (iv) validation; (v) deploy-
ment. It is advisable to estimate the duration of these tasks, their expected complexity and potential
challenges (which can vary by context and availability of resources such as expertise and processing
power) before deploying ML in any project. This helps determine if ML is feasible, how it compares to
traditional methods and how incorporating ML can impact project timelines. It is worth noting that
once deployed in the production environment, ML provides significant acceleration for the whole
process, for example, DL-based image analysis can exponentially save the time for collecting data to
be used in the road risk estimation.
A challenge for most ML algorithms is generalization, or how well a model can perform based on
test data (also called unseen data). Models may not perform well on unseen data that is different
from the training data’s distribution. For example, a model that is trained on images collected on
rural roads in an arid climate may not achieve the same level of performance on images in urban
roads in another country. The transferability of the model depends on how similar the features in
the images are. Therefore, before training ML algorithms, it is prudent to consider the diversity of
the training data, especially in terms of where, how and when it was collected. It is worth noting that
some researchers have found that artificial intelligence and ML algorithms can be easily and accu-
rately applied to different types of urban networks within the same city.40
To determine if using ML fits a budget or can even deliver a cost-advantage, it is important to un-
derstand associated costs. Costs of using ML can arise from the hiring of experts to develop and pro-
gram models, as well as from the data collection and preparation process (which includes cleaning


40
   Apostolos Ziakopoulos and George Yannis, “Using AI for Spatial Predictions of Driver Behavior” (presentation, ITF
International Transport Forum Roundtable on Artificial Intelligence in Road Traffic Crash Prevention, 2021).
https://www.nrso.ntua.gr/geyannis/conf/cp450-using-ai-for-spatial-predictions-of-driver-behavior/


                                                              32
and annotation). The cost of storing data (on local hardware or on the cloud) should also be accounted
for, especially if the inputs involve big data. Depending on the model and quantity of data being input,
and especially if a DL model is employed, you may also need to invest in additional computational
resources (graphics processing unit-equipped local computers or nodes on the cloud). Front-end and
back-end systems may also need to be established for automatic analysis services.
Deploying ML algorithms requires specialized expertise, often in the form of dedicated team
members that are ML experts. These may be addressed by hiring experts and managing the process
internally or acquiring resources externally. An in-house, “do-it-yourself” approach ensures more
control over every aspect of the process, which may be especially important where significant cus-
tomization or trial and error may be required. However, this approach requires labor and time, and
may be more costly in the long run. Using an external resource or tool, on the other hand, may be a
faster option but can come at the expense of some visibility and control over the development of the
model. It is important to consider these trade-offs to ensure the team is adequately resourced to use
ML effectively in the project.

2.3 Big Data, Machine Learning and the Future of Road Safety Assessments

Artificial intelligence presents many exciting possibilities for automation and analysis in trans-
port and infrastructure development. ML is increasingly used for road safety analysis. ML’s inher-
ent capability of managing uncertainties in data and models makes it extremely suitable for solving
road safety related issues. Uncertainty is a defining element of crash risk modelling and, in fact, a
source of complexity that has thus far limited the usefulness of traditional statistical models. More-
over, ML algorithms such as deep ANN can capture nonlinear patterns in data, making them the first
choice for processing road safety big data. Table 9 provides a summary of possible applications of big
data and ML in road safety analysis given the current state of the technologies.

TABLE 9:   Potential applications of big data and ML in road safety analysis
POTENTIAL             HOW BIG DATA CAN HELP                                  HOW ML CAN HELP
APPLICATIONS

Estimating Road       Video and photo images, APIs, satellite imagery and/ •	 Process images to evaluate road attributes
Infrastructure Risk   or crowdsourced images                               •	 Identify road features that could cause crashes
                                                                           •	 Identify risk factors contributing to crash occurrence
                                                                           •	 Identify safety conditions in infrastructure
Traffic Flows         APIs, aerial imagery, open-source traffic data, road •	 Process images to classify vehicles, identify
                      sensor data, wireless technology, street cameras, GPS congestion hotspots, vehicle detection, or speeds
                      data, mobile devices, real-time traffic data         •	 Assess traffic flows
                                                                           •	 Develop risk maps
                                                                           •	 Map the safety performance and Star Rating
                                                                           •	 Traffic flows prediction
Crash Risk            Meteorology data, geo-located crash data, video and    •	 Create crash prediction models
Assessment            photo images, APIs, open-source traffic data, road     •	 Develop risk maps
                      sensor data, historical crash data, crowdsourced       •	 Analyze different conflict scenarios and high-risk
                      crash data (e.g., Waze)                                   behavior
Incident Reporting/   Video recording, crash data, photo images,             •	 Identify hotspots through clustering techniques
Crash Data            crowdsourced data (Google Maps, Waze)
Analyzing Crash       Video and photo images, sensor data                    •	 Process images to evaluate road attributes
Severity                                                                     •	 Develop crash prediction models
                                                                                                      SOURCE: Original table for this publication.




                                                                   33
Combining big data and ML can provide an integrated framework for automatic road safety analy-
sis and management. This framework, demonstrated in figure 10, employs platforms (such as Mapil-
lary) to provide geo-tagged street level imagery for inputs to the DL model to infer useful information
(e.g., road characteristics). The DL-inferred data is then combined with multi-source big datasets
(e.g., region-specific historical crash data) for better analysis and management of road safety.

FIGURE 10:   Framework for automatic road safety analysis and management powered by ML
Geo-tagged street level images                           (Big) data sources                     Complementary
                                                                                                information
                                                                                                Road curvature
                                                                                                Historical crash
                                                         APIs                                   Baseline fatalities
                                                         In-house data                          …
                                                         Third-party data
                                                         …


                                                                                                 Methods/tools
                                                                                                 iRAP Star Rating Score
                                                                                                 RSSAT
                                                                                                 RSA
Deep learning model        Image analysis                    DL inferred information             RSIA
                                                             Lanes                               SSA
                                                             Shoulder                            ML models
                                                             Street lighting                     …
                                                             Pedestrians crossing
                                                             …



                                                                                    SOURCE: Original figure for this publication.



At present, much of the research and innovation in the use of ML for advanced road safety and risk
modelling is being driven by universities and other research institutions. As other stakeholders,
such as road safety practitioners, governments, developers of road safety tools and international
organizations such as the World Bank look to apply ML in their projects, there is an opportunity to
create dedicated tools that would harness big data and ML for road safety analysis. Such applications
have the potential to reduce the risk of human error and allow road safety assessments to be mostly,
if not fully, automated.
The following section presents practical examples of how big data and ML can assess urban road
safety. It applies an integrated framework introduced in section 2.3 to explore the opportunities
and limitations of new data sources and assess the ML models. To evaluate the robustness of the
proposed framework, the Integrated Framework for Road Risk Prediction was applied in two cities
of different sizes, regions, and data availability were chosen: Bogotá, Colombia, a rapidly urbanizing
metropolis in Latin America, and Padang, Indonesia, a secondary city in East Asia. The study found
that ML applied to street view imagery identified relevant road (and road user) characteristics to gen-
erate a model that predicts road risk with 72.5 percent accuracy in Bogotá. This framework was ap-
plied in Padang to test its replicability; preliminary results are encouraging for its potential to predict
road safety for areas with limited crash data. The section concludes with a reflection and guidance
for replicability.




                                                    34
PART 3
Case Studies: Applying Big Data and Machine
Learning to Assess Road Safety

3.1 Objectives of the Case Studies

This section presents how the Integrated Framework for Road Risk Prediction can be applied in two
different cities of interest: Bogotá, Colombia and Padang, Indonesia. The study examines how useful
ML is in evaluating road safety and how easily the integrated framework can be replicated. All code
is freely available for other teams to use and develop further.41
The objectives of the case studies are to:
1.	 Learn how well big data and ML can be used to identify road features, estimate road safety, cate-
    gorize road segments based on their risk level, and identify high-risk segments.
2.	 Evaluate the utility of several big data sources that are freely available for road safety analysis in
    diverse geographic areas.42
3.	 Assess the replicability of the proposed approach.
Located on two different continents, the selected locations offer an opportunity to apply the frame-
work on paved, urban roads in contrasting environments, particularly related to data availability
and usability. For example, the government of Bogotá has made significant efforts to increase crash
data collection and dissemination. The government offers an online portal with the location of each
crash over the past year publicly available. In addition, there was high coverage for data derived
from mobile phones, such as crowd-reported crashes. In contrast, information on the crash locations
for Padang could not be found online, and methods for data collection are largely manual or paper
based.43 In addition, mobile application data was scarce for crowdsourced crash reports. As a result,
Padang offers the opportunity to explore the utility of ML when data coverage is limited.




41
   The code for the Integrated Framework for Road Risk Prediction is open source and accessible on GitHub:
https://github.com/datapartnership/IntegratedFrameworkForRoadSafety. However, some datasets require partnership with
DDP to access.
42
   Freely available meaning at no cost; however, some data sources are not publicly available and require a license.
43
   World Bank, Indonesia Public Expenditure Review 2020: Spending for Better Results (Washington, DC: World Bank, 2020).
https://openknowledge.worldbank.org/handle/10986/33954


                                                            35
     BOGOTÁ AND PADANG: BACKGROUND AND CONTEXT
     With a population of more than 7 million, the capital
     district of Bogotá is Colombia’s largest city. As a crit-
     ical economic hub with a growing population, Bogotá
     stands out as one of the most congested cities in the
     world.44 The government has prioritized road safety
     and achieved significant gains over the past few de-
     cades, reducing the city’s traffic fatality rate by more
     than 60 percent between 1996 and 2006 alone.45
     More recent interventions during the UN Decade for
     Action for Road Safety include establishing a Nation-
     al Road Safety Plan and a National Road Safety Agency (Agencia Nacional de Seguridad Vial) fea-
     turing a National Road Safety Observatory in collaboration with the World Bank.46 In addition, in
     2017, the city’s government launched “Vision Zero,” which aimed to implement a range of speed
     management strategies to eliminate pedestrian and driver fatalities. The program has delivered
     measurable results, such as a 27 percent reduction in fatalities across corridors where speed limits
     have been introduced, and further interventions are planned to sustain its impact.47 Despite these
     initiatives and road safety improvements in Bogotá, challenges remain, and new policies would
     benefit from timely and affordable analytics on road safety.
     Padang is the capital of the Indonesian province
     of Western Sumatra with a population of around 1
     million. The government of Indonesia introduced
     various initiatives to address road safety during the
     UN Decade of Action for Road Safety. Established in
     2011, the National Road Safety Master Plan achieved
     a 10 percent reduction in annual road fatalities be-
     tween 2013 and 2016. However, data collection and
     management systems that rely on manual screen-
     ing significantly challenge the country’s progress in
     road performance and safety.48 Initiatives such as the establishment of the Integrated Road Asset
     Management System and the World Bank’s new Asia-Pacific Road Safety Observatory present a
     valuable opportunity for the country to improve its road safety data systems.49 For this case study
     in Padang, crash data was scarce from alternative sources. Therefore, it offers the opportunity to
     explore the utility of the pre-trained ML models in a new region with limited data coverage.


44
   INRIX 2018 Global Traffic Scorecard. In 2018, drivers lost 272 hours in road congestion.
45
   ODI (Overseas Development Institute), “Bogotá,” ODI: Think Change. Accessed October 12, 2021, from
https://odi.org/en/about/features/bogot%C3%A1/
46
   World Bank, Colombia - Programmatic Productive and Sustainable Cities Development Policy Loans (Washington, DC: World
Bank, 2020). http://documents.worldbank.org/curated/en/426591583968971309/Colombia-Programmatic-Productive-and-
Sustainable-Cities-Development-Policy-Loans
47
   Darío Hidalgo and Claudia Adriazola-Steil, “Bogotá’s Vision Zero Road Safety Plan Is Saving Lives,” TheCityFix, last modified
September 26, 2019, https://thecityfix.com/blog/bogotas-vision-zero-road-safety-plan-saving-lives-dario-hidalgo-claudia-
adriazola-steil/
48
   World Bank, Indonesia Public Expenditure Review 2020: Spending for Better Results.
49
   DT Global, “Indonesia: Establishment of Integrated Road Asset Management Systems,” accessed October 4, 2021,
https://dt-global.com/projects/irams-dc


                                                              36
3.2 Methodology

The ML-based framework implemented in these case studies was developed to provide a quick screen
to evaluate road safety. The framework ascertains road characteristics traditionally collected or an-
notated to provide a road safety prediction. ML models were developed specifically for this frame-
work during these case studies, one to extract road characteristics from street view images and one
to determine road risk based on the derived road characteristics. To do so, first, the models needed
to be trained to extract road characteristics and determine the road risk based on crash data. Then
the models could be applied to make predictions in new areas without crash data. Therefore, there
were two phases in this framework, first the training phase to train the models (figure 11), and then
the deployment phase to make new predictions with the models (figure 12). In each phase there were
three steps, both of which began with data collection and preparation. OpenStreetMap (OSM), Waze,
and Mapillary were used to develop this framework (additional examples of these datasets and relat-
ed analysis can be found in Annex 3).
                                                   The OSM road network provided the foundation for analysis. It is free-
                                                   ly available and scalable. OSM uses lines to represent roads and points
                                                   to represent links among the roads. In OSM, the geometric road lines
                                                   are split into road segments (called ways) that are connected by the
                                                   points (called nodes). No modifications were made to the OSM geom-
                                                   etry to maintain its synchronicity with other big datasets referencing
                                                   OSM ways and nodes.



                                                   The Waze crash data consists of coordinates representing the location
                                                   where users of the Waze application are when they see and report a
                                                   crash.50 The Waze crash points were joined to the nearest OSM road
                                                   segment (within 20 meters). For each road segment, the crash frequen-
                                                   cy, or crash per meter, was calculated to normalize the frequency of
                                                   crashes. Since OSM road segments vary in length and there could be
                                                   multiple reports per crash, calculating the crash frequency provided
                                                   crash trends. To identify road segments with more frequent crashes per
                                                   meter, the crash frequency was split into high and low risk.
                                                   Mapillary was used to obtain street view images, which were primari-
                                                   ly collected by the World Bank’s Global Program for Resilient Housing.
                                                   Since many images are captured along a street, and many images can
                                                   be linked to a single road segment, the image closest to the centroid of
                                                   the road segment was selected. The radius for this selection was with-
                                                   in three meters of the centroid. This approach standardizes the image
                                                   selection and classification: one image represents the scene of one road
                                                   segment. For each OSM road segment, a street view image taken near
                                                   the centroid of the segment was downloaded using Mapillary API v4.
  SOURCE: Original examples for this publication
  based on data from OSM, Waze, and Mapillary.
Copyright OpenStreetMap contributors, Microsoft,
   Esri Community Maps contributors. Basemap
    from Esri, HERE, Garmin, METI/NASA, USGS.



50
     Data provided by Waze App. Learn more at waze.com.


                                                                      37
The Training Phase
The training phase consisted of two significant steps that were powered by ML to extract information
from street view images and to make predictions on risk level based on extracted data. Each step had
an ML model at its core that needed to be trained based on data. Therefore, there were three steps in
the training phase.
Step 1. Select the region of interest and prepare data
A generalized polygon of the region of interest was used to collect data from OSM, Waze, and Mapil-
lary. The road network database was prepared, and the street view images closest to the centroid of
the road segment were downloaded as inputs for the models.

FIGURE 11: Training     phase for road safety segment analysis using ML
Geo-tagged street level images                                                                                (Big) data sources
                                                                                          Road network


                                                                                                              OSM Waze




                                                                                                  Road network
                                                                                                  (crash frequency)
                                       Mapillary                                                  database




     Deep learning model          Image analysis                           DL inferred                      Neural Network
     Road Information                                                      information                      classifier
     Collector (RIC)                                                       Lanes                            Road Risk Evaluator
                                                                           Shoulder                         (RRE)
                                                                           Street lighting                  Low risk
                                                                           Pedestrians crossing             High risk
                                                                           …

                                                                                              SOURCE: Original figure for this publication.




Step 2. Develop ML model for identifying road characteristics
The first custom ML model developed for this case study was the Road Information Collector (RIC),
shown in figure 11. It is a deep convolutional neural network, Mask R-CNN, which can classify and
count objects detected in images.51 The RIC model was trained with images from the updated Map-
illary Vistas Dataset (initially released in 2017), which provides detailed characteristics for types of
road markings and barriers, traffic lights and signs, and vulnerable road users such as pedestrians,
motorcyclists, and bicyclists.52 Other identifiable characteristics include flat terrain, which charac-
terizes road gradient, and the presence of potholes, which could indicate paved, urban road quality.
The RIC takes street view images as the input and can detect more than 100 classes of objects as the
output (for a complete list of the features the RIC model detects, refer to Annex 4). The model can



 Kaiming He et al., “Mask R-CNN,” 2017 IEEE International Conference on Computer Vision (2017): 2980-2988.
51

 G. Neuhold et al., “The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes,” 2017 IEEE International
52

Conference on Computer Vision (ICCV) (2017): 5000-5009, doi: 10.1109/ICCV.2017.534


                                                            38
detect and classify some road features better than others (for the precision score in detecting and
classifying the objects, see Annex 5).
Step 3. Develop ML model for evaluating road risk
The second ML model developed was the Road Risk Evaluator (RRE). The RRE is a neural network
classifier with two hidden layers; each has 50 neurons. The RRE was trained using paired data for
each road segment, the road attributes from the RIC and the assigned road risk from the road net-
work database. Similar work was conducted by a team using a neural network to predict the crash
frequency of road segments.53
The Deployment Phase
Once the two ML models are trained, they can be added to an automated workflow in the deployment
phase. This means the trained ML models can now predict the risk level for any road segment with
the required input data – a street view image. Crash data is not required in the deployment phase.
FIGURE 12:   Deployment phase to predict road safety
Street level images for each road segment                                Road network

                                                                                                       Region of interest

                                                                         OSM




                                      Mapillary




     Deep learning model        Image analysis                           DL inferred                     Neural Network
     Road Information                                                    information                     classifier
     Collector (RIC)                                                     Lanes                           Road Risk Evaluator
                                                                         Shoulder                        (RRE)
                                                                         Street lighting                 Low risk
                                                                         Pedestrians crossing            High risk
                                                                         …


                                                                                           SOURCE: Original figure for this publication.



The deployment phase uses three steps to predict risk within an automated workflow (figure 12).
Step 1. Select the region of interest and download data
For the selected region of interest, the code will download the road network from OSM and calculate
the centroid of each road segment. The code will then download from Mapillary API a street view
image taken near the centroid of the road segment.




 Qiang Zeng et al., “Rule Extraction from an Optimized Neural Network for Traffic Crash Frequency Modeling,” Accident
53

Analysis & Prevention 97 (2016): 87-95.


                                                           39
Step 2. Identify road characteristics
For each road segment, the downloaded image will be fed into the RIC to extract road characteristics.
For each image, the RIC will output the numbers of detected objects for each class (refer to Annex 4
for classes). These numbers are put together to form a vector for each image.
Step 3. Evaluate road risk
Each vector produced by the RIC will be fed into the RRE to calculate the risk level: high or low. To illus-
trate the automated workflow of the deployment phase, figure 13 shows the risk prediction for a road
segment. The RIC detected a flat road, car, and motorcycle; therefore, the RRE predicted the road seg-
ment as low risk. This framework requires no historical crash data to identify high- or low-risk roads.

FIGURE 13:   RIC and RRE applied to predict road segment risk




                                                                                                                      RIC




                                                               RRE                            construction--flat—road x 1
                                   Risk level: Low                                            object--vehicle—car x1
                                                                                              object--vehicle—motorcycle x1

                                   SOURCE: Original figure for this publication, based on data from Mapillary and annotated with classifications from the model.



The two case studies presented illustrate the training and deployment phases.
The training phase was conducted in Bogotá, where data was collected to train the ML model RRE,
while the RIC model was trained on the Mapillary Vista Dataset. Then the models were applied in the
deployment phase to predict the risk level for each road segment in Bogotá, Colombia.
The second case study was in Padang, Indonesia. The RIC and RRE models trained in the previous
case study were applied directly (i.e., without re-training) in a deployment phase to predict road risk
in Padang. This demonstrates that, ideally, there is no need to re-run the training phase for future
applications since the RIC and RRE are already trained.




                                                                    40
3.3 Case Study 1: Bogotá, Colombia

The Training Phase

Step 1. Select the region of interest and prepare data
In Bogotá, a road network database was created to prepare training data for the ML models. First, a
generalized polygon of the region was used to retrieve roads from OSM and six months of crash re-
ports from Waze (July–December 2020). The crashes were joined to the nearest OSM road segment
within 20 meters. The crash frequency, or crash per meter, was calculated and road segments were
divided into high risk (crash frequency >0.5) and low risk (crash frequency <=0.5) in the road network
database. This means a crash per meter of 1 represents one crash per meter in the six months of the
Waze data collected. Street view imagery was downloaded using the Mapillary API to collect images
close to the centroid of each road segment. Table 10 provides an overview of the data sources for this
case study.
TABLE 10:   Data used for case study in Bogotá, Colombia
DATA SOURCES           ATTRIBUTES                                             REMARKS
ROAD NETWORK
OSM                    Road network (road segment length)                     Provided through an open license.
CRASHES
Waze                   Road alerts (crashes reported by users, coordinates)   Obtained through DDP.
ROAD CHARACTERISTICS
Mapillary              Street view image detections (crosswalk, curb,         Selection of image annotation tags used
(images and tags)      guard rail, human, marking, pothole, sidewalk, sign,   in study; more available through Mapillary
                       streetlight, traffic sign, utility pole)               Traffic Sign and Vistas. Multiple detections
                                                                              per image are possible.
                                                                                             SOURCE: Original table for this publication.



Step 2. Develop ML model for identifying road characteristics
The RIC was developed and trained to perform instance segmentation. It is a deep convolutional
neural network that identified the classes, or objects in the image, and provided the count of these
classifications. The model was trained using the Mapillary Vistas Dataset using a total of 124 classes
(Annex 4).54 The resulting output is a count of the classes identified by the bounding boxes, shown in
figure 14, which is represented through a series of integers.

                                     Training data: Mapillary Vistas Dataset (124 classes)
                                     Input: Street view image near the centroid of a road segment
                                     Output: A vector of integers (each element represents the
                                     count of detected objects that belong to a class)


Figure 14 depicts the RIC in action on an image from Bogotá. The bounding boxes surrounding each
object in the image indicate classes the model identified. Confidence levels are provided next to the
name of the object segmented by the bounding box. The closer the confidence level is to 1, the higher
the confidence in the prediction. Looking at the center of the image, the bicyclist was identified with
0.5 confidence, and other vulnerable road users were recognized, such as a motorcyclist (0.84) and
pedestrian (0.75). Vehicles were segmented with high confidence for the bus (0.7), motorcycle (0.88),


54
     G. Neuhold et al., “The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes.”


                                                                    41
and car (0.99). The RIC segmented traffic signs, support and utility poles, flat road, and road mark-
ings as well.

FIGURE 14:   Image segmentation in Bogotá




                                                      SOURCE: Original figure for this publication, based on data from Mapillary.



The sample image shows favorable results for image segmentation. The performance of the RIC mod-
el in terms of the average precision of the bounding box detection and classification for each class
is provided in Annex 5. In the next step, road attribute data extracted through the RIC were inputs
for the prediction model to link the road characteristics with the likelihood of a crash in the road
networks examined.
Step 3. Develop the ML model RRE for evaluating road risk
To develop the RRE, six study areas in Bogotá, Colombia were selected to reduce computational load.
These study areas were drawn to include a wide variety of neighborhoods (poor, rich) and placed
throughout the city. They also contain high and low crash frequency road segments and comprehen-
sive street view image coverage. Figure 15 shows the six study areas along with the crash risk from
the road network database, high risk (crash frequency >0.5) and low risk (crash frequency <=0.5).




                                                 42
The low- and high-risk road segments in these                                        FIGURE 15: Six study areas and crash frequency in Bogotá

areas were the training data for the model. Based
on the segment risk derived from the road net-
work database and the characteristics for each
road segment derived from the RIC, the model
was trained to evaluate a road segment as high
or low risk.

    Training data: The following input-output pairs obtained from
    road segments in six study areas in Bogotá, Colombia.
    Input: A vector of integers, which is the output of RIC*
    Output: 0 (low risk) or 1 (high risk)


*
 Only 106 out of 124 classes are used as the input to RRE. A total of 18
classes irrelevant to road characteristics, such as sky, bird, etc., were re-
moved from the vector before entering into the RRE.

In searching for an optimal architecture of the
neural network, the number of layers and neu-
rons were tested for the best performance. Test-
ing showed that more layers or neurons do not
significantly improve the performance on this                                             Crash per meter
dataset. The RRE was used to evaluate whether                                                 0.5 - 3.2
a road segment was low or high risk based on a                                                0.0 - 0.5

street view image.                                                                               SOURCE: Original figure for this publication, based on data from OSM and data
                                                                                                                           provided by the Waze App. Learn more at waze.com.


Overall performance of the ML
                                                                                     FIGURE 16: Confusion matrix showing
                                                                                     the accuracy of the RRE model
Predictions of low-risk road segments were
correct 70 percent of the time, and predictions
of high-risk road segments were correct 75 per-
cent of the time (figure 16). The mean accuracy
and F1-score were both 72.5 percent. The closer                                               Low                  0.7                                    0.3

the accuracy and F1-score are to 100 percent,
the better the performance of the model. In the
                                                                                 True value




case of this model, a random guess of a binary
classification is 50 percent, which makes these
results promising. These results suggest the
model would perform well in similar contexts
as Bogotá. If needed, there would be potential to                                             High                0.25                                   0.75
fine-tune the model for increased accuracy and
precision in other areas.

                                                                                                                  Low                                   High
                                                                                                                                 Prediction
                                                                                                                                   SOURCE: Original figure for this publication.




                                                                                43
 TIPS FOR INTERPRETING ML PERFORMANCE
 The performance of an ML model can be evaluated using accuracy, precision, recall, and the F1-score. These are derived by counting
 the correct predictions (true positives and true negatives) and incorrect predictions (false positives and false negatives).

           accuracy = correct predictions / all predictions

           precision = true positives / (true positives + false positives)

           recall = true positives / (true positives + false negatives)

           F1-score = 2*((precision * recall) / (precision + recall))

 A confusion matrix shows how well the model performed in predicting road risk through a comparative chart of the true positives,
 true negatives, false positives, and false negatives.



Bogotá Results

Following the three-step workflow of the deploy-                             FIGURE 17:   Road risk prediction in Bogotá
ment phase described in section 3.2, road risk
was predicted for the entire road network in Bo-
gotá. In total, 98,488 images were processed to
make the predictions shown in figure 17. Road
segments without an image within 3 meters were
not predicted. Overall, high crash frequency from
Waze and high-risk predictions exhibited similar-
ity along some segments, particularly on arterial
roads; however, the model tended to moderately
overpredict high risk.




                                                                              Risk level
                                                                                  High
                                                                                  Low
                                                                                  No data

                                                                                                 SOURCE: Original figure for this publication, based on data
                                                                                                                            from Mapillary, OSM and Waze.




                                                                        44
3.4 Case Study 2: Padang, Indonesia

The Deployment Phase

The model that was built in Bogotá was applied in Padang. Similar to Bogotá, the road network was
accessed through OSM, and street view images were downloaded using the Mapillary API. Waze
crash data was joined to the OSM road network to compare with risk predictions. Padang had limited
geospatial crash data to validate the model. Table 11 provides a description of the datasets.

TABLE 11:   Data used for case study in Padang, Indonesia
DATA SOURCES         ATTRIBUTES                                             REMARKS
ROAD NETWORK
OSM                  Road network (road segment length)                     Provided through an open license.
CRASHES
Waze                 Road alerts (crashes reported by users, coordinates)   Obtained through DDP.
ROAD CHARACTERISTICS
Mapillary            Street view image detections (crosswalk, curb,         Selection of image annotation tags in study;
(images and tags)    guard rail, human, marking, pothole, sidewalk, sign,   more available through Mapillary Traffic Sign
                     streetlight, traffic sign, utility pole)               and Vistas. Multiple detections per image are
                                                                            possible.
                                                                                          SOURCE: Original table for this publication.



Padang Results

In Padang, preliminary results pointed to the framework’s potential in scanning roads for safety.
Figure 18 shows predictions where arterial road segments were predominately designated as high
risk (red lines). Residential areas were interspersed with low- and high-risk road segments. Similar
patterns of road segments predicted as high risk along arterial roads and a mix of low and high risk
along residential and tertiary road segments were largely found.

                                                                 45
FIGURE 18:   Road risk prediction in Padang




  Risk level
      High
      Low

   SOURCE: Original figure for this publication, based on data from OSM and data provided by the Waze App. Learn more at waze.com. Drone imagery provided by the World Bank
                                                                                                                                       Global Program for Resilient Housing.



In general, where there were crashes reported by Waze, high-risk road segments were predicted.
These preliminary results were encouraging; however, verifying the results was difficult because
there was not sufficient data. While the deployment of the framework in Padang requires further
validation with more data, ML-based approaches such as this are promising to offer initial road safety
scans.

3.5 Findings

The Integrated Framework for Road Risk Prediction demonstrates the strength of ML to identify road
segment safety with substantial accuracy (72.5 percent) in Bogotá. Preliminary results in Padang
support replicating the framework with further validation in other areas. Using advanced ML tech-
niques, the framework applied a streamlined approach that relied on road characteristics and crash
frequency to determine crash risk in the training phase. Then the ML models applied in the deploy-
ment phase could predict road risk based on road characteristics without historical crash data.
The alternative data sources used to train the models were robust – thousands of annotations,
high-resolution images, and crash data joined to extensive road networks – and of suitable quality for
the models to provide a road safety scan. To identify road characteristics, the RIC was trained using
the Mapillary Vistas Dataset, which has a breadth and depth of annotations from different contexts,
providing geographic diversity. The RRE was trained using a pairing of the road characteristics and a
road network database created from OSM road segments and Waze crash data. OSM road segments


                                                                                   46
offered global scalability and were sufficient for a coarse assessment in these case studies. Waze data
availability was dependent on the area (and the users of the app). Given the potential for duplicate
crash reports, Waze data was not relied on for accurate crash data in Bogotá; instead, it was used to
identify crash patterns of high- and low-risk road segments.
The framework is not suitable for detailed road assessments. However, it can be applied to screen
roads for safety without historical crash data if the RIC model is enhanced with more training data
and calibrated for the local street view context; the RRE model can be modified and enhanced with
fine-grained training data. It is replicable in other areas with the following recommendations, which
are applicable for developing other ML-based frameworks for road safety.
Incorporate training data to fine-tune the model for a specific location. Typically, ML models trained
on data collected from one region do not work well for a new region. This is called domain shift: the
testing data has a different distribution than the training data. In this case, including data collected
from the new region in the training phase will usually help. It is important to evaluate the data and
consider any influences the collection method may have on the potential to introduce bias into the
project. For example, if local crash data is introduced to train the RRE, it would help validate and
potentially improve the model’s application in the location of interest. Both RIC and RRE can be con-
tinually trained with newly obtained data so that the knowledge learned from previous data can be
carried on for new regions while the model is still applicable to the previous regions.
It is essential to ensure that models are based on sufficient, high-quality training data. In general,
at least a few thousand annotations are recommended to identify objects from images with simple
context, depending on the characteristics of the object. Whether the street view images are obtained
through big data platforms such as Mapillary or collected by the team, street view imagery covering
different geographical regions makes the trained object detection model, like the RIC, more robust.
Since street level images capture the visual scene (road characteristics and road users) at a single
point in time, it is important to consider these implications when using a snapshot of that time of day,
day of week, and season. Relatedly, a road characteristic may be covered or occluded in a street view
image; for instance, when a passing truck blocks a sign. Imagery collected at a frequent distance,
such as every two meters, permits greater flexibility to analyze the road scene and predict risk using
the RIC and RRE. OSM road networks require review for recency and accuracy, and possibly editing
to ensure suitable quality and coverage in other areas. If high-quality, granular crash data shows a
clear pattern of more risk classes, three classes could be predicted: for example, high, medium, and
low risk.




                                                  47
Conclusion


Big data and ML offer promising opportunities to improve current road safety assessment proce-
dures for sustainable development. Road safety assessments are often required for new transport
and infrastructure developments to be approved or as part of their monitoring and evaluation once
they are completed. However, conducting road safety assessment procedures can be expensive and
time-consuming. Alternative data sources and ML can optimize this process by identifying patterns
using complex predictive models. The Integrated Framework for Road Safety offers one approach
using street view imagery that can be accessed through Mapillary or collected by the team to provide
a road safety scan. With further training, this framework has the potential to provide detailed road
safety assessments, mitigating the need for manual annotations (or years of historical crash data). In
addition to the pilots and studies conducted by the researchers and representatives of road safety or-
ganizations interviewed for this note, there are many ML models contributing to road safety efforts,
which typically outperform statistical models in predicting road safety.55
Integrate alternative data sources and ML into road safety assessments with care. Finding valid,
representative data can be a significant challenge in evaluating risks and reducing crash fatalities and
injuries through data-driven, evidence-based interventions. Teams can directly partner with private
companies and data providers to retrieve alternative sources of data. And data sharing platforms, such
as DDP, offer streamlined solutions. However, commercial data sources are not typically established
to collect data for road safety analysis, and their data may be inadequate for road safety assessment
methods and procedures. Data can be biased, incomplete, and challenging to synchronize with con-
ventional analytical tools. The implications of collecting and analyzing big data using ML require thor-
ough consideration. Data privacy and security are central concerns; data needs to be de-identified and
anonymized and stored according to institutional guidelines.56 Data and models need to be screened
for biases that can affect their outcomes. For example, imbalanced access to smartphones or social
media may amplify gender or community bias.57 Teams can adhere to best practices and data policies
and make their ML models and results transparent and openly shared. Resources such as “A Frame-
work for Understanding Sources of Harm throughout the Machine Learning Life Cycle” and “The
Ethics of Artificial Intelligence” may be helpful for teams implementing ML in their projects.58
The approach used for the case studies in this note can be extended to evaluate specific measures
of road safety. For example, while the framework uses the crash frequency and may identify the
number of relevant road users in a street view image, it does not thoroughly consider the number of
(vulnerable) road users nor does it consider the probability of a crash causing fatalities or serious in-


55
   Philippe Silva, Michelle Andrade, and Sara Ferreira, “Machine Learning Applied to Road Safety Modeling: A Systematic
Literature Review,” Journal of Traffic and Transportation Engineering 7, no. 6 (2020): 775-790,
https://doi.org/10.1016/j.jtte.2020.07.004
56
   World Bank, World Development Report 2021: Data for Better Lives (Washington, DC: World Bank, 2021). doi:10.1596/978-1-
4648-1600-0
57
   World Bank, Use of AI Technology to Support Data Collection for Project Preparation and Implementation: A ‘Learning-by-doing’
Process (Washington, DC: World Bank, 2021).
58
   Harini Suresh and John Guttag, “A Framework for Understanding Sources of Harm throughout the Machine Learning Life
Cycle” in Proceedings of Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO ‘21),
https://doi.org/10.1145/3465416.3483305; Nick Bostrom and Eliezer Yudkowsky, “The Ethics of Artificial Intelligence,” in The
Cambridge Handbook of Artificial Intelligence, ed. Keith Frankish and William M. Ramsey (Cambridge: Cambridge University
Press, 2014): 316-334.


                                                              48
juries. The approach could also be extended using complementary data such as road geometry, traffic
flow, traffic volume, traffic speed, weather, season, and other factors affecting visibility along the
road or road surface conditions. The case studies illustrate the potential of big data and ML to reduce
the manual inspection of roadways and provide road safety insight where otherwise the information
is in short supply, thereby contributing to safer roads.
For big data to be fully leveraged for road safety analysis, governments, road safety advocates, and
international development organizations will want to consider investing in platforms and tools
that specialize in collecting and analyzing data for road safety. Ongoing efforts to establish regional
road safety data observatories provide an opportunity to gather data providers and create a data mar-
ketplace specifically for road safety analysis, especially where alternative or traditional sources are
scarce. Government regulations and initiatives to encourage private companies to share data could
further integrate big data in international development projects, including road safety. It is essential
for key stakeholders in road safety assessment to collaborate closely with pioneers of these technol-
ogies to realize their potential in road safety analysis.59 Initiatives such as the Artificial Intelligence
in Road Traffic Crash Prevention Roundtable hosted by the International Transport Forum (ITF) in
early 2021 is an example of one such opportunity. Conversations with World Bank team leaders and
transport specialists reveal that developing a tool to provide a single, easy-to-use solution to access
and utilize big data for road safety analysis is in high demand. There is potential to automate some
of the processing and analysis for which specialist expertise is currently required, and initiatives
such as Ai-RAP and the World Bank Simplified Methodology suggest that practical, scalable solutions
could be a reality soon.60 As big data and ML become more accessible, and as their adoption accel-
erates worldwide, road safety practitioners, governments, road safety advocates, and international
organizations can unlock their immense potential to improve the quality and efficiency of road safety
assessments.




59
   Subasish Das and Greg P. Griffin, “Investigating the Role of Big Data in Transportation Safety,” Transportation Research
Record 2674, no. 6 (2020): 244–52, https://doi.org/10.1177/0361198120918565
60
   Monica Olyslagers (Safe Cities and Innovation Specialist, iRAP) and Satoshi Ogita (Senior Transport Specialist, World Bank),
in discussion with the authors, April 2021.


                                                              49
ANNEX 1:
Most Relevant Big Data Types for Road Safety Analysis

DATA COLLECTION        POTENTIAL SOURCES         POTENTIAL           ADVANTAGES                    LIMITATIONS
                                                 APPLICATIONS

Street view            •	 Apple Look Around      Identify road       •	 Provides objective         •	 Coverage is incomplete, particularly
imagery                •	 Google Street View     attributes for         evidence of conditions        in rural and low-income areas.
                                                 road safety            in the field.              •	 Licensing restrictions for ML
                       •	 KartaView
                                                 assessments.        •	 Can be used in regions        application.
                       •	 Mapillary
                                                                        where government data
                       •	 Collected by team                             is not available.
Mobile                 •	 Mobile application     Identify vehicle  •	 App data is usually low      •	 Coverage is lighter in rural areas or
applications and          data                   movement, traffic    cost and current.               cities where use of app is low.
telematics             •	 Telematic companies    flows and road    •	 Telematic data could         •	 Often requires data sharing
                                                 use by various       show risky driving              agreements with private companies.
                       •	 Rideshare companies
                                                 types of users       behavior.
                                                 for crash risk.
                                                 identification
                                                 and road safety
                                                 assessments.
Crowdsourced           •	 Waze                   Obtain crash data •	 Can supplement               •	 Requires app use in the region of
                       •	 Delivery drivers       and information      government data,                interest.
                                                 related to road      particularly if incidents    •	 Needs coordination and resources to
                       •	 OSM
                                                 use, such as         are underreported or            collect reports from delivery drivers.
                       •	 Social media           types of road        government provided          •	 Data quality may be low.
                                                 users and their      road networks are
                                                                                                   •	 Social desirability bias can occur,
                                                 relative density     unavailable.
                                                                                                      where users feel inclined to share
                                                 at a specific
                                                                                                      specific types of information to
                                                 location. Can help
                                                                                                      reinforce a positive or negative
                                                 to identify road
                                                                                                      perspective.
                                                 risks.
Government             •	 Government transport   Most frequently      •	 Data often has many        •	 Data can be messy (human error).
                          agencies               used to obtain          attributes or details that •	 Data often not shared.
                       •	 Road safety            crash data,             have been manually
                          observatories          including               added.
                                                 statistics related •	 Data often has been
                                                 to crash severity,      collected for many years
                                                 crash frequency         in the same manner,
                                                 as well as              allowing for temporal
                                                 fatalities and          analysis.
                                                 injuries statistics.
Aerial and satellite   •	 Earth observation      Identify road       •	 Covers large geographic    •	 Requires balancing the cost with
imagery                   agencies               attributes for         area.                         recency and granularity of imagery.
                       •	 Private companies      road safety
                                                 assessments.
Meteorological         •	 Meteorological         Review weather      •	 Infer driving conditions   •	 There are varying levels of
sensors                   agencies               conditions that        (i.e., if road surface        granularity.
                       •	 Local universities     may affect road        conditions are not
                                                 safety, such as        available in government
                       •	 Private companies
                                                 crashes.               crash data).
                                                                                                         SOURCE: Original table for this publication.




                                                                    50
ANNEX 2:
Overview of Big Data Sources

DATA           ACCESS                  ATTRIBUTES                        RESOLUTION       COST     COMMENTS
                                                                         AND FORMAT

STREET VIEW IMAGERY
Apple Look     Early stages;           Requires processing to            Image            N/A      Offers extremely limited geographic
Around         contact company         derive physical features                                    coverage.
                                       related to road safety,
Google Street Not accessible                                             360 photos       N/A      Global coverage is fairly extensive.
                                       such as: crosswalks,
View          according to license                                       must be at least
                                       speedbumps, painted lines,        4K (image)
                                       roads, road shoulders,
KartaView      Open license            sidewalks, streetlights,          Depends on       Free     Images are free, though image
                                       traffic signs and others          camera (image)            processing is required (see street
                                       specific to region of                                       view training data); global coverage
                                       interest.                                                   is variable.
Mapillary      Publicly available                                        Depends on       Free     Images are free, though image
                                                                         camera (image)            processing is required (see street
                                                                                                   view training data); global coverage
                                                                                                   is variable.
Collected by   Requires permission                                       Depends on       High     Collection every two meters
team           and coordination with                                     camera (image             recommended for images.
               local government                                          or video)                 Images or video require processing;
                                                                                                   see street view training data.
STREET VIEW TRAINING DATA
Mapillary      Attribution-          Traffic signs                       Resolution can Free       More than 300 traffic sign classes
Traffic Sign   NonCommercial-                                            be very high or           covering six continents.
               ShareAlike 4.0                                            very low. The
               International License                                     model performs
                                                                         best on images Free
Mapillary      Attribution-            Physical features related to                                Coverage spans six continents.
                                                                         with the same
Vistas         NonCommercial-          road
                                                                         resolution level
               ShareAlike 4.0          crosswalks, speedbumps,
                                                                         of the training
               International License   painted lines, roads, road
                                                                         dataset. (image)
                                       shoulders, sidewalks,
                                       streetlights, traffic signs
                                       (others possible)
Annotation     Hire a team             Physical features related to                       High     Consider collaborating with
by team                                road, specific to region of                                 stakeholders in a region of interest
                                       interest                                                    to label images using a Computer
                                       crosswalks, speedbumps,                                     Vision Annotation Tool (CVAT) or a
                                       painted lines, roads, road                                  labeling team with training.
                                       shoulders, sidewalks,                                       2,000 labels per class is
                                       streetlights, traffic signs                                 recommended for a simple
                                       (others possible)                                           classification.
World          Open source             Physical features related to                       Free     Video analysis produces a richer
Bank’s GRSF                            road                                                        dataset.
Road Risk                              road grade and curvature,                                   Piloted in Liberia and Mozambique.
Assessment                             pedestrian crossings,
software±                              delineation, roadside severity,
                                       lane width, and number of
                                       lanes
± The software is included in this section as video training data is limited in World Bank countries. Contact Satoshi Ogita (World Bank),
for access.




                                                                   51
DATA       ACCESS                ATTRIBUTES                         RESOLUTION         COST   COMMENTS
                                                                    AND FORMAT

MOBILE APPLICATIONS AND TELEMATICS
Grab       Contact company       Contact company                    N/A                N/A    Coverage offered in Cambodia,
                                                                                              Indonesia, Malaysia, Myanmar,
                                                                                              Philippines, Singapore, Thailand,
                                                                                              Vietnam.
HERE       Not accessible        Traffic                        Every minute           N/A    Detailed road network coverage
           according to standard current and historical speeds, (text, number)                in more than 200 countries and
           license               jams, crashes, road closures                                 comprehensive traffic speeds in
                                 and road construction                                        more than 80 countries.
Mapbox     Contact company       Movement                           Aggregated daily N/A
Movement                         activity index; driving activity   or monthly at 100
                                 index available in select          meter resolution
                                 locations                          (text, number)
Mapbox     Contact company       Traffic (typical speed)            Typical speed per N/A     Available through Enterprise plan;
Traffic                          each road segment,                 road segment              licensed annually for specific
                                 identified by a start and          in five-minute            geographic region.
                                 end node, has 2,016 typical        increments over
                                 speed predictions (7 days          a week (text,
                                 × 24 hours × 12 five-minute        number)
                                 periods)
Moovit     Contact company       Urban transit (public and on- Contact                 N/A
                                 demand)                       company
Ola Cabs   Contact company       Travel time and potholes           Contact            N/A    Coverage provided in India.
                                                                    company
Orbital    Contact company       Foot traffic                    Each minute;          N/A    Foot traffic using mobile location
Insight                          time of day, day of week,       2019 to present              data in region of interest, subject to
                                 velocity (stationary, walking), (text, number)               data availability per country.
                                 dwell time
TomTom     Contact company       Traffic                        Every minute per Free   Global coverage is variable.
                                 current and historical speeds, road segment     to
                                 jams, crashes, road closures (text, number)     Medium
                                 and road construction
Uber       Contact company       Traffic                            Average travel   Free     Limited geographic coverage to a
Movement                         travel times between zones,        time, average             selection of major cities.
                                 average speed per segment          speeds per hour,          Currently no API.
                                 and traffic density                time of day or
                                                                    quarter of year
                                                                    (text, number)
Unacast    Contact company       Human movement                     Coordinates,       N/A
                                                                    horizontal
                                                                    accuracy,
                                                                    timestamp,
                                                                    time zone (text,
                                                                    number)
Veraset    Contact company       Human movement                     Coordinates,     N/A      Veraset Movement covers 150
                                                                    horizontal                countries.
                                                                    accuracy,
                                                                    timestamp (text,
                                                                    number)




                                                               52
DATA         ACCESS                ATTRIBUTES                          RESOLUTION          COST      COMMENTS
                                                                       AND FORMAT

Waze         Contact company to    Traffic (alerts, jams,              Every minute;       Free for Includes weather alerts and
             become a partner      irregularities)                     location            partners major and minor crashes by
                                   major and minor crashes;            provided as                  application users; see Waze under
                                   severity of congestion or           coordinates,                 Crowdsourced section.
                                   irregularities; current and         road segment,
                                   typical speed on jammed             street name
                                   segments; coordinates, road         (text, number)
                                   segment (start and end node),
                                   street name; road type; driving
                                   direction (NSEW); turn type;
                                   alerts (construction, road
                                   closure and weather)
WhereIs     Contact company        Informal transit network            Determined in       Medium Specializes in producing informal
MyTransport                                                            collaboration       to High transit data according to General
                                                                       with team                   Transit Feed Specifications (GTFS).
                                                                                                   Supports team in collecting and
                                                                                                   processing data in exchange for
                                                                                                   the team covering in-field costs
                                                                                                   of data collection and facilitating
                                                                                                   engagement with local transport
                                                                                                   authorities.
CROWDSOURCED
OSM          Open license          Road segments (road type,           Centerline of       Free      May include additional road
                                   length) and road features           road segments,                attributes: lanes, name, smoothness,
                                                                       referred to                   surface, speed limit, and width, and
                                                                       as ways and                   other information such as overtaking
                                                                       relations (text,              permitted or lighting.
                                                                       number)
Twitter      API                   Road incidents tweeted              User-dependent; Free to Price dependent on account type and
                                                                       can be            medium data volume.
                                                                       associated with a
                                                                       place or location
                                                                       (text, number)
Waze         Contact company to    Road incidents reported using Every minute;      Free for
             become a partner      app                           location provided partners
                                                                 as coordinates,
                                                                 road segment,
                                                                 street name (text,
                                                                 number)
Delivery     Coordinated by team   Road incidents reported using Depends on                High
drivers                            app                           collection (text,
                                                                 number)
GOVERNMENT
Government Government contact      Incidents (date, time, severity, XY coordinate          Free to   Processing requires standard GIS
or road safety or open data        type)                            per incident           Low       software such as ArcGIS (paid) or
observatory platform                                                (text, number)                   QGIS (free).
                                                                                                     Storage is small, typically <1GB per
                                                                                                     urban area over multiple years.
                                   Road segments (type, width,         Road segments       Low
                                   speed limit)                        (text, number)
                                   Traffic lights (intersection        XY coordinate       Low       May include intersection type
                                   type)                               per traffic light             (pedestrian, bicyclist, for example)
                                                                       (text, number)




                                                                  53
DATA          ACCESS                ATTRIBUTES                        RESOLUTION       COST      COMMENTS
                                                                      AND FORMAT

REMOTE SENSING
Maxar        Contact company        Elevation and roads               Less than 1m     High      Requires processing to derive road
Technologies                                                          (image)                    networks.
Orbital       Contact company       Car and truck count; roads        Car and truck    N/A       Car and truck count derived from
Insight                                                               count: high                satellite imagery.
                                                                      resolution,                Limited Geospatial Intelligence
                                                                      2013 to present;           Platform credits to derive roads in
                                                                      roads: medium              region of interest; not for routable
                                                                      resolution, 2016           road networks; not suitable for
                                                                      to present                 narrow roads in urban areas or dirt or
                                                                      (image, number)            mountainous roads in rural areas.
Security      Collected by team     Traffic density and volume        Depends on       Medium
or traffic    or through external                                     camera (image or to High
cameras       resource                                                video)
Unmanned       Collected by team    Elevation, roads, traffic         Depends on       Medium Recent research suggests traffic
aerial vehicle                      density and volume                camera (image or to High density and volume are possible to
(UAV)                                                                 video)                   calculate.
METEOROLOGICAL SENSORS
OpenWeather Contact company         Weather                           40-year historical Low     Price is economical for the 40-year
                                    (weather type, temperature,       archive for any            history of a single coordinate or city.
                                    wind speed and direction,         coordinates by             Contact provider for details on
                                    cloud coverage; rain and          the hour; or by            pricing and to download many
                                    snow volume by hour and           city or 1 km, 5            locations.
                                    per 3 hours)                      km, 10 km or
                                                                      customized grid
                                                                      (text, number)
Tomorrow.io   Contact company       Weather                           500m              N/A
                                    (weather type, temperature        radius with
                                    and humidity; wind speed,         precipitation
                                    direction, gust;                  recordings as
                                    precipitation type, intensity;    low as 30 feet
                                    snow and ice accumulation;        off the ground;
                                    visibility, moon phase)           time steps range
                                                                      from one day to
                                                                      one minute (text,
                                                                      number)
                                                                                                      SOURCE: Original table for this publication.




                                                                 54
ANNEX 3:
Hotspots and Heatmaps: Uncovering Data Patterns
for Road Safety

Data visualizations are provided in the case study regions using alternative data sources, such as
OSM, Mapbox, and Waze, as well as a select government dataset.

Bogotá, Colombia

Temporal data visualizations show road safety patterns between years, seasons, months, weeks,
days, and times of day. The Waze crash data used to train the ML model covered a period of six
months, from July through December 2020. It was anticipated that the pandemic would affect the
number of Waze crash reports, and potentially traffic patterns, as crashes reported by the govern-
ment noticeably decreased compared to prior years (figure 3.1). The government dataset revealed
fewer incidents starting in March 2020, suggesting that the number of crashes was affected by the
pandemic, though it is worth noting that the speed limit was also reduced from 60km/h to 50 km/h
in May 2020 (figure 3.2). With this in mind, the Waze data was used to identify road safety trends.

FIGURE 3.1:   Road crashes with damage, injury or death in Bogotá, 2016–2020
        With damage                  With injury               With death
 23,530                       23,775
                                                          22,606
                                                                                         21,260




                                                                                                                      12,874
 10,412                       10,096                       11,857                        11,799

                                                                                                                       8,015




  567                          536                            485                          491                          371
  2016                        2017                          2018                         2019                          2020
          SOURCE: Original figure for this publication, based on data from Datos Abiertos Secretaría Distrital de Movilidad.



FIGURE 3.2:   Road crashes per month in Bogotá, 2016–2020
                        Dec              Jan
                                     2020
          Nov                        2019               Feb

                                     2018
                                     2017
    Oct                                                        Mar
                                     2016




                                                                           Road crashes
    Sep                                                        Apr         per month
                                                                             ≤ 1,500
                                                                             ≤ 2,000
          Aug                                           May                  ≤ 2,500
                                                                             ≤ 3,000
                        Jul                 Jun                              ≤ 3,256
                              SOURCE: Original figure for this publication, based on data from
                                             Datos Abiertos Secretaría Distrital de Movilidad.




                                                                                      55
Hotspot analysis groups crash locations to determine statistically significant clusters of crashes.
Government and Waze datasets were analyzed during the same six-month window (figure 3.3). Be-
tween the two datasets, similar hotspots were found near Avenida Boyacá and Calle 6 along the high-
way in the south, Avenida Norte-Quito-Sur (NQS). Overall, Waze had more hotspots than the govern-
ment dataset. Some minor road incidents captured by Waze may have gone unreported to the police.
This trend can be seen in minor collisions clustering further north in the city. This cluster does not
appear in the government data. Instead, clusters of government-reported crashes with only damage
(no injury or fatality) appear in a central band. The approach to identify hotspots can vary, including
the clustering method, size, shape, and search area of neighboring hotspots.

FIGURE 3.3:     Hotspot analysis of government and Waze crash data in Bogotá, July–December 2020
Cold Spot Confidence:             99%        95%        90%           Not significant       Hot Spot Confidence:             90%        95%         99%
Government (all crashes)                                     Government (death or injury)                              Government (damage only)




Waze (all crashes*)                                          Waze (major)                                              Waze (minor)




*Includes major and minor crashes, as well as those not categorized as either type.
SOURCE: Original figure for this publication, based on data from Datos Abiertos Secretaría Distrital de Movilidad and the Waze App. Learn more at waze.com. Basemap provided by
                                                                                                                                           Esri, HERE, Garmin, METI/NASA, USGS.




                                                                                      56
As with other alternative sources of data derived from mobile devices and apps, Waze crash reports
are influenced by the location of the users, which affects where and when the crashes are reported.
While Waze data notes major and minor incidents, the dataset will not include additional crash de-
tails typically obtained from an official source, such as type, severity, class, and reason. Even though
users can validate reports (e.g., thumbs up) to provide a confidence and reliability rating and flag false
reports, there is potential for duplication in Waze data. Deduplication was not conducted for this
analysis because this study was interested in relative crash patterns.
Identifiable temporal patterns display when major crashes are aggregated by the day of the week and
hour of the day (figure 3.4). In Bogotá, major crash reports increased between 6 and 7 p.m., having
the most crashes during this window on Friday. Fewer incidents occurred on Sunday.

FIGURE 3.4:   Major crashes reported on Waze in Bogotá, July–December 2020

Mon

Tue

Wed

Thu
           ≤ 100
Fri        ≤ 200
           ≤ 300
Sat        ≤ 400
           ≤ 507
Sun

       0      1    2   3   4   5   6   7   8     9     10    11 12 13              14     15     16     17     18     19      20     21     22     23
                                                            Hour of day
                                           SOURCE: Original figure for this publication, based on data provided by the Waze App. Learn more at waze.com.




                                                              57
Spatial and temporal analysis can be combined to identify areas for closer inspection that exhibit pat-
terns over time. This is valuable given human movement or behavioral changes, including the effects
of a pandemic, road construction, or updated speed limits, during the examined period. Emerging
hotspot analysis reviews clusters of crashes that are consistent over time and ones that are intensify-
ing or diminishing (figure 3.5).61 In this example, each week was analyzed. Intensifying hotspot areas
were statistically significant hotspots for 90 percent of the weeks analyzed with increasing intensity
of hotspots, including the final week.

FIGURE 3.5:    Emerging hotspot analysis of Waze crashes in Bogotá, July–December 2020




     SOURCE: Original figure for this publication, based on data provided by the Waze App. Learn more at waze.com. Basemap provided by Esri, HERE, Garmin,
                                                                                                                                      METI/NASA, USGS.




61
  For a complete list of definitions, see “How Emerging Hot Spot Analysis Works”:
https://pro.arcgis.com/en/pro-app/latest/tool-reference/space-time-pattern-mining/learnmoreemerging.htm


                                                                                     58
If interventions or investments target a specific road, more geographically detailed information is
required to make decisions. Hotspot analysis applied to road segments visualizes statistically signif-
icant crash frequencies along roads, as shown in figure 3.6.

FIGURE 3.6:   Hotspot analysis using Waze crash frequencies in Bogotá, July–December 2020




   Hot Spot confidence
        99%
        95%
        90%
        Not Significant


                                     SOURCE: Original figure for this publication, based on data provided by OSM and the Waze App. Learn more at waze.com




                                                                59
Padang, Indonesia

Heatmaps visualize the density of crashes. While Waze data was sparse in Padang, some spatial
patterns could be detected. A heatmap shows at least three distinct areas of high crash density that
could be further examined during a site inspection (figure 3.7).

FIGURE 3.7: Heatmap of crashes reported using the Waze app in
Padang, April 2019–July 2021




                                                                                                                                             HIGH
                                                                                                                                             DENSITY
                                                                                                                                             (YELLOW)




                                                                                                                                             LOW
                                                                                                                                             DENSITY
                                                                                                                                             (BLACK)

   SOURCE: Original figure for this publication, based on data provided by the Waze App. Learn more at waze.com. Basemap provided by Esri, HERE, Garmin,
                                                                                                                                    METI/NASA, USGS.




                                                                                 60
Road safety assessments may require operating speeds of road segments. Mapbox collects this
data from mobile devices and provides typical speeds per road segment in 5-minute increments.
In Padang, Mapbox speeds were visualized for a Thursday from 5:00 p.m. to 6:00 p.m. (figure 3.8).
Using the OSM road type to group and designate minor and major roads as a proxy for a low or high-
speed limit (speed limits were sparsely noted in OSM), minor roads are visualized with thinner lines
than major roads. The average speed typically slowed near intersections in pink (<25 km/h) when
compared to major roads in purple (25-50 km/h). High-speed road segments exceeding 50 km/h are
found heading north and south along Jalan By Pass. Identifying road segments with high speeds
using Mapbox supports road safety assessments and the implementation of speed management or
traffic calming measures.

FIGURE 3.8:   Mapbox typical speeds in Padang on Thursday, 5:00 p.m. to 6:00 p.m.

                                                                                     Speed (km/h)
                                                                                         50.1 - 64.8
                                                                                         25.1 - 50.0
                                                                                         0.1 - 25.0
                                                                                         No Data




  SOURCE: Original figure for this publication, based on data provided by Mapbox. Basemap provided by Esri, HERE, Garmin, METI/
                                                                                                                   NASA, USGS.




                                                                                   61
ANNEX 4:
Classes Detected Using Mapillary Vistas Dataset in
RIC Model and Input Classes for the RRE Model

All classes listed were detected using the   marking--discrete--arrow--other             object--sign--information
Mapillary Vistas Dataset. Classes in bold    marking--discrete--arrow--right             object--sign--other
were the input for the RRE Model.            marking--discrete--arrow--split-left-or-    object--sign--store
animal--bird                                 straight                                    object--street-light
animal--ground-animal                        marking--discrete--arrow--split-right-or-   object--support--pole
construction--barrier--ambiguous             straight                                    object--support--pole-group
construction--barrier--concrete-block        marking--discrete--arrow--straight          object--support--traffic-sign-frame
construction--barrier--curb                  marking--discrete--crosswalk-zebra          object--support--utility-pole
construction--barrier--fence                 marking--discrete--give-way-row             object--traffic-cone
construction--barrier--guard-rail            marking--discrete--give-way-single          object--traffic-light--general-single
construction--barrier--other-barrier         marking--discrete--hatched--chevron         object--traffic-light--pedestrians
construction--barrier--road-median           marking--discrete--hatched--diagonal        object--traffic-light--general-upright
construction--barrier--road-side             marking--discrete--other-marking            object--traffic-light--general-horizontal
construction--barrier--separator             marking--discrete--stop-line                object--traffic-light--cyclists
construction--barrier--temporary             marking--discrete--symbol--bicycle          object--traffic-light--other
construction--barrier--wall                  marking--discrete--symbol--other            object--traffic-sign--ambiguous
construction--flat--bike-lane                marking--discrete--text                     object--traffic-sign--back
construction--flat--crosswalk-plain          marking-only--continuous--dashed            object--traffic-sign--direction-back
construction--flat--curb-cut                 marking-only--discrete--crosswalk-zebra     object--traffic-sign--direction-front
construction--flat--driveway                 marking-only--discrete--other-marking       object--traffic-sign--front
construction--flat--parking                  marking-only--discrete--text                object--traffic-sign--information-parking
construction--flat--parking-aisle            nature--mountain                            object--traffic-sign--temporary-back
construction--flat--pedestrian-area          nature--sand                                object--traffic-sign--temporary-front
construction--flat--rail-track               nature--sky                                 object--trash-can
construction--flat--road                     nature--snow                                object--vehicle--bicycle
construction--flat--road-shoulder            nature--terrain                             object--vehicle--boat
construction--flat--service-lane             nature--vegetation                          object--vehicle--bus
construction--flat--sidewalk                 nature--water                               object--vehicle--car
construction--flat--traffic-island           object--banner                              object--vehicle--caravan
construction--structure--bridge              object--bench                               object--vehicle--motorcycle
construction--structure--building            object--bike-rack                           object--vehicle--on-rails
construction--structure--garage              object--catch-basin                         object--vehicle--other-vehicle
construction--structure--tunnel              object--cctv-camera                         object--vehicle--trailer
human--person--individual                    object--fire-hydrant                        object--vehicle--truck
human--person--person-group                  object--junction-box                        object--vehicle--vehicle-group
human--rider--bicyclist                      object--mailbox                             object--vehicle--wheeled-slow
human--rider--motorcyclist                   object--manhole                             object--water-valve
human--rider--other-rider                    object--parking-meter                       void--car-mount
marking--continuous--dashed                  object--phone-booth                         void--dynamic
marking--continuous--solid                   object--pothole                             void--ego-vehicle
marking--continuous--zigzag                  object--sign--advertisement                 void--ground
marking--discrete--ambiguous                 object--sign--ambiguous                     void--static
marking--discrete--arrow--left               object--sign--back                          void--unlabeled




                                                                62
ANNEX 5:
Average Precision of the Bounding Box Detection
and Classification

An Average Precision (AP) score closer to 100 indicates a better performance in correctly detecting and classifying an object. AP
scores equal to zero mean that no data is available.




                                                                  63
Glossary of Terms


Big Data                   Large data sets that require significant processing power and/or complex
                           computational techniques to reveal patterns, trends, and correlations.
Development Data           A partnership between international organizations and companies, created to
Partnership (DDP)          facilitate the use of third-party data in research and international development.
Deep Learning (DL)         A branch of artificial intelligence that involves creating algorithms for deep
                           artificial neural networks, inspired by the human brain, to learn complex patterns
                           from high dimensional and large quantities of data.
Fatalities and Serious     A metric of those killed or seriously injured in a traffic crash which is used to
Injuries (FSI)             monitor traffic safety performance. Fatalities are defined as those who die within
                           30 days of the crash.
Intelligent Transport      The collection, analysis, and transmission of transportation, vehicle, and
System (ITS)               infrastructure data that informs users with real-time updates and improves future
                           operations and predictions.
Internet of Things (IoT)   Devices that are connected to the internet to send and/or receive data.
Machine Learning (ML)      Method to systematically derive patterns, identify trends, and make conclusions
                           from data with minimal human intervention.
Neural Network             A set of connected algorithms typically organized in three layers: input layer,
                           hidden layer(s), and an output layer.
Road Crash                 The collision of a vehicle with another entity, such as a car, bicycle, stationary
                           object, pedestrian, or animal, that causes injury or damage to one or more of the
                           entities on a road or road-related area.
Road Safety                System to reduce risks to road users, preventing death or injury.
Road Safety                Systematic review of the current road or traffic scheme to identify hazardous
Assessments                areas.
Road Safety Audit (RSA) Independent, systematic evaluation of the modification or addition to the road or
                        traffic scheme to determine the crash potential and safety performance for all
                        road users.
Road Safety Impact         The safety performance ranking of planned road construction or modification
Assessment (RSIA)          design schemes and their effect on the surrounding road network.
Road Safety Observatory A regional network of government representatives that facilitates the sharing and
(RSO)                   exchange of road safety data and expertise. The World Bank operates RSOs in
                        Latin America (OISEVI), Africa (ARSO), and Asia-Pacific (APRSO).
Safe System                An approach to road safety that integrates principles for safer vehicles, safer
                           roads, and safer users to eliminate death and serious injuries.
Supervised Learning        A machine learning task using labeled data to train the model with input-output
                           pairs.
Unsupervised Learning      A machine learning technique that extracts patterns from unlabeled data. For
                           example, grouping or clustering data with similar attributes.
Vulnerable Road Users      Individuals at a higher risk using the road because they do not have the
                           protection of an enclosed vehicle, such as pedestrians, motorcyclists, bicyclists,
                           and those on animals or animal drawn carts.




                                                      64
References


Allan, Phil. ”Road Safety Inspections.“ (presentation, Road Safety Seminar, World Road Association,
Lomé, Togo: October 2006). https://www.piarc.org/ressources/documents/actes-seminaires06/c31-
togo06/8718,2-PIARC_Oct06_Allan.pdf
Australian BITRE (Bureau of Infrastructure and Transport Research Economics). “Australian Road
Deaths Database (ARDD).” Australian BITRE. Updated May 13, 2021.
https://data.gov.au/data/dataset/australian-road-deaths-database
Bedoya Arguelles, Guadalupe, Svetoslava Petkova Milusheva, Arianna Legovini, and Sarah Elizabeth
Williams. “Smart and Safe Kenya Transport (SMARTTRANS).” Washington, DC: World Bank, 2019.
https://documents1.worldbank.org/curated/en/723411574361015073/pdf/Smart-and-Safe-Kenya-
Transport-SMARTTRANS.pdf
Bliss, Tony, and Jeanne Breen. “Meeting the Management Challenges of the Decade of Action for
Road Safety.” IATSS Res. 35 (2012): 48–55. https://doi.org/10.1016/j.iatssr.2011.12.001
Bostrom, Nick and Eliezer Yudkowsky. “The Ethics of Artificial Intelligence.” In The Cambridge Hand-
book of Artificial Intelligence, edited by Keith Frankish and William M. Ramsey, 316-334. Cambridge:
Cambridge University Press, 2014.
Das, Subasish and Greg P. Griffin. “Investigating the Role of Big Data in Transportation Safety.” Trans-
portation Research Record 2674, no. 6 (2020): 244–52. https://doi.org/10.1177/0361198120918565
Diop, Makhtar. “All Road Deaths Are Preventable. We Can Make It Happen.” World Bank. Accessed
May 14, 2021.
https://blogs.worldbank.org/transport/all-road-deaths-are-preventable-we-can-make-it-happen
DT Global. “Indonesia: Establishment of Integrated Road Asset Management Systems.” Accessed
October 4, 2021. https://dt-global.com/projects/irams-dc
Google. “Google Maps, Google Earth, and Street View.” Accessed May 14, 2021.
https://about.google/brand-resource-center/products-and-services/geo-guidelines/
He, Kaiming, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. “Mask R-CNN.” 2017 IEEE Interna-
tional Conference on Computer Vision (2017): 2980-2988.
Hidalgo, Darío and Claudia Adriazola-Steil. “Bogotá’s Vision Zero Road Safety Plan Is Saving Lives.”
TheCityFix. Last modified September 26, 2019. https://thecityfix.com/blog/Bogotas-vision-zero-road-
safety-plan-saving-lives-dario-hidalgo-claudia-adriazola-steil/
Institute for Transportation and Development Policy. “Pune, India Wins 2020 Sustainable Transport
Award.” Last modified June 27, 2019. https://www.itdp.org/2019/06/27/pune-india-wins-2020-sus-
tainable-transport-award/
International Transport Forum. “Best Practice for Urban Road Safety: Case Studies.” International
Transport Forum Policy Papers, no. 76 (2020).
International Transport Forum. Zero Road Deaths and Serious Injuries: Leading a Paradigm Shift to a
Safe System. Paris: OECD Publishing, 2016. https://doi.org/10.1787/9789282108055-en



                                                  65
Krambeck, Holly, Magreth Kakoko, and Mireille Raad. Using Computer Vision to Automatically Detect
Road Features for Road Safety Audits and Assessments: Inception Report. Washington, DC: World Bank,
2019.
Lovón-Melgarejo, Jesús, Alonso Tenorio-Trigoso, Manuel Castillo-Cara, and Daniel Miranda. “Identi-
fication of Risk Zones for Road Safety through Unsupervised Learning Algorithms.” In 16th LACCEI
International Multi-Conference for Engineering, Education, and Technology: Innovation in Education
and Inclusion, Lima, Peru, July 2018. http://www.laccei.org/LACCEI2018-Lima/full_papers/FP413.pdf
Milusheva, Sveta, Robert Marty, Guadalupe Bedoya, Sarah Williams, Elizabeth Resor, and Arianna
Legovini. “Applying Machine Learning and Geolocation Techniques to Social Media Data (Twitter) to
Develop a Resource for Urban Planning.” PLoS ONE 16, 2 (2021).
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0244317
Neilson, Alex, Indratmo, Ben Daniel, Stevanus Tjandra. “Systematic Review of the Literature on
Big Data in the Transportation Domain: Concepts and Applications.” Big Data Res. 17 (2019): 35-44.
https://doi.org/10.1016/j.bdr.2019.03.001
Neuhold, G., T. Ollmann, S. R. Bulò, and P. Kontschieder. “The Mapillary Vistas Dataset for Seman-
tic Understanding of Street Scenes.” 2017 IEEE International Conference on Computer Vision (ICCV)
(2017): 5000-5009. doi: 10.1109/ICCV.2017.534
ODI (Overseas Development Institute). “Bogotá.” ODI: Think Change. Accessed October 12, 2021.
https://odi.org/en/about/features/bogot%C3%A1/
ODPH (Open Data Philippines). “Open Data Philippines.” ODPH. Accessed June 3, 2021.
https://data.gov.ph/
OECD (Organisation for Economic Co-operation and Development)/ITF (International Transport Fo-
rum). Big Data and Transport: Understanding and Assessing Options. Paris: OECD/ITF, 2015.
https://www.itf-oecd.org/sites/default/files/docs/15cpb_bigdata_0.pdf
Ospina-Mateus, Holman, Leonardo Augusto Quintana Jiménez, Francisco José López-Valdés, Natalie
Morales-Londoño, and Katherinne Salas-Navarro. “Using Data-Mining Techniques for the Prediction
of the Severity of Road Crashes in Cartagena, Colombia.” In Applied Computer Sciences in Engineering.
Edited by J. Figueroa-García, M. Duarte-González, S. Jaramillo-Isaza, A. Orjuela-Cañon, Y. Díaz-Guti-
errez, 309-20. Cham: Springer, 2019. https://doi.org/10.1007/978-3-030-31019-6_27
Silva, Philippe Barbosa, Michelle Andrade, and Sara Ferreira. “Machine Learning Applied to Road
Safety Modeling: A Systematic Literature Review.” Journal of Traffic and Transportation Engineering
7, no. 6 (2020): 775-790. https://doi.org/10.1016/j.jtte.2020.07.004
Suresh, Harini and John Guttag. “A Framework for Understanding Sources of Harm throughout the
Machine Learning Life Cycle.” In Proceedings of Equity and Access in Algorithms, Mechanisms, and
Optimization (EAAMO ‘21), Association for Computing Machinery, New York, October 2021.
https://doi.org/10.1145/3465416.3483305
US NHTSA (United States National Highway Traffic Safety Administration). “Data.” US NHTSA. Ac-
cessed May 28, 2021. https://www.nhtsa.gov/data.




                                                 66
WHO (World Health Organization). Global Status Report on Road Safety 2018. Geneva: WHO, 2018.
World Bank. “Better Data for Safer Roads: The Powerful Mission of Road Safety Observatories.”
Last modified November 5, 2020. https://www.worldbank.org/en/news/video/2020/11/05/better-da-
ta-for-safer-roads-the-powerful-mission-of-road-safety-observatories
World Bank. Colombia - Programmatic Productive and Sustainable Cities Development Policy Loans. Wash-
ington, DC: World Bank, 2020. http://documents.worldbank.org/curated/en/426591583968971309/
Colombia-Programmatic-Productive-and-Sustainable-Cities-Development-Policy-Loans
World Bank. GRSF DRIVER Completion Report. Washington, DC: World Bank, 2019.
https://documents1.worldbank.org/curated/en/245151560919065747/pdf/Data-for-Road-Incident-Vi-
sualization-Evaluation-and-Reporting-Lowing-the-Barriers-to-Evidence-Based-Road-Safety-Manage-
ment-in-Resource-Constrained-Countries.pdf
World Bank. Good Practice Note: Road Safety. Washington, DC: World Bank, 2019.
https://pubdocs.worldbank.org/en/648681570135612401/Good-Practice-Note-Road-Safety.pdf
World Bank. Guide for Road Safety Opportunities and Challenges: Low and Middle Income Country Pro-
files. Washington, DC: 2020. https://openknowledge.worldbank.org/handle/10986/33363
World Bank. Indonesia Public Expenditure Review 2020: Spending for Better Results. Washington,
DC: World Bank, 2020. https://openknowledge.worldbank.org/handle/10986/33954
World Bank. Innovative Road Safety Risk Assessment Tool with Automated Image Analysis Technology.
Washington, DC: World Bank, 2019.
Word Bank. Making Roads Safer. Washington, DC: World Bank, 2014.
World Bank. Mobile Metropolises: Urban Transport Matters: An IEG Evaluation of the World Bank
Group’s Support for Urban Transport. Washington, DC: World Bank, 2017.
World Bank. “Open Traffic Data to Revolutionize Transport.” Last modified December 19, 2016.
https://www.worldbank.org/en/news/feature/2016/12/19/open-traffic-data-to-revolutionize-transport
World Bank. Open Traffic: Easing Urban Congestion. Washington, DC: World Bank, n.d.
https://olc.worldbank.org/system/files/WBG_BD_CS_OpenTraffic_1.pdf
World Bank. The High Toll of Traffic Injuries: Unacceptable and Preventable. Washington, DC: World
Bank, 2017.
World Bank. Use of AI Technology to Support Data Collection for Project Preparation and Implementa-
tion: A ‘Learning-by-doing’ Process. Washington, DC: World Bank, 2021.
World Bank. World Development Report 2021: Data for Better Lives. Washington, DC: World Bank,
2021. doi:10.1596/978-1-4648-1600-0
World Road Association. “Road Safety Manual: Infrastructure Management Tools.” Accessed May 10,
2021. https://roadsafety.piarc.org/en/planning-design-operation-infrastructure-management/manage-
ment-tools




                                                 67
Zeng, Qiang, Helai Huang, Xin Pei, S.C. Wong, and Mingyun Gao. “Rule Extraction from an Op-
timized Neural Network for Traffic Crash Frequency Modeling.” Accident Analysis & Prevention 97
(2016): 87-95. doi: 10.1016/j.aap.2016.08.017
Zhang, Min, Yang Liu, Shaohua Luo, Siyan Gao. “Research on Baidu Street View Road Crack Infor-
mation Extraction Based on Deep Learning Method.” Journal of Physics: Conference Series no. 1616
(2020). https://iopscience.iop.org/article/10.1088/1742-6596/1616/1/012086/pdf
Ziakopoulos, Apostolos and George Yannis. “Using AI for Spatial Predictions of Driver Behavior.”
(ITF) International Transport Forum Roundtable on Artificial Intelligence in Road Traffic Crash Pre-
vention, (presentation, February 2021).
https://www.nrso.ntua.gr/geyannis/conf/cp450-using-ai-for-spatial-predictions-of-driver-behavior/




                                                68
This guidance note offers a practical introduc-         While the preliminary results in Padang were en-
tion to integrating big data and machine learn-         couraging, additional data is required to verify
ing in road safety evaluations. It outlines data        the performance in a new context. However, the
requirements for several road safety assess-            workflow illustrated through these case studies
ments, provides a convenient overview of rel-           shows potential for replicability. All code for the
evant big data sources, and explains machine            Integrated Framework for Road Safety is free and
learning fundamentals for the application of            publicly available for repurposing and refining to
these advanced technologies, specifically for           local context through a link provided in the note.
road safety. The note proposes an Integrated
                                                        The framework exemplifies current capabilities
Framework for Road Safety, which takes the
                                                        to reduce the reliance on manual image anno-
reader step-by-step through a machine learning
                                                        tations and highlights the potential to conduct
workflow to evaluate road risk, using case stud-
                                                        a road safety scan without years of historical
ies in Bogotá, Colombia and Padang, Indonesia.
                                                        crash data. The increasing availability of big
The Integrated Framework for Road Safety uses           data and the growing use of machine learning
machine learning to identify road characteris-          models for road safety point to rapidly evolving
tics from street view images and predict road           technological solutions that have immense ca-
segment risk based on those identifiable char-          pacity to improve the quality and efficiency of
acteristics. As a result, road segment risk was         road safety assessments in developing coun-
predicted with 72.5 percent accuracy in Bogotá.         tries.




                                                   69