GLOBAL PROGRAM RESILIENT HOUSING Detecting Urban Clues for Road Safety Leveraging Big Data and Machine Learning DECEMBER 2021 1 Table of Contents List of Figures and Tables�������������������������������������������������������������������������������������������������������������������������������������� 3 Acknowledgments�������������������������������������������������������������������������������������������������������������������������������������������������� 4 Objective, Audience and Structure����������������������������������������������������������������������������������������������������������������������� 5 Abbreviations���������������������������������������������������������������������������������������������������������������������������������������������������������� 6 Introduction������������������������������������������������������������������������������������������������������������������������������������������������������������� 7 PART 1: The Demand for Data to Assess Risks and Conduct Safety Assessments��������������������������������������� 10 1.1 Conventional Tools for Road Safety Assessment�������������������������������������������������������������������������� 10 Data Requirements for Traffic and Road Safety Assessment Tools��������������������������������������������� 12 Key Challenges with Current Approaches to Road Safety Analysis���������������������������������������������� 13 PART 2: Big Data and Machine Learning to Strengthen Road Safety in Transport Projects�������������������������� 15 2.1 New Data (and Big Data) in Road Safety Analysis�������������������������������������������������������������������������� 16 How to Access Big Data������������������������������������������������������������������������������������������������������������������� 20 Key Considerations for Selecting the “Right” Big Data Source������������������������������������������������������ 23 2.2 Machine Learning in Road Safety Analysis������������������������������������������������������������������������������������� 25 How to Use Machine Learning��������������������������������������������������������������������������������������������������������� 27 Key Considerations for Using Machine Learning���������������������������������������������������������������������������� 30 2.3 Big Data, Machine Learning and the Future of Road Safety Assessments����������������������������������� 33 PART 3: Case Studies: Applying Big Data and Machine Learning to Assess Road Safety����������������������������� 35 3.1 Objectives of the Case Studies�������������������������������������������������������������������������������������������������������� 35 3.2 Methodology������������������������������������������������������������������������������������������������������������������������������������� 37 3.3 Case Study 1: Bogotá, Colombia������������������������������������������������������������������������������������������������������ 41 3.4 Case Study 2: Padang, Indonesia����������������������������������������������������������������������������������������������������� 45 3.5 Findings��������������������������������������������������������������������������������������������������������������������������������������������� 46 Conclusion������������������������������������������������������������������������������������������������������������������������������������������������������������ 48 Annex 1: Most Relevant Big Data Types for Road Safety Analysis������������������������������������������������������������������ 50 Annex 2: Overview of Big Data Sources�������������������������������������������������������������������������������������������������������������� 51 Annex 3: Hotspots and Heatmaps: Uncovering Data Patterns for Road Safety���������������������������������������������� 55 Annex 4: Classes Detected Using Mapillary Vistas Dataset in RIC Model and Input Classes for the RRE Model��62 Annex 5: Average Precision of the Bounding Box Detection and Classification��������������������������������������������� 63 Glossary of Terms������������������������������������������������������������������������������������������������������������������������������������������������ 64 References������������������������������������������������������������������������������������������������������������������������������������������������������������ 65 2 List of Figures Figure 1: Road safety is a serious concern in low- and middle-income countries��������������������������������������������� 8 Figure 2: Potential applications of big data and ML in road safety projects������������������������������������������������������ 9 Figure 3: Street view and OSM����������������������������������������������������������������������������������������������������������������������������� 18 Figure 4: Hotspot analysis of major crashes reported by Waze application users������������������������������������������ 19 Figure 5: ML lifecycle�������������������������������������������������������������������������������������������������������������������������������������������� 25 Figure 6: Categories of ML and the tasks they can perform����������������������������������������������������������������������������� 25 Figure 7: ANN structure���������������������������������������������������������������������������������������������������������������������������������������� 27 Figure 8: ML algorithms and street view������������������������������������������������������������������������������������������������������������� 29 Figure 9: Labeling a crosswalk in Padang, Indonesia using the Computer Vision Annotation Tool (CVAT)������ 32 Figure 10: Framework for automatic road safety analysis and management powered by ML����������������������� 34 Figure 11: Training phase for road safety segment analysis using ML������������������������������������������������������������ 38 Figure 12: Deployment phase to predict road safety����������������������������������������������������������������������������������������� 39 Figure 13: RIC and RRE applied to predict road segment risk��������������������������������������������������������������������������� 40 Figure 14: Image segmentation in Bogotá���������������������������������������������������������������������������������������������������������� 42 Figure 15: Six study areas and crash frequency in Bogotá�������������������������������������������������������������������������������� 43 Figure 16: Confusion matrix showing the accuracy of the RRE model������������������������������������������������������������� 43 Figure 17: Road risk prediction in Bogotá����������������������������������������������������������������������������������������������������������� 44 Figure 18: Road risk prediction in Padang���������������������������������������������������������������������������������������������������������� 46 List of Tables Table 1: Overview of common road safety assessment tools��������������������������������������������������������������������������� 12 Table 2: Overview of data requirements for common road safety assessment tools������������������������������������� 13 Table 3: SWOT analysis of using big data in road safety analysis�������������������������������������������������������������������� 17 Table 4: Overview of potential big data sources for road safety assessments������������������������������������������������ 23 Table 5: Categories of ML and algorithms���������������������������������������������������������������������������������������������������������� 26 Table 6: ML and DL algorithms���������������������������������������������������������������������������������������������������������������������������� 27 Table 7: Frequently used ML techniques for road safety analysis�������������������������������������������������������������������� 28 Table 8: SWOT analysis of using ML in road safety analysis����������������������������������������������������������������������������� 31 Table 9: Potential applications of big data and ML in road safety analysis������������������������������������������������������ 33 Table 10: Data used for case study in Bogotá, Colombia���������������������������������������������������������������������������������� 41 Table 11: Data used for case study in Padang, Indonesia��������������������������������������������������������������������������������� 45 3 Acknowledgments This Guidance Note was prepared by a team from the Global Program for Resilient Housing at the World Bank. The team was led by Sarah Elizabeth Antos (Data Scientist) and Luis Miguel Triveno Chan Jan (Senior Urban Development Specialist). Overall managerial support was provided by Fran- cis Ghesquiere (Practice Manager, Urban EAP) and Radoslaw Czapski (Senior Transport Specialist). The core team included Jessica Gosling-Goldsmith, Charles Wang, Bushra Syed Shafat Ali, and Se- bastian Anapolsky. The Global Program for Resilient Housing supports safe and resilient housing by creating new, cost-saving tools to evaluate homes from the air and the street to help identify those vulnerable to natural and health hazards. While the program focuses on housing, it developed a methodology to extract urban clues from street view imagery with multiple applications including those related to urban mobility and road safety. The note incorporates valuable input and review from Holly Krambeck (Program Manager), Said Dahdah (Lead Transport Specialist), Satoshi Ogita (Senior Transport Specialist), Veronica Ines Raffo (Senior Infrastructure Specialist), Li Qu (Senior Transport Specialist), and Glenn S. Morgan (ESF Con- sultant). During the drafting of this note several industry experts were interviewed. The team would like to express gratitude for the external inputs of: Anthony Germanchev (Principal Professional Leader, Advanced Technologies Lab, Australian Road Research Board), David Hynd (Chief Scientist, TRL), Monica Olyslagers (Safe Cities and Innovation Specialist, iRAP), Professor George Yannis (National Technical University of Athens), and Spencer Rigler (Account Director, TRL). Design was done by Xavier Conesa. This note would not have been possible without generous support from the Global Road Safety Facil- ity and UK Aid. 4 Objective, Audience and Structure The purpose of this Guidance Note is to provide concrete guidance on how big data and machine learning (ML) can be leveraged in road safety analysis. The document presents opportunities to use these new technologies to improve current methods for data collection and analysis for various road safety assessments. This Guidance Note provides a practical guide for using new data sources and analytical meth- ods for road safety analysis in different types of projects that may impact road infrastructure or risk-related factors. Road safety practitioners, project managers, researchers, international devel- opment organizations, data scientists, and government agencies responsible for road safety assess- ments, transportation management, and infrastructure development would also find this document useful to understand how these new technologies can be implemented for various road safety assess- ment procedures and requirements. This document consists of three parts. Part 1 provides an overview of existing approaches and tools for road safety assessment and identifies opportunities to improve these using new technologies such as big data and ML. Part 2 provides an overview of these new technologies and concrete guid- ance on how they can be integrated into transport projects for road safety analysis. Part 3 presents case studies on two regions of interest – Bogotá, Colombia and Padang, Indonesia – to demonstrate how ML can be implemented to evaluate road safety. The document concludes with recommenda- tions for using big data and ML in road safety assessments in the future. 5 Abbreviations ADB Asian Development Bank API Application Programming Interface DDP Development Data Partnership DL Deep Learning DRIVER Data for Road Incident Visualization, Evaluation and Reporting FSI Fatalities and Serious Injuries GRSF Global Road Safety Facility (World Bank) IoT Internet of Things iRAP International Road Assessment Programme ITS Intelligent Transport System LMICs Low- and Middle-Income Countries ML Machine Learning OSM OpenStreetMap RIC Road Information Collector ROI Region of Interest RRE Road Risk Evaluator RSA Road Safety Audit RSI Road Safety Inspection RSIA Road Safety Impact Assessment RSO Road Safety Observatory SDGs Sustainable Development Goals UAV Unmanned Aerial Vehicle 6 Introduction Transportation services and infrastructure connect people, businesses, and places. They allow citizens to access opportunities, such as jobs, education, health services, recreation, and enable the movement and distribution of goods. As a result, transport services and infrastructure are key to the economic development of cities and regions.1 While the development of transportation systems and infrastructure is vital to economic growth, it is also important to evaluate and mitigate its potential negative externalities and costs to soci- ety.2 According to the World Health Organization (WHO), around 1.25 million people are killed on the world’s roads every year and between 20 and 50 million are seriously injured. These costs are dispro- portionately higher in low- and middle-income countries (LMICs), which are estimated to endure 93 percent of the world’s fatalities on the road, despite having 60 percent of the world’s vehicles (figure 1).3 According to a 2019 study of select countries, road crashes cost World Bank client countries an estimated 7 percent to 22 percent of their GDP over a 24-year period.4 Road fatalities and injuries are predictable and preventable.5 Research indicates that roughly 70 percent of serious crashes are due to simple and unintentional errors of perception or judgement.6 The most vulnerable road users are pedestrians, bicyclists, and motorcyclists, accounting for more than 50 percent of reported fatalities in LMICs.7 Effective transport planning and management that carefully considers and incorporates measures to address safety risks.8 Speed reductions and the design of infrastructure to promote safer streets have demonstrated clear results in Colombia and India. In Bogotá, Colombia, the speed management program resulted in a 21 percent decrease in traffic fatalities compared to the average for the three preceding years (2015-18).9 In India, Pune has become a regional leader in complete streets, in which streets are designed for all users, rather than only for cars; pedestrians, cyclists, motorists, and transit riders are given safe access with the com- plete streets approach.10 The United Nations (UN) launched its second Decade of Action for Road Safety in 2020 to address the road safety objectives of its Sustainable Development Goals (SDGs). These include SDG 3.6, which seeks to reduce deaths and injuries from road crashes by 50 percent, and SDG 11, which focus- es on making cities and human settlements inclusive, safe, resilient, and sustainable. 1 World Bank, Mobile Metropolises: Urban Transport Matters: An IEG Evaluation of the World Bank Group’s Support for Urban Transport (Washington, DC: World Bank, 2017). 2 Word Bank, Making Roads Safer (Washington, DC: World Bank, 2014). 3 WHO (World Health Organization), Global Status Report on Road Safety 2018 (Geneva: World Health Organization, 2018), 4. 4 World Bank, The High Toll of Traffic Injuries: Unacceptable and Preventable (Washington, DC: World Bank, 2017). 5 Makhtar Diop, “All Road Deaths Are Preventable. We Can Make It Happen,” World Bank, accessed May 14, 2021, https://blogs.worldbank.org/transport/all-road-deaths-are-preventable-we-can-make-it-happen 6 International Transport Forum, Zero Road Deaths and Serious Injuries: Leading a Paradigm Shift to a Safe System (Paris: OECD Publishing, 2016). https://doi.org/10.1787/9789282108055-en 7 World Bank, Good Practice Note on Road Safety (Washington, DC: World Bank, 2019). https://pubdocs.worldbank.org/ en/648681570135612401/Good-Practice-Note-Road-Safety.pdf 8 International Transport Forum, “Best Practice for Urban Road Safety: Case Studies,” International Transport Forum Policy Papers, no. 76 (2020). 9 International Transport Forum, “Best Practice for Urban Road Safety: Case Studies.” 10 Institute for Transportation and Development Policy, “Pune, India Wins 2020 Sustainable Transport Award,” last modified June 27, 2019, https://www.itdp.org/2019/06/27/pune-india-wins-2020-sustainable-transport-award/ 7 The World Bank hosts the Global Road Safety Facility (GRSF) to FIGURE 1: Road safety is a provide funding, knowledge, and technical assistance to help de- serious concern in low- and middle-income countries veloping countries create safer roads. The Facility addresses road safety issues across a wide range of projects, from infrastructure design and vehicle safety to traffic law enforcement, post-crash re- sponse systems, data collection, and institutional strengthening. Since its inception in 2006, the Facility has disbursed a total of USD 44.6 million to improve road safety in 64 countries. It is important, and often required, to incorporate road safety management procedures in transport projects to identify and mitigate risks in a timely manner. Governments, international 93% of road fatalities occur in low- and development organizations, and other agencies have established middle-income countries, despite these various tools and systems to facilitate road safety analysis. How- countries having 60 percent of the world’s vehicles. ever, the absence of valid, representative data presents significant SOURCE: Original figure for this publication, based on challenges to developing a good understanding of road safety risks data from WHO. and reducing crash fatalities and injuries through data-driven, evi- dence-based interventions.11 New technologies such as big data and machine learning (ML) provide promising opportunities to improve existing data sources and methods for road safety analysis. From analyzing anonymized GPS data to understand traffic flows in the Philippines to part- World Bank, Guide for Road Safety Opportunities and Challenges: Low and Middle Income Country Profiles (Washington, DC: 11 2020). https://openknowledge.worldbank.org/handle/10986/33363 8 nering with data providers that crowdsource information about crash sites in Kenya, governments, road safety practitioners, and other stakeholders are adopting innovative approaches to identify, monitor, and mitigate fatalities and injuries in high-risk areas.12 Unsupervised learning techniques have been applied in Lima, Peru, using records of different crash types to identify safe areas along routes and safer pedestrian pathways, decreasing the likelihood of pedestrians suffering an acci- dent.13 The Urban Traffic Modeling and Control project at the National University of Medellín has been using deep learning (DL) techniques to classify traffic and identify motorbike usage. In Carta- gena, Colombia, data mining and ML algorithms were used to analyze road records and predict the severity of crashes using classification algorithms.14 Figure 2 provides an overview of the potential uses of big data and ML in road safety analysis that will be discussed in this note. FIGURE 2: Potential applications of big data and ML in road safety projects BIG DATA OR SPATIAL DATA SOURCE Street view Satellite and Internet of Incident Natural Social imagery aerial imagery Things reports phenomena media MACHINE/ Identify road Delineate road Analyze vehicle Identify road crash Find patterns in Extract traffic or conditions, barriers, curvature, complex and population patterns and weather and time road condition data DEEP crosswalks, intersections, road movement develop prediction of day LEARNING pedestrian paths, gradient; provide models street signs, traffic car and truck count lights SOURCE: Original figure for this publication. 12 World Bank, “Open Traffic Data to Revolutionize Transport,” last modified December 19, 2016, https://www.worldbank. org/en/news/feature/2016/12/19/open-traffic-data-to-revolutionize-transport; Guadalupe Bedoya Arguelles, et al., “Smart and Safe Kenya Transport (SMARTTRANS)” (Washington, DC: World Bank, 2019), https://documents1.worldbank.org/curated/ en/723411574361015073/pdf/Smart-and-Safe-Kenya-Transport-SMARTTRANS.pdf 13 Jesús Lovón-Melgarejo et al., “Identification of Risk Zones for Road Safety through Unsupervised Learning Algorithms,” in 16th LACCEI International Multi-Conference for Engineering, Education, and Technology: Innovation in Education and Inclusion, http://www.laccei.org/LACCEI2018-Lima/full_papers/FP413.pdf 14 Holman Ospina-Mateus et al., “Using Data-Mining Techniques for the Prediction of the Severity of Road Crashes in Cartagena, Colombia,” in Applied Computer Sciences in Engineering, eds. J. Figueroa-García et al., vol. 1052 (2019): 309-20, https://doi.org/10.1007/978-3-030-31019-6_27 9 PART 1: The Demand for Data to Assess Risks and Conduct Safety Assessments Road safety practitioners utilize a variety of data-driven tools and methods to evaluate road safety risks and determine mitigation measures across different stages of road and infrastructure de- velopment projects. Comprehensive road safety evaluation tools and procedures require both crash and non-crash data to identify issues and measure their associated risks. The variety, quantity, and quality of data available is an important determinant of the tool for measurement and analysis of various road safety indicators. This section provides an overview of the most widely used road safety assessment tools and their data requirements. A brief description of these road safety assessment procedures and tools can be found in table 1. This brief review of existing approaches informs the suggestions for improving data col- lection and analysis for road safety evaluation procedures through big data and machine learning (ML). 1.1 Conventional Tools for Road Safety Assessment Road safety risks arise from the interaction of many different elements. The road and roadside de- sign and engineering, travel speeds, the extent and type of road use, road user behavior, vehicle safe- ty features (both active and passive), and post-crash response. The Safe System approach addresses all of these interactive elements in an integrated manner and emphasizes sharing accountability with designers and users of the road network to achieve road safety targets.15 The primary purpose of road safety assessment procedures is to identify risks in existing or planned infrastructure developments. Road safety practitioners utilize a wide range of tools for this purpose. Some of these can be purchased commercially, while others are provided, and occasionally mandated by local governments. Organizations providing financial support for international develop- ment projects may also create their own tools for road safety analysis, such as the “Simplified Meth- odology” by the World Bank.16 In general, road safety assessment tools tend to comprise checklists for evaluating the safety of road networks at different stages of a road project’s lifecycle. Some tools, such as the Austroads Road Safety Audit tool, provide guidelines for conducting road safety audits at all the stages of a road project, while other tools like iRAP have guidelines for only some stages (such as during preparation and post-construction). Tools may also need to be adapted or customized depending on the type of project or the project location. A comprehensive approach to managing road safety and reducing crash risk generally requires a combination of reactive and proactive approaches across some or all stages of a road’s lifecycle.17 Reactive approaches rely on historical crash data to identify high risk regions and risk factors. Proactive approaches aim to identify and address potential risks before a project is implemented or crashes occur. 15 Tony Bliss and Jeanne Breen, “Meeting the Management Challenges of the Decade of Action for Road Safety,“ IATSS Res., 35 (2012): 48–55. https://doi.org/10.1016/j.iatssr.2011.12.001 16 World Bank, Innovative Road Safety Risk Assessment Tool with Automated Image Analysis Technology (Washington, DC: World Bank, 2021). 17 World Road Association, “Road Safety Manual: Infrastructure Management Tools,” accessed May 10, 2021, https://roadsafety.piarc.org/en/planning-design-operation-infrastructure-management/management-tools 10 Reactive approaches are often the starting point for road safety analysis and rely on some form of crash-based identification. Crash data-based risk assessments may involve evaluating one or sev- eral of the following criteria: infrastructure, users, speeds, vehicle standards and post-crash trauma care. This approach requires that risk factors be constantly monitored and assessed throughout the project lifecycle. Recently, the focus has shifted toward using more proactive approaches, with a wide range of tools being developed for this purpose. These are especially useful in the absence of crash data, and often involve surveys of existing roads for road infrastructure risk or assessment of other criteria to obtain subjective estimates of road infrastructure risk. Some common tools for proactive road risk assessments are discussed below. Road Safety Impact Assessments (RSIA) are designed to estimate the potential effects of planned road or traffic developments, or any other interventions that may significantly affect transport conditions and risks to road users. The procedure is often conducted at the planning stage to assess the possible impacts of different schematic designs before the most appropriate design is audited and selected for implementation. Road Safety Audits (RSA) are generally used to analyze a road project, or any other type of project which affects road users. An independent, qualified team reports on the project’s crash potential and safety performance to identify safety performance for all kinds of road users. Road safety audits can be conducted at various stages in the project lifecycle including planning, preliminary design, detailed design and pre-opening or post-construction stages. However, it is most cost-effective when it is applied to a road or traffic design before construction to ensure that safety is fully integrated into all elements of the project’s infrastructure, with minimal risk of redesign or physical rework. Road Safety Inspections (RSI) involve a systematic evaluation of an existing road or section of road by a team of seasoned experts. They are conducted on-site to determine potential hazards, faults and deficiencies that could contribute to serious crashes.18 RSIs are more comprehensive than RSAs and are usually conducted post construction to identify further interventions to improve road safety and inform future projects. Road Assessment Programmes (RAP) entail a comprehensive review of existing roads and road networks. Most RAPs, such as the EuroRAP, usRAP and iRAP, use a star-rating approach to provide a relative and comparable measure of the safety level of road networks all around the world. RAPs are highly comprehensive, detailed, and costly. They are usually commissioned by national or local governments to evaluate extensive road networks as an ad-hoc project to determine safety inter- ventions and inform further infrastructure development. Therefore, RAPs are either utilized at the preparation stage of a project to determine project scope, design, and other key requirements for pre-appraisal and construction, or they are conducted to assess the impact of major infrastructure development projects during the post-project operations phases. Phil Allan, “Road Safety Inspections” (presentation, Road Safety Seminar, World Road Association, Lomé, Togo: October 18 2006). https://www.piarc.org/ressources/documents/actes-seminaires06/c31-togo06/8718,2-PIARC_Oct06_Allan.pdf 11 TABLE 1: Overview of common road safety assessment tools TYPE OF WHEN TO USE WHEN TO USE RELATIVE COST DATA EXAMPLES OF TOOLS ASSESSMENT (PROJECT STAGE) (PROJECT ACTIVITY) (HIGH, MEDIUM, REQUIREMENTS LOW, DEPENDS) (HIGH, MEDIUM, LOW, DEPENDS) Crash data- Preparation, Pre-Planning and Depends, low- Depends Crash frequency, crash risk factors, based risk Implementation, Design, Monitoring cost models are crash severity analysis assessment Post-Project and Evaluation, Error available Operations Correction and Hazard Elimination Road Safety Preparation Pre-Planning and Low Low Impact Design Assessment (RSIA) Road Safety Preparation, Planning and Design, Medium to High Medium/ iRAP Road Safety Audit Toolkit, Audit (RSA) Implementation Construction and Pre- Depends Austroads Road Safety Audit Toolkit Opening (currently unavailable), ADB Road Safety Audit Toolkit Road Safety Implementation, High High iRAP Inspection Post-Project (RSI) Operations Road Preparation, Post- Planning and High High iRAP, EuroRap, usRAP Assessment Project Design, Independent Program (RAP) Operations Assessment SOURCE: Modified from Remote Project Supervision and Construction Management of IPF Projects (Washington, DC: World Bank, 2020). Data Requirements for Traffic and Road Safety Assessment Tools One or more types of road safety assessments may be conducted at once or at different phases of a project. Table 2 summarizes the assessment methods, objectives, and their data requirements. Assessments prepared early in a project’s lifecycle may help to identify and evaluate potential traffic and road safety risks that may arise from the project activities and/or their implementation. Such assessments are intended to help mobilize appropriate resources, analyze risks in detail, and identi- fy and adopt the most appropriate mitigation measures. During the project preparation stage, more in-depth assessments to identify and evaluate potential traffic and road safety risks may need to be conducted. The assessments should consider Safe System principles to ensure that all opportunities to minimize risks have been realized.19 Since the key objectives of these assessments (i.e., identifying risk elements and estimating crash exposure, likelihood, and severity for different road users) are complex and not standardized, the scoring system is subjective. This can complicate comparisons between sites, especially when these have been assessed by different individuals or teams. It is, therefore, usually most suitable for comparing options at a single site, identifying sources of risk and identifying solutions, rather than for comparing different sites. 19 Tony Bliss and Jeanne Breen, “Meeting the Management Challenges of the Decade of Action for Road Safety.” 12 TABLE 2: Overview of data requirements for common road safety assessment tools METHOD OBJECTIVES DATA REQUIREMENTS Crash data-based Estimate risk using Fatalities • Crash data from the previous 3–5 years or estimated from data available risk assessment and Serious Injuries (FSI) from similar roads in the country crash data to reflect road • Assessment of vehicle standards (safe vehicles) infrastructure, users, and speed • Post-crash trauma care (response time, quality of attention) factors. This is evaluated with vehicle standards and post- crash care. Road Safety Audit Identify safety concerns. It Analysis of project designs and interventions: specialists assess road (RSA) (performed by audits the safety of the specific options, such as intersections, signs, crossings; design standards, and the an independent team design of the chosen scheme. relationship of this intervention to main network. Main data needed includes: of specialists) • Scheme plans • Crash and FSI data • Traffic mix and volumes • Road features (e.g., design elements, such as bypasses, cycle routes, junction improvements, installation of traffic signals, roundabouts, traffic calming, bend realignment, safety fence schemes and pedestrian crossing facilities) Road Safety Impact Assess the impact of each The evaluation of each alternative is based on several factors, some of which Assessment (RSIA) of the planning options on include: (performed by the safety performance of • The scheme objectives members of the the current road network. • Crash and FSI data project design team It estimates the impact of • Traffic mix and volumes with road design and possible schemes on safety for road safety auditing an entire geographic area at the • Road features experience) strategic level. • Categorization of roads and streets of that network Safe System Assess how closely road The core of this SSA approach is the “Safe System Matrix” framework which Assessment (SSA) design and operation align with is essentially a risk assessment. The assessment is done by scoring the risk the Safe System objectives, exposure, likelihood and severity from 0–4. The Austroads approach can be and to clarify which elements used to perform this type of assessment. Data needed include: need to be modified to achieve • Traffic mix and volumes closer alignment with these objectives. • Road features SOURCE: Road Safety GPN. Key Challenges with Current Approaches to Road Safety Analysis Since data is the cornerstone of all road safety assessments, the availability of high quality, reli- able data is key to extracting useful, actionable insights and improving road safety conditions. Without quality information, it is difficult to estimate crash locations and crash types, at-risk individ- uals and groups, and key risk factors influencing exposure to risk, crash involvement, crash severity, and post-crash outcomes. Meeting data requirements for road safety assessments can be a challenge for various reasons, such as the lack of open data, or data collection costs. There can be a lack of adequate crash data or road ratings in data scarce countries and regions for identifying risk factors. Governments often lack adequate and reliable data to identify road safety risks and perform road safety assessments. In addition, road crashes tend to be underreported, es- pecially in LMICs. There may also be significant gaps in the data in terms of geographic or temporal coverage, or the data may be missing important variables and categories. Access to data can also be limited for certain data types, or the process of obtaining the data may be too complex, costly, and time-consuming. 13 Collecting data on road safety attributes through manual detection or special equipment can be expensive, time-consuming, and complex.20 Budgeting for data collection can be a challenge. In these cases, data is most often estimated through existing road designs or by local transportation agencies. The most cost-effective method for data collection is the installation of cameras and sensors that record street imagery, speed information, and other data. Images and video are then analyzed by road safety experts to identify relevant attributes, assess road conditions, and identify potential risks. Commissioning equipment and hiring resources to manually collect data on road features and design may be a hindrance, especially for smaller-scale projects where the opportunity to benefit from economies of scale is low. In addition to the quality and availability of data, preparing and analyzing road safety data can also be costly, resource-intensive, and technically demanding. Most road safety assessments require data to be combined from various sources, which often involves aggregating, cleaning and preparing the data. Additional resources and specialist expertise may be necessary for this process, and also to analyze the data and extract useful insights using methods such as clustering and developing spatial models. Conventional statistical techniques can also be limited in their ability to identify complex correlations and underlying factors that may contribute to road safety risks across various projects. The purpose of this Guidance Note is to identify new methods for the collection and analysis of road safety data that could overcome the limitations of existing approaches, and also improve their efficacy in identifying risks and opportunities to mitigate crashes. Conducting road safety assessments is a required component of most road investment and infrastructure development proj- ects. Advanced technologies such as big data and ML have the potential to not only supplement existing methods, but also significantly reduce costs while improving the efficacy of road safety as- sessments in identifying risks and opportunities to mitigate crashes. The following section explains how big data and ML be practically implemented by road safety practitioners for various road safety assessment procedures. It introduces these methods and pro- vides an overview of big data sources and ML techniques that are useful for road safety assessments. Part 2 also discusses best practices and key considerations that are vital to implementing these new methods effectively. A framework for integrating these technologies in road safety assessments is also proposed, and Part 3 demonstrates how this framework can be applied in LMICs through two original case studies. 20 OECD (Organisation for Economic Co-operation and Development)/ITF (International Transport Forum), Big Data and Transport: Understanding and Assessing Options (Paris: OECD/ITF, 2015), https://www.itf-oecd.org/sites/default/files/docs/15cpb_bigdata_0.pdf 14 PART 2: Big Data and Machine Learning to Strengthen Road Safety in Transport Projects Governments, road safety practitioners, international development organizations, and road safe- ty advocates such as the Global Road Safety Facility are keen to use new technologies, such as big data and ML, in data collection and analysis for road safety to overcome the limitations of existing approaches. As these technologies become more sophisticated and accessible, a growing body of re- search indicates their potential to complement, and eventually even surpass conventional methods. The usefulness of big data and ML in road safety and other transport and infrastructure projects has been widely demonstrated over the past few years. For example, a World Bank task team de- veloped an open data platform in 2015 based on a pilot in Cebu City, Philippines, which sourced data from a taxi company to generate insights for traffic management.21 Another team has developed a “Simplified Methodology” to implement ML in video analysis to extract data on road attributes. The new tool was piloted across over 500 kilometers of road in Mozambique and Liberia in 2019.22 The World Bank, in collaboration with the Philippines government, has also launched the Data for Road Incident Visualization Evaluation and Reporting (DRIVER) system to facilitate data sharing for road safety analysis. This free web-based, open-source platform connects traffic crash data from multiple agencies through a standardized reporting system. DRIVER also provides tools to geo-spatially an- alyze road crash data, predict blackspots, estimate the economic costs of crashes, and evaluate the effectiveness of various interventions to support investments and policymaking for improved road safety.23 Road safety practitioners are increasingly turning to data partnerships to obtain crash, traffic, and other types of data for road safety analysis. For example, in Kenya, the WHO estimates that up to 75 percent of crashes go unreported.24 SmarTTrans – a collaboration between the Kenyan govern- ment and the World Bank – has worked to fill this gap by bringing together crash information both from administrative records and from bystander crash reports from Twitter.25 In addition, the team has leveraged the Development Data Partnership (DDP) to access Waze API and Uber congestion and speed information for all 6,200 km of the city’s road network. Using all data sources, the smarTTrans team is creating near real-time analytics to facilitate the identification of crash hotspots, speeding, and congestion patterns. 21 World Bank, Open Traffic: Easing Urban Congestion (Washington, DC: World Bank, n.d.), https://olc.worldbank.org/system/files/WBG_BD_CS_OpenTraffic_1.pdf 22 World Bank, Innovative Road Safety Risk Assessment Tool with Automated Image Analysis Technology (Washington, DC: World Bank, 2019). 23 World Bank, GRSF DRIVER Completion Report (Washington, DC: World Bank, 2019), https://documents1.worldbank.org/curated/en/245151560919065747/pdf/Data-for-Road-Incident-Visualization-Evaluation-and- Reporting-Lowing-the-Barriers-to-Evidence-Based-Road-Safety-Management-in-Resource-Constrained-Countries.pdf 24 WHO, Global Status Report on Road Safety 2018. 25 Sveta Milusheva et al., “Applying Machine Learning and Geolocation Techniques to Social Media Data (Twitter) to Develop a Resource for Urban Planning,” PLoS ONE 16, 2 (2021), https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0244317 15 2.1 New Data (and Big Data) in Road Safety Analysis Big data is generally understood as extremely large datasets that are generated by a wide range of data sources, including machines, sensors, and other Internet of Things (IoT) devices. Big data can also be captured over the internet through social media and other types of applications, especially those that track locational or transactional data. The large volume of such data is one of many characteristics that make big data especially useful for road safety and other applications in transport and infrastructure development. For example, big data can be generated at immense velocity, especially as more such data is collected real-time and for large populations. It also occurs in a variety of data formats, from structured databases to unstructured text documents, emails, videos, audios, stock ticker data and financial transactions. Big data is also characterized by a high degree of variability since data flows can change over time, de- pending on seasons, off-peak hours, or availability of collection methods across an entire population under study. Table 3 provides a SWOT analysis of the use of big data in road safety analysis. For transport, the increasing use of personal mobile devices and vehicle sensors to collect traffic and location data presents a significant opportunity to augment traditional sources of transport data. Annex 1 discusses the most relevant big data types for road safety analysis. It also provides guidance on the potential applications of these sources for evaluating road safety, and the advantages and disadvantages of each source. The following sections discuss how big data can be used for the various road safety assessment methods and tools discussed in Part 1. 16 TABLE 3: SWOT analysis of using big data in road safety analysis STRENGTHS WEAKNESSES • Recent and broad geographic coverage allows researchers to • Requires investment in expertise, software and computing dive deeper into transport issues and get a comprehensive and power to store, access and process big data. current picture of risks. • Availability of data can vary significantly by geography and • Can help obtain real-time data and track up-to-the-minute context. changes in traffic flows and other important variables. • Coverage can be inconsistent or exclude important segments • May be faster and easier to obtain and process, compared to of the population. manual collection. • Most big data sources are not set up to support road safety • Can offer higher spatial and temporal resolution than assessments—it is often data that was collected for other conventional sources. purposes but gets repurposed for road safety analysis. This • Can be more affordable and easier to scale. can lead to the data being biased, incomplete and/or difficult to incorporate in road safety analysis. • Vast quantities of data can limit bias from outliers and other sources of “noise” since data gets aggregated across vast • Need to consider the interoperability of different datasets (i.e., populations. how easy it is to combine different datasets for complex road safety assessment models). • Can help improve data quality since often covers large geographic and/or temporal scope, also allowing for • Changes in privacy laws and other relevant policies can impact comparison against “control” datasets and scenarios. quality, consistency and coverage of data. OPPORTUNITIES THREATS • Provides an alternative approach to road safety data collection • Privacy concerns – data should be de-identified and and analysis that may complement or supplement traditional anonymized before use. approaches or datasets. For example, big data sources may be • Data providers may be reluctant to share data. able to collect more accurate crash data. • Governments, local municipalities, and other stakeholders • Big data analysis can uncover new dynamics, complex must invest in technological infrastructure to support big data behavioral patterns and relationships, and correlations that collection and analysis. conventional statistical methods and data may not be able to • Need to enforce quality control to limit risk of data bias. detect. • Licensing constraints – most private companies, such as • Growing interest in autonomous vehicles is generating more Google, provide limited licenses for data use. data about road systems, vehicles, and vulnerable users that can be integrated into road safety analysis. • Rising momentum for the creation of a “big data platform” where data providers can sell or share data. SOURCE: Original table for this publication. Big data, especially when combined with ML, which is discussed in the following section, can enhance the capabilities of current systems and road safety assessment tools. The increasing use of IoT devices, which range from smartphones to vehicle sensors, as well as Intelligent Transport Systems (ITS), is making it possible to collect, access and utilize real-time data about a large range of variables that are relevant to road safety analysis. This includes traffic flows, crash sites, peak tim- ings, travel times and road usage by pedestrians, bicyclists, and motorists. The availability of such ex- tensive data creates new possibilities for crash risk modelling, especially to predict the outcomes of various types of road safety interventions as well as possible impacts of road infrastructure projects. As mobile phone use rises globally, smartphones have become a prominent source of big data, though there are many other sources to consider. In addition to the location and velocity of road travelers collected passively through mobile devices, transportation projects can take advantage of street view, aerial, and satellite imagery, traffic monitoring systems, connected vehicles for road safety analysis, as well as crowdsourced data provided by the community through mobile devices.26 Annex 2 provides an overview of the most relevant and accessible big data sources for road safety analysis. Road safety practitioners are advised to look for relevant local and regional data providers based on the region(s) of interest that concern their project(s). As big data infrastructure advances 26 Alex Neilson et al., “Systematic Review of the Literature on Big Data in the Transportation Domain: Concepts and Applications,” Big Data Res. 17 (2019): 35-44. https://doi.org/10.1016/j.bdr.2019.03.001 17 globally and new companies and startups begin data collection for various purposes, it is likely that the list of available big data sources will expand significantly in coming years. Street view imagery can complement or potentially substitute manual or commissioned road sur- veys to collect data on road safety attributes for various types of assessments. For example, street view imagery can help obtain baseline data for RSIA more quickly and cheaply, especially if the data is not already readily available. By applying ML algorithms to street view images, road attributes and other data can be detected that are important for road safety assessments. Similarly, there may be instances where satellite imagery or aerial imagery, those collected by an unmanned aerial vehicle (UAV) or drone, can be analyzed to detect road or road user attributes. Figure 3 shows the same crosswalk visible in satellite imagery and street view imagery using OpenStreetMap in OSM. ML is discussed in greater detail in the next section. FIGURE 3: Street view and OSM Road safety data can be extracted from images such as road markings and signs, types of road users, and designated paths for vulnerable users. Each image and relevant attributes are geolocated for further analysis. In this instance, the crosswalk identified in OSM can be verified in street view imagery. SOURCE: Original figure for this publication derived from OSM, Mapillary, and Maxar Technologies. Mobile applications and telematics can provide data related to vehicle movement to identify road infrastructure risks. This data includes current and historical average speeds along road segments as well as irregularities, like traffic jams and incidents. This data is useful for most proactive road safety assessment tools, including RSIA, RSA, and RSI. It can be geographically visualized and ana­ lyzed, such as through heatmaps or hotspot analysis as shown in figure 4 (see Annex 3 for additional examples and descriptions). Telematics data has also been used to assess driver behavior, facilitate the prediction of crash-prone locations, and create geographic visualizations, as discussed in interviews with research- ers at the ARRB and Professor George Yannis from the National Technical University of Athens. Howev- er, data privacy is an especially important concern when it comes to the use of telematics data.27 Anthony Germanchev (Principal Professional Leader, Advanced Technologies Lab, Australian Road Research Board) and Professor 27 George Yannis (School of Civil Engineering, National Technical University of Athens), in discussion with the authors, April 2021. 18 FIGURE 4: Hotspot analysis of major crashes reported by Waze application users Bogotá, Colombia Waze Major Crash Cold Spot - 95% Confidence Cold Spot - 90% Confidence Not Significant Hot Spot - 90% Confidence Hot Spot - 95% Confidence Hot Spot - 99% Confidence SOURCE: Original figure for this publication (data provided by Waze App; learn more at waze.com). Mobile applications are helping overcome underreporting of road crashes by crowdsourcing inci- dent reports. For example, in Kenya, road crashes have been shown to be largely underreported, es- pecially in areas where incident reporting mechanisms are lacking or underdeveloped.28 Navigation applications such as Waze are providing a valuable new source of crash and traffic data by allowing users to report incidents through their smartphone applications. Each incident report submitted by a user is geolocated and timestamped, which allows it to be combined with other geospatial data to identify segments of a road that are experiencing major or minor crashes, light to stand still traffic jams or hazardous conditions (hazards on the road or on the shoulder, weather alerts or dangerous road surfaces). Additionally, social media platforms like Twitter are used by many people on the ground to report on crashes and traffic conditions and can be leveraged using machine learning al- gorithms to produce additional data on crashes, as was done by the smarTTrans team in Nairobi.29 Lastly, mobile application data can be generated in real-time to assist with monitoring or collected and analyzed over time to develop models. A growing number of countries and regions are focusing on developing a big data infrastructure to collect official incident reports. Collecting comprehensive and accurate information about road incidents is an important objective for government transportation agencies. There is growing inter- Guadalupe Bedoya Arguelles, et al., “Smart and Safe Kenya Transport (SMARTTRANS).” 28 Sveta Milusheva et al., “Applying Machine Learning and Geolocation Techniques to Social Media Data (Twitter) to Develop a 29 Resource for Urban Planning.” 19 est in gathering and analyzing the information in big data formats to provide deeper and more com- prehensive insight into road safety risks and the impact of different interventions. The collection of real-time data would also be beneficial for this purpose, for which collecting, storing, and analyzing the information as big data would be most realistic and feasible. How to Access Big Data Big data for road safety generally falls into two categories: public sector and private sector. Tradi- tionally governments have collected and provided data for road safety analysis, such as police reports of crash incidents. However, alternative sources are becoming increasingly available as mobile apps are used to crowdsource reports of roadside incidents and companies aggregate traffic speeds from proprietary mobile applications. Often data quality from such sources can vary significantly by loca- tion, with certain sources being more effective, reliable, and better developed in some regions com- pared to others. Road safety practitioners advised to use the list provided in Annex 2 as a starting point and find the most relevant data providers for the region(s) of interest that their project focuses on. This Guidance Note focuses on big data sources that are most easily and readily accessible for road safety analysis. Different sources require different approaches to obtaining relevant data quickly and efficiently. It is important to understand the licensing restrictions that accompany each source. For example, even though a dataset is crowdsourced, it may have licensing restrictions. It is best to con- sult a legal advisor and the data provider to clarify terms of use when necessary. Public sector. Governments can collect, manage, and share data relating to transport, infrastruc- ture, and mobility. Many governments, whether at the national level or even local municipalities, are establishing open data platforms where datasets can be accessed by running a simple search query. Such platforms have already been created in the Philippines as well as in Australia and the United States.30 In other instances, particularly where the data infrastructure is not as advanced, data may have to be requested through the relevant department. It is often possible to obtain datasets relating to crash histories or collected by road sensors from government sources which are extensive enough to be processed as big data in road safety analysis. The World Bank’s Road Safety Observatories (RSO) initiative also has the potential to become an important source of government-generated big data in the future. The Observatories provide a for- mal network of government representatives to share and exchange road safety data and experience in order to improve road safety throughout the region. The World Bank established its first RSO in Latin America (OISEVI), before introducing the initiative in Africa (ARSO) and Asia-Pacific (APRSO). By enhancing road safety data and information systems, the Observatories play a pivotal role in help- ing countries monitor, evaluate, and develop more impactful road safety policies and interventions.31 In other cases, publicly available datasets with a global reach may be considered. A good example of this is OSM, which offers freely available geographic data generated by volunteers who trace satel- lite images around the world to create and update the map consisting of road networks (detailing road 30 Australian BITRE (Bureau of Infrastructure and Transport Research Economics), “Australian Road Deaths Database (ARDD),” Australian BITRE, updated May 13, 2021, https://data.gov.au/data/dataset/australian-road-deaths-database; ODPH (Open Data Philippines), “Open Data Philippines,” ODPH, accessed June 3, 2021, https://data.gov.ph/; US NHTSA (United States National Highway Traffic Safety Administration), “Data,” US NHTSA, accessed May 28, 2021, https://www.nhtsa.gov/data 31 World Bank, “Better Data for Safer Roads: The Powerful Mission of Road Safety Observatories,” last modified November 5, 2020, https://www.worldbank.org/en/news/video/2020/11/05/better-data-for-safer-roads-the-powerful-mission-of-road-safety- observatories 20 types, bridges, tunnels, direction of traffic flow), among other features. OSM data can be combined with other datasets for road safety analysis. While OSM provides an overview of the road geometry, the recency and accuracy of the data requires validation. Due to variability in quality and coverage, OSM data would be considered a starting point and is not recommended for detailed assessments. Private sector. Mobility datasets are generated through ride-hailing services, delivery services, so- cial media, and other mobile applications that collect user location and movement. Companies in the transportation and logistics sector use smartphone applications to digitize their operations and take advantage of higher quality, real-time data to improve efficiency as well. Other companies provide telematics software to track vehicle movement and safety features. Companies and start-ups invest- ing in autonomous vehicle research are providing valuable sources of big data for road safety analy- sis. Some companies also provide APIs that allow developers to access these datasets (often on a lim- ited basis). However, proprietary or commercial data may have to be purchased in some instances, or data partnerships need to be established to access such data. It is also crucial to understand how the data is licensed and can be legally used for different types of analysis. For example, Google restricts digitizing and tracing information as well as using applications to analyze and extract information from street view images, although annotation and labelling is permitted.32 Data Partnership Agreements. Road safety practitioners can access various datasets for road safety analysis through data partnership agreements with companies. Practitioners can directly contact companies to request data relevant to road safety and, upon signing a licensing agreement, receive the data. Practitioners can also leverage data sharing platforms such as the Development Data Part- nership (DDP), which is accessible to practitioners affiliated with certain international development 32 Google, “Google Maps, Google Earth, and Street View,” accessed May 14, 2021, https://about.google/brand-resource-center/products-and-services/geo-guidelines/ 21 organizations. DDP is a formal collaboration of private sector companies and select international organizations to use third-party data in research and international development.33 The Waze for Cities program is one example of a data sharing agreement that can be leveraged through direct contact with the company or, if accessible, through DDP. The program allows cities to utilize data standards designed by Waze for closure and incident reporting to reduce data frag- mentation and promote transport and government data aggregation. It now has more than 500 glob- al partners including city, state and country government agencies, nonprofits and first responders. Another example of a possible data provider for road safety analysis is Moovit, an app focused on public transport, offers Mobility as a Service (MaaS) solutions for cities, providing personalized apps, payment solutions, real-time transit information, and other analytics. In many cases, data providers help local governments by exchanging data. For example, the city of Tokyo in Japan has partnered with a private firm to develop a smartphone compatible app, Zen- ryoku Annai!. The app analyzes nearly 360 million observations every second to generate real-time information on the shortest and least-congested travel routes. A similar intelligent transport system (ITS) in Denmark, Copenhagen Connecting, was implemented to promote transport sustainability through real-time digital traffic control and weather adaptation options. Road safety practitioners should consider seeking the support of local governments to establish data partnership agreements, particularly if the datasets are not accessible through DDP. Data marketplaces. Business leaders are keen to explore the value of the big data they collect as a tradable commodity. This has given rise to data marketplaces which are essentially online platforms dedicated to the buying and selling of data. These marketplaces can provide a more cost-effective source of data compared to other data mining techniques. Dedicated marketplaces for traffic and transport data have also emerged in recent years, although their coverage of LMICs tends to be low. As part of its efforts to establish an artificial intelligence tool for road safety analysis (called Ai- RAP), iRAP is seeking to establish a data marketplace where public and private data providers can trade data for road safety analysis. The data marketplace will focus on three types of data products, according to Monica Olyslagers (Safe Cities and Innovation Specialist at iRAP), who was interviewed for this Guidance Note.34 The first is raw datasets that need to be processed to extract relevant in- formation. The second is datasets that have been at least partially cleaned up and processed by data providers or Ai-RAP and are ready to be plugged into road safety assessments. The third is pre- pared-for-purpose datasets that are specifically commissioned for road safety assessments in differ- ent types of projects. This data marketplace model is currently being piloted in Africa, as part of a project to set up a regional road safety observatory there in collaboration with the World Bank. The new data marketplace will initially focus on aggregating and trading conventional datasets. However, the project team plans to bring on big data providers and incorporate ML in the Ai-RAP tool to allow for more sophisticated analysis in road safety assessment procedures. Road safety practi- tioners are advised to search data marketplaces as a lesser-cost alternative to commissioning data collection for their projects. 33 Development Data Partnership, https://datapartnership.org/ 34 Monica Olyslagers (Safe Cities and Innovation Specialist, iRAP), in discussion with the authors, April 2021. 22 Key Considerations for Selecting the “Right” Big Data Source This section provides an overview on how different big data sources can be used. The data sourc- es covered in table 4 for each method or assessment type should be viewed as guides, rather than concrete, all-inclusive lists. The most appropriate choice of data sources should eventually be deter- mined by considering the costs and benefits of each source. A list of factors that may be useful to consider for this purpose are discussed toward the end of this section. It is also worth noting that while big data may not be a feasible alternative to conventional data for every project or assessment (if only at present), it can still complement and supplement current approaches or be used to validate their outcomes and analyses.35 TABLE 4: Overview of potential big data sources for road safety assessments TYPE OF DATA REQUIRED WHICH METHODS POTENTIAL BIG DATA SOURCE EXAMPLES IT’S USED FOR Crash data from 3–5 years Methods I, V and VI Government Government portal or contact Mobile applications and telematics Waze Crowdsourced Waze Operating speeds Methods II to IV Mobile applications and telematics Mapbox, Waze Road features (road Methods III, V, VI, and VII Street view imagery Mapillary markings, signs, traffic Crowdsourced OSM calming measures, etc.) Aerial and satellite imagery Maxar, UAV Road type (urban road, Methods III, V, VI, and VII Street view imagery Mapillary pedestrian area, etc.) Crowdsourced OSM Aerial and satellite imagery Maxar, UAV Mobile applications Orbital Insight Vehicle fleet mean speed Methods III to VII Mobile applications and telematics Mapbox, Waze Traffic flow Methods IV to VII Traffic imagery Mapillary Aerial and satellite imagery Maxar, UAV Mobile applications and telematics Mapbox, Waze SOURCE: Original table for this publication. As a broader variety of big data sources become available, road safety practitioners are advised to carefully consider the trade-offs involved when collecting data from various sources. The factors noted below do not provide an exhaustive list. Some factors may be more relevant to some projects than others, while additional considerations may be required for certain projects. In some cases, data from existing sources may not be available and will need to be collected using cameras, sensors, and/ or other tools. • It is worth noting that many of these factors are also interrelated. For example, the types and quantity of data required could impact costs of obtaining and processing it. Costs can also vary by region, as can the availability of resources to process and analyze the data. This list may be used in tandem with Annex 2, which provides an overview of the most relevant big data sources for road safety analysis as well as their relative costs, data attributes and formats, and possible limitations. 35 Holly Krambeck, Magreth Kakoko, and Mireille Raad, Using Computer Vision to Automatically Detect Road Features for Road Safety Audits and Assessments: Inception Report (Washington, DC: World Bank, 2019). 23 • Type of road safety assessment or procedure. As discussed in Part 1, a broad range of tools and procedures are used for road safety assessments. Each tool has its own specific data require- ments. It is important to consider these before determining appropriate big data sources to com- plement analysis. • Context/Region(s) of Interest. The types and variety of big data sources available can vary great- ly from region to region, country to country, or even different provinces or localities within the same country. For example, Waze crowdsourced crash data is especially useful for urban regions that are more densely populated compared to rural regions. • Type of data required. As more big data sources become available for road and traffic data, road safety practitioners carefully consider which variables and data types are most relevant to their model before selecting a source. For example, Google offers a number of APIs that may be useful for road safety analysis. This includes Google Maps, Google Traffic, and Google Street View. It is important to consider the quantity, duration, and extensiveness of the data required. For exam- ple, some data sources include time-series information, others do not. Some may include specific road features or road user data, while others may just be focused on traffic flows. • Data formats. Big data is collected, stored, and transmitted in a wide range of formats. It is important to consider the usability of available big data formats as well as their interoperability with other types of data. Since many big data sources that are currently available are not custom designed for road safety analysis, it may be necessary to invest in resources and skilled expertise to extract, aggregate, clean, and convert the data into a format that can be combined with other data and/or used with analytical tools and models. • Cost. Given the size of big datasets, costs can arise from accessing, storing, handling, process- ing, and analyzing the data. The cost may be in the form of data licenses, software licenses or equipment (if the data is being collected specifically for the project at hand). Besides the cost of obtaining the data, it is also important to consider the cost of using it, such as by acquiring the necessary expertise, software tools and processing power for analysis. Annex 2 discusses the relative costs associated with using different big data sources. • Resources required to make data usable. In addition to relevant data sources and the costs that may be associated with accessing them, other resources could also be required to utilize the data in road safety assessment and analysis. This includes technical skills and expertise required to handle and analyze the data. • Time constraints. Some big data sources are faster to access and obtain data from compared to others. For example, open data platforms allow you to run a search query and instantly obtain relevant datasets. Other avenues, such as data sharing agreements, may take longer to deliver the required data. It is important to consider the project timeframe to determine which data source may be more useful for road safety analysis at a given stage. • Licensing constraints. Any official and legitimate data source is accompanied by licensing reg- ulations that outline the terms of use of the provided dataset. Big data sources are no exception. Different data sources have different licensing agreements associated with them. Some, such as open data platforms, may have minimal licensing restrictions. Others, such as APIs and data- sets obtained through data partnership agreements, can have more restrictive terms of use. It is important to carefully consider these limitations before choosing a source. Road safety prac- titioners are advised to consult legal advisors or the data provider to fully understand licensing restrictions associated with different big data sources to avoid legal ramifications. 24 2.2 Machine Learning in Road Safety Analysis ML is a branch of artificial intelligence. It involves creating algorithms that “learn” patterns, trends and behaviors from data and improve accuracy over time without further programming. As figure 5 illustrates, the lifecycle of an ML model can be typically divided into two phases: training and deploy- ment. In the training phase, training data is fed into the algorithm to obtain a trained model. In the deployment phase, new input data is fed into the trained algorithm (or model) to predict the output. FIGURE 5: ML lifecycle Training data Training the algorithm Trained model New input data Prediction SOURCE: Modified from https://randomtrees.com/data-science As shown in figure 6, ML algorithms can be divided into three categories: supervised learning, unsupervised learning, and reinforcement learning. The specific tasks they are capable of and the corresponding algorithms that are most widely used for this purpose are also listed in table 5. One significant difference between these categories is the format and source of training data. FIGURE 6: Categories of ML and the tasks they can perform Meaningful compression Fraud detection Image classification Structure discovery DIMENSIONALITY Customer Feature elicitation CLASSIFICATION retention REDUCTION Big data visualization UNSUPERVISED SUPERVISED Diagnostics Recommendations LEARNING LEARNING Weather forecasting Advertising CLUSTERING REGRESSION popularity MACHINE predictions LEARNING Customer Targeted Estimating life Market forecasting segmentation marketing expectancy Real-time decisions Game REINFORCEMENT LEARNING Robot navigation Skill acquisition SOURCE: Modified from https://towardsdatascience.com/coding-deep-learning-for-beginners-types-of-machine-learning-b9e651e1ed9d 25 Supervised learning is a family of algorithms that learn from previous data to map an input (X) to an output (Y). For example, a supervised learning algorithm can be used to predict the risk level or crash frequency (Y) of a road segment given its characteristics (X). “Supervised” means the training data is labelled (i.e., the training data should be pairs of X-Y, where Y is usually called labels). Unsupervised learning algorithms find structures in a dataset in order to group or cluster data points based on their similarity. As the name suggests, these algorithms do not require “supervision” or human intervention in the training phase. This means that, unlike supervised learning, the training data for unsupervised learning algorithms has no labels (Y). These algorithms learn to group X based on similar characteristics. The most common unsupervised learning task is clustering. For example, given the characteristics of a road segment, an unsupervised learning algorithm can classify it into a group of similar segments. It does not need to understand the characteristics that the group rep- resents to complete this task. Reinforcement learning trains a software agent to make decisions that maximize rewards from interactions with an external environment.36 As opposed to supervised learning and unsupervised learning, which require training data to be prepared before training, reinforcement learning gener- ates the training data during the training phase. The data is generated when the agent interacts with the environment. For example, reinforcement learning can be used to train an agent to control traffic lights based on traffic conditions. TABLE 5: Categories of ML and algorithms* ALGORITHMS TASKS *The algorithms listed in this table are not exhaustive. Supervised Learning SVM, DT, RF, KNN, ANN Classification SVM: support vector machine DT: decision trees Regression RF: random forest KNN: k-nearest neighbors Unsupervised Learning K-means, PCA, ANN Clustering ANN: artificial neural networks Dimensionality Reduction PCA: principal component analysis DQN: deep Q-network, which includes and ANN in its Reinforcement Learning Q-Learning, DQN Robotics/Decision-making algorithm Source: Original table for this publication. Artificial neural network (ANN) is a family of ML algorithms that have been inspired by the human brain. ANN is the most versatile ML algorithm – it can be used for supervised learning, unsuper- vised learning, and also reinforcement learning. As shown in figure 7, ANN structures the data and the computation in different layers. Every layer adds more depth to the algorithm; therefore, more layers indicate that it is “deeper”. Such ANNs are called deep neural networks or deep ANN or DNN. ML algorithms that use deep ANN are called deep learning (DL) algorithms. Therefore, from another perspective, ML algorithms can be divided into conventional ML and DL (table 6). 36 This agent is a piece of software that makes a decision based on the environment. 26 FIGURE 7: ANN structure Input 1 Input 2 Output 1 Input 3 INPUT LAYER HIDDEN LAYER OUTPUT LAYER SOURCE: Original figure for this publication. TABLE 6: ML and DL algorithms CONVENTIONAL ML* DL Supervised Learning SVM, DT, RF, KNN, shallow ANN Deep ANN Unsupervised Learning K-means, PCA Deep ANN Reinforcement Learning (RL) RL without deep ANN RL with deep ANN *The conventional ML algorithms listed in this table are not exhaustive. SOURCE: Original table for this publication. Most ML algorithms are conventional ML, such as conventional supervised learning algorithms like support vector machine (SVM), which can be used for classification or regression, for exam- ple, classifying the risk level of a road segment based on its characteristics. Conventional unsu- pervised learning algorithms, such as K-means clustering, automatically identify spatial patterns in datasets, which can be applied to locate clusters or areas with recurring road crashes. Conventional ML works well for small, low dimensional datasets. Meanwhile, DL is a subset of ML that learns the complex patterns from high dimensional (e.g., an image) and large quantities of data (e.g., big data). Supervised, unsupervised, and reinforcement learning algorithms that use deep ANN technique be- long to the deep learning category. DL’s first successful application is in the computer vision area. For example, image classification is a supervised learning task that utilizes deep neural networks to classify images into different classes (e.g., cars, pedestrians, etc.). How to Use Machine Learning The use of ML methods in road safety analyses is being widely explored.37 As ML methods become more advanced, economical, and accessible, their potential applications in various disciplines continue to grow and become more feasible. In road safety analyses, ML has great potential to overcome the limita- tions of traditional statistical models in crash analysis and crash probability modeling. The applications of ML in road safety analyses are discussed under three categories: conventional ML, DL, and reinforce- ment learning, as listed in table 7. It should be noted that some reinforcement learning algorithms using deep ANN belong to DL, but all reinforcement learning techniques are discussed separately. 37 Philippe Barbosa Silva, Michelle Andrade, and Sara Ferreira, “Machine Learning Applied to Road Safety Modeling: A Systematic Literature Review,” Journal of Traffic and Transportation Engineering (English Edition), 7, no. 6, (2020), https://www.sciencedirect.com/science/article/pii/S2095756420301410 27 TABLE 7: Frequently used ML techniques for road safety analysis* ML CATEGORIES SUBCATEGORIES ALGORITHMS TASKS EXAMPLES Conventional ML Supervised SVM Classification Predict risk level based on road Learning DT characteristics. RF Regression Crash frequency prediction based on road KNN characteristics. shallow ANN Unsupervised K-means Clustering Group road segments by characteristics Learning similarity; group drivers based on their driving behaviors. PCA Dimensionality Reduction Identify critical factors of road safety. DL Supervised CNN Image Classification/ Detect road features from images. Learning Object Detection/ Segmentation Unsupervised GAN Clustering/Dimensionality Find the hidden features related to road Learning Reduction safety from map and satellite images of the road environments. Reinforcement Learning N/A Q-Learning Robotics/Decision-making Control traffic lights based on traffic DQN conditions. *The algorithms and examples listed in this table are not exhaustive. CNN: convolutional neural network, a type of deep ANN GAN: generative adversarial networks, a type of deep ANN SOURCE: Original table for this publication. A growing body of research explores various ML techniques to predict the probability of road crashes and assess their severity by training on historical datasets that encompass diverse fac- tors. Conventional ML algorithms are the most frequently used ML algorithms for this purpose. They are summarized in table 7. ML-based approaches to road safety analysis can be used to comple- ment, supplement or even potentially substitute conventional road safety assessments. Conventional supervised learning algorithms learn functions that take vectors of variables as in- put to predict the output. Most conventional supervised learning algorithms that are frequently used in data science have been used in road safety analyses, including but not limited to: decision trees (DT), random forest (RF), support vector machine (SVM), k-nearest neighbors (KNN), and artifi- cial neural networks (ANN).38 It should be noted that there is no “best” algorithm. Determining which algorithm may be most appropriate for an ML-based road safety analysis is essentially a data science problem for which there are usually no set rules. One algorithm may perform well for a dataset, but badly for another. It is common practice for data scientists to try different algorithms in order to find a suitable one for a specific problem. When using the aforementioned conventional supervised learning algorithms for road safety assessments, the problem is often framed as a classification or regression problem, in which the output (Y) of the ML algorithm is either a class (e.g., risk level or severity: low, moderate, substantial or high) or a scalar (e.g., crash probability, crash frequency) and the input (X) to the ML algorithm could be any parameter (including but not limited to weather, time, road factors, human factors, etc.) that is related to the output. Conventional unsupervised learning algorithms are mainly used for clustering and dimensional- ity reduction purposes. In road safety analyses, K-means can be used for grouping tasks that help find clustering patterns in the data. For example, it can be used to group road segments by similar characteristics or group drivers based on their driving behaviors, so that dangerous road segments or drivers can be identified based on the similarity. In another example of unsupervised learning ap- 38 Silva, Andrade, and Ferreira, “Machine Learning Applied to Road Safety Modeling: A Systematic Literature Review.” 28 plication, principal component analysis is used for reducing the dimensions of input data to identify the most critical factors that affect road safety. DL has been applied in various disciplines and achieved impressive performance. DL technologies have progressed significantly over the past few years, especially in image analysis and computer vision, the method’s first successful application. The core technique in this domain is deep convo- lutional neural network (CNN), which is the state-of-the-art approach for object detection, semantic segmentation, and instance segmentation of images. Object detection is a task in which, given an image, the model outputs a bounding box of detected objects (figure 8). Semantic segmentation is a task in which, given an image, the model classifies every pixel into predefined classes (e.g., road lane, traffic light, etc.). Instance segmentation is a task, in which, given an image, the model groups pixels belonging to an instance of the object. FIGURE 8: ML algorithms and street view After applying an object detection algorithm to a street view image, a bounding box surrounds each predicted object, which also contains a confidence level for each prediction. Logo 90% Window 72% Buildings 85% Merchandise 77% Commerical sign 85% Window 75% Commerical sign 45% Street sign 69% Door 96% Person 72% Car 69% Car 98% Truck 92% Person 81% Person 78% Person 96% Merchandise 83% Merchandise 83% Merchandise 71% BOGOTÁ, COLOMBIA. SOURCE: World Bank Global Program for Resilient Housing. DL-based image analysis has been successfully used in various industries for applications ranging from facial recognition to autonomous driving. It has great potential to be used in road safety analysis to automatically analyze images and infer road attributes that are relevant to road safety assessments. Large sets of images with annotations such as road lanes, traffic lights, speed limit signs, and pedestrians can be compiled for training deep CNNs so that they learn to recognize these objects through images that the models have not previously encountered. If successful, this approach should equip the model to detect road attributes at a regional scale. The detected information can then be used for safety and risk analysis. For example, if the DL mod- el can infer the road segment characteristics (e.g., number of lanes, terrain type, road markings and signs, and pedestrian, bicycling, and motorcycling facilities), the inferred information can readily be 29 used as input for various road safety assessment tools. This would allow the process of detection and analysis to become fully, or at least significantly automated and scalable at a low cost. DL can also provide a lower-risk alternative to manual detection of certain road attributes and other important variables in road safety analysis. For example, a team used imagery from Baidu Street View to provide a practical, automated alternative to the manual detection of street cracks, which can be labor-intensive, hazardous, and difficult to conduct on a large scale. The authors use the Deeplabv3+ network model, a DL neural network, to develop an automated road crack identification system and demonstrate its practicality as a method to generate faster, more accurate and efficient information about road cracks at lower cost compared to manual detection.39 Reinforcement learning is widely used to design intelligent control and decision-making systems. In road safety and traffic management, reinforcement learning is most commonly employed to devel- op intelligent signal control algorithms. A typical reinforcement learning-based traffic light system makes divisions based on specific input traffic parameters, such as the length of time for which vehi- cles wait at the intersection, the cumulative delay caused by waiting at the intersection, the length of time for which the light stays green for each signal head, etc. The output of the system would be the next color of the light and length of time for which it should remain switched on. Designing traffic systems using reinforcement learning helps save time and improve safety standards. Key Considerations for Using Machine Learning Road safety can be evaluated explicitly using rule-based reasoning systems. However, developing such systems can be complex if there are many input variables. Compared with rule-based evalua- tion systems, ML algorithms are data-driven and don’t require developing rules; therefore, they are relatively inexpensive to implement. ML algorithms are more suitable for high dimensional inputs. As a broader spectrum of ML algorithms become available, road safety practitioners are advised to carefully consider the trade-offs involved when applying them to road safety analysis. This section dis- cusses various factors that must be considered before deciding to use an ML algorithm for road safety analysis in their project. Again, this is not an exhaustive list. Some factors may be more relevant to some projects than others, while additional considerations may be required for certain projects. It is worth noting that many of these factors are also interrelated. For example, the feasibility of using ML for a project can be affected by time and budget constraints, the availability of data and the anticipated resource intensiveness of the data preparation process. Table 8 provides a SWOT analysis of the use of ML in road safety analysis. 39 Min Zhang et al., “Research on Baidu Street View Road Crack Information Extraction Based on Deep Learning Method,” Journal of Physics: Conference Series, no. 1616 (2020). https://iopscience.iop.org/article/10.1088/1742-6596/1616/1/012086/pdf 30 TABLE 8: SWOT analysis of using ML in road safety analysis STRENGTHS WEAKNESSES • Offers tools and techniques to process big data that may be • Algorithms can be limited in their applicability; models may not more precise compared to traditional methods. perform well on data that is different from the training data’s • Especially effective for feature learning, parameter distribution. optimization, and processing large amounts of big data. • Large amounts of data are needed to train the models and yield • ML algorithms tend to perform better than traditional more accurate models, which may be difficult in data-scarce statistical techniques in cases where high-dimensional and contexts. high-nonlinear data is involved. • Some ML algorithms (e.g., ANN) works like a black box, and can • As the technology develops, novel techniques create new be hard to interpret, therefore an ML algorithm usually requires opportunities to understand complex relationships between thorough validation and test processes before it can be deployed in multiple, interrelated variables and predict outcomes with the real environment and assist decision-making. greater accuracy. • The technology still needs further development before it can be • ML algorithms can be improved continuously as more data mainstreamed for use in road safety assessments. is generated or made available for training. OPPORTUNITIES THREATS/CHALLENGES • May eliminate the need for manual coding of road safety • Requires specialist expertise, tools, and knowledge which may data in the future, making the process less labor-intensive make its usefulness limited in some contexts, especially in and time consuming. developing countries. • Possible to train datasets in one location or for one purpose • May require additional investment in computer power and analytical and use them for another. software. • Provides a powerful method for complex crash risk • Complexity of ML algorithms can make them difficult to implement modelling and other types of predictive analytics in road and analyze. safety. • Ethical considerations, such as bias in ML systems. • As the technology develops, a platform powered by ML • As a data-driven approach, ML relies on high-quality data for could be used across geographies for road assessments. training. Significant bias in the training data could lead to the • As more and more data is generated and collected everyday, failure of model training. Quality control of training data could be this could be potentially analyzed with ML algorithms to difficult, especially when annotating the data requires professional discover new patterns and insights. knowledge. SOURCE: Original table for this publication. Feasibility with project objectives and client requirements. Before deciding to use ML for any proj- ect, it must be ascertained if ML is suitable for the project. Some ML algorithms, such as neural net- works, are not interpretable. They work like a black box. Clients may not have confidence in using them for significant decision-making unless their predictions can be sufficiently validated. Preparing data to train ML algorithms. ML is a data-driven approach. Therefore, as with any da- ta-related project, it is important to plan the data collection and preparation process. To facilitate this process, make sure to have clearly defined the inputs and outputs of the model at the outset of the project. Section 2.1 provides guidance on how to select data sources, especially where big data may be involved. It is common that, during the training stage, an ML team may find the data is not enough to train a model with satisfactory performance. In this case, more data needs to be collected. In terms of data preparation, teams should be aware of the need to aggregate, clean and annotate data before it can be used for ML modelling. Annotation of data is especially necessary for supervised learning algorithms and entails manually identifying an object drawing a box or polygon around it and giving it a label such as “pothole” or “crosswalk” (figure 9). 31 FIGURE 9: Labeling a crosswalk in Padang, Indonesia using the Computer Vision Annotation Tool (CVAT) SOURCE: World Bank Global Program for Resilient Housing. Teams are advised to incorporate a quality control process to ensure data being used for any ML model, especially test data, is of good quality and truly valid and representative of the population or situation under study. For an ML-based project, steps include: (i) identifying data required for the model; (ii) data collection, cleaning, annotation; (iii) trial and error training; (iv) validation; (v) deploy- ment. It is advisable to estimate the duration of these tasks, their expected complexity and potential challenges (which can vary by context and availability of resources such as expertise and processing power) before deploying ML in any project. This helps determine if ML is feasible, how it compares to traditional methods and how incorporating ML can impact project timelines. It is worth noting that once deployed in the production environment, ML provides significant acceleration for the whole process, for example, DL-based image analysis can exponentially save the time for collecting data to be used in the road risk estimation. A challenge for most ML algorithms is generalization, or how well a model can perform based on test data (also called unseen data). Models may not perform well on unseen data that is different from the training data’s distribution. For example, a model that is trained on images collected on rural roads in an arid climate may not achieve the same level of performance on images in urban roads in another country. The transferability of the model depends on how similar the features in the images are. Therefore, before training ML algorithms, it is prudent to consider the diversity of the training data, especially in terms of where, how and when it was collected. It is worth noting that some researchers have found that artificial intelligence and ML algorithms can be easily and accu- rately applied to different types of urban networks within the same city.40 To determine if using ML fits a budget or can even deliver a cost-advantage, it is important to un- derstand associated costs. Costs of using ML can arise from the hiring of experts to develop and pro- gram models, as well as from the data collection and preparation process (which includes cleaning 40 Apostolos Ziakopoulos and George Yannis, “Using AI for Spatial Predictions of Driver Behavior” (presentation, ITF International Transport Forum Roundtable on Artificial Intelligence in Road Traffic Crash Prevention, 2021). https://www.nrso.ntua.gr/geyannis/conf/cp450-using-ai-for-spatial-predictions-of-driver-behavior/ 32 and annotation). The cost of storing data (on local hardware or on the cloud) should also be accounted for, especially if the inputs involve big data. Depending on the model and quantity of data being input, and especially if a DL model is employed, you may also need to invest in additional computational resources (graphics processing unit-equipped local computers or nodes on the cloud). Front-end and back-end systems may also need to be established for automatic analysis services. Deploying ML algorithms requires specialized expertise, often in the form of dedicated team members that are ML experts. These may be addressed by hiring experts and managing the process internally or acquiring resources externally. An in-house, “do-it-yourself” approach ensures more control over every aspect of the process, which may be especially important where significant cus- tomization or trial and error may be required. However, this approach requires labor and time, and may be more costly in the long run. Using an external resource or tool, on the other hand, may be a faster option but can come at the expense of some visibility and control over the development of the model. It is important to consider these trade-offs to ensure the team is adequately resourced to use ML effectively in the project. 2.3 Big Data, Machine Learning and the Future of Road Safety Assessments Artificial intelligence presents many exciting possibilities for automation and analysis in trans- port and infrastructure development. ML is increasingly used for road safety analysis. ML’s inher- ent capability of managing uncertainties in data and models makes it extremely suitable for solving road safety related issues. Uncertainty is a defining element of crash risk modelling and, in fact, a source of complexity that has thus far limited the usefulness of traditional statistical models. More- over, ML algorithms such as deep ANN can capture nonlinear patterns in data, making them the first choice for processing road safety big data. Table 9 provides a summary of possible applications of big data and ML in road safety analysis given the current state of the technologies. TABLE 9: Potential applications of big data and ML in road safety analysis POTENTIAL HOW BIG DATA CAN HELP HOW ML CAN HELP APPLICATIONS Estimating Road Video and photo images, APIs, satellite imagery and/ • Process images to evaluate road attributes Infrastructure Risk or crowdsourced images • Identify road features that could cause crashes • Identify risk factors contributing to crash occurrence • Identify safety conditions in infrastructure Traffic Flows APIs, aerial imagery, open-source traffic data, road • Process images to classify vehicles, identify sensor data, wireless technology, street cameras, GPS congestion hotspots, vehicle detection, or speeds data, mobile devices, real-time traffic data • Assess traffic flows • Develop risk maps • Map the safety performance and Star Rating • Traffic flows prediction Crash Risk Meteorology data, geo-located crash data, video and • Create crash prediction models Assessment photo images, APIs, open-source traffic data, road • Develop risk maps sensor data, historical crash data, crowdsourced • Analyze different conflict scenarios and high-risk crash data (e.g., Waze) behavior Incident Reporting/ Video recording, crash data, photo images, • Identify hotspots through clustering techniques Crash Data crowdsourced data (Google Maps, Waze) Analyzing Crash Video and photo images, sensor data • Process images to evaluate road attributes Severity • Develop crash prediction models SOURCE: Original table for this publication. 33 Combining big data and ML can provide an integrated framework for automatic road safety analy- sis and management. This framework, demonstrated in figure 10, employs platforms (such as Mapil- lary) to provide geo-tagged street level imagery for inputs to the DL model to infer useful information (e.g., road characteristics). The DL-inferred data is then combined with multi-source big datasets (e.g., region-specific historical crash data) for better analysis and management of road safety. FIGURE 10: Framework for automatic road safety analysis and management powered by ML Geo-tagged street level images (Big) data sources Complementary information Road curvature Historical crash APIs Baseline fatalities In-house data … Third-party data … Methods/tools iRAP Star Rating Score RSSAT RSA Deep learning model Image analysis DL inferred information RSIA Lanes SSA Shoulder ML models Street lighting … Pedestrians crossing … SOURCE: Original figure for this publication. At present, much of the research and innovation in the use of ML for advanced road safety and risk modelling is being driven by universities and other research institutions. As other stakeholders, such as road safety practitioners, governments, developers of road safety tools and international organizations such as the World Bank look to apply ML in their projects, there is an opportunity to create dedicated tools that would harness big data and ML for road safety analysis. Such applications have the potential to reduce the risk of human error and allow road safety assessments to be mostly, if not fully, automated. The following section presents practical examples of how big data and ML can assess urban road safety. It applies an integrated framework introduced in section 2.3 to explore the opportunities and limitations of new data sources and assess the ML models. To evaluate the robustness of the proposed framework, the Integrated Framework for Road Risk Prediction was applied in two cities of different sizes, regions, and data availability were chosen: Bogotá, Colombia, a rapidly urbanizing metropolis in Latin America, and Padang, Indonesia, a secondary city in East Asia. The study found that ML applied to street view imagery identified relevant road (and road user) characteristics to gen- erate a model that predicts road risk with 72.5 percent accuracy in Bogotá. This framework was ap- plied in Padang to test its replicability; preliminary results are encouraging for its potential to predict road safety for areas with limited crash data. The section concludes with a reflection and guidance for replicability. 34 PART 3 Case Studies: Applying Big Data and Machine Learning to Assess Road Safety 3.1 Objectives of the Case Studies This section presents how the Integrated Framework for Road Risk Prediction can be applied in two different cities of interest: Bogotá, Colombia and Padang, Indonesia. The study examines how useful ML is in evaluating road safety and how easily the integrated framework can be replicated. All code is freely available for other teams to use and develop further.41 The objectives of the case studies are to: 1. Learn how well big data and ML can be used to identify road features, estimate road safety, cate- gorize road segments based on their risk level, and identify high-risk segments. 2. Evaluate the utility of several big data sources that are freely available for road safety analysis in diverse geographic areas.42 3. Assess the replicability of the proposed approach. Located on two different continents, the selected locations offer an opportunity to apply the frame- work on paved, urban roads in contrasting environments, particularly related to data availability and usability. For example, the government of Bogotá has made significant efforts to increase crash data collection and dissemination. The government offers an online portal with the location of each crash over the past year publicly available. In addition, there was high coverage for data derived from mobile phones, such as crowd-reported crashes. In contrast, information on the crash locations for Padang could not be found online, and methods for data collection are largely manual or paper based.43 In addition, mobile application data was scarce for crowdsourced crash reports. As a result, Padang offers the opportunity to explore the utility of ML when data coverage is limited. 41 The code for the Integrated Framework for Road Risk Prediction is open source and accessible on GitHub: https://github.com/datapartnership/IntegratedFrameworkForRoadSafety. However, some datasets require partnership with DDP to access. 42 Freely available meaning at no cost; however, some data sources are not publicly available and require a license. 43 World Bank, Indonesia Public Expenditure Review 2020: Spending for Better Results (Washington, DC: World Bank, 2020). https://openknowledge.worldbank.org/handle/10986/33954 35 BOGOTÁ AND PADANG: BACKGROUND AND CONTEXT With a population of more than 7 million, the capital district of Bogotá is Colombia’s largest city. As a crit- ical economic hub with a growing population, Bogotá stands out as one of the most congested cities in the world.44 The government has prioritized road safety and achieved significant gains over the past few de- cades, reducing the city’s traffic fatality rate by more than 60 percent between 1996 and 2006 alone.45 More recent interventions during the UN Decade for Action for Road Safety include establishing a Nation- al Road Safety Plan and a National Road Safety Agency (Agencia Nacional de Seguridad Vial) fea- turing a National Road Safety Observatory in collaboration with the World Bank.46 In addition, in 2017, the city’s government launched “Vision Zero,” which aimed to implement a range of speed management strategies to eliminate pedestrian and driver fatalities. The program has delivered measurable results, such as a 27 percent reduction in fatalities across corridors where speed limits have been introduced, and further interventions are planned to sustain its impact.47 Despite these initiatives and road safety improvements in Bogotá, challenges remain, and new policies would benefit from timely and affordable analytics on road safety. Padang is the capital of the Indonesian province of Western Sumatra with a population of around 1 million. The government of Indonesia introduced various initiatives to address road safety during the UN Decade of Action for Road Safety. Established in 2011, the National Road Safety Master Plan achieved a 10 percent reduction in annual road fatalities be- tween 2013 and 2016. However, data collection and management systems that rely on manual screen- ing significantly challenge the country’s progress in road performance and safety.48 Initiatives such as the establishment of the Integrated Road Asset Management System and the World Bank’s new Asia-Pacific Road Safety Observatory present a valuable opportunity for the country to improve its road safety data systems.49 For this case study in Padang, crash data was scarce from alternative sources. Therefore, it offers the opportunity to explore the utility of the pre-trained ML models in a new region with limited data coverage. 44 INRIX 2018 Global Traffic Scorecard. In 2018, drivers lost 272 hours in road congestion. 45 ODI (Overseas Development Institute), “Bogotá,” ODI: Think Change. Accessed October 12, 2021, from https://odi.org/en/about/features/bogot%C3%A1/ 46 World Bank, Colombia - Programmatic Productive and Sustainable Cities Development Policy Loans (Washington, DC: World Bank, 2020). http://documents.worldbank.org/curated/en/426591583968971309/Colombia-Programmatic-Productive-and- Sustainable-Cities-Development-Policy-Loans 47 Darío Hidalgo and Claudia Adriazola-Steil, “Bogotá’s Vision Zero Road Safety Plan Is Saving Lives,” TheCityFix, last modified September 26, 2019, https://thecityfix.com/blog/bogotas-vision-zero-road-safety-plan-saving-lives-dario-hidalgo-claudia- adriazola-steil/ 48 World Bank, Indonesia Public Expenditure Review 2020: Spending for Better Results. 49 DT Global, “Indonesia: Establishment of Integrated Road Asset Management Systems,” accessed October 4, 2021, https://dt-global.com/projects/irams-dc 36 3.2 Methodology The ML-based framework implemented in these case studies was developed to provide a quick screen to evaluate road safety. The framework ascertains road characteristics traditionally collected or an- notated to provide a road safety prediction. ML models were developed specifically for this frame- work during these case studies, one to extract road characteristics from street view images and one to determine road risk based on the derived road characteristics. To do so, first, the models needed to be trained to extract road characteristics and determine the road risk based on crash data. Then the models could be applied to make predictions in new areas without crash data. Therefore, there were two phases in this framework, first the training phase to train the models (figure 11), and then the deployment phase to make new predictions with the models (figure 12). In each phase there were three steps, both of which began with data collection and preparation. OpenStreetMap (OSM), Waze, and Mapillary were used to develop this framework (additional examples of these datasets and relat- ed analysis can be found in Annex 3). The OSM road network provided the foundation for analysis. It is free- ly available and scalable. OSM uses lines to represent roads and points to represent links among the roads. In OSM, the geometric road lines are split into road segments (called ways) that are connected by the points (called nodes). No modifications were made to the OSM geom- etry to maintain its synchronicity with other big datasets referencing OSM ways and nodes. The Waze crash data consists of coordinates representing the location where users of the Waze application are when they see and report a crash.50 The Waze crash points were joined to the nearest OSM road segment (within 20 meters). For each road segment, the crash frequen- cy, or crash per meter, was calculated to normalize the frequency of crashes. Since OSM road segments vary in length and there could be multiple reports per crash, calculating the crash frequency provided crash trends. To identify road segments with more frequent crashes per meter, the crash frequency was split into high and low risk. Mapillary was used to obtain street view images, which were primari- ly collected by the World Bank’s Global Program for Resilient Housing. Since many images are captured along a street, and many images can be linked to a single road segment, the image closest to the centroid of the road segment was selected. The radius for this selection was with- in three meters of the centroid. This approach standardizes the image selection and classification: one image represents the scene of one road segment. For each OSM road segment, a street view image taken near the centroid of the segment was downloaded using Mapillary API v4. SOURCE: Original examples for this publication based on data from OSM, Waze, and Mapillary. Copyright OpenStreetMap contributors, Microsoft, Esri Community Maps contributors. Basemap from Esri, HERE, Garmin, METI/NASA, USGS. 50 Data provided by Waze App. Learn more at waze.com. 37 The Training Phase The training phase consisted of two significant steps that were powered by ML to extract information from street view images and to make predictions on risk level based on extracted data. Each step had an ML model at its core that needed to be trained based on data. Therefore, there were three steps in the training phase. Step 1. Select the region of interest and prepare data A generalized polygon of the region of interest was used to collect data from OSM, Waze, and Mapil- lary. The road network database was prepared, and the street view images closest to the centroid of the road segment were downloaded as inputs for the models. FIGURE 11: Training phase for road safety segment analysis using ML Geo-tagged street level images (Big) data sources Road network OSM Waze Road network (crash frequency) Mapillary database Deep learning model Image analysis DL inferred Neural Network Road Information information classifier Collector (RIC) Lanes Road Risk Evaluator Shoulder (RRE) Street lighting Low risk Pedestrians crossing High risk … SOURCE: Original figure for this publication. Step 2. Develop ML model for identifying road characteristics The first custom ML model developed for this case study was the Road Information Collector (RIC), shown in figure 11. It is a deep convolutional neural network, Mask R-CNN, which can classify and count objects detected in images.51 The RIC model was trained with images from the updated Map- illary Vistas Dataset (initially released in 2017), which provides detailed characteristics for types of road markings and barriers, traffic lights and signs, and vulnerable road users such as pedestrians, motorcyclists, and bicyclists.52 Other identifiable characteristics include flat terrain, which charac- terizes road gradient, and the presence of potholes, which could indicate paved, urban road quality. The RIC takes street view images as the input and can detect more than 100 classes of objects as the output (for a complete list of the features the RIC model detects, refer to Annex 4). The model can Kaiming He et al., “Mask R-CNN,” 2017 IEEE International Conference on Computer Vision (2017): 2980-2988. 51 G. Neuhold et al., “The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes,” 2017 IEEE International 52 Conference on Computer Vision (ICCV) (2017): 5000-5009, doi: 10.1109/ICCV.2017.534 38 detect and classify some road features better than others (for the precision score in detecting and classifying the objects, see Annex 5). Step 3. Develop ML model for evaluating road risk The second ML model developed was the Road Risk Evaluator (RRE). The RRE is a neural network classifier with two hidden layers; each has 50 neurons. The RRE was trained using paired data for each road segment, the road attributes from the RIC and the assigned road risk from the road net- work database. Similar work was conducted by a team using a neural network to predict the crash frequency of road segments.53 The Deployment Phase Once the two ML models are trained, they can be added to an automated workflow in the deployment phase. This means the trained ML models can now predict the risk level for any road segment with the required input data – a street view image. Crash data is not required in the deployment phase. FIGURE 12: Deployment phase to predict road safety Street level images for each road segment Road network Region of interest OSM Mapillary Deep learning model Image analysis DL inferred Neural Network Road Information information classifier Collector (RIC) Lanes Road Risk Evaluator Shoulder (RRE) Street lighting Low risk Pedestrians crossing High risk … SOURCE: Original figure for this publication. The deployment phase uses three steps to predict risk within an automated workflow (figure 12). Step 1. Select the region of interest and download data For the selected region of interest, the code will download the road network from OSM and calculate the centroid of each road segment. The code will then download from Mapillary API a street view image taken near the centroid of the road segment. Qiang Zeng et al., “Rule Extraction from an Optimized Neural Network for Traffic Crash Frequency Modeling,” Accident 53 Analysis & Prevention 97 (2016): 87-95. 39 Step 2. Identify road characteristics For each road segment, the downloaded image will be fed into the RIC to extract road characteristics. For each image, the RIC will output the numbers of detected objects for each class (refer to Annex 4 for classes). These numbers are put together to form a vector for each image. Step 3. Evaluate road risk Each vector produced by the RIC will be fed into the RRE to calculate the risk level: high or low. To illus- trate the automated workflow of the deployment phase, figure 13 shows the risk prediction for a road segment. The RIC detected a flat road, car, and motorcycle; therefore, the RRE predicted the road seg- ment as low risk. This framework requires no historical crash data to identify high- or low-risk roads. FIGURE 13: RIC and RRE applied to predict road segment risk RIC RRE construction--flat—road x 1 Risk level: Low object--vehicle—car x1 object--vehicle—motorcycle x1 SOURCE: Original figure for this publication, based on data from Mapillary and annotated with classifications from the model. The two case studies presented illustrate the training and deployment phases. The training phase was conducted in Bogotá, where data was collected to train the ML model RRE, while the RIC model was trained on the Mapillary Vista Dataset. Then the models were applied in the deployment phase to predict the risk level for each road segment in Bogotá, Colombia. The second case study was in Padang, Indonesia. The RIC and RRE models trained in the previous case study were applied directly (i.e., without re-training) in a deployment phase to predict road risk in Padang. This demonstrates that, ideally, there is no need to re-run the training phase for future applications since the RIC and RRE are already trained. 40 3.3 Case Study 1: Bogotá, Colombia The Training Phase Step 1. Select the region of interest and prepare data In Bogotá, a road network database was created to prepare training data for the ML models. First, a generalized polygon of the region was used to retrieve roads from OSM and six months of crash re- ports from Waze (July–December 2020). The crashes were joined to the nearest OSM road segment within 20 meters. The crash frequency, or crash per meter, was calculated and road segments were divided into high risk (crash frequency >0.5) and low risk (crash frequency <=0.5) in the road network database. This means a crash per meter of 1 represents one crash per meter in the six months of the Waze data collected. Street view imagery was downloaded using the Mapillary API to collect images close to the centroid of each road segment. Table 10 provides an overview of the data sources for this case study. TABLE 10: Data used for case study in Bogotá, Colombia DATA SOURCES ATTRIBUTES REMARKS ROAD NETWORK OSM Road network (road segment length) Provided through an open license. CRASHES Waze Road alerts (crashes reported by users, coordinates) Obtained through DDP. ROAD CHARACTERISTICS Mapillary Street view image detections (crosswalk, curb, Selection of image annotation tags used (images and tags) guard rail, human, marking, pothole, sidewalk, sign, in study; more available through Mapillary streetlight, traffic sign, utility pole) Traffic Sign and Vistas. Multiple detections per image are possible. SOURCE: Original table for this publication. Step 2. Develop ML model for identifying road characteristics The RIC was developed and trained to perform instance segmentation. It is a deep convolutional neural network that identified the classes, or objects in the image, and provided the count of these classifications. The model was trained using the Mapillary Vistas Dataset using a total of 124 classes (Annex 4).54 The resulting output is a count of the classes identified by the bounding boxes, shown in figure 14, which is represented through a series of integers. Training data: Mapillary Vistas Dataset (124 classes) Input: Street view image near the centroid of a road segment Output: A vector of integers (each element represents the count of detected objects that belong to a class) Figure 14 depicts the RIC in action on an image from Bogotá. The bounding boxes surrounding each object in the image indicate classes the model identified. Confidence levels are provided next to the name of the object segmented by the bounding box. The closer the confidence level is to 1, the higher the confidence in the prediction. Looking at the center of the image, the bicyclist was identified with 0.5 confidence, and other vulnerable road users were recognized, such as a motorcyclist (0.84) and pedestrian (0.75). Vehicles were segmented with high confidence for the bus (0.7), motorcycle (0.88), 54 G. Neuhold et al., “The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes.” 41 and car (0.99). The RIC segmented traffic signs, support and utility poles, flat road, and road mark- ings as well. FIGURE 14: Image segmentation in Bogotá SOURCE: Original figure for this publication, based on data from Mapillary. The sample image shows favorable results for image segmentation. The performance of the RIC mod- el in terms of the average precision of the bounding box detection and classification for each class is provided in Annex 5. In the next step, road attribute data extracted through the RIC were inputs for the prediction model to link the road characteristics with the likelihood of a crash in the road networks examined. Step 3. Develop the ML model RRE for evaluating road risk To develop the RRE, six study areas in Bogotá, Colombia were selected to reduce computational load. These study areas were drawn to include a wide variety of neighborhoods (poor, rich) and placed throughout the city. They also contain high and low crash frequency road segments and comprehen- sive street view image coverage. Figure 15 shows the six study areas along with the crash risk from the road network database, high risk (crash frequency >0.5) and low risk (crash frequency <=0.5). 42 The low- and high-risk road segments in these FIGURE 15: Six study areas and crash frequency in Bogotá areas were the training data for the model. Based on the segment risk derived from the road net- work database and the characteristics for each road segment derived from the RIC, the model was trained to evaluate a road segment as high or low risk. Training data: The following input-output pairs obtained from road segments in six study areas in Bogotá, Colombia. Input: A vector of integers, which is the output of RIC* Output: 0 (low risk) or 1 (high risk) * Only 106 out of 124 classes are used as the input to RRE. A total of 18 classes irrelevant to road characteristics, such as sky, bird, etc., were re- moved from the vector before entering into the RRE. In searching for an optimal architecture of the neural network, the number of layers and neu- rons were tested for the best performance. Test- ing showed that more layers or neurons do not significantly improve the performance on this Crash per meter dataset. The RRE was used to evaluate whether 0.5 - 3.2 a road segment was low or high risk based on a 0.0 - 0.5 street view image. SOURCE: Original figure for this publication, based on data from OSM and data provided by the Waze App. Learn more at waze.com. Overall performance of the ML FIGURE 16: Confusion matrix showing the accuracy of the RRE model Predictions of low-risk road segments were correct 70 percent of the time, and predictions of high-risk road segments were correct 75 per- cent of the time (figure 16). The mean accuracy and F1-score were both 72.5 percent. The closer Low 0.7 0.3 the accuracy and F1-score are to 100 percent, the better the performance of the model. In the True value case of this model, a random guess of a binary classification is 50 percent, which makes these results promising. These results suggest the model would perform well in similar contexts as Bogotá. If needed, there would be potential to High 0.25 0.75 fine-tune the model for increased accuracy and precision in other areas. Low High Prediction SOURCE: Original figure for this publication. 43 TIPS FOR INTERPRETING ML PERFORMANCE The performance of an ML model can be evaluated using accuracy, precision, recall, and the F1-score. These are derived by counting the correct predictions (true positives and true negatives) and incorrect predictions (false positives and false negatives). accuracy = correct predictions / all predictions precision = true positives / (true positives + false positives) recall = true positives / (true positives + false negatives) F1-score = 2*((precision * recall) / (precision + recall)) A confusion matrix shows how well the model performed in predicting road risk through a comparative chart of the true positives, true negatives, false positives, and false negatives. Bogotá Results Following the three-step workflow of the deploy- FIGURE 17: Road risk prediction in Bogotá ment phase described in section 3.2, road risk was predicted for the entire road network in Bo- gotá. In total, 98,488 images were processed to make the predictions shown in figure 17. Road segments without an image within 3 meters were not predicted. Overall, high crash frequency from Waze and high-risk predictions exhibited similar- ity along some segments, particularly on arterial roads; however, the model tended to moderately overpredict high risk. Risk level High Low No data SOURCE: Original figure for this publication, based on data from Mapillary, OSM and Waze. 44 3.4 Case Study 2: Padang, Indonesia The Deployment Phase The model that was built in Bogotá was applied in Padang. Similar to Bogotá, the road network was accessed through OSM, and street view images were downloaded using the Mapillary API. Waze crash data was joined to the OSM road network to compare with risk predictions. Padang had limited geospatial crash data to validate the model. Table 11 provides a description of the datasets. TABLE 11: Data used for case study in Padang, Indonesia DATA SOURCES ATTRIBUTES REMARKS ROAD NETWORK OSM Road network (road segment length) Provided through an open license. CRASHES Waze Road alerts (crashes reported by users, coordinates) Obtained through DDP. ROAD CHARACTERISTICS Mapillary Street view image detections (crosswalk, curb, Selection of image annotation tags in study; (images and tags) guard rail, human, marking, pothole, sidewalk, sign, more available through Mapillary Traffic Sign streetlight, traffic sign, utility pole) and Vistas. Multiple detections per image are possible. SOURCE: Original table for this publication. Padang Results In Padang, preliminary results pointed to the framework’s potential in scanning roads for safety. Figure 18 shows predictions where arterial road segments were predominately designated as high risk (red lines). Residential areas were interspersed with low- and high-risk road segments. Similar patterns of road segments predicted as high risk along arterial roads and a mix of low and high risk along residential and tertiary road segments were largely found. 45 FIGURE 18: Road risk prediction in Padang Risk level High Low SOURCE: Original figure for this publication, based on data from OSM and data provided by the Waze App. Learn more at waze.com. Drone imagery provided by the World Bank Global Program for Resilient Housing. In general, where there were crashes reported by Waze, high-risk road segments were predicted. These preliminary results were encouraging; however, verifying the results was difficult because there was not sufficient data. While the deployment of the framework in Padang requires further validation with more data, ML-based approaches such as this are promising to offer initial road safety scans. 3.5 Findings The Integrated Framework for Road Risk Prediction demonstrates the strength of ML to identify road segment safety with substantial accuracy (72.5 percent) in Bogotá. Preliminary results in Padang support replicating the framework with further validation in other areas. Using advanced ML tech- niques, the framework applied a streamlined approach that relied on road characteristics and crash frequency to determine crash risk in the training phase. Then the ML models applied in the deploy- ment phase could predict road risk based on road characteristics without historical crash data. The alternative data sources used to train the models were robust – thousands of annotations, high-resolution images, and crash data joined to extensive road networks – and of suitable quality for the models to provide a road safety scan. To identify road characteristics, the RIC was trained using the Mapillary Vistas Dataset, which has a breadth and depth of annotations from different contexts, providing geographic diversity. The RRE was trained using a pairing of the road characteristics and a road network database created from OSM road segments and Waze crash data. OSM road segments 46 offered global scalability and were sufficient for a coarse assessment in these case studies. Waze data availability was dependent on the area (and the users of the app). Given the potential for duplicate crash reports, Waze data was not relied on for accurate crash data in Bogotá; instead, it was used to identify crash patterns of high- and low-risk road segments. The framework is not suitable for detailed road assessments. However, it can be applied to screen roads for safety without historical crash data if the RIC model is enhanced with more training data and calibrated for the local street view context; the RRE model can be modified and enhanced with fine-grained training data. It is replicable in other areas with the following recommendations, which are applicable for developing other ML-based frameworks for road safety. Incorporate training data to fine-tune the model for a specific location. Typically, ML models trained on data collected from one region do not work well for a new region. This is called domain shift: the testing data has a different distribution than the training data. In this case, including data collected from the new region in the training phase will usually help. It is important to evaluate the data and consider any influences the collection method may have on the potential to introduce bias into the project. For example, if local crash data is introduced to train the RRE, it would help validate and potentially improve the model’s application in the location of interest. Both RIC and RRE can be con- tinually trained with newly obtained data so that the knowledge learned from previous data can be carried on for new regions while the model is still applicable to the previous regions. It is essential to ensure that models are based on sufficient, high-quality training data. In general, at least a few thousand annotations are recommended to identify objects from images with simple context, depending on the characteristics of the object. Whether the street view images are obtained through big data platforms such as Mapillary or collected by the team, street view imagery covering different geographical regions makes the trained object detection model, like the RIC, more robust. Since street level images capture the visual scene (road characteristics and road users) at a single point in time, it is important to consider these implications when using a snapshot of that time of day, day of week, and season. Relatedly, a road characteristic may be covered or occluded in a street view image; for instance, when a passing truck blocks a sign. Imagery collected at a frequent distance, such as every two meters, permits greater flexibility to analyze the road scene and predict risk using the RIC and RRE. OSM road networks require review for recency and accuracy, and possibly editing to ensure suitable quality and coverage in other areas. If high-quality, granular crash data shows a clear pattern of more risk classes, three classes could be predicted: for example, high, medium, and low risk. 47 Conclusion Big data and ML offer promising opportunities to improve current road safety assessment proce- dures for sustainable development. Road safety assessments are often required for new transport and infrastructure developments to be approved or as part of their monitoring and evaluation once they are completed. However, conducting road safety assessment procedures can be expensive and time-consuming. Alternative data sources and ML can optimize this process by identifying patterns using complex predictive models. The Integrated Framework for Road Safety offers one approach using street view imagery that can be accessed through Mapillary or collected by the team to provide a road safety scan. With further training, this framework has the potential to provide detailed road safety assessments, mitigating the need for manual annotations (or years of historical crash data). In addition to the pilots and studies conducted by the researchers and representatives of road safety or- ganizations interviewed for this note, there are many ML models contributing to road safety efforts, which typically outperform statistical models in predicting road safety.55 Integrate alternative data sources and ML into road safety assessments with care. Finding valid, representative data can be a significant challenge in evaluating risks and reducing crash fatalities and injuries through data-driven, evidence-based interventions. Teams can directly partner with private companies and data providers to retrieve alternative sources of data. And data sharing platforms, such as DDP, offer streamlined solutions. However, commercial data sources are not typically established to collect data for road safety analysis, and their data may be inadequate for road safety assessment methods and procedures. Data can be biased, incomplete, and challenging to synchronize with con- ventional analytical tools. The implications of collecting and analyzing big data using ML require thor- ough consideration. Data privacy and security are central concerns; data needs to be de-identified and anonymized and stored according to institutional guidelines.56 Data and models need to be screened for biases that can affect their outcomes. For example, imbalanced access to smartphones or social media may amplify gender or community bias.57 Teams can adhere to best practices and data policies and make their ML models and results transparent and openly shared. Resources such as “A Frame- work for Understanding Sources of Harm throughout the Machine Learning Life Cycle” and “The Ethics of Artificial Intelligence” may be helpful for teams implementing ML in their projects.58 The approach used for the case studies in this note can be extended to evaluate specific measures of road safety. For example, while the framework uses the crash frequency and may identify the number of relevant road users in a street view image, it does not thoroughly consider the number of (vulnerable) road users nor does it consider the probability of a crash causing fatalities or serious in- 55 Philippe Silva, Michelle Andrade, and Sara Ferreira, “Machine Learning Applied to Road Safety Modeling: A Systematic Literature Review,” Journal of Traffic and Transportation Engineering 7, no. 6 (2020): 775-790, https://doi.org/10.1016/j.jtte.2020.07.004 56 World Bank, World Development Report 2021: Data for Better Lives (Washington, DC: World Bank, 2021). doi:10.1596/978-1- 4648-1600-0 57 World Bank, Use of AI Technology to Support Data Collection for Project Preparation and Implementation: A ‘Learning-by-doing’ Process (Washington, DC: World Bank, 2021). 58 Harini Suresh and John Guttag, “A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle” in Proceedings of Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO ‘21), https://doi.org/10.1145/3465416.3483305; Nick Bostrom and Eliezer Yudkowsky, “The Ethics of Artificial Intelligence,” in The Cambridge Handbook of Artificial Intelligence, ed. Keith Frankish and William M. Ramsey (Cambridge: Cambridge University Press, 2014): 316-334. 48 juries. The approach could also be extended using complementary data such as road geometry, traffic flow, traffic volume, traffic speed, weather, season, and other factors affecting visibility along the road or road surface conditions. The case studies illustrate the potential of big data and ML to reduce the manual inspection of roadways and provide road safety insight where otherwise the information is in short supply, thereby contributing to safer roads. For big data to be fully leveraged for road safety analysis, governments, road safety advocates, and international development organizations will want to consider investing in platforms and tools that specialize in collecting and analyzing data for road safety. Ongoing efforts to establish regional road safety data observatories provide an opportunity to gather data providers and create a data mar- ketplace specifically for road safety analysis, especially where alternative or traditional sources are scarce. Government regulations and initiatives to encourage private companies to share data could further integrate big data in international development projects, including road safety. It is essential for key stakeholders in road safety assessment to collaborate closely with pioneers of these technol- ogies to realize their potential in road safety analysis.59 Initiatives such as the Artificial Intelligence in Road Traffic Crash Prevention Roundtable hosted by the International Transport Forum (ITF) in early 2021 is an example of one such opportunity. Conversations with World Bank team leaders and transport specialists reveal that developing a tool to provide a single, easy-to-use solution to access and utilize big data for road safety analysis is in high demand. There is potential to automate some of the processing and analysis for which specialist expertise is currently required, and initiatives such as Ai-RAP and the World Bank Simplified Methodology suggest that practical, scalable solutions could be a reality soon.60 As big data and ML become more accessible, and as their adoption accel- erates worldwide, road safety practitioners, governments, road safety advocates, and international organizations can unlock their immense potential to improve the quality and efficiency of road safety assessments. 59 Subasish Das and Greg P. Griffin, “Investigating the Role of Big Data in Transportation Safety,” Transportation Research Record 2674, no. 6 (2020): 244–52, https://doi.org/10.1177/0361198120918565 60 Monica Olyslagers (Safe Cities and Innovation Specialist, iRAP) and Satoshi Ogita (Senior Transport Specialist, World Bank), in discussion with the authors, April 2021. 49 ANNEX 1: Most Relevant Big Data Types for Road Safety Analysis DATA COLLECTION POTENTIAL SOURCES POTENTIAL ADVANTAGES LIMITATIONS APPLICATIONS Street view • Apple Look Around Identify road • Provides objective • Coverage is incomplete, particularly imagery • Google Street View attributes for evidence of conditions in rural and low-income areas. road safety in the field. • Licensing restrictions for ML • KartaView assessments. • Can be used in regions application. • Mapillary where government data • Collected by team is not available. Mobile • Mobile application Identify vehicle • App data is usually low • Coverage is lighter in rural areas or applications and data movement, traffic cost and current. cities where use of app is low. telematics • Telematic companies flows and road • Telematic data could • Often requires data sharing use by various show risky driving agreements with private companies. • Rideshare companies types of users behavior. for crash risk. identification and road safety assessments. Crowdsourced • Waze Obtain crash data • Can supplement • Requires app use in the region of • Delivery drivers and information government data, interest. related to road particularly if incidents • Needs coordination and resources to • OSM use, such as are underreported or collect reports from delivery drivers. • Social media types of road government provided • Data quality may be low. users and their road networks are • Social desirability bias can occur, relative density unavailable. where users feel inclined to share at a specific specific types of information to location. Can help reinforce a positive or negative to identify road perspective. risks. Government • Government transport Most frequently • Data often has many • Data can be messy (human error). agencies used to obtain attributes or details that • Data often not shared. • Road safety crash data, have been manually observatories including added. statistics related • Data often has been to crash severity, collected for many years crash frequency in the same manner, as well as allowing for temporal fatalities and analysis. injuries statistics. Aerial and satellite • Earth observation Identify road • Covers large geographic • Requires balancing the cost with imagery agencies attributes for area. recency and granularity of imagery. • Private companies road safety assessments. Meteorological • Meteorological Review weather • Infer driving conditions • There are varying levels of sensors agencies conditions that (i.e., if road surface granularity. • Local universities may affect road conditions are not safety, such as available in government • Private companies crashes. crash data). SOURCE: Original table for this publication. 50 ANNEX 2: Overview of Big Data Sources DATA ACCESS ATTRIBUTES RESOLUTION COST COMMENTS AND FORMAT STREET VIEW IMAGERY Apple Look Early stages; Requires processing to Image N/A Offers extremely limited geographic Around contact company derive physical features coverage. related to road safety, Google Street Not accessible 360 photos N/A Global coverage is fairly extensive. such as: crosswalks, View according to license must be at least speedbumps, painted lines, 4K (image) roads, road shoulders, KartaView Open license sidewalks, streetlights, Depends on Free Images are free, though image traffic signs and others camera (image) processing is required (see street specific to region of view training data); global coverage interest. is variable. Mapillary Publicly available Depends on Free Images are free, though image camera (image) processing is required (see street view training data); global coverage is variable. Collected by Requires permission Depends on High Collection every two meters team and coordination with camera (image recommended for images. local government or video) Images or video require processing; see street view training data. STREET VIEW TRAINING DATA Mapillary Attribution- Traffic signs Resolution can Free More than 300 traffic sign classes Traffic Sign NonCommercial- be very high or covering six continents. ShareAlike 4.0 very low. The International License model performs best on images Free Mapillary Attribution- Physical features related to Coverage spans six continents. with the same Vistas NonCommercial- road resolution level ShareAlike 4.0 crosswalks, speedbumps, of the training International License painted lines, roads, road dataset. (image) shoulders, sidewalks, streetlights, traffic signs (others possible) Annotation Hire a team Physical features related to High Consider collaborating with by team road, specific to region of stakeholders in a region of interest interest to label images using a Computer crosswalks, speedbumps, Vision Annotation Tool (CVAT) or a painted lines, roads, road labeling team with training. shoulders, sidewalks, 2,000 labels per class is streetlights, traffic signs recommended for a simple (others possible) classification. World Open source Physical features related to Free Video analysis produces a richer Bank’s GRSF road dataset. Road Risk road grade and curvature, Piloted in Liberia and Mozambique. Assessment pedestrian crossings, software± delineation, roadside severity, lane width, and number of lanes ± The software is included in this section as video training data is limited in World Bank countries. Contact Satoshi Ogita (World Bank), for access. 51 DATA ACCESS ATTRIBUTES RESOLUTION COST COMMENTS AND FORMAT MOBILE APPLICATIONS AND TELEMATICS Grab Contact company Contact company N/A N/A Coverage offered in Cambodia, Indonesia, Malaysia, Myanmar, Philippines, Singapore, Thailand, Vietnam. HERE Not accessible Traffic Every minute N/A Detailed road network coverage according to standard current and historical speeds, (text, number) in more than 200 countries and license jams, crashes, road closures comprehensive traffic speeds in and road construction more than 80 countries. Mapbox Contact company Movement Aggregated daily N/A Movement activity index; driving activity or monthly at 100 index available in select meter resolution locations (text, number) Mapbox Contact company Traffic (typical speed) Typical speed per N/A Available through Enterprise plan; Traffic each road segment, road segment licensed annually for specific identified by a start and in five-minute geographic region. end node, has 2,016 typical increments over speed predictions (7 days a week (text, × 24 hours × 12 five-minute number) periods) Moovit Contact company Urban transit (public and on- Contact N/A demand) company Ola Cabs Contact company Travel time and potholes Contact N/A Coverage provided in India. company Orbital Contact company Foot traffic Each minute; N/A Foot traffic using mobile location Insight time of day, day of week, 2019 to present data in region of interest, subject to velocity (stationary, walking), (text, number) data availability per country. dwell time TomTom Contact company Traffic Every minute per Free Global coverage is variable. current and historical speeds, road segment to jams, crashes, road closures (text, number) Medium and road construction Uber Contact company Traffic Average travel Free Limited geographic coverage to a Movement travel times between zones, time, average selection of major cities. average speed per segment speeds per hour, Currently no API. and traffic density time of day or quarter of year (text, number) Unacast Contact company Human movement Coordinates, N/A horizontal accuracy, timestamp, time zone (text, number) Veraset Contact company Human movement Coordinates, N/A Veraset Movement covers 150 horizontal countries. accuracy, timestamp (text, number) 52 DATA ACCESS ATTRIBUTES RESOLUTION COST COMMENTS AND FORMAT Waze Contact company to Traffic (alerts, jams, Every minute; Free for Includes weather alerts and become a partner irregularities) location partners major and minor crashes by major and minor crashes; provided as application users; see Waze under severity of congestion or coordinates, Crowdsourced section. irregularities; current and road segment, typical speed on jammed street name segments; coordinates, road (text, number) segment (start and end node), street name; road type; driving direction (NSEW); turn type; alerts (construction, road closure and weather) WhereIs Contact company Informal transit network Determined in Medium Specializes in producing informal MyTransport collaboration to High transit data according to General with team Transit Feed Specifications (GTFS). Supports team in collecting and processing data in exchange for the team covering in-field costs of data collection and facilitating engagement with local transport authorities. CROWDSOURCED OSM Open license Road segments (road type, Centerline of Free May include additional road length) and road features road segments, attributes: lanes, name, smoothness, referred to surface, speed limit, and width, and as ways and other information such as overtaking relations (text, permitted or lighting. number) Twitter API Road incidents tweeted User-dependent; Free to Price dependent on account type and can be medium data volume. associated with a place or location (text, number) Waze Contact company to Road incidents reported using Every minute; Free for become a partner app location provided partners as coordinates, road segment, street name (text, number) Delivery Coordinated by team Road incidents reported using Depends on High drivers app collection (text, number) GOVERNMENT Government Government contact Incidents (date, time, severity, XY coordinate Free to Processing requires standard GIS or road safety or open data type) per incident Low software such as ArcGIS (paid) or observatory platform (text, number) QGIS (free). Storage is small, typically <1GB per urban area over multiple years. Road segments (type, width, Road segments Low speed limit) (text, number) Traffic lights (intersection XY coordinate Low May include intersection type type) per traffic light (pedestrian, bicyclist, for example) (text, number) 53 DATA ACCESS ATTRIBUTES RESOLUTION COST COMMENTS AND FORMAT REMOTE SENSING Maxar Contact company Elevation and roads Less than 1m High Requires processing to derive road Technologies (image) networks. Orbital Contact company Car and truck count; roads Car and truck N/A Car and truck count derived from Insight count: high satellite imagery. resolution, Limited Geospatial Intelligence 2013 to present; Platform credits to derive roads in roads: medium region of interest; not for routable resolution, 2016 road networks; not suitable for to present narrow roads in urban areas or dirt or (image, number) mountainous roads in rural areas. Security Collected by team Traffic density and volume Depends on Medium or traffic or through external camera (image or to High cameras resource video) Unmanned Collected by team Elevation, roads, traffic Depends on Medium Recent research suggests traffic aerial vehicle density and volume camera (image or to High density and volume are possible to (UAV) video) calculate. METEOROLOGICAL SENSORS OpenWeather Contact company Weather 40-year historical Low Price is economical for the 40-year (weather type, temperature, archive for any history of a single coordinate or city. wind speed and direction, coordinates by Contact provider for details on cloud coverage; rain and the hour; or by pricing and to download many snow volume by hour and city or 1 km, 5 locations. per 3 hours) km, 10 km or customized grid (text, number) Tomorrow.io Contact company Weather 500m N/A (weather type, temperature radius with and humidity; wind speed, precipitation direction, gust; recordings as precipitation type, intensity; low as 30 feet snow and ice accumulation; off the ground; visibility, moon phase) time steps range from one day to one minute (text, number) SOURCE: Original table for this publication. 54 ANNEX 3: Hotspots and Heatmaps: Uncovering Data Patterns for Road Safety Data visualizations are provided in the case study regions using alternative data sources, such as OSM, Mapbox, and Waze, as well as a select government dataset. Bogotá, Colombia Temporal data visualizations show road safety patterns between years, seasons, months, weeks, days, and times of day. The Waze crash data used to train the ML model covered a period of six months, from July through December 2020. It was anticipated that the pandemic would affect the number of Waze crash reports, and potentially traffic patterns, as crashes reported by the govern- ment noticeably decreased compared to prior years (figure 3.1). The government dataset revealed fewer incidents starting in March 2020, suggesting that the number of crashes was affected by the pandemic, though it is worth noting that the speed limit was also reduced from 60km/h to 50 km/h in May 2020 (figure 3.2). With this in mind, the Waze data was used to identify road safety trends. FIGURE 3.1: Road crashes with damage, injury or death in Bogotá, 2016–2020 With damage With injury With death 23,530 23,775 22,606 21,260 12,874 10,412 10,096 11,857 11,799 8,015 567 536 485 491 371 2016 2017 2018 2019 2020 SOURCE: Original figure for this publication, based on data from Datos Abiertos Secretaría Distrital de Movilidad. FIGURE 3.2: Road crashes per month in Bogotá, 2016–2020 Dec Jan 2020 Nov 2019 Feb 2018 2017 Oct Mar 2016 Road crashes Sep Apr per month ≤ 1,500 ≤ 2,000 Aug May ≤ 2,500 ≤ 3,000 Jul Jun ≤ 3,256 SOURCE: Original figure for this publication, based on data from Datos Abiertos Secretaría Distrital de Movilidad. 55 Hotspot analysis groups crash locations to determine statistically significant clusters of crashes. Government and Waze datasets were analyzed during the same six-month window (figure 3.3). Be- tween the two datasets, similar hotspots were found near Avenida Boyacá and Calle 6 along the high- way in the south, Avenida Norte-Quito-Sur (NQS). Overall, Waze had more hotspots than the govern- ment dataset. Some minor road incidents captured by Waze may have gone unreported to the police. This trend can be seen in minor collisions clustering further north in the city. This cluster does not appear in the government data. Instead, clusters of government-reported crashes with only damage (no injury or fatality) appear in a central band. The approach to identify hotspots can vary, including the clustering method, size, shape, and search area of neighboring hotspots. FIGURE 3.3: Hotspot analysis of government and Waze crash data in Bogotá, July–December 2020 Cold Spot Confidence: 99% 95% 90% Not significant Hot Spot Confidence: 90% 95% 99% Government (all crashes) Government (death or injury) Government (damage only) Waze (all crashes*) Waze (major) Waze (minor) *Includes major and minor crashes, as well as those not categorized as either type. SOURCE: Original figure for this publication, based on data from Datos Abiertos Secretaría Distrital de Movilidad and the Waze App. Learn more at waze.com. Basemap provided by Esri, HERE, Garmin, METI/NASA, USGS. 56 As with other alternative sources of data derived from mobile devices and apps, Waze crash reports are influenced by the location of the users, which affects where and when the crashes are reported. While Waze data notes major and minor incidents, the dataset will not include additional crash de- tails typically obtained from an official source, such as type, severity, class, and reason. Even though users can validate reports (e.g., thumbs up) to provide a confidence and reliability rating and flag false reports, there is potential for duplication in Waze data. Deduplication was not conducted for this analysis because this study was interested in relative crash patterns. Identifiable temporal patterns display when major crashes are aggregated by the day of the week and hour of the day (figure 3.4). In Bogotá, major crash reports increased between 6 and 7 p.m., having the most crashes during this window on Friday. Fewer incidents occurred on Sunday. FIGURE 3.4: Major crashes reported on Waze in Bogotá, July–December 2020 Mon Tue Wed Thu ≤ 100 Fri ≤ 200 ≤ 300 Sat ≤ 400 ≤ 507 Sun 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Hour of day SOURCE: Original figure for this publication, based on data provided by the Waze App. Learn more at waze.com. 57 Spatial and temporal analysis can be combined to identify areas for closer inspection that exhibit pat- terns over time. This is valuable given human movement or behavioral changes, including the effects of a pandemic, road construction, or updated speed limits, during the examined period. Emerging hotspot analysis reviews clusters of crashes that are consistent over time and ones that are intensify- ing or diminishing (figure 3.5).61 In this example, each week was analyzed. Intensifying hotspot areas were statistically significant hotspots for 90 percent of the weeks analyzed with increasing intensity of hotspots, including the final week. FIGURE 3.5: Emerging hotspot analysis of Waze crashes in Bogotá, July–December 2020 SOURCE: Original figure for this publication, based on data provided by the Waze App. Learn more at waze.com. Basemap provided by Esri, HERE, Garmin, METI/NASA, USGS. 61 For a complete list of definitions, see “How Emerging Hot Spot Analysis Works”: https://pro.arcgis.com/en/pro-app/latest/tool-reference/space-time-pattern-mining/learnmoreemerging.htm 58 If interventions or investments target a specific road, more geographically detailed information is required to make decisions. Hotspot analysis applied to road segments visualizes statistically signif- icant crash frequencies along roads, as shown in figure 3.6. FIGURE 3.6: Hotspot analysis using Waze crash frequencies in Bogotá, July–December 2020 Hot Spot confidence 99% 95% 90% Not Significant SOURCE: Original figure for this publication, based on data provided by OSM and the Waze App. Learn more at waze.com 59 Padang, Indonesia Heatmaps visualize the density of crashes. While Waze data was sparse in Padang, some spatial patterns could be detected. A heatmap shows at least three distinct areas of high crash density that could be further examined during a site inspection (figure 3.7). FIGURE 3.7: Heatmap of crashes reported using the Waze app in Padang, April 2019–July 2021 HIGH DENSITY (YELLOW) LOW DENSITY (BLACK) SOURCE: Original figure for this publication, based on data provided by the Waze App. Learn more at waze.com. Basemap provided by Esri, HERE, Garmin, METI/NASA, USGS. 60 Road safety assessments may require operating speeds of road segments. Mapbox collects this data from mobile devices and provides typical speeds per road segment in 5-minute increments. In Padang, Mapbox speeds were visualized for a Thursday from 5:00 p.m. to 6:00 p.m. (figure 3.8). Using the OSM road type to group and designate minor and major roads as a proxy for a low or high- speed limit (speed limits were sparsely noted in OSM), minor roads are visualized with thinner lines than major roads. The average speed typically slowed near intersections in pink (<25 km/h) when compared to major roads in purple (25-50 km/h). High-speed road segments exceeding 50 km/h are found heading north and south along Jalan By Pass. Identifying road segments with high speeds using Mapbox supports road safety assessments and the implementation of speed management or traffic calming measures. FIGURE 3.8: Mapbox typical speeds in Padang on Thursday, 5:00 p.m. to 6:00 p.m. Speed (km/h) 50.1 - 64.8 25.1 - 50.0 0.1 - 25.0 No Data SOURCE: Original figure for this publication, based on data provided by Mapbox. Basemap provided by Esri, HERE, Garmin, METI/ NASA, USGS. 61 ANNEX 4: Classes Detected Using Mapillary Vistas Dataset in RIC Model and Input Classes for the RRE Model All classes listed were detected using the marking--discrete--arrow--other object--sign--information Mapillary Vistas Dataset. Classes in bold marking--discrete--arrow--right object--sign--other were the input for the RRE Model. marking--discrete--arrow--split-left-or- object--sign--store animal--bird straight object--street-light animal--ground-animal marking--discrete--arrow--split-right-or- object--support--pole construction--barrier--ambiguous straight object--support--pole-group construction--barrier--concrete-block marking--discrete--arrow--straight object--support--traffic-sign-frame construction--barrier--curb marking--discrete--crosswalk-zebra object--support--utility-pole construction--barrier--fence marking--discrete--give-way-row object--traffic-cone construction--barrier--guard-rail marking--discrete--give-way-single object--traffic-light--general-single construction--barrier--other-barrier marking--discrete--hatched--chevron object--traffic-light--pedestrians construction--barrier--road-median marking--discrete--hatched--diagonal object--traffic-light--general-upright construction--barrier--road-side marking--discrete--other-marking object--traffic-light--general-horizontal construction--barrier--separator marking--discrete--stop-line object--traffic-light--cyclists construction--barrier--temporary marking--discrete--symbol--bicycle object--traffic-light--other construction--barrier--wall marking--discrete--symbol--other object--traffic-sign--ambiguous construction--flat--bike-lane marking--discrete--text object--traffic-sign--back construction--flat--crosswalk-plain marking-only--continuous--dashed object--traffic-sign--direction-back construction--flat--curb-cut marking-only--discrete--crosswalk-zebra object--traffic-sign--direction-front construction--flat--driveway marking-only--discrete--other-marking object--traffic-sign--front construction--flat--parking marking-only--discrete--text object--traffic-sign--information-parking construction--flat--parking-aisle nature--mountain object--traffic-sign--temporary-back construction--flat--pedestrian-area nature--sand object--traffic-sign--temporary-front construction--flat--rail-track nature--sky object--trash-can construction--flat--road nature--snow object--vehicle--bicycle construction--flat--road-shoulder nature--terrain object--vehicle--boat construction--flat--service-lane nature--vegetation object--vehicle--bus construction--flat--sidewalk nature--water object--vehicle--car construction--flat--traffic-island object--banner object--vehicle--caravan construction--structure--bridge object--bench object--vehicle--motorcycle construction--structure--building object--bike-rack object--vehicle--on-rails construction--structure--garage object--catch-basin object--vehicle--other-vehicle construction--structure--tunnel object--cctv-camera object--vehicle--trailer human--person--individual object--fire-hydrant object--vehicle--truck human--person--person-group object--junction-box object--vehicle--vehicle-group human--rider--bicyclist object--mailbox object--vehicle--wheeled-slow human--rider--motorcyclist object--manhole object--water-valve human--rider--other-rider object--parking-meter void--car-mount marking--continuous--dashed object--phone-booth void--dynamic marking--continuous--solid object--pothole void--ego-vehicle marking--continuous--zigzag object--sign--advertisement void--ground marking--discrete--ambiguous object--sign--ambiguous void--static marking--discrete--arrow--left object--sign--back void--unlabeled 62 ANNEX 5: Average Precision of the Bounding Box Detection and Classification An Average Precision (AP) score closer to 100 indicates a better performance in correctly detecting and classifying an object. AP scores equal to zero mean that no data is available. 63 Glossary of Terms Big Data Large data sets that require significant processing power and/or complex computational techniques to reveal patterns, trends, and correlations. Development Data A partnership between international organizations and companies, created to Partnership (DDP) facilitate the use of third-party data in research and international development. Deep Learning (DL) A branch of artificial intelligence that involves creating algorithms for deep artificial neural networks, inspired by the human brain, to learn complex patterns from high dimensional and large quantities of data. Fatalities and Serious A metric of those killed or seriously injured in a traffic crash which is used to Injuries (FSI) monitor traffic safety performance. Fatalities are defined as those who die within 30 days of the crash. Intelligent Transport The collection, analysis, and transmission of transportation, vehicle, and System (ITS) infrastructure data that informs users with real-time updates and improves future operations and predictions. Internet of Things (IoT) Devices that are connected to the internet to send and/or receive data. Machine Learning (ML) Method to systematically derive patterns, identify trends, and make conclusions from data with minimal human intervention. Neural Network A set of connected algorithms typically organized in three layers: input layer, hidden layer(s), and an output layer. Road Crash The collision of a vehicle with another entity, such as a car, bicycle, stationary object, pedestrian, or animal, that causes injury or damage to one or more of the entities on a road or road-related area. Road Safety System to reduce risks to road users, preventing death or injury. Road Safety Systematic review of the current road or traffic scheme to identify hazardous Assessments areas. Road Safety Audit (RSA) Independent, systematic evaluation of the modification or addition to the road or traffic scheme to determine the crash potential and safety performance for all road users. Road Safety Impact The safety performance ranking of planned road construction or modification Assessment (RSIA) design schemes and their effect on the surrounding road network. Road Safety Observatory A regional network of government representatives that facilitates the sharing and (RSO) exchange of road safety data and expertise. The World Bank operates RSOs in Latin America (OISEVI), Africa (ARSO), and Asia-Pacific (APRSO). Safe System An approach to road safety that integrates principles for safer vehicles, safer roads, and safer users to eliminate death and serious injuries. Supervised Learning A machine learning task using labeled data to train the model with input-output pairs. Unsupervised Learning A machine learning technique that extracts patterns from unlabeled data. For example, grouping or clustering data with similar attributes. Vulnerable Road Users Individuals at a higher risk using the road because they do not have the protection of an enclosed vehicle, such as pedestrians, motorcyclists, bicyclists, and those on animals or animal drawn carts. 64 References Allan, Phil. ”Road Safety Inspections.“ (presentation, Road Safety Seminar, World Road Association, Lomé, Togo: October 2006). https://www.piarc.org/ressources/documents/actes-seminaires06/c31- togo06/8718,2-PIARC_Oct06_Allan.pdf Australian BITRE (Bureau of Infrastructure and Transport Research Economics). “Australian Road Deaths Database (ARDD).” Australian BITRE. Updated May 13, 2021. https://data.gov.au/data/dataset/australian-road-deaths-database Bedoya Arguelles, Guadalupe, Svetoslava Petkova Milusheva, Arianna Legovini, and Sarah Elizabeth Williams. “Smart and Safe Kenya Transport (SMARTTRANS).” Washington, DC: World Bank, 2019. https://documents1.worldbank.org/curated/en/723411574361015073/pdf/Smart-and-Safe-Kenya- Transport-SMARTTRANS.pdf Bliss, Tony, and Jeanne Breen. “Meeting the Management Challenges of the Decade of Action for Road Safety.” IATSS Res. 35 (2012): 48–55. https://doi.org/10.1016/j.iatssr.2011.12.001 Bostrom, Nick and Eliezer Yudkowsky. “The Ethics of Artificial Intelligence.” In The Cambridge Hand- book of Artificial Intelligence, edited by Keith Frankish and William M. Ramsey, 316-334. Cambridge: Cambridge University Press, 2014. Das, Subasish and Greg P. Griffin. “Investigating the Role of Big Data in Transportation Safety.” Trans- portation Research Record 2674, no. 6 (2020): 244–52. https://doi.org/10.1177/0361198120918565 Diop, Makhtar. “All Road Deaths Are Preventable. We Can Make It Happen.” World Bank. Accessed May 14, 2021. https://blogs.worldbank.org/transport/all-road-deaths-are-preventable-we-can-make-it-happen DT Global. “Indonesia: Establishment of Integrated Road Asset Management Systems.” Accessed October 4, 2021. https://dt-global.com/projects/irams-dc Google. “Google Maps, Google Earth, and Street View.” Accessed May 14, 2021. https://about.google/brand-resource-center/products-and-services/geo-guidelines/ He, Kaiming, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. “Mask R-CNN.” 2017 IEEE Interna- tional Conference on Computer Vision (2017): 2980-2988. Hidalgo, Darío and Claudia Adriazola-Steil. “Bogotá’s Vision Zero Road Safety Plan Is Saving Lives.” TheCityFix. Last modified September 26, 2019. https://thecityfix.com/blog/Bogotas-vision-zero-road- safety-plan-saving-lives-dario-hidalgo-claudia-adriazola-steil/ Institute for Transportation and Development Policy. “Pune, India Wins 2020 Sustainable Transport Award.” Last modified June 27, 2019. https://www.itdp.org/2019/06/27/pune-india-wins-2020-sus- tainable-transport-award/ International Transport Forum. “Best Practice for Urban Road Safety: Case Studies.” International Transport Forum Policy Papers, no. 76 (2020). International Transport Forum. Zero Road Deaths and Serious Injuries: Leading a Paradigm Shift to a Safe System. Paris: OECD Publishing, 2016. https://doi.org/10.1787/9789282108055-en 65 Krambeck, Holly, Magreth Kakoko, and Mireille Raad. Using Computer Vision to Automatically Detect Road Features for Road Safety Audits and Assessments: Inception Report. Washington, DC: World Bank, 2019. Lovón-Melgarejo, Jesús, Alonso Tenorio-Trigoso, Manuel Castillo-Cara, and Daniel Miranda. “Identi- fication of Risk Zones for Road Safety through Unsupervised Learning Algorithms.” In 16th LACCEI International Multi-Conference for Engineering, Education, and Technology: Innovation in Education and Inclusion, Lima, Peru, July 2018. http://www.laccei.org/LACCEI2018-Lima/full_papers/FP413.pdf Milusheva, Sveta, Robert Marty, Guadalupe Bedoya, Sarah Williams, Elizabeth Resor, and Arianna Legovini. “Applying Machine Learning and Geolocation Techniques to Social Media Data (Twitter) to Develop a Resource for Urban Planning.” PLoS ONE 16, 2 (2021). https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0244317 Neilson, Alex, Indratmo, Ben Daniel, Stevanus Tjandra. “Systematic Review of the Literature on Big Data in the Transportation Domain: Concepts and Applications.” Big Data Res. 17 (2019): 35-44. https://doi.org/10.1016/j.bdr.2019.03.001 Neuhold, G., T. Ollmann, S. R. Bulò, and P. Kontschieder. “The Mapillary Vistas Dataset for Seman- tic Understanding of Street Scenes.” 2017 IEEE International Conference on Computer Vision (ICCV) (2017): 5000-5009. doi: 10.1109/ICCV.2017.534 ODI (Overseas Development Institute). “Bogotá.” ODI: Think Change. Accessed October 12, 2021. https://odi.org/en/about/features/bogot%C3%A1/ ODPH (Open Data Philippines). “Open Data Philippines.” ODPH. Accessed June 3, 2021. https://data.gov.ph/ OECD (Organisation for Economic Co-operation and Development)/ITF (International Transport Fo- rum). Big Data and Transport: Understanding and Assessing Options. Paris: OECD/ITF, 2015. https://www.itf-oecd.org/sites/default/files/docs/15cpb_bigdata_0.pdf Ospina-Mateus, Holman, Leonardo Augusto Quintana Jiménez, Francisco José López-Valdés, Natalie Morales-Londoño, and Katherinne Salas-Navarro. “Using Data-Mining Techniques for the Prediction of the Severity of Road Crashes in Cartagena, Colombia.” In Applied Computer Sciences in Engineering. Edited by J. Figueroa-García, M. Duarte-González, S. Jaramillo-Isaza, A. Orjuela-Cañon, Y. Díaz-Guti- errez, 309-20. Cham: Springer, 2019. https://doi.org/10.1007/978-3-030-31019-6_27 Silva, Philippe Barbosa, Michelle Andrade, and Sara Ferreira. “Machine Learning Applied to Road Safety Modeling: A Systematic Literature Review.” Journal of Traffic and Transportation Engineering 7, no. 6 (2020): 775-790. https://doi.org/10.1016/j.jtte.2020.07.004 Suresh, Harini and John Guttag. “A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle.” In Proceedings of Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO ‘21), Association for Computing Machinery, New York, October 2021. https://doi.org/10.1145/3465416.3483305 US NHTSA (United States National Highway Traffic Safety Administration). “Data.” US NHTSA. Ac- cessed May 28, 2021. https://www.nhtsa.gov/data. 66 WHO (World Health Organization). Global Status Report on Road Safety 2018. Geneva: WHO, 2018. World Bank. “Better Data for Safer Roads: The Powerful Mission of Road Safety Observatories.” Last modified November 5, 2020. https://www.worldbank.org/en/news/video/2020/11/05/better-da- ta-for-safer-roads-the-powerful-mission-of-road-safety-observatories World Bank. Colombia - Programmatic Productive and Sustainable Cities Development Policy Loans. Wash- ington, DC: World Bank, 2020. http://documents.worldbank.org/curated/en/426591583968971309/ Colombia-Programmatic-Productive-and-Sustainable-Cities-Development-Policy-Loans World Bank. GRSF DRIVER Completion Report. Washington, DC: World Bank, 2019. https://documents1.worldbank.org/curated/en/245151560919065747/pdf/Data-for-Road-Incident-Vi- sualization-Evaluation-and-Reporting-Lowing-the-Barriers-to-Evidence-Based-Road-Safety-Manage- ment-in-Resource-Constrained-Countries.pdf World Bank. Good Practice Note: Road Safety. Washington, DC: World Bank, 2019. https://pubdocs.worldbank.org/en/648681570135612401/Good-Practice-Note-Road-Safety.pdf World Bank. Guide for Road Safety Opportunities and Challenges: Low and Middle Income Country Pro- files. Washington, DC: 2020. https://openknowledge.worldbank.org/handle/10986/33363 World Bank. Indonesia Public Expenditure Review 2020: Spending for Better Results. Washington, DC: World Bank, 2020. https://openknowledge.worldbank.org/handle/10986/33954 World Bank. Innovative Road Safety Risk Assessment Tool with Automated Image Analysis Technology. Washington, DC: World Bank, 2019. Word Bank. Making Roads Safer. Washington, DC: World Bank, 2014. World Bank. Mobile Metropolises: Urban Transport Matters: An IEG Evaluation of the World Bank Group’s Support for Urban Transport. Washington, DC: World Bank, 2017. World Bank. “Open Traffic Data to Revolutionize Transport.” Last modified December 19, 2016. https://www.worldbank.org/en/news/feature/2016/12/19/open-traffic-data-to-revolutionize-transport World Bank. Open Traffic: Easing Urban Congestion. Washington, DC: World Bank, n.d. https://olc.worldbank.org/system/files/WBG_BD_CS_OpenTraffic_1.pdf World Bank. The High Toll of Traffic Injuries: Unacceptable and Preventable. Washington, DC: World Bank, 2017. World Bank. Use of AI Technology to Support Data Collection for Project Preparation and Implementa- tion: A ‘Learning-by-doing’ Process. Washington, DC: World Bank, 2021. World Bank. World Development Report 2021: Data for Better Lives. Washington, DC: World Bank, 2021. doi:10.1596/978-1-4648-1600-0 World Road Association. “Road Safety Manual: Infrastructure Management Tools.” Accessed May 10, 2021. https://roadsafety.piarc.org/en/planning-design-operation-infrastructure-management/manage- ment-tools 67 Zeng, Qiang, Helai Huang, Xin Pei, S.C. Wong, and Mingyun Gao. “Rule Extraction from an Op- timized Neural Network for Traffic Crash Frequency Modeling.” Accident Analysis & Prevention 97 (2016): 87-95. doi: 10.1016/j.aap.2016.08.017 Zhang, Min, Yang Liu, Shaohua Luo, Siyan Gao. “Research on Baidu Street View Road Crack Infor- mation Extraction Based on Deep Learning Method.” Journal of Physics: Conference Series no. 1616 (2020). https://iopscience.iop.org/article/10.1088/1742-6596/1616/1/012086/pdf Ziakopoulos, Apostolos and George Yannis. “Using AI for Spatial Predictions of Driver Behavior.” (ITF) International Transport Forum Roundtable on Artificial Intelligence in Road Traffic Crash Pre- vention, (presentation, February 2021). https://www.nrso.ntua.gr/geyannis/conf/cp450-using-ai-for-spatial-predictions-of-driver-behavior/ 68 This guidance note offers a practical introduc- While the preliminary results in Padang were en- tion to integrating big data and machine learn- couraging, additional data is required to verify ing in road safety evaluations. It outlines data the performance in a new context. However, the requirements for several road safety assess- workflow illustrated through these case studies ments, provides a convenient overview of rel- shows potential for replicability. All code for the evant big data sources, and explains machine Integrated Framework for Road Safety is free and learning fundamentals for the application of publicly available for repurposing and refining to these advanced technologies, specifically for local context through a link provided in the note. road safety. The note proposes an Integrated The framework exemplifies current capabilities Framework for Road Safety, which takes the to reduce the reliance on manual image anno- reader step-by-step through a machine learning tations and highlights the potential to conduct workflow to evaluate road risk, using case stud- a road safety scan without years of historical ies in Bogotá, Colombia and Padang, Indonesia. crash data. The increasing availability of big The Integrated Framework for Road Safety uses data and the growing use of machine learning machine learning to identify road characteris- models for road safety point to rapidly evolving tics from street view images and predict road technological solutions that have immense ca- segment risk based on those identifiable char- pacity to improve the quality and efficiency of acteristics. As a result, road segment risk was road safety assessments in developing coun- predicted with 72.5 percent accuracy in Bogotá. tries. 69