BIG DATA innovation challenge Pioneering approaches to data-driven development © 2016 International Bank for Reconstruction and Development / The World Bank 1818 H Street NW, Washington DC 20433 Telephone: 202-473-1000; Internet: www.worldbank.org This work is a product of the staff of The World Bank with external contributions. The findings, interpretations, and conclusions expressed in this work do not necessarily reflect the views of The World Bank, its Board of Executive Directors, or the governments they represent. The World Bank does not guarantee the accuracy of the data included in this work. The boundaries, colors, denominations, and other information shown on any map in this work do not imply any judgment on the part of The World Bank concerning the legal status of any territory or the endorsement or acceptance of such boundaries. Nothing herein shall constitute or be considered to be a limitation upon or waiver of the privileges and immunities of The World Bank, all of which are specifically reserved. Rights and Permissions The material in this work is subject to copyright. Because The World Bank encourages dissemination of its knowledge, this work may be reproduced, in whole or in part, for non-commercial purposes as long as full attribution to this work is given. Any queries on rights and licenses, including subsidiary rights, should be addressed to the Office of the Publisher, The World Bank, 1818 H Street NW, Washington, DC 20433, USA; fax: 202-522-2422; e-mail: pubrights@worldbank.org Front Cover: Big data analytics offers potential for groundbreaking initiatives in development, ranging from (clockwise from bottom left): climate-smart agriculture in Latin America; boosting access to financial services in Africa; improving congestion in the Philippines, and linking crime to urban infrastructure in Colombia. Foreward Dear Readers, We are truly delighted to share with you this compilation of an almost two-year journey, on which we embarked with 14 courageous World Bank innovators. Belonging to such a data-rich organization as ours, we realized in the summer of 2014 that a growing trend in the private sector – big data – could really enhance the way we achieve our results. Big data capitalizes on the vast sources of data available to us today, from cellphone records to taxi GPS data, satellite images or new media. These reams of data, when anonymized and analyzed, can provide us with a wealth of information, such as the travel patterns of individuals in a crowded city or the path of a flu virus. So we started to look for big data ideas within the Bank and found a few colleagues who were trying out new initiatives. We learned that many faced similar challenges, such as lack of access to certain types of data, a desire for world-class data science expertise, lack of storage and computational capacity, a desire for good practices on handling privacy, opportunities for peer-to-peer learning, and platforms and norms for sharing data and software. There was also need for seed and growth funding to kick-start the high-potential, but often high-risk, big data initiatives. We realized the importance of creating a big data program with strong projects that could inspire others. Following up on this instinct, we launched the Big Data Innovation Challenge – not knowing what to expect, as this was a concept so new to our staff. We weren’t even sure if we would get more than 10 proposals, and were pleasantly surprised to receive 131 innovative ideas, from which we chose 14 winners. And what a diverse group of winners they were, from tracking rural electrification in India using satellite imagery, to analyzing taxi hailing services in the Philippines to reduce congestion, to using cellphones to measure road conditions in Belarus, and analyzing social media content in Brazil to understand citizens’ political sentiment. In November 2014, the winners all received seed funding to test their pilot ideas, as well as technical help from data scientists from the newly established Big Data team in the Bank’s Innovation Labs. Our journey with these teams has continued during the course of several months, during which we have seen ideas tried, failed, refined and field-tested. Each story became a personal quest for us and we are proud that many of these are now ready for scale-up and adoption in operations. This publication profiles stories from the Challenge winning teams and finalists, showing how data on an unprecedented scale has the potential to be transformational in its effects. We hope you will find these stories as inspiring as we have. Adarsh Desai Program Manager, Innovation Labs The World Bank Group BIG DATA innovation challenge i Contents Introduction...................................................................iii At-a-Glance: Innovation Challenge winners and top finalists.............vi Winners and Top Finalists Projects to Watch Mining Big Data for Climate-Smart Agriculture..............1 Targeting Poverty by Predicting Poverty...61 Erick C.M. Fernandes, Daniel Jiménez, Andy Jarvis Melissa Adelman and Sylvain J. Delerce Assessing Whether Markets are Working Big Data for Financial Inclusion.......................................7 for the Poor.................................................62 Sven Harten Alvaro S. Gonzalez Improving Road Investments through Mobile Data...... 13 From Cellphone Data to Poverty Maps..... 63 Kai Kaiser Marco Hernandez Ore Securing Property Rights through Geo-spatial Data.....19 Testing Cellphone-Derived Measures Kathrine Kelm of Income and Inequality........................... 64 Tariq Khokhar Open Traffic: Easing Urban Congestion.........................25 Holly Krambeck Satellite-Based Yield Measurement..........65 Talip Kilic Observing People’s Feelings about State Institutions...31 Victoria L. Lemieux Understanding Individual Travel Patterns in African Cities...........................66 Monitoring Rural Electrification from Space.................37 Nancy Lozano Gracia, Talip Kilic Kwawu Mensan Gaba Key Lessons..............................................69 Mapping Poverty by Satellite......................................... 43 David Newhouse Glossary.....................................................72 Understanding How Infrastructure Affects Crime.........49 Acknowledgements................................. 74 Camila Rodríguez, Andrés Villaveces Revamping Road Condition Monitoring with Smartphones................................................................... 55 Wei Winnie Wang ii BIG DATA innovation challenge Introduction Big data can sound remote and lacking a human dimension, with few obvious links to development and impacting the lives of the poor. Concepts such as anti-poverty targeting, market access or rural electrification seem far more relevant – and easier to grasp. And yet some of today’s most groundbreaking initiatives in these areas rely on big data. This publication profiles these and more, Unlike traditional sources of development showing how data on an unprecedented data, such as household surveys, which scale has the potential to improve lives in address specific research questions, big data unprecedented ways. The featured case is usually produced in the course of some stories illustrate the diverse range of big data other activity (such as making a cellphone applications in development. For the World call). This, along with the size and complexity Bank, with twin goals of ending extreme poverty of some datasets, requires different research and boosting shared prosperity, big data is big methods. Big data analytics is the emerging set news – and this is just the beginning. of tools and methods to manage and analyze this explosive growth of digital information. It What is big data and why does it matter? includes data science methods like machine Big data is an umbrella term used to describe learning, predictive analytics, and visualization. the constantly increasing flows of data emitted These methods open significant potential for from connected individuals and things, as drawing on real-time information to address well as a new generation of approaches being development challenges – potential that used to deliver insight and value from these can’t be ignored. To this end, the World Bank’s data flows. It is said that more data has been Innovation Labs, housed in the Leadership, generated in the past two years alone than in Learning and Innovation Vice Presidency, all previous years combined. While most of launched its Innovations in Big Data Analytics the attention given to big data has focused program in November 2014. on the developed world, the rapid diffusion of technologies such as the internet, cellphones, ground sensors and satellites – to name a few – is driving big data innovation in the developing world. And while data flows in the developing world are typically smaller and less diverse than in the developed world, they still present incredible opportunities for data scientists, economists and statisticians to use big data to enhance or supplement traditional Analysis of climate and yield data can tell farmers what analytical approaches. and when to plant for anticipated weather conditions BIG DATA innovation challenge iii The World Bank’s Big Data program • develops training, learning events and The Big Data program brings together knowledge products exploring big data use in data scientists, social scientists and sector sectors from agriculture and health to energy specialists in a work program with two and disaster risk main objectives: • builds partnerships to develop collective • To accelerate organizational capabilities in capacities and strengthen the WBG’s role as big data analytics for use in research and a leader in big data operations – to help the World Bank Group (WBG) better work towards ending extreme • incentivizes the use of big data through poverty and boosting shared prosperity. innovation challenges to solve development problems. • To position WBG as a leader in the use of big data solutions in development. Taking up the Challenge Launched in September 2014, the Big The program aims to scale early pilots into Data Innovation Challenge has been key in projects that solve significant development encouraging big data approaches. Exceeding challenges, and to establish best practices for all expectations, it attracted 131 innovative using big data analytics to steer evidence-driven proposals and awarded 14 with funding and development. To mainstream and embed big expertise to enable big data analytics in their data analytics across the organization, projects. The winning initiatives cover an the program: exciting range, from using satellite imagery to improve poverty mapping, to mining social • Provides data science technical assistance media data to understand political sentiment, to projects with high potential to or cellphone data to increase the use of demonstrate early value through big data banking services. Others promote traffic flows or accountable road building, anticipate crop • connects big data practitioners through yields, predict violent crime and promote workshops, training, knowledge events and registration of land rights. online communities to foster collaboration, knowledge flows and learning-by-doing This publication profiles 16 extraordinary between data scientists, sector specialists initiatives from the Challenge winning teams and external entities and finalists. The case stories examine the application of big data analytics and • works with internal technology and service how it can help achieve project goals. By providers to develop big data technologies demonstrating the impact, use and value and tools that meet evolving needs of big data in development, they show that you don’t need to be a data scientist iv BIG DATA innovation challenge to understand how these approaches can Putting big data into action for development improve people’s lives. Work now continues The Big Data program will share these lessons beyond the Challenge to take several of the widely, supporting early adopters and helping projects to the next stage of growth. The Big transition big data to mainstream use by Data program is also currently engaged with providing advice, infrastructure, expertise and several WBG Global Practices to accelerate resources. We hope the exciting approaches their progress in big data. These engagements in this publication will demystify big data in involve a range of activities to deliver development and motivate its applicability to technical assistance, knowledge and learning, development challenges. Above all, we hope it and essential resources to operationalize will inspire you to put big data into action in your big data capabilities in each practice. own work. These case stories demonstrate that big data can improve development effectiveness A rich learning process and help World Bank operations achieve Innovative paths inevitably involve hurdles. Each results through solutions with better evidence, case story revealed useful lessons, and several efficiency, awareness, understanding and testify to the value of perseverance. However, forecasting. At the same time, we hope others the potential rewards of big data make the effort outside the World Bank will find inspiration worthwhile. Key themes include the need to from these experiences in applying big data capture, prepare and store data meticulously – approaches for development. Ultimately, big and to plan enough time to do so. Successful data analytics can be an accelerator for ending big data solutions involve approaching existing poverty and boosting shared prosperity. situations from new angles or combining previously unrelated data sources – but these novel data approaches must be tested, validated and adapted for mainstream use. Despite the central role of computational power, the human element remains vital to the success of these projects. Big data analysis can be enhanced by traditional research techniques, such as socioeconomic surveys. Several stories stress the need to invest in partnerships, or to combine human and computational power for optimum results. Big data analytics is a team sport: Effective collaboration between data experts, technologists and business sector In Bogotá, big data analysis has identified urban specialists is also crucial. features and times of day associated with violent crime For more information on big data and updates on any of these case stories, email innovation@worldbank.org To download the World Bank’s 2014 report ‘Big Data in Action for Development,’ visit http://bit.ly/bigdatainaction BIG DATA innovation challenge v Big Data Innovation Challenge At-a-Glance: Winners and Top Finalists Mining Big Data for Climate-Smart Agriculture Improving Road Investments through Mobile Erick C.M. Fernandes, Daniel Jiménez, Data Andy Jarvis and Sylvain J. Delerce Kai Kaiser Climate change is making seasonal planting To track nationally-financed projects to improve decisions unreliable, so researchers are local road networks in the Philippines, this comparing long-term data on harvests and project developed OpenRoads, an interactive climate, to identify climate sequences favorable multi-media portal and set of digital tools. The or unfavorable for cropping in specific areas. platform links official road data with crowd- By matching these with forecasts, farmers sourced geo-tagged video and image data from can tell what and when to plant for the mobile devices, to increase accountability in anticipated conditions. road network investment projects. Big Data for Financial Inclusion Securing Property Rights through Geo-spatial Sven Harten Data By combining big data analytics and socio- Kathrine Kelm economic research in Africa, the team Using unmanned aerial vehicles (drones), this established a statistical customer profile project recorded imagery which is processed for users of digital financial services. They into maps and 3D computer-generated then identified matching profiles among phone landscape models. Communities can use these customers not using such services, and mapped to identify property boundaries and work with their locations. These insights are helping authorities to register their land. The process financial companies reach previously helps secure land rights in developing countries unbanked people. in a fast, cost-effective way. vi BIG DATA innovation challenge Open Traffic: Easing Urban Congestion Mapping Poverty by Satellite Holly Krambeck David Newhouse To help traffic management agencies in To generate inexpensive, timely poverty developing countries monitor real-time traffic estimates, this project examined how well conditions and mitigate congestion, this project satellite indicators contribute to poverty developed an open-source platform, Open prediction, and how this depends on the type Traffic, for collecting, visualizing and analyzing of prediction model. When compared with Sri traffic speed data derived from taxi drivers’ Lankan census data, high-resolution satellite smartphones. indicators track poverty very well, and have potential to improve traditional poverty maps. Observing People’s Feelings about State Institutions Victoria L. Lemieux By conducting sentiment analysis of tweets made during civil unrest around the 2014 Soccer World Cup in Brazil, researchers found that people protest when feeling deprived relative to some external standard – in this case, spending on the tournament. Their analytical method increases understanding of the relationship between civil unrest and citizens’ sentiment. Monitoring Rural Electrification from Space Kwawu Mensan Gaba Through analyzing two decades’ of satellite images for nightly light output from India’s 600,000 villages, this project developed a novel data-intensive strategy to improve the monitoring of rural electricity provision. The data is accessible via an online visualization platform to help optimize electrification planning. BIG DATA innovation challenge vii Understanding How Infrastructure Affects Crime Projects to Watch – results still Camila Rodríguez, Andrés Villaveces pending Latin America has above-average crime rates and unplanned cities with high inequality. To Targeting Poverty by Predicting Poverty define the association between crime and Melissa Adelman infrastructure, this project drew on rich existing Targeting errors are common in development data about Bogotà in Colombia. Through Risk programs. By applying machine learning Terrain Modeling, the team identified specific techniques to data-sets used for targeting urban features and times of day associated with poverty, this project seeks to improve assault and homicide. methodologies for identifying the poor. Revamping Road Condition Monitoring with From Cellphone Data to Poverty Maps Smartphones Marco Hernandez Ore Wei Winnie Wang Outdated poverty maps in many developing This project developed an app called countries limit governments’ ability to make RoadLab, which uses crowdsourced data effective anti-poverty decisions. Using from accelerometers in smartphones carried anonymized cellphone call records in moving vehicles to evaluate the roughness in Guatemala, this project aims to create of road surfaces. RoadLab gives national a tool to produce inexpensive, near-real-time road agencies comprehensive and frequent poverty maps. information on road surface condition, to help them manage assets cost-effectively. viii BIG DATA innovation challenge Assessing Whether Markets are Working Satellite-Based Yield Measurement for the Poor Talip Kilic Alvaro S. Gonzalez Through trials in Uganda, this project is testing Many markets in which the poor transact are a pioneering approach which relates satellite- volatile and fragmented. This project analyses based data to plot-level ground measures of existing micro-level price data to provide near- yields. This enables future yield predictions, real-time information on how well markets are which can inform better policymaking to help working, to help policymakers improve them farmers improve productivity. for the poor. Understanding Individual Travel Patterns in Testing Cellphone-Derived Measures of Income African Cities and Inequality Nancy Lozano Gracia, Talip Kilic Tariq Khokhar This project combined personal interviews and Many countries’ official measures of poverty analysis of big data from smartphones’ GPS are outdated and inconsistent. This project sensors. It aims to capture accurately and evaluates techniques that use cellphone call affordably the route, purpose, mode and cost records to offer timely and complete poverty of individuals’ travels, to help improve urban estimates, and examines how these techniques transport and land-use planning. can enhance government workflows. BIG DATA innovation challenge ix Mining Big Data for Climate-Smart Agriculture Erick C.M. Fernandes (pictured), Daniel Jiménez, Andy Jarvis and Sylvain J. Delerce Data analysis can reveal the combination of climate factors resulting in high or low crop yields in a specific region x BIG DATA innovation challenge Photo: Neil Palmer Data-driven agronomy will help increase the sector’s capacity to adapt to climate change SUMMARY Climate change means traditional calendar-based decisions about what and when to plant are no longer reliable for farmers. New approaches are urgently required to provide relevant and timely information for decision making. In Colombia, the International Center for Tropical Research (CIAT) has combined long-term data on rice harvests and climate patterns, and analyzed it to provide essential information for crop planting decisions. This project draws on the Colombian experience. The team partnered with rice-growing experts from Argentina, Brazil, Chile and Uruguay, helped them prepare cropping and climate data, and held a workshop to analyze their data. The technique combines the two databases on harvests and weather, and relates each harvest to the corresponding climate sequence for approximately 120 days between sowing and reaping. The data is analyzed to unravel underlying correlations between climate factors and yield variability. This enables identification of climate sequences that are favorable or unfavorable for cropping in specific areas. By matching these with seasonal forecasts, the system can advise farmers on what variety of rice to plant for the anticipated conditions, and when to plant it. To support field data capture by farmers, the project also created a web platform and an Android app. The tool can be applied to any crop or location, providing farmers with data-driven information to help them decide what, when and where to plant. BIG DATA innovation challenge 1 CHALLENGE INNOVATION Greater weather variability due to climate CIAT’s technique combines and analyzes change means calendar references are two country-wide databases, covering increasingly outdated in helping farmers make commercial harvests and weather. Data on the right planting decisions. New approaches harvests has been collected by the Colombian are urgently required to provide them with National Rice-Growers’ Federation for almost relevant information and enhance their 20 years, covering variables such as yield, grain resilience to climate variability. Recent climate humidity, sowing and harvest dates, cultivar, change analysis suggests Latin America and municipality and cropping system (irrigated or the Caribbean could have a future climate upland rice). Data from the Colombian National more suitable for growing rice – the world’s Meteorology Institute provided daily records most important food crop in terms of energy of five variables: Maximum and minimum consumed by humanity. In Colombia rainfall and temperature, precipitation, relative humidity temperature extremes have changed differently and solar radiation. in each region, and the climate is increasingly unpredictable, resulting in national average By combining these two databases, each rice yields falling from six to five tonnes individual harvest event can be related to a per hectare in less than five years (without corresponding climate sequence for around noticeable changes in soil or crop 120 days between sowing and harvest. Data is management). Faced with an increasingly analyzed using machine learning techniques volatile climate, farmers need new methods such as Artificial Neural Networks (ANN), for deciding when and what to plant. random forest and clustering, to reveal the combination of climatic factors that result in high or low yields in a specific region. This would foster Based on the harvest events database for two municipalities, analysis of the 120 days from data-driven sowing to harvest for the five climate variables agronomy and revealed 17 different climatic sequences. By matching these against the yields obtained increase the under each sequence and evaluating the sector’s capacity relevance of specific climate indicators against the growing phases of the crop, CIAT identified to adapt to favorable and non-favorable climatic situations. climate change Projecting these onto seasonal forecasts, the system can advise farmers on optimum sowing dates and rice varieties, given the weather ahead. By providing farmers with data-driven information for planting decisions, the tool can 2 BIG DATA innovation challenge compensate for the increasing irrelevance of collecting real-time data and further automating traditional knowledge due to climate variability. the analysis to build a real-time system. To enable field data capture by farmers for factors Expanding the project reach such as soil type and crop management In pilot tests, analysis of crop data together techniques, the team developed an Android app, with weather data generated advice against using open-source software and an external planting rice that season, due to projected provider to build the app. Supporting a wide adverse climate conditions. Farmers received range of Android versions, it can capture GPS the planting advice via the growers’ association. coordinates using one button. Stored in a Those who ignored it harvested nothing and lost cloud, the data is instantly available and can considerable inputs, while those who followed be exported to any format for reuse. The app it saved seed, labor, fertilizer and water. The is linked to a web platform which generates historical data allowed insight at a site-specific personalized reports, real-time information, level. For example, in Saldaña – where rice is interactive graphs and mapping, allowing for grown year-round, due to irrigation policies optimum site-specific crop management. The – the analysis showed that rice yields were system will be further refined by increased use limited mainly by solar radiation during the of machines such as drones or inexpensive grain-ripening stage. This implies that farmers wireless sensors for data capture. This allows can boost yields by aligning sowing dates with high frequency measuring, without requiring sunnier seasons, or choosing crop varieties effort from farmers and eliminating the margin resilient to low solar radiation. of human error. With this success suggesting other countries and crops could benefit from the same RESULTS approach, the project worked to scale it up The workshop teams spent most of the time in Latin America. The team visited Argentina, preparing their data: standardizing and cleaning Uruguay and Brazil to seek partnerships with it, and relating crop data with weather series. agricultural associations and assess data This reflects the reality of work with data. availability. CIAT then helped partners with Each team completed at least one analysis data collection and preparation for five months using random forest models that allowed them before holding a workshop in Uruguay to teach to identify the most relevant factors behind them to prepare and analyze their own data. The yield variability, and evaluate the relationships training involved participants using the system between input and output variables. for themselves, to ensure they could replicate the process in their own contexts. The group quickly grasped the approach. Some teams achieved satisfying results which Building a real-time system coincided with previous work, while others Equally important was the move from using obtained unexpected results due to incomplete only historic data on weather and yields, to and insufficiently processed datasets. BIG DATA innovation challenge 3 All experienced the problems that occur LESSONS LEARNED while handling real data (such as outliers or The scaling-up and move towards real-time variables out of the range when cleaning data). data capture revealed that data quality is of This generated a rich debate around how paramount importance to the reliability of the best to capture data. The limited capacity of system, but human relationships through local participants’ personal computers restricted the partnerships and a user community will also be exercise to small datasets using only ANN and crucial to its success. random forest techniques, prompting CIAT to run the analysis on its more powerful server • Work specifically to create a positive user facilities, with whole datasets. Exploratory experience results using information from one rice mill User experience is central to success. showed that previous land use also influences Technical issues undermine user trust, rice production. making the adoption process harder. A tool must offer sufficient services to engage The app is being finalized after initial user users, and be easy to operate, otherwise feedback reported that it was too slow loading it will not be used. and saving information, and did not work on all platforms (IOS/Apple, Windows, etc.). • Invest in regional and local partnerships This caused some frustration among users Partnerships on the ground are key to (who wanted quick responses to problems, scaling up – particularly with agricultural as if from a service provider rather than a organizations, which can source data research project). from numerous locations and deliver recommendations to many farmers The system could potentially be adapted to efficiently. It’s important to gain credibility almost any crop, and the team aims to pilot with partners, so they will share information: it in Africa in 2016. It is hoped the workshop Open data-sharing is still in its infancy in will form the basis for the establishment of many places and primary data holders a community of practitioners in big data in often have legitimate concerns about how agriculture. This would foster a culture of data- information they share will be used. driven agronomy and increase the sector’s capacity to adapt to climate change. • Capture, prepare and store data meticulously High-quality input data is essential for accurate results. Agricultural data is readily available, but there is big potential to modernize the methods used to capture, store and share it – for example, through machine-based data sourcing. The data preparation step (addressing missing data, 4 BIG DATA innovation challenge outliers, correlated variables, etc.) must be and developers who voluntarily contribute completed, and organizations should evolve technical expertise to maintain and towards cloud-based technologies, so data is extend the ‘R software’ language. This centralized and always available. approach demonstrates how a tool can be continuously improved. • Create a community of practitioners Networking will be powerful in promoting uptake and refinement of these methods. www.open-aeps.org:8080/ CIAT hopes to support a user community for data mining techniques in agriculture, similar to the ‘R community’ – a global network of more than 2 million users Teams at the project workshop learned data analysis techniques to identify the factors behind yield variability BIG DATA innovation challenge 5 Big Data for Financial Inclusion Sven Harten Analysis of mobile phone data can help increase subscribers’ use of banking services, boosting their economic resilience 6 BIG DATA innovation challenge SUMMARY Access to financial services is essential to development efforts. Across Africa, financial inclusion remains below potential – partly due to the challenge for financial institutions in developing products for the low-income mass market. This project uses available data in an innovative way to help providers offer affordable financial services to previously unbanked people. By combining big data analytics and socio-economic research in Ghana, Uganda and Zambia, the team created a powerful tool to increase adoption and use of digital financial services (DFS). This enabled them to establish a statistical customer profile for an active DFS user. They then searched the big-data set to identify matching profiles among phone-service customers who are not yet DFS users, and mapped their locations. These are the customers most likely to become active users of DFS. This intelligence can be used for product development and targeted marketing campaigns to increase the supply of financial services to the previously unbanked. It also offers valuable insights for the World Bank’s drive for full global financial inclusion by 2020. In Ghana, the use of project findings has already led to the financial inclusion of more than 70,000 people. BIG DATA innovation challenge 7 CHALLENGE They hoped to identify behavior patterns Across Africa, the coverage of formal financial among customers and use that information services is poor, with low-income and rural to stimulate better use of DFS and identify customers largely excluded. Yet their need potential new users. for affordable financial services is well documented. Financial inclusion increases resilience to economic shocks and helps grow small and medium-sized enterprises. The International Finance Corporation (IFC) is The study has committed to helping create 600 million bank already led accounts in the developing world. to the financial Through the Partnership for Financial Inclusion inclusion of over – a joint initiative with the MasterCard 70,000 people Foundation – the IFC is working to expand microfinance and advance digital financial services (DFS) in Sub-Saharan Africa. The growth of DFS offers huge potential to increase financial inclusion in developing INNOVATION countries, especially given the proliferation The project used big data from MNOs and of mobile phone use. With high mobile phone financial institutions in Ghana, Uganda and penetration across the continent, many Zambia to calculate profiles for users of mobile microfinance institutions, banks and mobile financial services. Using statistical predictors network operators (MNOs) are developing derived from the usage patterns of the current DFS. However, although many customers have active network user base, the team firstly registered for these services, only a minority identified which MNO customers are highly use them regularly. Financial institutions also likely to become active DFS users. As mobile lack information about potential customers, phone data does not contain socio-economic or which products address their needs and how to demographic information, they also designed provide access to those products. classic surveys to achieve more complete profiling of users and non-users of financial Big data offers the opportunity to mine existing services. Potential users can then be targeted information about mobile phone and DFS users by marketing campaigns. to help MNOs and DFS providers to deliver products and services to previously excluded Big data analysis customers. Using a company’s DFS transactions The study started with a big data analysis of database and call detail records, the project call detail records (CDRs) covering one MNO per team sought to characterize active mobile DFS market, each with an average of 4 million mobile users and understand what drives inactivity. subscribers. Six months’ of CDRs and DFS 8 BIG DATA innovation challenge transaction records, nearly two terabytes in size, active DFS users. Using machine learning were extracted from the MNOs’ servers. The techniques, they modeled the 15 most powerful team segmented users into three categories: variables (such as the number and length of calls or number of call contacts) which predict • Voice only whether a subscriber is likely to become a DFS • Registered but inactive DFS users user. Based on the findings, the team compiled • Active DFS users. maps showing the actual distribution of DFS users, the distribution of predicted adopters, These segments showed very distinct patterns and districts with highest concentrations of of voice calls, social network structures and likely adopters. geographical mobility. Active DFS users make on average almost twice as many phone calls Classic socio-economic profiling interviews as non-users. These calls also last significantly To strengthen the profiles of different types of longer. They also send and receive the most DFS customer, the team also carried out a socio- text messages and have a much larger social economic study in Ghana: network. They are therefore the high-value customers and early adopters each MNO seeks • Demographic profile to attract. Based on the average three-monthly volume of voice, text and data usage, network The research found that many telecoms-only subscribers were organized into high, mid customers had a demographic profile similar and low users. A random selection of 500 to these highly active DFS users, indicating from each segment was interviewed by a strong correlation between high users of phone. The results indicated that mobile telecoms services and the potential to be an phone users are 61 percent male and active DFS user. The team therefore scored all relatively young (45 percent under 35), telecoms subscribers according to the extent to with good literacy and access to financial which users are similar to the profile of highly services (66 percent have a bank account). District-level adoption rate Predicted adoption rate Top target districts Through big data analysis, the team mapped the distribution of current, predicted and likely users of digital financial services in Ghana BIG DATA innovation challenge 9 There is an apparent gender effect at all RESULTS three levels of mobile activity: Women switch In this study, big data was used to discover the less between providers and have lower profile of those MNO voice customers most mobile activity levels on the network likely to become regular DFS users. Research they use. This suggests opportunity for then identified socio-economic groups that MNOs that develop marketing strategies fit these profiles but were not using DFS. It targeting women. is reasonable to expect that a combination of targeted marketing and the provision of • Use of Digital Financial Services DFS relevant to these profiles should result in Ninety-five percent of NMO subscribers, significantly increased active usage of DFS. including users over 55, are aware of DFS. However, there is a gap between awareness Comparing the findings from the socio- and usage, especially among low-activity economic survey with big data analysis, the subscribers. There is also great disparity team found that infrequent users of voice between male and female subscribers, with calls are also more likely never to have used low-active male network users having the DFS. While younger people are the most active same level of DFS usage as high-active users of voice, they also have the largest female network users. The gap between share of registered yet inactive DFS accounts. awareness and usage decreases the This suggests improvements needed in DFS younger the users are. The key reason why products, as mobile-savvy younger customers so many NMO subscribers are not using are ignoring them despite using their phones DFS is around understanding and products. regularly. An important finding was that DFS Twenty-eight percent of non-users declared users display increased network usage and they have no need for DFS, which suggests loyalty to an MNO than non-DSF users. they need explanations of how DFS can help The socio-economic research showed high their financial management. MNOs should potential for growth, given that nearly half of also consider whether they have the right voice subscribers have never used DFS. In products for these customers. Twenty-three particular, the youth segment and infrequent percent of customers reported having no female voice users are high potential target money to use with DFS, reinforcing the groups who could be approached with tailored need for customer education, as even with products and communication strategies. irregular incomes, many could still benefit Appealing to these consumers could lead to from DFS. increased use of DFS and ultimately to greater financial inclusion. Findings from big data analysis are already helping MNOs promote both the use of DFS and their telecoms business. In Ghana, use of information from this study has so far led to the financial inclusion of more than 70,000 additional people. The MNO 10 BIG DATA innovation challenge in Ghana called the list of potential customers data source, and it does not make other directly to promote DFS. These calls were far data obsolete. Talking to people remains more effective than previous indiscriminate a powerful source of information. For efforts to attract customers. example, big data analysis offers huge potential to support financial inclusion, but The team now hopes to develop more only by enhancing it with consumer profile sophisticated metrics of phone use, such as research it is possible to target customers social network structure, behavioral traits, with precision. geographic segmentation and mobility analysis. Statistical analysis could also identify key traits • Plan for imperfect data that differentiate groups, such as voice versus It’s rare to have perfectly extractable data. DSF users or active versus inactive DSF users. There may be gaps or reliability issues, Ultimately the researchers aim to construct a especially if data sources span different model to compute the likelihood that each voice machines or archives. Different data (such as subscriber will sign up for DFS or become an call details and DSF transactions) usually sit active user. on different servers, which can complicate and prolong data extraction. Before starting the full extraction, request samples of data LESSONS LEARNED from all sources, to ensure they can be The combination of big data analysis and unified without problems. Even with the best classic research methods offers valuable preparation, extraction often takes longer lessons in getting the most from big data: than anticipated, so include potential delays in project planning. • Enhance big data analysis with traditional research techniques Despite the current hype, big data is ultimately ‘only’ a new (albeit very rich) Big data can help the World Bank reach its target of full global financial inclusion by 2020 BIG DATA innovation challenge 11 Improving Road Investments through Mobile Data Kai Kaiser 12 BIG DATA innovation The OpenRoads challenge platform links official road data with crowd-sourced geo-coded mobile phone imagery to validate the state of the Philippines’ road network The Philippines’ 180,000km of local roads are vital for linking people to social and economic opportunities SUMMARY Local roads that give ‘last-mile’ access to a destination are essential for promoting inclusive growth. But improvements in local road networks typically depend on national financing and involve numerous decentralized projects, making it hard for policymakers and citizens to ensure the development of efficient road networks. In response, this project developed OpenRoads, an interactive multi-media portal and set of digital tools to track nationally- financed projects to improve local road networks in the Philippines. The platform links official road data with crowd-sourced geo-coded video and image data from mobile devices, to validate the state of the country’s extensive road network. It can be used by all stakeholders working to promote better road investments, including ordinary citizens. Leveraging low-cost open-source mapping technology and mobile pictures and videos, it organizes a wealth of rich geo-tagged imagery and feedback contributions, turning fragmented data into timely information. By geo-tagging, processing and analyzing the imagery, it captures the state of local road networks and road investment projects. This allows greater digital transparency and feedback, improving results and value for money for public investments in last-mile rural road access. BIG DATA innovation challenge 13 CHALLENGE is so unstructured that it cannot be Beyond national highway networks, local roads systematically analyzed. that give last-mile access to a destination are vital for linking people to economic and social In response, this project sought to implement opportunities. But improvements in local road new data mining and evaluation protocols, networks typically depend on national financing develop a clear reference model for the road and involve large and diverse portfolios of network, and ensure that image and video data decentralized projects. This makes it hard for can be referenced to this model. With the 2013- policymakers and citizens to ensure that such 16 national budget for improving local last-mile programs are effective in prioritizing, delivering roads at over US$3 billion, the value for money and maintaining efficient road networks. in advancing transparency and feedback could be significant. In the Philippines only 31,000km (15 percent) of an estimated 210,000km of road are designated national roads. Improving last-mile access means upgrading the remaining 180,000km of Users can see all local roads – a distance halfway to the moon. nationally-financed Several government programs worth almost US$1 billion in 2016 target local roads, ranging local road projects from tourism to farm-to-market schemes. and provide However, with patronage politics playing a key role in the country, it is hard to ensure that the feedback by right roads get built at the right cost. Traditional project or location monitoring and evaluation systems for road projects cannot quickly and cost-effectively improve transparency and feedback. Incomplete local road maps compound the INNOVATION problem, but geo-tagged photo and video The project developed OpenRoads, an data offer a reliable approach to mapping and interactive multi-media portal and set of digital assessing the Filipino local road network. tools to track nationally-financed projects to In recent years, government agencies’ improve local road networks. The platform collection of such data on public infrastructure turns fragmented data into timely information, programs (including through crowd-sourcing) organizing a wealth of rich geo-tagged image, has exploded. There is a need to effectively video and feedback contributions over time. Its leverage this data overflow to provide real- premise is that visible, mapped information on time information on the road network and existing road networks and the physical and improvement projects. However, the data financial progress of investment projects will 14 BIG DATA innovation challenge enhance transparency and accountability for The platform allows all stakeholders wanting public investments in last-mile roads. better roads to create the missing digital road map using Routeshoot, a mobile video OpenRoads leverages innovative, low-cost open- application that works on basic smartphones. source mapping technology and mobile picture After simple training, people can upload these and video imagery. By geo-tagging, processing movies and maps onto OpenRoads. The and analyzing the imagery, it captures the state platform provides a comprehensive overview of of local road networks and road investment government investments in local roads. It can be projects. The platform maps both projects and used by citizens concerned with the use of tax the road network assets these investments money, development and private sector partners are seeking to improve. OpenRoads Mapping promoting investments, or government bodies and Network Analytics builds on Open Street showcasing public infrastructure delivery. Map protocols, an open-source global mapping OpenRoads offers project updates and virtual platform. This provides for systematic data tours at different points in time. Users can analytics and validation, as well as public view national and local road networks, see all disclosure of information. mapped nationally-financed local road projects and provide feedback by project or points on the Joining up disparate data sources map. This enables stakeholder collaboration in Various Filipino government agencies are improving road networks. financing or implementing local roads projects. OpenRoads is not designed to duplicate these Measuring program performance agencies’ tracking systems, but rather to provide The platform’s Geostore stores mapped images a platform for bringing this data together and and videos over time, allowing dynamic project- augmenting it, particularly with rich geo-tagged based review and updates by responsible information. For example, the Department of agencies. Government bodies can map and Public Works and Highways electronic project geotag all projects, which oversight agencies lifecycle system provides monthly summaries can track systematically. Users can measure on the financial and physical completion of an program performance by project or local estimated 20,000 projects per year. While many government. The mobile app assigns point- focus on the national road network, others or track-mapping to photos or videos, which benefit local roads. Open Roads augments this geo-processing tools convert to diagrams information with locational information, as summarizing project information (such as well as videos and images. The system can road length, surface and quality). Users can also scour image and video data from various validate the quality of budgeted road projects dates to show the lifecycle of road segments against observed progress on the ground. and projects. The platform also offers road mapping tools to assess project connections and the extent of local road networks. Through BIG DATA innovation challenge 15 its Dashboard, OpenRoads allocates scores provincial roads under the 2016 budget. A to project portfolios based on the availability prerequisite is that agencies submit their of physical and financial information and proposals to OpenRoads using Routeshot the extent of basic project mapping. videos. Over 70 provinces have been trained and are submitting projects. Worth nearly By bringing projects to life through its dynamic US$150 million in 2016, the Kalsada program digital maps, OpenRoads deepens information is set to increase in value six-fold in 2017. This exchanges, allowing stakeholders to assess ‘No map, no money’ approach from sponsoring the road network and prioritize the right agencies underscores the importance of improvement projects. accountability for taxpayers’ money. The platform’s approach could in future be applied to other decentralized infrastructure RESULTS investments or special programs, such OpenRoads is already making a tangible as re-greening. difference to the Philippines’ road network. Local roads in Palawan Province have OpenRoads is no substitute for major been surveyed using Routeshoot, and the institutional reforms to deliver sustained imagery geo-processed to rapidly summarize effectiveness in last-mile roads investments, but their condition by segment. The provincial better information and stakeholder engagement government is collaborating with 24 cities and and feedback enable the country to check municipalities to complete comprehensive road whether the right roads are being built at the mapping across the province. The Commission right time and cost. on Audit has engaged local civil society to conduct a citizens’ audit of more than 200 Farm- to-Market roads across the country. OpenRoads’ LESSONS LEARNED Geostore was used to mange to the process, OpenRoads shows how big data can bring resulting in a People’s Audit Report based on stakeholders together to promote transparency, data analytics and visualization. The Tourism but technology cannot substitute for Road Infrastructure Program is upgrading over institutional reform to guarantee accountability. 4,000km of roads across more than 450 projects and 1,000 contracts. The Department of Tourism • Complement big data with non-digital is using Routeshoot and geo-processing to approaches to deliver change conduct a rapid appraisal and validation of all Online roads transparency is no silver these roads. bullet. Digital technology needs ‘analog’ complements, such as institutions that link Under Kalsada, a new public investment transparency to accountability, to improve financing program, provinces are eligible for service delivery. national government financing for up to two 16 BIG DATA innovation challenge • Use big data to promote dialog as a project management system, rather OpenRoads promotes a multi-stakeholder than just a disclosure platform. The major conversation about how to improve lesson from this evolution was to ensure planning, budgeting and implementation of that the technology was both scalable and decentralized road investment programs. able to accommodate specific program Citizens can provide constructive feedback requirements (e.g., tracking budget release on progress in last-mile access for requirements). their communities. Mapping reframes conversations about road investments so • Make road project geo-tagging comprehensive they resonate with politicians, policymakers and mandatory and citizens alike. Goals such as providing The only way to avoid ghost roads is to every community with road access can be ensure all local road projects are mapped. costed, and choices evaluated. It was critical to underscore the need for OpenRoads coverage to be comprehensive. • Be prepared for evolution in system usage The Commission on Audit is enforcing The Kalsada program institutionalized the this principle in the Philippines. Engaging OpenRoads platform, linking the mandatory Supreme Audit Institutions in geo-tagging use of geo-tags to receiving performance- must be an important part of advancing the based grants. However, this also meant that notion that ‘no road project should be a OpenRoads increasingly needed to serve state secret’. OpenRoads enables the country to check whether its roads are being built in the right place, at the right time and the right cost BIG DATA innovation challenge 17 Using Geo-spatial Data to Secure Property Rights Kathrine Kelm Ultra-light commercial drones record aerial imagery that can be quickly 18 BIG DATA innovation challenge processed into accurate, cost-effective maps The orthophoto produced from the drone and the software on the tablet are used to gather property information from local residents in Kosovo. The property boundary information is then updated on the orthophoto SUMMARY Property rights are critical to economic growth and social stability, yet almost 75 percent of the world’s population lacks access to formal systems to register their land rights. In a new approach to recording property rights quickly and cheaply, this project used Unmanned Aerial Vehicles (UAVs), commonly known as drones. These record imagery which is processed into high-resolution orthophotographs (aerial photographs corrected to have the same lack of distortion as a map). The process generates accurate, cost-effective and up-to-date maps and 3D computer-generated landscape models in a fraction of the time of conventional aerial surveys. In Kosovo, the team used a UAV to map villages where the men were killed in the Balkan conflicts and the women lack formal property title. They are now using the new maps to help the women define their property boundaries and officially register their rights. They also deployed the UAV successfully in the fast-growing city of Ferizaj, to support a government program for unregistered land owners to legalize their property rights. The initiative can be scaled up globally, especially to secure land rights in developing countries. BIG DATA innovation challenge 19 CHALLENGE the world’s population who lack secure property Almost three-quarters of the world’s population rights. lack affordable access to formal systems to register and secure their land rights. Poor This mapping approach could be used to people, including indigenous and vulnerable complete or update cadastral maps. By groups, are disproportionately affected. Even facilitating the registration of land rights, it where affordable formal systems exist, data would allow poor people to acquire a tangible quality often remains low, yet property rights asset, which they could use directly or as are critical to economic growth and social collateral to invest in other assets. Vital for stability. Without secure rights, land remains economic and social inclusion, this security of underdeveloped and underutilized. Access to tenure would help people escape the vicious secure land rights eliminates threat of eviction, cycle of poverty. increases investment and improves agricultural productivity. Women’s property rights have been shown to improve children’s health and INNOVATION education, foster inclusive family decision- UAVs offer a new approach to producing making and reduce domestic violence. accurate, cost-effective and up-to-date maps and 3D computer-generated landscape models. There is urgent need to develop ways to identify More commonly known as drones, commercial and record land rights information far more UAVs are small and ultra-light, facilitating an quickly and cheaply than conventional methods. affordable mapping service through a process New technology is enabling the local capture of that takes days or weeks from planning to high-resolution geo-spatial data and processing product, rather than months or years. These into accurate maps. Combined with open-source maps can easily be disseminated and used by software programs, this provides huge potential citizens, local government, utility companies for a more cost-effective and inclusive approach and businesses, among others. to securing property rights. The project therefore aimed to produce faster, cheaper spatial data To refine the identification of land plots by UAV, using processed imagery from Unmanned the team built on a 2014 World Bank project Aerial Vehicles (UAVs) to help the majority of which successfully used UAVs for mapping The approach significantly reduces the cost and timescale of high-quality cadastral mapping, and empowers local communities to participate 20 BIG DATA innovation challenge and citizen engagement in defining property For their first UAV deployment, the team worked rights in Albania. They used drones to collect to support the property rights of women in aerial imagery and produce high-resolution Krushe e Madhe, where most of the men orthophotographs – aerial photographs and boys were killed in 1999 in the Balkan geometrically corrected for topographic relief, conflicts. The women have slowly rebuilt their lens distortion and camera tilt (‘orthorectified’), lives and have organized several agricultural so that their scale is uniform and there are cooperatives. However, official data from the no distortions. Unlike an uncorrected aerial Kosovo Cadastral Agency shows less than 10 photograph, an orthophotograph can be used percent of properties in Krushe e Madhe are to accurately measure distances. The team registered in a female name. This prevents is now combining the orthophotos with open- women from using their land as an effective source, customizable registration software to economic asset – in particular, as collateral record property rights information from citizens. for credit. The time, cost and complexity of Together, UAVs and the new software offer an conventional cadastral registration, along with innovative and cost-effective property mapping poor knowledge about their rights, often exclude and registration toolkit. the women from the benefits of registration. The process results in geo-referenced maps with To map Krushe e Madhe’s property boundaries, boundaries and information on ownership and the team used a Sensefly eBee fixed-wing UAV, use for each property. These can be distributed owned by the World Bank’s Innovation Labs. to each owner and the local community, The drone carries an 18 megapixel camera and allowing land owners to easily identify and verify flies within remote control contact at an altitude the boundaries of their properties. Although the of around 100 meters. It covers a predefined maps do not constitute formal registration, the area, prepared by setting survey ‘ground control team also works with government agencies on points’ (marked with spray paint) that ensure how the field information can be integrated into the highest accuracy of the maps. In a week, the the official cadaster or registration database. team completed 25 flights covering 12 square kilometers and processed the images locally, Securing women’s property rights with the help of the Kosovo Cadastral Agency, To scale the Albania pilot up to operational into high-resolution maps of around three levels, the team turned to Kosovo, where the centimeters per pixel. World Bank is helping the government produce a national cadaster system. Working with the Mapping cityscapes Kosovo Cadastral Agency, the team began The UAV was also deployed successfully in to integrate UAVs into the national mapping a fast-changing urban context. In the past program. This would significantly reduce the two decades many cities in Kosovo have cost and duration of cadaster development, and experienced rapid, unplanned expansion facilitate informed planning. The process would resulting in informal settlements, illegal also empower local communities to participate. constructions and chaotic development. BIG DATA innovation challenge 21 In response, the government recently construction site for a new national highway, introduced a program for land owners to the team responded to a spontaneous request legalize their property rights. To facilitate for assistance from a local official. The road property registration in the city of Ferizaj, crew had recently found an archaeological site, the team spent a day carrying out six flights, but existing aerial imagery and maps provided covering three square kilometers in a total no evidence of it. Using the UAV, the team was of three hours’ flying time. The data were able to plan, fly and process a high-resolution processed in 24 hours using two local high-end 3D map of the area in less than 24 hours. This desktop computers, resulting in orthophoto provided accurate information for rerouting the maps with 1.9cm resolution from which land road and preserving the cultural heritage site. owners can easily identify their property. Other examples of UAVs’ potential include utility inventories, supervision of major infrastructure RESULTS contracts, post-conflict or disaster response The project showed that UAV technology assessment, recording the rights of indigenous supported by customized open-source software and vulnerable communities, and road can produce accurate, cost-effective and up-to- engineering. The initiative can be scaled up date maps and ownership information for the globally, especially to secure land rights in registration of property rights. This approach developing countries. The results will also significantly reduces the cost and timescale of inform the global discussion on how to help high-quality cadastral mapping activities, and people currently unable to register their land empowers local communities to participate rights and how to build sustainable systems to by identifying and verifying boundaries on the identify the way land is used. maps. In Krushe e Madhe, the team is using the new maps to help the women define their property boundaries, and working with officials LESSONS LEARNED to develop a system for completing official The project confirmed the potential of UAVs for registration using the maps and community collecting geospatial information. Paradoxically, information on property ownership. The the use of big data and powerful technology can maps and digital elevation model produced in empower people at community levels. Ferizaj will be made available free of charge to citizens who are participating in the legalization • Use drones for a wide range of aerial imaging program, as well as to the municipality and requirements other authorities via the national Geoportal UAVs are becoming a more commonly (which offers web access to maps and other accepted tool for producing high-resolution geospatial information). mapping products for targeted areas. They support ‘fit-for-purpose’ mapping principles, The project also proved the versatility of which hold that land administration should the UAV approach. While flying near the be designed to meet the needs of people and 22 BIG DATA innovation challenge their relationship to land, support security of potential to empower communities tenure for all, and sustainably manage land and local government to visualize their use and natural resources. environment and make informed decisions can be significantly increased by using • Empower local communities through new these new tools and technology. technology and big data tools Although a certain level of resources and capabilities is needed to process and manage big data such as the high-resolution geospatial information produced by UAVs, the project demonstrates how this new technology allows a more decentralized approach to traditional mapping. The Drones enable production of accurate orthophotos in less than 24 hours for applications such as supervising infrastructure contracts, disaster response assessment and mapping archaeological sites (above, in Kosovo) BIG DATA innovation challenge 23 Open Traffic: Easing Urban Congestion Stephanie Debere With thanks to Task Team Leader Holly Krambeck (pictured) Urban traffic congestion affects poorer people disproportionately, as they have longer commutes and suffer more from the health effects of pollution The Open Traffic platform can generate travel time survey data without the cost of manual fieldwork and analysis SUMMARY Congestion has known negative impacts on economic growth and can exacerbate urban air pollution and greenhouse gas emissions. Effectively addressing congestion requires accurate traffic speed and flow data, but resource-constrained transport agencies are challenged to collect these data, as modern tools tend to be financially and technically out of reach. In response, the Open Traffic program leverages open-source software and innovative partnerships to substantially reduce the cost of traditional traffic data collection and analysis, while simultaneously improving the quality. The first scalable, open-source program of its kind, the project built on work with the Cebu City Government in the Philippines to develop an open-source platform for collecting, visualizing and analyzing traffic speed data derived from taxi drivers’ smartphones. Using GPS data from an on-demand taxi service, Open Traffic successfully analyzed peak-hour congestion, travel time reliability and corridor vulnerability across 10 Southeast Asian cities, and has prepared travel time analyses for select origin- destination pairs. This analysis would not previously have been possible without substantial time and resources. It shows that the next generation of congestion management solutions will leapfrog the capital-intensive approaches of the past, enabling traffic management agencies to make affordable, evidence-based planning decisions. Open Traffic has now been deployed in Cebu City for live testing. BIG DATA innovation challenge 25 CHALLENGE technical expertise beyond the capacity of Urban traffic congestion affects poorer people poorer cities. They can also record data only disproportionately. They generally have longer in places where they are deployed – select commutes than the affluent and suffer more corridors during select time periods. There is from the health effects of higher pollution, as urgent need for a viable, inexpensive alternative many work outside. Congestion also generates to traditional travel-time and congestion data excess greenhouse gas emissions, and it is collection and analysis. This would allow often the poorest people who live in areas most resource-constrained agencies to make vulnerable to climate change. Time lost in traffic evidence-based decisions to promote jams also has a significant negative impact on traffic flow. urban GDP growth. In many developing countries, decisions INNOVATION about traffic signal timing plans, public transit The project leveraged three trends to develop a provision, roadway infrastructure, emergency traffic management system reliant on GPS data traffic management and travel demand instead of fixed-location equipment: Growth in management are made without observed, global smartphone usage, the emergence of quantified congestion or travel-time data. Such taxi-hailing app companies and increased use of data is costly to collect and can also require open-source software. substantial technical expertise to analyze. This causes avoidable congestion, as well as Over a third of the world’s population is unnecessary fuel consumption. expected to have a smartphone by 2017. This has inadvertently created a new source of traffic In higher-income countries, transport agencies data, derived from handset GPS signals and Wi- rely on a combination of manual survey Fi pings. Viewed as traffic probes, smartphones methods and installed physical sensors – can create a sensor network that is unrestricted underground detector loops, pneumatic tubes, to specific corridors, is continuously updated in laser-based sensors, cameras and Bluetooth real time, requires no maintenance and provides device detectors. However, these require initial a level of sampling unachievable through capital outlays, ongoing maintenance and manual methods or equipment-based sensors. The project leveraged three trends: Growth in smartphone usage, the emergence of taxi- hailing app companies and increased use of open-source software 26 BIG DATA innovation challenge Recently, international smartphone-based by volunteers without cost or licensing taxi-hailing app services have also emerged. requirements. The map may be freely updated These companies maintain databases of and improved by transport agencies and others, millions of urban GPS points, often spanning using open-source editing tools. Its ‘Highway’ hundreds of cities across many countries. The feature includes all OSM mapped roads, from project combined smartphone ‘sensors’ and unpaved rural tracks to expressways, covering taxi GPS databases to develop a single cloud- much of the planet. Drawing on the OSM based traffic management application, which Highway, Open Traffic links average traffic could support services in numerous cities speed calculations to OSM road segments simultaneously. By using open-source software, via several steps: it offers unprecedented economies of scale in capturing and analyzing traffic data. • Open Traffic downloads the relevant portion of the global OSM map. Linking disparate data sources • It prepares the map sections by assigning The initiative builds on a successful pilot in virtual ‘detectors’ to every approach where Cebu City, where the team created an open- road map segments intersect. source platform that uses GPS data generated • It calculates the travel time for a single by taxi drivers’ smartphones to derive vehicle traversing a road segment across meaningful statistics for traffic planning. The two detectors, as the distance between the platform, called Open Traffic, is a graphical two detectors divided by the time the vehicle user interface allowing government agencies spent traveling between them. to easily query and visualize stored traffic statistics derived from GPS data collected from From raw data to travel times drivers’ phones. The estimated travel time for each road segment on a given trip is stored on a server. Neither raw The team partnered with Malaysia-based GPS data nor information associated with a Grab, the largest taxi-hailing app company in particular vehicle is retained. Data are stored as Southeast Asia, to further develop and pilot the the number of travel times for every hour of the smartphone ‘traffic sensor’ approach. Through day (how many travel times of, for example, five, the partnership, traffic management agencies six or seven kilometers per hour, etc.). These in Malaysia, Singapore, Indonesia, Vietnam, the travel times can be queried to calculate average Philippines and Thailand will have access to traffic speed for different time specifications (a anonymized traffic data generated by 250,000 specific day, a specific hour each day, etc.) for vehicles in Grab’s fleet, free of charge for at single or multiple road segments. least the two-year pilot and scaling-up phases of the project. Open Traffic can query the database of stored travel times by road segment to generate a The platform uses Open Street Map (OSM), map of average travel speeds for selected time a global geographic dataset populated periods. It also facilitates travel-time queries BIG DATA innovation challenge 27 between select origin and destination pairs, as the traditional survey represents only a either for automatically generated routes (based single sample, whereas the Open Traffic dataset on the shortest path) or manually defined ones. represents thousands of samples over the same A ‘confidence indicator’ is provided, based on time period. the number of observations used to derive the travel time and average speed estimates. These results show that the next generation of congestion management solutions will leapfrog the capital-intensive approaches of the past. RESULTS They will enable traffic management agencies Using GrabTaxi’s data from 10 major Southeast to make better, evidence-based decisions about Asian cities, the team tested whether Open traffic signal timings, public transport, road Traffic’s analytical results made intuitive sense. infrastructure, emergency traffic management They used the platform to observe weekday and travel demand management. peak and non-peak travel patterns in each city. These peak-hour graphs mostly reflected The Open Traffic platform has now been expectations about urban traffic, with travel deployed in Cebu City for live testing. Next speeds highest at night and slowest during steps include development of a methodology commuting times. The results meant Open for optimizing traffic signal timing plans using Traffic could be used to monitor the efficacy of GPS data instead of traditional sensors, as well congestion mitigation measures. as a standardized methodology for estimating the cost of congestion (in terms of fuel usage, The team also tested the platform’s suitability greenhouse gas emissions and economic for examining peak-traffic duration and impact). Discussions are underway with Grab on variation, conducting travel-time surveys and launching the platform in other cities. understanding how externalities and traffic interventions affect traffic speed. Tests successfully examined the predictability of LESSONS LEARNED congestion, checking the consistency of Open Traffic illustrates that much of big data’s expected travel times between origin and potential lies in combining existing disparate destination pairs along key corridors. sources of information in unprecedented and innovative ways. The Open Traffic platform could also be used to generate inputs such as travel-time survey data • Keep seeking the potential in new combinations for traditional transportation planning, without of data the cost of fieldwork, encoding and analysis. By bringing together existing independent The team compared manual survey data in data sources, the team was able to mine a Cebu City to Open Traffic data, and found that rich new source of information. To exploit big the Open Traffic results provide less variation data fully, it will be important to remain alert between road segments. This is unsurprising, to the potential of combining disparate and 28 BIG DATA innovation challenge even seemingly unrelated sources of data applications and databases. However, this for analysis. required Grab to set up specialized buckets for the project, and the process and file • Think about data storage from the beginning sizes soon became unwieldly. On the team’s Big data analytics can generate unwieldy recommendation, Grab began aggregating amounts of data. Open Traffic data was its global data as a single stream, rather initially stored as city-specific files in than individual files, using Amazon’s Kinesis Amazon ‘buckets’, a cloud-storage service service, which can upload real-time data which facilitates automated uploading and streams from multiple sources. downloading of data directly from and into Open Traffic can query the database of stored travel times by road segment to generate a map of average travel speeds for selected time periods BIG DATA innovation challenge 29 Observing People’s Feelings About State Institutions Victoria L. Lemieux To explore links between citizen’s feelings and civil unrest, the project analyzed tweets made during protests around the 2014 Soccer World Cup in Brazil 30 BIG DATA innovation challenge To help identify people’s sentiments, the team analyzed historical tweets and depicted the top 500 word-frequency terms in a text-cloud SUMMARY Where citizens lack means of voicing discontent with institutions and governments, they can embrace extreme forms of protest that quickly escalate. To explore the relationship between citizens’ feelings about governance institutions, their trust in government, and civil unrest, this project conducted sentiment analysis of tweets made during protests surrounding the 2014 Soccer World Cup in Brazil. The team harvested 11 million real-time tweets and used visual analytics techniques to extract search terms for the collection of relevant historic tweets. They used a visual analytics tool to carry out sentiment and text analysis, then followed a structured approach to explore how Brazilian citizens felt about their state institutions, how these feelings connected to their sentiments about government and politicians, and how such sentiments translated into collective behaviors. The analysis showed citizens expressing negative sentiment about the national government’s low investment in services such as education, health and water, relative to lavish spending on the World Cup. The negative tweets were forms of social protest that may have led to other forms of protest, including demonstrations. These findings supported the relative deprivation theory of what causes social protest (the theory that people feel grievance leading to protest when they feel deprived relative to some external standard). They suggest domestic policy priorities for Brazil’s government and resonate with wider policy discussions on wealth inequality. BIG DATA innovation challenge 31 CHALLENGE Civil protest has been described as moving from discontent of the populace, to politicization of Evidence from online that discontent, to actualization of frustration media data offers an as aggression against the state. Such unrest can lead to fragile states, corruption, terrorism innovative approach and economic impediments. It can be causally to investigating linked to poverty, but dissatisfaction with institutions can also be relevant. In countries social and political where citizens lack means of voicing discontent issues with institutions and governments, they can embrace extreme forms of protest that quickly escalate – as demonstrated by the so-called ‘Arab Spring’. INNOVATION The team chose to analyze Twitter postings in Typical methods for studying internal Brazil during the 2014 Soccer World Cup. The conflict include qualitative case studies and country has relatively high levels of inequality econometric analyses. However, case studies and high social media use. It is the world’s can be dangerous and costly, and econometric second-biggest Twitter user (with roughly 41.2 approaches can overlook localized conditions million tweeters). Costing up to an estimated and the evolving dynamics of protest. There US$14 billion, the World Cup was the country’s is a need for granular information about local then-largest and most expensive sporting conditions and the causal chain of events, event. During and afterwards, it sparked public without the dangers and costs of case study protests in several Brazilian cities. approaches. Analysis of microblog data has the potential to deliver this information. This project To explore whether the protests related to sought to discover whether such analysis citizens’ trust in state institutions, the project could uncover how citizens feel about their produced an innovative visual analytics tool institutions and government, and how these and a novel analytic methodology. Visual sentiments translate into collective behavior. analytics combines human reasoning with machine reasoning through an interactive visual interface. This ‘mixed initiative’ approach overcomes the limitations of each type of reasoning. Human analysts can detect subtleties of humor or satire that a computer might miss, and computers can rapidly process and manipulate volumes of data that humans cannot. 32 BIG DATA innovation challenge How do you feel? word-frequency terms in a text-cloud, as well as Through sentiment analysis and text analysis the single use of terms over time. (machine reasoning) of a large-scale collection of tweets, the project tracked general public Translating sentiment into behavior distress and trust in institutions, hypothesizing Results of the sentiment classification and that increased negative sentiment signaled text analyses were then represented visually, declining public trust in government. The team so analysts could detect patterns in the data collected publicly available Twitter data, in two (human reasoning). They used a structured phases. The first was an initial ‘big picture’ analytic methodology, aided by the tool’s harvest of approximately 11 million tweets. interactive features, such as the ability to The sample was then analyzed using Natural search, sort and see data from different Language Processing and a visual analysis views. The sentiment analysis enabled the tool that clusters documents together by team to observe patterns in the data such as determining key themes in each. Collections high negative or positive sentiment toward a of texts are displayed like a galaxy of stars, particular institution, increasing or decreasing in which each star is a single document, positive or negative sentiment over a period, and clusters represent document similarity. or correlation of negative sentiment about Through this, the team identified tweets on an institution with rising negative sentiment political opinion and analyzed them to define about government as a whole. They explored ‘naturalistic’ search terms representing key correlations between patterns of sentiment, concepts (such as government and institutions) government policies and observed citizen underlying the study. These search terms were behavior (such as protests). then used to harvest historical tweets for the 2014 World Cup period. Using the visual analytics tool, the team carried out a pair analysis to make basic observations The team performed sentiment classification from the data. This analysis pairs a subject- of the harvested tweets using SentiStrength, a matter expert and a visual analytics expert, tool which classifies text in terms of positive combining contextual knowledge with technical and negative sentiment. It can also compute a expertise. The tool was also used for an analysis single evaluation based on both positive and of competing hypotheses, a methodology which negative classifications. The project used this analyzes the degree to which evidence supports single-sentiment approach, resulting in nine the relative likelihood of alternative hypotheses. sentiment categories: Positive or negative with From background literature, the team identified a magnitude of 1 to 4, and neutral. The results 68 hypotheses on the relationship between of the sentiment classification were visually citizen trust and social protest. The analysis presented as horizon charts, allowing analysts drew tentative conclusions about the relative to see how sentiment varies in polarity and likelihood of each by trying to disprove rather intensity over time. From a basic text analysis than prove each hypothesis. of historical tweets, they depicted the top 500 BIG DATA innovation challenge 33 RESULTS higher education to existing hypotheses about By connecting the visual analysis to hypotheses education in developing countries. Evidence on the relationship between citizen trust and from online media data suggests further social protest, the study found support for the avenues for research using complementary relative deprivation theory of social protest. methodologies, such as surveys or comparative This suggests that an individual or group lacks analysis with other data sources. The approach something that another group has and to which could also support the development of an they feel entitled. Deprivation is felt in relation to observatory of citizen sentiment to inform some external standard, not in absolute terms. investment decisions around strengthening institutions. It could ultimately lead to predictive The methodology showed that around the models for negative sentiment towards 2014 World Cup, Brazilians expressed negative particular institutions. sentiment about low investment in services such as education, health and water, relative to spending on the tournament. At state level, LESSONS LEARNED water was a key issue, with tweeters criticizing The project demonstrated the potential of investment relative to politicians’ spending big data analytics to contribute to knowledge on priorities such as campaign financing. about development, as well as for evaluating The negative tweets themselves constituted development outcomes. social protest, as well as forming part of larger politicized groups using protest hashtags • Combine human and computational power for and calling for other forms of protest (such optimum results as demonstrations). Visual analytics are useful for unpicking complex socio-political issues, such as While media reports and subsequent studies those surrounding citizen trust. Alongside of the 2014 World Cup protests generally computational methods, analysis of online focused on single immediate causes, the social data requires human input, supported analysis showed that the protests sprang by interactive visual interfaces. from a range of long-standing grievances, coupled with relative deprivation triggered • Build strong partnerships by spending on the World Cup and campaign Development of this approach required close financing. This sense of deprivation fueled collaboration with experts in fields such as sentiments that activated protest. Brazilian society, history and government. Using ‘design thinking’ techniques, the The project offers an innovative approach to project team partnered with experts to investigating social and political issues. The test assumptions and obtain feedback on methodology has already been used in a World approaches and results. This improved the Bank evaluation of higher education in Brazil, project outcome. enabling analysts to link how citizens feel about 34 BIG DATA innovation challenge • Use big data analytics alongside other research • Protect individuals’ privacy approaches Twitter users do not consider use of their Big data analytics makes observations data for research when they tweet. Many possible from a distance, both in terms of tweets sent impulsively – especially during space and time, and the project provided social protest – contain candid or critical insight into Brazilians’ thoughts, as remarks. Take care to protect citizens expressed naturalistically. In contrast, from potential harm, such as retaliation surveys require researchers to spend time for expression of opinion or lawsuits for on the ground, and may prime citizens defamation. with questions that do not reveal their own thoughts. However, this type of big data analytics does not offer a representative sample, and is best used to complement other approaches to understanding development issues. Where citizens lack means of voicing discontent, they can embrace forms of protest that can quickly escalate BIG DATA innovation challenge 35 Monitoring Rural Electrification from Space Kwawu Mensan Gaba The India.Nightlights platform visualizes data from satellite images, so each point on the map represents the light output of a specific village at a specific time 36 BIG DATA innovation challenge The approach shows the great potential of satellite-based monitoring to radically transform electrification planning SUMMARY Electricity is essential to human wellbeing worldwide, yet 1.2 billion people still live without it. Key to improving service provision is accurate tracking of the availability and supply of electricity at local level. By collecting and analyzing a unique historical archive of nighttime satellite imagery, this project developed a novel data-intensive strategy to improve the monitoring of electricity provision to rural areas across the developing world. Drawing on a multi-terabyte image archive spanning over 8,000 nights since 1993, the team used computationally intensive methods to extract and analyze patterns of light output observed nightly over all 600,000 villages in India. This pioneering dataset paints a dynamic portrait of rural energy access over two decades, enabling observation of how access to electricity has expanded, and identification of villages that remain dark. It also enables the detection of power supply irregularities. These insights are particularly useful in rural and remote regions where traditional monitoring is difficult. To make the data accessible to governments, power companies, regulatory agencies and other users, the team developed an online visualization platform, India.Nightlights. The site allows users to see how light output has evolved over two decades, from state level right down to individual villages. Ultimately, the approach could help optimize electrification planning according to an accurate understanding of electrical supplies on the ground. BIG DATA innovation challenge 37 CHALLENGE to explore how to use night lights data in a In much of the world, access to electricity is scalable, systematic way. Their early work uneven and irregular, undermining development focused on validating (‘ground-truthing’) the and welfare. Rural electrification and lighting relationship between satellite-detected light improvement projects are high on the output and the use and the availability of development agenda. These are regularly electricity in several hundred villages across monitored and evaluated, but there have been Senegal, Mali and Vietnam. The next step was no mechanisms to track the sustainability of to develop a strategy to exploit the detailed electrification schemes after projects end, or to information from the full archive of nighttime identify easily and precisely who has electricity satellite imagery to improve the monitoring of and who does not. electricity supply around the world. Data processing technologies are now enabling new ways to monitor access to electricity. Night INNOVATION lights data measured by satellite has been a The team’s pilot studies explored the use of useful resource for the development community night light data to monitor rural electrification for several years. However, the complexity of in countries with low electrification rates, accessing, processing and manipulating this such as Mali and Senegal. This was expanded data has been a barrier to widespread use. While to Vietnam, which has near-universal analysts have previously examined summaries electrification. Following the Big Data Innovation or subsets of historical nighttime lights data, Challenge, the team refined its approach and there has been no systematic effort to study scaled it up to look at all of India, a country the entire raw nightly data stream. This stream with a high density of villages and a major rural reveals the distribution of electricity at high electrification program. This project took two resolution over the last two decades. parallel tracks, in close collaboration: In 2011, a team from the University of Michigan, • Mapping India’s power supply the US National Oceanic and Atmospheric With its high density of villages and a Administration and the World Bank Group’s flagship national electrification strategy, Energy and Extractives Global Practice began India was an ideal country to assess the The resulting dataset represents the most comprehensive database known describing electricity access and variability 38 BIG DATA innovation challenge validity and reliability of a satellite-based uses annual composite images, which approach for monitoring rural electrification describe the average brightness of a locality over time. The first step was to evaluate over a calendar year. Yet in India and the large-scale electrification program, elsewhere, day-to-day variability in access to launched in 2005 to bring power to over electricity is a far larger concern. By applying 100,000 villages. statistical and machine learning techniques, the team developed new methods to The team acquired the complete historical visualize patterns of supply disruptions. One archive of nighttime satellite imagery objective is to use variability in light output from the Defense Meteorological Satellite data to identify regional instability in power Program, run by the National Oceanic and supply, increased incidences of power cuts, Atmospheric Administration. This has taken and indications of electrical supply problems pictures of the Earth every night for over as they occur. 20 years, creating an archive of multiple terabytes of high-resolution image data. • Creating visual tools Using geographic information systems (GIS) Building on the insights gained from the and data processing tools, the team analyzed satellite data study of India, the team the nightly light signatures of India’s developed an online toolkit to provide power 600,000 villages (identified by geographical companies, regulatory agencies and relevant coordinates). The resulting dataset of partners with geo-referenced maps and almost 5 billion observations represents satellite imagery depicting current patterns the most comprehensive database known and recent trends in electricity supply. It describing electricity access and variability. visualized electrification trends on a web It enables new analysis, exploration of signal- platform, India.Nightlights.io. processing techniques and generation of data visualizations that better capture space The open-source platform comprises a and time patterns in electricity distribution. pipeline to process massive amounts of data, an application programming interface Drawing on official electrification program that enables technical partners to query records, the project linked newly electrified light output at village, district, region or villages to their nighttime light signatures, state levels across India, and a dashboard covering around 8,000 nights during a map to allow users to explore light output 21-year period (1993-2013). This enabled trends. The platform offers high-level verification of improvements to electrical overviews or enables users to compare supply, and identification of potential villages, plot trends and share data. Freely implementation problems. explored from any part of the world, it has the potential be a powerful tool in driving The approach is a departure from prior rapid electrification. research on nighttime lights. Most analysis BIG DATA innovation challenge 39 Each point on the map of India represents The platform shows the great potential of the light output of a specific village at a satellite-based monitoring to radically transform specific time. At district level, users can rural electrification planning and assessment. filter to view villages that have participated Drawing on satellite-based data will sharpen in India’s electrification program and see program targeting in the village selection changes in light output, which can be used process, improve implementation assessment to complement research about electrification and allow ongoing monitoring by interested in the country. parties after projects have officially closed. The platform was tested by various users involved in expanding electricity supply, Next steps include promoting adoption of including private firms, universities, the tool and exploring where to focus future regional governments, non-governmental electrification efforts. The team is looking at organizations and development partners. It where electrification has been successful, was then refined ahead of a public launch what other variables are related to faster at the World Economic Forum in Davos, electrification, and whether other development Switzerland, in January 2016. indicators can be added to the platform’s dashboard. The tool could be used in poverty analysis, as the geo-referenced data can be RESULTS combined with other databases, drawing more The project demonstrated that nighttime correlations between electricity access and satellite imagery can be reliably used to development outcomes. detect the use of electricity in the developing world, even in rural contexts where electricity use is characterized by low power loads, LESSONS LEARNED small numbers of dispersed users, limited The project’s success rests on the importance infrastructure and erratic service provision. of validation when pioneering big data approaches, and of persevering when faced India.Nightlights presents an online platform with hurdles. that enables visualizations and interactive exploration of the night lights data over India. • Validate novel data approaches thoroughly The team now wants to refine the platform, build Rigorous validation or ‘ground-truthing’ is new capabilities and generate nuanced reports imperative to establish confidence in novel to meet the myriad needs of potential country- data sources and new methods. Because the level beneficiaries. It also wants to see how team was able to demonstrate in its earlier this approach could be replicated across the work in Senegal, Mali and Vietnam the strong developing world. correlation between light outputs captured in satellite imagery and electricity supply on the ground, the project generated confidence in its approach. 40 BIG DATA innovation challenge • Take the long view that ran the survey also provided access to Perseverance is essential. The overall electricity consumption data not publicly process took five years to reach the stage available. For the visualization platform, of publicly launching the web platform. the team ran a competition to select the The team had to overcome several hurdles website developer, having benefited from free in the process, in particular, data availability technical assistance from the GIS mapping and processing requirements to ensure software company and contributions quality data. from other World Bank global practices to formulate the terms of reference. • Partnerships are crucial for success Working with other organizations brings www.india.nightlights.io valuable insight from new perspectives, and can generate solutions from unexpected quarters. In Vietnam, for example, the agency Around 1.2 billion people worldwide still live without the electricity essential to human wellbeing BIG DATA innovation challenge 41 Mapping Poverty by Satellite David Newhouse Preliminary results show that in Sri Lanka, high-resolution satellite indicators such as roof type (top) and density of cars (below) track poverty, based on census estimates, extremely well 42 BIG DATA innovation challenge Pakistan Sri Lanka Satellite-based indicators improved the accuracy of poverty prediction in Sri Lanka (which has few existing poverty indicators) more than in Pakistan (which has many) SUMMARY Poverty must be located accurately if development interventions are to be effectively targeted and monitored. However, expensive data collection and processing mean national poverty estimates in developing countries are often outdated. In response, this project explored the use of indicators derived from satellite data to predict geographic variations in poverty. The first component examined how well publicly available low-resolution satellite indicators such as nighttime lights and land type contribute to poverty prediction, and how this depends on the method used to build the prediction model. When satellite indicators were applied to the models in Pakistan, which is unusual in that it can generate district poverty estimates from a detailed household survey, they did not improve the accuracy of predictions. The rich survey information meant satellite-based indicators contributed nothing new. However, in Sri Lanka, which is more typical in generating poverty estimates from a census (meaning fewer indicators), even freely available satellite indicators improved the accuracy of predictions. In both cases, the team found models selected using an innovative statistical technique called Lasso to work best for predicting poverty at a more local level, with sizeable benefits when there are many variables. When high-resolution satellite data indicators – such as cars, built-up area, shadows, roof type and road type – were combined with Sri Lankan census data, preliminary results showed that satellite indicators track regional differences in poverty extremely well. They demonstrate that high-resolution satellite imagery is a valuable complement to household survey data, with potential to help generate more accurate and updated local poverty maps and refine targeting in development initiatives. BIG DATA innovation challenge 43 CHALLENGE the goal of real-time estimates of how pockets Development interventions can be more of poverty are evolving. effectively targeted and monitored if poverty can be located more precisely. However, long lags in processing and report-writing mean that recent national estimates of poverty in developing Satellite imagery countries are often several years old. In addition, gives insights into local-level estimates require census data that is expensive and collected infrequently. Big data, factors such as the in the form of satellite imagery, has so far been scale of urbanization, largely untapped by policymakers wanting to infrastructure and understand where exactly the poorest people live. Little is known about which satellite-based natural resources indicators help predict poverty, and there is uncertainty around the best way to build a prediction model. Although numerous models have been developed, there has been little INNOVATION rigorous comparison of different approaches. The study involved two stages: This project aimed to assess different he first compared different methods of • T approaches to poverty mapping, as well as the generating poverty prediction models, in both extent to which high-resolution satellite imagery Pakistan (which has existing data for many can be used to generate more accurate poverty poverty indicators) and Sri Lanka (where estimates. Incorporating satellite data into fewer are available). poverty mapping is a first step towards using The team then examined how high-resolution • the wealth of non-traditional data generated satellite data indicators are correlated with daily to predict poverty more effectively. poverty predictions based on the 2011 Satellite-based data analysis is particularly Census, for local administrative areas in attractive because it can see a complete Sri Lanka. picture of a particular area, unlike, for example, mobile phone-based analysis, which typically Assessing poverty prediction models captures only a subset of phone users. Satellite To examine different approaches to developing data can also be collected frequently at fine poverty prediction models, the team applied geographic levels, even in conflict areas not out-of-sample validation techniques to conducive to surveys. Using this data to better household data from Pakistan and Sri Lanka. understand poverty can help development A randomly selected portion of the sample practitioners target interventions and evaluate was repeatedly withheld when generating the their effectiveness more accurately. Satellite- prediction model, and accuracy was assessed enhanced maps would be a key step towards by comparing extrapolated poverty rates from 44 BIG DATA innovation challenge the prediction model to actual poverty rates algorithms, the team also calculated whether in those withheld areas. This technique was buildings were more rectangular or had more used to compare the accuracy of models chaotic angles (indicating higher poverty) and derived using manual selection, stepwise constructed indicators such as the share of regression and Lasso-based procedures. The paved roads or built-up area. They assessed team also augmented the set of prediction the density of each feature in each local district variables with publicly available low-resolution division and correlated the satellite-based satellite data to see whether this improved measures with poverty estimates from the traditional poverty mapping techniques. 2011 census data. This provided a measure of which indicators correlate most strongly Applying higher-resolution imagery with predicted poverty and other measures of In the second stage, the team purchased high- economic welfare from the census. The process resolution (0.5m per pixel) satellite imagery demonstrates the potential of high-resolution covering approximately 5 percent of Sri Lanka, satellite data to capture new indicators including both rural and urban areas, and correlated to poverty. containing roughly 1,400 local administrative divisions. They used multispectral imagery (multiple images taken at varying spectrum RESULTS wavelengths) to capture variations in roof The assessment of poverty prediction models texture and surface material, enabling far more showed that the number of poverty indicators accurate identification of possible correlates affects the performance of different models of income. Novel methods are also emerging and the usefulness of adding satellite data to to detect smaller objects such as cars from the indicators. In Pakistan, with many potential such imagery. These additional predictors have indicators, the team found that Lasso models not yet been tried in poverty mapping models. outperform both discretionary and stepwise Because some indicators, such as car traffic, models. However, Lasso and stepwise models can change rapidly as economic growth occurs, give comparable results in Sri Lanka, where the these additional predictors could pave the set of indicators is smaller. The accuracy of the way towards more frequent poverty estimates prediction model also depends considerably on in the future. Contractors produced pan- the poverty threshold. In Sri Lanka, models were sharpened mosaics (merging several smaller better able to predict the bottom 40 percent scenes) of the raw high-resolution imagery. than the bottom 10 percent, but in Pakistan the The team then worked with experts to develop reverse was true. In Sri Lanka, including publicly detection algorithms to identify possible poverty available satellite data made poverty predictions predictors that can be extracted from high- more accurate, but in Pakistan, the satellite resolution satellite data. These include built-up data makes predictions slightly less accurate. area, building and car density, type of roofing, When the satellite data is included in Sri Lanka, amounts of shadow, road type and agricultural the Lasso models significantly outperform the land-use. Using open-source image processing manual and stepwise models. BIG DATA innovation challenge 45 Overall, the team found Lasso-based models track local variations in poverty in a variety are preferred for generating poverty predictions, of contexts. These might include building and that the benefits can be sizeable when the density, roads, agricultural land or forest pool of candidate variables is large, as in the cover. More analysis is also needed to better case of Pakistan and of Sri Lanka when satellite understand the tradeoff between the quality indicators are included. There is strong interest and cost of the imagery on the one hand, and in testing different poverty modeling approaches its benefits in terms of predicting local variation and the research highlighted the value of using in poverty. Eventually, satellite-based imagery publicly-available satellite data to generate could also be a valuable tool in improving small-area estimates of poverty in contexts measuring inequality, monitoring development where census data is limited. projects and ‘nowcasting’ poverty rates. Understanding pictures of poverty For the Sri Lanka study using high-resolution LESSONS LEARNED imagery, preliminary results show that As the price of high-resolution imagery indicators track regional differences in poverty, continues to fall and coverage improves, based on estimates from the census, extremely satellite-based data will become an increasingly well. When looking at the full sample, the useful source of information about welfare in key indicators relate to building density and developing countries urbanization, including the number of buildings, a vegetation index, and in rural areas, the share rive research into mainstreaming the use of • D of roads that are paved, shadow, and type of satellite imagery in poverty measurement roof. But when predicting variation in poverty Most poverty economists are unaware of in local areas within urban areas, the number satellite technology’s potential to improve of cars and an abstract measure of rectangular small area estimates. There is not yet buildings become strong predictors as well. sufficient evidence to mainstream the use of satellite imagery to improve poverty These preliminary results demonstrate that maps based on census data, but this project satellite-based data is a valuable complement has increased awareness. Beyond poverty to household survey data, strengthening the prediction, satellite imagery can help efforts case for investing in high-resolution imagery to better understand poverty. For example, to monitor poverty more generally, as well as road network estimates can indicate whether project impacts. This approach is the first new road construction benefits the poor. step of an exciting research agenda. Imagery can deliver new insights related to a variety • Allocate sufficient resources to sourcing raw of development challenges, such as the satellite imagery scale of urbanization, infrastructure and the Navigating the market for high-resolution state of natural resources. Much more work satellite imagery and developing is needed to explore which indicators best relationships with vendors and processing 46 BIG DATA innovation challenge experts can take longer than anticipated. Ensure adequate project funding and planning to allow for the time this can take. • Watch for future potential in satellite imagery The accuracy and timeliness of satellite- derived poverty maps will continue to improve. More work is needed to see which satellite-based measures best predict poverty, both spatially and across time. There is broad scope for fruitful collaboration between poverty economists and geo-spatial image experts. Poverty must be located accurately if development interventions are to be effectively targeted and monitored BIG DATA innovation challenge 47 Understanding How Infrastructure Affects Crime Camila Rodríguez and Andrés Villaveces Using geo-coded data to generate risk terrain models, the project identified specific urban features 48 BIG DATA innovation challenge associated with violent crime in Bogotá Risk terrain models help diagnose why crimes have clustered at certain places and forecast where they are likely to occur SUMMARY Latin America is highly urbanized, with above-average crime rates. Its cities are typically unplanned, with high socioeconomic inequality, yet the association between crime and infrastructure has not been clearly defined or quantified. Colombia’s capital, Bogotá, collects considerable geo-coded data on urban infrastructure and has reliable geo-coded information on population and crime. The recent development of the world’s largest Bus Rapid Transit system has led to the modification of infrastructure in several parts of Bogotá. These changes present an opportunity for studying the association between crime and infrastructure. Drawing on rich data, this project quantified the occurrence of crimes in relation to specific characteristics in the built environment. Through risk terrain modeling (RTM), the team identified locations near public hospitals, schools, drugstores and bus stations as being associated with assault and homicide. The modeling also revealed peak times of day for crime, and predicted areas of the city more likely to experience future crime. Combined with local stakeholder perspectives, RTM analyses can reliably suggest action to reduce crime associated with particular environmental factors. The methods are widely applicable in other locations and for other crimes. BIG DATA innovation challenge 49 CHALLENGE Latin America is one of the world’s most urbanized regions, with crime levels higher than Clusters of crime can the global average. Urban violence and crime be partially explained disproportionately affect young, economically active populations, which are the region’s by landscape features largest segment. Yet the effects of urban that attract criminal characteristics on crime in Latin American cities are little studied. Understanding of this behavior at certain relationship could inform urban planning that times helps deter crime. Among initiatives to address Bogotá’s transport problems has been the creation of the world’s largest Bus Rapid Transit (BRT) INNOVATION system. The development of the system’s trunk To capture Bogotá’s overall characteristics, the lines includes several modifications to the team drew on rich infrastructure data, including cityscape. This project sought to evaluate the land-use information, service network data (gas, association between crime and infrastructure water, sewage) and city block and street audit in Bogotá using the BRT trunk routes as information captured via previously validated an intervention and comparing them with tools. Geo-coded crime data from 2012 then other areas. Carried out in collaboration with allowed them to estimate the correlation Rutgers University, the study aimed to generate of homicide and assaults with overall city reliable estimates of risk for different crimes infrastructure features, land-use patterns, socio- around BRT stations and throughout Bogotá. economic level, cadastral information, socio- If it could quantify crime variations and their economic survey data and BRT or non-BRT relationship with population flows through sections of the city. The BRT analyses included the different bus stations, this would help city data on population flows through stations authorities understand the impact of urban to understand crime variations related to infrastructure modifications in relation to crime. population density at specific times of the day. Such information could inform future urban planning to help create environments that are The crime data were analyzed in different non-conducive or even inhibitive to crime. The ways. The team first conducted a Nearest study also sought to highlight and estimate Neighbor analysis to assess clustering risks of crime in 18 different urban features, within the distribution of crimes. The results including schools, libraries, tourist attractions, suggested that the distribution of crimes bridges and clinics. These findings would have in Bogotá is significantly clustered. Kernel implications for cities elsewhere in the world, density mapping was then used to identify especially those with similar BRT systems. where, at more localized places within the 50 BIG DATA innovation challenge study area, the highest concentrations of presented consistently high risk, with incidents of crime occur. Using hotspot at least 50 percent greater likelihood of analysis, the team found many micro-level either assault or homicide. Places close locations around which crimes cluster. This to hospitals, public schools or BRT stations patterning was statistically significant. also presented significantly higher risks of assault or homicide compared to places Drawing on the findings from these exercises lacking the defining landscape features and combining them with the infrastructural of these public facilities (such as large data for Bogotá, the project generated 16 pedestrian access routes). different risk terrain models (RTMs) for assault incidents across the city. This modeling • Different economic strata for one calendar year process uses a specific algorithm that identifies To understand temporal and spatial relationships between different layers of data incentives to criminal behavior within each and correlates them with crime using count stratum, the team performed hourly-based regression models which are then linked to RTM analyses. A heat map summarizing the places on a digitized map. The approach incidence of crimes by hour and day of the represents spatial influences of crime risk week indicated that violent crime clusters at factors as common geographic units, then certain times, peaking from Saturday evening combines separate layers of map (one per risk) until Sunday dawn. to produce maps showing the intensity of all risk factors at every location throughout a • Highest concentration of assault incidents landscape – the ‘risk terrain’. Risk terrain maps Several RTM analyses explore these peak show where conditions are most conducive times in greater detail. The results suggest to crime. They help diagnose why crimes that proximity to drugstores and medical have clustered at certain places, and can clinics increases the risk of assault and help forecast where they are likely to occur homicide significantly in low stratum in the future. sections of Bogotá. In medium stratum sections, proximity to public schools The 16 risk terrain models covered: increases the likelihood of assault. Overall, crimes tend to occur mostly in the evenings. • The entire city for one calendar year City-wide risk terrain models suggest that • Peak and off-peak hours different economic strata (Low, Medium Hourly-based risk terrain models were and High) of Bogotá correlate with crime carried out by stratum, for peak and off-peak incident locations, but each stratum hours on the BRT network. These suggest influences assault and homicide differently. that the risk of assault more than doubles Low stratum blocks have an increased near medical clinics during peak hours in likelihood of both homicide and assault. lower stratum sections. In medium stratum Proximity to drug stores and medical clinics sections, the likelihood of assault increases BIG DATA innovation challenge 51 near private schools. The likelihood of by certain ‘street vendors’ near a few BRT homicide more than quadruples near BRT stations. They also explained that many people stations during peak hours in low stratum go to drugstores to buy prescription drugs sections. The results suggest that crimes unobtainable at hospitals. Offenders near occur in different places at different times these locations may see these people with of the day. cash as potential targets. Drug micro-trafficking may also be a factor, particularly in lower • Gender-based maps for one calendar year stratum sections. Gender-based risk terrain models were generated for each economic stratum to RTM offers insights into the spatial dynamics of analyze the locations of assaults on female crime and how to mitigate key factors leading to and male victims. The results suggest that criminal behavior, for example, by environmental proximity to medical clinics and drugstores modifications that improve passive surveillance in low stratum sections almost doubles and law enforcement activities. The approach the likelihood of assault for both male and can be applied to other crimes, such as female victims. In medium stratum sections, residential or commercial robbery and vehicle being near tourist attractions increases the thefts. The team has encouraged various risk of assault on men, while proximity to entities to perform their own RTM analyses. private schools increases the likelihood of Local stakeholders have already used the women being assaulted more than fivefold. methodology to create maps of drug micro- trafficking in five Colombian cities for the Ministry of Justice. Their insider perspective RESULTS increases the reliability and practical value of The results indicated that incidents of both the RTM analyses. homicide and assault cluster at certain places. This can be partially explained by certain features of the landscape that attract criminal LESSONS LEARNED behavior at certain times, linked to changing The project clearly showed how big data density of people. Such crime patterns occur analytics reveals correlations which inform beyond random chance and are statistically strategies to reduce crime. significant. • Draw on big data to inform decision-making Findings were presented to key local and RTM is useful as a reliable diagnostic tool national stakeholders in Bogotá, who fed that can evaluate patterns of crime and back useful information for contextualizing orient prevention and enforcement activities the results. They suggested the clustering of more efficiently – including towards specific homicide incidents during peak hours near areas and at specific times. Being cross- BRT stations in low stratum sections was sectional, the methodology established related to known criminal activities organized certain correlations, rather than causal 52 BIG DATA innovation challenge associations, but in doing so it highlighted • Involve local stakeholders to help explain certain features of the environment that findings otherwise would not easily be detected by Insider perspectives and content analysis simpler statistical modeling. with local stakeholders help identify potentially plausible – though not • Use big data analytics to assess disparate causal – explanations for crime, given data sources their knowledge of the context. Working Big data approaches are useful for with local stakeholders is important for integrating disparate but complementary generating explanatory hypotheses. information to provide a very rich environment that can be analyzed to respond to key questions – in this case, about crime. Analyzing big data can also establish emergent patterns of correlation or association which can be related to specific outcomes of interest. Stakeholders helped interpret the risk terrain maps, suggesting that the clustering of homicides near bus stations was linked to known criminal activity BIG DATA innovation challenge 53 Revamping Road Condition and Safety Monitoring with Smartphones Wei Winnie Wang Well-kept roads are needed to connect people to public amenities and reduce travel time, vehicle operation costs and crash risks 54 BIG DATA innovation challenge The RoadLab app uses accelerometers in smartphones to evaluate the roughness of road surfaces and identify major damage RoadLab visualizes road condition in Google maps SUMMARY Road agencies constantly face the challenge of developing cost-effective asset management strategies despite limited resources and modest understanding of road infrastructure conditions from users’ perspectives. This project piloted an innovative solution through an app called RoadLab, developed using accelerometers in smartphones carried in moving vehicles. This automatically evaluates the roughness of road surfaces and identifies major damage, such as potholes. It also allows road users to manually submit reports of road accidents or safety hazards, along with precise GPS information. The crowdsourced data is analyzed to extract information on the condition of road surfaces users travel over. Developed in collaboration with the Belarus authorities, RoadLab gives road agencies comprehensive and frequent information on road surface condition over wide areas. This enables them to prioritize investments effectively for maintaining infrastructure, and to assess road surfaces before and after maintenance work. By promoting citizen engagement and enabling road agencies to respond more effectively to users’ concerns, this approach also enhances government accountability. Built with the aim of scaling up, RoadLab can easily be modified to other countries and has the potential to become a key input for road network management worldwide. BIG DATA innovation challenge 55 CHALLENGE more affordable, uniform and immediate data Well-kept roads connect people to public on road condition. One approach is to harness amenities and reduce travel time, vehicle road users to report the surface conditions and operation costs and crash risks. In order to safety issues they encounter, simply by using an maintain road networks, government agencies app in their smartphones as they travel by car. must develop cost-effective asset management Automatic data collection with accelerometers strategies, but many have only limited resources in smartphones can give road agencies large and poor understanding of road infrastructure amounts of information from the road users’ conditions from road users’ perspectives. perspective. This can facilitate quicker and more Potholes, rutted surfaces and missing manhole effective decisions on road asset management. covers are among the hazards identified by road However, existing road-reporting apps faced authorities for assessing road surface condition challenges affecting the reliability of their and informing decisions on the maintenance of readings, such as what parameters to include road assets. However, collecting information on in the evaluation algorithm, and the accuracy such hazards through conventional methods is of their vertical acceleration detection. The costly and time-consuming, involving engineers need was for a reliable smartphone app that physically identifying locations that require addressed these common issues and provided maintenance or examining surface roughness valuable reference for future app development. using road surface profilers. This requires significant resources in both time and labor, especially for large road networks. INNOVATION The project developed a smartphone app called ‘RoadLab,’ which in effect harnesses moving vehicles as probes that detect real-time road Within 10 days, conditions by using smartphone accelerometers to monitor and report the roughness of travel the team collected over stretches of road. The app was developed useful road surface in close collaboration with the national road information for management agency in Belarus, alongside the World Bank-supported development of the 3,000km of road country’s Traffic and Road Safety Coordination Center. Learning from previous approaches Through a review of existing road-reporting As a result, road agencies often pay inadequate apps, the team identified key factors often attention to road asset management. To overlooked but affecting the accuracy of road maintain safe, efficient road networks and to roughness estimations. These included the improve infrastructure, the authorities need position of smartphones – especially if changed 56 BIG DATA innovation challenge during driving – and the vehicle’s speed and on the windshield. The app will automatically suspension type. They realized that the filtering detect phone positions and strength of GPS and smoothing of raw data would need to be signals and remind the user to place the device conducted carefully through machine learning correctly in order to obtain reliable data. to accurately differentiate bumps from abrupt braking, a swing of the vehicle or the user A global solution moving their phone. In addition, repeated Several adaptations were made as a result of reporting of extraordinary vertical acceleration field tests to verify road conditions reported values at the same location by multiple users by the app. It was found that when the vehicle could be used as a supplemental tool to the speed is lower than 30km per hour, smartphone data processing and filtering model to identify accelerometer readings are less sensitive abnormal road conditions. to road surface, therefore such readings are discarded to avoid false reporting. Similarly, To develop the app, the team divided roads into when the smartphone is put in a pocket or 100-meter segments, with GPS coordinates at directly on the seat of a moving vehicle, the the start and end of each. The system gathers correlation between accelerometer readings and analyzes data from smartphones, and and road conditions and bumps is not accurate, calculates the vertical acceleration and average so these readings too are excluded. These speed within each segment. Regression models adaptations also help prevent overloading road then link the road surface condition with vertical agencies with raw data. acceleration and speed, and estimate the roughness in line with the global International Although developed in Belarus, the app was Roughness Index, commonly used for designed as a global solution for road asset measuring the roughness of road surfaces. This management. It was built with parameters way, RoadLab estimates can be directly linked easily adapted to suit other countries. with existing road roughness measurements for comparison and updates. Reflecting the practice of Belarus road management agencies, RESULTS RoadLab automatically categorizes road surface Belarus road agencies have readily adopted the condition as excellent, good, fair or poor. It RoadLab app to screen road surface conditions. allows road agencies to set these threshold The approach is much more cost-effective than values themselves, given that what constitutes traditional road-monitoring methods, and the a poor surface or a major bump is highly authorities are currently working on integrating subjective, and standards are also likely to vary it into their own road asset management across road agencies and countries. This allows database and the system of the Traffic and maximum adaptation to local contexts. Road Safety Coordination Center. Roadlab users must place the smartphone on The app was tested by engineers from the a stable surface in a moving vehicle, such as Belarus road management agency, before being the dashboard or mounted vertically in a cradle launched to the general public in the capital BIG DATA innovation challenge 57 Minsk through a public campaign, including to grow into a key input for road network posters distributed to car-owners’ clubs, whose management worldwide. members are keen to improve road conditions. Within 10 days, the team was able to collect useful road surface information for 3,000km of LESSONS LEARNED road. Analysis of this data compared with the The RoadLab project underlined the International Roughness Index showed that importance of close client consultation the estimation from the smartphone app was and of thinking in new ways to see links reasonable. Despite the limited sample data and potential in existing resources. size, the exercise clearly demonstrated the value of a big data approach to road surface analysis. • Consult clients throughout the project cycle Successful client engagement was key to Following the pilot, minor refinements are the project’s success. By engaging and being made to the app, such as inclusion of a consulting with the Belarus road authorities, chart showing measured road surface for the the team benefited from valuable end-user last 10km surveyed. A future option to make inputs, as well as developing the authorities’ RoadLab more attractive to the general public knowledge and ownership of the approach, might be to combine the standalone app with making them willing and able to sustain it. other apps that are more practical for travelers, such as navigation systems. • Teach how to fish instead of giving fish The project showed the importance of RoadLab gives road agencies the tools building clients’ technical capacity and skills, to transform their approach to road asset so innovative approaches can be sustained management. It delivers comprehensive and in the long run and have real impact on frequent information on road surface conditions communities. over wide areas, enabling them to prioritize investments effectively for maintaining road • Approach existing situations from new angles infrastructure and to assess road surfaces Researchers have used standalone before and after maintenance work. By accelerometers in moving vehicles to promoting citizen engagement and enabling evaluate road roughness for decades. By road agencies to respond more effectively to linking wide smartphone ownership, people’s road users’ concerns, the app also enhances use of phones for navigation, and the fact government accountability. they all have accelerometers embedded anyway, a picture emerges of how phones The team is currently disseminating this could be used to extract information useful innovative approach to other countries and for road agencies. Through thinking more regions for replication. With increasing usage widely about GPS usage, the team is also of smartphones, they expect the initiative now adding a tracing function to the app, 58 BIG DATA innovation challenge so road networks can automatically be mapped digitally. RoadLab gives road agencies the tools to transform their approach to road asset management BIG DATA innovation challenge 59 BIG DATA Projects to watch Several of the Innovation Challenge winners and finalists are still awaiting the full results of their big data projects. But they all offer valuable examples of ways in which big data approaches can help improve development programs, and are worth watching for their results in coming months. 60 BIG DATA innovation challenge Targeting Poverty by Predicting Poverty Melissa Adelman Targeting errors are common in development programs, undermining their efficacy. By applying machine learning techniques to datasets commonly used for targeting poverty, this project seeks to improve methodologies for identifying the poor, as well as who will benefit most from particular interventions. With targeting integral to so many programs, improvements could have large impacts on development outcomes. CHALLENGE poverty. These models are deeply linear, yet Targeted transfers are central to nearly every the real world they predict is not. Household anti-poverty program, yet misclassifications characteristics may interact in extremely are common, reducing program efficacy. complex ways: An earth floor may indicate Targeting poverty is also more than simply poverty for a piece laborer but not for a farmer, targeting the poor. Optimal targeting requires yet a simple linear model would skip over the consideration both of who is poor and who is difference. likely to benefit from an intervention. Someone for whom the expected benefit is large may Machine learning provides structured methods be selected ahead of someone slightly poorer to search over a very wide set of functions to but for whom the expected benefit is small. maximize a model’s predictive power. Some methods, such as Lasso techniques, are This project aims to improve methodologies for similar to linear regression but allow for much identifying the poor – in particular, by testing greater complexity. Others are very different whether machine learning techniques can and may better reflect the real world. Decision reduce targeting errors. These tools can also trees, for example, allow the impact of some be applied to predicting the degree of benefit characteristics to depend on others – for for different types of recipient, and to estimate example, asking whether a person was a piece the targeting method which maximizes poverty laborer or a farmer, and then only for the laborer reduction. With targeting integral to so many basing the prediction on whether they had an programs, even small improvements could have earth floor. large impacts on development outcomes. The project applied machine learning tools INNOVATION to several existing programs and one new The project views targeting as the sort of simple experiment, to compare the targeting outcomes prediction problem which machine learning to those of the actual target method used. This tools are designed to address. Existing targeting will show whether machine learning tools can rules typically predict low consumption via a improve targeting. From their results, the team Proxy Means Test, a model which combines aims to create an improved methodology that household characteristics, such as roof can be applied broadly across data sets already material or television ownership, to predict used for targeting. BIG DATA innovation challenge 61 Real-time Assessments: How Markets Are Working for the Poor Alvaro S. Gonzalez Research suggests that many markets in which the poor transact are volatile, fragmented and suffer weak competition, reducing people’s ability to escape poverty. This project analyzes existing micro-level price data to provide policymakers with near-real-time information on how well markets are working, so they can implement policies to improve them for the poor. CHALLENGE INNOVATION Improving poor households’ participation in To increase understanding of markets in which markets – as buyers, sellers or producers – is the poor operate, this project analyzes largely central to combating poverty, but participation unexploited, micro-level price data, to assess depends on well-functioning and efficient how well markets function. The team assessed markets. The team’s prior research in Nigeria monthly price data on hundreds of commodities, found correlations between poverty and low collected to estimate inflation indexes across market efficiency. Many markets in which the the world. To geographically pinpoint markets poor transact are volatile, badly integrated serving poorer regions, the team combined and characterized by weak competition. spatially disaggregated price data with publicly Based on analysis of price and poverty data, available satellite lights data. This combined these inefficiencies are likely to result in data can be used to track trends and conditions lower incomes, lower investment, higher risk in markets and can alert policymakers to aversion and suppressed supply responses. changes that may negatively affect the poor. The price data is updated monthly, allowing the The root causes of these inefficiencies are team to monitor how markets function in near- unclear, as is precisely how they affect poverty real-time. outcomes. Micro-analysis of market trends and conditions would help governments understand This analysis can also monitor market trends these inefficiencies, and evaluate the effects and conditions in the poorest regions, and on poor people of attempts to mitigate them, assess the impact of reforms and changes such as changes to price supports, trade in policies or the market. It will inform regimes, exchange rate or interest rate policies governments and development partners on and infrastructure. The dualism that exists in how well the economic environment in which many developing economies – a rising (usually the poor transact is helping them raise their urban) core with a poor periphery – may also incomes and escape poverty. be due to how markets function. This too has yet to be understood. Governments can act to make markets work better for the poor – but they need the right information to do so. 62 BIG DATA innovation challenge From Cellphone Data to Poverty Maps Marco Hernandez Ore Accurate poverty maps require expensive, time-consuming surveys, meaning many developing countries only produce them every five or 10 years, often with a long time-lag. This limits governments’ ability to make effective anti-poverty decisions. Using anonymized cellphone call records in Guatemala, this project aims to create a tool to produce inexpensive, near-real-time poverty maps and predictions based on phone users’ behavioral patterns. CHALLENGE learning techniques to reveal users’ behavioral Accurate poverty maps are critical for patterns. These can be used to estimate effective anti-poverty initiatives. They help poverty. policymakers understand poverty and design better interventions, as national level The project aimed to create a tool to produce indicators often hide important regional inexpensive, rapidly updated poverty maps differences. However, they require expensive, and to forecast trends in poverty across time-consuming surveys, meaning many geographical regions. Using anonymized developing countries only produce poverty cellphone call records gathered by Movistar maps every five or 10 years, often with a Guatemala, the country’s largest cellphone long time-lag. This limits governments’ provider, the team developed algorithms to ability to make effective decisions. extract and analyze users’ behavioral patterns as a proxy to estimate poverty. The call records The lack of information on poverty is a key allowed them to compute three sets of variables policy challenge in Guatemala. The country’s to characterize human behavior: Phone service latest poverty figures were produced in 2011, consumption, social networks and mobility. meaning policies are designed with limited and These behavioral features could predict regional outdated data. With low government revenue poverty data, easily visualized as maps. For and more than half of the population living example, longer traveling patterns or smaller in poverty, it is vital that resources are well- social networks could be correlated to higher or targeted. This project addresses Guatemala’s lower income levels respectively. urgent need for affordable, up-to-date poverty data to inform decisions and monitor progress. The validity of the approach is being assessed by analyzing the similarity between the INNOVATION predicted social indicators and Guatemala’s The wide use of cellphones in developing 2011 poverty map. This approach aims to allow economies offers an opportunity to enhance governments to use call records to produce low- poverty mapping. Cellphones generate large cost, near-real-time information on poverty and datasets which can be analyzed using machine to forecast trends. BIG DATA innovation challenge 63 Testing Cellphone-Derived Measures of Income and Inequality Tariq Khokhar In many developing countries, official measures of poverty and inequality – vital to responsive policy design and implementation – are produced with a multi-year time lag and inconsistent coverage. This project evaluates techniques that use Call Detail Records (CDRs) to offer more timely and complete estimates of poverty and inequality. It also examines how these techniques can be incorporated into government workflows. CHALLENGE CDR-derived techniques for measuring poverty Many developing country governments operate into routine data work, then evaluating the in resource-constrained environments with technical performance and methodological significant gaps and lags in data needed issues around using CDRs as proxy measures for effective policy design and service for wealth. The team will examine how well delivery. Official measures of poverty and CDR-derived models perform, including in inequality are currently produced with a comparison with official sources, and whether multi-year time lag and have varying levels of they can be valuable in environments with coverage across and within countries. More sparse calibration data. They will also assess timely, complete and disaggregated socio- the impact of biases such as infrastructure economic measures are urgently needed. development and mobile phone penetration on CDR-derived measures of wealth and inequality. Techniques are emerging for using Call Detail Records (CDRs) to offer more comprehensive The project will produce papers identifying and timely estimates of poverty and inequality. opportunities for CDR-derived measures to be This project aims to evaluate these techniques incorporated into the workflow of government and explore how they could be incorporated statistical programs, and presenting the into the routine work of government agencies. evaluation of CDR-based techniques for Such knowledge could form the basis of estimating key socioeconomic variables. fundamental improvements in key data All software developed will be published as availability for developing countries. open-source code intended for re-use by others. The project’s long-term goal is to provide INNOVATION governments with new tools and methods The project takes a two-part approach, first for estimating socioeconomic variables in working with the Colombian National Statistics their countries. Office to identify opportunities for integrating 64 BIG DATA innovation challenge Satellite-Based Yield Measurement Talip Kilic Through trials in Uganda, this project is testing a novel approach to derive reliable data on crop productivity from satellite imagery. The technique relates satellite-based data to plot-level ground measures of yields. This enables future yield predictions, which can inform better policymaking to help farmers improve productivity. CHALLENGE measures of soil fertility (through conventional Reliable data on crop productivity is essential analysis of subsamples, and farmer reporting), for policy decisions that will improve maize variety (through DNA fingerprinting of agricultural yields and reduce poverty. leaf and grain samples, and farmer assessment Traditional approaches to measuring yields and assisted by photographic prompts). productivity (such as household surveys) are Questionnaires were also submitted to each resource-intensive and difficult to implement, household. The combined satellite and field particularly for smallholder systems. However, datasets provide an unprecedented opportunity pioneering techniques using data from for testing the ability of satellites to improve satellite imagery now offer more accurate, and predict yield measurement in smallholder timely and affordable agricultural statistics. systems. To validate the approach, this project tested satellite-based yield predictions The project is the first to test yield estimation against results on the ground for 900 in smallholder production via high-resolution maize plots in Uganda, a country highly satellite imagery against farmer self-reported dependent on smallholder agriculture. harvest and objective ground research into actual yields. The approach could be scaled INNOVATION up across different crops and regions. Uganda The team used the Scalable Satellite-based is the first of several countries in Sub-Saharan Crop Yield Mapper, a statistical approach newly Africa in which the team plans to validate this developed at Stanford University which relates satellite-based remote sensing approach. satellite data to plot-level ground measures of yields in order to make future yield predictions. The outlines of each maize plot were captured via handheld GPS devices and used in conjunction with satellite imagery. The project took objective and subjective BIG DATA innovation challenge 65 Understanding Individual Travel Patterns in African Cities Nancy Lozano Gracia Talip Kilic This project combined face-to-face and phone interviews with analysis of big data from sensor-embedded smartphones in Dar es Salaam, Tanzania. It aims to capture accurately and affordably the route, purpose, travel mode and cost of individual’s journeys within the city, to support well-informed urban transport and land-use planning. CHALLENGE INNOVATION To make informed, coordinated decisions on The team sought to create a dataset highly transport investments and land-use planning, informative about individual’s travel patterns as policymakers need reliable information about a function of their socioeconomic background, how individuals move around cities and the purpose of their travels and the associated the constraints they face. These decisions costs. They combined face-to-face interviews affect the locations of households and with analysis of big data from sensor-embedded businesses, influencing residents’ quality smartphones and follow-up phone interviews, of life and economic opportunities. to assess individual travel patterns in Dar es Salaam, Tanzania (although the methodology Traditional methods for understanding could be applied to other cities). individuals’ travel patterns (purpose, mode and cost) involve collecting travel diaries kept by After developing sensors and software that were respondents and supervised for completion installed in GPS-enabled smartphones, the team by field staff over an extended period. These selected a random sub-sample of respondents methods are resource-intensive, subject to to the World Bank’s 2013-14 Measuring Living recall error and demanding on respondents. Standards Survey. Each had already taken part in a face-to-face interview covering their To support well-informed urban planning, socioeconomic background and travel patterns. this project aimed to develop a tool that They were supplied with a smartphone able to could capture accurately and affordably collect and transmit the time and GPS location the route, purpose, travel mode and of individual movements at one-minute intervals cost for every trip respondents made for a one-month period. To encourage continued within a city over a period of time. participation, respondents were told they could keep the phones after the study. A total of 533 people from 300 households took part. Journey records from their phones were then validated 66 BIG DATA innovation challenge via follow-up phone interviews every three days, covering the origin, destination, route, purpose and cost of each trip. Initial data analysis is now underway, focusing on understanding the key determinants for how people choose modes of transport. The team aims to combine project data with cellphone call records to assess the feasibility of using lower-cost phone data for informing transport planning. Subsequent analysis will also examine how the travel patterns recorded compare to those based on traditional data sources, such as the Living Standards survey. BIG DATA innovation challenge 67 BIG DATA Key lessons 68 BIG DATA innovation challenge Key lessons from the challenge winners Despite the diversity of the winning solutions, several overarching lessons emerge. These can help big data practitioners harness the wealth of information generated by everyday activity, and use it to promote development. Spot the opportunities Quality data means quality results • Much of big data’s potential lies in • High-quality input data is essential for combining disparate and even seemingly accurate results. Even where data is readily unrelated sources of information in available, there is often significant potential innovative ways for analysis. This creates to modernize the methods used to capture, rich new approaches – as in Bogotá, where store and share it – for example, through the team combined existing data sources machine-based data sourcing. into a potent dataset from which to predict • Prepare and store data meticulously crime. The climate-smart agriculture (addressing gaps, outliers, correlated project merged climate and yield data, variables, etc.). Use cloud-based storage while Open Traffic combined smartphone technologies so data is centralized and data and taxi company databases. always available. • Approach existing situations from • It’s rare to have perfectly extractable data, new angles. Researchers have used so plan for imperfection. The financial standalone accelerometers in moving inclusion project grappled to unify different vehicles to evaluate road quality for data sources on different servers. Even with decades, but the RoadLab team linked the best preparation, extraction often takes wide smartphone ownership, the use longer than anticipated, so include potential of phones for navigation and the fact delays in project planning. phones all have accelerometers, to extract • Consider data storage from the beginning. information useful for road improvement. Open Traffic’s data was initially kept as city- • Seek opportunities in emerging technology. specific files in cloud-storage, but the file Poverty mapping in Sri Lanka and sizes soon became unwieldly, so the data electrification monitoring in India show was aggregated as a single stream, using a the untapped potential of satellite imagery. cloud-based service which uploads real-time Ultra-light drones were used for high- data streams from multiple sources. resolution cadastral mapping in Kosovo, • Respect privacy and protect individuals from and will help record agricultural data in potential harm resulting from use of their Latin America. data (such as retaliation for expression of opinion on social media). BIG DATA innovation challenge 69 • Focus on creating a positive experience for • Enhance other research approaches users of big data tools. Technical issues with big data analytics. Big data makes undermine user trust, making the adoption observations possible from a distance, process harder. As the climate-smart both in terms of space and time, and agriculture team found, a tool must offer offers naturalistic insight into people’s sufficient services to engage users and be true attitudes. However, big data analytics easy to operate, otherwise it will not be used. does not always offer a representative sample of views, and is therefore best Human input still matters used to complement other approaches • Combine human and computational power to understanding development issues. for optimum results. The sentiment analysis • Enhance big data analytics with other in Brazil showed that alongside technological research approaches. Despite the current methods, successful analysis of complex excitement about big data, it is ultimately socio-political issues using online social data ‘only’ a new (albeit very rich) data source, requires human interpretation. and does not make other data obsolete. As • Teach how to fish instead of giving fish. the financial inclusion research showed, Projects such as the RoadLab app in Belarus talking to people remains a powerful source showed the importance of building clients’ of information. In the Philippines, OpenRoads technical capacity and skills, so innovative delivers on-line transparency, but institutions approaches can be sustained in the long that link transparency to accountability are run and have real impact on communities. still needed to improve service delivery. Big data does not make other data obsolete, so combine human and computational power for optimum results 70 BIG DATA innovation challenge The power of partnerships Back new approaches with reliable evidence • Partnerships are crucial for success. • Rigorous validation or ‘ground-truthing’ is In Latin America, working with essential to establish confidence in novel agricultural organizations was key to data sources and methods. Earlier work scaling up data-driven agronomy. The establishing the strong correlation between development and use of the sentiment lights in satellite imagery and electricity analysis tool in Brazil required close supply on the ground generated confidence collaboration with experts in fields such in the India.Nightlights project approach. as society, history and government. • The Sri Lanka poverty mapping illustrates • Gain credibility with partners, to the broad scope for fruitful collaboration assuage legitimate concerns about how between poverty economists and geo-spatial information they share will be used. image experts – but hard evidence is needed • Create communities of practitioners – for the benefits that big data outputs networking will be powerful in promoting can provide. uptake and refinement of big data • Validation of big data models is more methods. In Latin America, CIAT hopes complicated than survey-based techniques. to support a user community for data- The data involved is not collected for mining techniques in agriculture. the purpose to which it is being put, so • Involve local stakeholders to bring inspection may not be revealing and pre- valuable insight from new perspectives. testing not possible. But, at a minimum, In Bogotá, local stakeholders helped in-sample validation techniques (such as a identify potentially plausible explanations holdout test set) should be used. for crime. The OpenRoads platform relies • Persevere. Several projects’ success rests on on stakeholder input, from government the importance of persevering when faced agencies to ordinary citizens. with obstacles. The overall India.Nightlights • Empower local communities through process took five years to reach the stage big data tools. Although resources and of publicly launching the web platform, capabilities are needed to process and overcoming hurdles such as data processing manage big data, projects such as the drone requirements en route. mapping in Kosovo demonstrate how new technology can help communities and local government make informed decisions. • Use big data to promote dialog. Platforms such as India.Nightlights, OpenRoads and RoadLab promote a multi-stakeholder conversation. The Bogotá crime research and Kosovar mapping work also brought together stakeholders to work jointly Partnerships are crucial for success towards progress. in pioneering big data projects BIG DATA innovation challenge 71 Glossary Artificial neural networks of what constitutes a cluster and how best to In machine learning, artificial neural networks find them. Popular notions of clusters include (ANNs) are models inspired by biological neural groups with small distances between members, networks and used to estimate functions that dense areas of data space or particular depend on a large number of inputs and which statistical distributions. are generally unknown. ANNs comprise complex systems of interconnected ‘neurons’ which Decision tree analysis exchange messages and are able to model a A decision tree is a machine learning tool that system by means of a training algorithm. The uses a tree-like model of decisions and their connections have numeric weights that can be possible consequences, including chance event adjusted, making neural networks adaptive to outcomes, resource costs and utility. It is one inputs and capable of learning. way to display an algorithm. Big data Hotspot mapping Big data is an umbrella term used to describe Hotspot mapping visualizes the geographic the constantly increasing flows of data emitted incidence of socioeconomic data, such as from connected individuals and things, as well crime. One of the most widely used techniques as a new generation of approaches being used for generating hotspot maps as smooth to deliver insight and value from these data continuous surfaces is kernel density estimation flows. Sources include technologies such as the (see below). Instead of mapping the location of internet, mobile phones, ground sensors individual events, hotspot mapping highlights and satellites. areas with above average incidence of events. These are known as ‘hotspots’. Big data analytics Big data analytics is the emerging set of Kernel density estimation tools and methods to manage and analyze In statistics, kernel density estimation is a the explosive growth of digital information. method for estimating the probability density It includes visualization, machine learning function of a random variable. It makes techniques and algorithms. inferences about the statistical population, based on a finite data sample. Cluster analysis Cluster analysis or clustering is used in machine Least absolute shrinkage and selection operator learning to group (or cluster) a set of objects (Lasso) so that they are more similar to each other than In machine learning, Lasso is a regression to those in other groups. It can be achieved by analysis method that performs both variable various algorithms that differ in their notion selection and regularization in order to enhance 72 BIG DATA innovation challenge the prediction accuracy and interpretability leverage the power of multiple alternative of the statistical model it produces. It selects analyses, randomization strategies and variables by imposing a penalty on the selection ensemble learning to produce accurate of additional variables when fitting the model. models, variable ranking and detailed data reporting. It can spot outliers and Machine learning techniques data anomalies, display clusters, predict Machine learning involves the construction outcomes, identify predictors, discover data of algorithms that can learn from and make patterns and replace missing values. predictions about data. Rather than following static program instructions, these algorithms Regression analysis operate by building a model from example Regression analysis is a statistical process for inputs in order to make data-driven predictions estimating the relationships among variables. or decisions. It includes many techniques for modeling and analyzing several variables, when the focus Manual selection prediction models is on the relationship between a dependent Manual selection models are those built by variable and one or more independent variables researchers (as opposed to derived from (or ‘predictors’). machine learning) for statistical prediction. • Count regression Count regression analysis involves Natural language processing modeling using count data, a statistical Natural language processing is a field of data type in which the observations computer science related to human-computer can take only non-negative integer interaction. It links computational linguistics values and where these integers arise and human (natural) languages, and aims to from counting rather than ranking. enable computers to derive meaning from • Stepwise regression human or natural language input. Stepwise regression is the step-by-step construction of a regression model Nearest neighbor analysis that involves automatic selection of Nearest neighbor analysis (also known as predictive variables. It can involve proximity search, similarity search or closest trying out one independent variable at point search), is an optimization problem for a time and including it in the regression finding closest (or most similar) points. It model if it is statistically significant, attempts to measure distributions according to or including all potential independent whether they are clustered, random or regular. variables in the model and eliminating those that are not statistically significant Random forest technique (or a combination of both methods). Random forest is a machine learning tool for classification, regression and other tasks. It constructs a multitude of decision trees to BIG DATA innovation challenge 73 Acknowledgments With warm appreciation to the Innovation Labs’ Big Data Team, for their hard work and commitment in organizing the Big Data Innovation Challenge and supporting the winning projects: Adarsh Desai Trevor Monroe Program Manager Operations Officer Andrew Whitby Bruno Sanchez Nuno Data Scientist Data Scientist Luda Bujoreanu Kiwako Sakamoto Operations Advisor Data Analyst With thanks also to the team’s Collaborators for their ongoing support: Amparo Ballivan Lead Economist, Development Data Group (DECDG) Isabelle Huynh Senior Operations Officer, Transport and ICT Global Practice (GTIDR) Malar Veerappan Senior Data Scientist, DECDG Rajan Bhardvaj Lead IT Officer, Information and Technology Solutions (ITS) Special thanks to Publication Coordinator Norma Garza Knowledge and Learning / Open Contracting and Extractives Governance 74 BIG DATA innovation challenge BIG DATA innovation challenge 75 These case stories demonstrate that big data can improve development effectiveness and help World Bank operations achieve results through better evidence, efficiency, awareness, understanding and forecasting… Ultimately, big data analytics can be an accelerator for ending poverty and boosting shared prosperity 76 BIG DATA innovation challenge