Digital Pulse An exploration of non-traditional data for entrepreneurship ecosystem diagnostics Credit: Photo by Zak Sakata on Unsplash (License allows free use of images). This volume is a product of the staff of the World Bank Group. The World Bank Group refers to the member institutions of the World Bank Group: The World Bank (International Bank for Reconstruction and Development); International Finance Corporation (IFC); and Multilateral Investment Guarantee Agency (MIGA), which are separate and distinct legal entities each organized under its respective Articles of Agreement. We encourage use for educational and non-commercial purposes. The findings, interpretations, and conclusions expressed in this volume do not necessarily reflect the views of the Directors or Executive Directors of the respective institutions of the World Bank Group or the governments they represent. The World Bank Group does not guarantee the accuracy of the data included in this work. Rights and Permissions This work is product of the staff of the World Bank with external contributions. The findings, interpretations, and conclusions expressed in this work do not necessarily reflect the views of the World Bank, its Board of Executive Directors, or the governments they represent. Nothing herein shall constitute or be considered to be a limitation upon or waive of the privileges and immunities of the World Bank, all of which are specifically reserved. Table of Contents Acknowledgements ......................................... 4 1 Executive Summary ................................. 5 2 Exploration of alternative data sources8 2.1 What is alternative data and web scraping? ........................................................................................... 9 2.2 Prior work on alternative data collection methods ............................................................................... 9 2.3 Identification and prioritization of data sources for web scraping ................................................. 10 2.4 Extraction of data from private sector big data sources and public websites ............................. 13 2.5 Extraction of company-level metadata from private sector sources or public websites .......... 18 3 Data analysis and Machine learning . 24 3.1 Named Entity Recognition and Classification for Entity Extraction .................................................25 3.2 Sentiment Analysis on social media data ...............................................................................................30 3.3 Topic modelling ............................................................................................................................................33 3.4 Thematic classification .................................................................................................................................39 3.5 Network Visualization ..................................................................................................................................42 4 Data quality and mitigation ................ 47 4.1 Recording Linkage between Data Sources ............................................................................................47 4.2 Data Management and Storage ...............................................................................................................49 4.3 Adjusting for Biases & Testing Quality of alternative data sources ................................................ 51 4.4 Limitations of the approach .......................................................................................................................54 5 The ethics of web scraping.................. 55 6 Conclusion and Looking ahead .......... 57 Bibliography .................................................... 59 List of Tables Table 1. Strengths and Weaknesses of Traditional and Alternative data sources ...........................9 Table 2. Criteria for Prioritizing Data Sources for Web Scraping ....................................................... 11 Table 3. Illustrative example: Indonesian startup profile using online data .................................... 13 Table 4. Summary of Machine Learning Use cases and Potential value-add for this report ... 24 Table 5. Example entities found on GnB Accelerator's website ........................................................ 28 Table 6. Example NER output for GnB Accelerator ............................................................................. 28 Table 7. Sample output using VADER for Sentiment Analysis ........................................................... 31 Table 8. Sample output using VADER for Sentiment Analysis using slang words and emoticons......................................................................................................................................................... 31 Table 9. Sample output for VADER Sentiment Analysis on scraped data ..................................... 32 Table 10. Outputs for Topic Modelling on scraped data. ................................................................... 36 Table 11. Examples of Entities per cluster derived from Thematic Classification on scraped data ........................................................................................................................................................................... 40 Table 12. Output for Thematic Classification for scraped data .......................................................... 41 Table 13. Proposed Initial criteria for selecting components of the potential technology stack ........................................................................................................................................................................... 50 List of Figures Figure 1. Pilot pipeline for Named Entity Recognition ......................................................................... 27 Figure 2. Number of entities identified per company using NER .................................................... 28 Figure 3. Pilot pipeline for Topic Modelling ........................................................................................... 35 Figure 4. Screenshot of Interactive Tool for visualizing LDA results................................................ 38 Figure 5. Probability Distribution of a startup belonging to each of the topics ........................... 39 Figure 6. Network Visualization for Indonesia: Full Entrepreneurship Ecosystem ....................... 43 Figure 7. Observations based on the Network Visualization of Indonesia .................................... 44 Figure 8. Network Visualization for Indonesia: Investment Flow Ecosystem Map ....................... 45 Figure 9. Network Visualization for Indonesia: Geographic Ecosystems Map .............................. 45 Figure 10. Record Linkage for Traditional and Alternative data sources (Source: Salganik 2017) ........................................................................................................................................................................... 48 Figure 11. Potential technology stack for Data Management and Storage (within the World Bank environment)........................................................................................................................................ 49 Figure 12. Macro and Micro checks for data quality ............................................................................ 52 Figure 13. Depiction of data calibration .................................................................................................. 54 Acknowledgements This publication was funded by ‘Digital Industries and Skills Development Sharing : The Korean Experience’ grant funded by the Korean Trust Fund for ICT4D at the World Bank and implemented under the Digital Entrepreneurship Program (DEP), Global Knowledge and Learning project. The Innovation Policy Platform (IPP – www.innovationpolicyplatform.org), was developed by the World Bank Group (WBG) and the Organization for Economic Co-operation and Development (OECD) as a global resource for knowledge, learning, indicators/data, and communities of practice on the design, implementation, and evaluation of innovation policies around the world. It was one of the core external knowledge products offered by the Finance, Competitiveness & Innovation (FCI) global practice and served to raise awareness of FCI’s product portfolio, facilitate global engagement and advocacy, and build staff skillset. The platform was retired in 2019. Several people provided input and contributed to the report. Najy Benhassine and Denis Medvedev provided overall guidance. The project was led by Prasanna Lal Das with support from Adela Antic. Ma. Regina Paz Saquido Onglao is the principal author of the report, with contributions from Prasanna Lal Das. Alberto Sanchez Rodelgo (IMF) and Romulo Cabeza (ILO) were the peer reviewers. 1 Executive Summary The ongoing data revolution has made a prodigious amount of new data and analytical tools available to researchers. This data is available at higher frequency and at a much more granular level that traditional data collected through field work and surveys. This creates new opportunities for research but also raises significant questions about the usefulness, reliability, and quality of such data. Proponents of big data research have called for the development of new analytical techniques and tools to take advantage of new opportunities while others have cautioned against its seductive power. In the current note we provide an initial examination of the usefulness of such data in the context of entrepreneurship ecosystem diagnostics. The World Bank is currently updating the methodology it uses to assess entrepreneurship ecosystems, in particular the Digital Entrepreneurship Ecosystem Diagnostic (DEED) framework.1 One of the features of the new methodology is an ‘all data’ approach that seeks to blend standard data sources (surveys, official data) with online data (open or proprietary). Entrepreneurship ecosystems are fluid environments containing complex interactions and relationships between entities. Most of the current ecosystem assessments rely on secondary sources of data that are generally based on small samples and often don’t inc lude information about entity relationships and networks2. Such methodologies are also generally expensive to repeat. This leads to significant data gaps, including coverage and timeliness. Primary data, when collected, is seldom global and generally infrequently gathered. Almost no assessment methodologies utilize so called ‘big data’ sources and very few reuse or combine data from non-traditional sources like online platforms. And most methodologies focus on a specific set of actors within the entrepreneurship ecosystem – either the firms or investors or government agencies or intermediaries, but almost never all of them. This means that the findings of such ecosystem assessments are often high-level, not comparable over time and geographies, and not necessarily actionable. 1The DEED toolkit relies on a framework used to assess 1) the current environment, 2) strengths & successes, 3) weaknesses & barriers, and 4) opportunities for growth across the six domains of an entrepreneurship ecosystem identified by the Babson Entrepreneurship Ecosystem Project: policy, financial capital, markets, culture, human capital, and supports. 2Exceptions include the following reports: (i) the World Bank Ecosystem Connections Mapping Project in collaboration with GERN, Endeavor, and other institutions which maps connections between key actors within startup ecosystems around the world; (ii) Endeavor Insight’s report on “The Power of Entrepreneur Networks” focusing on how founder networks have accelerated New York City’s tech sector growth, and (iii) Startup Genome’s Global Startup Ecosystem Report which uses company and founder data to generate Local and Global Connectedness index measures. The assessment challenges have been further exacerbated in the digital economy in which economic activity, including entrepreneurship, is difficult to measure fully using traditional indicators. Digital entrepreneurs, whether they be start-ups or within incumbent firms, face several new and different challenges (and opportunities) than ‘traditional’ entrepreneurs do. Digital businesses also generate new types of data whose exhaust can be a powerful way to measure conditions that are unique to the digital entrepreneurship ecosystems. The approach described in the current note tries to address the challenges in measuring entrepreneurship ecosystem assessments using alternative data and related techniques. The note describes – • New data sources and data collection techniques covering – o Basic definitions o Review of related literature for exploring new data sources o Identification and prioritization of data sources o Examples of scraped data and their presentation • Natural language processing (NLP), visualization, and machine learning techniques including - o Named entity recognition and classification for entity extraction o Sentiment analysis o Topic modeling o Thematic classification o Network visualization • Data quality issues and mitigations covering issues such as – o Linkages between different sources o Data management and storage o Biases in data and quality testing o Limitations of web scraping • A brief discussion of the ethics of web scraping The examples used in the note are drawn from developing countries such as Senegal, Kenya, and Indonesia which tend to be ‘data poor’ and where the proposed approaches may both have the greatest potential but also face the most significant challenges given their relatively low level of digital development. Please note that the current report is designed as a data science practitioner guide and assumes a degree of technical familiarity with the subject matter. We should also clarify that we do not propose that alternative data should replace or is ‘superior’ to other data sources – the purpose of the current note is to provide a technical examination of specific data sources and the tools available to utilize them. Executive Summary Page | 6 It is also important to consider the sustainability and reproducibility of web scraping when incorporating such data into the research methodology. Many sites have begun to close themselves off to scrapers and while not widespread this may apply more forcefully to some projects than others. The code associated with the work below is available at https://github.com/mrpsonglao/Machine-Learning-Pilots. Note that we have scrubbed the code of any personally identifiable information, to make it fit for public use. Executive Summary Page | 7 2 Exploration of alternative data sources The modern world is awash in data. As a recent world Bank report on data driven development3 pointed out, just in one second people send 2.7 million emails, watch 75,000 videos on YouTube, and transmit almost 60,000 gigabytes of data. In that one second, many individual airplanes generate 10 GB of data and connected cars gather even more GBs of data about everything ranging from weather and traffic conditions to every driving action and the response of other vehicles on the road. This data, as has been documented by the World Bank4 and others, is an important source of economic growth and to deliver public services. The research community, after initial skepticism, has gradually warmed to the benefits of such data. In the US, for instance, the Bureau of Labor Statistics increasingly uses ‘big data’ to track the economy5. Examples of such data include apparel prices gathered directly from big departmental stores, vehicle prices gathered directly from private sector aggregators, and drug prices sourced from pharmacy chains. In Canada government statisticians have started collecting price data online. Statistical agencies in New Zealand, Norway and the Netherlands also gather sales data through checkout scanners in stores. Similarly, the Billion Prices project, seeded at MIT, began by scraping data from online sellers at scale. The drivers for such work include the need for more frequent, cheaper, and timely data that are relevant for policymakers and allow researchers to fill data gaps. Access to such data at a large scale also lets researchers test non-probability sampling methods (Section 4.3 Adjusting for Biases & Testing Quality of alternative data sources provides more details on non-probability sampling methods). In this section we provide a broad definition of alternative data and describe tools and techniques we employed to scrape data from online sources to support entrepreneurship ecosystem assessments in Senegal, Kenya, and Indonesia. The steps included – 1. Identify and prioritize data sources for web scraping; 3 Harnessing data technologies for development https://openknowledge.worldbank.org/handle/10986/30437 4 Internet of Things – the new government to business platform http://documents.worldbank.org/curated/en/610081509689089303/Internet-of-things-the-new-government-to-business- platform-a-review-of-opportunities-practices-and-challenges 5 Government economists turn to big data to track the economy https://www.wsj.com/articles/government-economists- turn-to-big-data-in-estimating-inflation-11556622001 2. Extraction of a list of ecosystem actors (e.g., companies, accelerators, incubators) from private sector big data sources or public websites; 3. Extraction of actor-level metadata (e.g., year founded, address, number of employees, connections with other actors) from private sector big data sources or public websites. The compilation of all sample outputs for this section can be accessed in a shared view-only Google Drive folder.6 2.1 What is alternative data and web scraping? In the current note, we use the term ‘alternative data’ to refer to online, digital data either published in consumable open data format or otherwise available for scraping. Such data includes open data sets such as the ones published by the World Bank at http://data.worldbank.org, social data on platforms such as Twitter, general online content such as on http://worldbank.org, or data made available by proprietary resources such as https://www.telegeography.com/. Web scraping or data scraping refers to automated ‘copying’ of the content of a website into a database, typically through a bot or a web crawler. 2.2 Prior work on alternative data collection methods There are several examples of previous successful work which leverage both traditional and alternative data sources, as documented by Blumenstock, Cadamuro, and On (2015); Olson (1996, 1999); Beskow, Sandler, and Weinberger (2006); and Ginsberg et al. (2009). Other examples include the combination of Facebook and survey data Burke and Kraut (2014) and the research by Ansolabehere and Eitan Hersh (2012) on US voting patterns using proprietary data. The table below, inspired by Salganik 2017, highlights the strengths and weaknesses of traditional and alternative data sources. Table 1. Strengths and Weaknesses of Traditional and Alternative data sources Strengths Weaknesses Traditional ● Custom-made for ● Usually narrow in scope data (e.g., the research ● Usually expensive and suffers from funding and/or surveys, problem time constraints interviews) ● In depth ● Infrequent or not timely ● Good for opinion ● Lack of scale with respect to geographic coverage and perception- ● Lack of coverage with respect to ecosystem actors related questions ● Publicly-available data usually lacks granularity 6The view-only Google Drive folder can be accessed here: https://drive.google.com/drive/folders/1VW07ZUisEhcH1yQnt9Vg7fQlvfU1XJjr?usp=sharing Exploration of alternative data sources Page | 9 Strengths Weaknesses Alternative ● Big, which allows ● Digital biases such as non-representative and data minimizing of systemic biases random error or ● Sparse or incomplete data noise during ● Possible drifting especially for social media modelling platforms, such as population drift (change in user ● Provides real-time base), behavioral drift (change in how users use estimates the platform), and system drift (change in the ● Substantially system itself) cheaper ● Algorithmically confounded, that is, user behavior ● Usually provides is affected by engineering goals of the systems more granular data ● Some may contain sensitive data, which is a potential risk for data ownership and legal use issues The animating idea behind such work is that combining traditional and alternative data allows researchers to produce a larger, richer, and more complete database than using one or the other. Using both traditional and alternative methods affords researchers the benefits of both types of data sources - the in-depth and custom-made nature of traditional data sources together with the scale, speed, and granularity of alternative data sources. It also mitigates the weaknesses of each one. • Alternative data sources and traditional data can complement each other by filling each one’s data gaps. • Alternative data sources can augment the sampling frame for traditional surveys by providing a potential list of respondents. • Traditional data can supplement alternative data sources by providing representative data with which to check or triangulate the alternative data against.. Salganik 2017 provides more detail about the ideas above and introduces the concept of “enriched asking”. 2.3 Identification and prioritization of data sources for web scraping Alternative data comes in many shapes and formats and from a variety of sources including social media, websites, IoT, and others. Depending on the research question, the first step is to develop a list of criteria to prioritize certain sources and data types over others. For the work on entrepreneurship ecosystem diagnostics, the team decided to focus specifically on online data and shortlisted the list of sources based on the following criteria: • Accessibility – Listed both public and private/proprietary datasets Exploration of alternative data sources Page | 10 • Scope – Listed both global and country-specific data sources for Indonesia, Vietnam, Kenya, Senegal, and Nigeria • Granularity – Listed both country-level and company-level data sources Metadata, such as the following, per data source7 was also noted down for analysis and prioritization purposes: • Goal or intention for web scraping (e.g., for extracting lists of companies, for extracting ecosystem actor-level metadata) • Potential extraction methodology, data type, and notes • Important notes and potential issues when extracting data from the source To identify the data sources for the current demonstration project, the team conducted desk research including online searches (Google) and parsing through relevant entrepreneurship- related documents and toolkits to shortlist potential data sources to scrape. The team employed an iterative and test-heavy approach in extracting data from these sources. To strategically test web scraping across these sources, the team further shortlisted data sources for initial web scraping based on two main criteria -- extraction priority and extraction difficulty. The shortlisting criteria and sub-criteria are further detailed in the table below. Table 2. Criteria for Prioritizing Data Sources for Web Scraping Possible Criteria values Rationale Extraction Assessed priority of extracting the data for this specific dataset. General rule of thumb: Priority • 1 (highest) = Public, global datasets (1=highest • 2 = Public, region/country-specific datasets with ecosystem actor-level data to • 3 = Public, region/country-specific datasets with country-level data 5=lowest) • 4 = Relevant but proprietary sources • 5 (lowest) = Other data sources for exploration The sub-criteria used is as follows: Source There are terms of use limitations for proprietary sources, which usually require a Accessibility fee or partnership to access the granular data. Public datasets allow us to freely (Public vs download the data and use for research/analysis. Proprietary) 7 For more details, please refer to this comment-only Google spreadsheet for the list of data sources considered and their corresponding attributes based on the criteria above: https://docs.google.com/spreadsheets/d/1nKkgaueEdiym1fmXUYdUy2Ctmh4_YDk-eVpG1PkQC4U/edit#gid=739685063 Exploration of alternative data sources Page | 11 Possible Criteria values Rationale Geographic A good mix is best for web scraping, since: Scope • Global websites are preferred since this allows web scraping to be scalable, (Global vs since the same code can be used to get data for more countries and Country- companies within the same website domain (given that company/country specific) pages usually have the same HTML structure within the same website). These can be used as the baseline or starting point of any web scraping activity. • On the other hand, country-specific websites tend to be localized and usually contain more unknown company data, so scraping these can augment and widen the scope of the scraped global websites. Granularity Ecosystem actor-level is preferred, since there are a lot of existing resources and (Country-level datasets already for country-level data. On the other hand, ecosystem actor-level vs Actor-level) data is hard to find while allowing us to generate interesting in-depth insights. Extraction Assessed difficulty of implementing the extraction for this specific dataset. General rule of thumb: Difficulty • 1 (Easy) = Minimal coding required. Data is already in a table-structured, easily-parsable (1=Easy to format (e.g., single JSON endpoint, CSV or Excel file). 5=Hard) • 2 = Some coding required. Data can be pulled via API that is structured well, which allows creation of reusable code that is applicable across different countries/companies. • 3 (Intermediate) = Intermediate coding required. Scripts specific to the websites with somewhat structured data need to be written to extract data from page source. • 4 = Intensive coding required. Need to set up web crawlers to get slightly-structured website data, or would need to extract data regularly from PDFs. • 5 (Hard) = Method for extracting data is unclear / for exploration. Or, data pull is disallowed due to owner's recent decisions (e.g., Facebook). The sub-criteria used is as follows: Extraction data Data type of the data to be extracted (e.g., JSON, HTML) affects ease in type extraction. For example, PDFs are harder data types to extract data from when compared to CSV files or JSON endpoints. Extraction Extraction methods available (e.g., API, data download, website pull limits) affect method ease in extraction since this will determine the difficulty and complexity of the scraping code required. Notes / API rate limits and other notes will affect the scalability and frequency of usage of Potential Issues web scraping code. To illustrate the results of data collection through web scraping, here is a sample company profile of an Indonesian startup through scraping diverse sources. Note that Exploration of alternative data sources Page | 12 the team has not implemented data quality checks for the sample profile below. For more details on data quality, please refer to Section 4 Data quality and mitigation below. Table 3. Illustrative example: Indonesian startup profile using online data Basic Profile Founder data Name Bukalapak Co-founder & Nugroho Herucahyono Actor Type startup CTO HQ Location Indonesia Co-founder & Achmad Zaky CEO Description Situs Jual Beli Online Mudah Dan Terpercaya Investor data Description Bukalapak - Place of selling / Venture Undisclosed (detailed) buying the most comfortable & (November safe online with Payment System 2017) which ensures buyers and sellers Series B Emtek 100% risk free online scams. (February 2015) Queensbridge Venture Partners Bukalapak.com, Sell Buy Easy & 500 Startups Reliable. Series A Gree Ventures Founding September 2011 (September Date 2012) Estimated 1,500 Social Media accounts and statistics Number of Facebook https://www.facebook.com/bukala Employees pak Industry Marketplaces, E-Commerce Instagram https://www.instagram.com/bukal apak Company https://bukalapak.com/ LinkedIn https://www.linkedin.com/compan website y/pt-bukalapak-com Related http://endeavorindonesia.org/id/bu Twitter https://twitter.com/bukalapak Articles kalapak-raih-penghargaan- bergengsi-dari-jokowi/ 2.4 Extraction of data from private sector big data sources and public websites Data extraction begins after the data sources have been identified and tested. For the current demonstration, the team extracted sample lists of companies, accelerators, and incubators from a diverse set of sources to show the kind of company data that can be extracted per source type. Here are the results of the sample web scraping for lists of companies, accelerators, and incubators. Exploration of alternative data sources Page | 13 2.4.1 A data collector or directory website Data source Startups List, a website which contains global country-specific listings (http://nigeria.startups-list.com/). Here’s how the website looks like: Implementation We were able to extract data on 251 startups located in Nigeria using & Results Python scripts on the website’s page source. Relevant Startup name, description, website, logo (link to image), and keywords. extracted fields Strategic use of Having the startup website allows us to build another scraping layer by scraped data extracting data (e.g., text, contact details, images, etc.) from the startups’ respective websites. Here is what the sample output looks like. For the full output, please see 2018-07-10 - Nigeria Startups List.csv. Startup Name Description URL Keywords Logo URL 1 Truppr The AirBnB for FITFAM Truppr is a social https://w Truppr - https://d1qb2nb5cznatu. tool that helps sport lovers and fitness ww.trupp fitness cloudfront.net/startups/i/ enthusiasts organise and find teammates r.com/ personal 390218- for their activity of choice in cities around health 64243e8459fac2a0764c5 the world. We help people stay fit and corporate 93ddcdc9608- well through a: - Simplified process of wellness active thumb_jpg.jpg?buster=1 organising amateur sporting/active ... lifestyle 398884019 2 RubiQube Location-based app recommendation http://ww RubiQube - https://d1qb2nb5cznatu. RubiQube® is a (cloud-based) mobile w.therubi cloud cloudfront.net/startups/i/ applications discovery and aggregator qube.co computing 311337- that seeks to connect locally developed m android 3dabab1d944fc169b7529 mobile apps (HTML 5 apps) with their application aa64f974de6- target market using a location based app platforms app thumb_jpg.jpg?buster=1 stores 387363971 Exploration of alternative data sources Page | 14 recommendation system in the app store. The application is available ... 3 ChopUp Mobile Social Gaming for Africa Chopup http://ww ChopUp - https://d1qb2nb5cznatu. is a social platform that allows mobile w.chopup social games cloudfront.net/startups/i/ game players to interact based on in- .me social media 90872- game achievements. The following are platforms aae89861281498b4fc2ba features of the platform: - Targeted mobile games 9ca37847637- exclusively at mobile devices (not virtual thumb_jpg.jpg?buster=1 excluding feature phones) - Social profiles currency 371473075 for each user - Realtime ... 2.4.2 An embedded map Data source Afrilabs – has a map of African accelerators / hubs (http://www.afrilabs.com/afrilabs-passport/). Here’s a screenshot of how the interactive map looks like: Implementation We were able to extract data on 57 accelerators, incubators, or hubs in & Results Africa using Python scripts in the map’s underlying code. Relevant Company address, description, geocoordinates, city, state, country, and extracted fields postal code. Strategic use of Having the startup geocoordinates gives us flexibility in conducting scraped data geospatial analysis on the scraped company metadata. This will be a potential analysis dimension when doing network analysis. Here is what the sample output looks like, filtered down to a few columns since the original dataset has many columns. For the full output, please see 2018-07-10 - Afrilabs List of Accelerators or Hubs.csv. Startup Name Address Description Latitude Longitude City State Country Exploration of alternative data sources Page | 15 ActivSpac Cefam Rd, ActivSpaces is an open 4.1515548 9.2327857 Buea Southwest Cameroon es Buea, collaboration space, Cameroon innovation hub and startup incubator for African techies. Established in 2009, ActivSpaces was one of the earliest African coworking spaces to provide free and open access to members actively pursuing technology-based ventures. Based in Buea, Cameroon. AkiraChix Kenyatta AkiraChix is a not for -0.2849853 36.0693113 Nakuru Nakuru Kenya Avenue, profit organisation that County Nakuru, aims to inspire and Kenya develop a successful force of women in technology who will change Africa’s future. 2.4.3 Google Search via search nearby places Data source Google Places API, “Nearby Search” endpoint (API documentation here: https://developers.google.com/places/web-service/search) Implementation We were able to extract data on 60 establishments located near Nigeria. & Results Specifically, we scraped all establishments on Google Places API which is within a 50km-radius from Nigeria’s capital, Abuja. Relevant Establishment name, geocoordinates, Google place_id, opening hours, extracted fields photo (link), Google user rating, type of establishment (e.g., hotel, restaurant, lodging) Strategic use of Aside from providing accurate geocoordinates, using Google Places API scraped data allows us to expand the diversity in company types as well as data types extracted by including photos, user ratings, and opening hours in the mix. The “place_id” field also allows us to pull greater detail on the business/establishment using another Google API endpoint. Here is what the sample output looks like, filtered down to a few columns since the original dataset has many columns. For the full output, please see 2018-07-10 - Nigeria Google Places API - Nearby Places endpoint.csv. Exploration of alternative data sources Page | 16 Google Name Geocoordinates Google Place ID Places rating Entity type vicinity {'location': {'lat': 9.0428389, 'lng': ['bank', 102 7.523834399999998}, 'viewport': 'finance', ChIJpS5lz- Yakubu World {'northeast': {'lat': 9.043978930291502, 'point_of_inter ELThAR7Hh_IdCw1 4.7 Gowon Bank 'lng': 7.525221030291501}, 'southwest': est', HI Crescent, {'lat': 9.041280969708497, 'lng': 'establishment' Abuja 7.522523069708496}}} ] {'location': {'lat': 9.0756033, 'lng': ['bank', 7.478640400000001}, 'viewport': 'finance', Ademola Ecoba {'northeast': {'lat': 9.0769522802915, ChIJbe1WOPkKThA 'point_of_inter Adetokunb 5 nk 'lng': 7.479989380291503}, 'southwest': RbRjyUqnO7xo est', o Crescent, {'lat': 9.074254319708496, 'lng': 'establishment' Abuja 7.477291419708498}}} ] {'location': {'lat': 9.0580188, 'lng': ['bank', Interc 7.486050700000001}, 'viewport': 'finance', ontin {'northeast': {'lat': 9.059367780291502, ChIJM8qNp6gLThA 'point_of_inter Abuja ental 'lng': 7.487399680291503}, 'southwest': RdPRXvFMXWRo est', Bank {'lat': 9.056669819708498, 'lng': 'establishment' 7.484701719708498}}} ] 2.4.4 Google Search via text search on places Data source Google Places API, “Text Search” endpoint (API documentation here: https://developers.google.com/places/web-service/search) Implementation We were able to extract data on 24 startups/accelerators/hubs located in & Results Indonesia and 53 of which in Kenya. We did this by specifying a keyword list and querying Google Places API using those keywords while limiting the results to a specific country (e.g., Indonesia, Kenya). The keyword list used is: ["accelerator", "hub", "startup", "business", "company", "incubator"] Relevant Similar to #3 above, but with the added metadata of keyword used for extracted fields source, and country used to restrict search results. Strategic use of Having the keyword mapped to each result allows us to easily group scraped data results together as an additional analysis dimension. Here is what the sample output looks like, filtered down to a few columns since the original dataset has many columns. For the full output, please see 2018-07-10 - Consolidated Google Places API - Text Search endpoint.csv, 2018- 07-10 - Kenya Google Places API - Text Search endpoint.csv, and Exploration of alternative data sources Page | 17 2018-07-10 - Indonesia Google Places API - Text Search endpoint.csv. Google Name of Google Places Entity Country Address Geocoordinates Keyword Place ID Rating Entity Type {'location': {'lat': -1.2813276, 'lng': 36.8177021}, ['university', 'viewport': {'northeast': ChIJTeyB Kenya Kemu Hub, 'point_of_int {'lat': -1.280026070107278, BdMQLx Methodist Kenya Koinange hub 4.3 erest', 'lng': 36.81895152989272}, gRxKyG9 University St, Nairobi 'establishme 'southwest': {'lat': - h36RbY nt'] 1.282725729892722, 'lng': 36.81625187010728}}} {'location': {'lat': 0.0466226, 'lng': 37.6554979}, Meru 'viewport': {'northeast': ['university', Njuri ChIJ3axa Institute {'lat': 'point_of_int Ncheke yeQhiBcR Of Kenya 0.04797242989272221, business 5 erest', Street, kkvyRGtL Business 'lng': 37.65684772989272}, 'establishme Meru 65w Studies 'southwest': {'lat': nt'] 0.04527277010727779, 'lng': 37.65414807010728}}} {'location': {'lat': - Murang'a 0.7163028, 'lng': University Murang'a 37.1476829}, 'viewport': ['university', College, ChIJ7Rm University {'northeast': {'lat': - 'point_of_int Muranga, bKXOYKB of Kenya 0.7142872701072778, 'lng': business 4 erest', MURANGA gRBhWo Technolo 37.14864597989273}, 'establishme TOWN q4lTZ5c gy 'southwest': {'lat': - nt'] (fomer Fort 0.7169869298927222, 'lng': Hall) 37.14594632010728}}} 2.5 Extraction of company-level metadata from private sector sources or public websites As proof of concept, the extracted company-level metadata from a diverse set of sources -- ranging from Google search results to social media data -- to concretely illustrate the kind of data we can extract per source type. What was not explored in this proof of concept is indirect data source discovery by leveraging existing knowledge graphs such as Wikipedia which link related entities and individuals with one another by design. For example, we can use Wikipedia pages (or LinkedIn) specific to an entrepreneur and use the links in each page to identify firms related to this entrepreneur (e.g., he/she may be a founder for Company A, employee for Company B, co-founder with Individual C, etc.). Leveraging these sites allow us to easily connect firms and individuals in the entrepreneurship ecosystem. We suggest exploring this idea for future proof of concepts. Exploration of alternative data sources Page | 18 Here are the results of the sample web scraping for company-level metadata. For this analysis, the team focused on accelerators/incubators in Indonesia identified in the previous section, specifically “GnB Accelerator”. 2.5.1 Google Places API Data source Google Places API, “Place Details” endpoint (API documentation here: https://developers.google.com/places/web-service/details) Implementation We were able to extract metadata on the 3 accelerators/hubs located in & Results Indonesia, by inputting their unique place_id which was extracted through Google Places API (see above section) Relevant This provides more details compared to the Google Place API endpoints extracted fields used in extracting company listings above. Additional fields include: contact details / phone number, website link, and more details in Google user reviews Strategic use of Getting company contact details will help when contacting the company scraped data for an interview or FGD. Having the startup website allows us to build another scraping layer by extracting data (e.g., text, contact details, images, etc.) from the startups’ respective websites. Here is what the sample output looks like, filtered down to one row and a few columns of the original dataset since the original dataset has many columns. For the full output, please see 2018-07-10 - Indonesia Google Places API - Place Details endpoint.csv. Entity Name GnB Accelerator Address Metropolitan Tower, Jl. R. A. Kartini, Kav. 14, RT.10/RW.4, Cilandak Bar., Cilandak, Kota Jakarta Selatan, Daerah Khusus Ibukota Jakarta 12310, Indonesia Geocoordinates {'location': {'lat': -6.292881999999999, 'lng': 106.784808}, 'viewport': {'northeast': {'lat': - 6.291533019708496, 'lng': 106.7861569802915}, 'southwest': {'lat': -6.294230980291501, 'lng': 106.7834590197085}}} Google Place ID ChIJPZLvV9rxaS4R6uvDTPMej_Y Google Places [{'author_name': 'Yeli Risna', 'author_url': Reviews 'https://www.google.com/maps/contrib/102120318279663282381/reviews', 'profile_photo_url': 'https://lh4.googleusercontent.com/- Ynu7tWTlj3I/AAAAAAAAAAI/AAAAAAAAAAA/AAnnY7ogOyKkvKtEul3N3wuvuOChCuI-yg/s128- c0x00000000-cc-rp-mo/photo.jpg', 'rating': 2, 'relative_time_description': '7 months ago', 'text': '', 'time': 1512631153}] Entity Types ['point_of_interest', 'establishment'] Website https://gnb.ac/ Exploration of alternative data sources Page | 19 2.5.2 Google Search Data source Google Custom Search API (API documentation here: https://developers.google.com/custom-search/json-api/v1/using_rest) Implementation We were able to extract detailed data on the top 10 Google search results & Results for “GnB Accelerator” and “MAD Incubator”, respectively, with the results restricted to the location of Indonesia for localized search results. Relevant Google search result title, text snippet, link, and rich snippet information extracted fields such as sublinks and images (see here for more details: https://developers.google.com/custom-search/docs/snippets). The number of Google search results for that keyword is also returned. Strategic use of This easily gives us additional data sources specific to the company by scraped data feeding the Google search result links into the extraction pipeline. The number of Google search results for that keyword can also serve as a proxy indicator for company digital presence. Here is what the sample output looks like, filtered down to a few columns of the original dataset since the original dataset has many columns. For the full output, please see 2018- 07-10 - Indonesia Mad Incubator Google Search API.csv and 2018- 07-10 - Indonesia GnB Accelerator Google Search API.csv. Search Snippet Title Search Result Snippet Text Search Result Link URL GnB Accelerator innovative technology companies. GnB is a https://gnb.ac/ – Local Identity, collaborative program between Global Opportunity Japanese IT company Infocom Corporation and Fenox Venture Capital from Silicon ... 6 Startup Indonesia 6 Sep 2017 ... Untuk penyelenggaraan kali ini, GnB https://id.techinasia.com/6- di GnB Accelerator Accelerator gaet 6 startup dari berbagai startup-di-gnb-accelerator-batch- Batch Ketiga 2017 3 latar belakang, masing-masing dengan keunikan model bisnis ... GnB Accelerator 5 Sep 2017 ... Program GnB Accelerator https://dailysocial.id/post/gnb- Batch Ketiga mengumumkan enam startup terpilih menjadi peserta accelerator-batch-ketiga- Umumkan Enam umumkan-enam-startup-terpilih Startup Terpilih ... batch ketiga dan berhak mengikuti program selama tiga bulan ... GnB Accelerator - GnB Accelerator, South Jakarta. 1738 likes · 15 talking https://www.facebook.com/gnbac Home | Facebook about this · 30 were here. celerator/ We're a startup accelerator in Jakarta, Indonesia. We offer... Exploration of alternative data sources Page | 20 GnB Accelerator | Learn about working at GnB Accelerator. Join LinkedIn https://www.linkedin.com/compan LinkedIn today for free. See who y/gnb-accelerator you know at GnB Accelerator, leverage your professional network, and get hired. 2.5.3 Social Media - Twitter Data source Twitter API (API documentation here: https://developer.twitter.com/en/docs) Implementation We were able to extract detailed Twitter status data on the company “GnB & Results Accelerator” using: keyword search for the phrase “GnB Accelerator” which pulls relevant tweets from the past 7 days containing that keyword direct pull of tweets from GnB Accelerator’s public user timeline Relevant We are able to extract data on the user who posted the status, as well as extracted fields status-level fields such as created time, text, location, interactions (replies to user / statuses), retweets, URLs, user mentions, hashtags used, place/coordinates used, contributors. Strategic use of Twitter data, and social media data in general, is ripe for natural language scraped data processing analysis such as sentiment analysis and topic modelling. This allows us to extract online intelligence not commonly found in traditional datasets. Here is what the sample output looks like, filtered down to one row and a few columns of the original dataset since the original dataset has many columns. For the full output, please see 2018-07-10 - Indonesia GnB Accelerator Twitter API.csv Created Tue Aug 23 08:26:10 +0000 2016 timestamp Tweet Text RT @VCInsiderNews: Why Japan’s Leading IT Firm Decides to Invest in Indonesia https://t.co/FkC5qlTQNB @GnBAccelerator #startup https://t.co… Was FALSE retweeted by GnB’s followers? Source URL Twitter Web Client User {"created_at": "Fri Mar 04 07:42:38 +0000 2016", "favourites_count": 1, "followers_count": 46, "friends_count": 115, "id": 705659703041208321, "id_str": "705659703041208321", "lang": "en", "listed_count": 1, "location": "Jakarta Capital Region", "name": "GnBAccelerator", "profile_background_color": "000000", "profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png", "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png", "profile_banner_url": Exploration of alternative data sources Page | 21 "https://pbs.twimg.com/profile_banners/705659703041208321/1475842390", "profile_image_url": "http://pbs.twimg.com/profile_images/784366082509254660/zWOhEKJg_normal.jpg", "profile_image_url_https": "https://pbs.twimg.com/profile_images/784366082509254660/zWOhEKJg_normal.jpg", "profile_link_color": "7FDBB6", "profile_sidebar_border_color": "000000", "profile_sidebar_fill_color": "000000", "profile_text_color": "000000", "screen_name": "GnBAccelerator", "statuses_count": 5, "url": "https://t.co/uiBJe1D2V7"} URLs linked in [URL(URL=https://t.co/FkC5qlTQNB, ExpandedURL=http://goo.gl/mMoIHm)] Tweet body Who are the [User(ID=732181521235247105, ScreenName=vcinsidernews), User(ID=705659703041208321, users ScreenName=GnBAccelerator)] mentioned? What were [Hashtag(Text='startup')] the hashtags used? Retweeted {"created_at": "Mon Aug 22 05:03:51 +0000 2016", "favorite_count": 1, "hashtags": [{"text": "startup"}], "id": history 767588069935505408, "id_str": "767588069935505408", "lang": "en", "media": [{"display_url": "pic.twitter.com/aSvbDRCrPQ", "expanded_url": "https://twitter.com/VCInsiderNews/status/767588069935505408/photo/1", "id": 767587950859137026, "media_url": "http://pbs.twimg.com/media/CqcFLKjUMAIftJu.jpg", "media_url_https": "https://pbs.twimg.com/media/CqcFLKjUMAIftJu.jpg", "sizes": {"large": {"h": 1536, "resize": "fit", "w": 2048}, "medium": {"h": 900, "resize": "fit", "w": 1200}, "small": {"h": 510, "resize": "fit", "w": 680}, "thumb": {"h": 150, "resize": "crop", "w": 150}}, "type": "photo", "url": "https://t.co/aSvbDRCrPQ"}], "retweet_count": 1, "source": "Twitter Web Client", "text": "Why Japan\u2019s Leading IT Firm Decides to Invest in Indonesia https://t.co/FkC5qlTQNB @GnBAccelerator #startup https://t.co/aSvbDRCrPQ", "urls": [{"expanded_url": "http://goo.gl/mMoIHm", "url": "https://t.co/FkC5qlTQNB"}], "user": {"created_at": "Mon May 16 12:10:52 +0000 2016", "description": "We are an online magazine featuring in-depth stories of today\u2019s investors & entrepreneurs.", "favourites_count": 9, "followers_count": 370, "friends_count": 180, "geo_enabled": true, "id": 732181521235247105, "id_str": "732181521235247105", "lang": "en", "listed_count": 44, "location": "Kuala Lumpur City", "name": "VC Insider News", "profile_background_color": "000000", "profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png", "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png", "profile_banner_url": "https://pbs.twimg.com/profile_banners/732181521235247105/1513671620", "profile_image_url": "http://pbs.twimg.com/profile_images/932920167624990720/0hkeqo0Q_normal.jpg", "profile_image_url_https": "https://pbs.twimg.com/profile_images/932920167624990720/0hkeqo0Q_normal.jpg", "profile_link_color": "000000", "profile_sidebar_border_color": "000000", "profile_sidebar_fill_color": "000000", "profile_text_color": "000000", "screen_name": "vcinsidernews", "statuses_count": 245, "url": "https://t.co/pTErhHsfzy"}, "user_mentions": [{"id": 705659703041208321, "id_str": "705659703041208321", "name": "GnBAccelerator", "screen_name": "GnBAccelerator"}]} 2.5.4 Company website page source and visible text Data source Company’s own website page source, visible text, and language Implementation We were able to pull the entire page source of GnB Accelerator’s website & Results using Python scripts. We also extracted the visible text of each Indonesian website as a pre- processing step for Natural Language Processing (NLP) and other machine Exploration of alternative data sources Page | 22 learning techniques. To do this, we removed all HTML tags and trailing/internal whitespaces. Last, we also identified the language used for each website. This will help us adjust the natural language processing techniques used on the text to account for language differences. Strategic use of Pulling the websites’ entire page source allows us to extract all relevant text scraped data and links within the page in an automated fashion. Website data is ripe for natural language processing analysis such as Named entity recognition (NER) to get lists of other company names, startups, partners associated with each company, as well as topic modelling to identify topics/themes associated with each company. This allows us to extract online intelligence not commonly found in traditional datasets. Here is what the sample text snippet looks like from the visible text of the company website of GnB accelerator. For the full output, please see 2018-07-10 - Indonesia GnB Accelerator Website data.html or 2018-07-12 - Indonesia Companies Website data - Visible Text.csv. Home Startups What’s On Contact Us Apply Accelerator First global acccelerator in Indonesia dedicated to progress and innovation that brings together the people, the funding, and the partners that drive business velocity We invest in talented and passionate early stage startups of all backgrounds, helping them to create innovative technology companies. GnB is a collaborative program between Japanese IT company Infocom Corporation and Fenox Venture Capital from Silicon Valley. It is a global network dedicated to progress and innovation that brings together the people, the funding, and the partners that drive business velocity. World Experts Join Forces INFOCOM CORPORATION Infocom Corporation, a subsidiary of Teijin, is a leader in IT systems and operation management services that provide diverse IT solutions and healthcare IT for dozens of pharmaceutical companies and thousands of hospitals. ... Exploration of alternative data sources Page | 23 3 Data analysis and Machine learning As described above, alternative data tends to be larger and more heterogenous than data available through typical official statistical channels or gathered through surveys or fieldwork. A different set of analytical tools has thus been recently developed to derive value from such datasets. As proof of concept, the team focused on machine learning techniques and tools to demonstrate how to concretely derive intelligence and insights relevant to digital entrepreneurship from various types of scraped data. Note that for this proof of concept, the team implemented the following: • Implemented a pre-trained model, such as Named Entity Recognition (see Section 3.1 below) and VADER (Valence Aware Dictionary for sEntiment Reasoning) for sentiment analysis (see Section 3.2 below), for the purposes of illustrating this technique; • Built a model using the available data such as Latent Dirichlet allocation (LDA) for topic modelling (see Section 3.3 below), and then applied the generative LDA model for thematic classification (see Section 3.4 below); or • Visualized the collected data through network visualization (see Section 3.5 below). The use cases below demonstrate how alternative data can complement standard data sources, if used carefully and in the appropriate context. Table 4. Summary of Machine Learning Use cases and Potential value-add for this report Use case: Machine Learning technique or tool Potential value-add Section 3.1 Named Entity Recognition and Productivity and speed gains. Named Entity Recognition Classification for Entity Extraction. (NER) can be used to extract relevant entities from website • Use case: Extract related data, which leads to productivity and speed gains when companies/entities per accelerator from parsing through large chunks of text for relevant data. their website data Standard data sources can then be used to check the • Data source: Raw website page source quality of the data extracted. (from respective company websites) Section 3.2 Sentiment Analysis on social New metrics. A potentially useful new metric is general media data. sentiment or “pulse” regarding a certain topic or entity, • Use case: Determine polarity which we can derive using sentiment analysis to determine (positive/negative/neutral) of tweets the polarity (i.e., positive/negative/neutral) of a given text. related to each company • Data source: Social media data (Twitter) Use case: Machine Learning technique or tool Potential value-add We can then check if this new metric strongly correlates with any of the existing standard metrics, and derive insights from patterns uncovered. Section 3.3 Topic modelling. Knowledge discovery and compact representation. We • Use case: Extract general topics for can use topic modelling to automatically extract topics startups in Nigeria (represented through relevant word clusters) from various texts. Section 3.4 Thematic classification. We can then group entities into clusters based on their • Use case: Group companies into clusters topic association scores through thematic classification. based on their topic association scores Subject matter experts can then be tapped and consulted to verify if the resulting topics and entity clusters make sense. Data source: Extracted company metadata from an online data collector / directory Section 3.5 Network Visualization. New data. By collecting relationship data between • Use case: Map various entrepreneurship ecosystem actors (such as investor-investee relationships), ecosystem actors with one another we can leverage this new data to create network based on relationships or connections visualizations which allow us to map various (e.g., investor-investee connection). entrepreneurship ecosystem actors with one another and • Data source: All data sources used above look for patterns (e.g., how central an actor is, if there are (including dummy data) clustering present in the ecosystem). We can then check if these patterns are aligned with our knowledge of the entrepreneurship ecosystem based on the standard DEED framework. In the subsection below, we describe the methodology used, data source used, results of the analysis, possible next steps or improvements for each case, as well as the strategic benefit of implementing the chosen methodology. The compilation of all sample outputs for this section can be accessed in a shared view-only Google Drive folder.8 3.1 Named Entity Recognition and Classification for Entity Extraction Data source used. We used Python scripts to pull the homepage source from 3 websites. Note that the company name and website were extracted via text search on Google Place API, with search restricted to Indonesia. • ‘GnB Accelerator': 'https://gnb.ac/' 8The view-only Google Drive folder can be accessed here: https://drive.google.com/open?id=1YXA218oOwUvB65JBGlH_h4PcMMh9KIsZ Data analysis and Machine learning Page | 25 • 'Mad Incubator': 'http://www.incubator.com.my/' • 'The Accelerator': 'http://www.accelerator.co.id/' See Subsection 2.5.4 Company website page source and visible text above for a snippet of the scraped data. 3.1.1 Methodology Here’s a visual representation of the methodology used to extract entities (such as persons, organizations, locations) from websites’ page source for the sample Indonesian companies: Data analysis and Machine learning Page | 26 Figure 1. Pilot pipeline for Named Entity Recognition 3.1.2 Results We extracted a total of 357 related entities from the website text of the 3 Indonesian companies. The distribution per entity type is as follows: Data analysis and Machine learning Page | 27 Figure 2. Number of entities identified per company using NER For example, here are some examples found per entity type for the Indonesian company “GnB Accelerator”: Table 5. Example entities found on GnB Accelerator's website Entity Type Examples for “GnB Accelerator” Facility The Bridge, Wall Street, Y Combinator Geo-political entity Indonesia, Japan, Asia, Jakarta, Singapore, China (GPE) Geo-Social-Political US group (GSP) Location Southeast Asia Organization Fenox Venture Capital, Infocom Person Joshua Kevin, Adamas Belva Syah Devara CEO Here’s a snippet of the output dataframe from NER for GnB accelerator: Table 6. Example NER output for GnB Accelerator entity label Adamas Belva Syah Devara CEO PERSON Alfatih Timur CEO PERSON Bridestory GPE CEO Appsocially Willson Cuaca ORGANIZATION CEO Bridestory Katsuhiro Okamura ORGANIZATION CEO Bridestory Kevin Mintaraga ORGANIZATION Data analysis and Machine learning Page | 28 CEO Fenox Venture Capital Anis Uzzaman ORGANIZATION CEO Fenox Venture Capital Kentaro Hashimoto ORGANIZATION CEO Intangible Communications Peter ORGANIZATION CEO Intangible Communications Toshihisa Wanami ORGANIZATION 3.1.3 Next steps Here are some immediate next steps (beyond the scope of this initial proof of concept) to improve model performance as well as the insights extracted: • Further refine the model to remove false positives from identified entities by creating an ensemble model which combines the initial results (generated using NLTK 9) with other NER open-source libraries such as StanfordNERTagger10 and Polyglot11, etc. • Enable NER with multi-language support using polyglot (especially since not all websites are in English). • Make semi-automated process to tag subtypes per entities and their relationship with the company (e.g., partner, mentor, founder, etc.). • Mix NER with supervised/semi-supervised machine learning techniques, which could improve its outputs particularly for entrepreneurship-related text. 3.1.4 Strategic benefit of implementing this methodology NER allows us to automate the extraction of persons, organizations, and other entities related to each company, ultimately allowing us to build a network or ecosystem of actors surrounding each company. These small networks - wherein one company is at its center - can then be merged to generate a bigger, area- or country-wide ecosystem mapping of entities. This compiled network can then be used to: • Augment the entity data collected via surveys; • Implement more robust network analysis, since company nodes already have metadata extracted through web scraping (see Task B above). This additional metadata can be used for thematic clustering and other network analysis techniques to supplement the network analysis conducted through other assessments. 9NLTK means “Natural Language Toolkit”, which is the dominant Python package used for natural language processing. For more details, please see https://www.nltk.org/ 10 https://nlp.stanford.edu/software/CRF-NER.html#Download 11 http://polyglot.readthedocs.io/en/latest/Installation.html Data analysis and Machine learning Page | 29 3.2 Sentiment Analysis on social media data Data source used. We used the Twitter API to pull detailed Twitter status data on the Indonesian company “GnB Accelerator” using: • keyword search for the phrase “GnB Accelerator” which pulls relevant tweets from the past 7 days containing that keyword • direct pull of tweets from GnB Accelerator’s public user timeline with Twitter username @GnBAccelerator (https://twitter.com/GnBAccelerator). Here’s how its Twitter public timeline looks like: For a snippet of the scraped Twitter data, please see subsection 2.5.3 Social Media - Twitter above. Data analysis and Machine learning Page | 30 3.2.1 Methodology We used the VADER (Valence Aware Dictionary for sEntiment Reasoning) model 12, which is a well-known and often-used model for implementing sentiment analysis on social media text created by C.J. Hutto and Eric Gilbert from Georgia Institute of Technology. The VADER model takes sentences as an input, and outputs four sentiment metrics for each sentence. Let’s take for example the sentence “The food is good and the atmosphere is nice.” Table 7. Sample output using VADER for Sentiment Analysis Score for example VADER sentiment metric Definition sentence Proportion of the sentence/text that falls under positive Positive (pos) 45% lexicon Proportion of the sentence/text that falls under neutral Neutral (neu) 55% lexicon Proportion of the sentence/text that falls under Negative (neg) 0% negative lexicon Sum of all lexicon ratings, standardized to range Compound 69% between -1 and 1 The VADER model works well with social media text since it also considers slang or informal speech such as multiple punctuation marks, acronyms, emoticons, capitalization, and word context. Each word in the lexicon is assigned a sentiment rating such that positive words have a positive value, and negative words have a negative value. Note that “more positive” words have a higher rating, as seen when you compare “great” (3.1) to “good” (1.9). To further illustrate the VADER model, here are examples of its usage which showcases sentences with slang words and emoticons: Table 8. Sample output using VADER for Sentiment Analysis using slang words and emoticons Example sentence Compound Negative Neutral Positive :) and :D 79% 0% 12% 88% 0% 0% 0% 0% 12For the original research paper on the VADER model, please see http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf. For an easy-to-understand introduction to VADER, please see http://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html. Data analysis and Machine learning Page | 31 Today sux -36% 71% 29% 0% Today kinda sux! But I'll get by, lol 22% 20% 53% 27% Very bad movie. -58% 66% 25% 0% VERY BAD movie! -76% 74% 27% 0% 3.2.2 Results We were able to compute the VADER sentiment metrics of the 15 sample tweets related to GnB accelerator. Here’s a snippet of the output dataframe: Table 9. Sample output for VADER Sentiment Analysis on scraped data Tweet text Negative Neutral Positive Compound Applications for GnBAccelerator, SE Asia’s first multinational startup accelerator, are now available 0% 100% 0% 0% online at https://t.co/ZZABui0Dxq . I'm giving out shout out to @ahlijasa as #StartupWorldCupChampion #INDONESIA regional 0% 80% 20% 40% finale! Just a few days left to apply! Seize the opportunity to get started with your business idea now at… 0% 85% 15% 48% https://t.co/G7QwHNCBi8 KILAS INFO @tabloidpulsa EDISI 391 | Huawei - 3 - GnB Accelerator - Asia IoT Bussines Platform | Cc.… 0% 100% 0% 0% https://t.co/TDta9NJrMq RT @VCInsiderNews: Why Japan’s Leading IT Firm Decides to Invest in Indonesia https://t.co/FkC5qlTQNB 0% 100% 0% 0% @GnBAccelerator #startup https://t.co… RT @VentureShire: Why Japan’s leading IT firm decides to invest in Indonesia @GnBAccelerator 0% 100% 0% 0% @FenoxVC #Infocom https://t.co/R3jt8s6ltx In general, GnB Accelerator’s tweets are mostly neutral or slightly positive. There are no negative tweets found in the sample generated. Its high score in neutrality makes sense since most of the sampled tweets are retweets of GnB Accelerator-related articles/posts by news agencies, which tend to have neutral-sounding headlines. 3.2.3 Next steps Here are some immediate next steps (beyond the scope of this initial proof of concept) to improve model performance as well as the insights extracted: • The current code only works with English language tweets. We need to implement sentiment analysis with multi-language support using polyglot or other solutions. Data analysis and Machine learning Page | 32 • Identify themes, entities, or keywords which generate high sentiment scores. To do this, we can segment the tweets into highly positive and highly negative tweets, and then identify the top keywords prominent in each segment. • Possibly derive a company-level indicator for social media sentiment using the aggregated scores of tweets related to each company. • Compare the various sentiment analysis methods and how well they perform on entrepreneurship-related text given the methods' pros and cons. 3.2.4 Strategic benefit of implementing this methodology Evaluating the general sentiment of companies, entities, and topics on social media would be an interesting new dimension of analysis when it comes to digital entrepreneurship. This can also possibly lead to the development of new indicators which can augment some of the intangible DEED domains, such as Culture (e.g., attitudes). 3.3 Topic modelling Data source used. We used the extracted data on 251 startups based in Nigeria from the website http://nigeria.startups-list.com. Specifically, we focused on analyzing the brief descriptions of all startups. Topic modelling works best on a large set of same-language text data with multiple rows or entries. For simplicity, we chose the largest scraped English language dataset from the previous section, which is the Nigeria startups list dataset. See 2018-07-10 - Nigeria Startups List.csv for the scraped data on the Nigeria startups. For a snippet of the scraped data, please refer to subsection 2.4.1 A data collector or directory website above. 3.3.1 Methodology For this note, we used one of the most oft-used models for topic modelling – the Latent Dirichlet Allocation (LDA) model13. LDA allows us to extract N topics from a set of documents, wherein each topic is defined by a set of keywords which are strongly associated with that topic. Note that this method requires some interpretation on the part of the analyst with the help of subject matter experts, since the model requires N as input – that is, the analyst will be the one to set the number of topics (N) that the LDA model will look for. For the purposes of this pilot, we picked N = 3 by manually checking the diversity of the topics generated using N = 3, 4, 5. For this specific dataset, N = 3 seems to work the best. 13 For more information, please see https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation Data analysis and Machine learning Page | 33 The lambda () parameter is important to tune when building the LDA model. When calculating the relevance or importance of a word in a topic, 0 ≤ ≤ 1 can be interpreted as the reverse weight given to the overall frequency of a given word in the corpus. That is, if = 1, then we don’t care about how rare the word is in the corpus. Alternatively, if = 0, the relevance of each word is inversely proportional to its overall frequency in the corpus. Also, topic modelling requires a lot of heavy text pre-processing before the data can inputted to train the model, to account for multiple versions of the same word/idea (e.g., “discourage” vs “discouraging”) and commonly-occurring but non-descriptive words in the language (e.g., “a, the, an” for English). Here is a visual representation of the methodology implemented: Data analysis and Machine learning Page | 34 Figure 3. Pilot pipeline for Topic Modelling Data analysis and Machine learning Page | 35 3.3.2 Results We were able to extract 3 topics using the company descriptions of the 251 scraped Nigerian startups. The 3 identified topics are as follows, with their corresponding top 30 most relevant keywords. Note that the keywords were derived by settingλ = 0.6, as suggested by Sievert and Shirley in their paper “LDAvis: A method for visualizing and interpreting topics”. Table 10. Outputs for Topic Modelling on scraped data. Topic Top 30 Most Relevant Keywords per topic (using λ = 0.6) Topic 1: Online student / education products Data analysis and Machine learning Page | 36 Topic Top 30 Most Relevant Keywords per topic (using λ = 0.6) Topic 2: Mobile- based services for estate, commerce, search, etc. Topic 3: Social media network and marketing Data analysis and Machine learning Page | 37 We also built an interactive tool14 for interpreting the results of the trained LDA model on the Nigeria startups data (see Figure below). One of the interesting features of this tool is that it can visualize the size and possible overlaps among topics (notice the bubbles to the left of the figure below). Figure 4. Screenshot of Interactive Tool for visualizing LDA results 3.3.3 Next steps Here are some next steps (beyond the scope of this initial proof of concept) to improve model performance as well as the insights extracted: • The current code only works with English language text. We need to implement topic modelling and keyword extraction with multi-language support using polyglot or other solutions. • To get more robust results, we would need more text as inputs for the Latent Dirichlet Allocation (LDA) model. This can be achieved by pulling the website text for all startups in the list and then applying topic modelling to this bigger text base. 14For more details, refer to the paper by Sievert and Shirley entitled “LDAvis: A method for visualizing and interpreting topics”. Data analysis and Machine learning Page | 38 • Experiment with changing the value of N (number of topics) as well as lambda ( λ) for extracting top relevant keywords for each topic. 3.3.4 Strategic benefit of implementing this methodology Using this methodology properly allows us to automatically and efficiently describe a vast set of text by grouping its elements into topics as well as extracting relevant keywords per topic. This can be easily extended to open-ended survey responses and other qualitative data, whose textual content is usually analyzed through manual methods. Also, this methodology easily lends itself to thematic classification, by bucketing the inputted startup data into the topics extracted using this method. 3.4 Thematic classification As a follow-up analysis, we used the same data source as the previous methodology (topic modelling) as well as the results of topic modelling. Specifically, we applied the trained Latent Dirichlet Allocation (LDA) model on the entire Nigerian startup list dataset, and derived scores for each startup as to which topic they are most closely assigned to. 3.4.1 Methodology We ran the trained LDA model (see previous methodology section) on each company description per Nigerian startup. Running the model outputs three scores for each startup – that is, one score for each topic – wherein the sum of all 3 scores is 1.00 per startup. In other words, it returns the probability distribution of a startup belonging to each of the topics. We then classified the startups per topic by assigning them to the topic such that their score for that topic is beyond a certain threshold. For the purposes of this note, we used the threshold of 0.97 based on histograms of the topic scores (see plots below). Notice the clean cutoff for each topic histogram at approximately 0.97 (denoted by the black dotted line in the histograms below). Figure 5. Probability Distribution of a startup belonging to each of the topics Data analysis and Machine learning Page | 39 3.4.2 Results Using this method, we were able to classify the Nigerian startups into the three topics, with the following distribution as shown in the table below. Note that out of the 251 startups, 89 startups were not assigned to any cluster since their topic scores are all below our threshold of 0.97 – that is, their topic association wasn’t high enough to merit being assigned to any topic. Table 11. Examples of Entities per cluster derived from Thematic Classification on scraped data No. of Cluster companies Examples of companies per cluster Cluster 1: Online student / 53 education products Cluster 2: Mobile-based services for 46 estate, commerce, search, etc. Cluster 3: Social media network 60 and marketing Data analysis and Machine learning Page | 40 The startup names, descriptions, and respective topic-level scores can all be seen in the file 2018-07-12 - Nigeria Startups – Thematic classification.csv. Here’s a snippet of the output: Table 12. Output for Thematic Classification for scraped data Assigned Topic Probability of being in… Startup (based on Description Name threshold of Topic 1 Topic 2 Topic 3 97%) African Dating Site FrienDite.com helps African connect with loved ones, helps you mingle, find your soul mate and fall in love easily. We help Africans Friendite Topic 3 1% 1% 97% improve a better marriage and a better love connection.Friendite - social media online dating social network media match making Search for Real-estate listings around you We provide a more convenient and effective way for property seekers to discover their desired property through Estatenode the up-to-date property information available on our Topic 2 1% 97% 1% database, available for free, accessible 24 hours a day to anyone with web access and far more complete ...Estatenode - mobile real estate Learn | Learn smarter | Learn better Educandlab provides access to education with personalized experience based on the future ambition of our Educandlab student through : 1)a video lecture platform, Topic 1 97% 1% 1% 2)simulation of key concepts in a field and 3) easy access to books.Educandlab - education edutainment k 12 education 3.4.3 Next steps Here are some next steps (beyond the scope of this initial proof of concept) to improve model performance as well as the insights extracted: • The current code only works with English language text. We need to implement topic modelling and keyword extraction with multi-language support using polyglot or other solutions. • Confirm or improve on topic modelling and thematic classification model accuracy by comparing the results against a manually-generated clustering/grouping of the same set of startups. Data analysis and Machine learning Page | 41 3.4.4 Strategic benefit of implementing this methodology We can use this methodology to easily find thematic groupings among entities with lots of textual data associated with them. Keeping the trained model can allow us to have comparative data for longitudinal surveys. This is helpful for future survey assessments which may include new, unseen startups, as reusing the trained model allows us to classify these new startups to the topics extracted in the baseline survey. 3.5 Network Visualization 3.5.1 Methodology We scraped online data on startups, investors, incubators, accelerators, associations, and mentors in Indonesia and generated an interactive network visualization from this. When interpreting these visualizations, please consider the following caveats: • Only scraped online data was used to generate these visualizations for simplicity. Other data sources (e.g., survey data, proprietary data, official data) were not used for these visualizations (beyond the scope of this initial proof of concept). • Due to time limitations, we were not able to check the online scraped data for bias, nor were we able to recalibrate the data to reflect more accurate estimates (beyond the scope of this initial proof of concept). • For illustrative purposes, we filled in a few columns with dummy information to generate the visualizations. Some examples of our use of dummy data include initializing reasonable15 random16 values for variables with missing data, including: o org/company size (bubble size for the network diagrams) o investment amount (connection line width for the network diagrams) o geocoordinates (for the map), etc. Here are some biases to consider when interpreting the following charts: • Bias for sources for inferable relationship data - e.g., program/accelerator pages which explicitly state “mentor” and “members”, directories which connect investors to startups • Bias for investor data linked to startups - Investor data primarily seeded from startup data 15Note that what we mean by "reasonable" differs for each context (e.g., for geocoordinates, these should be found on the Indonesia land mass). 16We deliberately chose to generate random dummy data to fill in missing data (rather than imputation) with the purpose of quickly showing how the network could possibly look like given differing values across different ecosystem actors. Data analysis and Machine learning Page | 42 • Disconnected circles - we expect to add more connections as we scrape secondary sources of data (company websites, articles, social media accounts) • Deduplication not yet done (beyond the scope of this initial proof of concept) so there’s a small redundant set here (but not so much – the majority are well known startups/companies) 3.5.2 Results Here are the resulting ecosystem network visualizations based on the scraped data and dummy data as described above. We were able to generate three network visualizations: • Full Entrepreneurship Ecosystem • Zooming in - Investment Flow Ecosystem Map • Geographic ecosystems map Figure 6. Network Visualization for Indonesia: Full Entrepreneurship Ecosystem17 Here are a few observations we can make from the sample visualization (again, filled in by dummy data), guided by the DEED framework for digital entrepreneurship: 17 The interactive version can be viewed here: https://embed.kumu.io/f1fccb919b6bb50e5e259c16b21533a8. Data analysis and Machine learning Page | 43 Figure 7. Observations based on the Network Visualization of Indonesia Visualization Observation Some Clustering present. There is some clustering visually present in the Indonesia ecosystem, primarily composed of association members or startup program participants. The rest of the ecosystem does not display much clustering. Sub-ecosystem with low Accumulation/Allocation Barriers. This sub-ecosystem with relatively higher density and degree of connections is composed of investors, incubators, accelerators, and firms. Notice that these are primarily the key institutions for access to finance and social capital. Sub-ecosystem with high Accumulation/Allocation Barriers. The sub-ecosystem with relatively lower density and degree of connections is composed of unconnected firms. This implies that these firms are not part of an association, nor do they have investors, accelerators, or incubators who are mentoring or investing in them. Data analysis and Machine learning Page | 44 Figure 8. Network Visualization for Indonesia: Investment Flow Ecosystem Map18 A few interesting things to notice in the diagram above: • Some startups/companies tend to attract more investors (spaghetti mass in middle). • At the fringes, there are some startups/companies with relatively few investors connected. Figure 9. Network Visualization for Indonesia: Geographic Ecosystems Map19 18 You can view the interactive version here: https://embed.kumu.io/72ba81ed159e47a1681e43ac6bb2be04 19Interactive version here: https://embed.kumu.io/da7538cd4c0a3c3ba94834c31e9cacbd. Caveat: Only 100 of companies/investor data is here due to rendering difficulties. Data analysis and Machine learning Page | 45 3.5.3 Next Steps Here are some next steps (beyond the scope of this initial proof of concept) to improve on the insights extracted: • Clustering of actor types in a region • Investment flow patterns (line thickness) specific to locations • Change in clustering and flow patterns over time • Indonesian firm vis-a-vis global market (foreign investors to Indonesian firms, or Indonesia firms with international market) • Easily compare ecosystems with other countries (for instance, we can make “ecosystem typologies” based on patterns across different countries) • See change in investment flow and ecosystem over time It is also important to ensure that our network measures and analysis are robust and sensitive to missing data by selecting appropriate centrality measures based on suggestions from existing research in this area. For instance, there have been empirical tests which look at the correlation between calculated centrality measures and the actual centrality measures by simulating missing data. They have found that there are some centrality measures – such as in-degree centrality and simple eigenvector centrality – whose resulting measures are relatively stable despite having a low sampling level such as 50% missing data (Costenbader & Valente 2003). Data analysis and Machine learning Page | 46 4 Data quality and mitigation The quality of alternative data is a source of significant concern for researchers. Almost none of this data is gathered for research purposes, is representative in any way, follows any international standards, is consistent with other online sources, or provides any assurance about quality. It is thus critical to apply a high bar for quality when using such resources. The following section describes suggested mitigation approaches to: • record linkage between the diverse sets of sources, • manage data and its storage, and • adjust for biases and test the quality of alternative data sources. 4.1 Recording Linkage between Data Sources One of the key challenges in combining diverse sets of traditional and alternative data sources is the problem of record linkage, which has two main subproblems: (a) maintaining a data structure which can accept different data from different sources and (b) matching records about the same actor from different data sources. Maintaining a data structure which can accept different data from different sources. This is a common problem and is typically addressed by using a NoSQL Database, which is a flexible data structure which can accept generally any document structure (compared to the frequently-used SQL database which requires a certain structure before accepting data). Matching records about the same actor from different data sources. To do this, we can implement several techniques and checks, such as: ● Data processing involving fuzzy matching, which allows us to approximately detect matches across records from different data sources; ● Triangulation of indicator data collected across different data sources; ● Handling data discrepancies through a combination of semi-automated checks guided by internally-defined criteria (such as source reliability as defined by subject matter experts, recency of data collected, and frequency of value among all sources considered) and manually checking a random sample of the records to ensure proper handling of edge cases; and ● Exploration of probabilistic models for record linkage (such as fastLink) which allows a mixed approach where the user provides input to update the model. To illustrate the record linkage process, see the schematic below based on earlier research by Ansolabehere and Eitan Hersh (2012) which also combined traditional and alternative data sources ( the figure is from Salganik 2017). Figure 10. Record Linkage for Traditional and Alternative data sources (Source: Salganik 2017) Using the data collection process to our advantage. We can also structure the data collection process in such a way that it will be easier for us to do record linkage later on. In particular, we can start with some “seed sources” which contains certain data on ecosystem actors and institutions such as their name, website (if any), social media accounts, and the like. Typically, these seed sources are online data directories which aggregate information from various sources. This initial round provides us with URLs and keywords for search engines which can feed into the next round of web scraping, while ensuring that some of the data scraped are definitely linked to that particular actor or institution. To illustrate the benefits of this method, notice that the data we have found for Bukalapak in the illustrative example above comes from the first round of web scraping. We can then use the links from the “Company Website”, “Facebook”, “Instagram”, Twitter”, and “LinkedIn” fields to scrape more information which can be linked to Bukalapak. We can also add whitelisted or blacklisted sources for the scraping process, to filter dubious or less credible links and avoid scraping them. Data quality and mitigation Page | 48 4.2 Data Management and Storage Once the web scraping exercise reaches a point of wide-scale implementation, it is important to support this with an appropriate data management and storage technology stack to build a long-term data asset which will consolidate all collected data and metrics. This data asset will potentially grow more valuable over time, as more features, countries, and data sources are collected consistently over time. One potential is that this asset unlocks powerful analyses and comparisons of the same indicators across different countries and over time. It is important to leverage easy-to-use and flexible templates and tools for data analysis, visualization, and dissemination to enable ease in data sharing with both internal and external stakeholders. The example below shows a combination of World Bank Group and open-source tools / technology. Figure 11. Potential technology stack for Data Management and Storage (within the World Bank environment) Here are a few notes which may be useful in developing the data management framework: • Build a minimum viable product (MVP). While this data asset may be a critical output, there are a lot of design choices which will be discovered along the way. Hence, it is good practice to implement a lean, agile methodology when developing this infrastructure by starting with a lean prototype with low investment and iterating on this based on regular stakeholder feedback. • Leverage existing organization tools as much as possible. If the data management infrastructure is within the context of an organization, leveraging existing tools and Data quality and mitigation Page | 49 services is key to the long-term sustainability of the data management framework. This will avoid redundancy with and allow piggybacking on existing organization tools and processes at a lower cost. • Closely collaborate with key organization units. In different organizations this might include groups as disparate as research, technology, information security, policy and others. Close collaboration with all of them can help ensure that the proposed data management framework can be easily integrated within existing organization tools and services. In addition, here are some additional criteria to select the components and tools of the technology stack: Table 13. Proposed Initial criteria for selecting components of the potential technology stack Suggested Criteria Rationale Leverages existing To be consistent with the organization tool suite and best practices, and to organization data assets minimize infrastructure and data tool costs where possible. and software licenses as much as possible Enabling & Flexible. Flexibility is key to adoption. Pinning this down at an early stage is particularly Allows ease and flexibility of crucial, since one of the hardest issues when introducing new data tools is use for both technical and adoption of these by the target users (considering their technical skills, comfort non-technical users with new tools, etc.). Free and/or open-source As much as possible, data tools should be preferably free and/or open-source to ensure flexibility and avoids long-term funding commitments. Cloud-based. (Especially Security and data backup are outsourced to industry-standard tools. Industry- for the data storage and standard tools can ensure that the data is fault-tolerant, requires low management tools) maintenance, is always accessible, and durable & distributed geographically. Web browser-based. Ensures that the data tool is always up-to-date. Increases chances of user adoption since minimal/no installation steps required. (Potential downside: increased reliance on good internet connection to upload, pull, analyze data.) Has user login / Survey/interview/FGD data also has some confidential aspects, so this can’t be authentication enabled publicly published. Different internal teams and external stakeholders should have different access levels to the consolidated database. Allows user collaboration Members within survey teams usually need to collaborate to finalize reports and outputs. Easily allow pulling in For instance, the DEED methodology has special emphasis on sourcing / external data via data exploring data on TCdata360. Most sample surveys support their results using upload or API pull data from WBG or other external institutions. Data quality and mitigation Page | 50 Suggested Criteria Rationale Leverage off-the-shelf Survey Monkey, AWS, and Microsoft Azure technology stacks are some tools APIs of data tools when which have off-the-shelf APIs which allow ease in sharing data across data tools. possible Flexible data analysis and Offer a customizable tool that can be used to assess a particular ecosystem with storytelling tools which ease at any point in time and to respond to specific client requests. can generate interactive, shareable data stories/visualizations Conditional access for Instead of having the whole tool password protected, there could be tiers of the public access tool that are open to the public, or where the public is even encouraged to contribute and modify directly the content. Third party access/ In the overall data architecture, there could be value in having some layers not crowdsourcing/ wikis for only open to third party providers, but also explicitly adopting a the tool. crowdsourcing/wiki approach. 4.3 Adjusting for Biases & Testing Quality of alternative data sources While the proposed approaches have a lot of potential upside, they also contain several limitations and weaknesses, such as handling bias and data ownership issues. It is thus crucial to compare the extracted indicator data from alternative methods against official data sources (e.g., census data, household survey microdata) to (a) check for data quality and credibility and (b) test and adjust for biases. The following are the overarching guidelines to consider: ● Greater value can be obtained by combining traditional and alternative data sources. Traditional data plays a pivotal role in assessing and recalibrating the quality, validity, and accuracy of the scraped data collected (and show possible biases), for all steps of the process. ● Leverage existing domain knowledge to ensure relevance and actionability. We will have close coordination with subject matter experts and country survey teams every step of the way to get feedback and ensure relevance and applicability of the results to policymakers. ● Transparency in data collection and analysis. Showing how the data was collected and the metrics extracted can foster feedback, research replicability, and further interest and investigation. ● Continuously check data and re-calibrate algorithms even in production. Algorithms are never really “done” since the entrepreneurship ecosystem as well as the digital ecosystem dynamically changes over time. Continuous recalibration and Data quality and mitigation Page | 51 updating of the algorithm vis-a-vis latest traditional data will keep the data and metrics relevant, accurate, and of quality. 4.3.1 Checks for data quality and credibility Comparing traditional and alternative data sources will help spot glaring differences, possible biases, and observe underlying patterns for these biases. We can check for quality along two levels, namely (1) macro or aggregated data and (2) micro or on the actor level. Figure 12. Macro and Micro checks for data quality We can implement the following quality checks at the micro-level and macro-level: ● Whitelist or blacklist certain online data sources based on credibility and advice from subject matter experts; ● Identify “gold standard” data sources among the available sources (typically census data or household survey data) to serve as the “ground truth” for the data comparisons. ● Triangulate indicator data (either granular or aggregated data) collected across different data sources. For example, triangulate the results of sites with unsure credibility against those which are identified as credible, and check the overlap or similarity of results returned. ● Identify data discrepancies among the data compared, and handle these through a combination of semi-automated checks guided by internally-defined criteria (such as source reliability as defined by subject matter experts, recency of data collected, and frequency of value among all sources considered) and manually checking a random sample of the records to ensure proper handling of edge cases. ● Work with legal teams to confirm and clarify the terms of use of public sources of data before proceeding with wide and long-term data scraping. 4.3.2 Testing and adjusting for biases Note that alternative data sources commonly suffer from nonrepresentative and digital bias, and you cannot expect these sources to be accurate at the onset. It is thus important to leverage existing traditional data sources and use these to calibrate and adjust the collected Data quality and mitigation Page | 52 data from alternative sources. Traditional data and subject matter expertise will play a pivotal role for this process. To merge the two data sources, the alternative data sources need to be checked for bias and adjusted using methods suitable for non-probability samples (which is often the case for alternative data sources and methods) such as: • Post-stratification using auxiliary information about population strata (which are assumed to be mutually exclusive and exhaustive groups). This requires fulfillment of "homogeneous response propensities within groups" assumption wherein there should be little variation in the response propensity and outcome among the homogeneous groups formed. • Multi-level regression wherein we estimate outcomes per group without enough (or zero) respondents by pooling together estimates from people in very similar groups. • Other methods to handle non-probability samples include: o Sample matching (Ansolabehere and Rivers 2013; Bethlehem 2015) o Propensity score weighting (Lee 2006; Schonlau et al. 2009) o Calibration (Lee and Valliant 2009) • Some specific methodologies for adjusting for different types of biases: o Adjust for population bias via reweighting by population segment (e.g., segmented by location, industry, firm size, firm age) o Adjust for selection bias via propensity score matching between survey and non- survey data (test: propensity of subject being in the survey data) o Adjust for activity bias (esp. for social media datasets, search datasets) by clustering data based on participant activity (e.g., recency, frequency) We can then calibrate and reweight data from alternative data sources to adjust for bias. To test the reliability of the adjusted metrics, we can compare the adjusted data against corresponding traditional data as baseline (if available) and/or feedback from subject matter experts. Data quality and mitigation Page | 53 Figure 13. Depiction of data calibration 4.4 Limitations of the approach Here are a few caveats for and limitations of this approach: • Care must be taken when interpreting results gathered from non-traditional sources. Data gathered from online sources tend to suffer from some bias, especially depending on the data collection methods of that online source. For instance, global sources such as Pitchbook and Crunchbase may have incomplete data on African countries compared to their American counterparts. This may lead to overrepresenting some subset of the digital entrepreneurship ecosystem “population” whereas underrepresenting another subset. o To mitigate this, we can explore a mix of global and local data sources to complement one another, and to use triangulation to check for discrepancies in data collected among the different data sources. • The richness of the results greatly depends on the available non-traditional sources per country. The quality of the data collected largely depends on the quality of the data from the non-traditional sources, so the results must always be taken with a grain of salt”. Also, it is possible that there are data-poor countries which will have inadequate data sources to implement this approach. • Refinement of the methodology and data collected requires some manual checking by subject matter experts. The quality of the data and the robustness of the methodology can be developed and further refined over time through feedback from subject matter experts. Data quality and mitigation Page | 54 5 The ethics of web scraping While the techniques described above have great potential for research, questions inevitably arise about the propriety of scraping data without permission from website users and from entities described on such websites. Typical concerns include the following – • Technical. Web scraping can place undue demands on websites and slow down their performance • Permission. Some sites often explicitly prohibit scraping but do not have the technical resources to enforce it (see this related court ruling) • Deception. Very often web scrapers do not identify themselves correctly to sites they are scraping from • Reuse. Scrapers may not always have the permission to reuse the data they harvest • Awareness. It is sometimes the case that web owners are unaware of the technical possibility of scraping and may be giving away their data out of ignorance We propose the following mitigations – • Technical. Scrapers must take care to not over-burden websites; scraping should ideally be infrequent or at off-peak hours and respect the technical infrastructure limitations of the source sites • Technical. Scrapers must use the website API if the source website provides it • Permission. Scrapers must first carefully review the terms and conditions of all websites they plan to scrape and not scrape content from websites that prohibit it even if they don’t possess the technical means to enforce it. Some sites present a clear robots.txt message; others do not but state their objections through terms and conditions • Identification. Scrapers must always identify themselves clearly and honestly. Inserting such information in code headers is easy and standardized. The information should also ideally include contact information • Reuse. Scrapers must refer to the terms and conditions and respect the conditions for reuse. Intellectual property and trademark laws typically dictate how a website’s content may be used; in any case scrapers should credit all information and as much as possible use it in a non-rivalrous fashion It is also important to consider the sustainability and reproducibility of web scraping when incorporating such data into the research methodology. Many sites have begun to close themselves off to scrapers and while not widespread this may apply more forcefully to some projects than others. There are also cases wherein public APIs have been closed off for public use (e.g., Facebook, Instagram) or have had changes in access rights (e.g., AngelList), deprecation of API methods, rate limit changes, monetization strategies, among others. It is therefore important to keep in mind long-term sustainability and reproducibility when identifying which sources and techniques to implement at scale when establishing good initial foundations for the methodology, while being aware of the potential future deprecation and changes in data accessibility. The ethics of web scraping Page | 56 6 Conclusion and Looking ahead The note provides a description of tools to both gather and analyze data from alternative, digital sources and apply them to answer some of the research and measurement questions related to entrepreneurship ecosystem assessments. The description above shows the value of such resources but also describes their limitations and a few mitigation approaches. In general, the report demonstrates that such data can be a powerful complement to standard data sources, if used carefully and in the appropriate context, such as the following applications explored in this report: • Productivity and speed gains. Techniques such as Named Entity Recognition (NER) can be used to extract relevant entities from website data, which leads to productivity and speed gains when parsing through large chunks of text for relevant data. Standard data sources can then be used to check the quality of the data extracted. • Knowledge discovery and compact representation. Techniques such as topic modelling can be used to automatically extract topics (represented through relevant word clusters) from various texts. We can then group entities into clusters based on their topic association scores through thematic classification. Subject matter experts can then be tapped and consulted to verify if the resulting topics and entity clusters make sense. • New metrics. A potentially useful new metric is general sentiment or “pulse” regarding a certain topic or entity, which we can derive using sentiment analysis to determine the polarity (i.e., positive/negative/neutral) of a given text. We can then check if this new metric strongly correlates with any of the existing standard metrics, and derive insights from patterns uncovered. • New data. By collecting relationship data between ecosystem actors (such as investor- investee relationships), we can create network visualizations which allow us to map various entrepreneurship ecosystem actors with one another and look for patterns (e.g., how central an actor is, if there are clustering present in the ecosystem). We can then check if these patterns are aligned with our knowledge of the entrepreneurship ecosystem based on the standard DEED framework. It is important for researchers to also consider a few additional issues and caveats if they would like to include alternative data in their methodology. These include – • Data and computational infrastructure. Alternative data sources require sophisticated data and computational infrastructure to be scaled beyond small pilots. Projects or organizations thus need to make appropriate investments in their infrastructure. • Policies and guidelines. Many organizations still do not have appropriate policies or guidelines in place for the use of alternative data. Recent experience has highlighted the numerous ethical, social, and other challenges associated with the gathering and use of such data. It is thus important for organizations to develop appropriate mechanisms and policies governing some of the techniques discussed. • Partnerships. As the volume and variety of alternative data sources grows, it is impossible for most organizations to develop either the infrastructure or the skills to gather and manage such data. Data partnerships or collaboratives can offer a way forward in such situations. • Skills. Data science is a fast-developing area and organizations should consider programs to develop and nurture the capacity of staff to use the techniques described above. Otherwise organizations face the risk of a wall between their data science teams and subject matter experts. • Sustainability and long-term reproducibility. Changes and deprecation of API and general data access over time have been observed across various data sources such as Facebook, Instagram, AngelList, and the like. To mitigate this risk, it is important to establish good initial foundations for any methodology involving alternative data. Conclusion and Looking ahead Page | 58 Bibliography Ansolabehere, Stephen, & Hersh, Eitan. (2012). “Validation: What Big Data Reveal About Survey Misreporting and the Real Electorate.” Political Analysis 20 (4): 437–59. doi:10.1093/pan/mps023. Beskow, Laura M., Sandler, Robert S., & Weinberger, Morris. (2006). “Research Recruitment Through US Central Cancer Registries: Balancing Privacy and Scientific Issues.” American Journal of Public Health 96 (11): 1920–26. doi:10.2105/AJPH.2004.061556. Blumenstock, Joshua E., Cadamuro, Gabriel, and On, Robert. (2015). “Predicting Poverty and Wealth from Mobile Phone Metadata.” Science 350 (6264): 1073–6. doi:10.1126/science.aac4420. Costenbader, E., & Valente, T. W. (2003). The stability of centrality measures when networks are sampled. Elsevier B.V. Retrieved from https://www.bebr.ufl.edu/sites/default/files/Costenbader%20and%20Valente%20- %202003%20- %20The%20stability%20of%20centrality%20measures%20when%20networks.pdf Endeavor Insight. 2014. The Power of Entrepreneur Networks: How New York City Became the Role Model for Other Urban Tech Hubs. http://www.nyctechmap.com/nycTechReport.pdf. Ginsberg, Jeremy, Matthew H. Mohebbi, Rajan S. Patel, Lynnette Brammer, Mark S. Smolinski, and Larry Brilliant. (2009). “Detecting Influenza Epidemics Using Search Engine Query Data.” Nature 457 (7232): 1012–14. doi:10.1038/nature07634. Groves, Robert M. (2004). Survey Errors and Survey Costs. Hoboken, NJ: Wiley. ———. (2006). “Nonresponse Rates and Nonresponse Bias in Household Surveys.” Public Opinion Quarterly 70 (5): 646–75. doi:10.1093/poq/nfl033. ———. (2011). “Three Eras of Survey Research.” Public Opinion Quarterly 75 (5): 861–71. doi:10.1093/poq/nfr057. Judson, D. H. (2007). “Information Integration for Constructing Social Statistics: History, Theory and Ideas Towards a Research Programme.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 170 (2): 483–501. doi:10.1111/j.1467- 985X.2007.00472.x. Olson, Janice A. (1996). “The Health and Retirement Study: The New Retirement Survey.” Social Security Bulletin 59: 85. http://heinonline.org/HOL/Page?handle=hein.journals/ssbul59&id=87&div=13&collecti on=journals. Olson, Janice A. (1999). “Linkages with Data from Social Security Administrative Records in the Health and Retirement Study.” Social Security Bulletin 62: 73. http://heinonline.org/HOL/Page? handle=hein.journals/ssbul62&id=207&div=25&collection=journals Salganik, Matthew J. (2017). Bit by Bit: Social Research in the Digital Age. Princeton, NJ: Princeton University Press. Startup Genome LLC. (2018). Global Startup Ecosystem Report 2018: Succeeding in the New Era of Technology. Retrieved from https://startupgenome.com/download- report/?file=2018 Bibliography Page | 60