WPS7398 Policy Research Working Paper 7398 Nowcasting Prices Using Google Trends An Application to Central America Skipper Seabold Andrea Coppola Macroeconomics and Fiscal Management Global Practice Group August 2015 Policy Research Working Paper 7398 Abstract The objective of this study is to assess the possibility of these countries using Google Trends data covering a two- using Internet search keyword data for forecasting price week period during a single month. For each country, the series in Central America, focusing on Costa Rica, El study estimates one-step-ahead forecasts for several dozen Salvador, and Honduras. The Internet search data comes price series for food and consumer goods categories. The from Google Trends. The paper introduces these data study finds that the addition of the Internet search index and discusses some of the challenges inherent in work- improves forecasting over benchmark models in about 20 ing with it in the context of developing countries. A new percent of the series. The paper discusses the reasons for index is introduced for consumer search behavior for the varied success and potential avenues for future research. This paper is a product of the Macroeconomics and Fiscal Management Global Practice Group. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://econ.worldbank.org. The authors may be contacted at jsseabold@gmail.com. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Nowcasting Prices Using Google Trends: An Application to Central America Skipper Seabold Andrea Coppola American University The World Bank∗ JEL codes: E31,C55,C8 Keywords: Macroeconomic modeling and statistics, Inflation, Big Data ∗ Corresponding author e-mail: jsseabold@gmail.com. This is a preliminary draft. 1 Introduction It is a well recognized problem that policy makers must make decisions before all data about the current economic environment are available. Given this reality, there is consid- erable interest in short-term forecasting and nowcasting using intra-period data releases. For example, the forecaster can provide an estimate of GDP this quarter using other data that are available at a monthly frequency. This technique is called nowcasting, or predicting the present. Giannone et al. [2008] lays out three tenets of nowcasting. First, many data series are used. Second, nowcasts are updated as intraperiod data become available. Finally, nowcasting “bridges” higher frequency data releases with the now- cast of the lower frequency series of interest. This study is similar in spirit to that of Giannone et al. [2008]. However, while Giannone et al. [2008] are concerned with now- casting GDP using a large number of economic data series, this paper nowcasts price series using Internet search keyword data from Google Trends1 . Furthermore, we do not attempt to “bridge” higher frequency data with lower frequency data explicitly as part of a model. The Google Trends data are not all systematically available at a higher frequency than the series we wish to forecast. Instead, we are more concerned with the efficient aggregation of many series to help improve our nowcasts. There are three main contributions of this study. First, it focuses on the countries of Central America. Almost the entirety of the nowcasting literature focuses on developed countries with one notable exception in Carri`ere-Swallow and Labb´ e [2013]. Second, this is a large scale study, approaching the problem of nowcasting with Google Trends from a data mining perspective rather than one solely grounded in economic theory. This approach gives us insights that will be useful for forecasters who wish to pursue similar ends. Third, we introduce methods from the statistical learning literature to compute the Google Trends keyword search index that are not yet commonly used in forecasting studies. Given the large number of series included in this study, we rely heavily on automatic model identification procedures. Despite this potential shortcoming, we find that Google Trends can improve our ability to forecast certain series. These findings are notable and may be worth pursuing in more detail. The outline of the paper is as follows. Section 2 reviews some of the literature for nowcasting and the use of Google Trends data in forecasting. Section 3 introduces the data and includes a section that discusses the challenges of working with Google Trends data for the countries of Central America. Section 4 explains the framework used for forecasting and evaluating forecasts. Section 5 1 http://google.com/trends 2 discusses the results of this exercise and assesses the usefulness of Google Trends data in forecasting price series for Central American countries. Section 6 concludes, noting several paths for continuing research. While this section deals specifically with ideas for future research there are notes about ongoing research throughout the paper. 2 Literature Review There is a growing literature that is using Internet search keyword data and Google Trends, in particular, for forecasting and nowcasting. Ettredge et al. [2005] were the first to use search engine keyword data to aid in forecasting. They found keyword- based searches to be helpful in predicting the number of unemployed workers in the United States. The use of Google Trends data, specifically, in forecasting yet to be released macroeconomic series goes back to Choi and Varian [2009, 2012]. They find that Google Trends data help to forecast initial unemployment claims, automobile sales, and consumer confidences in the United States. Since then, there are been numerous efforts to use Google Trends data in forecasting. Schmidt and Vosen [2012] uses search data related to the “cash for clunkers” program to improve forecasts for private consumption in France, Germany, Italy, and the United States. Guzman [2011] uses Google search data to estimate inflation expectations. Suhoy [2009] estimates accurate probabilities of downturn in early 2007 using Google search category data for Israel. The author also finds improvements in estimates of private consumption by employing the search data. Early results on using Google Trends data as a proxy for consumer sentiment are promising. Traditionally, studies have made use of survey-based sentiment data to pro- vide leading indicators of series of interest. However, this data is not always available, especially in developing countries. Vosen and Schmidt [2011] show that Google Trends outperforms The University of Michigan Consumer Sentiment Index and the Confer- ence Board Consumer Confidence Index in predicting private consumption in the United ere-Swallow States. One study which is very relevant to our present effort is that of Carri` and Labb´ e [2013]. The authors look at the benefits of using Google Trends data in the context of a developing country, Chile. They develop an index of consumer interest in automobile purchaes and find that it outperforms benchmark specifications that take advantage of the IMACEC index of consumer activity. We will use a similar framework to the one employed in that study in what follows. 3 3 Data This section first describes the raw data and then the transformations that are made to each series before estimation. A subsection is dedicated to addressing some of the challenges inherent in working with the search query data from Google in emerging market countries. For each of Costa Rica, El Salvador, and Honduras2 , there are two categories of series that we will forecast – data on aggregate consumer prices and their component series and staple food price data. We obtained the consumer price data from the statistical office of each country. See Appendix A for details. The raw series are in levels and are not seasonally adjusted. The food price data was obtained from the Global Information and Early Warning System on Food and Agriculture (GIEWS) from the Food and Agriculture Organization of the United Nations (FAO). The types of food that are available from GIEWS are particular to each country. We obtained every available series. Appendix A gives, for each country, the series names, appropriate region, and the units for which we have data available. These series are not seasonally adjusted. To augment our forecasts, we have obtained Google Trends data on a number of search keywords. These keywords were chosen ex ante with the belief that they contain relevant information that will allow us to use them as a proxy for consumer behavior and beliefs. Obtaining real-time insights into consumer behavior allows us to better predict price changes all other things equal. In some sense, the Trends data takes the place of traditional consumer-sentiment surveys. The keywords that we have chosen are listed in Table 1. Each individual Google Trend series is relative and not an absolute measures of search volume. That is, the period in which the search interest of a keyword is highest within the dates of inquiry receives a value of 100. All other periods for an individual series are measured relative to this highest period. There is, therefore, no sense of how many people were searching for a term and the terms themselves are not comparable with each other. Furthermore, changes in Internet penetration and the use of Google, in particular, do not matter. The following transformation is made to each price series before estimation to go from 2 We could not acquire sufficient data on food prices or on search keywords for Belize, so it is omit- ted from discussion. Earlier versions of this paper contained every other country in Central America. However, given some of the data challenges discussed below, we chose to narrow our interest to three countries. We chose Costa Rica and El Salvador because they generally have good data availability from Google Trends. The quality of the data for Honduras, on the other hand, was found to be rather poor, so we included it to learn more about how the models perform under adverse data conditions. 4 Search Keywords cr hn sv arroz x x x azucar x x x carne x x x caro x x x cerdo x x x combustible x x x cuesta x x x diesel -vin x x x frijoles x x x gas x x x gasolina x x x inflacion x x x ingresos x x x maiz x x x pago x x x pan x x x precio x x x precios x x x propano x x salario x x x sueldo x x x trigo x x x Table 1: The keywords that are used in the forecasting. We found that the search term “diesel -vin” was more reliable in returning searches related to diesel fuel rather than the actor Vin Diesel. All analysis is based on this term. 5 levels to month-over-month percentage changes pt − pt−1 xt = × 100 (1) pt−1 No series has been seasonally adjusted prior to downloading. Therefore, many of the series exhibit some degree of and sometimes a strong degree of seasonality. As discussed below, we will attempt to model the seasonality explicitly when present. A few of the GIEWS price data series contain missing observations. The missing observations were replaced using simple linear interpolation before applying this trans- formation. The Google Trends data are transformed as follows. Some of the search terms are available at weekly frequencies while other series are only available at monthly frequen- cies. For those that are available at a weekly frequency, we take the maximum value in each month to be the value for that month. This differs from the approach of Vosen and Schmidt [2011] and Carri` ere-Swallow and Labb´ e [2013] who aggregate the weekly data into monthly series by taking the monthly average of the indicators. Since the data are relative, we do not wish to first smooth them in this way. This could mask potentially important, short-lived events. Further transformations to the Trends data are described in the next subsection. 3.1 Challenges in Using Google Trends Data Several challenges present themselves when working with the Google Trends data in a developing country context. First, as pointed out by Carri` ere-Swallow and Labb´ e [2013], Google Trends historical data are not constant over time. Within the same 24-hour period, the results will be the same. However, from day to day the results can be different. Indeed, not only do the values change, but on one day monthly data may be returned. On another biweekly or weekly data for the same keyword search. It is unclear, what exactly is driving these differences – whether different normalizations, sampling considerations, or something else, but for practical purposes we can treat the data as being recorded with sampling error with the same consequences. For the present study, we collected data on all of the keywords for ten days over a period of one month. Figures 1 and 2 show the sampling error for two representative series collected during this period. These series are chosen to be representative of all of the series used and show the two most salient features for the purposes of this study. First, the sampling error is evident 6 Costa Rica 100 "precios" 80 60 40 20 0 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 Figure 1: Results for 10 days during the study period for the “precios” keyword in Costa Rica. The dark line is the average. The gray bands are minimum and maximum observed values for that month over the study period. 7 El Salvador 100 "caro" 80 60 40 20 0 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 Figure 2: Results for 10 days during the study period for the “caro” keyword in El Salvador. The dark line is the average. The gray bands are minimum and maximum observed values for that month over the study period. 8 in both figures. Figure 1 is in some sense a best case scenario. Variability is very large for the first two years of the sample but becomes quite a bit more stable after this initial uncertainty. Figure 2, on the other hand, shows high sampling variability throughout the entire period. We will assume that the signal of each series can be well approximated by its average and use the average when referring to the series for a single keyword in what follows unless otherwise indicated. The second thing to note in figures 1 and 2 are that many of the observations for a single draw of the Google Trends data are exactly zero. These zero observations present two difficulties in particular – one conceptual and one practical. First, conceptually, these zeros suggest a lack of signal where presumably there should be some. As we collect more daily samples of the data, this problem becomes less and less, again assuming that the signal is well approximated by the mean. However, this problem does not disappear. Looking at the early parts of both series, there are still observations which are zero even at the mean. Second, as a practical problem, some of the Google Trends data contain strong sea- sonal components. Studies such as Carri` ere-Swallow and Labb´ e [2013] alleviate the effects of seasonality in the trends data by using year-over-year percent changes for them as well as the series to forecast. However, if the base year is zero, we would lose this entire year of data. We employed several techniques in an attempt to overcome these problems, which we will now describe. The Google trends data can be written more formally as Xi,j,t where i represents the vintage – a downloaded sample on a particular day, j represents a particular keyword, and t represents the weekly, bi-weekly, or monthly observation of each keyword. The first task is to deal with the i vintage, or sample, index. We took the mean and the median of all the samples. This leaves us with either 1 Xj,t = Xi,j,t I i for the mean, where I is the total number of samples taken or Xj,t = medi (Xi,j,t ) for the median. After handling the sampling dimension, we apply transformations to smooth the data for each keyword and attempt to better identify the signal from the noise, given the 9 nature of the search data. Here, we take several different approaches. First, we apply a simple exponential smoothing model with additive errors to the data. Following the notation of Makridakis et al. as used in Hyndman et al. [2002], this model can be written lt = αyt + (1 − α)lt−1 (2) We choose to fix α = .5. Results typical of this smoothing can be seen in figures 3 and 4. We include both the forecastable part of the series and the unsystematic “surprise” part of the series. We also tried smoothing the results by applying the Christiano-Fitzgerald (CF) band- pass filter [Christiano and Fitzgerald, 2003]. The CF filter starts from the (false) assump- tion that the underlying data obeys a unit root process. Using this assumption, the CF filter provides an approximation to an optimal band-pass filter as follows ˜T −t yT + ˆt =B0 yt + B1 yt+1 + · · · + BT −1−t yT −1 + B c (3) + B1 yt−1 + · · · + Bt−2 + B ˜t−1 y1 −sin(ja) ˜k = −f rac12B0 − where Bj = sin(jb)πj , j ≥ 1 and B0 = b− π a ,a= 2π pu ,b= 2 pl π ,B k−1 j = 1 Bj . The parameters pu and pl denote the cut-offs for the cycles for the high and low frequency elements, respectively. We remove all stochastic cycles at a periodicity lower than 3 months and higher than 12 months. This has the effect of both smoothing the series and removing long-term seasonality. The results of applying the CF filter to our two selected series can be seen in figures 5 and 6. One notable advantage of techniques such as exponential smoothing and the CF filter is that they provide us with real-time estimates at the ends of our series so that we do not need to truncate our observed series at the beginning or the end as would be necessary if we used a simple moving averages, seasonal differences, or, another filter such as the Baxter-King.3 3 Of course, we could estimate a model and forecast and backcast then apply a filter that truncates, using these extra data points. However, this is another form of uncertainty that we would like to avoid introducing. Instead, we prefer to use only the information we have. 10 Costa Rica 90 "precios" 80 70 60 50 40 30 20 Series Average 10 A,N,N 0 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 80 60 Unsystematic Part 40 20 0 20 40 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 Figure 3: Smoothed results for the average of the “precios” keyword in Costa Rica. The top pane contains the original series and the smoothed, in-sample forecasted series. The forecasted series is labeled A, N, N indicating additive errors, no trend, and no seasonality according to the Hyndman et al. [2002] taxonomy. The bottom pane contains the unsystematic or “surprise” component of the series. 11 El Salvador 90 "caro" 80 Series Average 70 A,N,N 60 50 40 30 20 10 0 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 50 40 Unsystematic Part 30 20 10 0 10 20 30 40 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 Figure 4: Smoothed results for the average of the “caro” keyword in El Salvador. The top pane contains the original series and the smoothed, in-sample forecasted series. The forecasted series is labeled A, N, N indicating additive errors, no trend, and no season- ality according to the Hyndman et al. [2002] taxonomy. The bottom pane contains the unsystematic of “surprise” component of the series. 12 Costa Rica 100 "precios" filtered original 80 60 40 20 0 20 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 Figure 5: Smoothed results for the average of the “precios” keyword in Costa Rica. The smoothed series is computed using the Christiano-Fitzgerald filter with all stochastic cycles at a periodicity lower than 3 months and higher than 12 months removed. 13 El Salvador "caro" filtered 80 original 60 40 20 0 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 Figure 6: Smoothed results for the average of the “caro” keyword in El Salvador. The smoothed series is computed using the Christiano-Fitzgerald filter with all stochastic cycles at a periodicity lower than 3 months and higher than 12 months removed. 14 4 Methodology To nowcast a series at a particular point in time, we produce an estimate of the series before that variable has been observed but when other contemporaneous variables in our information set have been observed. For instance, we might use data available to us now to get an estimate for economic growth or inflation before official statistics are released. As a concrete example, suppose that in mid-April 2014, we have either a few weeks of Google Trends data or perhaps some preliminary monthly estimate of a search term, but we do not yet know the current inflation. Lags in publication of inflation could mean that we only have estimates for inflation through March or even February 2013. If a policymaker is interested in knowing inflation today, we would nowcast at a monthly m horizon of hm ≥ 1. Our strategy for this exercise is as follows. For each series in each country we will compare nowcasts using Google Trends data and one-step ahead forecasts from a best effort ARIMA model to some benchmark models to assess if the information available from Google Trends data improves our forecasting ability. We now introduce our bench- mark models. In the following subsection, we discuss what we mean by a “best effort” ARIMA model. 4.1 Benchmark Models Five simple models are estimated to provide a baseline for the candidate models described below. The estimated baseline models are the simple mean of the series, the median of the series, the value of the series in the previous period, an AR(1) model, and an A, A, N exponential smoothing model. This exponential smoothing model written in its recursive form is given by lt = αyt + (1 − α)(lt−1 + bt−1 ) (4) bt = β (lt − lt−1 ) + (1 − β )bt−1 where lt and bt are the level and growth rate, respectively, and the parameters along with the initial states are estimated as described in section 3.1. This model is otherwise known as Holt’s linear method with additive errors and is equivalent to an ARIMA (0, 1, 1) model [Hyndman et al., 2008]. Our one-step ahead point forecasts are given by 15 Benchmark Results Benchmark Model Total ar 23 ets 4 mean 11 median 20 Table 2: The total number of series for which each benchmark model is deemed the best by the MSE criterion. t 1 ˆt+1 y = yi (5a) t i=1 ˆt+1 = median({yi })∀i = 1, . . . , t y (5b) ˆt+1 = yt y (5c) ˆt+1 = ρ y ˆyt + t (5d) ˆt+1 = lt + bt y (5e) where t ∼ N (0, σ ) in (5d). We choose the baseline model for each series based on mean squared error (MSE). MSE is defined as usual T 1 ˆt − Yt )2 MSE = (Y T t=1 ˆt is our forecast estimate, Yt is the true observation at time t, and T is the where Y total number of observations. To compute the MSE for the benchmarks, we start with two years of data and compute one-step ahead forecasts using each benchmark model until time T − 1 where T is the last period for which we have data that we wish to forecast. We then choose the model that has the best performance in all periods as the benchmark model for that series. Table 2 presents an overview of which benchmark model is best in an MSE sense. The AR(1) and median are preferred the most often. The benchmark for the individual series is presented with the full results in section section 5 for ease of comparison. 16 4.2 Forecasting and Nowcasting Models To attempt to improve over these baseline models, we first estimate a possibly seasonal Autoregressive integrated moving-average (ARIMA) model for each monthly series. ϕ(L)p ϕ12 (L12 )P (1 − L)d (1 − L12 )D (yt − µ) = θ(L)q θ12 (L12 )Q t (6) where yt is the series we wish to forecast, t follows a white noise process, L is the lag operator Li yt = yt−i , ϕ (L)p = (1 − φ1 L − · · · − φp Lp ) is the non-seasonal polynomial of order p in the lag operator that describes the autoregressive component of the model P and ϕ12 (L12 ) = 1 − φ12,1 L − · · · − φ12,P LP is the seasonal polynomial of order P in the lag operator that describes the seasonal autoregressive component of the model. The polynomial of order q that denotes the non-seasonal MA component of the model is θ(L)q , and likewise the seasonal MA component of order Q is denoted θ(L12 )Q . The non-seasonal and seasonal orders of differencing are denoted d and D, respectively. We use the auto.arima function from the forecast package in R4 for order identification for each series. See Hyndman and Khandakar [2008] for more information on the model identification procedure.5 The auto.arima automatic model identification procedures allows parameters to be zero, so in principal, for example, the model is only differenced or includes a seasonal component when it is appropriate. To test whether there is information in the Google Trends data that will help us forecast each series, we use a possibly seasonal ARIMAX model where the Trends data is used as an exogenous variable.6 The seasonal ARIMAX model estimated is specified 4 We used the 5.4 develpment version obtained from https://github.com/robjhyndman/forecast/ 5 We also performed order identification using the AUTOMDL procedure from X-13ARIMA-SEATS [Staff, 2013] as well as using (seasonal) unit root tests to identify the order of (seasonal) differencing and then using the Bayesian information criteria (BIC) to select the best model. None of the procedures used produced identical results, nor did any procedure do unambiguously better than any other. The auto.arima function was the most computationally performant and is thus the basis for the results below. We used the default arguments for this function. 6 This model is sometimes referred to as a regression model with ARMA errors. Ignoring seasonality, it may be written yt = β xt + zt (7) ϕ(L)zt = θ(L)zt This is to contrast it with the ARMAX model which is written ϕ(L)p (yt ) = Xt β + θ(L)q t (8) 17 ϕ(L)p ϕ12 (L12 )P (1 − L)d (1 − L12 )D (yt − β xt ) = θ(L)q θ12 (L12 )Q t (9) where everything is as in (6) and xt contains the Google Trends Index that we de- scribe in the next section. The addition of this term allows us to model the information contained in the Google Trends data as a time-varying mean. 4.3 Index Construction In order to incorporate the information from the various Google Trends search keywords, it is desirable to synthesize the information in all of the Google Trends data into some- thing more manageable. Formerly, authors used the Google Insights search categories data. This data is used in many of the studies referenced in section 2. However, pre- viously such an index from Google Insights was usually not available outside of large, developed countries, so studies such as Carri`ere-Swallow and Labb´ e [2013] estimate their own. The advantage of having an index is mainly parsimony of information. Indeed, such an index may be of interest in its own right. Furthermore, in September 2012 Google merged some features of Google Insights with Trends and discontinued the aggregate search categories entirely.7 To solve the keyword aggregation problem Carri` ere-Swallow and Labb´ e [2013] cre- ates an index from multiple search terms by use of an expanding linear regression model described below. Other approaches rely on factor analysis techniques for dimension reduc- tion such as unweighted least squares [Vosen and Schmidt, 2011] or principal components analysis [Stock and Watson, 2002]. These methods assume that there are some underly- ing, unobserved common factors for all of the series. We describe our use of statistical learning techniques for variable selection below. We took several approaches to constructing our search indices. First, we applied the linear index approach of Carri` ere-Swallow and Labb´ e [2013]. This is a common approach in the literature and is an attractive choice mainly for its simplicity. Let X be our matrix of year-over-year percent changes for the Google Trends terms. We construct an index It for these terms, for each series yt that we wish to forecast in the following way. In each period, we estimate the weights β ˆ by using the observations up to time t − 1 and fitting a linear model 7 http://insidesearch.blogspot.com/2012/09/insights-into-what-world-is-searching. html One may only speculate that it was discontinued because this task is very difficult to automate. 18 yt = α + βXt + t The index for period t is ˆ β | yt−1 , Xt−1 ]Xt It = E[ Given that y and X contain monthly percent changes, we can interpret It as the linear combination of search terms which best explains the series that we are forecasting, in a linear least squares sense. The expanding nature of the construction of the index allows for the factors in the trends that explain the changes in our price series to change over time. This is certainly something we might be interested in given the heterogenous character of the included terms. Figure 7 contains an example of an index created using the expanding linear OLS. That is, this is the last out-of-sample fitted value of each index created for a single price series. We anticipate two potential problems with this approach for the current exercise, and we construct this linear index using two other methods from the statistical learning literature. Both of these techniques were implemented using the scikit-learn Python package [Pedregosa et al., 2011]. First, we have a high number of variables relative to the number of observations, especially in the early years of the index. To improve the degrees of freedom of our fit, we are interested in obtaining sparse models. To this end, we applied the lasso technique introduced in Tibshirani [1996].8 The lasso is a penalized least squares method that allows both continuous shrinkage and variable selection through the imposition of an L1 −penalty on the regression coefficients β . That is, the coefficients are pushed both towards and to zero when appropriate. The optimization function for the lasso is 1 min y − Xβ 2 +α β 1 β 2n where α is chosen via K −folds cross-validation with K = 5 and the LP −norm is defined X p = n p 1/p i=1 (|xi | ) . Figure 8 contains an example of an index created using the expanding lasso linear model. The fit is much more conservative than the linear OLS fit given the sparse nature of the solution. 8 We also considered the more general LARS estimator introduced by Efron et al. [2004]. The results of this estimator were comparable, though slightly worse than the lasso. It should also be noted that we ran the computationally efficient LARS algorithm variant for the lasso solution path. 19 OLS Index: CPI in Costa Rica 4 cpi Linear GT Index 3 2 1 0 1 2 2006 2007 2008 2009 2010 2011 2012 2013 Figure 7: Linear OLS index created for CPI series from Costa Rica. Displays some evidence of overfitting. 20 Lasso Index: CPI in Costa Rica 2.5 cpi 2.0 LASSO GT Index 1.5 1.0 0.5 0.0 0.5 1.0 2006 2007 2008 2009 2010 2011 2012 2013 Figure 8: Lasso model index created for CPI series from Costa Rica. Conservative fit. Does not vary much. 21 Both Zou and Hastie [2005] and Tibshirani [1996] point out that the lasso may not perform well empirically in the cases where the number of variables is higher than the number of observations9 , there are groups of variables with high pairwise correlation, or there are high correlations between all predictors. These are all possible concerns for our keywords from Google Trends. To account for these issues, we employ the elastic net estimator of Zou and Hastie [2005]. The elastic net estimator is a linear combination of the L1 −penalty of the lasso and the L2 −penalty of ridge regression [Hoerl and Kennard, 1970]. The objective function of the elastic net is 1 1 min y − Xβ 2 + αρ β 1 + α(1 − ρ) β 2 β 2n 2 where α and ρ are chosen via K-folds cross-validation with K = 5. Using both the lasso and the elastic net, we compute the index in the same way as the linear OLS index except that the β coefficients are obtained from the two new estimators. Figure 9 contains an example of an index created using the expanding elastic net model. The fit is somewhere between the high variance OLS model and the conservative lasso model. In the following section we describe the empirical results of using these indices. 5 Forecasting Results Our hypothesis is that there is additional information in our transformations of the Google Trends data that allows improved nowcasts of the series of interest before their respective data releases have been made versus an ARMA model and our respective benchmarks. To test this hypothesis we compute one-step ahead forecasts using (6) and (9) and compare them to the chosen models from (5). Just like for the benchmarks, we start with two years of monthly data and then estimating expanding window models until time T − 1 where T is the last period for which we have data for the series we wish to forecast and for which we have T Google Trends index values. At each time t in t = 24 . . . T + 1 we recompute the order of the seasonal ARIMA(X) model as described above. This is to emulate what a practitioner would do in any given period. For each forecast (and nowcast) we compute the one-step ahead forecast error 9 This is not the case in the current analysis, though we do have the case where the number of variables is only slightly smaller than the number of observations in the early periods of our index construction 22 Elastic net Index: CPI in Costa Rica 2.5 cpi 2.0 ElasticNet GT Index 1.5 1.0 0.5 0.0 0.5 1.0 2006 2007 2008 2009 2010 2011 2012 2013 Figure 9: Elastic net model index created for CPI series from Costa Rica. Somewhere in between high variance OLS and low variance lasso. 23 ˆk,t+1 ≡ yk,t+1 − Et [ˆ e yk,t+1 ] (10) for model k . We compute the relative MSE for each series combination method defined in section 4.3. That is, for the original data Xi,j,t we computed the results for each of reduction methods of the i sampling dimension – mean, median, applying the CF filter after taking the mean, and ETS smoothing after taking the mean – and for each i reduction we also computed the three linear indices over the j keywords – linear OLS, lasso, and elastic net. We found first that the linear OLS trend preformed unambiguously the worst. We were we unable to beat both the benchmark model and the ARMA model even once regardless of the smoothing technique that we applied. This is not wholly suprising given that we did not apply any variable selection of the keywords beforehand. Inclusion of inappropriate keywords appears to have led to overfitting and poor out of sample performance. Moving to the lasso and the elastic net, for each respective estimator the ETS smoothed results performed best. Between the classifiers, the elastic net performed marginally better, beating both the benchmarks and the ARMA models in a few more cases. Again, this is not wholly surprising given the documented better empirical per- formance of the elastic net estimator when there is high pairwise correlations among the regressors. Due to the large number of results, Table 3 contains only the results of the ETS smoothed data and the trend computed via the elastic net estimator. Using our best performing method, the ARMAX model outperforms the benchmark in 28% of the cases or for 16 out of the 58 series. In each of these cases, the ARMAX also outperforms the ARIMA model. The ARMA model fairs only slightly better versus the benchmark, outperforming the benchmark in 22% of the series or for 13 out of 58 series. However, the ARIMA model is only the best model versus the ARIMAX model in 7 of these cases. The food price series appear to be particularly difficult to forecast. If we consider only the consumer price series, then the ARMAX model is the best model in 24% of the cases while the ARMA model is only the best in 14% of the cases. The difficulty in forecasting food prices is likely due to the food price crisis during the period. The price US dollar price fluctuations during the time under consideration were due largely to events external to the countries of Central America. These results, though only partially successful, indicate that there may be some ben- efit to exploring the further use of Google Trends data in forecasting economic series in Central America. We use the concluding section to speculate on some of the reasons for this success or lack thereof and give suggestions for future research. 24 25 Table 3: The results for each series using the ETS smoothed data and the elastic net estimator. Rela- tive MSE (1) is the MSE for the expanding window ARMA model results versus the benchmark model given in columns 6. Relative MSE (2) is the MSE for the elastic net model versus the benchmark in column 6. A relative MSE less than 1 indicates that the proposed model beat the benchmark. Country Series Relative MSE (1) Relative MSE (2) N Benchmark cr food01 1.04644 1.09103 94 ar cr food02 1.00313 1.09517 94 median cr food03 1.08606 0.977452 94 mean cr food04 1.0755 1.06406 94 median cr food05 1.32068 1.32068 38 median cr food06 1.01586 1.04028 94 ar cr food07 1.08315 1.1684 94 mean cr food08 1.08143 1.17251 94 mean cr food09 0.886634 1.20473 57 median cr infl01 0.980986 1.09914 94 ar cr infl02 1 1.00755 63 mean cr infl03 0.993006 2.13919 63 median cr infl04 1.08064 1.08917 63 mean cr infl05 0.981035 1.11166 95 ets cr infl06 0.840378 0.808711 63 ar cr infl07 1.03188 1.00261 63 ar cr infl08 1.03867 0.860798 63 ar cr infl09 1.0767 1.01002 63 median cr infl10 1.11333 1.23137 63 ets cr infl11 1.10306 1.15871 63 median cr infl12 1.17467 1.18608 63 ets cr infl13 1.14892 1.35561 63 ets cr infl14 1.0491 1.04081 63 ar hn food01 1.11477 1.23652 94 median hn food02 1.06082 1.02454 94 mean Continued on next page 26 Table 3: The results for each series using the ETS smoothed data and the elastic net estimator. Rela- tive MSE (1) is the MSE for the expanding window ARMA model results versus the benchmark model given in columns 6. Relative MSE (2) is the MSE for the elastic net model versus the benchmark in column 6. A relative MSE less than 1 indicates that the proposed model beat the benchmark. Country Series Relative MSE (1) Relative MSE (2) N Benchmark hn food03 1.02453 1.22845 56 ar hn food04 1.05794 1.02891 56 ar hn food05 1.50041 1.41338 56 median hn food06 1.06789 1.34548 56 ar hn food07 1.02272 1.19456 56 ar hn food08 1.02289 1.02308 56 ar hn infl01 1.14475 1.07673 94 ar hn infl02 0.954699 1.13202 94 ar sv food01 1.32029 1.34929 69 ar sv food02 0.981255 1.17397 69 ar sv food03 0.999156 1.4412 69 ar sv food04 1.00938 1.37656 69 ar sv food05 1.0587 1.22251 69 median sv food06 1.01563 1.2812 69 ar sv food07 1.14023 1.42201 69 median sv food08 1.26369 1.31265 69 median sv food09 1.08291 1.07086 69 median sv food10 1.05501 1.22104 69 ar sv food11 1 0.998684 69 median sv food12 1.24149 1.24152 69 median sv infl01 1.44499 1.33092 34 median sv infl02 0.986749 0.834845 34 ar sv infl03 1.10362 1.51751 34 median sv infl04 0.97742 0.97742 34 median sv infl05 1.11776 1.15664 34 ar sv infl06 1 0.999088 34 mean Continued on next page 27 Table 3: The results for each series using the ETS smoothed data and the elastic net estimator. Rela- tive MSE (1) is the MSE for the expanding window ARMA model results versus the benchmark model given in columns 6. Relative MSE (2) is the MSE for the elastic net model versus the benchmark in column 6. A relative MSE less than 1 indicates that the proposed model beat the benchmark. Country Series Relative MSE (1) Relative MSE (2) N Benchmark sv infl07 1.29662 0.986124 34 ar sv infl08 0.948536 0.948536 34 median sv infl09 1.0029 1.00293 34 mean sv infl10 1.00788 1.13624 34 mean sv infl11 1.18043 1.03241 34 mean sv infl12 1.03771 1.01186 34 median sv infl13 1.04429 1.29648 34 mean 28 6 Conclusion In this paper, we studied the possibility of using Internet search keyword data to nowcast price changes in Central America. We gathered price data for Costa Rica, El Salvador, and Honduras. We also identified several search keywords and downloaded data for them from Google Trends over a period of weeks. We tried several aggregation, smoothing, and linear index construction methods for this Internet search data and were partially successful in improving nowcasts for Costa Rica and for El Salvador, countries for which the search data were of higher quality. As part of the exercise, we were able to identify several important points for practic- tioners who wish to forecast using high-dimensional Internet search keyword time series. First, variable selection is of upmost importance. Many, if not most, of the successful forecasting studies that use Internet search keyword data are based on some theory of consumer behavior. This may be the idea that consumers use the Internet to do research before the purchase of a consumer durable as in Carri` ere-Swallow and Labb´ e [2013] or searches for jobs and unemployment and welfare as in Choi and Varian [2009]. In the absence of a strong model of consumer behavior, one should incorporate some kind of variable selection mechanism. Naively including a large number of search keyword terms into a model for a search index with the hope that the coefficients on unimportant terms will be small leads to very poor results. However, by employing some variable selection methods from the statistical learning literature we were able to substantially improve all of our forecasts and beat both the ARMA models and benchmarks in several instances. The second takeaway is the importance of order identification in ARIMA modeling. This is perhaps not a surprise for any forecaster, but the successful results here using automatic techniques are encouraging. If a forecaster were to focus on fewer series and apply the Box-Jenkins methodology rather than relying on automatic model selection procedures it might be possible to outperform the benchmark models further. Finally, this study suggests several avenues for further research. We might consider more estimators such as the TS-LARS, which is LARS estimator explicitly written with time-series data in mind [Gelper and Croux, 2008]. It allows selection of distributed lags and ranking of predictors. Ranking of predictors will be of particular interest for those who use an exercise such as the one in this paper to generate ideas about consumer behavior and the search for keywords and categories that help forecast price changes. One might also explore using dynamic linear models or a structural model where the Internet search information stands in explicitly for some aspect of the theoretical model. There are also a number of different smoothing techniques and variable selection methods 29 that might be explored. In conclusion, the study of the manifestation of consumer sentiment via Internet search behavior is very much still in its infancy. It certainly presents a number of challenges, but the potential insights and use cases are varied and exciting. It may be tempting to dismiss this excitement as hype. All the same, it is difficult to deny the possible benefits of real-time consumer sentiment to future economics research and forecasting studies. 30 References ere-Swallow and Felipe Labb´ Yan Carri` e. Nowcasting with google trends in an emerging market. Journal of Forecasting, 32(4):289–298, 2013. Hyunyoung Choi and Hal Varian. Predicting initial claims for unemployment benefits. Technical Report, 2009. Hyunyoung Choi and Hal Varian. Predicting the present with google trends. Economic Record, 88(s1):2–9, 2012. Lawrence J Christiano and Terry J Fitzgerald. The band pass filter*. international economic review, 44(2):435–465, 2003. Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tibshirani, et al. Least angle regression. The Annals of statistics, 32(2):407–499, 2004. Michael Ettredge, John Gerdes, and Gilbert Karuga. Using web-based search data to predict macroeconomic statistics. Communications of the ACM, 48(11):87–92, 2005. Sarah Gelper and Christophe Croux. Least angle regression for time series forecasting with many predictors. FBE Research Report KBI 0801, 2008. Domenico Giannone, Lucrezia Reichlin, and David Small. Nowcasting: The real-time informational content of macroeconomic data. Journal of Monetary Economics, 55(4): 665–676, 2008. Giselle Guzman. Internet search behavior as an economic forecasting tool: The case of inflation expectations. Journal of Economic and Social Measurement, 36(3):119–167, 2011. Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970. Rob Hyndman, Anne B Koehler, J Keith Ord, and Ralph D Snyder. Forecasting with exponential smoothing: the state space approach. Springer, 2008. Rob J Hyndman and Yeasmin Khandakar. Automatic time series forecasting: the forecast package for r. Journal of Statistical Software, 26(3), 2008. Rob J Hyndman, Anne B Koehler, Ralph D Snyder, and Simone Grose. A state space framework for automatic forecasting using exponential smoothing methods. Interna- tional Journal of Forecasting, 18(3):439–454, 2002. 31 Spyros Makridakis, SC Wheelwright, and Rob J Hyndman. Forecasting: methods and applications. 1998. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. Torsten Schmidt and Simeon Vosen. Using internet data to account for special events in economic forecasting. Ruhr Economic Paper, (382), 2012. Time Series Research Staff. X-13ARIMA-SEATS Reference Manual. Statistical Research Division U.S. Census Bureau, 1.1 edition, 2013. James H Stock and Mark W Watson. Forecasting using principal components from a large number of predictors. Journal of the American statistical association, 97(460): 1167–1179, 2002. Tanya Suhoy. Query indices and a 2008 downturn: Israeli data. Research Department, Bank of Israel, 2009. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996. Simeon Vosen and Torsten Schmidt. Forecasting private consumption: survey-based indicators vs. google trends. Journal of Forecasting, 30(6):565–578, 2011. Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301– 320, 2005. 32 A Appendix This Appendix contains extra information on the variables used in this study. Table A1 provides full series names for the abbreviations used for the forecasted series. More information is provided in Table A2. 33 Table A1: This table contains the abbreviation used in the main tables and the full series name. Country Abbreviation Series cr food01 Costa Rica, National Average, Beans (black), Retail... cr food02 Costa Rica, National Average, Beans (black), Wholes... cr food03 Costa Rica, National Average, Beans (red), Retail, ... cr food04 Costa Rica, National Average, Beans (red), Wholesal... cr food05 Costa Rica, National Average, Maize (white), Retail... cr food06 Costa Rica, National Average, Maize (white), Wholes... cr food07 Costa Rica, National Average, Rice (first quality),... cr food08 Costa Rica, National Average, Rice (second quality)... cr food09 Costa Rica, National Average, Wheat (flour), Retail... cr infl01 cpi cr infl02 cpi alc cr infl03 cpi clothes cr infl04 cpi comm cr infl05 cpi core cr infl06 cpi educ cr infl07 cpi entertain cr infl08 cpi food cr infl09 cpi health cr infl10 cpi household cr infl11 cpi housing cr infl12 cpi misc cr infl13 cpi restaurant cr infl14 cpi trans hn food01 Honduras, National Average, Beans (red), Wholesale,... hn food02 Honduras, National Average, Maize (white), Wholesal... hn food03 Honduras, San Pedro Sula, Beans (red), Wholesale, (... hn food04 Honduras, San Pedro Sula, Maize (white), Wholesale,... hn food05 Honduras, San Pedro Sula, Rice (second quality), Wh... hn food06 Honduras, Tegucigalpa, Beans (red), Wholesale, (USD... hn food07 Honduras, Tegucigalpa, Maize (white), Wholesale, (U... hn food08 Honduras, Tegucigalpa, Rice (second quality), Whole... Continued on next page 34 Table A1: This table contains the abbreviation used in the main tables and the full series name. Country Abbreviation Series hn infl01 cpi hn infl02 cpi food sv food01 El Salvador, San Salvador, Beans (red), Retail, (US... sv food02 El Salvador, San Salvador, Beans (red), Wholesale, ... sv food03 El Salvador, San Salvador, Beans (red, seda), Retai... sv food04 El Salvador, San Salvador, Beans (red, seda), Whole... sv food05 El Salvador, San Salvador, Maize (white), Retail, (... sv food06 El Salvador, San Salvador, Maize (white), Wholesale... sv food07 El Salvador, San Salvador, Rice, Retail, (USD/Kg) sv food08 El Salvador, San Salvador, Rice, Wholesale, (USD/Kg) sv food09 El Salvador, San Salvador, Sorghum (Maicillo), Reta... sv food10 El Salvador, San Salvador, Sorghum (Maicillo), Whol... sv food11 El Salvador, San Salvador, Wheat (flour), Retail, (... sv food12 El Salvador, San Salvador, Wheat (flour), Wholesale... sv infl01 cpi sv infl02 cpi alc sv infl03 cpi clothes sv infl04 cpi comm sv infl05 cpi educ sv infl06 cpi entertain sv infl07 cpi food sv infl08 cpi furniture sv infl09 cpi health sv infl10 cpi house fuel sv infl11 cpi misc sv infl12 cpi restaurant sv infl13 cpi trans 35 36 Table A2: Full information for all of the food price series used throughout the study. Country Region Series Units CR National Average Beans (black), Retail (USD/Kg) CR National Average Beans (black), Wholesale (USD/Kg) CR National Average Beans (red), Retail (USD/Kg) CR National Average Beans (red), Wholesale (USD/Kg) CR National Average Maize (white), Retail (USD/Kg) CR National Average Maize (white), Wholesale (USD/Kg) CR National Average Rice (first quality), Retail (USD/Kg) CR National Average Rice (second quality), Retail (USD/Kg) CR National Average Wheat (flour), Retail (USD/Kg) HN National Average Beans (red), Wholesale (USD/Kg) HN National Average Maize (white), Wholesale (USD/Kg) HN San Pedro Sula Beans (red), Wholesale (USD/Kg) HN San Pedro Sula Maize (white), Wholesale (USD/Kg) HN San Pedro Sula Rice (second quality), Wholesale (USD/Kg) HN Tegucigalpa Beans (red), Wholesale (USD/Kg) HN Tegucigalpa Maize (white), Wholesale (USD/Kg) HN Tegucigalpa Rice (second quality), Wholesale (USD/Kg) SV San Salvador Beans (red), Retail (USD/Kg) SV San Salvador Beans (red), Wholesale (USD/Kg) SV San Salvador Beans (red, seda), Retail (USD/Kg) SV San Salvador Beans (red, seda), Wholesale (USD/Kg) SV San Salvador Maize (white), Retail (USD/Kg) SV San Salvador Maize (white), Wholesale (USD/Kg) SV San Salvador Rice, Retail (USD/Kg) SV San Salvador Rice, Wholesale (USD/Kg) SV San Salvador Sorghum (Maicillo), Retail (USD/Kg) SV San Salvador Sorghum (Maicillo), Wholesale (USD/Kg) SV San Salvador Wheat (flour), Retail (USD/Kg) SV San Salvador Wheat (flour), Wholesale (USD/Kg) 37 Country Series Source CR All CPI Data Banco Central de Costa Rica HN All CPI Data Banco Central de Honduras SV All CPI Data Banco Central de Reserva de El Salvador Table A3: Sources for the CPI data for each country considered in the study. Table A3 lists the sources for the CPI data used for each country. The food price series were all obtained from FAO-GIEWS. 38