WPS7398
Policy Research Working Paper 7398
Nowcasting Prices Using Google Trends
An Application to Central America
Skipper Seabold
Andrea Coppola
Macroeconomics and Fiscal Management Global Practice Group
August 2015
Policy Research Working Paper 7398
Abstract
The objective of this study is to assess the possibility of these countries using Google Trends data covering a two-
using Internet search keyword data for forecasting price week period during a single month. For each country, the
series in Central America, focusing on Costa Rica, El study estimates one-step-ahead forecasts for several dozen
Salvador, and Honduras. The Internet search data comes price series for food and consumer goods categories. The
from Google Trends. The paper introduces these data study finds that the addition of the Internet search index
and discusses some of the challenges inherent in work- improves forecasting over benchmark models in about 20
ing with it in the context of developing countries. A new percent of the series. The paper discusses the reasons for
index is introduced for consumer search behavior for the varied success and potential avenues for future research.
This paper is a product of the Macroeconomics and Fiscal Management Global Practice Group. It is part of a larger effort
by the World Bank to provide open access to its research and make a contribution to development policy discussions
around the world. Policy Research Working Papers are also posted on the Web at http://econ.worldbank.org. The authors
may be contacted at jsseabold@gmail.com.
The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development
issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the
names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those
of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and
its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.
Produced by the Research Support Team
Nowcasting Prices Using Google Trends: An
Application to Central America
Skipper Seabold Andrea Coppola
American University The World Bank∗
JEL codes: E31,C55,C8
Keywords: Macroeconomic modeling and statistics, Inﬂation, Big Data
∗
Corresponding author e-mail: jsseabold@gmail.com. This is a preliminary draft.
1 Introduction
It is a well recognized problem that policy makers must make decisions before all data
about the current economic environment are available. Given this reality, there is consid-
erable interest in short-term forecasting and nowcasting using intra-period data releases.
For example, the forecaster can provide an estimate of GDP this quarter using other
data that are available at a monthly frequency. This technique is called nowcasting, or
predicting the present. Giannone et al. [2008] lays out three tenets of nowcasting. First,
many data series are used. Second, nowcasts are updated as intraperiod data become
available. Finally, nowcasting “bridges” higher frequency data releases with the now-
cast of the lower frequency series of interest. This study is similar in spirit to that of
Giannone et al. [2008]. However, while Giannone et al. [2008] are concerned with now-
casting GDP using a large number of economic data series, this paper nowcasts price
series using Internet search keyword data from Google Trends1 . Furthermore, we do not
attempt to “bridge” higher frequency data with lower frequency data explicitly as part
of a model. The Google Trends data are not all systematically available at a higher
frequency than the series we wish to forecast. Instead, we are more concerned with the
eﬃcient aggregation of many series to help improve our nowcasts.
There are three main contributions of this study. First, it focuses on the countries of
Central America. Almost the entirety of the nowcasting literature focuses on developed
countries with one notable exception in Carri`ere-Swallow and Labb´ e [2013]. Second, this
is a large scale study, approaching the problem of nowcasting with Google Trends from
a data mining perspective rather than one solely grounded in economic theory. This
approach gives us insights that will be useful for forecasters who wish to pursue similar
ends. Third, we introduce methods from the statistical learning literature to compute
the Google Trends keyword search index that are not yet commonly used in forecasting
studies.
Given the large number of series included in this study, we rely heavily on automatic
model identiﬁcation procedures. Despite this potential shortcoming, we ﬁnd that Google
Trends can improve our ability to forecast certain series. These ﬁndings are notable
and may be worth pursuing in more detail. The outline of the paper is as follows.
Section 2 reviews some of the literature for nowcasting and the use of Google Trends
data in forecasting. Section 3 introduces the data and includes a section that discusses
the challenges of working with Google Trends data for the countries of Central America.
Section 4 explains the framework used for forecasting and evaluating forecasts. Section 5
1
http://google.com/trends
2
discusses the results of this exercise and assesses the usefulness of Google Trends data
in forecasting price series for Central American countries. Section 6 concludes, noting
several paths for continuing research. While this section deals speciﬁcally with ideas for
future research there are notes about ongoing research throughout the paper.
2 Literature Review
There is a growing literature that is using Internet search keyword data and Google
Trends, in particular, for forecasting and nowcasting. Ettredge et al. [2005] were the
ﬁrst to use search engine keyword data to aid in forecasting. They found keyword-
based searches to be helpful in predicting the number of unemployed workers in the
United States. The use of Google Trends data, speciﬁcally, in forecasting yet to be
released macroeconomic series goes back to Choi and Varian [2009, 2012]. They ﬁnd that
Google Trends data help to forecast initial unemployment claims, automobile sales, and
consumer conﬁdences in the United States. Since then, there are been numerous eﬀorts
to use Google Trends data in forecasting. Schmidt and Vosen [2012] uses search data
related to the “cash for clunkers” program to improve forecasts for private consumption
in France, Germany, Italy, and the United States. Guzman [2011] uses Google search
data to estimate inﬂation expectations. Suhoy [2009] estimates accurate probabilities of
downturn in early 2007 using Google search category data for Israel. The author also
ﬁnds improvements in estimates of private consumption by employing the search data.
Early results on using Google Trends data as a proxy for consumer sentiment are
promising. Traditionally, studies have made use of survey-based sentiment data to pro-
vide leading indicators of series of interest. However, this data is not always available,
especially in developing countries. Vosen and Schmidt [2011] show that Google Trends
outperforms The University of Michigan Consumer Sentiment Index and the Confer-
ence Board Consumer Conﬁdence Index in predicting private consumption in the United
ere-Swallow
States. One study which is very relevant to our present eﬀort is that of Carri`
and Labb´ e [2013]. The authors look at the beneﬁts of using Google Trends data in the
context of a developing country, Chile. They develop an index of consumer interest in
automobile purchaes and ﬁnd that it outperforms benchmark speciﬁcations that take
advantage of the IMACEC index of consumer activity. We will use a similar framework
to the one employed in that study in what follows.
3
3 Data
This section ﬁrst describes the raw data and then the transformations that are made
to each series before estimation. A subsection is dedicated to addressing some of the
challenges inherent in working with the search query data from Google in emerging
market countries. For each of Costa Rica, El Salvador, and Honduras2 , there are two
categories of series that we will forecast – data on aggregate consumer prices and their
component series and staple food price data.
We obtained the consumer price data from the statistical oﬃce of each country. See
Appendix A for details. The raw series are in levels and are not seasonally adjusted.
The food price data was obtained from the Global Information and Early Warning
System on Food and Agriculture (GIEWS) from the Food and Agriculture Organization
of the United Nations (FAO). The types of food that are available from GIEWS are
particular to each country. We obtained every available series. Appendix A gives, for
each country, the series names, appropriate region, and the units for which we have data
available. These series are not seasonally adjusted.
To augment our forecasts, we have obtained Google Trends data on a number of
search keywords. These keywords were chosen ex ante with the belief that they contain
relevant information that will allow us to use them as a proxy for consumer behavior and
beliefs. Obtaining real-time insights into consumer behavior allows us to better predict
price changes all other things equal. In some sense, the Trends data takes the place of
traditional consumer-sentiment surveys. The keywords that we have chosen are listed in
Table 1.
Each individual Google Trend series is relative and not an absolute measures of search
volume. That is, the period in which the search interest of a keyword is highest within
the dates of inquiry receives a value of 100. All other periods for an individual series
are measured relative to this highest period. There is, therefore, no sense of how many
people were searching for a term and the terms themselves are not comparable with each
other. Furthermore, changes in Internet penetration and the use of Google, in particular,
do not matter.
The following transformation is made to each price series before estimation to go from
2
We could not acquire suﬃcient data on food prices or on search keywords for Belize, so it is omit-
ted from discussion. Earlier versions of this paper contained every other country in Central America.
However, given some of the data challenges discussed below, we chose to narrow our interest to three
countries. We chose Costa Rica and El Salvador because they generally have good data availability from
Google Trends. The quality of the data for Honduras, on the other hand, was found to be rather poor,
so we included it to learn more about how the models perform under adverse data conditions.
4
Search Keywords
cr hn sv
arroz x x x
azucar x x x
carne x x x
caro x x x
cerdo x x x
combustible x x x
cuesta x x x
diesel -vin x x x
frijoles x x x
gas x x x
gasolina x x x
inﬂacion x x x
ingresos x x x
maiz x x x
pago x x x
pan x x x
precio x x x
precios x x x
propano x x
salario x x x
sueldo x x x
trigo x x x
Table 1: The keywords that are used in the forecasting. We found that the search term
“diesel -vin” was more reliable in returning searches related to diesel fuel rather than the
actor Vin Diesel. All analysis is based on this term.
5
levels to month-over-month percentage changes
pt − pt−1
xt = × 100 (1)
pt−1
No series has been seasonally adjusted prior to downloading. Therefore, many of the
series exhibit some degree of and sometimes a strong degree of seasonality. As discussed
below, we will attempt to model the seasonality explicitly when present.
A few of the GIEWS price data series contain missing observations. The missing
observations were replaced using simple linear interpolation before applying this trans-
formation.
The Google Trends data are transformed as follows. Some of the search terms are
available at weekly frequencies while other series are only available at monthly frequen-
cies. For those that are available at a weekly frequency, we take the maximum value in
each month to be the value for that month. This diﬀers from the approach of Vosen and
Schmidt [2011] and Carri` ere-Swallow and Labb´ e [2013] who aggregate the weekly data
into monthly series by taking the monthly average of the indicators. Since the data are
relative, we do not wish to ﬁrst smooth them in this way. This could mask potentially
important, short-lived events. Further transformations to the Trends data are described
in the next subsection.
3.1 Challenges in Using Google Trends Data
Several challenges present themselves when working with the Google Trends data in a
developing country context. First, as pointed out by Carri` ere-Swallow and Labb´ e [2013],
Google Trends historical data are not constant over time. Within the same 24-hour
period, the results will be the same. However, from day to day the results can be diﬀerent.
Indeed, not only do the values change, but on one day monthly data may be returned. On
another biweekly or weekly data for the same keyword search. It is unclear, what exactly
is driving these diﬀerences – whether diﬀerent normalizations, sampling considerations,
or something else, but for practical purposes we can treat the data as being recorded with
sampling error with the same consequences. For the present study, we collected data on
all of the keywords for ten days over a period of one month. Figures 1 and 2 show the
sampling error for two representative series collected during this period.
These series are chosen to be representative of all of the series used and show the two
most salient features for the purposes of this study. First, the sampling error is evident
6
Costa Rica
100 "precios"
80
60
40
20
0
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
Figure 1: Results for 10 days during the study period for the “precios” keyword in Costa
Rica. The dark line is the average. The gray bands are minimum and maximum observed
values for that month over the study period.
7
El Salvador
100 "caro"
80
60
40
20
0
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
Figure 2: Results for 10 days during the study period for the “caro” keyword in El
Salvador. The dark line is the average. The gray bands are minimum and maximum
observed values for that month over the study period.
8
in both ﬁgures. Figure 1 is in some sense a best case scenario. Variability is very large
for the ﬁrst two years of the sample but becomes quite a bit more stable after this initial
uncertainty. Figure 2, on the other hand, shows high sampling variability throughout the
entire period. We will assume that the signal of each series can be well approximated by
its average and use the average when referring to the series for a single keyword in what
follows unless otherwise indicated.
The second thing to note in ﬁgures 1 and 2 are that many of the observations for a
single draw of the Google Trends data are exactly zero. These zero observations present
two diﬃculties in particular – one conceptual and one practical. First, conceptually, these
zeros suggest a lack of signal where presumably there should be some. As we collect more
daily samples of the data, this problem becomes less and less, again assuming that the
signal is well approximated by the mean. However, this problem does not disappear.
Looking at the early parts of both series, there are still observations which are zero even
at the mean.
Second, as a practical problem, some of the Google Trends data contain strong sea-
sonal components. Studies such as Carri` ere-Swallow and Labb´ e [2013] alleviate the eﬀects
of seasonality in the trends data by using year-over-year percent changes for them as well
as the series to forecast. However, if the base year is zero, we would lose this entire year
of data.
We employed several techniques in an attempt to overcome these problems, which
we will now describe. The Google trends data can be written more formally as Xi,j,t
where i represents the vintage – a downloaded sample on a particular day, j represents
a particular keyword, and t represents the weekly, bi-weekly, or monthly observation of
each keyword. The ﬁrst task is to deal with the i vintage, or sample, index. We took the
mean and the median of all the samples. This leaves us with either
1
Xj,t = Xi,j,t
I i
for the mean, where I is the total number of samples taken or
Xj,t = medi (Xi,j,t )
for the median.
After handling the sampling dimension, we apply transformations to smooth the data
for each keyword and attempt to better identify the signal from the noise, given the
9
nature of the search data. Here, we take several diﬀerent approaches. First, we apply
a simple exponential smoothing model with additive errors to the data. Following the
notation of Makridakis et al. as used in Hyndman et al. [2002], this model can be written
lt = αyt + (1 − α)lt−1 (2)
We choose to ﬁx α = .5. Results typical of this smoothing can be seen in ﬁgures 3 and
4. We include both the forecastable part of the series and the unsystematic “surprise”
part of the series.
We also tried smoothing the results by applying the Christiano-Fitzgerald (CF) band-
pass ﬁlter [Christiano and Fitzgerald, 2003]. The CF ﬁlter starts from the (false) assump-
tion that the underlying data obeys a unit root process. Using this assumption, the CF
ﬁlter provides an approximation to an optimal band-pass ﬁlter as follows
˜T −t yT +
ˆt =B0 yt + B1 yt+1 + · · · + BT −1−t yT −1 + B
c
(3)
+ B1 yt−1 + · · · + Bt−2 + B ˜t−1 y1
−sin(ja) ˜k = −f rac12B0 −
where Bj = sin(jb)πj , j ≥ 1 and B0 = b−
π
a
,a= 2π
pu
,b= 2
pl
π
,B
k−1
j = 1 Bj . The parameters pu and pl denote the cut-oﬀs for the cycles for the high
and low frequency elements, respectively. We remove all stochastic cycles at a periodicity
lower than 3 months and higher than 12 months. This has the eﬀect of both smoothing
the series and removing long-term seasonality. The results of applying the CF ﬁlter to
our two selected series can be seen in ﬁgures 5 and 6.
One notable advantage of techniques such as exponential smoothing and the CF ﬁlter
is that they provide us with real-time estimates at the ends of our series so that we do not
need to truncate our observed series at the beginning or the end as would be necessary
if we used a simple moving averages, seasonal diﬀerences, or, another ﬁlter such as the
Baxter-King.3
3
Of course, we could estimate a model and forecast and backcast then apply a ﬁlter that truncates,
using these extra data points. However, this is another form of uncertainty that we would like to avoid
introducing. Instead, we prefer to use only the information we have.
10
Costa Rica
90 "precios"
80
70
60
50
40
30
20 Series Average
10 A,N,N
0
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
80
60 Unsystematic Part
40
20
0
20
40
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
Figure 3: Smoothed results for the average of the “precios” keyword in Costa Rica.
The top pane contains the original series and the smoothed, in-sample forecasted series.
The forecasted series is labeled A, N, N indicating additive errors, no trend, and no
seasonality according to the Hyndman et al. [2002] taxonomy. The bottom pane contains
the unsystematic or “surprise” component of the series.
11
El Salvador
90 "caro"
80 Series Average
70 A,N,N
60
50
40
30
20
10
0
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
50
40 Unsystematic Part
30
20
10
0
10
20
30
40
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
Figure 4: Smoothed results for the average of the “caro” keyword in El Salvador. The
top pane contains the original series and the smoothed, in-sample forecasted series. The
forecasted series is labeled A, N, N indicating additive errors, no trend, and no season-
ality according to the Hyndman et al. [2002] taxonomy. The bottom pane contains the
unsystematic of “surprise” component of the series.
12
Costa Rica
100 "precios"
filtered
original
80
60
40
20
0
20
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
Figure 5: Smoothed results for the average of the “precios” keyword in Costa Rica. The
smoothed series is computed using the Christiano-Fitzgerald ﬁlter with all stochastic
cycles at a periodicity lower than 3 months and higher than 12 months removed.
13
El Salvador
"caro"
filtered
80 original
60
40
20
0
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
Figure 6: Smoothed results for the average of the “caro” keyword in El Salvador. The
smoothed series is computed using the Christiano-Fitzgerald ﬁlter with all stochastic
cycles at a periodicity lower than 3 months and higher than 12 months removed.
14
4 Methodology
To nowcast a series at a particular point in time, we produce an estimate of the series
before that variable has been observed but when other contemporaneous variables in our
information set have been observed. For instance, we might use data available to us now
to get an estimate for economic growth or inﬂation before oﬃcial statistics are released.
As a concrete example, suppose that in mid-April 2014, we have either a few weeks of
Google Trends data or perhaps some preliminary monthly estimate of a search term, but
we do not yet know the current inﬂation. Lags in publication of inﬂation could mean
that we only have estimates for inﬂation through March or even February 2013. If a
policymaker is interested in knowing inﬂation today, we would nowcast at a monthly m
horizon of hm ≥ 1.
Our strategy for this exercise is as follows. For each series in each country we will
compare nowcasts using Google Trends data and one-step ahead forecasts from a best
eﬀort ARIMA model to some benchmark models to assess if the information available
from Google Trends data improves our forecasting ability. We now introduce our bench-
mark models. In the following subsection, we discuss what we mean by a “best eﬀort”
ARIMA model.
4.1 Benchmark Models
Five simple models are estimated to provide a baseline for the candidate models described
below. The estimated baseline models are the simple mean of the series, the median of
the series, the value of the series in the previous period, an AR(1) model, and an A, A, N
exponential smoothing model. This exponential smoothing model written in its recursive
form is given by
lt = αyt + (1 − α)(lt−1 + bt−1 )
(4)
bt = β (lt − lt−1 ) + (1 − β )bt−1
where lt and bt are the level and growth rate, respectively, and the parameters along
with the initial states are estimated as described in section 3.1. This model is otherwise
known as Holt’s linear method with additive errors and is equivalent to an ARIMA
(0, 1, 1) model [Hyndman et al., 2008]. Our one-step ahead point forecasts are given by
15
Benchmark Results
Benchmark Model Total
ar 23
ets 4
mean 11
median 20
Table 2: The total number of series for which each benchmark model is deemed the best
by the MSE criterion.
t
1
ˆt+1
y = yi (5a)
t i=1
ˆt+1 = median({yi })∀i = 1, . . . , t
y (5b)
ˆt+1 = yt
y (5c)
ˆt+1 = ρ
y ˆyt + t (5d)
ˆt+1 = lt + bt
y (5e)
where t ∼ N (0, σ ) in (5d). We choose the baseline model for each series based on
mean squared error (MSE). MSE is deﬁned as usual
T
1 ˆt − Yt )2
MSE = (Y
T t=1
ˆt is our forecast estimate, Yt is the true observation at time t, and T is the
where Y
total number of observations.
To compute the MSE for the benchmarks, we start with two years of data and compute
one-step ahead forecasts using each benchmark model until time T − 1 where T is the
last period for which we have data that we wish to forecast. We then choose the model
that has the best performance in all periods as the benchmark model for that series.
Table 2 presents an overview of which benchmark model is best in an MSE sense. The
AR(1) and median are preferred the most often. The benchmark for the individual series
is presented with the full results in section section 5 for ease of comparison.
16
4.2 Forecasting and Nowcasting Models
To attempt to improve over these baseline models, we ﬁrst estimate a possibly seasonal
Autoregressive integrated moving-average (ARIMA) model for each monthly series.
ϕ(L)p ϕ12 (L12 )P (1 − L)d (1 − L12 )D (yt − µ) = θ(L)q θ12 (L12 )Q t (6)
where yt is the series we wish to forecast, t follows a white noise process, L is the
lag operator Li yt = yt−i , ϕ (L)p = (1 − φ1 L − · · · − φp Lp ) is the non-seasonal polynomial
of order p in the lag operator that describes the autoregressive component of the model
P
and ϕ12 (L12 ) = 1 − φ12,1 L − · · · − φ12,P LP is the seasonal polynomial of order P in
the lag operator that describes the seasonal autoregressive component of the model. The
polynomial of order q that denotes the non-seasonal MA component of the model is
θ(L)q , and likewise the seasonal MA component of order Q is denoted θ(L12 )Q . The
non-seasonal and seasonal orders of diﬀerencing are denoted d and D, respectively.
We use the auto.arima function from the forecast package in R4 for order identiﬁcation
for each series. See Hyndman and Khandakar [2008] for more information on the model
identiﬁcation procedure.5 The auto.arima automatic model identiﬁcation procedures
allows parameters to be zero, so in principal, for example, the model is only diﬀerenced
or includes a seasonal component when it is appropriate.
To test whether there is information in the Google Trends data that will help us
forecast each series, we use a possibly seasonal ARIMAX model where the Trends data
is used as an exogenous variable.6
The seasonal ARIMAX model estimated is speciﬁed
4
We used the 5.4 develpment version obtained from https://github.com/robjhyndman/forecast/
5
We also performed order identiﬁcation using the AUTOMDL procedure from X-13ARIMA-SEATS
[Staﬀ, 2013] as well as using (seasonal) unit root tests to identify the order of (seasonal) diﬀerencing and
then using the Bayesian information criteria (BIC) to select the best model. None of the procedures
used produced identical results, nor did any procedure do unambiguously better than any other. The
auto.arima function was the most computationally performant and is thus the basis for the results below.
We used the default arguments for this function.
6
This model is sometimes referred to as a regression model with ARMA errors. Ignoring seasonality,
it may be written
yt = β xt + zt
(7)
ϕ(L)zt = θ(L)zt
This is to contrast it with the ARMAX model which is written
ϕ(L)p (yt ) = Xt β + θ(L)q t (8)
17
ϕ(L)p ϕ12 (L12 )P (1 − L)d (1 − L12 )D (yt − β xt ) = θ(L)q θ12 (L12 )Q t (9)
where everything is as in (6) and xt contains the Google Trends Index that we de-
scribe in the next section. The addition of this term allows us to model the information
contained in the Google Trends data as a time-varying mean.
4.3 Index Construction
In order to incorporate the information from the various Google Trends search keywords,
it is desirable to synthesize the information in all of the Google Trends data into some-
thing more manageable. Formerly, authors used the Google Insights search categories
data. This data is used in many of the studies referenced in section 2. However, pre-
viously such an index from Google Insights was usually not available outside of large,
developed countries, so studies such as Carri`ere-Swallow and Labb´ e [2013] estimate their
own. The advantage of having an index is mainly parsimony of information. Indeed, such
an index may be of interest in its own right. Furthermore, in September 2012 Google
merged some features of Google Insights with Trends and discontinued the aggregate
search categories entirely.7
To solve the keyword aggregation problem Carri` ere-Swallow and Labb´ e [2013] cre-
ates an index from multiple search terms by use of an expanding linear regression model
described below. Other approaches rely on factor analysis techniques for dimension reduc-
tion such as unweighted least squares [Vosen and Schmidt, 2011] or principal components
analysis [Stock and Watson, 2002]. These methods assume that there are some underly-
ing, unobserved common factors for all of the series. We describe our use of statistical
learning techniques for variable selection below.
We took several approaches to constructing our search indices. First, we applied the
linear index approach of Carri` ere-Swallow and Labb´ e [2013]. This is a common approach
in the literature and is an attractive choice mainly for its simplicity. Let X be our matrix
of year-over-year percent changes for the Google Trends terms. We construct an index It
for these terms, for each series yt that we wish to forecast in the following way. In each
period, we estimate the weights β ˆ by using the observations up to time t − 1 and ﬁtting
a linear model
7
http://insidesearch.blogspot.com/2012/09/insights-into-what-world-is-searching.
html One may only speculate that it was discontinued because this task is very diﬃcult to automate.
18
yt = α + βXt + t
The index for period t is
ˆ β | yt−1 , Xt−1 ]Xt
It = E[
Given that y and X contain monthly percent changes, we can interpret It as the
linear combination of search terms which best explains the series that we are forecasting,
in a linear least squares sense. The expanding nature of the construction of the index
allows for the factors in the trends that explain the changes in our price series to change
over time. This is certainly something we might be interested in given the heterogenous
character of the included terms. Figure 7 contains an example of an index created using
the expanding linear OLS. That is, this is the last out-of-sample ﬁtted value of each index
created for a single price series.
We anticipate two potential problems with this approach for the current exercise,
and we construct this linear index using two other methods from the statistical learning
literature. Both of these techniques were implemented using the scikit-learn Python
package [Pedregosa et al., 2011]. First, we have a high number of variables relative to
the number of observations, especially in the early years of the index. To improve the
degrees of freedom of our ﬁt, we are interested in obtaining sparse models. To this end, we
applied the lasso technique introduced in Tibshirani [1996].8 The lasso is a penalized least
squares method that allows both continuous shrinkage and variable selection through the
imposition of an L1 −penalty on the regression coeﬃcients β . That is, the coeﬃcients are
pushed both towards and to zero when appropriate. The optimization function for the
lasso is
1
min y − Xβ 2 +α β 1
β 2n
where α is chosen via K −folds cross-validation with K = 5 and the LP −norm is
deﬁned X p = n p 1/p
i=1 (|xi | ) . Figure 8 contains an example of an index created using
the expanding lasso linear model. The ﬁt is much more conservative than the linear OLS
ﬁt given the sparse nature of the solution.
8
We also considered the more general LARS estimator introduced by Efron et al. [2004]. The results
of this estimator were comparable, though slightly worse than the lasso. It should also be noted that we
ran the computationally eﬃcient LARS algorithm variant for the lasso solution path.
19
OLS Index: CPI in Costa Rica
4
cpi
Linear GT Index
3
2
1
0
1
2
2006 2007 2008 2009 2010 2011 2012 2013
Figure 7: Linear OLS index created for CPI series from Costa Rica. Displays some
evidence of overﬁtting.
20
Lasso Index: CPI in Costa Rica
2.5
cpi
2.0 LASSO GT Index
1.5
1.0
0.5
0.0
0.5
1.0
2006 2007 2008 2009 2010 2011 2012 2013
Figure 8: Lasso model index created for CPI series from Costa Rica. Conservative ﬁt.
Does not vary much.
21
Both Zou and Hastie [2005] and Tibshirani [1996] point out that the lasso may not
perform well empirically in the cases where the number of variables is higher than the
number of observations9 , there are groups of variables with high pairwise correlation, or
there are high correlations between all predictors. These are all possible concerns for our
keywords from Google Trends. To account for these issues, we employ the elastic net
estimator of Zou and Hastie [2005].
The elastic net estimator is a linear combination of the L1 −penalty of the lasso and
the L2 −penalty of ridge regression [Hoerl and Kennard, 1970]. The objective function of
the elastic net is
1 1
min y − Xβ 2 + αρ β 1 + α(1 − ρ) β 2
β 2n 2
where α and ρ are chosen via K-folds cross-validation with K = 5. Using both the
lasso and the elastic net, we compute the index in the same way as the linear OLS
index except that the β coeﬃcients are obtained from the two new estimators. Figure 9
contains an example of an index created using the expanding elastic net model. The ﬁt
is somewhere between the high variance OLS model and the conservative lasso model.
In the following section we describe the empirical results of using these indices.
5 Forecasting Results
Our hypothesis is that there is additional information in our transformations of the
Google Trends data that allows improved nowcasts of the series of interest before their
respective data releases have been made versus an ARMA model and our respective
benchmarks. To test this hypothesis we compute one-step ahead forecasts using (6) and
(9) and compare them to the chosen models from (5). Just like for the benchmarks,
we start with two years of monthly data and then estimating expanding window models
until time T − 1 where T is the last period for which we have data for the series we
wish to forecast and for which we have T Google Trends index values. At each time t in
t = 24 . . . T + 1 we recompute the order of the seasonal ARIMA(X) model as described
above. This is to emulate what a practitioner would do in any given period. For each
forecast (and nowcast) we compute the one-step ahead forecast error
9
This is not the case in the current analysis, though we do have the case where the number of variables
is only slightly smaller than the number of observations in the early periods of our index construction
22
Elastic net Index: CPI in Costa Rica
2.5
cpi
2.0 ElasticNet GT Index
1.5
1.0
0.5
0.0
0.5
1.0
2006 2007 2008 2009 2010 2011 2012 2013
Figure 9: Elastic net model index created for CPI series from Costa Rica. Somewhere in
between high variance OLS and low variance lasso.
23
ˆk,t+1 ≡ yk,t+1 − Et [ˆ
e yk,t+1 ] (10)
for model k . We compute the relative MSE for each series combination method
deﬁned in section 4.3. That is, for the original data Xi,j,t we computed the results for
each of reduction methods of the i sampling dimension – mean, median, applying the CF
ﬁlter after taking the mean, and ETS smoothing after taking the mean – and for each i
reduction we also computed the three linear indices over the j keywords – linear OLS,
lasso, and elastic net. We found ﬁrst that the linear OLS trend preformed unambiguously
the worst. We were we unable to beat both the benchmark model and the ARMA model
even once regardless of the smoothing technique that we applied. This is not wholly
suprising given that we did not apply any variable selection of the keywords beforehand.
Inclusion of inappropriate keywords appears to have led to overﬁtting and poor out of
sample performance.
Moving to the lasso and the elastic net, for each respective estimator the ETS
smoothed results performed best. Between the classiﬁers, the elastic net performed
marginally better, beating both the benchmarks and the ARMA models in a few more
cases. Again, this is not wholly surprising given the documented better empirical per-
formance of the elastic net estimator when there is high pairwise correlations among the
regressors. Due to the large number of results, Table 3 contains only the results of the
ETS smoothed data and the trend computed via the elastic net estimator.
Using our best performing method, the ARMAX model outperforms the benchmark
in 28% of the cases or for 16 out of the 58 series. In each of these cases, the ARMAX also
outperforms the ARIMA model. The ARMA model fairs only slightly better versus the
benchmark, outperforming the benchmark in 22% of the series or for 13 out of 58 series.
However, the ARIMA model is only the best model versus the ARIMAX model in 7 of
these cases. The food price series appear to be particularly diﬃcult to forecast. If we
consider only the consumer price series, then the ARMAX model is the best model in 24%
of the cases while the ARMA model is only the best in 14% of the cases. The diﬃculty
in forecasting food prices is likely due to the food price crisis during the period. The
price US dollar price ﬂuctuations during the time under consideration were due largely
to events external to the countries of Central America.
These results, though only partially successful, indicate that there may be some ben-
eﬁt to exploring the further use of Google Trends data in forecasting economic series in
Central America. We use the concluding section to speculate on some of the reasons for
this success or lack thereof and give suggestions for future research.
24
25
Table 3: The results for each series using the ETS
smoothed data and the elastic net estimator. Rela-
tive MSE (1) is the MSE for the expanding window
ARMA model results versus the benchmark model given
in columns 6. Relative MSE (2) is the MSE for the elastic
net model versus the benchmark in column 6. A relative
MSE less than 1 indicates that the proposed model beat
the benchmark.
Country Series Relative MSE (1) Relative MSE (2) N Benchmark
cr food01 1.04644 1.09103 94 ar
cr food02 1.00313 1.09517 94 median
cr food03 1.08606 0.977452 94 mean
cr food04 1.0755 1.06406 94 median
cr food05 1.32068 1.32068 38 median
cr food06 1.01586 1.04028 94 ar
cr food07 1.08315 1.1684 94 mean
cr food08 1.08143 1.17251 94 mean
cr food09 0.886634 1.20473 57 median
cr inﬂ01 0.980986 1.09914 94 ar
cr inﬂ02 1 1.00755 63 mean
cr inﬂ03 0.993006 2.13919 63 median
cr inﬂ04 1.08064 1.08917 63 mean
cr inﬂ05 0.981035 1.11166 95 ets
cr inﬂ06 0.840378 0.808711 63 ar
cr inﬂ07 1.03188 1.00261 63 ar
cr inﬂ08 1.03867 0.860798 63 ar
cr inﬂ09 1.0767 1.01002 63 median
cr inﬂ10 1.11333 1.23137 63 ets
cr inﬂ11 1.10306 1.15871 63 median
cr inﬂ12 1.17467 1.18608 63 ets
cr inﬂ13 1.14892 1.35561 63 ets
cr inﬂ14 1.0491 1.04081 63 ar
hn food01 1.11477 1.23652 94 median
hn food02 1.06082 1.02454 94 mean
Continued on next page
26
Table 3: The results for each series using the ETS
smoothed data and the elastic net estimator. Rela-
tive MSE (1) is the MSE for the expanding window
ARMA model results versus the benchmark model given
in columns 6. Relative MSE (2) is the MSE for the elastic
net model versus the benchmark in column 6. A relative
MSE less than 1 indicates that the proposed model beat
the benchmark.
Country Series Relative MSE (1) Relative MSE (2) N Benchmark
hn food03 1.02453 1.22845 56 ar
hn food04 1.05794 1.02891 56 ar
hn food05 1.50041 1.41338 56 median
hn food06 1.06789 1.34548 56 ar
hn food07 1.02272 1.19456 56 ar
hn food08 1.02289 1.02308 56 ar
hn inﬂ01 1.14475 1.07673 94 ar
hn inﬂ02 0.954699 1.13202 94 ar
sv food01 1.32029 1.34929 69 ar
sv food02 0.981255 1.17397 69 ar
sv food03 0.999156 1.4412 69 ar
sv food04 1.00938 1.37656 69 ar
sv food05 1.0587 1.22251 69 median
sv food06 1.01563 1.2812 69 ar
sv food07 1.14023 1.42201 69 median
sv food08 1.26369 1.31265 69 median
sv food09 1.08291 1.07086 69 median
sv food10 1.05501 1.22104 69 ar
sv food11 1 0.998684 69 median
sv food12 1.24149 1.24152 69 median
sv inﬂ01 1.44499 1.33092 34 median
sv inﬂ02 0.986749 0.834845 34 ar
sv inﬂ03 1.10362 1.51751 34 median
sv inﬂ04 0.97742 0.97742 34 median
sv inﬂ05 1.11776 1.15664 34 ar
sv inﬂ06 1 0.999088 34 mean
Continued on next page
27
Table 3: The results for each series using the ETS
smoothed data and the elastic net estimator. Rela-
tive MSE (1) is the MSE for the expanding window
ARMA model results versus the benchmark model given
in columns 6. Relative MSE (2) is the MSE for the elastic
net model versus the benchmark in column 6. A relative
MSE less than 1 indicates that the proposed model beat
the benchmark.
Country Series Relative MSE (1) Relative MSE (2) N Benchmark
sv inﬂ07 1.29662 0.986124 34 ar
sv inﬂ08 0.948536 0.948536 34 median
sv inﬂ09 1.0029 1.00293 34 mean
sv inﬂ10 1.00788 1.13624 34 mean
sv inﬂ11 1.18043 1.03241 34 mean
sv inﬂ12 1.03771 1.01186 34 median
sv inﬂ13 1.04429 1.29648 34 mean
28
6 Conclusion
In this paper, we studied the possibility of using Internet search keyword data to nowcast
price changes in Central America. We gathered price data for Costa Rica, El Salvador,
and Honduras. We also identiﬁed several search keywords and downloaded data for them
from Google Trends over a period of weeks. We tried several aggregation, smoothing,
and linear index construction methods for this Internet search data and were partially
successful in improving nowcasts for Costa Rica and for El Salvador, countries for which
the search data were of higher quality.
As part of the exercise, we were able to identify several important points for practic-
tioners who wish to forecast using high-dimensional Internet search keyword time series.
First, variable selection is of upmost importance. Many, if not most, of the successful
forecasting studies that use Internet search keyword data are based on some theory of
consumer behavior. This may be the idea that consumers use the Internet to do research
before the purchase of a consumer durable as in Carri` ere-Swallow and Labb´ e [2013] or
searches for jobs and unemployment and welfare as in Choi and Varian [2009]. In the
absence of a strong model of consumer behavior, one should incorporate some kind of
variable selection mechanism. Naively including a large number of search keyword terms
into a model for a search index with the hope that the coeﬃcients on unimportant terms
will be small leads to very poor results. However, by employing some variable selection
methods from the statistical learning literature we were able to substantially improve all
of our forecasts and beat both the ARMA models and benchmarks in several instances.
The second takeaway is the importance of order identiﬁcation in ARIMA modeling.
This is perhaps not a surprise for any forecaster, but the successful results here using
automatic techniques are encouraging. If a forecaster were to focus on fewer series and
apply the Box-Jenkins methodology rather than relying on automatic model selection
procedures it might be possible to outperform the benchmark models further.
Finally, this study suggests several avenues for further research. We might consider
more estimators such as the TS-LARS, which is LARS estimator explicitly written with
time-series data in mind [Gelper and Croux, 2008]. It allows selection of distributed
lags and ranking of predictors. Ranking of predictors will be of particular interest for
those who use an exercise such as the one in this paper to generate ideas about consumer
behavior and the search for keywords and categories that help forecast price changes.
One might also explore using dynamic linear models or a structural model where the
Internet search information stands in explicitly for some aspect of the theoretical model.
There are also a number of diﬀerent smoothing techniques and variable selection methods
29
that might be explored.
In conclusion, the study of the manifestation of consumer sentiment via Internet search
behavior is very much still in its infancy. It certainly presents a number of challenges,
but the potential insights and use cases are varied and exciting. It may be tempting to
dismiss this excitement as hype. All the same, it is diﬃcult to deny the possible beneﬁts
of real-time consumer sentiment to future economics research and forecasting studies.
30
References
ere-Swallow and Felipe Labb´
Yan Carri` e. Nowcasting with google trends in an emerging
market. Journal of Forecasting, 32(4):289–298, 2013.
Hyunyoung Choi and Hal Varian. Predicting initial claims for unemployment beneﬁts.
Technical Report, 2009.
Hyunyoung Choi and Hal Varian. Predicting the present with google trends. Economic
Record, 88(s1):2–9, 2012.
Lawrence J Christiano and Terry J Fitzgerald. The band pass ﬁlter*. international
economic review, 44(2):435–465, 2003.
Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tibshirani, et al. Least angle
regression. The Annals of statistics, 32(2):407–499, 2004.
Michael Ettredge, John Gerdes, and Gilbert Karuga. Using web-based search data to
predict macroeconomic statistics. Communications of the ACM, 48(11):87–92, 2005.
Sarah Gelper and Christophe Croux. Least angle regression for time series forecasting
with many predictors. FBE Research Report KBI 0801, 2008.
Domenico Giannone, Lucrezia Reichlin, and David Small. Nowcasting: The real-time
informational content of macroeconomic data. Journal of Monetary Economics, 55(4):
665–676, 2008.
Giselle Guzman. Internet search behavior as an economic forecasting tool: The case of
inﬂation expectations. Journal of Economic and Social Measurement, 36(3):119–167,
2011.
Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for
nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
Rob Hyndman, Anne B Koehler, J Keith Ord, and Ralph D Snyder. Forecasting with
exponential smoothing: the state space approach. Springer, 2008.
Rob J Hyndman and Yeasmin Khandakar. Automatic time series forecasting: the forecast
package for r. Journal of Statistical Software, 26(3), 2008.
Rob J Hyndman, Anne B Koehler, Ralph D Snyder, and Simone Grose. A state space
framework for automatic forecasting using exponential smoothing methods. Interna-
tional Journal of Forecasting, 18(3):439–454, 2002.
31
Spyros Makridakis, SC Wheelwright, and Rob J Hyndman. Forecasting: methods and
applications. 1998.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.
Journal of Machine Learning Research, 12:2825–2830, 2011.
Torsten Schmidt and Simeon Vosen. Using internet data to account for special events in
economic forecasting. Ruhr Economic Paper, (382), 2012.
Time Series Research Staﬀ. X-13ARIMA-SEATS Reference Manual. Statistical Research
Division U.S. Census Bureau, 1.1 edition, 2013.
James H Stock and Mark W Watson. Forecasting using principal components from a
large number of predictors. Journal of the American statistical association, 97(460):
1167–1179, 2002.
Tanya Suhoy. Query indices and a 2008 downturn: Israeli data. Research Department,
Bank of Israel, 2009.
Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society. Series B (Methodological), pages 267–288, 1996.
Simeon Vosen and Torsten Schmidt. Forecasting private consumption: survey-based
indicators vs. google trends. Journal of Forecasting, 30(6):565–578, 2011.
Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–
320, 2005.
32
A Appendix
This Appendix contains extra information on the variables used in this study. Table A1
provides full series names for the abbreviations used for the forecasted series. More
information is provided in Table A2.
33
Table A1: This table contains the abbreviation used in
the main tables and the full series name.
Country Abbreviation Series
cr food01 Costa Rica, National Average, Beans (black), Retail...
cr food02 Costa Rica, National Average, Beans (black), Wholes...
cr food03 Costa Rica, National Average, Beans (red), Retail, ...
cr food04 Costa Rica, National Average, Beans (red), Wholesal...
cr food05 Costa Rica, National Average, Maize (white), Retail...
cr food06 Costa Rica, National Average, Maize (white), Wholes...
cr food07 Costa Rica, National Average, Rice (ﬁrst quality),...
cr food08 Costa Rica, National Average, Rice (second quality)...
cr food09 Costa Rica, National Average, Wheat (ﬂour), Retail...
cr inﬂ01 cpi
cr inﬂ02 cpi alc
cr inﬂ03 cpi clothes
cr inﬂ04 cpi comm
cr inﬂ05 cpi core
cr inﬂ06 cpi educ
cr inﬂ07 cpi entertain
cr inﬂ08 cpi food
cr inﬂ09 cpi health
cr inﬂ10 cpi household
cr inﬂ11 cpi housing
cr inﬂ12 cpi misc
cr inﬂ13 cpi restaurant
cr inﬂ14 cpi trans
hn food01 Honduras, National Average, Beans (red), Wholesale,...
hn food02 Honduras, National Average, Maize (white), Wholesal...
hn food03 Honduras, San Pedro Sula, Beans (red), Wholesale, (...
hn food04 Honduras, San Pedro Sula, Maize (white), Wholesale,...
hn food05 Honduras, San Pedro Sula, Rice (second quality), Wh...
hn food06 Honduras, Tegucigalpa, Beans (red), Wholesale, (USD...
hn food07 Honduras, Tegucigalpa, Maize (white), Wholesale, (U...
hn food08 Honduras, Tegucigalpa, Rice (second quality), Whole...
Continued on next page
34
Table A1: This table contains the abbreviation used in
the main tables and the full series name.
Country Abbreviation Series
hn inﬂ01 cpi
hn inﬂ02 cpi food
sv food01 El Salvador, San Salvador, Beans (red), Retail, (US...
sv food02 El Salvador, San Salvador, Beans (red), Wholesale, ...
sv food03 El Salvador, San Salvador, Beans (red, seda), Retai...
sv food04 El Salvador, San Salvador, Beans (red, seda), Whole...
sv food05 El Salvador, San Salvador, Maize (white), Retail, (...
sv food06 El Salvador, San Salvador, Maize (white), Wholesale...
sv food07 El Salvador, San Salvador, Rice, Retail, (USD/Kg)
sv food08 El Salvador, San Salvador, Rice, Wholesale, (USD/Kg)
sv food09 El Salvador, San Salvador, Sorghum (Maicillo), Reta...
sv food10 El Salvador, San Salvador, Sorghum (Maicillo), Whol...
sv food11 El Salvador, San Salvador, Wheat (ﬂour), Retail, (...
sv food12 El Salvador, San Salvador, Wheat (ﬂour), Wholesale...
sv inﬂ01 cpi
sv inﬂ02 cpi alc
sv inﬂ03 cpi clothes
sv inﬂ04 cpi comm
sv inﬂ05 cpi educ
sv inﬂ06 cpi entertain
sv inﬂ07 cpi food
sv inﬂ08 cpi furniture
sv inﬂ09 cpi health
sv inﬂ10 cpi house fuel
sv inﬂ11 cpi misc
sv inﬂ12 cpi restaurant
sv inﬂ13 cpi trans
35
36
Table A2: Full information for all of the food price series
used throughout the study.
Country Region Series Units
CR National Average Beans (black), Retail (USD/Kg)
CR National Average Beans (black), Wholesale (USD/Kg)
CR National Average Beans (red), Retail (USD/Kg)
CR National Average Beans (red), Wholesale (USD/Kg)
CR National Average Maize (white), Retail (USD/Kg)
CR National Average Maize (white), Wholesale (USD/Kg)
CR National Average Rice (ﬁrst quality), Retail (USD/Kg)
CR National Average Rice (second quality), Retail (USD/Kg)
CR National Average Wheat (ﬂour), Retail (USD/Kg)
HN National Average Beans (red), Wholesale (USD/Kg)
HN National Average Maize (white), Wholesale (USD/Kg)
HN San Pedro Sula Beans (red), Wholesale (USD/Kg)
HN San Pedro Sula Maize (white), Wholesale (USD/Kg)
HN San Pedro Sula Rice (second quality), Wholesale (USD/Kg)
HN Tegucigalpa Beans (red), Wholesale (USD/Kg)
HN Tegucigalpa Maize (white), Wholesale (USD/Kg)
HN Tegucigalpa Rice (second quality), Wholesale (USD/Kg)
SV San Salvador Beans (red), Retail (USD/Kg)
SV San Salvador Beans (red), Wholesale (USD/Kg)
SV San Salvador Beans (red, seda), Retail (USD/Kg)
SV San Salvador Beans (red, seda), Wholesale (USD/Kg)
SV San Salvador Maize (white), Retail (USD/Kg)
SV San Salvador Maize (white), Wholesale (USD/Kg)
SV San Salvador Rice, Retail (USD/Kg)
SV San Salvador Rice, Wholesale (USD/Kg)
SV San Salvador Sorghum (Maicillo), Retail (USD/Kg)
SV San Salvador Sorghum (Maicillo), Wholesale (USD/Kg)
SV San Salvador Wheat (ﬂour), Retail (USD/Kg)
SV San Salvador Wheat (ﬂour), Wholesale (USD/Kg)
37
Country Series Source
CR All CPI Data Banco Central de Costa Rica
HN All CPI Data Banco Central de Honduras
SV All CPI Data Banco Central de Reserva de El Salvador
Table A3: Sources for the CPI data for each country considered in the study.
Table A3 lists the sources for the CPI data used for each country. The food price
series were all obtained from FAO-GIEWS.
38