
Board of Governors of the Federal Reserve System
International Finance Discussion Papers
Number 838, August 2005 --- Screen Reader
Version*
NOTE: International Finance Discussion Papers are preliminary materials circulated to stimulate discussion and critical comment. References in publications to International Finance Discussion Papers (other than an acknowledgment that the writer has had access to unpublished material) should be cleared with the author or authors. Recent IFDPs are available on the Web at http://www.federalreserve.gov/pubs/ifdp/. This paper can be downloaded without charge from the Social Science Research Network electronic library at http://www.ssrn.com/.
Abstract:
This paper discusses the econometric methodology of general-to-specific modeling, in which the modeler simplifies an initially general model that adequately characterizes the empirical evidence within his or her theoretical framework. Central aspects of this approach include the theory of reduction, dynamic specification, model selection procedures, model selection criteria, model comparison, encompassing, computer automation, and empirical implementation. This paper thus reviews the theory of reduction, summarizes the approach of general-to-specific modeling, and discusses the econometrics of model selection, noting that general-to-specific modeling is the practical embodiment of reduction. This paper then summarizes fifty-seven articles key to the development of general-to-specific modeling.
Keywords: cointegration, conditional models, data mining, diagnostic testing, dynamic specification, econometric methodology, encompassing, equilibrium correction models, error correction models, exogeneity, general-to-specific modeling, model comparison, model design, model evaluation, model selection, non-nested hypotheses, PcGets, PcGive, reduction, specific-to-general modeling
JEL classification: C1, C5
This paper focuses on a central method for selecting useful empirical models, called general-to-specific modeling. In this method, the modeler simplifies an initially general model that adequately characterizes the empirical evidence within his or her theoretical framework. While the methodological, statistical, and empirical foundations for general-to-specific modeling have been laid down over the last several decades, a burst of activity has occurred in the last half-dozen years, stimulated in fair part by Hoover and Perez's (1999a) development and analysis of a computer algorithm for general-to-specific modeling. The papers discussed herein detail how the subject has advanced to its present stage of success and convey the promise of these developments for future empirical research. The remainder of this overview motivates the interest in general-to-specific modeling and summarizes the structure of the subsequent sections (Sections 2-5).
Economists have long sought to develop quantitative models of economic behavior by blending economic theory with data evidence. The task has proved an arduous one because of the nature of the economy modeled, the economic theory, and the data evidence. The economy is a complicated, dynamic, nonlinear, simultaneous, high-dimensional, and evolving entity; social systems alter over time; laws change; and technological innovations occur. Thus, the target is not only a moving one; it behaves in a distinctly nonstationary manner, both evolving over time and being subject to sudden and unanticipated shifts. Economic theories are highly abstract and simplified; and they also change over time, with conflicting rival explanations sometimes coexisting. The data evidence is tarnished: economic magnitudes are inaccurately measured and subject to substantive revisions, and many important variables are not even observable. The data themselves are often time series where samples are short, highly aggregated, heterogeneous, time-dependent, and inter-dependent. Econometric modeling of economic time series has nevertheless strived to discover sustainable and interpretable relationships between observed economic variables. This paper focuses on general-to-specific modeling, in which the modeler simplifies an initially general model that adequately characterizes the empirical evidence within his or her theoretical framework. This method has proved useful in practice for selecting empirical economic models.
The difficulties of empirical modeling are well reflected in the slowness of empirical progress, providing plenty of ammunition for critics. However, part of the problem may be internal to the discipline, deriving from inappropriate modeling methods. The "conventional" approach insists on a complete theoretical model of the phenomena of interest prior to data analysis, leaving the empirical evidence as little more than quantitative clothing. Unfortunately, the complexity and nonstationarity of economies makes it improbable than anyone--however brilliant--could deduce a priori the multitude of quantitative equations characterizing the behavior of millions of disparate and competing agents. Without a radical change in the discipline's methodology, empirical progress seems doomed to remain slow.
The situation is not as bleak as just described, for two reasons. First, the accumulation of knowledge is progressive, implying that one does not need to know all the answers at the start. Otherwise, no science could have advanced. Although the best empirical model at any given time may be supplanted later, it can provide a springboard for further discovery. Data-based model selection need not raise serious concerns: this implication is established below and is demonstrated by the actual behavior of model selection algorithms.
Second, inconsistencies between the implications of any conjectured model and the observed data are often easy to detect. The ease of model rejection worries some economists, yet it is also a powerful advantage by helping sort out which models are empirically adequate and which are not. Constructive progress may still be difficult because "we don't know what we don't know, and so we cannot know how best to find out what we don't know". The dichotomy between model destruction and model construction is an old one in the philosophy of science. While critical evaluation of empirical evidence is a destructive use of econometrics, it can also establish a legitimate basis for empirical models.
To undertake empirical modeling, one must begin by assuming a probability structure for the data, which is tantamount to conjecturing the data generating process. Because the economic mechanism is itself unknown, the relevant probability structure is also unknown, so one must proceed iteratively: conjecture the data generation process (DGP), develop the associated probability theory, use that theory for modeling empirical evidence, and revise the starting point when the results do not match consistently. The development of econometric theory highlights this iterative progression: from stationarity assumptions, through integrated-cointegrated systems, to general nonstationary mixing processes, as empirical evidence revealed the inadequacy of each earlier step. Further developments will undoubtedly occur, leading to a still more useful foundation for empirical modeling. See Hendry (1995a) for an extensive treatment of progressive research strategies.
Having postulated a reasonable probability basis for the DGP, including the procedures used for data measurement and its collection, the next issue concerns what classes of model might be useful. The theory of reduction (discussed in Section 2) explains how empirical models arise and what their status is, noting that they are not facsimiles of the DGP. Specifically, empirical models describe the behavior of a relatively small set of variables--often from one to several hundred--and never the many millions of distinct variables present in most economies.
A key concept here is that of the local DGP, which is the probability mechanism in the space of those variables under analysis. The theory of reduction shows how the local DGP arises as a simplification of a vastly more general DGP involving millions of variables. The usefulness of a given local DGP depends on it capturing sustainable links, which in turn depends partly on the theoretical framework and partly on data accuracy. An econometric model cannot do better than capture the salient characteristics of its corresponding local DGP. The extent to which the model does capture those characteristics depends both on its specification at least embedding the local DGP and on the goodness of its selection.
There are thus two distinct conceptual steps in modeling, albeit ones closely related in practice. First, specify a useful information set for a "well-behaved" local DGP. Second, select a "good" empirical model of that local DGP.
A viable methodology for empirical modeling is an integral component of achieving the second step. Despite the controversy surrounding every aspect of econometric methodology, the "LSE" (or London School of Economics) approach has emerged as a leading methodology for empirical modeling; see Hendry (1993) for an overview. One of the LSE approach's main tenets is general-to-specific modeling, sometimes abbreviated as Gets. In general-to-specific modeling, empirical analysis starts with a general statistical model that captures the essential characteristics of the underlying dataset, i.e., that general model is congruent. Then, that general model is reduced in complexity by eliminating statistically insignificant variables, checking the validity of the reductions at every stage to ensure congruence of the finally selected model.
The papers discussed below articulate many reasons for adopting a general-to-specific approach. First amongst these reasons is that general-to-specific modeling implements the theory of reduction in an empirical context. Section 2 summarizes the theory of reduction, and Section 3 discusses general-to-specific modeling as the empirical analogue of reduction.
General-to-specific modeling also has excellent characteristics for model selection, as documented in Monte Carlo studies of automatic general-to-specific modeling algorithms. Hoover and Perez (1999a) were the first to evaluate the performance of general-to-specific modeling as a general approach to econometric model building. To analyze the general-to-specific approach systematically, Hoover and Perez mechanized the decisions in general-to-specific modeling by coding them in a computer algorithm. In doing so, Hoover and Perez also made important advances in practical modeling.
To appreciate Hoover and Perez's contributions to general-to-specific modeling, consider the most basic steps that such an algorithm follows.
Pagan (1987) and other critics have argued that the outcome of general-to-specific modeling may depend on the simplification path chosen--that is, on the order in which variables are eliminated and on the data transformations adopted--and so the selected model might vary with the investigator. Many reduction paths certainly could be considered from an initial general model.
Hoover and Perez (1999a) turned this potential drawback into a virtue by exploring many feasible paths and seeing which models result. When searches do lead to different model selections, encompassing tests can be used to discriminate between these models, with only the surviving (possibly non-nested) specifications retained. If multiple models are found that are both congruent and encompassing, a new general model can be formed from their union, and the simplification process is then re-applied. If that union model re-occurs, a final selection among the competing models can be made by using (say) information criteria. Otherwise, a unique, congruent, encompassing reduction has been located.
Hoover and Perez (1999a) re-analyzed the Monte Carlo experiments in Lovell (1983) and found that their (Hoover and Perez's) general-to-specific algorithm performed much better than any method investigated by Lovell. Hendry and Krolzig (1999) demonstrated improvements on Hoover and Perez's general-to-specific algorithm, and Hoover and Perez (2004) showed how their algorithm could be successfully modified for analyzing cross-section data; see also Hendry and Krolzig (2004)
Automating the general-to-specific approach throws light on several methodological issues and prompts many new ideas, several of which are discussed in Section 3.1. Three common concerns with the general-to-specific approach are repeated testing, recursive procedures, and selection of variables.
First, critics such as Ed Leamer have worried about the interpretation of mis-specification tests that are repeatedly applied during simplification; see Hendry, Leamer, and Poirier (1990). Automation of the general-to-specific approach reveals two distinct roles for mis-specification tests: their initial application to test the congruence of the general model, and their subsequent use to guard against invalid reductions during simplification. The mis-specification tests are thus applied only once as statistical mis-specification tests--to the general model--so no doubts of interpretation arise. Their subsequent role in diagnostic checking during reduction does not alter their statistical properties as applied to the general model.
Second, recursive procedures such as subsample estimation play two roles. Recursive procedures help investigate parameter constancy, which is essential for congruence and for any practical use of the resulting model. Recursive procedures also assist in checking that estimated effects are significant over subsamples, and not just over the whole sample. Specifically, if a variable is only occasionally significant over various subsamples, that occasional significance may reflect chance rather than substance, especially if the variable's significance does not increase as the sample size grows. Eliminating such occasionally significant variables may be useful for parsimony.
Third, model selection procedures that use diagnostic testing and are based on multiple criteria have eluded most attempts at theoretical analysis. However, computer implementation of the model selection process has allowed evaluation of such strategies by Monte Carlo simulation. Krolzig and Hendry (2001) and Hendry and Krolzig (2003) present the results of many Monte Carlo experiments to investigate whether the model selection process works well or fails badly for time series processes. The implications for the calibration of their econometrics computer program PcGets are also noted below; see Hendry and Krolzig (2001).
The remainder of this paper is organized into four sections. Section 2 reviews the theory of reduction as a major background component to general-to-specific modeling. Section 3 reviews the basis for general-to-specific modeling and discusses the econometrics of model selection, noting that general-to-specific modeling is the practical embodiment of reduction. Section 4 then summarizes, paper by paper, fifty-seven papers key to the development of general-to-specific modeling. These papers are reprinted in Campos, Ericsson, and Hendry (2005) and are grouped into nine parts across the two volumes of Campos, Ericsson, and Hendry (2005): see the Appendix below for a listing of these papers. This nine-part division is also paralleled by the subsections in Section 4, as follows.
Volume I
Part I. Introduction to the methodology (Section 4.1);
Part II. Theory of reduction (Section 4.2);
Part III. Dynamic specification (Section 4.3);
Part IV. Model selection procedures (Section 4.4);
Volume II
Part I. Model selection criteria (Section 4.5);
Part II. Model comparison (Section 4.6);
Part III. Encompassing (Section 4.7);
Part IV. Computer automation (Section 4.8); and
Part V. Empirical applications (Section 4.9).
Section 5 concludes.
For ease of reference, when an article reprinted in Campos, Ericsson, and Hendry (2005) is initially cited in Sections 2-3, text in square brackets immediately following the citation indicates the volume and chapter where the reprint appears: for instance, "Hoover and Perez (1999a) [Volume II: Chapter 22]". Thereafter--except in Section 4's actual summary of the article--citation to an article reprinted in Campos, Ericsson, and Hendry (2005) is indicated by a trailing asterisk, as in "Hoover and Perez (1999a)*". On a separate issue, the mathematical notation in Section 2 differs somewhat from that used in Section 3, owing to the different strands of literature on which those two sections are based. The mathematical notation in Section 4 generally follows the notation in the article being described.
This section examines the relationship between the data generation process and an empirical model. Section 2.1 defines what an empirical model is, Section 2.2 defines what the data generation process is, and Section 2.3 shows how an empirical model is obtained as a reduction or simplification of the data generation process. Sections 2.4, 2.5, and 2.6 thus examine how models can be compared, how to measure the information losses associated with a model, and what information is used in evaluating models. Section 2.7 turns to explicit model design, which recognizes and utilizes the relationship between empirical models and the data generation process.
The exposition in this section draws on Hendry (1995a). Hendry (1987) [Volume I: Chapter 8] presents a non-technical exposition of the theory of reduction. Important contributions to its development include Florens and Mouchart (1980, 1985), Hendry and Richard (1982) [Volume I: Chapter 9], Engle, Hendry, and Richard (1983) [Volume I: Chapter 13], and Florens, Mouchart, and Rolin (1990).
Before proceeding to the theory of reduction, a few comments are in order. First, data are generated from an unknown high-dimensional probability distribution (the DGP), which is indexed by a set of parameters. Some functions of those parameters are of interest to an investigator. A central aim of modeling is to determine the numerical values of those parameters, which can be used for testing theories, forecasting, conducting economic policy, and learning about the economy.
Second, the DGP itself involves far too many parameters to estimate on available data, so reductions of the DGP are essential to produce an operational model. The key feature of every reduction is whether or not it involves a loss of information about the parameters of interest. Such a loss may be total (as when the parameters of interest no longer enter the model) or partial (as when some parameters can be gleaned but others not, or when some cease to be constant), or it may just lower the statistical efficiency with which the parameters may be estimated. Logically, reductions lead from the original DGP--which involves the universe of variables--to the distribution of a small subset thereof--which is the local DGP.
Third, an empirical model of that subset of variables then approximates their local DGP. Estimation of the parameters of that model comes last, logically speaking. That model's estimation has been extensively analyzed in econometrics for many postulated types of model. A variety of approaches are still extant, such as "classical" and "Bayesian", and with varying degrees of emphasis on least squares, instrumental variables, maximum likelihood, method of moments, and so on. Many of these approaches are minor variants on the basic likelihood score equations; see Hendry (1976).
Fourth, an empirical model may include variables that do not actually enter the local DGP, in which case a more parsimonious representation can be selected from the sample evidence. Such a data-based simplification is a major focus of this paper. As noted in the previous section, model selection procedures have been debated extensively, with little professional accord. The discussion below shows how the general-to-specific approach is the analogue in modeling of reduction in theory. Critiques of the general-to-specific approach and the practical arguments in its favor are also considered.
Both economic theory and data are important in empirical economic modeling, so this subsection compares theory models and empirical models. Theory models are freely created, whereas empirical models are derived and not autonomous. That distinction in a model's status has direct implications for the roles that sequential reduction and mis-specification testing play in empirical modeling. At a more general level, theory models play a key part in interpreting economic data, but theory models in themselves are not sufficient for doing so in a reliable fashion. To illustrate the distinction between theory models and empirical models, consider a standard linear model that is subsequently estimated by least squares.
From the outset, it is important to distinguish between the
economic theory-model and the empirical model that the theory model
serves to interpret. A theory model is freely created by the human
imagination. For instance, economic theory might specify a
relationship between two variables
and
:
| yt = b´zt + et | (1) |
| [output] [transformed input] [perturbation] |
where the coefficient vector
is the partial derivative
,
the variables
and
(in sans serif font) denote the
economic theoretic variables at time
, and the
error
is an independent perturbation
at time
. The theory model is defined by how it
is constructed. In equation (1),
is constructed from
and the error
, and the relationship between
and
can be treated as if it were a
causal mechanism.
A corresponding empirical model is anything but freely created, with the properties of the empirical model being determined by reductions of the DGP. The mapping between variables and disturbances also differs. To demonstrate, consider the following empirical model:
| yt = b´zt + et | (2) |
| [observed variable [explanation] [remainder] of interest] |
where
and
(in italic) are
economic data at time
, the conditional
expectation
is zero (by
assumption), and
is the expectations
operator. The orthogonality assumption that
defines the
parameter vector
from equation (2) in terms of the
data's properties:
| (3) |
The error
in
equation (2)
is also defined as a function of the data:
| (4) |
In particular, equation (4) shows
explicitly that the error
contains
everything in
that is not modeled by
. From a slightly different perspective,
can always be decomposed into two components:
(the explained part) and
(the unexplained part). Such a
partition is feasible, even when
does not depend
on
in any way and so is not caused by
.
From equations (3)
and (4), the
properties of
and
vary with the
choice of
and the orthogonality assumption.
Equivalently, the properties of
and
vary with the choice of variables ignored and with the
conditioning assumption. The coefficients and errors of empirical
models are thus derived and not autonomous.
As an immediate implication, empirical models can be designed
through the selection of
: changing
redesigns the error
and the
coefficients
. Consequently, design
criteria can be analyzed. For example, what makes one design better
than another? Sections 2.5-2.7 address this
issue, leading to the notion of a congruent model--one that
matches the data evidence on all the measured attributes. Any given
sample may have more than one congruent model. That leads to the
requirement of a dominant congruent model, which entails an
ordering over congruent models. Indeed, successive congruent models
of a given phenomenon should be able to explain or encompass
previous models, thereby achieving progress in modeling.
This subsection formally (and briefly) discusses the statistical basis for the data generation process.
Let
denote a
stochastic process for a vector
of random variables at
time
that is defined on the probability
space
, where
is the sample space,
is the event space (sigma field),
and
is the probability measure. Let
denote a vector of parameters,
which are entities that do not depend on
.
Consider the full sample
, which is for
where the notation
means
for
; and denote the initial (pre-sample)
conditions by
. The distribution function
of
, conditional on
, is denoted by
, which is often called the Haavelmo distribution; see
Spanos (1989) [Volume I: Chapter 4]. To make explicit
the observed phenomenon of parameter nonconstancy,
is represented parametrically by the
-dimensional
vector of parameters
, where each time period has an associated parameter
![]()
. Thus, elements of
need not be the same at each
time
, and some of the
may reflect
transient effects or regime shifts. The parameter space is
, so
.
The data generating process of
is therefore
written as:
| (A5) |
From equation (5), the complete
sample
is generated from
by a population parameter value, which is denoted
.
This subsection considers the sequence of reductions that
obtains the empirical model from the DGP. Because
is unmanageably large,
operational models are defined by a sequence of data reductions,
which can be viewed in ten stages:
It is assumed that empirical modeling aims to determine the values of a set of parameters of interest that are relevant to an investigator's objectives, such as testing theories or undertaking policy analysis. The key concern of any given reduction is its effect on the parameters of interest. To derive that effect, this subsection briefly considers each of these ten stages in turn, while noting that some of these stages do not involve a reduction per se.
Parameters of interest. Let the parameters of interest be
denoted by
. Both economic
theory and empirical properties may suggest that certain parameters
are parameters of interest. Parameters that are identifiable and
invariant to an empirically relevant class of interventions are
likely to be of interest. Other parameters may be of interest,
depending on the purpose of the exercise.
Also, if
is not a function of
, then the modeling
exercise will be vacuous, so assume that
.
After each reduction, it is essential to check that
can still be retrieved from the
parameters characterizing the lower-dimensional data density.
Data transformations and aggregation. Consider a
one-to-one mapping of
to a new dataset
:
. The
variables in
may include aggregates of
the original variables, their growth rates, etc. The transformation
from
to
affects the parameter space, so
is transformed into
(say). Because densities are
equivariant under one-to-one transformations, the DGP of
is characterized by the
joint density of
:
| (6) |
where
is the set of
transformed parameters, with
.
For
in equation (6),
the key issue is how the transformation from
to
alters the properties
of the parameters. Some parameters in
may be more constant
than ones in
; others may be less
constant; and a smaller (or larger) number of the parameters in
may be needed to
characterize the parameters of interest
.
Sequential factorization. Using the basic result that a
joint probability equals the product of the conditional and
marginal probabilities, and noting that time is irreversible, then
sequentially factorize the density of
into its
(martingale-difference) components:
|
(7) |
where
,
is the
-th column in
, and
is the parameterization
resulting from the sequential factorization. The right-hand side of
equation (7)
implicitly defines an innovation process
, which equals
.
Data partition. Now, partition
into two sets, one set to
be analyzed (
) and one set to be
marginalized (
):
| (8) |
where
is an
matrix. Consequently, everything about
must be learnt from
alone, which entails that
must not be essential to
inference about
.
Marginalization. Actual marginalization proceeds as
follows. Using the partition in equation (8), and noting
that
, factorize
into the conditional distribution of
given
, and the marginal
distribution of
:
|
(9) |
If only
is to be
analyzed, with only
retained, then
must be obtainable from
alone.
If lagged information about
is also to be eliminated, then
must be marginalized with respect to
, requiring the very
strong condition that:
| (10) |
There is no loss of information from eliminating the history
if and only if
, in which case the conditional sequential distribution of
does not
depend on
. That is,
does not Granger-cause
; see Granger (1969). In
modeling, another important condition is that there is no loss of
relevant information when
. That is still a strong condition, but less stringent than
. Also, marginalizing
will entail a loss of information
unless
for parameter spaces
and
. Otherwise, the
parameters of the conditional and marginal distributions in
equation (9)
are cross-linked.
The above discussion implies that modeling aggregated data (say) can be viewed as a two-step process. First, the disaggregated series for a given variable (such as expenditure) are transformed by a one-to-one transformation into the corresponding aggregated series and all but one of those disaggregated series. Second, those disaggregated series are marginalized.
Mapping to stationarity. An economy may generate
integrated data, where a variable that is integrated of order
(denoted I(
)) must
be differenced
times to eliminate all unit roots.
Mapping such data to stationarity is a reduction from
I(
) to I(0). This mapping is generally
useful for interpreting the resulting models, and it is needed to
ensure that conventional inference is valid for all parameters.
Still, many inferences will be valid even if this reduction is not
enforced; see Sims, Stock, and Watson (1990). While differencing the data
can map the data to stationarity, cointegration can also eliminate
unit roots between linear combinations of variables. Cointegration
is merely noted here, as it is treated extensively in numerous
books and expository articles: see Banerjee and Hendry (1992),
Ericsson (1992a), Banerjee, Dolado, Galbraith, and Hendry (1993),
Johansen (1995), Hatanaka (1996), Doornik, Hendry, and Nielsen (1998),
and Hendry and Juselius (2001) inter alia.
Conditional factorization. Typically in empirical
modeling, some variables are treated as endogenous and others are
treated as given or non-modeled. Formally, this partitioning of the
variables arises by factorizing the density of
variables in
into sets of
and
variables
and
:
| (11) |
where
denotes the endogenous
variables in
,
denotes the non-modeled
variables in
, and
. Using the partition in
equation (11),
the joint distribution of
on the right-hand side of
equation (10)
can always be factorized as:
|
(12) |
where
is the conditional density of
given
,
is the marginal density of
, and
and
are those densities'
parameters. Modeling only
and treating
as given corresponds to
modeling only
and discarding the marginal distribution
on the right-hand side of equation (12). No loss of
information in this reduction corresponds to the condition that
is weakly exogenous for
. Specifically, weak exogeneity
requires that
alone and that
for parameter spaces
and
; see
Engle, Hendry, and Richard (1983)*.
Constancy. Complete parameter constancy in the
conditional density
means that
, where
. In such a
situation, if weak exogeneity holds,
itself is constant because
is a function of only
. While appearing simple
enough, constancy is actually a subtle concept; see Hendry (1996)
and Ericsson, Hendry, and Prestwich (1998).
Lag truncation. Lag truncation limits the extent of the
history
in the conditional
density
in equation (12).
For instance, truncation at
lags implies:
| (13) |
In equation (13), no loss of
relevant information requires that
.
Functional form. Functional form could be treated as a
set of data transformations, but it merits some discussion on its
own. Specifically, map
into
(
) and
into
(
), and denote the
resulting data by
. Assume that the
transformations
and
together make
approximately normal and homoscedastic, denoted
.
Then there is no loss of information--and no change in the
specification--if:
| (14) |
A well-known example in which no loss of information occurs is
transforming the log-normal density of a variable to the normal
density in the logarithm of that variable. The left-hand side
density in equation (14) defines the
local (conditional) DGP of
. When joint
normality holds, as is assumed here, that final conditional model
is linear in the transformed space.
The derived model. This sequence of reductions delivers the following specification:
| (15) |
where
is a mean-zero,
homoscedastic, mean-innovation process with variance
, and
and
are
constant-parameter polynomial matrices of order
in the lag operator
. That is,
and
are matrices
whose elements are polynomials. The error
is a derived
process that is defined by:
| (16) |
so
as given in
equation (16)
is not autonomous. For the same reason, equation (15) is a derived
model, rather than an autonomous model. Equations (15)
and (16)
parallel and generalize Section 2.1's discussion of
the empirical model in equation (2) and its derived
error in equation (4).
Section 4.3
discusses further aspects of model formulation, focusing on dynamic
specification, noting that equation (15) is an
autoregressive distributed lag model.
Reduction from the DGP to the generic econometric equation in
(15) involves all
ten stages of reduction discussed above, thereby transforming the
parameters
in the DGP to the
coefficients of the empirical model. Because the DGP is congruent
with itself, equation (15) would be an
undominated congruent model if there were no information losses
from the corresponding reductions. More generally,
equation (15)
is congruent under the conditions stated, but it still could be
dominated, as the next subsection discusses.
Comparison of empirical models is often of interest, and dominance is a useful criterion when comparing models. Dominance can be summarized, as follows.
Consider two distinct scalar empirical models, denoted
and
, with mean innovation
processes (MIPs)
and
relative
to their own information sets, where
and
have constant finite
variances
and
respectively. Model
variance-dominates
model
if
. That
property is denoted
.
Several implications follow immediately from the definition of variance dominance in terms of the models' variances.
These implications favor starting with general rather than simple empirical models, given any choice of information set; and they suggest modeling the conditional expectation.
Valid reductions involve no losses in information. Econometrics has created concepts that correspond to avoiding possible losses, as the following list highlights.
At a more general level, encompassing implies that no loss of information arises from ignoring another model's data, conditional on retaining the data in one's own model.
To provide some insight into encompassing, note that the local DGP is a reduction of the DGP itself and is nested within that DGP. The properties of the local DGP are explained by the reduction process: knowledge of the DGP entails knowledge of all reductions thereof. Thus, when knowledge of one model entails knowledge of another, the first model is said to encompass the second. Several relevant papers on encompassing are reprinted in Campos, Ericsson, and Hendry (2005); see Sections 4.6 and 4.7.
Given the potential information losses that can occur in
reduction, it is natural to evaluate the extent to which there is
evidence of invalid reductions. A taxonomy of information sources
arises from the data themselves, theory and measurement
information, and data used in rival models. In this taxonomy, it is
useful to partition the data used in modeling (
) into its relative past,
relative present, and relative future:
| (17) |
The taxonomy of information sources is thus:
Each source of information implies a criterion for evaluating an empirical model:
Models that satisfy the criteria for the first five information sets are said to be congruent. An encompassing congruent model satisfies all six criteria; see Sections 4.6 and 4.7.
Model design can be either implicit or explicit. Implicit model design typically occurs when modeling aims to address the symptoms of existing mis-specification. For instance, a simple model might be initially specified and then tested for problems such as autocorrelation, heteroscedasticity, and omitted variables, correcting such problems as they are discovered. The model is thus implicitly designed to minimize (or control) the values of the test statistics that are computed.
Explicit model design aims to mimic reduction theory in empirical modeling, thereby minimizing the losses due to the reductions imposed. Explicit model design leads to general-to-specific modeling; see the following section. Gilbert (1986) [Volume I: Chapter 1] discusses and contrasts implicit and explicit model design in detail.
General-to-specific modeling is the practical embodiment of reduction. The introduction to the current section summarizes important aspects of that relationship, focusing on implications for automated general-to-specific modeling algorithms. See Gilbert (1986)* and Ericsson, Campos, and Tran (1990) [Volume I: Chapter 7] for detailed expositions on general-to-specific modeling and Phillips (1988) [Volume I: Chapter 6] for a constructive critique.
As background to the methodology of general-to-specific modeling, consider the local DGP, which is the joint distribution of the subset of variables under analysis. A general unrestricted model such as equation (15) is formulated to provide a congruent approximation to that local DGP, given the theoretical, institutional, and existing empirical background. The empirical analysis commences from this general specification, which is assessed for discrepancies from congruency by using mis-specification tests. If no mis-specification is apparent, the general unrestricted model (or GUM) is simplified to a parsimonious congruent representation, with each step in the simplification being checked by diagnostic testing.
Simplification can proceed in many ways. Although the goodness of a model is intrinsic to the model and is not a property of the selection route, poor routes seem unlikely to deliver useful models. Consequently, some economists worry about how selection rules may affect the properties of the resulting models, thus advocating the use of a priori specifications. To be a viable empirical approach, these a priori specifications require knowledge of the answer before starting; and they deny any useful role to empirical modeling. Conversely, good routes may have a high chance of delivering congruent representations of the local DGP. Section 4.4 addresses model selection procedures and summarizes the papers reprinted in Campos, Ericsson, and Hendry (2005).
Several studies have recently investigated how well
general-to-specific modeling does in model selection. In
particular, the pathbreaking Monte Carlo study by
Hoover and Perez (1999a) [Volume II: Chapter 22] reconsiders
earlier Monte Carlo results by Lovell (1983) [Volume II:
Chapter 20]. Hoover and Perez start with series on 20
macroeconomic variables (
) and generate a new variable
(denoted
) as a function of zero to five of the
's and an error. Hoover and
Perez then regress
on all twenty
's plus lags thereof (
), and they
let their general-to-specific algorithm simplify that general
unrestricted model until it finds an irreducible, congruent,
encompassing result. Their algorithm checks up to 10 different
simplification paths, testing for mis-specification at every step.
It then collects the models from those different simplification
paths and selects the one that variance-dominates the others. By
following many paths, the algorithm guards against choosing a
misleading route; and it delivers an undominated congruent
model.
HooverPerez (1999a)* stimulated a flurry of activity on the methodology of general-to-specific modeling. Hendry and Krolzig (1999) [Volume II: Chapter 23] improved on Hoover and Perez's algorithm. More recently, Hoover and Perez (2004) extended their investigations to cross-section datasets and found equally impressive performance by general-to-specific modeling in model selection. Hendry and Krolzig (2003) also reported a wide range of simulation studies used to calibrate their algorithm.
To see why general-to-specific modeling does well, the current section examines this approach from several standpoints. Section 3.1 summarizes generic difficulties in model selection. Section 3.2 reviews the debates about general-to-specific modeling prior to the publication of Hoover and Perez (1999a)*. Section 3.3 develops the analytics for several procedures in general-to-specific modeling, drawing on Hendry (2000), from which Section 3.4 derives various costs of search. Section 3.5 reports recent simulation evidence on the properties of general-to-specific modeling.
There are four potential basic mistakes in selecting a model from data evidence:
The first mistake is outside the purview of selection issues, although mis-specification testing of the general unrestricted model can alert the investigator to potential problems with that model. When the general unrestricted model is congruent, the fourth mistake can be avoided by ensuring that all simplifications are valid. That leaves the second and third mistakes as the two central problems for model selection. In what follows, the resulting costs of selection are typically considered for situations in which the general unrestricted model is much larger than the required model.
The conventional statistical analysis of repeated testing provides a pessimistic background for model selection. Every test has a nonzero null rejection frequency (or size, if independent of nuisance parameters), so type I errors accumulate across tests. Setting a tight significance level for tests would counteract that phenomenon, but would also induce low power to detect the influences that really matter. Thus, the conventional view concludes that incorrect decisions in model search must be commonplace.
As a contrasting view, in a progressive research strategy that utilizes general-to-specific modeling, evidence accumulates over time against irrelevant influences and in favor of relevant influences. White (1990) [Volume I: Chapter 23] showed that, with sufficiently rigorous testing, the selected model will converge to the DGP, so overfitting and mis-specification problems are primarily finite sample issues. Sections 3.3, 3.4, and 3.5 show that general-to-specific modeling has relatively low search costs, with sizes close to their nominal levels and powers near those attainable when commencing from the correct specification.
In addition to the four potential mistakes above, critics of general-to-specific methods have voiced concerns about data-based model selection, measurement without theory, pre-test biases, ignored selection effects, data mining, repeated testing, lack of identification, and the potential path dependence of any selected model. This subsection highlights central references in these debates and considers various responses to the critics.
Keynes (1939, 1940), Tinbergen (1940), Koopmans (1947), Judge and Bock (1978), Leamer (1978), Lovell (1983)*, Pagan (1987) [Volume I: Chapter 5], Hendry, Leamer, and Poirier (1990), and Faust and Whiteman (1997) inter alia critique general-to-specific methods in several key exchanges in the literature. Although Hendry and Krolzig (2001) address numerous concerns raised about general-to-specific modeling, it has taken a considerable time to develop the conceptual framework within which that rebuttal is possible. For instance, when Hendry (1980) was written, it was unclear how to counter earlier attacks, such as those by Keynes on Tinbergen (1939) or by Koopmans on Burns and Mitchell (1946). These debates set the scene for doubting any econometric analysis that failed to commence from a pre-specified model. Leamer (1978) also worried about the effects of data mining, which he defined as "the data-dependent process of selecting a statistical model" (p. 1). After Lovell83* found low success rates for selecting a small relation hidden in a large database, an adverse view of data-based model selection became entrenched in the profession. This view was reinforced by two additional results: the apparent coefficient bias that arises when variables are selected by significance tests, and the claimed under-estimation in reported coefficient standard errors that arises from treating a selected model as if it were certain.
Consequently, many empirical econometric studies have tried to appear to commence from pre-specified models, whether they did so or not. Econometric evidence became theory dependent: empirical evidence provided little value added, and it was likely to be discarded when fashions in theory changed. Confusion over the role of econometric evidence was so great that (e.g.) Summers (1991) failed to notice that theory dependence was a source of the problem, not the use of "sophisticated" econometrics.
Keynes and others claimed that valid econometric analysis must be based on models pre-specified by economic theory. The fallacy in that approach is that theoretical models are themselves incomplete and incorrect. Similarly, Koopmans inter alia relied on the (unstated) assumption that only one form of economic theory was applicable, that it was correct, and that it was immutable; see Hendry and Morgan (1995). That said, it is actually not necessary (or even possible) to know everything in advance when commencing statistical work in economics. If it were necessary, no one would ever discover anything not already known! Partial explanations are likewise valuable empirically, as the development of the natural sciences has demonstrated. Progressive research can discover invariant features of reality without prior knowledge of the whole; see Hendry (1995b).
A critic might well grant the force of such arguments, yet remain skeptical that data mining could produce anything useful, thereby undermining an evidence-based approach. However, as Gilbert (1986)* discusses, pejorative forms of data mining can be discovered from conflicting evidence or by rival models that cannot be encompassed. Stringent and critical model evaluation can detect and avoid these forms of data mining. See Hendry (1995a, Ch. 15] and Campos and Ericsson (1999) [Volume II: Chapter 33] for further discussion of data mining, including other less pejorative senses. Even when an undominated congruent model is data-based, it can provide a good approximation to the local DGP; and it can help reveal pejorative data mining.
At first blush, the theory of repeated testing appears to wreak havoc with general-to-specific model selection, so this subsection addresses that issue. In fact, the probabilities of deleting irrelevant variables are relatively high. The greater difficulty is retaining relevant effects, even if the analysis commences from the "correct" model, i.e., the local DGP. This subsection re-examines the probabilities associated with deleting irrelevant variables and retaining relevant variables when adopting a general-to-specific modeling strategy.
Deleting irrelevant variables. To illustrate the
probabilities associated with deleting irrelevant variables in
general-to-specific model selection, consider a classical
regression model in which
regressors are
irrelevant, i.e., have regression parameters equal to zero. Under
that null hypothesis, the probability p
that at least one of
corresponding
-tests rejects at the
level is one minus the
probability that none of those
-tests rejects:
|
(18) |
where
is the critical value associated
with an
rejection frequency. For example,
when
tests of correct null hypotheses are
conducted at
(say), then
from
equation (18).
With approximately 87% probability, at least
one
t-test spuriously rejects at the
5% level. Such a high
p-value is usually the focus of
worry with repeated testing.
One solution is to use larger critical values. The
t-distribution is thin-tailed,
leading Sargan (2001a) to note how difficult it is to obtain
spurious
t-values exceeding three in
absolute value; see also Sargan (2001b). A critical value of three
corresponds to (approximately) the 0.5% critical
value for a
t-test with
,
for which
when
. While an 18% chance
of a false rejection may still be high from some perspectives, this
situation has other interpretations, as is now shown.
To better grasp the issues involved in false rejection,
reconsider the probability of irrelevant variables being
significant on a
t-test at significance
level
. That probability
distribution
is given by the
terms of the binomial expansion of
,
namely:
|
(19) |
Thus, the probability of all
coefficients being
significant is
, the probability of
being significant is
, and
(as used in equation (18)) the
probability of none being significant is
. Using
equation (19),
the average number of variables found significant by chance
is:
|
(20) |
If
and
, then
equals two, which explains the high
probability of at least one spurious rejection. Even so,
38 out of the 40 variables
will be deleted (on average) when using individual
t-tests at this relatively loose
significance level. Moreover, from equation (20),
falls to 0.4 for
and to 0.2 for
. That is, when using a critical
value of three, one variable out of forty is retained just once in
five attempts, on average. That value of
explains why
. It also
reveals that such a rejection rate does not correspond to the usual
interpretation of size but to the rare occurrence of a selection
error, despite a large number of irrelevant variables. Even if
repeated
t-tests are used, few spurious
variables will typically be retained for
. Unfortunately, small values
of
often imply difficulty in detecting
relevant variables, leading to the issue of retention.
Retaining relevant variables. Unless the relevant variables are highly significant in the population, retaining them appears inherently difficult, whether or not general-to-specific modeling is used. The difficulty in retaining relevant variables can be illustrated in a framework similar to the one used above for calculating the probabilities of deleting irrelevant variables, but with the t-ratios having noncentral distributions.
Consider drawing from a
t-distribution
with
degrees of freedom and a noncentrality
parameter
, which is approximately the
t-statistic's mean. The null
hypothesis
is
,
the alternative hypothesis
is
, and assume that the alternative hypothesis is
true. For a critical value
, the probability
determines the size, whereas power is given by:
| (21) |
Suppose that
, which corresponds to an
expected
t-value of approximately two in
the population. That is,
for a
fixed
. Because the
t-distribution is nearly symmetric
around its mean, the probability of that
t-ratio exceeding two is about
50%:
| (22) |
For even relatively small
,
equation (22) is the power in
equation (21)
when calculated for the alternative hypothesis
, noting that then
roughly corresponds to
and
is
approximately zero. The probability of rejecting the null
hypothesis that
is thus only 50% under the alternative hypothesis that
, even although the local DGP is known and only a
single variable is involved. For three such variables, the
probability of detecting all three is:
| (23) |
where the estimated model is the local DGP and includes no additional irrelevant variables, and the included variables are orthogonal. From equation (23), all three variables are deemed significant only about one-eighth of the time (12.5%)--the same probability that no variables are retained. This low power represents an inexorable cost of inference from the given evidence, even when commencing from a correctly specified model.
This difficulty with power is exacerbated if the critical value
is increased (e.g.) to offset problems of "overfitting". For
example,
for even large
, implying only about a 2% chance of keeping all three variables. Despite knowing
the local DGP, such a variable will only occasionally be retained
using
t-tests if the null hypothesis is
tested. If there are many such variables, retaining all of them is
highly unlikely.
These calculations paint a potentially gloomy picture for data-based model selection. However, the situation is more promising than it might appear. As Section 3.4 explains, general-to-specific modeling can have surprisingly small search costs--i.e., the additional costs that arise by commencing from a general unrestricted model that nests the local DGP, rather than by commencing from the local DGP and knowing that it is the local DGP. These search costs are typically positive: it is difficult to improve on model selection if the local DGP is known. However, multiple-variable procedures such as F-tests can improve on single-variable procedures such as the t-tests above.
The costs associated with model selection can be usefully separated into the costs of inference and the costs of search, as the following example illustrates.
The costs of inference are those costs associated with
inference about variables in a model when the model is the local
DGP but the modeler does not know that. Consider a local DGP with
variables (all relevant), and denote those
variables by the set
(
for relevant). Let
denote the
probability of retaining the
variable at
significance level
when commencing from
the local DGP as the initial specification. The probability of
dropping the
relevant variable is
, so a
measure of the total costs of inference in this situation is:
|
(24) |
While equation (24) is one measure of the costs of inference, other measures are feasible as well, such as unity minus the probability of selecting the local DGP; cf. equation (23).
The costs of search are those costs associated with
inference about variables in a general unrestricted model that
nests the local DGP, relative to inferences in the local DGP
itself. Let
denote the
probability of retaining the
variable when
commencing from a general unrestricted model with
variables, applying the same selection tests and significance
levels as before. Thecost of search is
for a
relevant variable (
), and it is
for an
irrelevant variable (
), where
is
the set of
irrelevant variables in the general
unrestricted model. By construction, the local DGP has no
irrelevant variables, so the whole cost of keeping irrelevant
variables in model selection is attributed to search. Thus, a
measure of the pure search costs is:
|
(25) |
In principle, the pure search costs given in equation (25) could be negative if the algorithm for selection from a general unrestricted model were different from that for testing in a local DGP. The complexities of multiple-variable problems typically preclude analytical answers, so Section 3.5 reports some simulation evidence. Before e