The Federal Reserve Board eagle logo links to home page

General-to-Specific Modeling: An Overview and Selected Bibliography

Julia Campos, Neil R. Ericsson, and David F. Hendry*

NOTE: International Finance Discussion Papers are preliminary materials circulated to stimulate discussion and critical comment. References in publications to International Finance Discussion Papers (other than an acknowledgment that the writer has had access to unpublished material) should be cleared with the author or authors. Recent IFDPs are available on the Web at http://www.federalreserve.gov/pubs/ifdp/. This paper can be downloaded without charge from the Social Science Research Network electronic library at http://www.ssrn.com/.


Abstract:

This paper discusses the econometric methodology of general-to-specific modeling, in which the modeler simplifies an initially general model that adequately characterizes the empirical evidence within his or her theoretical framework. Central aspects of this approach include the theory of reduction, dynamic specification, model selection procedures, model selection criteria, model comparison, encompassing, computer automation, and empirical implementation. This paper thus reviews the theory of reduction, summarizes the approach of general-to-specific modeling, and discusses the econometrics of model selection, noting that general-to-specific modeling is the practical embodiment of reduction. This paper then summarizes fifty-seven articles key to the development of general-to-specific modeling.

Keywords: cointegration, conditional models, data mining, diagnostic testing, dynamic specification, econometric methodology, encompassing, equilibrium correction models, error correction models, exogeneity, general-to-specific modeling, model comparison, model design, model evaluation, model selection, non-nested hypotheses, PcGets, PcGive, reduction, specific-to-general modeling

JEL classification: C1, C5



1  Motivation and Overview

This paper focuses on a central method for selecting useful empirical models, called general-to-specific modeling. In this method, the modeler simplifies an initially general model that adequately characterizes the empirical evidence within his or her theoretical framework. While the methodological, statistical, and empirical foundations for general-to-specific modeling have been laid down over the last several decades, a burst of activity has occurred in the last half-dozen years, stimulated in fair part by Hoover and Perez's (1999a) development and analysis of a computer algorithm for general-to-specific modeling. The papers discussed herein detail how the subject has advanced to its present stage of success and convey the promise of these developments for future empirical research. The remainder of this overview motivates the interest in general-to-specific modeling and summarizes the structure of the subsequent sections (Sections 2-5).

Economists have long sought to develop quantitative models of economic behavior by blending economic theory with data evidence. The task has proved an arduous one because of the nature of the economy modeled, the economic theory, and the data evidence. The economy is a complicated, dynamic, nonlinear, simultaneous, high-dimensional, and evolving entity; social systems alter over time; laws change; and technological innovations occur. Thus, the target is not only a moving one; it behaves in a distinctly nonstationary manner, both evolving over time and being subject to sudden and unanticipated shifts. Economic theories are highly abstract and simplified; and they also change over time, with conflicting rival explanations sometimes coexisting. The data evidence is tarnished: economic magnitudes are inaccurately measured and subject to substantive revisions, and many important variables are not even observable. The data themselves are often time series where samples are short, highly aggregated, heterogeneous, time-dependent, and inter-dependent. Econometric modeling of economic time series has nevertheless strived to discover sustainable and interpretable relationships between observed economic variables. This paper focuses on general-to-specific modeling, in which the modeler simplifies an initially general model that adequately characterizes the empirical evidence within his or her theoretical framework. This method has proved useful in practice for selecting empirical economic models.

The difficulties of empirical modeling are well reflected in the slowness of empirical progress, providing plenty of ammunition for critics. However, part of the problem may be internal to the discipline, deriving from inappropriate modeling methods. The "conventional" approach insists on a complete theoretical model of the phenomena of interest prior to data analysis, leaving the empirical evidence as little more than quantitative clothing. Unfortunately, the complexity and nonstationarity of economies makes it improbable than anyone--however brilliant--could deduce a priori the multitude of quantitative equations characterizing the behavior of millions of disparate and competing agents. Without a radical change in the discipline's methodology, empirical progress seems doomed to remain slow.

The situation is not as bleak as just described, for two reasons. First, the accumulation of knowledge is progressive, implying that one does not need to know all the answers at the start. Otherwise, no science could have advanced. Although the best empirical model at any given time may be supplanted later, it can provide a springboard for further discovery. Data-based model selection need not raise serious concerns: this implication is established below and is demonstrated by the actual behavior of model selection algorithms.

Second, inconsistencies between the implications of any conjectured model and the observed data are often easy to detect. The ease of model rejection worries some economists, yet it is also a powerful advantage by helping sort out which models are empirically adequate and which are not. Constructive progress may still be difficult because "we don't know what we don't know, and so we cannot know how best to find out what we don't know". The dichotomy between model destruction and model construction is an old one in the philosophy of science. While critical evaluation of empirical evidence is a destructive use of econometrics, it can also establish a legitimate basis for empirical models.

To undertake empirical modeling, one must begin by assuming a probability structure for the data, which is tantamount to conjecturing the data generating process. Because the economic mechanism is itself unknown, the relevant probability structure is also unknown, so one must proceed iteratively: conjecture the data generation process (DGP), develop the associated probability theory, use that theory for modeling empirical evidence, and revise the starting point when the results do not match consistently. The development of econometric theory highlights this iterative progression: from stationarity assumptions, through integrated-cointegrated systems, to general nonstationary mixing processes, as empirical evidence revealed the inadequacy of each earlier step. Further developments will undoubtedly occur, leading to a still more useful foundation for empirical modeling. See Hendry (1995a) for an extensive treatment of progressive research strategies.

Having postulated a reasonable probability basis for the DGP, including the procedures used for data measurement and its collection, the next issue concerns what classes of model might be useful. The theory of reduction (discussed in Section 2) explains how empirical models arise and what their status is, noting that they are not facsimiles of the DGP. Specifically, empirical models describe the behavior of a relatively small set of variables--often from one to several hundred--and never the many millions of distinct variables present in most economies.

A key concept here is that of the local DGP, which is the probability mechanism in the space of those variables under analysis. The theory of reduction shows how the local DGP arises as a simplification of a vastly more general DGP involving millions of variables. The usefulness of a given local DGP depends on it capturing sustainable links, which in turn depends partly on the theoretical framework and partly on data accuracy. An econometric model cannot do better than capture the salient characteristics of its corresponding local DGP. The extent to which the model does capture those characteristics depends both on its specification at least embedding the local DGP and on the goodness of its selection.

There are thus two distinct conceptual steps in modeling, albeit ones closely related in practice. First, specify a useful information set for a "well-behaved" local DGP. Second, select a "good" empirical model of that local DGP.

A viable methodology for empirical modeling is an integral component of achieving the second step. Despite the controversy surrounding every aspect of econometric methodology, the "LSE" (or London School of Economics) approach has emerged as a leading methodology for empirical modeling; see Hendry (1993) for an overview. One of the LSE approach's main tenets is general-to-specific modeling, sometimes abbreviated as Gets. In general-to-specific modeling, empirical analysis starts with a general statistical model that captures the essential characteristics of the underlying dataset, i.e., that general model is congruent. Then, that general model is reduced in complexity by eliminating statistically insignificant variables, checking the validity of the reductions at every stage to ensure congruence of the finally selected model.

The papers discussed below articulate many reasons for adopting a general-to-specific approach. First amongst these reasons is that general-to-specific modeling implements the theory of reduction in an empirical context. Section 2 summarizes the theory of reduction, and Section 3 discusses general-to-specific modeling as the empirical analogue of reduction.

General-to-specific modeling also has excellent characteristics for model selection, as documented in Monte Carlo studies of automatic general-to-specific modeling algorithms. Hoover and Perez (1999a) were the first to evaluate the performance of general-to-specific modeling as a general approach to econometric model building. To analyze the general-to-specific approach systematically, Hoover and Perez mechanized the decisions in general-to-specific modeling by coding them in a computer algorithm. In doing so, Hoover and Perez also made important advances in practical modeling.

To appreciate Hoover and Perez's contributions to general-to-specific modeling, consider the most basic steps that such an algorithm follows.

  1. Ascertain that the general statistical model is congruent.

  2. Eliminate a variable (or variables) that satisfies the selection (i.e., simplification) criteria.

  3. Check that the simplified model remains congruent.

  4. Continue steps 2 and 3 until none of the remaining variables can be eliminated.

Pagan (1987) and other critics have argued that the outcome of general-to-specific modeling may depend on the simplification path chosen--that is, on the order in which variables are eliminated and on the data transformations adopted--and so the selected model might vary with the investigator. Many reduction paths certainly could be considered from an initial general model.

Hoover and Perez (1999a) turned this potential drawback into a virtue by exploring many feasible paths and seeing which models result. When searches do lead to different model selections, encompassing tests can be used to discriminate between these models, with only the surviving (possibly non-nested) specifications retained. If multiple models are found that are both congruent and encompassing, a new general model can be formed from their union, and the simplification process is then re-applied. If that union model re-occurs, a final selection among the competing models can be made by using (say) information criteria. Otherwise, a unique, congruent, encompassing reduction has been located.

Hoover and Perez (1999a) re-analyzed the Monte Carlo experiments in Lovell (1983) and found that their (Hoover and Perez's) general-to-specific algorithm performed much better than any method investigated by Lovell. Hendry and Krolzig (1999) demonstrated improvements on Hoover and Perez's general-to-specific algorithm, and Hoover and Perez (2004) showed how their algorithm could be successfully modified for analyzing cross-section data; see also Hendry and Krolzig (2004)

Automating the general-to-specific approach throws light on several methodological issues and prompts many new ideas, several of which are discussed in Section 3.1. Three common concerns with the general-to-specific approach are repeated testing, recursive procedures, and selection of variables.

First, critics such as Ed Leamer have worried about the interpretation of mis-specification tests that are repeatedly applied during simplification; see Hendry, Leamer, and Poirier (1990). Automation of the general-to-specific approach reveals two distinct roles for mis-specification tests: their initial application to test the congruence of the general model, and their subsequent use to guard against invalid reductions during simplification. The mis-specification tests are thus applied only once as statistical mis-specification tests--to the general model--so no doubts of interpretation arise. Their subsequent role in diagnostic checking during reduction does not alter their statistical properties as applied to the general model.

Second, recursive procedures such as subsample estimation play two roles. Recursive procedures help investigate parameter constancy, which is essential for congruence and for any practical use of the resulting model. Recursive procedures also assist in checking that estimated effects are significant over subsamples, and not just over the whole sample. Specifically, if a variable is only occasionally significant over various subsamples, that occasional significance may reflect chance rather than substance, especially if the variable's significance does not increase as the sample size grows. Eliminating such occasionally significant variables may be useful for parsimony.

Third, model selection procedures that use diagnostic testing and are based on multiple criteria have eluded most attempts at theoretical analysis. However, computer implementation of the model selection process has allowed evaluation of such strategies by Monte Carlo simulation. Krolzig and Hendry (2001) and Hendry and Krolzig (2003) present the results of many Monte Carlo experiments to investigate whether the model selection process works well or fails badly for time series processes. The implications for the calibration of their econometrics computer program PcGets are also noted below; see Hendry and Krolzig (2001).

The remainder of this paper is organized into four sections. Section 2 reviews the theory of reduction as a major background component to general-to-specific modeling. Section 3 reviews the basis for general-to-specific modeling and discusses the econometrics of model selection, noting that general-to-specific modeling is the practical embodiment of reduction. Section 4 then summarizes, paper by paper, fifty-seven papers key to the development of general-to-specific modeling. These papers are reprinted in Campos, Ericsson, and Hendry (2005) and are grouped into nine parts across the two volumes of Campos, Ericsson, and Hendry (2005): see the Appendix below for a listing of these papers. This nine-part division is also paralleled by the subsections in Section 4, as follows.

Volume I
Part I.  Introduction to the methodology (Section 4.1);
Part II.  Theory of reduction (Section 4.2);
Part III.   Dynamic specification (Section 4.3);
Part IV.   Model selection procedures (Section 4.4);

Volume II
Part I.   Model selection criteria (Section 4.5);
Part II.   Model comparison (Section 4.6);
Part III.   Encompassing (Section 4.7);
Part IV.   Computer automation (Section 4.8); and
Part V.   Empirical applications (Section 4.9).

Section 5 concludes.

For ease of reference, when an article reprinted in Campos, Ericsson, and Hendry (2005) is initially cited in Sections 2-3, text in square brackets immediately following the citation indicates the volume and chapter where the reprint appears: for instance, "Hoover and Perez (1999a) [Volume II: Chapter 22]". Thereafter--except in Section 4's actual summary of the article--citation to an article reprinted in Campos, Ericsson, and Hendry (2005) is indicated by a trailing asterisk, as in "Hoover and Perez (1999a)*". On a separate issue, the mathematical notation in Section 2 differs somewhat from that used in Section 3, owing to the different strands of literature on which those two sections are based. The mathematical notation in Section 4 generally follows the notation in the article being described.

2  Theory of Reduction

This section examines the relationship between the data generation process and an empirical model. Section 2.1 defines what an empirical model is, Section 2.2 defines what the data generation process is, and Section 2.3 shows how an empirical model is obtained as a reduction or simplification of the data generation process. Sections 2.4, 2.5, and 2.6 thus examine how models can be compared, how to measure the information losses associated with a model, and what information is used in evaluating models. Section 2.7 turns to explicit model design, which recognizes and utilizes the relationship between empirical models and the data generation process.

The exposition in this section draws on Hendry (1995a). Hendry (1987) [Volume I: Chapter 8] presents a non-technical exposition of the theory of reduction. Important contributions to its development include Florens and Mouchart (1980, 1985), Hendry and Richard (1982) [Volume I: Chapter 9], Engle, Hendry, and Richard (1983) [Volume I: Chapter 13], and Florens, Mouchart, and Rolin (1990).

Before proceeding to the theory of reduction, a few comments are in order. First, data are generated from an unknown high-dimensional probability distribution (the DGP), which is indexed by a set of parameters. Some functions of those parameters are of interest to an investigator. A central aim of modeling is to determine the numerical values of those parameters, which can be used for testing theories, forecasting, conducting economic policy, and learning about the economy.

Second, the DGP itself involves far too many parameters to estimate on available data, so reductions of the DGP are essential to produce an operational model. The key feature of every reduction is whether or not it involves a loss of information about the parameters of interest. Such a loss may be total (as when the parameters of interest no longer enter the model) or partial (as when some parameters can be gleaned but others not, or when some cease to be constant), or it may just lower the statistical efficiency with which the parameters may be estimated. Logically, reductions lead from the original DGP--which involves the universe of variables--to the distribution of a small subset thereof--which is the local DGP.

Third, an empirical model of that subset of variables then approximates their local DGP. Estimation of the parameters of that model comes last, logically speaking. That model's estimation has been extensively analyzed in econometrics for many postulated types of model. A variety of approaches are still extant, such as "classical" and "Bayesian", and with varying degrees of emphasis on least squares, instrumental variables, maximum likelihood, method of moments, and so on. Many of these approaches are minor variants on the basic likelihood score equations; see Hendry (1976).

Fourth, an empirical model may include variables that do not actually enter the local DGP, in which case a more parsimonious representation can be selected from the sample evidence. Such a data-based simplification is a major focus of this paper. As noted in the previous section, model selection procedures have been debated extensively, with little professional accord. The discussion below shows how the general-to-specific approach is the analogue in modeling of reduction in theory. Critiques of the general-to-specific approach and the practical arguments in its favor are also considered.

2.1  Theory Models and Empirical Models

Both economic theory and data are important in empirical economic modeling, so this subsection compares theory models and empirical models. Theory models are freely created, whereas empirical models are derived and not autonomous. That distinction in a model's status has direct implications for the roles that sequential reduction and mis-specification testing play in empirical modeling. At a more general level, theory models play a key part in interpreting economic data, but theory models in themselves are not sufficient for doing so in a reliable fashion. To illustrate the distinction between theory models and empirical models, consider a standard linear model that is subsequently estimated by least squares.

From the outset, it is important to distinguish between the economic theory-model and the empirical model that the theory model serves to interpret. A theory model is freely created by the human imagination. For instance, economic theory might specify a relationship between two variables $ \mathsf{y}_{t}$ and $ \mathsf{z}_{t}$:

      yt    =        b´zt          +          et (1)
[output]   [transformed input]   [perturbation] 

where the coefficient vector $ \mathsf{b}$ is the partial derivative $ \partial\mathsf{y}_{t}/\partial\mathsf{z}_{t}$, the variables $ \mathsf{y}_{t}$ and $ \mathsf{z}_{t}$ (in sans serif font) denote the economic theoretic variables at time $ t$, and the error $ \mathsf{e}_{t}$ is an independent perturbation at time $ t$. The theory model is defined by how it is constructed. In equation (1), $ \mathsf{y}_{t}$ is constructed from $ \mathsf{z}_{t}$ and the error $ \mathsf{e}_{t}$, and the relationship between $ \mathsf{y}_{t}$ and $ \mathsf{z}_{t}$ can be treated as if it were a causal mechanism.

A corresponding empirical model is anything but freely created, with the properties of the empirical model being determined by reductions of the DGP. The mapping between variables and disturbances also differs. To demonstrate, consider the following empirical model:

            yt    =                 b´zt          +          et (2)
[observed variable    [explanation]   [remainder]
    of interest]
 

where $ y_{t}$ and $ z_{t}$ (in italic) are economic data at time $ t$, the conditional expectation  $ \mathcal{E}(e_{t}\vert z_{t})$ is zero (by assumption), and $ \mathcal{E}(\cdot)$ is the expectations operator. The orthogonality assumption that $ \mathcal{E}(e_{t}\vert z_{t})=0$ defines the parameter vector $ b$ from equation (2) in terms of the data's properties:

$\displaystyle \mathcal{E}(y_{t}\vert z_{t})\;=\;b^{\prime}z_{t}.$ (3)

The error $ e_{t}$ in equation (2) is also defined as a function of the data:

$\displaystyle e_{t}\;=\;y_{t}-\mathcal{E}(y_{t}\vert z_{t}).$ (4)

In particular, equation (4) shows explicitly that the error $ e_{t}$ contains everything in $ y_{t}$ that is not modeled by $ z_{t}$. From a slightly different perspective, $ y_{t}$ can always be decomposed into two components: $ b^{\prime}z_{t}$ (the explained part) and $ e_{t}$ (the unexplained part). Such a partition is feasible, even when $ y_{t}$ does not depend on $ z_{t}$ in any way and so is not caused by $ z_{t}$.

From equations (3) and (4), the properties of $ b$ and $ e_{t}$ vary with the choice of $ z_{t}$ and the orthogonality assumption. Equivalently, the properties of $ b$ and $ e_{t}$ vary with the choice of variables ignored and with the conditioning assumption. The coefficients and errors of empirical models are thus derived and not autonomous.

As an immediate implication, empirical models can be designed through the selection of $ z_{t}$: changing $ z_{t}$ redesigns the error $ e_{t}$ and the coefficients $ b$. Consequently, design criteria can be analyzed. For example, what makes one design better than another? Sections 2.5-2.7 address this issue, leading to the notion of a congruent model--one that matches the data evidence on all the measured attributes. Any given sample may have more than one congruent model. That leads to the requirement of a dominant congruent model, which entails an ordering over congruent models. Indeed, successive congruent models of a given phenomenon should be able to explain or encompass previous models, thereby achieving progress in modeling.

2.2  The Data Generation Process

This subsection formally (and briefly) discusses the statistical basis for the data generation process.

Let $ \left\{ \mathbf{u}_{t}\right\} $ denote a stochastic process for a vector $ \mathbf{u}_{t}$ of random variables at time $ t$ that is defined on the probability space $ \left( \mathbf{\Omega},\mathcal{F},\mathsf{P}\right) $, where $ \mathbf{\Omega}$ is the sample space, $ \mathcal{F}$ is the event space (sigma field), and $ \mathsf{P}$ is the probability measure. Let $ \mathbf{\psi }$ denote a vector of parameters, which are entities that do not depend on $ \mathcal{F}$.

Consider the full sample  $ \mathbf{U}_{T}^{1}=\left( \mathbf{u}_{1} \ldots\mathbf{u}_{T}\right) $ , which is for $ t=1,\ldots,T$ where the notation  $ \mathbf{U}_{j}^{i}$ means $ \left( \mathbf{u}_{i}\ldots \mathbf{u}_{j}\right) $ for $ j\geq i$; and denote the initial (pre-sample) conditions by $ \mathbf{U}_{0}=\left( \ldots\mathbf{u}_{-2}\;\mathbf{u} _{-1}\;\mathbf{u}_{0}\right) $ . The distribution function  $ \mathsf{D} _{\mathsf{U}}\left( \cdot\right) $ of $ \mathbf{U}_{T}^{1}$, conditional on $ \mathbf{U}_{0}$, is denoted by $ \mathsf{D}_{\mathsf{U}}\left( \mathbf{U} _{T}^{1}\mid\mathbf{U}_{0},\mathbf{\cdot}\right) $ , which is often called the Haavelmo distribution; see Spanos (1989) [Volume I: Chapter 4]. To make explicit the observed phenomenon of parameter nonconstancy, $ \mathsf{D} _{\mathsf{U}}\left( \cdot\right) $ is represented parametrically by the $ kT$-dimensional vector of parameters  $ \mathbf{\psi}_{T}^{1}=\left( \mathbf{\psi}_{1}^{\prime}\ldots\mathbf{\psi}_{T}^{\prime}\right) ^{\prime}$ , where each time period has an associated parameter  $ \mathbf{\psi}_{t}=\left( \psi_{1,t}\ldots\psi_{k,t}\right) $$ ^{\prime}$. Thus, elements of $ \mathbf{\psi}$ need not be the same at each time $ t$, and some of the $ \left\{ \psi_{i,t}\right\} $ may reflect transient effects or regime shifts. The parameter space is $ \mathbf{\Psi}\subseteq{\mathbb{R}}^{kT}$, so $ \mathbf{\psi}_{T}^{1}\in\mathbf{\Psi}$.

The data generating process of $ \left\{ \mathbf{u}_{t}\right\} $ is therefore written as:

$\displaystyle \mathsf{D}_{\mathsf{U}}\left( \mathbf{U}_{T}^{1}\mid\mathbf{U}_{0} ,\mathbf{\psi}_{T}^{1}\right)$     with  $\displaystyle \mathbf{\psi }_{T}^{1}\;\in\;\mathbf{\Psi\;}\subseteq\;{\mathbb{R}}^{kT}.$ (A5)

From equation (5), the complete sample  $ \left\{ \mathbf{u}_{t},t=1,\ldots,T\right\} $ is generated from $ \mathsf{D} _{\mathsf{U}}\left( \cdot\right) $ by a population parameter value, which is denoted $ \mathbf{\psi}_{p}$.

2.3  The Reduction Sequence

This subsection considers the sequence of reductions that obtains the empirical model from the DGP. Because $ \mathbf{U}_{T}^{1}$ is unmanageably large, operational models are defined by a sequence of data reductions, which can be viewed in ten stages:

It is assumed that empirical modeling aims to determine the values of a set of parameters of interest that are relevant to an investigator's objectives, such as testing theories or undertaking policy analysis. The key concern of any given reduction is its effect on the parameters of interest. To derive that effect, this subsection briefly considers each of these ten stages in turn, while noting that some of these stages do not involve a reduction per se.

Parameters of interest.    Let the parameters of interest be denoted by $ \mathbf{\mu}\in\mathcal{M}$. Both economic theory and empirical properties may suggest that certain parameters are parameters of interest. Parameters that are identifiable and invariant to an empirically relevant class of interventions are likely to be of interest. Other parameters may be of interest, depending on the purpose of the exercise.

Also, if $ \mathbf{\mu}$ is not a function of $ \mathbf{\psi}_{T}^{1}$, then the modeling exercise will be vacuous, so assume that $ \mathbf{\mu}=\mathbf{g} \left( \mathbf{\psi}_{T}^{1}\right) $. After each reduction, it is essential to check that $ \mathbf{\mu}$ can still be retrieved from the parameters characterizing the lower-dimensional data density.

Data transformations and aggregation.    Consider a one-to-one mapping of $ \mathbf{U}_{T}^{1}$ to a new dataset $ \mathbf{W}_{T}^{1}$: $ \mathbf{U} _{T}^{1}\leftrightarrow\mathbf{W}_{T}^{1}$. The variables in $ \mathbf{W} _{T}^{1}$ may include aggregates of the original variables, their growth rates, etc. The transformation from $ \mathbf{U}$ to $ \mathbf{W}$ affects the parameter space, so $ \mathbf{\Psi}$ is transformed into $ \mathbf{\Phi}$ (say). Because densities are equivariant under one-to-one transformations, the DGP of $ \mathbf{W}_{T}^{1}$ is characterized by the joint density of $ \mathbf{U} _{T}^{1}$:

$\displaystyle \mathsf{D}_{\mathsf{W}}\left( \mathbf{W}_{T}^{1}\mid\mathbf{W}_{0} ,\mathbf{\phi}_{T}^{1}\right) \;=\;\mathsf{D}_{\mathsf{U}}\left( \mathbf{U}_{T}^{1}\mid\mathbf{U}_{0},\mathbf{\psi}_{T}^{1}\right) ,$ (6)

where $ \mathbf{\phi}_{T}^{1}$ is the set of transformed parameters, with $ \mathbf{\phi}_{T}^{1}\in\mathbf{\Phi}$.

For $ \mathsf{D}_{\mathsf{W}}\left( \cdot\right) $ in equation (6), the key issue is how the transformation from $ \mathbf{\psi}_{T}^{1}$ to $ \mathbf{\phi}_{T}^{1}$ alters the properties of the parameters. Some parameters in $ \mathbf{\phi}_{T}^{1}$ may be more constant than ones in $ \mathbf{\psi}_{T}^{1}$; others may be less constant; and a smaller (or larger) number of the parameters in $ \mathbf{\phi}_{T}^{1}$ may be needed to characterize the parameters of interest  $ \mathbf{\mu}$.

Sequential factorization.    Using the basic result that a joint probability equals the product of the conditional and marginal probabilities, and noting that time is irreversible, then sequentially factorize the density of $ \mathbf{W}_{T}^{1}$ into its (martingale-difference) components:

$\displaystyle \mathsf{D}_{\mathsf{W}}\left( \mathbf{W}_{T}^{1}\mid\mathbf{W}_{0} ,\mathbf{\phi}_{T}^{1}\right) \;=\;\prod_{t=1}^{T}\mathsf{D}_{\mathsf{w} }\left( \mathbf{w}_{t}\mid\mathbf{W}_{t-1},\mathbf{\delta}_{t}\right) ,$ (7)

where $ \mathbf{W}_{t-1}=\left( \mathbf{W}_{0}:\mathbf{W}_{t-1} ^{1}\right) $ , $ \mathbf{w}_{t}$ is the $ t$-th column in $ \mathbf{W}_{T} ^{1}=\left( \mathbf{w}_{1}\ldots\mathbf{w}_{T}\right) $ , and $ \mathbf{\delta }_{t}$ is the parameterization resulting from the sequential factorization. The right-hand side of equation (7) implicitly defines an innovation process $ \mathbf{\eta}_{t}$, which equals $ \mathbf{w} _{t}-\mathcal{E}(\mathbf{w}_{t}\vert\mathbf{W}_{t-1}^{1})$ .

Data partition.    Now, partition $ \mathbf{W}_{T}^{1}$ into two sets, one set to be analyzed ($ \mathbf{X}_{T}^{1}$) and one set to be marginalized ($ \mathbf{V}_{T}^{1}$):

$\displaystyle \mathbf{W}_{T}^{1}\;=\;\left( \mathbf{X}_{T}^{1}:\mathbf{V}_{T}^{1}\right) ,$ (8)

where $ \mathbf{X}_{T}^{1}$ is an $ T\times n$ matrix. Consequently, everything about $ \mathbf{\mu}$ must be learnt from $ \mathbf{X}_{T}^{1}$ alone, which entails that $ \mathbf{V}_{T}^{1}$ must not be essential to inference about $ \mathbf{\mu}$.

Marginalization. Actual marginalization proceeds as follows. Using the partition in equation (8), and noting that $ \mathbf{W} _{t-1}=\left\{ \mathbf{X}_{t-1}^{1},\mathbf{V}_{t-1}^{1},\mathbf{W} _{0}\right\} $ , factorize $ \mathsf{D}_{\mathsf{w}}\left( \cdot\right) $ into the conditional distribution of $ \mathbf{v}_{t}$ given $ \mathbf{x}_{t}$, and the marginal distribution of $ \mathbf{x}_{t}$:

$\displaystyle \begin{tabular}[b]{l} $\mathsf{D}_{\mathsf{w}}\left( \mathbf{w}_{t}\mid\mathbf{W}_{t-1} ,\mathbf{\delta}_{t}\right) \bigskip $\\ $\hspace*{0.5in}=\;\mathsf{D}_{\mathsf{v\vert x}}\left( \mathbf{v}_{t} \mid\mathbf{x}_{t},\mathbf{W}_{t-1},\mathbf{\delta}_{a,t}\right) \cdot\mathsf{D}_{\mathsf{x}}\left( \mathbf{x}_{t}\mid\mathbf{V}_{t-1} ^{1},\mathbf{X}_{t-1}^{1},\mathbf{W}_{0},\mathbf{\delta}_{b,t}\right) .$ \end{tabular}$ (9)

If only $ \left\{ \mathbf{x}_{t}\right\} $ is to be analyzed, with only $ \mathsf{D}_{\mathsf{x}}\left( \mathbf{x}_{t}\vert\mathbf{\cdot}\right) $ retained, then $ \mathbf{\mu}$ must be obtainable from $ \left\{ \mathbf{\delta }_{b,t}\right\} $ alone.

If lagged information about $ \mathbf{v}$ is also to be eliminated, then $ \mathsf{D}_{\mathsf{x}}\left( \mathbf{x}_{t}\vert\mathbf{\cdot}\right) $ must be marginalized with respect to $ \mathbf{V}_{t-1}^{1}$, requiring the very strong condition that:

$\displaystyle \mathsf{D}_{\mathsf{x}}\left( \mathbf{x}_{t}\mid\mathbf{V}_{t-1} ^{1},\mathbf{X}_{t-1}^{1},\mathbf{W}_{0},\mathbf{\delta}_{b,t}\right) \;=\;\mathsf{D}_{\mathsf{x}}\left( \mathbf{x}_{t}\mid\mathbf{X}_{t-1} ^{1},\mathbf{W}_{0},\mathbf{\delta}_{b,t}^{\ast}\right) .$ (10)

There is no loss of information from eliminating the history $ \mathbf{V}_{t-1}^{1}$ if and only if $ \mathbf{\delta}_{b,t}=\mathbf{\delta }_{b,t}^{\ast}\ \forall t$ , in which case the conditional sequential distribution of $ \left\{ \mathbf{x}_{t}\right\} $ does not depend on $ \mathbf{V}_{t-1}^{1}$. That is, $ \mathbf{v}$ does not Granger-cause $ \mathbf{x}$; see Granger (1969). In modeling, another important condition is that there is no loss of relevant information when $ \mathbf{\mu} =\mathbf{g}(\left\{ \mathbf{\delta}^{\ast}\right\} _{b,T}^{1})$ . That is still a strong condition, but less stringent than $ \mathbf{\delta} _{b,t}=\mathbf{\delta}_{b,t}^{\ast}\ \forall t$ . Also, marginalizing $ \mathbf{v}$ will entail a loss of information unless $ \mathbf{\delta} _{t}=\left( \mathbf{\delta}_{a,t},\mathbf{\delta}_{b,t}\right) \in\mathbf{\Delta}_{a}\times\mathbf{\Delta}_{b}$ for parameter spaces $ \mathbf{\Delta}_{a}$ and $ \mathbf{\Delta}_{b}$. Otherwise, the parameters of the conditional and marginal distributions in equation (9) are cross-linked.

The above discussion implies that modeling aggregated data (say) can be viewed as a two-step process. First, the disaggregated series for a given variable (such as expenditure) are transformed by a one-to-one transformation into the corresponding aggregated series and all but one of those disaggregated series. Second, those disaggregated series are marginalized.

Mapping to stationarity.    An economy may generate integrated data, where a variable that is integrated of order $ d$ (denoted I($ d$)) must be differenced $ d$ times to eliminate all unit roots. Mapping such data to stationarity is a reduction from I($ d$) to I(0). This mapping is generally useful for interpreting the resulting models, and it is needed to ensure that conventional inference is valid for all parameters. Still, many inferences will be valid even if this reduction is not enforced; see Sims, Stock, and Watson (1990). While differencing the data can map the data to stationarity, cointegration can also eliminate unit roots between linear combinations of variables. Cointegration is merely noted here, as it is treated extensively in numerous books and expository articles: see Banerjee and Hendry (1992), Ericsson (1992a), Banerjee, Dolado, Galbraith, and Hendry (1993), Johansen (1995), Hatanaka (1996), Doornik, Hendry, and Nielsen (1998), and Hendry and Juselius (2001) inter alia.

Conditional factorization.    Typically in empirical modeling, some variables are treated as endogenous and others are treated as given or non-modeled. Formally, this partitioning of the variables arises by factorizing the density of $ n$ variables in $ \mathbf{x}_{t}$ into sets of $ n_{1}$ and $ n_{2}$ variables  $ \mathbf{y}_{t}$ and  $ \mathbf{z}_{t}$:

$\displaystyle \mathbf{x}_{t}^{\prime}\;=\;\left( \mathbf{y}_{t}^{\prime}:\mathbf{z} _{t}^{\prime}\right) ,$ (11)

where $ \mathbf{y}_{t}$ denotes the endogenous variables in $ \mathbf{x}_{t}$, $ \mathbf{z}_{t}$ denotes the non-modeled variables in $ \mathbf{x}_{t}$, and $ n_{1}+n_{2}=n$. Using the partition in equation (11), the joint distribution of $ \mathbf{x}_{t}$ on the right-hand side of equation (10) can always be factorized as:

$\displaystyle \begin{tabular}[b]{l} $\mathsf{D}_{\mathsf{x}}\left( \mathbf{x}_{t}\mid\mathbf{X}_{t-1} ^{1},\mathbf{W}_{0},\mathbf{\delta}_{b,t}^{\ast}\right) \bigskip $\\ $\hspace*{0.5in}=\mathsf{D}_{\mathsf{y\vert z}}\left( \mathbf{y}_{t}\mid \mathbf{z}_{t},\mathbf{X}_{t-1}^{1},\mathbf{W}_{0},\mathbf{\theta} _{a,t}\right) \cdot\mathsf{D}_{\mathsf{z}}\left( \mathbf{z}_{t} \mid\mathbf{X}_{t-1}^{1},\mathbf{W}_{0},\mathbf{\theta}_{b,t}\right) ,$ \end{tabular}$ (12)

where $ \mathsf{D}_{\mathsf{y\vert z}}\left( \mathbf{y}_{t} \vert\mathbf{\cdot}\right) $ is the conditional density of $ \mathbf{y}_{t}$ given $ \mathbf{z}_{t}$, $ \mathsf{D}_{\mathsf{z}}\left( \mathbf{z}_{t} \vert\mathbf{\cdot}\right) $ is the marginal density of $ \mathbf{z}_{t}$, and $ \mathbf{\theta}_{a,t}$ and $ \mathbf{\theta}_{b,t}$ are those densities' parameters. Modeling only $ \mathbf{y}_{t}$ and treating $ \mathbf{z}_{t}$ as given corresponds to modeling only $ \mathsf{D}_{\mathsf{y\vert z}}\left( \mathbf{y}_{t}\vert\mathbf{\cdot}\right) $ and discarding the marginal distribution $ \mathsf{D}_{\mathsf{z}}\left( \mathbf{z}_{t}\vert\mathbf{\cdot }\right) $ on the right-hand side of equation (12). No loss of information in this reduction corresponds to the condition that $ \mathbf{z} _{t}$ is weakly exogenous for $ \mathbf{\mu}$. Specifically, weak exogeneity requires that $ \mathbf{\mu}=\mathbf{f}\left( \mathbf{\theta}_{a,t}\right) $ alone and that $ \left( \mathbf{\theta}_{a,t},\mathbf{\theta}_{b,t}\right) \in\mathbf{\Theta}_{a}\times\mathbf{\Theta}_{b}$ for parameter spaces $ \mathbf{\Theta}_{a}$ and $ \mathbf{\Theta}_{b}$; see Engle, Hendry, and Richard (1983)*.

Constancy.    Complete parameter constancy in the conditional density $ \mathsf{D}_{\mathsf{y\vert z}}\left( \mathbf{y}_{t}\vert\mathbf{\cdot}\right) $ means that $ \mathbf{\theta}_{a,t}=\mathbf{\theta}_{a}\ \forall t$, where $ \mathbf{\theta}_{a}\in\Theta_{a}$. In such a situation, if weak exogeneity holds, $ \mathbf{\mu}$ itself is constant because $ \mathbf{\mu}$ is a function of only $ \mathbf{\theta}_{a}$. While appearing simple enough, constancy is actually a subtle concept; see Hendry (1996) and Ericsson, Hendry, and Prestwich (1998).

Lag truncation. Lag truncation limits the extent of the history  $ \mathbf{X}_{t-1}^{1}$ in the conditional density $ \mathsf{D} _{\mathsf{y\vert z}}\left( \mathbf{y}_{t}\vert\mathbf{\cdot}\right) $ in equation (12). For instance, truncation at $ s$ lags implies:

$\displaystyle \mathsf{D}_{\mathsf{y\vert z}}\left( \mathbf{y}_{t}\mid\mathbf{z}_{t} ,\mathbf{X}_{t-1}^{1},\mathbf{W}_{0},\mathbf{\theta}_{a}\right) \;=\;\mathsf{D}_{\mathsf{y\vert z}}\left( \mathbf{y}_{t}\mid\mathbf{z} _{t},\mathbf{X}_{t-1}^{t-s},\mathbf{W}_{0},\mathbf{\zeta}\right) .$ (13)

In equation (13), no loss of relevant information requires that $ \mathbf{\mu}=\mathbf{f}\left( \mathbf{\zeta}\right) $.

Functional form.    Functional form could be treated as a set of data transformations, but it merits some discussion on its own. Specifically, map $ \mathbf{y}_{t}$ into $ \mathbf{y}_{t}^{\dagger}$ ( $ =\mathbf{h}_{1}\left( \mathbf{y}_{t}\right) $) and $ \mathbf{z}_{t}$ into $ \mathbf{z}_{t}^{\dagger} $ ( $ =\mathbf{h}_{2}\left( \mathbf{z}_{t}\right) $), and denote the resulting data by $ \mathbf{X}^{\dagger}$. Assume that the transformations $ \mathbf{y} _{t}^{\dagger}$ and $ \mathbf{z}_{t}^{\dagger}$ together make $ \mathsf{D} _{\mathsf{y}^{\dagger}\mathsf{\vert z}^{\dagger}}(\mathbf{y}_{t}^{\dagger}\vert\cdot)$ approximately normal and homoscedastic, denoted $ \mathsf{N}_{n_{1} }(\mathbf{\lambda}_{t},\mathbf{\Upsilon)}$. Then there is no loss of information--and no change in the specification--if:

$\displaystyle \mathsf{D}_{\mathsf{y}^{\dagger}\mathsf{\vert z}^{\dagger}}(\mathbf{y}_{t} ^{\dagger}\mid\mathbf{z}_{t}^{\dagger},\mathbf{X}^{\dagger}{}_{t-1} ^{t-s},\mathbf{W}_{0},\mathbf{\gamma)}\;=\;\mathsf{D}_{\mathsf{y\vert z}}\left( \mathbf{y}_{t}\mid\mathbf{z}_{t},\mathbf{X}_{t-1}^{t-s},\mathbf{W} _{0},\mathbf{\zeta}\right) .$ (14)

A well-known example in which no loss of information occurs is transforming the log-normal density of a variable to the normal density in the logarithm of that variable. The left-hand side density in equation (14) defines the local (conditional) DGP of $ \mathbf{y} _{t}^{\dagger}$. When joint normality holds, as is assumed here, that final conditional model is linear in the transformed space.

The derived model.    This sequence of reductions delivers the following specification:

$\displaystyle \mathbf{A}\left( L\right) \mathbf{h}_{1}\left( \mathbf{y}_{t}\right) \;=\;\mathbf{B}\left( L\right) \mathbf{h}_{2}\left( \mathbf{z}_{t}\right) \;+\;\mathbf{\varepsilon}_{t}\hspace*{0.5in}\mathbf{\varepsilon}_{t\text{ \ }\widetilde{app}}\ \mathsf{N}_{n_{1}}(\mathbf{0},\mathbf{\Sigma }_{\varepsilon}),$ (15)

where $ \mathbf{\varepsilon}_{t}$ is a mean-zero, homoscedastic, mean-innovation process with variance $ \mathbf{\Sigma}_{\varepsilon}$, and $ \mathbf{A}\left( L\right) $ and $ \mathbf{B}\left( L\right) $ are constant-parameter polynomial matrices of order $ s$ in the lag operator $ L$. That is, $ \mathbf{A}\left( L\right) $ and $ \mathbf{B}\left( L\right) $ are matrices whose elements are polynomials. The error  $ \mathbf{\varepsilon }_{t} $ is a derived process that is defined by:

$\displaystyle \mathbf{\varepsilon}_{t}\;=\;\mathbf{A}\left( L\right) \mathbf{h}_{1}\left( \mathbf{y}_{t}\right) \;-\;\mathbf{B}\left( L\right) \mathbf{h}_{2}\left( \mathbf{z}_{t}\right) ,$ (16)

so $ \mathbf{\varepsilon}_{t}$ as given in equation (16) is not autonomous. For the same reason, equation (15) is a derived model, rather than an autonomous model. Equations (15) and (16) parallel and generalize Section 2.1's discussion of the empirical model in equation (2) and its derived error in equation (4). Section 4.3 discusses further aspects of model formulation, focusing on dynamic specification, noting that equation (15) is an autoregressive distributed lag model.

Reduction from the DGP to the generic econometric equation in (15) involves all ten stages of reduction discussed above, thereby transforming the parameters  $ \mathbf{\psi}_{T}^{1}$ in the DGP to the coefficients of the empirical model. Because the DGP is congruent with itself, equation (15) would be an undominated congruent model if there were no information losses from the corresponding reductions. More generally, equation (15) is congruent under the conditions stated, but it still could be dominated, as the next subsection discusses.

2.4  Dominance

Comparison of empirical models is often of interest, and dominance is a useful criterion when comparing models. Dominance can be summarized, as follows.

Consider two distinct scalar empirical models, denoted $ \mathsf{M}_{1}$ and $ \mathsf{M}_{2}$, with mean innovation processes (MIPs) $ \left\{ \varepsilon_{1t}\right\} $ and $ \left\{ \varepsilon_{2t}\right\} $ relative to their own information sets, where $ \varepsilon_{1t}$ and $ \varepsilon_{2t}$ have constant finite variances $ \sigma_{1}^{2}$ and $ \sigma_{2}^{2}$ respectively. Model  $ \mathsf{M}_{1}$ variance-dominates model  $ \mathsf{M}_{2} $ if $ \sigma_{1}^{2}<\sigma_{2}^{2}$. That property is denoted $ \mathsf{M} _{1}\succ\mathsf{M}_{2}$.

Several implications follow immediately from the definition of variance dominance in terms of the models' variances.

These implications favor starting with general rather than simple empirical models, given any choice of information set; and they suggest modeling the conditional expectation.

2.5  Measuring Information Loss

Valid reductions involve no losses in information. Econometrics has created concepts that correspond to avoiding possible losses, as the following list highlights.

At a more general level, encompassing implies that no loss of information arises from ignoring another model's data, conditional on retaining the data in one's own model.

To provide some insight into encompassing, note that the local DGP is a reduction of the DGP itself and is nested within that DGP. The properties of the local DGP are explained by the reduction process: knowledge of the DGP entails knowledge of all reductions thereof. Thus, when knowledge of one model entails knowledge of another, the first model is said to encompass the second. Several relevant papers on encompassing are reprinted in Campos, Ericsson, and Hendry (2005); see Sections 4.6 and 4.7.

2.6  Information for Model Evaluation

Given the potential information losses that can occur in reduction, it is natural to evaluate the extent to which there is evidence of invalid reductions. A taxonomy of information sources arises from the data themselves, theory and measurement information, and data used in rival models. In this taxonomy, it is useful to partition the data used in modeling ( $ \mathbf{X} _{T}^{1}$) into its relative past, relative present, and relative future:

$\displaystyle \mathbf{X}_{T}^{1}\;=\;\left( \mathbf{X}_{t-1}^{1}:\mathbf{x}_{t} :\mathbf{X}_{T}^{t+1}\right) .$ (17)

The taxonomy of information sources is thus:

  1. past data $ \mathbf{X}_{t-1}^{1}$;
  2. present data $ \mathbf{x}_{t}$;
  3. future data $ \mathbf{X}_{T}^{t+1}$;
  4. theory information, which often motivates the choice of parameters of interest;
  5. measurement information, including price index theory, constructed identities, and data accuracy; and
  6. data of rival models, which is partitioned into its relative past, present, and future, paralleling the partition of the model's own data in equation (17).

Each source of information implies a criterion for evaluating an empirical model:

  1. homoscedastic innovation errors;
  2. weakly exogenous conditioning variables for the parameters of interest;
  3. constant invariant parameters of interest;
  4. theory-consistent identifiable structures;
  5. data-admissible formulations on accurate observations; and
  6. encompassing of rival models.

Models that satisfy the criteria for the first five information sets are said to be congruent. An encompassing congruent model satisfies all six criteria; see Sections 4.6 and 4.7.

2.7  Model Design

Model design can be either implicit or explicit. Implicit model design typically occurs when modeling aims to address the symptoms of existing mis-specification. For instance, a simple model might be initially specified and then tested for problems such as autocorrelation, heteroscedasticity, and omitted variables, correcting such problems as they are discovered. The model is thus implicitly designed to minimize (or control) the values of the test statistics that are computed.

Explicit model design aims to mimic reduction theory in empirical modeling, thereby minimizing the losses due to the reductions imposed. Explicit model design leads to general-to-specific modeling; see the following section. Gilbert (1986) [Volume I: Chapter 1] discusses and contrasts implicit and explicit model design in detail.


3  General-to-specific Modeling

General-to-specific modeling is the practical embodiment of reduction. The introduction to the current section summarizes important aspects of that relationship, focusing on implications for automated general-to-specific modeling algorithms. See Gilbert (1986)* and Ericsson, Campos, and Tran (1990) [Volume I: Chapter 7] for detailed expositions on general-to-specific modeling and Phillips (1988) [Volume I: Chapter 6] for a constructive critique.

As background to the methodology of general-to-specific modeling, consider the local DGP, which is the joint distribution of the subset of variables under analysis. A general unrestricted model such as equation (15) is formulated to provide a congruent approximation to that local DGP, given the theoretical, institutional, and existing empirical background. The empirical analysis commences from this general specification, which is assessed for discrepancies from congruency by using mis-specification tests. If no mis-specification is apparent, the general unrestricted model (or GUM) is simplified to a parsimonious congruent representation, with each step in the simplification being checked by diagnostic testing.

Simplification can proceed in many ways. Although the goodness of a model is intrinsic to the model and is not a property of the selection route, poor routes seem unlikely to deliver useful models. Consequently, some economists worry about how selection rules may affect the properties of the resulting models, thus advocating the use of a priori specifications. To be a viable empirical approach, these a priori specifications require knowledge of the answer before starting; and they deny any useful role to empirical modeling. Conversely, good routes may have a high chance of delivering congruent representations of the local DGP. Section 4.4 addresses model selection procedures and summarizes the papers reprinted in Campos, Ericsson, and Hendry (2005).

Several studies have recently investigated how well general-to-specific modeling does in model selection. In particular, the pathbreaking Monte Carlo study by Hoover and Perez (1999a) [Volume II: Chapter 22] reconsiders earlier Monte Carlo results by Lovell (1983) [Volume II: Chapter 20]. Hoover and Perez start with series on 20 macroeconomic variables ($ \mathbf{z}_{t}$) and generate a new variable (denoted $ y_{t}$) as a function of zero to five of the $ \mathbf{z}_{t}$'s and an error. Hoover and Perez then regress $ y_{t}$ on all twenty $ \mathbf{z}_{t}$'s plus lags thereof ($ y_{t-1},y_{t-2},y_{t-3} ,y_{t-4},\mathbf{z}_{t-1}$), and they let their general-to-specific algorithm simplify that general unrestricted model until it finds an irreducible, congruent, encompassing result. Their algorithm checks up to 10 different simplification paths, testing for mis-specification at every step. It then collects the models from those different simplification paths and selects the one that variance-dominates the others. By following many paths, the algorithm guards against choosing a misleading route; and it delivers an undominated congruent model.

HooverPerez (1999a)* stimulated a flurry of activity on the methodology of general-to-specific modeling. Hendry and Krolzig (1999) [Volume II: Chapter 23] improved on Hoover and Perez's algorithm. More recently, Hoover and Perez (2004) extended their investigations to cross-section datasets and found equally impressive performance by general-to-specific modeling in model selection. Hendry and Krolzig (2003) also reported a wide range of simulation studies used to calibrate their algorithm.

To see why general-to-specific modeling does well, the current section examines this approach from several standpoints. Section 3.1 summarizes generic difficulties in model selection. Section 3.2 reviews the debates about general-to-specific modeling prior to the publication of Hoover and Perez (1999a)*. Section 3.3 develops the analytics for several procedures in general-to-specific modeling, drawing on Hendry (2000), from which Section 3.4 derives various costs of search. Section 3.5 reports recent simulation evidence on the properties of general-to-specific modeling.

3.1  The Econometrics of Model Selection

There are four potential basic mistakes in selecting a model from data evidence:

  1. mis-specifying the general unrestricted model;
  2. failing to retain variables that should be included;
  3. retaining variables that should be omitted; and
  4. selecting a noncongruent representation, which renders conventional inference hazardous.

The first mistake is outside the purview of selection issues, although mis-specification testing of the general unrestricted model can alert the investigator to potential problems with that model. When the general unrestricted model is congruent, the fourth mistake can be avoided by ensuring that all simplifications are valid. That leaves the second and third mistakes as the two central problems for model selection. In what follows, the resulting costs of selection are typically considered for situations in which the general unrestricted model is much larger than the required model.

The conventional statistical analysis of repeated testing provides a pessimistic background for model selection. Every test has a nonzero null rejection frequency (or size, if independent of nuisance parameters), so type I errors accumulate across tests. Setting a tight significance level for tests would counteract that phenomenon, but would also induce low power to detect the influences that really matter. Thus, the conventional view concludes that incorrect decisions in model search must be commonplace.

As a contrasting view, in a progressive research strategy that utilizes general-to-specific modeling, evidence accumulates over time against irrelevant influences and in favor of relevant influences. White (1990) [Volume I: Chapter 23] showed that, with sufficiently rigorous testing, the selected model will converge to the DGP, so overfitting and mis-specification problems are primarily finite sample issues. Sections 3.3, 3.4, and 3.5 show that general-to-specific modeling has relatively low search costs, with sizes close to their nominal levels and powers near those attainable when commencing from the correct specification.

3.2  Past Debates

In addition to the four potential mistakes above, critics of general-to-specific methods have voiced concerns about data-based model selection, measurement without theory, pre-test biases, ignored selection effects, data mining, repeated testing, lack of identification, and the potential path dependence of any selected model. This subsection highlights central references in these debates and considers various responses to the critics.

Keynes (1939, 1940), Tinbergen (1940), Koopmans (1947), Judge and Bock (1978), Leamer (1978), Lovell (1983)*, Pagan (1987) [Volume I: Chapter 5], Hendry, Leamer, and Poirier (1990), and Faust and Whiteman (1997) inter alia critique general-to-specific methods in several key exchanges in the literature. Although Hendry and Krolzig (2001) address numerous concerns raised about general-to-specific modeling, it has taken a considerable time to develop the conceptual framework within which that rebuttal is possible. For instance, when Hendry (1980) was written, it was unclear how to counter earlier attacks, such as those by Keynes on Tinbergen (1939) or by Koopmans on Burns and Mitchell (1946). These debates set the scene for doubting any econometric analysis that failed to commence from a pre-specified model. Leamer (1978) also worried about the effects of data mining, which he defined as "the data-dependent process of selecting a statistical model" (p. 1). After Lovell83* found low success rates for selecting a small relation hidden in a large database, an adverse view of data-based model selection became entrenched in the profession. This view was reinforced by two additional results: the apparent coefficient bias that arises when variables are selected by significance tests, and the claimed under-estimation in reported coefficient standard errors that arises from treating a selected model as if it were certain.

Consequently, many empirical econometric studies have tried to appear to commence from pre-specified models, whether they did so or not. Econometric evidence became theory dependent: empirical evidence provided little value added, and it was likely to be discarded when fashions in theory changed. Confusion over the role of econometric evidence was so great that (e.g.) Summers (1991) failed to notice that theory dependence was a source of the problem, not the use of "sophisticated" econometrics.

Keynes and others claimed that valid econometric analysis must be based on models pre-specified by economic theory. The fallacy in that approach is that theoretical models are themselves incomplete and incorrect. Similarly, Koopmans inter alia relied on the (unstated) assumption that only one form of economic theory was applicable, that it was correct, and that it was immutable; see Hendry and Morgan (1995). That said, it is actually not necessary (or even possible) to know everything in advance when commencing statistical work in economics. If it were necessary, no one would ever discover anything not already known! Partial explanations are likewise valuable empirically, as the development of the natural sciences has demonstrated. Progressive research can discover invariant features of reality without prior knowledge of the whole; see Hendry (1995b).

A critic might well grant the force of such arguments, yet remain skeptical that data mining could produce anything useful, thereby undermining an evidence-based approach. However, as Gilbert (1986)* discusses, pejorative forms of data mining can be discovered from conflicting evidence or by rival models that cannot be encompassed. Stringent and critical model evaluation can detect and avoid these forms of data mining. See Hendry (1995a, Ch. 15] and Campos and Ericsson (1999) [Volume II: Chapter 33] for further discussion of data mining, including other less pejorative senses. Even when an undominated congruent model is data-based, it can provide a good approximation to the local DGP; and it can help reveal pejorative data mining.

3.3  Probabilities of Deletion and Retention

At first blush, the theory of repeated testing appears to wreak havoc with general-to-specific model selection, so this subsection addresses that issue. In fact, the probabilities of deleting irrelevant variables are relatively high. The greater difficulty is retaining relevant effects, even if the analysis commences from the "correct" model, i.e., the local DGP. This subsection re-examines the probabilities associated with deleting irrelevant variables and retaining relevant variables when adopting a general-to-specific modeling strategy.

Deleting irrelevant variables.    To illustrate the probabilities associated with deleting irrelevant variables in general-to-specific model selection, consider a classical regression model in which $ n$ regressors are irrelevant, i.e., have regression parameters equal to zero. Under that null hypothesis, the probability p$ _{\alpha}$ that at least one of $ n$ corresponding $ \mathsf{t}$-tests rejects at the $ (100\cdot\alpha)\%$ level is one minus the probability that none of those $ \mathsf{t}$-tests rejects:

\begin{displaymath}\begin{array}[b]{lll} \mathsf{p}_{\alpha} & = & 1-\mathsf{P}(\vert\mathsf{t}_{i}\vert<c_{\alpha}\;\forall i=1,\ldots,n)\medskip\\ & = & 1-\left( 1-\alpha\right) ^{n}, \end{array}\end{displaymath} (18)

where $ c_{\alpha}$ is the critical value associated with an $ \alpha$ rejection frequency. For example, when $ 40$ tests of correct null hypotheses are conducted at $ \alpha=0.05$ (say), then $ \mathsf{p}_{0.05}\simeq0.87$ from equation (18). With approximately 87% probability, at least one t-test spuriously rejects at the 5% level. Such a high p-value is usually the focus of worry with repeated testing.

One solution is to use larger critical values. The t-distribution is thin-tailed, leading Sargan (2001a) to note how difficult it is to obtain spurious t-values exceeding three in absolute value; see also Sargan (2001b). A critical value of three corresponds to (approximately) the 0.5% critical value for a t-test with $ T=50 $, for which $ \mathsf{p}_{0.005}\simeq0.18$ when $ n=40$. While an 18% chance of a false rejection may still be high from some perspectives, this situation has other interpretations, as is now shown.

To better grasp the issues involved in false rejection, reconsider the probability of irrelevant variables being significant on a t-test at significance level $ \alpha$. That probability distribution$ \{\mathsf{p} _{j}\}$ is given by the $ n+1$ terms of the binomial expansion of $ \left( \alpha+\left( 1-\alpha\right) \right) ^{n}$, namely:

$\displaystyle \mathsf{p}_{j}\;=\;\frac{n!}{j!\left( n-j\right) !}\alpha^{j}\left( 1-\alpha\right) ^{n-j}\hspace*{0.5in}j=0,\ldots,n.$ (19)

Thus, the probability of all $ n$ coefficients being significant is $ \alpha^{n}$, the probability of $ n-1$ being significant is $ n\alpha ^{n-1}\left( 1-\alpha\right) $, and (as used in equation (18)) the probability of none being significant is $ \left( 1-\alpha\right) ^{n}$. Using equation (19), the average number of variables found significant by chance is:

$\displaystyle m\;=\;\sum_{j=0}^{n}j\cdot\mathsf{p}_{j}\;=\;n\alpha.$ (20)

If $ \alpha=0.05$ and $ n=40$, then $ m$ equals two, which explains the high probability of at least one spurious rejection. Even so, 38 out of the 40 variables will be deleted (on average) when using individual t-tests at this relatively loose significance level. Moreover, from equation (20), $ m$ falls to 0.4 for $ \alpha=0.01$ and to 0.2 for $ \alpha=0.005$. That is, when using a critical value of three, one variable out of forty is retained just once in five attempts, on average. That value of $ m$ explains why $ \mathsf{p}_{0.005}\simeq0.18$. It also reveals that such a rejection rate does not correspond to the usual interpretation of size but to the rare occurrence of a selection error, despite a large number of irrelevant variables. Even if repeated t-tests are used, few spurious variables will typically be retained for $ \alpha\leq0.01$. Unfortunately, small values of $ \alpha$ often imply difficulty in detecting relevant variables, leading to the issue of retention.

Retaining relevant variables.    Unless the relevant variables are highly significant in the population, retaining them appears inherently difficult, whether or not general-to-specific modeling is used. The difficulty in retaining relevant variables can be illustrated in a framework similar to the one used above for calculating the probabilities of deleting irrelevant variables, but with the t-ratios having noncentral distributions.

Consider drawing from a t-distribution $ \mathsf{t}\left( T,\tau\right) $ with $ T$ degrees of freedom and a noncentrality parameter $ \tau$, which is approximately the t-statistic's mean. The null hypothesis $ \mathsf{H}_{0}$ is $ \tau=0$, the alternative hypothesis $ \mathsf{H}_{1}$ is $ \tau\neq0$, and assume that the alternative hypothesis is true. For a critical value  $ c_{\alpha}$, the probability $ \mathsf{P}\left( \left\vert \mathsf{t}\right\vert \geq c_{\alpha}\mid \mathsf{H}_{0}\right) =\alpha$ determines the size, whereas power is given by:

$\displaystyle \mathsf{P}\left( \left\vert \mathsf{t}\right\vert \geq c_{\alpha}\mid\mathsf{H} _{1}\right) .$ (21)

Suppose that $ \tau=2$, which corresponds to an expected t-value of approximately two in the population. That is, $ \mathcal{E} (\mathsf{t}\left( T,2\right) ) \simeq2$ for a fixed $ T$. Because the t-distribution is nearly symmetric around its mean, the probability of that t-ratio exceeding two is about 50%:

$\displaystyle \mathsf{P}\left( \mathsf{t}\geq2\mid\tau=2\right) \;\; \simeq \;\;0.50.$ (22)

For even relatively small $ T$, equation (22) is the power in equation (21) when calculated for the alternative hypothesis $ \mathsf{H}_{1}:\tau=2$, noting that then $ c_{\alpha}=2$ roughly corresponds to $ \alpha=0.05$ and $ \mathsf{P}\left( \mathsf{t}\leq-2\mid \tau=2\right) $ is approximately zero. The probability of rejecting the null hypothesis that $ \tau=0$ is thus only 50% under the alternative hypothesis that $ \tau=2$, even although the local DGP is known and only a single variable is involved. For three such variables, the probability of detecting all three is:

$\displaystyle \mathsf{P}\left( \vert\mathsf{t}_{i}\vert\geq c_{\alpha}\;\forall i=1,2,3\mid \tau=2\right) \;\; \simeq\;\;0.50^{3}\;\;=\;\;0.125,$ (23)

where the estimated model is the local DGP and includes no additional irrelevant variables, and the included variables are orthogonal. From equation (23), all three variables are deemed significant only about one-eighth of the time (12.5%)--the same probability that no variables are retained. This low power represents an inexorable cost of inference from the given evidence, even when commencing from a correctly specified model.

This difficulty with power is exacerbated if the critical value is increased (e.g.) to offset problems of "overfitting". For example, $ \mathsf{P}\left( \vert\mathsf{t}\vert\geq c_{0.01}\mid\tau=2\right) \simeq0.28$ for even large $ T$, implying only about a 2% chance of keeping all three variables. Despite knowing the local DGP, such a variable will only occasionally be retained using t-tests if the null hypothesis is tested. If there are many such variables, retaining all of them is highly unlikely.

These calculations paint a potentially gloomy picture for data-based model selection. However, the situation is more promising than it might appear. As Section 3.4 explains, general-to-specific modeling can have surprisingly small search costs--i.e., the additional costs that arise by commencing from a general unrestricted model that nests the local DGP, rather than by commencing from the local DGP and knowing that it is the local DGP. These search costs are typically positive: it is difficult to improve on model selection if the local DGP is known. However, multiple-variable procedures such as F-tests can improve on single-variable procedures such as the t-tests above.

3.4  Costs of Inference and Costs of Search

The costs associated with model selection can be usefully separated into the costs of inference and the costs of search, as the following example illustrates.

The costs of inference are those costs associated with inference about variables in a model when the model is the local DGP but the modeler does not know that. Consider a local DGP with $ k$ variables (all relevant), and denote those variables by the set $ S_{rel}$ ($ rel$ for relevant). Let $ \mathsf{p}_{\alpha,i}^{dgp}$ denote the probability of retaining the $ i^{th} $ variable at significance level $ \alpha$ when commencing from the local DGP as the initial specification. The probability of dropping the $ i^{th}$ relevant variable is $ (1-\mathsf{p}_{\alpha,i}^{dgp})$, so a measure of the total costs of inference in this situation is:

$\displaystyle \sum_{i\in S_{rel}}(1-\mathsf{p}_{\alpha,i}^{dgp}).$ (24)

While equation (24) is one measure of the costs of inference, other measures are feasible as well, such as unity minus the probability of selecting the local DGP; cf. equation (23).

The costs of search are those costs associated with inference about variables in a general unrestricted model that nests the local DGP, relative to inferences in the local DGP itself. Let $ \mathsf{p}_{\alpha,i}^{gum}$ denote the probability of retaining the $ i^{th}$ variable when commencing from a general unrestricted model with $ n$ variables, applying the same selection tests and significance levels as before. Thecost of search is $ \mathsf{p}_{\alpha,i}^{dgp}-\mathsf{p}_{\alpha,i}^{gum}$ for a relevant variable ( $ i\in S_{rel}$), and it is $ \mathsf{p}_{\alpha,i}^{gum}$ for an irrelevant variable ( $ i\in S_{0}$), where $ S_{0}$ is the set of $ n-k$ irrelevant variables in the general unrestricted model. By construction, the local DGP has no irrelevant variables, so the whole cost of keeping irrelevant variables in model selection is attributed to search. Thus, a measure of the pure search costs is:

$\displaystyle \sum_{i\in S_{rel}}(\mathsf{p}_{\alpha,i}^{dgp}-\mathsf{p}_{\alpha,i} ^{gum})\;+\;\sum_{i\in S_{0}}\mathsf{p}_{\alpha,i}^{gum} .$ (25)

In principle, the pure search costs given in equation (25) could be negative if the algorithm for selection from a general unrestricted model were different from that for testing in a local DGP. The complexities of multiple-variable problems typically preclude analytical answers, so Section 3.5 reports some simulation evidence. Before e