back start next


[start] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] [119] [120] [121] [122] [123] [124] [125] [126] [127] [128] [129] [130] [131] [132] [133] [134] [135] [136] [137] [138] [139] [140] [141] [142] [143] [144] [145] [146] [147] [148] [149] [150] [151] [152] [153] [154] [155] [156] [157] [158] [159] [160] [161] [162] [163] [164] [ 165 ] [166] [167] [168] [169] [170] [171] [172] [173] [174] [175] [176] [177] [178] [179] [180] [181] [182] [183] [184] [185] [186] [187] [188] [189] [190] [191] [192] [193] [194] [195] [196] [197] [198] [199] [200] [201] [202] [203] [204] [205] [206] [207] [208] [209] [210] [211] [212]


165

Table 12.2 Regression Diagnostics for Detecting Outliers for the Production Function (4.24)

Year

Studentized Residual

DFFITS

Year

Studentized Residual

DFFITS

1929

-0.38

-0.08

1949

-0.61

-0.16

1930

-1.28

-0.35

1950

0.04

0.01

1931

-1.21

-0.39

1951

-0.79

-0.19

1932

-0.42

-0.21

1952

-0.96

-0.22

1933

-0.97

-0.46

1953

-0.70

-0.15

1934

-0,61

-0.22

1954

0.32

0.06

1935

-0.24

-0.07

1955

0.43

0.08

1936

0.31

0.08

1956

-0.45

-0.09

1937

-0.08

-0.02

1957

-0.29

-0.06

1938

1.36

0.37

1958

0.42

0.10

1939

1.46

0.37

1959

0.01 .

0.00

1940

1.29

0.29

1960

-0.41

-0.11

1941

0.54

0.12

1961

0.11

0.03

-0.51

-0.12

1962

0.37

0.11

1943

0.04

0.01

1963

0.75

0.23

1944

1.94

0.46

1964

0.70

0.22

1945

3.28

0.88

1965

0.56

0.18

1946

0.42

0.17

1966

0.02 •

0.01

1947

-2.04

-0.85

1967

-0.91

-0.33

1948

-1.80

-0.66

tics. The results in Table 12.2 have been obtained from the SAS regression program.

Belsley, Kuh, and Welsch suggest that DFFITS is a better criterion to detect outliers and influential observations. DFFITS is a standardized measure of the difference in the fitted value of due to deleting this particular observation. Further, they suggest that observations with large studentized residuals or DFFITS should not be deleted. Their influence should be minimized. This method of estimation is called bounded influence estimation. The details of this method are complicated but a simple one-step bounded influence estimator suggested by Welsch" is as follows: Minimize 2] wfj, - ,), where

w, =

0.34

DFFITSJ

if DFFITS, < 0.34 if [DFFITSJ > 0.34

"Roy E. Welsch, "Regression Sensitivity Analysis and Bounded Influence Estimation," in J. Kmenta and J. B. Ramsay (eds.). Evaluation of Econometric Models (New York: Academic Press, 1980), pp. 153-167.



OLS with

Bounded

Outlier

Estimate of:

Influence

Deletion

-3.987

-3.938

-3.980

1.468

1.451

1.466

0.375

0.384

0.376

For comparison we also present the estimates from the OLS regression (4.24) and also estimates using OLS after deleting the observations for 1944, 1945, 1947, and 1948, the years for which the OLS residuals in Table 12.1 are large.

In this example there was not much difference in the estimated coefficients. In fact, the bounded influence method and OLS with outlier deletion gave almost identical results. The data set we have used is perhaps not appropriate for the illustration of the bounded influence method. The problem of parameter instabiUty and autocorrelated errors seems to be more important with this data set than that of detection of outliers.

In any case the preceding discussion gives an idea of what "bounded influence estimation" is about. The basic point is that the OLS residuals are not appropriate for detection of outliers. Further, outliers should not all be discarded. Their influence on the least squares estimates should be reduced (bounded) based on their magnitude.

As mentioned earlier, the data set we have used has not turned out to be appropriate for illustrating the method. Other data sets in the book can be used to check out the usefulness of the method.

Krasker" gives an interesting example of the use of bounded influence estimation. The problem is a forecasting problem faced by Korvetts Department Stores in 1975. The company has to choose between two locations, A and to start a new store. Data are available for 25 existing stores on the following variables.

= sales per capita

Xl = medium home value (x 10 ~*)

2 = average family size (x 10 ~)

3 = percent of population which is black or hispanic (x 10"")

"W. S. Krasker, "The Role of Bounded Influence Estimation in Model Selection," Journal of Econometrics, Vol. 16, 1981, pp. 131-138.

Illustrative Example

As an illustration, consider again, the production function (4.24). The values of DFFITS are shown in Table 12.2. There are nine observations (all before 1948) that have DFF1TS, > 0.34. These observations receive a weight < 1 in the bounded influence estimation.

Using this weighting scheme, we obtained the following results:



Constant

-0.13

2.70

0.22

0.014

(0 05)

(0 51)

(11)

(3 1)

-0.05

1.00

-4.1

0.010

(0 04)

(0 52)

(1 0)

(2 9)

Note the change in the coefficient for Xj. The WLS estimator is the weighted least squares.- Krasker argues that there are two outliers (observations 2 and 11 in this sample). The other 23 observations are "well described by an OLS regression whose estimates are essentially those of the WLS." Thus again in this example the bounded influence estimator does not appear to be different from the OLS with the two outliers omitted. (Results from OLS with 23 observations are not presented here.)

Krasker suggests that site A is similar to observation 2 and if the model cannot be used to predict observation 2. it should not be used to make predictions for site A (with 50.8% of the population from minorities). The model (OLS with 23 observations or WLS with all observations) can be used to make predictions for site B.

12.6 Model Selection

In the usual textbook econometrics, the statistical model underlying the data is assumed to be known at the outset and the problem is merely one of obtaining good estimates of the parameters in the model. In reality, however, the choice of a model is almost always made after some preliminary data analysis. For instance, in the case of a regression model, we start with a specification that seems most reasonable a priori. But after examining the coefficients, their standard errors and the residuals, we change the specification of the model. Purists would consider this "data mining" as an illegitimate activity, but it is equally unreasonable to assume that we know the model exactly at the very outset.

The area of model selection comprises of:

1. Choice between some models specified before any analysis.

2. Simplification of complicated models based on the data (data-based simplification).

3. Post-data model construction.

The weighting scheme is slightly different from the weighting scheme discussed in Welsch, "Regression Sensitivity."

The regression results were (dependent variable: sales) (figures in parentheses are standard errors) as follows:



[start] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] [119] [120] [121] [122] [123] [124] [125] [126] [127] [128] [129] [130] [131] [132] [133] [134] [135] [136] [137] [138] [139] [140] [141] [142] [143] [144] [145] [146] [147] [148] [149] [150] [151] [152] [153] [154] [155] [156] [157] [158] [159] [160] [161] [162] [163] [164] [ 165 ] [166] [167] [168] [169] [170] [171] [172] [173] [174] [175] [176] [177] [178] [179] [180] [181] [182] [183] [184] [185] [186] [187] [188] [189] [190] [191] [192] [193] [194] [195] [196] [197] [198] [199] [200] [201] [202] [203] [204] [205] [206] [207] [208] [209] [210] [211] [212]