back start next


[start] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] [119] [120] [121] [122] [123] [124] [125] [126] [127] [128] [129] [130] [131] [132] [133] [134] [135] [136] [137] [138] [139] [140] [141] [142] [143] [144] [145] [146] [147] [148] [ 149 ] [150] [151] [152] [153] [154] [155] [156] [157] [158] [159] [160] [161] [162] [163] [164] [165] [166]


149

efficient. However, it does cause problems for the validity of any model! If the true values of Fare never observed, a model can only be fitted to the measured values, and if the measurements are unreliable and difficult to predict, the model cannot perform well. There is little one can do about this, except try to obtain better data.

In summary, if unreliable data are used for explanatory variables the attenuation bias will tend to reduce the size of OLS parameter estimates. When data on the dependent variable are known to contain significant measurement errors there is little to be done other than to obtain better data.

A.4.3 Missing Data

Missing observations on bank holidays are usually filled by repeating the last observation, or by linear interpolation. But what should one do with new markets that have existed only for a short time, or with illiquid markets where trading occurs only sporadically? If data are available on related variables there are some measures that can be taken towards estimating the model regardless of the incomplete data, without resorting to filling in the gaps with dummy or proxy variables. One approach, based on PCA, was described in §6.4.2. This section describes an alternative approach for use in the context of regression models.

If the dependent variable in a regression model has missing data, then some auxiliary regressions of the explanatory variables can be substituted in the regression model. Divide the data on the dependent variable into two parts, letting y* denote the vectors of missing observations in and yc denote the complete observations on y. Likewise, the explanatory data are divided into two matrices: X* denotes the data on explanatory variables corresponding to the missing observations on y, and Xc denotes the data on explanatory variables corresponding to the complete observations on y.

To estimate the model on all the data, regardless of the missing observations y*. one proceeds as follows. First obtain parameter estimates bc using the complete data vectors that are available, giving bc = (XlXc) 1Xlyc. Then use some estimates y* of the missing observations in and obtain parameter estimates b* based on the incomplete data vectors, so b* = (X*X*) 1X*y*. Final parameter estimates are then a weighted average of bc and b* given by

bw = Wbc + (I - W)b*, (A.4.2)

where W is <XCXC + * * ! ; .

The big question is how we should fill in the missing observations y*. Only if the missing data are estimated in such a way that b* is unbiased will (A.4.2) gne unbiased final estimates of the model parameters. Taking b* = bc and



2500

1250-

1000-1-.-1-.-.-.-.-.-1

Jan-86 Jul-86 Jan-87 Jul-87 Jan-88 Jul-88 Jan-89 Jul-89

Figure A.9 The FTSE 100 index, 1986-1989.

backing out y* from b = (X*X*) 1X*y* is a possibility, but that does not really add any new information to the model, since then bw - bc and we may just as well base the model on the complete data. Taking every element of y* to be the average of the complete data on is also possible, but then (A.4.2) will give biased estimates, and we may be better off just using the data set that is complete. Perhaps the most attractive alternative is to define scenarios over y* that are more general than taking the average over yc. Then (A.4.2) can be used to estimate model parameters for a number of different y*, and this will give some idea of how realistic are the scenarios over y*. However, this method will still produce biased parameter estimates.

AAA Dummy Variables

Dummy variables are proxies for explanatory variables that we know are important, but for which there are no direct data, and neither is there a suitable proxy variable. In that case all that can be done is to make up the data by creating a dummy variable. In tic data one might consider creating a dummy to model the timing of important news announcements, or a dummy corresponding to opening times in the major markets. In daily data day-of-the-week dummies are sometimes used. Structural break dummy variables are important whenever the data period covers a permanent shift arising from a change in regime, or a temporary shift due to an extreme market movement.

Dummy variables should be used prudently and only if there is a real reason, such as an important news announcement or a change in government policy. For an example of a very basic dummy, suppose an extreme event such as Black Monday occurs in the dependent variable data. Figure A.9 illustrates the FTSE 100 index during the period around Black Monday. If a model is to explain the FTSE 100 returns around Black Monday, it will have to include a variable that has similar characteristics (such as the returns on



another equity index). If no similar variables are included in the model, the large returns during the Black Monday period will appear in the residuals, and this will upset the whole model because the residual variance will be increased.

A simple solution is to add a dummy variable to the model that takes the value 1 during the few days of large negative returns around Black Monday, and 0 otherwise. Denote by a the estimate of the model constant without the dummy variable. During the period that the dummy is 1, the constant will shift to a + d, where d is the coefficient estimate on the dummy. In this example d will be large and negative, so in effect the regression line is temporarily shifted downwards.

Dummy variables allow different regression lines to be estimated with the one set of data. For example, if there is a structural break in the data, such as in Figure A.7a, the structural break dummy D that is 0 before the break and 1 afterwards should be included in the model. There are a number of ways in which D can be included:

1. as an additional constant term as in (A.4.3a), in which case the regression line shifts parallel, as depicted in Figure A. 10a;

2. as a change in slope of the regression line, which is the model (A.4.3b) and is shown in Figure A. 10b;

3. or as both, which is model (A.4.3c) and is shown in Figure A. 10c.

There are three ways to incorporate this type of dummy variable into the simple model:

Shift: Y=a + 8D + $X+e (A.4.3a)

Slope: Y= a + yDX+ px+ e (A.4.3b)

Shift + Slope: Y = a + 8D + yDX + px + e (A.4.3c)

In many cases it is sufficient to use the dummy just to shift the regression constant, as in (A.4.3a). For example, day-of-the-week effects in daily data might use day-of-the-week dummies, MON (- 1 on Monday and 0 otherwise), TUE (= 1 on Tuesday and 0 otherwise), and so on. The constant term of the model becomes (a + a! MON + a2TUE + a3WED + a4THU) so that it takes the value a + a, on Mondays, a + a2 on Tuesdays, a + a3 on Wednesdays, a + a4 on Thursdays, and a on Fridays. Note that only four daily dummies are used. Using five dummies would introduce perfect multicollinearity.

Dummy variables should be viewed as necessary measures for data that have structural breaks, regime shifts or seasonalities. If dummies are omitted there will be residual problems that lead to inefficient parameter estimates on the real explanatory variables. However, if too many dummies are used the power of other explanatory variables may be reduced.



[start] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] [119] [120] [121] [122] [123] [124] [125] [126] [127] [128] [129] [130] [131] [132] [133] [134] [135] [136] [137] [138] [139] [140] [141] [142] [143] [144] [145] [146] [147] [148] [ 149 ] [150] [151] [152] [153] [154] [155] [156] [157] [158] [159] [160] [161] [162] [163] [164] [165] [166]