back start next


[start] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] [119] [120] [121] [122] [123] [124] [125] [126] [127] [128] [129] [130] [131] [132] [133] [134] [135] [136] [137] [138] [139] [140] [141] [142] [143] [144] [145] [146] [147] [ 148 ] [149] [150] [151] [152] [153] [154] [155] [156] [157] [158] [159] [160] [161] [162] [163] [164] [165] [166]


148

shall have to be estimated using some consistent method. One method is to first apply OLS to each country model separately, each time obtaining e, the vector of residuals for country j, and then estimating each a,, as ej/T.

Appendix 4 Data Problems

The rapid growth in new financial products and markets during the past decade means that good-quality data are difficult to obtain. In illiquid markets reliable daily data are not always available; inappropriate quotes might remain for several days, and in new markets data will only cover a recent period. Even in established markets one may question the reliability of data. Accountancy data from the banking or trading book can have significant measurement errors. Quoted prices may not have been traded, and when they are it is not always clear whether the quoted price is bid or offer. And often important model parameters have little or no empirical validation. So although a vast quantity of data are recorded for many financial markets, many problems may be associated with these data. This appendix deals with three of the most common data problems: highly collinear data, unreliable data, and missing data.

A.4.1 Multicollinearity

One of the problems that is common to all linear regression models is that explanatory variables can have a high degree of correlation between themselves. In this case it may not be possible to determine their individual effects. The problem is referred to as one of multicollinearity.

Perfect multicollinearity occurs when two (or more) explanatory variables are perfectly correlated. In this case the OLS estimators do not even exist, because XX has less than full rank and so is not invertible.18 This is not really a problem. In fact it is just a fundamental mistake in the model specification: some linear transform of one of the explanatory variables has been included as another explanatory variable.

However, when there is a high degree of multicollinearity, with a large, but not perfect, correlation between some explanatory variables, there may be a real problem. The OLS estimators do exist, and they are still unbiased. They are also still the most efficient of all linear unbiased estimators. But that does not mean that their variances and covariances are small. They may be most

i8The rank of a square matrix is the number of linearly independent rows (or columns). If the rank of an n x n matrix is less than n then the matrix has no inverse.



efficient, but when there is a high degree of multicollinearity, the problem is that most efficient is still not very efficient.

To see why this is so, note that if some of the variables in X are highly (but not perfectly) collinear, the XX matrix will have some very large elements in the off-diagonals corresponding to these variables. Thus the determinant of XX will be small, and this has the effect of increasing all elements of (XX)-1. From (A. 1.13) the covariance matrix of the estimates is governed by the matrix (XX)"1. The estimated variances and covariances of the collinear variables, in particular, will be very large. The (-ratios on their coefficient estimates will be depressed and OLS estimates of the coefficients of these collinear variables will fluctuate greatly, even when there are only small changes in the data. In short, multicollinearity implies a lack of robustness in OLS estimates.

Multicollinearity is not an all or nothing thing, it is a question of degree, so there is no formal test for multicollinearity. If the intercorrelations between certain explanatory variables are too high (a rule of thumb is that they should be no greater than the R2 from the whole regression), multicollinearity can be severe enough to distort the model estimation procedure. In that case the simple solution is to drop one of the collinear variables, and if there are more collinear variables left, continue to throw them out of the model until the multicollinearity is not a problem. However, this may not be in line with the fundamental theory of the model. Another solution is to obtain more data, or different data on the same variables. But this just may not be possible.

If none of these measures offers a feasible solution and the problem of multicollinearity persists, the model parameters can be estimated using the ridge estimator

bt = (XX + rT>y]Xy, (A.4.1)

where D is the diagonal matrix containing the diagonal terms of XX and the constant r is as small as possible. The optimal value of r can be determined by testing the model for increasing values of r until it produces stable estimates. The justification for using the ridge estimator is that since multicollinearity increases the off-diagonal elements of XX in particular, one can produce more efficient results by augmenting the diagonals to be more in line with the off-diagonals. Although ridge estimators are biased they will be more precise than the OLS estimators when the regressors are highly collinear.

None of these measures for coping with multicollinearity is as powerful as that of principal component analysis (PCA). The use of PCA to cope with highly collinear explanatory variables is described in §6.4.1.

A.4.2 Data Errors

The first step towards developing any model should be to plot all the available data. This will not only reveal something of the relationships between



variables, but also identify any serious errors in the data. Often prices are recorded incorrectly, and these errors can pass through even when data vendors employ proper filtering procedures.

Accounting data often have to be revised after the auditing process, which itself takes a considerable time. But however careful the audit, it will never be possible to measure certain quantities with great accuracy because too many guestimates have to made. This does not only happen in book data. The accuracy of important data for credit risk and operational risk modelling leaves a lot to be desired. It is enormously difficult to obtain reliable data on credit spreads, default rates, ratings migrations, correlations and recovery rates, in fact all the processes that constitute credit loss. Similarly, for many low-frequency but high-impact operational risks, data are extremely scarce.

This subsection examines the detrimental effect that data errors will have on the parameter estimates of a linear regression model. Generally speaking, they induce an attenuation bias, that is, they decrease the size of parameter estimates. To see this, suppose that data on an explanatory variable X* is not available, although data are available on X = X* + u, where is an error process. The attenuation bias may be illustrated in the framework of the very simple model

Y = + e.

This model cannot be estimated, since no data are available on X*, so let us write the model in a form that can be estimated, as

Y= $X + v,

where the new error process v = e - . Now X is a stochastic regressor, so OLS will only give unbiased estimates if cov(X, v) = 0 (§A.1.3). However, cov(X, v) is not 0, since both X and v contain and so they will be correlated. In fact

cow{X, v) = cow(X* + , e - )

- co\(X*,e) + cov(m, e) - coy(X*, Pm) - cov(m, ).

The first three terms are all zero, but cov(m, Pm) - p2 V(u) and so co\(X, v) - - fi2V(u) is not zero.

This example shows that when there are errors in the data on explanatory variables the assumptions necessary for OLS to be unbiased are violated. The OLS bias will always be downwards since co\(X, v) is negative. Neither is OLS consistent and the coefficient estimates will be biased downwards even in very large samples.

On the other hand, if the dependent variable is badly measured this does not cause problems with the OLS estimators: They will still be unbiased and



[start] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] [119] [120] [121] [122] [123] [124] [125] [126] [127] [128] [129] [130] [131] [132] [133] [134] [135] [136] [137] [138] [139] [140] [141] [142] [143] [144] [145] [146] [147] [ 148 ] [149] [150] [151] [152] [153] [154] [155] [156] [157] [158] [159] [160] [161] [162] [163] [164] [165] [166]