back start next


[start] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [ 45 ] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] [119] [120] [121] [122] [123] [124] [125] [126] [127] [128] [129] [130] [131] [132] [133] [134]


45

The errors affecting a continuous sequence of quotes cannot be sufficiently filtered by the means described in the previous sections; they pose a special challenge to filtering. The danger is that the continuous stream of false quotes is accepted to be valid after a while because this false series appears internally consistent.

A filtering hypothesis is characterized by one general assumption on an error affecting all its quotes. This can lead to another unusual property. Sometimes the cause of the error is so clear and the size of the error so obvious that quotes can be corrected. In these cases, the filter produces not only credibilities and filtering reasons but also corrected quotes that can be used in further applications. This will discussed further.

The errors leading to a filter hypothesis are rare. Before discussing the details, we should evaluate the relevance of this filtering element in general. Such an evaluation may lead to the conclusion that the filtering hypothesis algorithm is not necessary in a new implementation of the filter.

Decimal errors have been the dominant error type in the page-based data feed from Reuters in 1987-1989. In later years, they have become rare; they hardly exist in modern data feeds. The few remaining decimal errors in the 1990s often were of short duration so they could successfully be filtered also through the standard data filter. Thus there is no convincing case for adding a decimal error filter algorithm to a filter of modern data. A decimal error filter is needed if old, historical data have to be cleaned.

The scaling filter is also superfluous if the user of the filter has a good organization of raw data. If a currency is rescaled (e.g., 1000 old units = 1 new unit as in the case of the Russian Ruble), a company with good data handling rules will not need the data cleaning filter to detect this; this rescaling will be appropriately handled before the data is passed to the filter. Rescaled currencies (or equity quotes after a stock split) can be treated as a new time series. However, the transition between the two definitions may not be abrupt, and there may be a mixture of quotes of both scaling types for a while. A scaling analysis within the filter can serve as an additional element of safety to treat this case and detect unexpected scale changes.

There is the possibility of having coexisting hypotheses, for example, the hypothesis of having a decimal error and the hypothesis of having none. If an immediate decision in favor of one hypothesis is always made, there is no need to store two coexisting hypotheses. Note that the filtering hypothesis algorithms are executed for each new quote before quote splitting.

4.7.1 The Results of Univariate Filtering

The output of the univariate filter consists of several parts. For every quote entered, the following filtering results are available:

1. The credibility of the quote

2. The value(s) of the quote, possibly corrected according to a filtering hypothesis such as a scaling factor or a decimal error as explained in 4.2.2



3. The filtering reason, explaining why the filter has rejected a quote

4. Individual credibilities of scalar quotes (bid, ask, spread)

Users may only want a minimum of results, perhaps just a yes/no decision on using or not using the quote. This can be obtained by simply checking whether the credibility of the quote is above or below a threshold value, which is usually chosen to be 0.5.

In the case of bid-ask data, the credibility of the full quote has to be determined from the credibilities of the scalar quotes, usually applying the following formula:

= min(Cbid, Cask, Cspread) (4.46)

This formula is conservative and safe; valid quotes are meant to be valid in every respect. The timing of the univariate filtering output depends on whether it is in a historical or real-time mode.

4.7.2 Filtering in Historical and Real-Time Modes

The terms "historical" and "real-time" are defined from the perspective of filtering here. A filter in real-time mode may be applied in a historical test. The two modes differ in their timing:

In the real-time mode, the credibilities of a newly included quote resulting from Equations 4.38 and 4.1 are immediately passed to the univariate filtering unit. If there is only one filtering hypothesis, these credibilities are directly accessible to the user. If there are several hypotheses, the hypothesis with the highest overall credibility will be chosen.

In the case of historical filtering, the initially produced credibilities are modified by the advent of new quotes. Only those quotes are output whose credibilities are finally determined. At that time, the quotes leave the full-quote filtering window and this implies that their components have also left the corresponding scalar filtering windows. If several filtering hypotheses coexist, their full-quote windows do not dismiss any quotes and so we get filtering results only when conflicts between filtering hypotheses are finally resolved in favor of one winning hypothesis.

Although these modes are different, their implementation and selection is easy. In the historical mode, we retrieve the oldest member of the full-quote window only after a test on whether this oldest quote and its results are ready. In the real-time mode, we pick the newest member of the same full-quote window. Thus it is possible to get both modes from the same filter run.

A special option of historical filtering should be available by obtaining the last quotes and their results when the analysis reaches the most recent available quote. It should be possible to output the full-quote window (of the dominant filtering hypothesis) for that purpose, even if the credibilities of its newest quotes are not finally corrected.



This leads to another timing mode that might frequently occur in practice. A real-time filter might be started from historical data. In this case, we start the filter in historical mode, flush the full-quote window as soon as the filter time reaches real time, and then continue in real-time mode. This can be implemented as a special mode if such applications are likely.

4.7.3 Choosing the Filter Parameters

The filter algorithm as a whole depends on many configuration parameters. Table 4.5 summarizes the definitions and explanations. The parameters are listed in the sequence of their appearance in Chapter 4. Some less important parameters have no symbol and appear directly as numbers in the text; nevertheless they have been included in Table 4.5. The same parameter values can be chosen for the different financial markets. Tests have shown that we need no parameter adjustments because the adaptive algorithm successfully adjusts to different financial instruments.

Filter users may choose the parameter values in order to obtain a filter with properties suited to their needs. A higher value of §o in Equation 4.11, for instance, will lead to a more tolerant filter. For a sensitivity test, we define different filters, for example, a weak (tolerant) filter and a strong (fussy) filter. This is explained in Section 4.9.

4.8 SPECIAL FILTER ELEMENTS

The filter described so far is flexible enough for most cases, but not for some of the special error types presented at the end of Section 4.2.2. These errors can be identified by additional algorithmic elements, which are discussed by Muller (1999). Moreover, there can be disruptive events such as the redefinition of financial instruments that pose some additional problems. For these rare cases, the data cleaning environment should provide the possibility of human intervention.

4.8.1 Multivariate Filtering: Filtering Sparse Data

Multivariate filtering is a concept that has not been used in the empirical results of this book, and univariate filtering as described in Section 4.7 remains the highest algorithmic level. Multivariate filtering requires a more complex and less modular software than univariate filtering-but it seems the only way to filter very sparse time series with unreliable quotes. Some concepts of a possible implementation are presented here.

In the financial markets, there is a quite stable structure of only slowly varying correlations between financial instruments. In risk management software packages, a large, regularly updated covariance matrix is used to keep track of these correlations. Covariance matrices between financial instruments can also be applied in the data cleaning of sparse quotes. Although univariate filtering methods work well for dense quotes, they lose a large part of their power when the density



[start] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [ 45 ] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] [119] [120] [121] [122] [123] [124] [125] [126] [127] [128] [129] [130] [131] [132] [133] [134]