back start next


[start] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [ 35 ] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] [119] [120] [121] [122] [123] [124] [125] [126] [127] [128] [129] [130] [131] [132] [133] [134]


35

A neighborhood of quotes, called the filtering window, is needed to judge the credibility of a quote. Such a data window can grow and shrink according to data quality.

Quotes with a complex structure (i.e., bid-ask) are split into scalar variables to be filtered separately. The filtered variables are derived from the raw variables (e.g., the logarithm of a bid price or the bid-ask spread). Some special error types may also be analyzed for full quotes before data splitting.

Numerical methods with convergence problems (such as model estimation and nonlinear minimization) are avoided. The chosen algorithm produces well-defined results in all situations.

The filter needs to be computationally fast. This requirement excludes algorithms starting from scratch for each new incoming tick. The chosen algorithm is sequential and iterative. It uses the existing filter information base when a new tick arrives, with a minimum amount of updating.

The filter has two modes, which are the real-time and the historical modes. Due to the windowing technique, both modes are supported by the same filter. In historical filtering, the final validation of a quote is delayed until successor quotes have been seen.

4.2 DATA AND DATA ERRORS 4.2.1 Time Series of Ticks

The object of data cleaning is a time series of ticks. The term "tick" stands for "quote" in a very general sense: any variable that is quoted, from any origin and for any financial instrument. The time-ordered sequence of ticks is inhomogeneous in the general case where the time intervals between ticks vary in size. Normally, one time series is filtered independently from other series. The multivariate cleaning of several time series together is discussed in Section 4.8.1.

The ticks of the series must be of the same type. They may differ in the origins of the contributors, but should not differ in important parameters such as the maturity (of interest rates, etc.) or the moneyness (of options or implied volatilities). If a data-feed provides bid or ask quotes (or transaction quotes) alternatively in random sequence, we advise splitting the data stream into independent bid and ask streams. Normal bid-ask pairs, however, are appropriately handled inside the filter.

The following data structure of ticks is assumed:

1. A time stamp.

2. The tick level(s) of which the data cleaning algorithm supports two kinds:

(a) Data with one level (a price or transaction volume, etc.), such as a stock index.



(b) Data with two levels: bid-ask pairs, such as foreign exchange (fx) spot rates.

3. Information on the origin of the tick, e.g. an identification code of the contributor (a bank or broker). For some financial instruments, notably those traded at an exchange, this is trivial or not available.

A data feed may provide some other information which is not utilized by the filter. 4.2.2 Data Error Types

A data error is a piece of quoted data that does not conform to the real situation of the market. A price quote has to be identified as a data error if it is neither a correctly reported transaction price nor a possible transaction price at the reported time. We have to tolerate some transmission time delays and small deviations especially in the case of indicative prices.

There are many causes for data errors. The errors can be separated into two classes:

1. Human errors: Errors directly caused by human data contributors, for different reasons:

Unintentional errors, such as typing errors

Intentional errors, such as dummy ticks produced just for technical testing

2. System errors: Errors caused by computer systems, their interactions and failures

Human operators have the ultimate responsibility for system errors. However, the distance between the data error and the responsible person is much larger for system errors than for "human" errors. In many cases, it is impossible to find the exact reason for the data error even if a tick is very aberrant. The task of the filter is to identify such outliers whatever the reason. Sometimes the cause of the error can be guessed from the particular behavior of the bad ticks. This knowledge of the error mechanism can help to improve filtering and, in few cases, allow the correction of bad ticks.

The following error types are so particular that they need special treatment.

1. Decimal errors: Failure to change a "big" decimal digit of the quote. For instance, a bid price of 1.3498 is followed by a true quote 1.3505, but the published, bad quote is 1.3405. This error is most damaging if the quoting software is using a cache memory somewhere. The wrong decimal digit may stay in the cache and cause a long series of bad quotes. Around 1988, this was a dominant error type.

2. "Test": Some data contributors sometimes send test ticks to the system, usually at times when the market is not liquid. These test ticks can cause a lot of damage because they may look plausible to the filter, at least initially. Two important examples follow:



"Early morning test": A contributor sends a bad tick very early in the morning to test whether the connection to the data distributor is operational. If the market is inactive overnight, no trader would take this test tick seriously. For the filter, such a tick may be a major challenge. The filter has to be very critical to first ticks after a data gap-

Monotonic series: Some contributors test the performance and the time delay of their data connection by sending a long series of linearly increasing ticks at inactive times such as overnight or during a weekend. This is hard for the filter to detect because tick-by-tick returns look plausible. Only the monotonic behavior in the long run can be used to identify the fake nature of this type of data.

3. Repeated ticks: Some contributors let their computers repeat the last tick in more or less regular time intervals. This is harmless if it happens in a moderate way. In some markets with coarse granularity of tick values (such as short-term interest rate futures), repeated tick values are quite natural. However, some contributors repeat old ticks thousands of times with high frequency, thereby obstructing the validation of the few good ticks produced by other, more reasonable contributors.

4. Tick copying: Some contributors employ computers to copy and re-send the ticks of other contributors, as explained in Section 2.2.3. If these ticks are on a correct level, a filter has no reason to care-with one exception. Some contributors run computer programs to produce slightly modified ticks by adding small random corrections. Such slightly varying copy ticks are damaging because they obstruct the clear identification of fake monotonic or repeated series made by other contributors.

5. Scaling problem: Quoting conventions may differ or be officially redefined in some markets. Some contributors may quote the value of 100 units, others the value of 1 unit. Scaling factors are often integer powers of 10, but other values may occur (for stock splits in equity markets). The filter will run into this problem "by surprise" unless a human filter user anticipates all scale changes and preprocesses the data accordingly.

A complete data cleaning tool has to include algorithmic elements to deal with each of these special error types.

4.3 general overview of the filter 4.3.1 the functionality of the filter

The flowcharts in Figure 4.1 illustrate some typical applications of a data cleaning filter in a larger context. Normal users simply want to eliminate "invalid" data from the stream, but the chart on the right-hand side shows that the filter can also deliver more information on the ticks and their quality.



[start] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [ 35 ] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] [119] [120] [121] [122] [123] [124] [125] [126] [127] [128] [129] [130] [131] [132] [133] [134]