back start next


[start] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [ 34 ] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] [119] [120] [121] [122] [123] [124] [125] [126] [127] [128] [129] [130] [131] [132] [133] [134]


34

To avoid the spurious singularity when tj-i = tj, the MA operator has to be evaluated with the next-point interpolation (see Equation 3.52). This makes the computation numerically stable even when an extremely small value of Sto in Equation 3.89 is chosen.

Figure 3.17 shows the behavior of A in our example week. At first glance, the activity A looks rather different from the tick frequency /. Yet an interesting feature of this definition is to be equivalent to the tick frequency / when the MA operator is a rectangular moving average with = /2 (this is easy to prove by computing the integral of the piecewise constant function a). However, the activity A has some advantages such that it is much simpler to compute on a moving window, and the weighting function of the past can be controlled through the choice of the MA kernel.



ADAPTIVE DATA CLEANING

4.1 introduction: using a filter to clean the data

High-frequency data are commercially transmitted as a piece of real-time information to human users, usually traders. These data users are professionals who know the context (e.g., the market state and the likely level of a quoted price). If bad data is transmitted, professional users immediately understand, and implicitly clean the data by using information they have in their personal information set. They do not need additional human or computerized input to check the correctness of the data.

The situation changes if the data users are different, such as researchers investigating historical high-frequency data or computer algorithms that extract real-time information for a given purpose (e.g., a trading algorithm, risk assessment). If bad quotes are used, the results are inevitably bad and totally unusable in the case of aberrant outliers. In the experience of the authors and many other researchers, almost every high-frequency data source contains some bad quotes. Data cleaning is a necessity; it has nothing to do with manipulation or cosmetics.

Data cleaning is a very technical topic. Readers interested in economic results rather than methods and researchers enjoying the privilege of possessing cleaned high-frequency data may skip the remainder of Chapter 4.



A data cleaning methodology requires some criteria to decide on the corrccl-ness and possible elimination of quotes. As long as the data set is not too large, human judgment may be a sufficient criterion. In this book, however, we focus on high-frequency data with thousands and millions of observations. Therefore, the criteria have to be formalized through a statistical model that can be implemented as a computer algorithm. Such an algorithm is called a data filter. In this chapter, the term "filter" is exclusively used within the context of data cleaning and the term "filtering" is a synonym of "cleaning." Data cleaning is done as a first, independent step of analysis, before applying any time series operator as studied in Chapter 3 and before statistically analyzing the resulting time series. We choose this approach because it is universally applicable, regardless of the type of further analysis. There is a less favorable alternative to prior data filtering: robust statistics, where all the data (also outliers) are included in the main statistical analysis. The methods of robust statistics depend on the nature of the analysis and are not universally applicable.

Cleaning a high-frequency time series is a demanding, often underestimated task. It is complicated for several reasons:

The variety of possible errors and their causes

The variety of statistical properties of the filtered variables (distribution functions, conditional behavior, structural breaks)

The variety of data sources and contributors of different reliability

The irregularity of time intervals (sparse/dense data, sometimes data gaps of long duration)

The complexity and variety of the quoted data as discussed in Chapter 2: transaction prices, indicative prices, FX forward points (where negative values are allowed), interest rates, figures from derivative markets, transaction volumes, bid-ask quotes versus single-valued quotes

The necessity of real-time filtering (some applications need instant information before seeing successor quotes)

The data cleaning algorithm presented here is adaptive and also presented in Miiller (1999). The algorithm learns from the data while sequentially cleaning a time series. It continuously updates its information base in real time.

Further guidelines are needed in a filtering methodology:

The cause of data errors is rarely known. Therefore the validity of a quote is judged according to its plausibility given the statistical properties of the series.1

1 We have to distinguish true, plausible movements from spurious movements due to erroneous quotes. Brock and Kleidon (1992) suggest decomposing observed movements in the data according to three causes: (1) erroneous quotes, (2) bid-ask spread dynamics due to the pressure on trading factors, and (3) other economic forces.



[start] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [ 34 ] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] [119] [120] [121] [122] [123] [124] [125] [126] [127] [128] [129] [130] [131] [132] [133] [134]