back start next
[start] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [ 36 ] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] [119] [120] [121] [122] [123] [124] [125] [126] [127] [128] [129] [130] [131] [132] [133] [134]
36 Data Source Data Source Raw quotes Filtered quotes Raw quotes Credibility of quotes 1 Filter reports Special Data User FIGURE 4.1 Data cleaning (filtering): Normal users want to eliminate bad ticks from the application (left chart). In special cases, users want to know filtering results such as the credibility or the reason for rejecting a tick (right chart). A filter has some configuration parameters depending on the type of instrument, as to be shown later. Once it is created, it performs the following operations: 1. It receives financial ticks in the ordered sequence of their time stamps. 2. It delivers the same ticks in the same ordered sequence, plus the filter results. For each tick, the following results are delivered: Credibility values of the tick and of its individual elements (such as bid, ask, bid-ask spread); the credibility is defined between 0 (totally invalid) and 1 (totally valid) The value(s) of the tick, whose errors can possibly be corrected in some cases where the error mechanism is well known The "filtering reason," which is a formalized piece of text explaining why the filter has rejected (or corrected) the tick Normal users use only those (possibly corrected) ticks with a credibility exceeding a threshold value (which is often chosen to be 0.5). They ignore all invalid ticks and al 1 side results of the filter such as the filter report. The timing of the filter operations is nontrivial. In real-time operation, the result of a filter is used right after the tick has entered the filter. In historical operation, the user takes the corrected result after the filter has seen a few newer ticks and adapted the credibility of older ticks. The filter needs a build-up period to learn from the data. This is natural for an adaptive filter. If the data cleaning operation starts at the first available tick (beginning of data series), the build-up means to run the filter for a few weeks from this point, storing a set of statistical variables in preparation for restarting the filter from the first available tick. The filter will then be well adapted because it can use the previously stored statistical variables. If the data cleaning operation starts
at some later point in time, the natural build-up period is a period immediately preceding the first tick needed. The filtering algorithm can be seen as one whole block that can be used several times in a data flow such as the following: Mixing already filtered data streams from several sources where the mixing result is again filtered. The danger is that the combined filters reject too many quotes, especially in the real-time filtering of fast moves (or price jumps). Filtering combined with computational blocks: raw data -> filter -> computational block -> filter -> application. Some computational blocks such as cross rate or yield curve computations require filtered input and produce an output that the user may again want to filter. Repeated filtering of the same time series is rather dangerous because it may lead to too many rejections of quotes. If it cannot be avoided, only one of the filters in the chain should be of the standard type. The other filter(s) should be configured to be weak (i.e., they should eliminate not more than the obviously aberrant outliers). 4.3.2 Overview of the Filtering Algorithm and Its Structure The filtering algorithm is structured in a hierarchical scheme of subalgorithms. Table 4.1 gives an overview of this structure for a univariate filter for one financial instrument. A higher hierarchy level at the top of Table 4.1 can be added for multivariate filtering, as discussed in Section 4.8.1. Details of the different algorithmic levels are explained in the next sections. The sequence of these sections follows Table 4.1 ,from bottom to top. Some special filter elements are not treated there, but are briefly described in Section 4.8. 4.4 BASIC FILTERING ELEMENTS AND OPERATIONS The first element to be discussed in a bottom-to-top specification is the scalar filtering window. Its position in the algorithm is shown in Table 4.1. The basic filtering operations utilize the quotes in the simplified form of scalar quotes consisting of the following: 1. The time stamp 2. One scalar variable value to be filtered (e.g., the logarithm of a bid price), here denoted by x 3. The origin of the quote (as in the full quote of Section 4.2.1) The basic operations can be divided into two types: 1. Filtering of single scalar quotes: considering the credibility of one scalar quote alone. An important part is the level filter where the level of the filtered variable is the criterion. 2. Pair filtering: comparing two scalar quotes. The most important part is the change filter that considers the change of the filtered variable from one
TABLE 4.1 Basic structure of the filtering algorithm used for data cleaning. The data cleaning algorithm has three main hierarchy levels, each with its specific functionalities. Hierarchy level | Level name | Purpose, description | | Univariate filter | The complete filtering of one time series: Passing incoming ticks to the lower hierarchy levels Collecting the filter results of the lower hierarchy levels and packaging them into the right output format Supporting real-time and historical filtering Supporting one or more filtering hypotheses, each with its own full-tick filtering window | | Full-tick filtering window | A sequence of recent full ticks (bid-ask), some of them possibly corrected according to a general filtering hypothesis. The tasks are as follows: Tick splitting: splitting a full tick into scalar quotes to be filtered in their own scalar filtering windows Basic validity test (e.g., whether prices are positive) A possible mathematical transformation (e.g., logarithm) All those filtering steps that require full ticks (not just bid or ask ticks alone) | | Scalar filtering window | A sequence of recent scalar quotes whose credibilities are still being modified. The tasks are as follows: Testing new, incoming scalar quotes Comparing a new scalar quote to all older quotes of the window (using a business time scale and a dependence analysis of quote origins) Computing a first credibility of the new scalar quote; modifying the credibilities of older quotes based on new information Dismissing the oldest scalar quotes when their credibility is finally settled Updating the statistics with good scalar quotes when they are dismissed from the window |
quote to another one. Filtering depends on the time interval between the two quotes and the time scale on which this is measured. Pair filtering also includes a comparison of quote origins. The basic filtering operations and another basic concept of filtering, credibility are presented in the following sections. Their actual application in the larger algorithm is explained later, starting from Section 4.5. 4.4.1 Credibility and Trust Capital Credibility is a central concept of the filtering algorithm. It is expressed by a variable taking values between 0 and 1, where 1 indicates validity and 0 invalidity. This number can be interpreted as the probability of a quote being valid according to a given criterion. For two reasons, we avoid the formal introduction of the term "probability." First, the validity of a quote is a fuzzy concept (e.g., slightly deviating quotes of an over-the-counter spot market can perhaps be termed valid even if they are very unlikely to lead to a real transaction). Second, we have no model of
[start] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [ 36 ] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] [119] [120] [121] [122] [123] [124] [125] [126] [127] [128] [129] [130] [131] [132] [133] [134]
|