Site icon SigTech

The Importance of Data Validation

As the investment management industry continues its quantitative shift, the role of data becomes increasingly central in the identification of profitable trading signals, potentiating its historical role and value. Yet, as both supply of and demand for data increases so too does the effort required to ensure it is clean and reliable. It cannot be assumed that the raw data available will be free from errors. Rather, it should be assumed that inaccuracies are replete throughout many datasets. To avoid the monotony and complexities of data cleaning and enable greater focus on the activities of research and portfolio management, a structured data validation process is required.

Common Errors in Financial Markets Data

There are a broad range of common errors that can adversely affect a dataset, its analysis, and the investment decisions it informs. These anomalies can exert varying degrees of influence on an investment strategy’s expected risk/return profile. Any datasets used for research should be subjected to a comprehensive data validation process. The following methodologies provide a glimpse of some of the processes employed on the SigTech quant technologies platform and our data validation engine (DaVe):

SigTech ensures that data on our quant technology platform has been validated before it is made available to users. DaVe allows both data producers and data users (i.e. researchers, portfolio managers, data scientists) to validate their datasets by customizing the thresholds for error detection and subsequent cleaning.

Comparing Clean and Dirty Data

To demonstrate the importance of data validation, the following example considers how four superficial yet common data errors can significantly impact a bond futures carry strategy.

A carry strategy seeks to exploit the difference between the yield on two financial instruments with differing maturities. The strategy utilizes one of SigTech’s pre-built and fully customizable building blocks; the Rolling Future Strategy. It trades four futures contracts, taking long and short positions for each instrument for the top and bottom 25% of signal values, maintaining 100% gross exposure. Contracts traded are:

To illustrate the impact of erroneous data points, we synthetically add data errors to each of the datasets. These mimic errors commonly found in financial data.

Misplaced Decimals

The first error arises frequently in financial markets data and can have extraordinary repercussions for traders; misplaced decimal points. They often arise as a consequence of manual data entry, which remains a common industry practice.

Figure (1): US 10Y T-Note futures with a misplaced decimal for a single data point

For a visual comparison, the clean time series is included below.

Figure (2): US 10Y T-Note futures with a clean dataset

Misplaced decimals disrupt the plotted time series to an extent that is easily recognisable as an anomaly. However, when building a strategy, researchers and portfolio managers may not always choose to visually display the data being used, due to either the inefficiency arising from plotting multiple datasets or the assumption that the data will be clean. However, even if a researcher chooses to plot all of the datasets, some errors will not generate discernible disruptions, complicating their visual detection.

Values out of Range

Another common error in financial markets data are data points with values deviating from an accepted range. Deviations can vary widely. Whilst large deviations will have a greater impact on analysis, they are generally easier to identify. Below are two versions of the US 5Y T-Note futures time series; the first with an incorrect value, the second without.

Figure (3): US 5Y T-Note futures with an incorrect value

Figure (4): US 5Y T-Note futures with a clean dataset

Stale Values

The third error is also commonplace and can be easily overlooked; stale values. In the US Long Bond futures time series below we have randomly added stale values to 10 days across the time series. The first plot includes the stale values, whilst the second plot represents the clean time series. As is clear from their comparison, these errors are much more difficult to identify visually.

Figure (5): US Long Bond futures with stale values for March 2017

Figure (6): US Long Bond futures with a clean dataset

Overall Strategy Performance

Backtesting our carry strategy using the errored datasets results in an annualized excess return of 2.5%, a Sharpe ratio of 0.3 and a maximum drawdown of -6.6%.

Figure (7): Bond Carry Strategy with corrupted datasets

When backtesting using clean datasets, the results suggest a riskier, less profitable strategy. Annualized returns fall to 1.7%, the Sharpe ratio dips to 0.14, and the maximum drawdown is -7.4%. Thus, in the presence of the anomalies, investors may have been enticed into taking an uninformed investment decision.

Figure (8): Bond Carry Strategy with clean datasets

Conclusion

This blog has demonstrated the significance data errors can have in the construction of investment strategies. Without accurate and reliable data, investment decisions are subject to distorted risk and return expectations. The SigTech platform provides investment managers with a wide range of quant technologies and analytical tools, supported by a rich pool of financial markets data. This data has been pre-validated and fully operationalized via DaVe, allowing users to focus on research and portfolio management without the added burden of data management. DaVe allows users to customize their own validation parameters and efficiently clean their own data.

Disclaimer

This content is not, and should not be construed as financial advice or an invitation to purchase financial products. It is provided for information purposes only and is subject to the terms and conditions of our disclaimer which can be accessed here.

Exit mobile version