
Nowadays, data exist on almost everything, and they may be utilized to attempt to provide answers to a variety of issues. Do actual clinical studies prove a drug's efficacy? Can polls really predict who will win the next election? Can a money manager really forecast a profitable portfolio?
According to researchers from Washington University in St. Louis, Chicago Booth, and University of Bonn, Andreas Neuhierl, Björn Höppner, and University of Bonn, Michael Weber, adjustments made for missing information—the individuals who withdraw from drug trials, the questions people choose not to answer in polls, and incomplete corporate financial reports—may significantly skew the outcomes of predictive models.
They compare their suggested strategy to two widely used current ones in the context of a real-world data application, namely the forecasting of stock returns, and then offer a better methodology for managing missing data. The outcomes show that their strategy consistently gives them an advantage.
The researchers acquired a database of US stock and balance-sheet data from 1978 through 2021 in order to compare the three approaches. The first version of the data set had 2.4 million observations, or rows, with 82 different variables, including trade volume, accounting details, momentum indicators, and other related topics. It was incomplete, as is the case with many data sets since some rows lacked values for all 82 variables.
The "complete cases" strategy, the first of the two frequently used techniques, discards any incomplete observations, which is against the cardinal tenet of data analysis that "Thou shalt not throw away data." For example, if a stock's trade volume was absent for one month, the full cases technique necessitated discarding all the data gathered for that stock during that month. This strategy required the researchers to eliminate rows of data if any information was missing. Just 10% of the original data remained once the researchers had completed their work. The majority of the cases that were eliminated had five or fewer variables with missing values.
The second popular technique, "mean imputation," retains all of the data but introduces biases. An average of all the data points in the data set for a certain variable and month is used to fill in the gaps left by missing values. However, the missing data could include extreme values that materially alter prediction models. Consider a scenario in which a housing database exists but the majority of the upscale homes in it are sold by realtors who never disclose the square footage. Analysts would most likely undershoot and bias their model's projections of market prices if they substituted the missing data with the average square footage of all homes.
The researchers' strategy to compensate for missing variables fared better in forecasting stock returns than the popular "complete cases" and "mean imputation" methods.
The approach developed by Freyberger, Höppner, Neuhierl, and Weber fills in the gaps by combining observations with comparable missing data patterns and using the observations with full data to predict the missing values. Regression modeling is used to merge the instances with full data and those with approximated data into a single data set.
When using the researchers' strategy for managing missing data in simulations, portfolios that were long the 100 stocks with the greatest anticipated return and short the 100 stocks with the lowest predicted return had returns of roughly 52 percent. This easily outperformed the portfolios utilizing the full cases and mean imputation approaches, respectively, which had returns of 11% and 49%. In terms of the return obtained for the amount of risk taken, portfolios employing the researchers' strategy also beat those using the other two approaches. The Sharpe ratio, which measures risk-adjusted returns, was 1.79 as opposed to 1.19 and 1.66 for the other two.
With a 92 percent return compared to the popular approaches' 11 percent and 86 percent returns when a nonlinear model was used to forecast return rates, their method outperformed the others. Meanwhile, the Sharpe ratio increased to 2.82 from 1.29 and 2.44 for the dominant strategies.
According to Weber, the enhanced approach for handling missing data enables investors to create well-balanced portfolios with good risk-adjusted returns by determining which of the thousands of possible return predictors give reliable information.