Missing data is a reality of virtually every business data set. In statistics classes, you are told to replace missing data with either the average or some other sophisticated value generated through regression or other techniques. At times, this process of replacing missing values becomes so mechanical that the analysts tend to forget that there could be a reason why data is missing.
Missing data or absence of something in certain cases can be strong evidence in itself. This is particularly true in risk and fraud analytics. At the beginning of the analytics projects, it is a good idea to scrutinize missing data and identify if there are compelling clues hiding within them.
Another problem for analysis, highlighted by every statistics textbook, is outliers. Outliers are the observations that are extremely dissimilar to the studied population. For instance, if you are studying the net wealth of individuals on the planet then Bill Gates is an outlier.
One of the strategies to deal with outliers is data transformation i.e. taking the log or the square root of all the observations. This narrows the data to a ‘normal’ range. At other times, outliers can be removed from the data being analyzed. This is a good strategy in many cases but is equally ineffective in several others. For example, in several marketing analytics applications, it is a good idea to create different segments of the population and create a separate model for each segment.