Reproducibility of Results

Many analytical models are built with the idea that the model built today will be good in the future. If the results are not reproducible then the predictive models are worthless. It is essential for the project team to identify reasons why their models will or will not work in the future. Moreover, it is also a good idea to define boundaries within which the model will operate properly.

For instance, consider this fictitious model of salary of professionals

Salary = 1,000 * Years of Experience + 5,000

This mathematical equation says that if someone has infinite years of experience they will have the infinite salary. We know this is incorrect. The above model for salary is possibly correct in the boundary of 0 to 30 years of experience.  Yet, most models in business systems are implemented without defining the boundaries for the effectiveness of those models.

Missing Data & Outliers

Missing data is a reality of virtually every business data set. In statistics classes, you are told to replace missing data with either the average or some other sophisticated value generated through regression or other techniques. At times, this process of replacing missing values becomes so mechanical that the analysts tend to forget that there could be a reason why data is missing.

Missing data or absence of something in certain cases can be strong evidence in itself. This is particularly true in risk and fraud analytics. At the beginning of the analytics projects, it is a good idea to scrutinize missing data and identify if there are compelling clues hiding within them.

Another problem for analysis, highlighted by every statistics textbook, is outliers. Outliers are the observations that are extremely dissimilar to the studied population. For instance, if you are studying the net wealth of individuals on the planet then Bill Gates is an outlier.

One of the strategies to deal with outliers is data transformation i.e. taking the log or the square root of all the observations. This narrows the data to a ‘normal’ range. At other times, outliers can be removed from the data being analyzed.  This is a good strategy in many cases but is equally ineffective in several others. For example, in several marketing analytics applications, it is a good idea to create different segments of the population and create a separate model for each segment.

Not Identifying the Right Variables

After identification of the right question(s) for a business analytics problem, the next step is to identify the right data and variables to work with.

“Assume you want to build a model to predict job satisfaction for employees. In any human resources system, the easily available and highly quantifiable metrics are income, bonus, levels, promotions, etc. But we all know from our experience that job satisfaction is a highly complicated phenomenon and can barely be predicted with just these variables. However, when one builds this model there is a greater temptation to just use the easily available variables. The ability to identify the right set of variables at the beginning of the project differentiates a good analyst from the rest. Identification of variables requires a good understanding of the domain and lots of creativity. Creativity helps in generating derived variables from the available data in the business systems.”

– Roopam Upadhyay    May 2, 2016

Eagerness to Solve Problems

Thinking about data in a scientific way is at the core of a successful analytics project that produces a competitive edge for the organization.  Yet, there are several reasons why analytics projects fail to create sound outcomes for an organization.

On Facebook people post something like:

Identify a word that starts and ends with the letter ‘r’

Almost always hundreds of users on this social media site immediately start answering this question.  Every now and then someone asks, “Why is this an important question?”  In this setting, if someone does ask this, he or she is considered a spoilsport.

Still, there is something extremely interesting happening here. Humans are wired, particularly by schooling, to answer questions without questioning the question. We see a problem and we need to solve it. This is a dangerous strategy for analytics projects.

Identification of the right business problem is at the core of successful analytics projects. Not every business problem is equality important, and many problems are not even worth putting any effort into. Always ask why the problem you are solving is important and don’t start your project until you have a satisfactory answer.