Insights from data come from models. Models are always based on assumptions about the data. The better the assumptions, the closer the insights are to reality.
Sales, price, product review score...
Linearity, independence, ignorability...
Regression, random forest, neural network...
Data centricity is staying true to the data. Staying true to the data is not limited to prioritizing data over its applications or improving data quality. It is also about strengthening the path from the data to the models used to derive insights from the data. This path is defined by assumptions. Assumptions must be made about the data (and the underlying data generation processes) to connect data to models.**
Assumptions link data to models in statistics, engineering, and computer science. In a 1976 paper, the British statistician George Box famously wrote: “Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful.” This is because all models simplify reality in order to reveal associations or causal relationships, or to make predictions about the future.
We can divide the assumptions used to learn from data into two main categories: Method-based and Model-based. Method-based assumptions refer to the statistical or machine learning methods used. Model-based assumptions refer to the assumptions that are specific to the problem and solution at hand.
Click to expand and learn more about the method-based and model-based assumptions |Example: As a parametric method, a linear regression model assumes, among other things, that the error term follows a normal distribution. That is, if the outcome is sales, we expect the difference between estimated sales and actual sales to be around the mean of the difference most of the time, with variations below and above the mean at about the same level and frequency.
Example: As a nonparametric method, a decision tree-based XGBoost does not assume any particular family of probability distributions, but random sampling remains a critical assumption.
Example: Gaussian Mixture Models (GMMs) are semi-parametric. A GMM can be used to cluster sales data (e.g., from different stores). In such a model, sales may be assumed to follow a mixture of several normal distributions (parametric component), while cluster membership may be assigned probabilistically by iteratively updating the model fit until convergence (nonparametric component).
Example: If there is a promotion, each customer should have a non-zero probability of receiving the promotional offer and a non-zero probability of not receiving the promotional offer.
Example: The potential sales for a customer who actually received the promotional offer should equal the observed sales. Note that this assumption would be violated if some customers received physical coupons while others received emails.
Example: Conditional on the observed covariates (e.g., customer type, past sales, seasonality), the potential sales for customers who receive the promotion are comparable to those who do not receive the promotion. That is, the allocation of the promotion is independent of potential sales.
In addition, (1) if compliance with the treatment is an issue (if compliance varies between subjects), the estimated effect may be reduced to the local average treatment effect.
Example: The compliance assumption would require that when customers are assigned to receive the coupon, they actually receive and redeem the coupon.
(2) There must be no interference between subjects (the treatment of one subject cannot affect the outcome of another subject). Together with consistency, this last assumption is called the Stable Unit Treatment Values Assumption (SUTVA = Consistency + No interference).
Example: When a customer receives a coupon, their purchasing behavior should not be influenced by whether their friends or family members also received the coupon. This ensures no interference. The coupon must also be consistent in its terms and conditions. This ensures consistency.
Example: The way the promotion performed in the past should be the same in the future. In addition, the historical and future samples of customer data should be representative of the population data.
Being data-centric means making good assumptions and deriving the right insights from the data. This is not a trivial task. Method and model assumptions can get very complicated, and in the fast-paced data science environment, assumptions often go unchecked. But can you trust the insights from the data in such cases? The answer is a resounding no. That's why we're working on an AI tool to help you with this problem.
For articles on data centricity, visit our blog Data Duets.