Logo
Data

Data

Sales, price, product review score...

Assumptions

Assumptions

Linearity, independence, ignorability...

Model

Model

Regression, random forest, neural network...

What is data centricity?*

Data centricity is staying true to the data. Staying true to the data is not just about prioritizing data over applications or improving data quality. It is about strengthening the path from the data to the models used to derive insights from the data. This path is defined by assumptions. Assumptions must be made about the data (and the underlying data generation processes) to connect data to models.**

Assumptions link data to models in statistics, engineering, and computer science. In a 1976 paper, the British statistician George Box famously wrote: “Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful.” This is because all models simplify reality in order to reveal associations or causal relationships, or to make predictions about the future.

We can divide the modeling assumptions into two main categories: Method-based and Model-based. Method-based assumptions relate to the statistical/machine learning methods used. Model-based assumptions refer to the assumptions that are specific to the problem and solution at hand.

Click to expand and learn more about the method-based and model-based assumptions |
  1. Method based: Different types of methods require different types and levels of assumptions. This first category has a well-established typology. Briefly, we can talk about three groups of assumptions here:
    1. Fully parametric: The probability distributions describing the data generation process behind a sample are assumed to follow a family of probability distributions with a finite number of unknown parameters. A common assumption is that the distribution of values is normal, with unknown mean and variance, and that the data sets are generated by simple random sampling.

      Example: As a parametric method, a linear regression model assumes that the outcome variable follows a normal distribution (along with the other accompanying assumptions). That is, if the outcome is sales, we expect sales to be around the mean most of the time, with variations below and above the mean at about the same level and frequency.

    2. Non-parametric: The assumptions made about the process generating the data are much fewer than in parametric methods, but there are still assumptions, including random sampling. In addition, depending on the specific method, independence of observations may be assumed.

      Example: As a nonparametric method, a decision tree-based XGBoost does not assume that sales follow a particular family of probability distributions, while random sampling remains critical.

    3. Semi-parametric: This implies assumptions between fully parametric and non-parametric approaches. For example, the mean of the outcome may be assumed to have a linear relationship with some explanatory variables (a parametric assumption). However, the variance around this mean may not be assumed to follow any particular distribution. Semiparametric models can often be separated into parametric and nonparametric parts (e.g., structural and random variation).

      Example: Gaussian Mixture Models (GMM) are semi-parametric. A GMM can be used to cluster sales data (say from different stores). In such a model, sales may be assumed to follow a mixture of several normal distributions (parametric), while cluster membership is assigned probabilistically by iteratively updating the fit until convergence (nonparametric).

  2. Model based: Different modeling objectives and identification strategies require different types and levels of assumptions. Unlike the differences between the assumptions of the three broad types of methods, the differences between models with different objectives are less structured. We broadly divide models into two groups:
    1. Causal modeling: Both experimental and observational data can be used in causal modeling. Depending on the specific modeling approach, the assumptions required for causal modeling may differ, but three assumptions are typically made to determine the average causal effect: positivity, consistency, and exchangeability.
      1. Positivity: Each subject (product, store, customer…) has a positive probability of receiving all values of the treatment variable (difficult to meet when treatment is continuous).

        Example: If there is a promotion, each customer should have a non-zero probability of getting the promotion and a non-zero probability of not getting the promotion.

      2. Consistency: There are not multiple versions of the treatment where both versions, but the potential outcome, would be different under the alternative versions.

        Example: The promotion must be the exact same promotion applied in the exact same way to all customers, and the customers should always make the same amount of sales when they receive the promotion.

      3. Exchangeability: The assumption that control and treatment groups can be exchanged without changing the outcome. This can also be referred to as the independence assumption.

        Example: Conditional on the observed covariates (e.g., customer type, seasonality, historical sales), the potential sales outcomes for customers that receive the promotion are comparable to those that do not receive the promotion. In other words, the allocation of the promotion (treatment) is independent of potential sales outcomes.

      If adherence to the experimental treatment is an issue (if adherence varies between subjects), the estimated effect may be reduced to the local average treatment effect.

      Example: Let's say a coupon is sent to customers instead of a product group promotion. The compliance assumption would require that when customers are assigned to receive the coupon, they actually receive and redeem the coupon.

      In addition, there must be no interference between subjects (the treatment of one subject cannot affect the outcome of another subject). Together with consistency, this last assumption is called the Stable Unit Treatment Values (SUTVA) assumption (consistency + no interference).

      Example: When a customer receives a coupon, their purchasing behavior should not be influenced by whether their friends or family members also received a coupon. This ensures no interference. The coupon must also be consistent in its terms and conditions. This ensures consistency.

    2. Predictive modeling: The basic premise of a predictive model is that the training set is a good representation of the test set. In other words, the underlying assumption is that historical data is a good predictor of future data. In addition, the data in the training and test sets must be representative of the population. These assumptions must be valid in addition to the methodological assumptions made by the underlying methods used for prediction.

      Example: A predictive model would assume that the way that the promotion or coupon performed historically should continue to be the same in the future. In addition, the historical and future data should be representative of the population data for the promotion and coupon.

Collapse

Being data-centric means making the correct assumptions and deriving the correct insights from the data. This is not a trivial task. Method and model assumptions can get very complicated, and in the fast-paced data science environment, assumptions often go unchecked. But can you trust the insights from the data in such cases? The answer is clearly no. That's why we're working on an AI tool to help you with this problem.

Want to learn more?

For articles on data centricity, visit our blog Data Duets.