There is more to data structure than the almost obvious aspects I’ve been discussing. Take, as an example, my comments about clustering demographic variables to develop segments of consumers. The segments were implicit (i.e., hidden or latent) in the demographics. Using some clustering procedure such as hierarchical, KMeans, or latent variable clustering reveals what is there all the time. By collapsing the demographics to a new single variable that is the segments, more structure is added to the data table. Different graphs can now be created for an added visual analysis of the data. And, in fact, this is what is done so frequently – a few simple graphs (the ubiquitous pie and bar charts) are created to summarize the segments and a few key variables such as purchase intent by segments or satisfaction by segments.

**Modeling with OLS**

In addition to the visuals, sometimes a simple OLS regression model is estimated with the newly created segments as an independent variable. Actually, the segment variable is dummified and the dummies are used since the segment variable per se is not quantitative but categorical and so it cannot be used in a regression model. Unfortunately, an OLS model is not appropriate because of a violation of a key independence assumption required for OLS to be effectively used. This assumption is independence of the observations. In the case of the segments, almost by definition the observations are not independent because the very nature of segments says that all the consumers in a single segment are homogeneous. So, they should all behave “the same” and are therefore not independent.

**Consider the Hierarchical Structure of Your Data**

The problem is that there is a hierarchical or multilevel structure to the data that must be accounted for when estimating a model with data such as this. There are *micro* or *first level* units nested inside *macro* or *second level* units. The consumers are the micro units embedded inside the segments that are the macro units. The observations used in estimation are on the micro units. The macro units give a context to these micro units and it is that context that must be accounted for. There could be several layers of macro units so my example is somewhat simplistic. This diagram illustrates a possible hierarchy consisting of three macro levels for consumers:

Consumer traits such as their income and age would be used in modeling but so would higher level context variables. In the case of segments there is just a single macro level.

Business units also could have a hierarchical structure. A possibility is illustrated here:

The macro levels usually have key driver variables that are the influencers for the lower level. For consumers, macroeconomic factors such as interest rates, the national unemployment rate, and real GDP in one time period would influence or drive how consumers behave. Interest rates in that time period would be the same for all consumers. Over several periods, the rates would change but they would change the same way for all consumers. Somehow, this factor, and others like it, have to be incorporated in a model. Traditionally, this has been done either by aggregating data at the micro level to a higher level or by disaggregating the macro level data to a lower level.

Both data handling methods have issues:

*Aggregation problems*:

- Information is more hidden/obscure. Recall that information is buried inside the data and must be extracted.
- Statistically, there is a loss of power for statistical tests and procedures.

*Disaggregation problems*:

- Data are “blown-up” and artificially evenly spread.
- Statistical tests assume there are independent draws from a distribution, but they are not since they have a common base thus violating the key assumption.
- The sample size is affected since measures are at a higher level than what the sampling was designed for.

This last point on sample size has a subtle problem. The sample size is too large because the measurement units (e.g., consumers) are measured at the wrong level **SO** the standard errors are too small **SO** test statistics are too large **SO** you would reject the Null Hypothesis of no effect too often.

**My Next Post**

In my next post, I’ll explore other issues associated with a multilevel data structure and how to handle such data.