Understanding the structure of your data is very important for data analysis because the structure almost dictates the type of analysis you can and should do to reach your objective. In my previous post, I talked about a simple structure of a rectangular array with numeric/continuous variables and then extended that structure slightly by including a nominal/discrete variable. I referred to the first case as the Stat 101 case because the analysis was straightforward using the concepts and tools developed n that course. These include a histogram and summary statistics. These do, of course, provide insight, but not terribly much insight. Certainly not enough to be the basis for a multi-billion-dollar business merger or the enactment of a major tax bill or the decision about which presidential candidate you should vote for in the next election. The slight extension adds more insight, but again not much.
In the last post, the structure was explicit in the data set: the Major Political Party Affiliation was a variable that was part of the data set. This divided the states into “red” and “blue” states based on the majority party affiliation. Some analysis was done using this variable. But the analysis was evident because the variable was right there in the data set; the structure was explicit. There are times when a structure is present in the data but only implicitly. A variable may have to be added to the data set to reveal that implicit structure which could then be used for further analysis, analysis that otherwise would have gone undone simply because that structure was not evident.
Let’s consider the data set from the last post. This had state data on unemployment and household income plus political part affiliation. An implicit variable that could be added is the regional assignment of each state. I expended the data set for this post to include the US Census Region classification of each state. The Census Bureau assigns each state to one of four regions: Northeast, South, Midwest, and West. The distribution of the states across the regions is shown here:
The inclusion of regions complicates the structure slightly because there are now four divisions of the data that was not evident before: the states are embedded or clustered within the regions. This division is important because any regional issues and tendencies would affect the states that are members of those regions. And the regions do differ in many ways. For example, consider that “a broad analysis of … general trends suggests regions do vary in their mean levels of personality traits. For example, Neuroticism appears to be highest in the Northeast and Southeast and lowest in the Midwest and West; Openness appears to be highest in the New England, Middle Atlantic, and Pacific regions and lower in the Great Plain, Midwest, and southeastern states; and Agreeableness is generally high in the Southern regions and low in the Northeast. The spatial patterns for Extraversion and Conscientiousness do not appear consistent across studies.” (P. J. Rentfrow, et al. “Divided We Stand: Three Psychological Regions of the United States and Their Political, Economic, Social, and Health Correlates.” Journal of Personality and Social Psychology, 2013, Vol. 105, No. 6, 996–1012)
This simple extension of the structure of the data, like the one discussed in my last post, involves a grouping of data. In the last case, the grouping was simple and we were able to do some simple analyzes, primarily a t-test for differences in means. We cannot do a t-test of regional differences, where the t-tests are based on pairs of regions, to determine if there are differences among the regions because of the multiple comparisons problem. With four regions, there are 6 possible pair-wise comparisons. The problem is that the error rate for these comparisons is inflated. If the traditional error rate (an experiment-wise error rate) is set at 0.05, meaning that there is a 5% chance of making an incorrect decision, then the inflated rate (a family-wise error rate) that accounts for the multiplicity of tests is now 0.265, more than a five-fold increase in the error rate. This means that the probability of incorrectly making an incorrect decision among all six possible tests is more than 25%. The Tukey multiple comparisons procedure is usually used to adjust or correct for this inflated error rate.
Consider the median annual household income for the 50 US states but aggregated by the four Census Regions. A diamond plot shown here
indicates that there is a difference in income level between the Northeast and the South regions. The indication is based on the diamonds that do not overlap. On the other hand, there does not appear to be any difference between the West and Midwest regions.
The Tukey mean comparison test for all combinations, shown here supports the suggestion from the diamond plot:
The Connecting Letter Report and the Ordered Differences Report support the suggestion that the Northeast and the South differ for household income. The implication is that any modeling that involves household income should account for a regional effect.
The goal of any data analysis is to extract information. In this example, some valuable information would have been overlooked if the implicit structure was itself ignored. I’ll explore implicit structures in my next post.