In my last blog, I discussed the importance of understanding the structure of your data before you do any empirical work such as calculating statistics, doing data visualization, or estimating advanced models. The structure will determine the type of analysis you can do and aid in the interpretation of results. More importantly, the structure determines how much information can be extracted.
Telling you to know the structure begs two questions:
- What is structure?
- How does structure impact analysis?
What Is Data Structure?
I noted before that there is a simple and a complex data structure. The simple one is what you’re shown and taught in Stat 101. The data typically are in a rectangular array or data table which is just a matrix of rows and columns. The rows are objects and the columns are variables. An object can be a person or an event. The words object, case, individual, event, observation, and instance are often used interchangeably. Each row is an individual case and each case is in its own row. There are never any blank rows. The columns are the variables, features, or attributes, all three terms are used interchangeably. Each variable is in one column with one variable per column. This is the structure of a Stat 101 data table.
The Stat 101 data table also typically has just numeric data, usually continuous variables measured on a ratio scale. For a ratio scaled variable, the distance between values is meaningful (to say one object to twice another has meaning) and there is a fixed, nonarbitrary origin or zero point so a value of zero means something. For example, real GDP is a ratio value since zero real GDP means that a country produced nothing. Sometimes, a discrete nominal variable would be included but the focus is on the continuous ratio variable. A nominal variable is categorical with a finite number of values or levels, the minimum is, of course, two; with one level it’s then just a constant. With this simple structure, there is a limited number of operations and analyses you can do and, therefore, a limited amount of information can be extracted.
A Simple Data Structure
A simple example that might be used in an elementary statistics course is the following based on state data*:
* Only 6 out of 50 states shown. Sources:
Unemployment rate: https://www.bls.gov/web/laus/laumstrk.htm
This is a simple data structure: a 50 x 3 rectangular array with each state in one row and only one state per row. There is one character variable and two numeric/continuous variables. The typical analysis would be very simple. Means for the unemployment rate and household income would be calculated, although the median income might be used. Histograms and boxplots could be developed to show the distributions. These might be combined into one display such as the following:
You could also display the pattern between the two series using a scatter plot as shown here:
There is one large outlier in the upper right corner (which happens to be Alaska) and what appears to be a negative relationship between income and unemployment if the outlier is ignored. This relationship might be expected intuitively. Otherwise, not much else is evident. However, a slightly more advanced analysis might include a contour or density plot relating the two series. Such a plot is shown here:
The two dark patches indicate a concentration or clustering of states: a large black patch with high unemployment and low income and a second smaller black patch for low unemployment and not much better income. This is slightly more informative.
These are standard for a simple structure such as this, although the contour plot might be a little more advanced. The information content extracted is minimal at best.
A More Complex Data Structure
Now reconsider the above but with one new variable: the dominant political party.
Source for party affiliation: https://www.cleveland.com/datacentral/index.ssf/2015/09/blue_states_red_states_rich_st.html
This is now a 50 x 4 data table with one new character variable. The political party affiliation is based on 2 out of 3 party affiliation for governor and the two US Senators. If all three are Democrats, then the state party affiliation is Democrat; if two are Democrats, then the affiliation is Democrat; the same holds for the Republicans. Party affiliation is a nominal variable with two levels.
A proportion for the party affiliation could now be calculated and displayed in a pie or bar chart such as this:
There is actually slightly more to this data structure because of the political party affiliation. This makes it a little more complex, although not by much, because the states can be divided into two groups or clusters: red (Republican) and blue (Democrat) states. The unemployment and income data can then be compared by affiliation. This invites more possible analyses and potentially useful information so the information content extracted is larger.
One possibility is a comparison of the unemployment rate by party affiliation using a two-sample t-test on the means. This would result in the following:
The p-values show there is no difference between the unemployment rates for red and blue states for any alternative hypothesis. But this isn’t the case for household income. A similar test shows that the blue states have a significantly higher household income:
Now this is more interesting information. A scatter plot could again be created but with a party affiliation overlay to color the markers and two regression lines superimposed, one for each party:
Now we see that a negative relationship but with the red states being below the blue states. A contour plot could also be created similar to the one above but with an overlay for party affiliation. This is shown here:
Now you can see that the two clusters observed before are red states. This is even more interesting information.
Higher Dimensional Data Adds Complexity and Opportunity
Adding the one extra variable increased the data table’s dimensionality by one so the structure is now slightly more complex. But there is now more opportunity to extract information. The structure is important since it determines what you can do. The more complex the structure, the more the analytical possibilities and the more information that can be extracted. Andrew Gelman referred to this as the blessing of dimensionality (http://andrewgelman.com/2004/10/27/the_blessing_of/) as opposed to the curse of dimensionality. In my next post, I will discuss even more complex structures.