Since my blog posts started a few weeks ago, I have been musing about data structures and their importance for Deep Data Analysis. In my previous post, I noted that some structures across the rows of a data table are explicit while others are implicit. The explicit structures are obvious based on the variables in the data table. Political party affiliation is the example I have been using. A variable on party affiliation is in the data table so a division of the data by party is obvious. How this variable is actually used is a separate matter, but it can and should be used to extract more information from the whole data table. The regions of the country, the variable I discussed last time, is implicit in the data table. There was no variable named “Region” present in the table, yet it was there in the sense that the states comprising the regions were there, so states are nested under regions. As I noted in the last post, the states are mapped to regions by the Census Department, and this mapping is easy to obtain. Using this Census Department mapping, a Region variable could be added that was not previously present, at least explicitly. With this added variable, further cuts of the data are possible leading to more detailed and refined analysis.
Implicit Structural Variables
The explicit structural variables are clear-cut – they are whatever is in the data table. The implicit structural variables also depend on what is already in the data table, but their underlying components have to be found and manipulated (i.e., wrangled) into the new variables. This is what I did with mapping States to Regions. Variables that are candidates for the mapping include, but are certainly not limited to, any of the following:
- Telephone numbers:
- Extract international codes and domestic US area codes.
- ZIP codes and other postal codes.
- Time/Date stamps for extracting:
- Day-of week
- Work day vs weekend
- Time-of-day (e.g., Morning/afternoon/evening/night)
- Holidays (and holiday weekends)
- Season of the year
- Web addresses
- Date-of-Birth (DOB)
- Year of birth
- Decade of birth
- SKUs which are often combinations of codes
- Product category
- Product line within a category
- Specific product
You could also bin or categorize continuous variables to create new discrete variables to add structure. For example, age may be calculated from the date-of-birth (DOB) and then the people may be binned into pre-teen, teen, adults, and seniors.
Implicit Variables Based on Several Explicit Variables
In each of these cases, a single explicit variable can be used to identify an implicit variable. The problem is compounded when several explicit variables can be used. But which ones and how? This is where several multivariate statistical methods come in. To mention two: cluster analysis and principal components analysis (PCA). The former can be used to group or cluster rows of the data table based on a number of explicit variables. The result is a new implicit variable that can be added to the data table as a discrete variable with levels or values identifying the clusters. This new discrete variable is much like the Region and Party Affiliation variables I previously discussed. For instance, consider an enhanced version of the data table from the prior posts. The enhancement is the addition of variables, at the state level, for education attainment (percent with a high school education, percent with a bachelor’s degree, and percent with an advanced degree), the state Human Development Index (HDI), and the state Gini coefficient as a measure of income inequality. A hierarchical cluster analysis (using Ward’s method) was done. The dendrogram is shown here:
Four clusters were identified and added to the main data table. The distribution is shown here:
This new variable is discrete with four levels so it can be used to dig deeper into the original data. More structure has been imposed.
My Next Posting
In future blog postings, I will explore even more structure across variables.