Finding Structure in Columns of a Data Table

I already discussed explicit and implicit data structures based on a variable in a single data table.  It’s important that you recognize both of these because these structures enhance your analysis and allow you to extract more information and insight from your data.

Just to review and summarize, explicit structure is based on variables that are already part of your data table.  Gender, political party, education level, store last purchased at, and so forth are examples of variables that are most likely already present in a data table.  Implicit structure is based on variables that buried or are latent inside a single variable or are combinations of explicit structural variables.  They can be defined or derived from existing variables.  Examples are regions based on states, age groups based on age in years, and clusters based on demographic variables.

Finding Structure Across Rows

A feature of these explicit and implicit structural variables is that they define structure across the records within a single data table.  So, the explicit structural variable “gender” tells you how the rows of the data table – the cases or subjects or respondents – are divided.  This division could be incorporated in a regression model by including a gender dummy variable to capture a gender effect.

Finding Structure Across Columns

There could be structure, however, across columns, again within a single data table.  On the simplest level, some groupings of columns are natural or explicit (to reuse that word).  For example, a survey data table could have a group of columns for demographics and another for product performance for a series of attributes.  A Check-all-that-apply (CATA) question is another example since the responses to this type of question are usually recorded in separate columns with nominal values (usually 0/1).  But like the implicit variables for the rows of the data table, new variables can be derived from existing ones.

Consider, for example, a 2010 survey of US military veterans conducted by the Department of Veterans Affairs.  They surveyed veterans, active duty service members, activated national guard and reserve members, family members and survivors to help the VA plan future programs and services for veterans.  The sample size was 8,710 veterans representing the Army, Navy, Air Force, Marine Corps, Coast Guard, and Other (e.g. Public Health Services, Environmental Services Administration, NOAA, U.S. Merchant Marine). One question asked the vets to indicate their service branch.  This is a CATA question with a separate variable for each branch and each variable is nominally coded as 0/1.  There were 221 respondents who were in more than one branch.  A new variable could be created that combines the components through simple addition into one new variable to indicate the branch of service plus Multiple Branches for those in more than one branch.  This one structural variable would replace all the CATA variables plus it would include some new insight for the multiple branches.  Incidentally, this new variable would also add structure to the rows of the data table since its components are themselves classifiers.

High Dimensional Data Tables

Combining variables is especially prevalent in large data tables with hundreds if not thousands of variables which makes the data table “high dimensional.”   A high-dimensional data table is one that not only has a large number of variables, but this number outstrips the number of cases or observations.  If N is the number of cases and Q is the number of variables, then the data table is high dimensional if Q >> N.  You can see a discussion of this in my latest book on Pricing Analytics which is listed elsewhere on my website.  Because of this situation, standard statistical methods (e.g., regression analysis) become jeopardized.  Regression models, for example, cannot be estimated if Q >> N.  So, some means must be employed to reduce the dimensionality.

Working with High-Dimensional Data Tables

Just as there is cluster analysis for the rows of a data table, an approach most analysts are readily familiar with, so there is cluster analysis for the variables.  Another technique is Principal Components Analysis (PCA).  Both have the same goal: dimension reduction.  PCA is probably better known, but variable clustering is becoming more popular because the developing algorithms are handling different types of data plus it avoids a problem with PCA.  The data types typically found in a data table are quantitative and qualitative (also known as categorical).  The former is just standard numbers, interval and ratio, that most people would immediately think about.  The latter consists of classifiers such as gender, political party, and region.  These may be represented by numbers in the data table or words such as “Male” and “Female.”  If words are used, they are usually encoded with numerics but the codes are meaningless and arbitrary.  In these three examples, the numbers would be nominal values.  Ordinal values are also possible if there is an intuitive ordering to the data such as “Low-Level Manager”, “Mid-level Manager”, and “Executive.”  Technically, PCA is designed for quantitative data because it tries to find components based on variances which are quantitative concepts.  There are R packages that try to circumvent this issue, but, nonetheless, PCA in general has this problem.  Variable clustering algorithms avoid this issue.

Summary

The following table summarizes the discussion so far:

Facebook
Google+
Twitter
LinkedIn
Pinterest

Leave a Reply

Close Menu