I remember when I started analyzing data many years ago that I would panic each time I was given a new dataset. All the rows and columns of numbers, and sometimes words, just stared at me, challenging me to uncover the information inside. Eventually I would, but it was painful. Not because I didn’t know the analysis objective or the statistical and econometric methods needed for the analysis, but because I didn’t know the structure of the data.
By poking around and “looking” at the dataset, I would, sooner or later, come to understand the structure, the panic would disappear, and I would complete the analysis.
What is Structure in Data?
The structure is not the number of rows and columns in the data table or which columns come first and which come last. This is a physical structure that is relatively unimportant. The real structure is the organization of columns relative to each other so that they tell a story. Take a survey dataset, for example. Typically, case ID variables are at the beginning and demographic variables are at the end; this is the physical structure. The real structure consists of columns (i.e., variables) that are conditions for other columns. So, if a survey respondent’s answer is “No” in one column, then other columns might be dependent on that “No” answer and contain a certain set of responses but contain a different set if the answer is “Yes.” The responses could, of course, be simply missing values. For a soup preference study, if the first question is “Are you a vegetarian?” and the respondent said “Yes”, then later columns for types of meats preferred in the soups would have missing values. This is a structural dependency.
The soup example is obviously a simple structure. An even simpler structure is the one that appears in Stat 101 textbooks. It has just a few rows and columns, no missing values, and no structural dependencies. Very neat and clean – and always very small. All the data needed for a problem are also in that one dataset. Real world datasets are not neat, clean, small, and self-contained. Aside from describing them as “messy” (i.e., having missing values, structural and otherwise), they also have complex structures. Consider a dataset of purchase transactions that has purchase locations, date and time of purchase (these last two making it a panel or longitudinal dataset), product type, product class or category, customer information (e.g., gender, tenure as a customer, last purchase), prices, discounts, sales incentives, sales rep identification, multilevels of relationships (e.g., stores within cities and cities within sales regions), and so forth. And this data may be spread across several data bases so they have to be joined together in a logical and consistent fashion. And don’t forget that this is Big Data with gigabytes! This is a complex structure different from the Stat 101 structure as well as the survey structure.
Why You Need to Know the Structure of Your Data
Knowing the structure of a dataset is critical for being able to apply the right toolset to extract information hidden or latent inside the data. The data per se are not the information – they’re just “stuff.” It’s what is inside that matters. That’s the information. The more complex the structure, the more information is inside, the more difficult it is to extract that information from the “stuff”, and the more sophisticated the tools that are needed for that extraction.
For an analogy, consider two books: Dick & Jane and War & Peace. Each book is a collection of words that are data points no different than what is in a dataset. The words per se have no meaning, just as data points (i.e., numbers) have no meaning. But both books have a message (i.e., information) that is distilled from the words; the same for a dataset.
Obviously, Dick & Jane has a simple structure: just a few words on a page, a few pages, and one or two simple messages (the information). War & Peace, on the other hand, has a complex structure: hundreds of words on a page, hundreds of pages, and deep thought provoking messages throughout. You would never read War & Peace the way you would read Dick & Jane: the required toolsets are different. And someone who could only read Dick & Jane would never survive War & Peace. You would never even approach War & Peace the way you would approach Dick & Jane. Yet this is what many of us do: we approach a complex dataset the way we approach a Stat 101 dataset.
How to Find the Structure in Your Data
When you read War & Peace, or a math book, or a physics book, or a history book, or an economics book, anything that is complex, the first thing you (should) do is look at the structure. This is given by the Table of Contents with chapter headings, section headings, and subsection headings all in a logical sequence. The Index at the back of the book gives hints about what is important. The book’s cover jacket has ample insight into the structure and complexity of the book and even about the author’s motive for writing the book. Also, the Preface has clues about the book’s content, theme, and major conclusions. You would not do this for Dick & Jane.
Just as you would (should) look at these for a complex book, so should you follow these steps for understanding a dataset’s structure. A data dictionary would be one place to start; a questionnaire would be obvious; missing value patterns a must; groupings, as in a multilevel or hierarchical dataset, are more challenging. Once the structure is known, then the analysis is made easier. This is not to say that the analysis will become trivial if you do this, but you will be better off than if you did not.