“Designed Data” and “Organic Data”

Written by: Director Robert Groves

What’s the difference between “data” and “information?”

We’re entering a world where data will be the cheapest commodity around, simply because the society has created systems that automatically track transactions of all sorts. For example, internet search engines build data sets with every entry, Twitter generates tweet data continuously, traffic cameras digitally count cars, scanners record purchases, RFID’s signal the presence of packages and equipment, and internet sites capture and store mouse clicks. Collectively, the society is assembling data on massive amounts of its behaviors. Indeed, if you think of these processes as an ecosystem, it is self-measuring in increasingly broad scope. Indeed, we might label these data as “organic,” a now-natural feature of this ecosystem.

Information is produced from data by uses. Data streams have no meaning until they are used. The user finds meaning in data by bringing questions to the data and finding their answers in the data. An old quip notes that a thousand monkeys at typewriters will eventually produce the complete works of Shakespeare. (For younger readers, typewriters were early word processing hardware.) The monkeys produce “data” with every keystroke. Only we, as “users,” identify the Shakespearian content. Data without a user are merely the jumbled-together shadows of a past reality.

What’s this got to do with the Census Bureau? For decades, the Census Bureau has created “designed data” in contrast to “organic data.” The questions we ask of businesses and households create data with a pre-specified purpose, with a use in mind. Indeed, designed data through surveys and censuses are often created by the users. This means that the ratio of information to data (for those uses) is very high, relative to much organic data. Direct estimates are made from each data item – no need to search for a Shakespearian sonnet within the masses of data.

What has changed is that the volume of organic data produced as auxiliary to the Internet and other systems now swamps the volume of designed data. In 2004 the monthly traffic on the internet exceeded 1 exabyte or 1 billion gigabytes. The risk of confusing data with information has grown exponentially. We must collectively figure out the role of organic data in extracting useful information about the society. Hence, developments like “Google Flu,” which tries to predict the course of flu epidemics, and the MIT billion prices index, which scrapes price data from internet sales sites to measure price inflation.

The challenge to the Census Bureau is to discover how to combine designed data with organic data, to produce resources with the most efficient information-to-data ratio. This means we need to learn how surveys and censuses can be designed to incorporate transaction data continuously produced by the internet and other systems in useful ways. Combining data sources to produce new information not contained in any single source is the future. I suspect that the biggest payoff will lie in new combinations of designed data and organic data, not in one type alone.

To continue the monkey-typewriter metaphor, the internet and other computer systems are like typewriters that have an unknown set of keys disabled. Some keys are missing but we don’t know which ones are missing. They’re not capturing all behaviors in the society, just some. The Shakespearian library may or may not be result of the monkeys pounding on the keys. In contrast to the beauty of the bard’s words, we may only find pedestrian jingles and conclude that’s as good as it gets. We need designed data for the missing keys; then we need to piece them together with the masses of organic data from the present keys.

The combination of designed data with organic data is the ticket to the future.

3 Responses to “Designed Data” and “Organic Data”

  1. Aref N. Dajani and Caribert Irazi says:

    We find great value when statistical products found in academia, other governmental institutions, and industry are benchmarked to those produced by the Census Bureau.
    The University of Michigan’s Institute for Social Research, collaborating with Reuters, conducts the Survey of Consumers to calculate the Index of Consumer Expectations. Together with statistics generated from our monthly economic surveys, they become official components of the Index of Leading Economic Indictators. http://ns.umich.edu/index.html?Releases/2006/Jul06/reuters_bg
    Our National Crime and Victimization Survey is used alongside the Uniform Crime Reports from the Federal Bureau of Investigation to give a more precise assessment of crime in America. http://www2.fbi.gov/ucr/cius2009/about/crime_measures.html
    Finally, the new MIT billion prices index serves its stakeholders well when benchmarked against the U.S. Consumer Price Index, calculated through a number of our monthly surveys, including the Consumer Expenditures Survey. http://bpp.mit.edu/
    Our metadata painstakingly document our assumptions, definitions, and limitations. This well suits academia, other governmental agencies, and industry that benchmark against our statistical products.
    We believe that there will always be demand for our statistical products. We believe that our data have value far beyond their pre-specified purpose.
    Aref Dajani, Mathematical Statistician, Demographic Statistical Methods Division and Caribert Irazi, Survey Statistician, Foreign Trade Division

  2. Finding the difference between data and information is hard. Especially since we live in the world where there immense amount of data. We often get lost what is useful and what is not. I agree with the quote “The combination of designed data with organic data is the ticket to the future.”

  3. Jim Pruitt says:

    I find it disheartening that it is April 2013 and this post has only been viewed 478 times. Director Groves, I find your words cogent and prophetic. I agree with your assessment that a combination of both designed and organic data are the future.

    I hope your time as Provost at Georgetown is rewarding.

