Written by: Director Robert Groves
What’s the difference between “data” and “information?”
We’re entering a world where data will be the cheapest commodity around, simply because the society has created systems that automatically track transactions of all sorts. For example, internet search engines build data sets with every entry, Twitter generates tweet data continuously, traffic cameras digitally count cars, scanners record purchases, RFID’s signal the presence of packages and equipment, and internet sites capture and store mouse clicks. Collectively, the society is assembling data on massive amounts of its behaviors. Indeed, if you think of these processes as an ecosystem, it is self-measuring in increasingly broad scope. Indeed, we might label these data as “organic,” a now-natural feature of this ecosystem.
Information is produced from data by uses. Data streams have no meaning until they are used. The user finds meaning in data by bringing questions to the data and finding their answers in the data. An old quip notes that a thousand monkeys at typewriters will eventually produce the complete works of Shakespeare. (For younger readers, typewriters were early word processing hardware.) The monkeys produce “data” with every keystroke. Only we, as “users,” identify the Shakespearian content. Data without a user are merely the jumbled-together shadows of a past reality.
What’s this got to do with the Census Bureau? For decades, the Census Bureau has created “designed data” in contrast to “organic data.” The questions we ask of businesses and households create data with a pre-specified purpose, with a use in mind. Indeed, designed data through surveys and censuses are often created by the users. This means that the ratio of information to data (for those uses) is very high, relative to much organic data. Direct estimates are made from each data item – no need to search for a Shakespearian sonnet within the masses of data.
What has changed is that the volume of organic data produced as auxiliary to the Internet and other systems now swamps the volume of designed data. In 2004 the monthly traffic on the internet exceeded 1 exabyte or 1 billion gigabytes. The risk of confusing data with information has grown exponentially. We must collectively figure out the role of organic data in extracting useful information about the society. Hence, developments like “Google Flu,” which tries to predict the course of flu epidemics, and the MIT billion prices index, which scrapes price data from internet sales sites to measure price inflation.
The challenge to the Census Bureau is to discover how to combine designed data with organic data, to produce resources with the most efficient information-to-data ratio. This means we need to learn how surveys and censuses can be designed to incorporate transaction data continuously produced by the internet and other systems in useful ways. Combining data sources to produce new information not contained in any single source is the future. I suspect that the biggest payoff will lie in new combinations of designed data and organic data, not in one type alone.
To continue the monkey-typewriter metaphor, the internet and other computer systems are like typewriters that have an unknown set of keys disabled. Some keys are missing but we don’t know which ones are missing. They’re not capturing all behaviors in the society, just some. The Shakespearian library may or may not be result of the monkeys pounding on the keys. In contrast to the beauty of the bard’s words, we may only find pedestrian jingles and conclude that’s as good as it gets. We need designed data for the missing keys; then we need to piece them together with the masses of organic data from the present keys.
The combination of designed data with organic data is the ticket to the future.