Written by: Director Robert Groves
Several posts ago, I outlined a set of thoughts (The Future of Producing Social and Economic Statistical Information, Part I, Part II, Part III) of how statistical agencies might navigate the future. How should they react to massive data sets that are being produced daily through internet search, social media, and administrative data processing? Can these sources of information about social and economic activities be combined in some effective way to reduce the burden of the American public in responding to surveys and censuses? How should agencies react to increasing demands for more timely and more disaggregated statistical information on the society and the economy?
Most records of companies, governments, and other organizations that formerly were on paper are now digitized. The phrase “big data,” which is increasingly being used, applies to some of these data. One source of “big data” is the management information systems that are part of modern manufacturing, financial, and transportation firms. Other “big data” sources are consumer transactions data, internet site content, traffic camera data, and a whole host of others. Many of them are designed to be used to administer some process (e.g., a retail sales company, a program of support payments), but they could also be aggregated to produce statistics useful to the country.
Statistical agencies like the Census Bureau have the sole purpose to produce such statistical aggregations. The methods and practices employed by these agencies lead to high quality benchmark products that trusted by a broad range of stakeholders. These aggregated products are contrasted from identifiable inform, they also are governed by laws that prohibit revealing the identity of those described by the data.
The Census Bureau has outlined a vision that addresses a key challenge: the society needs more timely statistics; there is no money to pay for the added information. To do this, we are developing a new system for data collection, data integration, and real-time estimation, to offer cheaper, faster, and potentially better statistical information for the country:
- Whenever possible we will offer multiple modes to collect data – internet, mail, telephone, face-to-face, as well as use of administrative records, all under very tight security controls. This will allow us to reduce costs by using the most cost efficient tool for each sample unit.
- We will use empirical quality criteria to manage our follow-up efforts on cases, deciding when to switch modes of data collection and, more importantly, when the cost and quality criteria suggest that we should terminate efforts at data collection.
- This will require us to perform near-real-time monitoring of the progress of data collections and to estimate preliminary statistics (fully edited and imputed) each day of the data collection.
- For survey statistics that can be improved by combining them with external data sources, the real-time estimation will utilize those sources.
This vision has the goal of not burdening American households and business with questions to be used for social and economic statistics, when the information already exists in some record. Indeed, this vision will allow us to optimize our survey data collection efforts to fill critical gaps in administrative and other record systems. It continues the strong pledges of confidentiality that are essential to a statistical agency. Finally, it allows us to more nimbly meet the needs of our stakeholders for timely, relevant and reliable data. To realize this vision, we are building a “mixed-mode data collection” system that will provide the ability to improve our statistics by combining relevant other data sources along with the survey/census data that we collect. It will allow us to offer respondents a mode of data collection that minimizes their burden. It will permit real-time monitoring of data collection to reassign cases to the best mode, to import administrative records or other relevant external data when appropriate, and to reduce the costs of large scale surveys for a given quality target of the statistics.
What’s the role of “big data” in this vision? To be useful statistically, a “big data” set must have some relation to the statistical goals of a Census Bureau program. For example, some transaction data contain a recorded time of the transaction. These might be useful if our current survey data have time recorded. Each month we measure sales of firms by asking them to report them. If we had access to customer purchase transactions volume, we might construct models blending our benchmark sample survey data with the continuous transaction data, to produce more timely and more disaggregated estimates. The strength of the transaction data will be their timeliness and the large number of transactions they reflect; their weakness will be that they do not include many transactions conducted in ways other than those the data reflect (e.g., cash might be omitted). The strength of our benchmark survey will be its statistical coverage of the entire population of business units; its weakness is its lack of timeliness and its relatively small sample size of firms. Combined, the “big data” and the benchmark survey data can produce better statistics.
Sometimes the link between our sample surveys and the big data will be time, other times it will be space. “Big data” will be useful for constructing small area estimates. For example, internet sites listing asking prices for houses may be accompanied with exact location of the units. Their strength is that they offer millions of records of prospective sales; their weakness is that they don’t cover all areas of the country, not all sales are covered, and asking prices are not sale prices. Our sample survey on residential sales offers statistical coverage of all sales, but its sample size is too small to provide statistics on all areas. Combining the two data series might offer more spatial detail.
At other times, the link between the big data and our sample survey data may be measures that are highly correlated to our key statistics. For example, we might have access to traffic volume data continuously streaming based on traffic cameras, with location codes to permit fine spatial detail. Our sample survey reports of commuting times from home to place of work might be enhanced by statistically combining them with the traffic count data from available cameras. The strength of the traffic camera counts would be very fine grain detail on time; the weakness would be coverage of all roads and counts of commercial traffic as well as private cars.
So the system we are building permits the ingestion of auxiliary information from “big data” sets and their use in improving the estimation of key attributes of the society and economy we monitor.
Data will be the cheapest commodity in the future. Extracting useful information from the data, however, will be the expertise in greatest demand. Maintaining the strong confidentiality pledges of Census Bureau data is of paramount importance. The challenge to the Census Bureau is accessing available data (wherever they might be) and using them in ways to enhance the quality of statistical information we provide the public.