Written by: Director Robert Groves
My last posts addressed how the changing world of economic units and households is diminishing the effectiveness of our current methods and increasing their costs, at a time of great fiscal pressure. This post comments on a way forward.
The future is likely to value more timely statistical information, while certain populations may become more difficult to measure directly. However, many of these difficult-to-measure businesses, households, and persons will be included in administrative data systems (data already supplied by the units) that could be used as companions to survey data. This future will require using multiple alternative sources of data simultaneously.
Now, most of our Economic and Demographic surveys tend to use single sampling frames, often based on mailing addresses, from which we attempt to cover the entire population. Generally, we need to find contact information when we switch from one data collection mode to another (e.g., requesting phone numbers in wave 1 face to face interviews to prepare for wave 2 telephone interviews). It means we use separate systems to manage each mode’s assigned sample, give work assignments to interviewers, monitor the progress of data collection, and perform data editing and estimation in post-data collection.
The economic directorate at the Census Bureau has already combined administrative data with survey data in inventive ways. It has more experience in simultaneously dealing with paper forms, internet responses, and telephone interviews. It has also grappled with decision rules that define what preliminary estimates shall be and what ingredients are required for revisions of those preliminary estimates. Timeliness of economic indicators has in the past been valued more highly than timeliness of demographic estimates. The future is likely to increase the demand for more timely information and the increased use of preliminary estimates.
In short, the future will require a blending together of multiple sources of data, real-time processing of data, and the production of more timely estimates. We need the infrastructure to facilitate this.
One could imagine a three-step evolution to this future.
Step 1. Consolidation of Frame Data and Paradata for Active Use during Data Collection. The first step would require no change in our current frames and the current assignment of the first mode to cases. Sample designs would not be altered and would be implemented as currently performed. It would add, however, the capability of changing the mode of data collection to one preferred by the sample unit more quickly. It keeps track of how the sample unit reacted to the initial approach to complete the measurement in order to tailor the request to their lifestyles (based on “paradata” or process data). Current management procedures, based on human judgment of supervisors for assigning cases to mode, would continue to be used. In essence, the innovation in this phase of the evolution would be a) the integration of the sample management systems across modes into a single unified system, b) a unified software module that provides a cross-mode progress monitoring capability.
Step 1 is already taking shape. We have launched the developmental effort for the “Unified Tracking System,” which contains a repository of frame data, process data on field activities, cost/effort data, call record data, and questionnaire data on sample cases. The repository will be updated daily. Statistical analysis of the data will produce new monitoring and reporting information for survey management. Such analysis will also permit better costing of future surveys.
Step 2. Real-time Mode Assignment. The second evolutionary step would add mode assignment based on a set of software-driven business rules that would guide real-time movement across modes. Some of the business rules would be based on statistical models run in background and used to update mode assignments every 24 hours. The rules will never replace some of the unique knowledge held by field supervisors, but they would greatly increase efficiency of case management for a large majority of cases. Hence, the management would also have capabilities of manual assignment to modes, when direct intervention is desirable.
Similar business rules would guide subsampling of nonrespondents for followup modes (as currently done in the American Community Survey). Thus, step 2 enhances the data-based direction of cases from mode to mode to completion, exploiting more powerful statistical models to direct these. At the same time, it provides managers with full control capabilities over the course of the sample administration. It acts on the timeliness of data production.
To move from Step 1 to Step 2 we need a new group of statistical analysts to create the cost, effort, and quality indicators of survey data during their collection. We need statistical models of the likelihood that the next followup on a case will generate a completed measurement. Analysis of such data will lead to statistical models that form the business rules of mode-switches. Such analysis will permit better budgeting and cost controls.
When we achieve this step we can control/reduce costs within controls on qualities differences across modes. This is the value of a unified mixed-mode operational control system.
Step 3. Real-time Estimation During Data Collection. The third evolutionary step would add to the second step the real-time estimation routines each 24 hours of a data collection. This real-time estimation would have real-time imputation, nonresponse adjustment, and variance estimation as part of the process, with an assessment of whether the imputation variance inflation was acceptable or whether increased effort at nonresponse followup was necessary. It integrates the editing and imputation step into the data collection process.
It thereby provides the survey management with a new source of information on whether a case should be shifted to another mode for followup – if high quality imputations can be made at a given point, the cost efficiency of followup may be too low to tolerate. Of special application is the use of pre-loaded administrative data for the case, which may have item missing data rates higher than the full data record from another mode. (In one sense, Step 3 adds imputation as another “mode” of data collection.)
Similarly, tracking of key survey estimates on a real-time basis can provide useful information on whether continued attempts to gather data are cost-efficient. If estimates are unlikely to change in important ways with further efforts, the data collection could be halted.
The movement from step 2 to step 3 requires advancing the editing and imputation step in all modes as much as possible. Two monitoring tools would be used during data collection to help make tradeoff decisions about when to terminate efforts on a case for a given mode: a) active tracking of the key statistics based on the fully imputed/adjusted estimates, b) measures of imputation variance and sampling variance of estimates. With data on the costs of moving a case with partial data to another mode and some assessment of bias and variance impacts of accepting imputed data, more nearly optimal estimates can be achieved. Step 3 is likely to be taken first for surveys that require more timely estimates.
The use of multiple sources of data takes advantage of the diverse resources describing our sample businesses and households. The use of multiple sources is the device to “fill-in” the missing data. For the self-report part of surveys, real-time assignment of new modes (e.g., internet, phone, mail, face to face) can be enhanced with active statistical analysis of paradata and survey data, in order to optimize the alternative uses of imputation versus original data collection. In this sense, imputation itself can be viewed as an alternative source of data.
The use of multiple modes together with administrative data provides new control over costs of our programs that will well-serve both our clients and ourselves. When active statistical modeling is used during data collection, it offers survey managers the possibility of making some cost-quality tradeoff decisions with greater confidence. For programs desiring to increase the timeliness of their products, the real-time management of multiple modes can offer gains.
This future ties together Census Bureau successes in individual programs. It takes advantage of our assembly and record matching of administrative data; our use of web, CATI, CAPI, and paper questionnaire technologies; our experience with running paper, CATI, CAPI, and web surveys; our development of paradata; our strong field interviewing staff; our building of statistical staff in the regional offices; our access to many data sets unavailable to other data collection organizations.
We will not move through these steps in a year. Getting to Step 3 may take more than five years, I suspect. I am, however, fully convinced that we must move in this direction. I am also convinced that we can succeed and indeed thrive in this new environment because we can gain access to more alternative data sources than most organizations. We, more than most organizations, have fully elaborated systems for alternative modes of data collection. We, more than most organizations, have the broad measurement mandate that provides us with more data resources within our security firewalls. Finally, we have the talent to get from “here” to “there.” Getting “there” will assure our place in the future.