National Statistical Offices: Independent, Identical, Simultaneous Actions Thousands of Miles Apart

Bookmark and Share

Written by: Director Robert Groves

Several weeks ago, at the initiative of Brian Pink, the Australian statistician, leaders of the government statistical agencies from Australia, Canada, New Zealand, United Kingdom, and the United States held a summit meeting to identify common challenges and share information about current initiatives.  While there had been casual sharing of partial information in previous years among these leaders, this event was unprecedented.

The five countries share languages and some cultural features; they vary in size and in the organization of their statistical systems.  They also vary in the current health of their national economies, their regional economic foci, and key social and political issues.   None of them have population registers with mandatory updating features.  The legal frameworks of the countries’ statistical systems give different powers to the chief statistician.

While meetings of this character happen periodically in many sectors, the findings of the meeting were notable on one dimension – the five countries’ statisticians report that the strategic activities now being mounted are very nearly identical.  They perceive the same likely future challenges for central government statistical agencies, and they are making similar organizational changes to prepare for the future.  While they vary in specific current innovations, the components of the full future vision are remarkably similar.

Ingredients of the future vision:

  1. The volume of data generated outside the government statistical systems is increasing much faster than the volume of data collected by the statistical systems; almost all of these data are digitized in electronic files.
  2. As this occurs, the leaders expect that relative cost, timeliness, and effectiveness of traditional survey and census approaches of the agencies may become less attractive.
  3. Blending together multiple available data sources (administrative and other records) with traditional surveys and censuses (using paper, internet, telephone, face-to-face interviewing) to create high quality, timely statistics that tell a coherent story of economic, social and environmental progress must become a major focus of central government statistical agencies.
  4. This requires efficient record linkage capabilities, the building of master universe frames that act as core infrastructure to the blending of data sources, and the use of modern statistical modeling to combine data sources with highest accuracy.
  5. Agencies will need to develop the analytical and communication capabilities to distill insights from more integrated views of the world and impart a stronger systems view across government and private sector information.
  6. There are growing demands from researchers and policy-related organizations to analyze the micro-data collected by the agencies, to extract more information from the data.

In some of the countries the difficulty of obtaining high participation rates in surveys and censuses is growing, creating cost inflation due to the need for greater efforts to contact and persuade sample units of the value of their participation.  At the same time, central government budgets are constrained, not amenable to major initiatives.

In most of the countries the agencies realize that there are data resources that are not being fully used to the benefit of the country’s statistics.  Many of these are controlled by other government agencies that use the data for program administration.  In all of the countries, many of the record systems of lower-level geographical units, businesses, and program agencies of the central government are increasingly digitized.  These record systems often contain some data of relevance to the statistical agency’s mandate to describe its society and economy.

Further, there are different surveys the agencies conduct that could be linked together to increase the amount of information – sometimes different economic surveys of the same unit; sometimes a mix of household data and employer data.  Such linking can produce new statistical information without the need to collect any new data.  All agencies report efforts to link such data resources together.

The global internet is currently offering near real-time data on durable and nondurable goods prices, housing sales, and other relevant events.  Since the data reflect global phenomena, new conceptual puzzles arise in using the data to describe the nation-state.  The global internet search capabilities also generate verbal data describing billions of information requests (e.g., Google Search) and behavioral reports (e.g., tweets).  All of these sources of data are fallible.  They fail to offer complete coverage of the population and behaviors of interest.  They tend to be lean, reporting only the behavior and time and geography, not other characteristics of the person who performed the behavior.  In contrast to other data sources the internet data often have global reach, with little concern about nation-state boundaries.

There are practical implications for the management and governance of internal activities of the agencies:

  1. The traditional functional separations among population census, economic surveys, and household/person surveys are not well-fitted to a world of multiple data sourced censuses and surveys.  Hence, management changes are being considered to unify data collection processes under the same structures.
  2. Generalized IT systems that serve census, economic surveys, and demographic surveys are being developed.  These have the advantages of reduced maintenance costs, flexibility in rotating staff across subunits, and new functions suited to linked files.
  3. The staffs of the statistical agencies need to learn about the purposes and procedures of program data resources; sometimes this involves placing statistical agency staff in program agencies (e.g., within tax authorities).  They will be less wedded to collecting data and more attentive to generating and utilizing that which is most appropriate and cost effective.
  4. There is an increasing need for high-speed, “big data” software systems for record linkage and extraction of key information from massive files.
  5. Efficient and sophisticated imputations procedures are needed to make the combined data sources jointly useful.
  6. There is more use of statistical modeling for statistical estimation, to provide more timely and small area estimates.
  7. The agencies are inventing new ways to give secure access to micro-data for legitimate research purposes, to increase their impact of their work.

For some decades these organizations have been increasing use of software systems to improve the efficiency of data collection and data processing.  These have held costs of producing statistical information at lower levels than would have been expected.  The future will see the integration of new data resources, each fallible in some way, combined through linking and statistical modeling to produce the requisite statistical information for their countries.  Through this, the agencies hope to provide more statistics at lower cost.

In short, the five countries are actively inventing a future unlike the past, requiring new ways of thinking and calling for new skills.  The payoff sought is timelier, more trustworthy, and lower cost statistical information measuring new components of the society, economy, and environment, telling a richer story of our countries’ progress.

 

 

 

Posted in About the Agency | Tagged , | Leave a comment

A Little Known Special Day

Bookmark and Share

Written by: Director Robert Groves

January 28 is National Data Privacy Day.  It’s not a big deal to most folks, but it is a time for those of us at the Census Bureau to reflect on our role in the US society.

We at the Census Bureau collect answers to questions from people throughout the country, cumulate those answers, and provide freely to our society statistical summaries about important social and economic phenomena.  “Privacy” is a key concept to this work.

“Privacy,” in a discussion by Warren and Brandeis in 1890, was defined as the “right to be left alone.”  More recent definitions have added thoughts like the right to control the use of information about yourself.

By the very nature of what the Census Bureau does, we enter very briefly into people’s daily lives.  We do this to ask them questions that fulfill common good purposes – the percentage of persons currently suffering from some health condition; the average time it takes people to get to work each day; the median income of the adult population.  The intrusions are relatively short, limited to questions that must meet the criterion of national need.

For most all of our household surveys (two exceptions are the decennial census and the American Community Survey), participation of a respondent is based on his/her voluntary consent to do the interview or fill out a questionnaire.  Before consent is sought, we describe the purpose of the survey and explain that statistical information will be produced when we assemble all the answers we get.  This is the notion of “informed consent” that is the basis of voluntary surveys.

Part of the informed consent process details our pledge that the answers the respondent gives us are never revealed to anyone who is not part of cumulating them to produce statistical information.  That means that data that identify individuals are not released in any public reports.

That “pledge of confidentiality” is, for some people, a key factor in their deciding to give up a little of their privacy to give us their answers – they can contribute to the common good without fearing that they’d be personally harmed in some way by abuse of their answers.

Another way that the Census Bureau tries to maximize privacy is to use data that have already been provided by the public.  Congress, in the law governing the Census Bureau, has directed us to “the maximum extent possible” use records and information gathered by other government agencies, instead of asking the same questions of people yet again.  Implementing this fully would achieve that goal that the American public would never be asked to provide the same information twice by a government agency.  This would be a real advance on privacy.

We at the Census Bureau know that privacy, informed consent, and confidentiality are the three touchstones of our ethical code in meeting our mission to produce useful statistics for our country.  We’re out of business without peoples’ trust that their answers are safe with us and that our sole purpose is combining their answers into summaries to provide our entire society with useful information about itself.

So National Data Privacy Day, to us, represents a reminder of that important fact.

Posted in About the Agency | Tagged , | 1 Comment

The Future of Producing Social and Economic Statistical Information, Part III

Bookmark and Share

Written by: Director Robert Groves

My last posts addressed how the changing world of economic units and households is diminishing the effectiveness of our current methods and increasing their costs, at a time of great fiscal pressure.  This post comments on a way forward.

The future is likely to value more timely statistical information, while certain populations may become more difficult to measure directly.  However, many of these difficult-to-measure businesses, households, and persons will be included in administrative data systems (data already supplied by the units) that could be used as companions to survey data.  This future will require using multiple alternative sources of data simultaneously.

Now, most of our Economic and Demographic surveys tend to use single sampling frames, often based on mailing addresses, from which we attempt to cover the entire population.  Generally, we need to find contact information when we switch from one data collection mode to another (e.g., requesting phone numbers in wave 1 face to face interviews to prepare for wave 2 telephone interviews).  It means we use separate systems to manage each mode’s assigned sample, give work assignments to interviewers, monitor the progress of data collection, and perform data editing and estimation in post-data collection.

The economic directorate at the Census Bureau has already combined administrative data with survey data in inventive ways.  It has more experience in simultaneously dealing with paper forms, internet responses, and telephone interviews.  It has also grappled with decision rules that define what preliminary estimates shall be and what ingredients are required for revisions of those preliminary estimates.  Timeliness of economic indicators has in the past been valued more highly than timeliness of demographic estimates.  The future is likely to increase the demand for more timely information and the increased use of preliminary estimates.

In short, the future will require a blending together of multiple sources of data, real-time processing of data, and the production of more timely estimates.  We need the infrastructure to facilitate this.

One could imagine a three-step evolution to this future.

Step 1. Consolidation of Frame Data and Paradata for Active Use during Data Collection.  The first step would require no change in our current frames and the current assignment of the first mode to cases.  Sample designs would not be altered and would be implemented as currently performed.  It would add, however, the capability of changing the mode of data collection to one preferred by the sample unit more quickly.  It keeps track of how the sample unit reacted to the initial approach to complete the measurement in order to tailor the request to their lifestyles (based on “paradata” or process data). Current management procedures, based on human judgment of supervisors for assigning cases to mode, would continue to be used.   In essence, the innovation in this phase of the evolution would be a) the integration of the sample management systems across modes into a single unified system, b) a unified software module that provides a cross-mode progress monitoring capability.

Step 1 is already taking shape.  We have launched the developmental effort for the “Unified Tracking System,” which contains a repository of frame data, process data on field activities, cost/effort data, call record data, and questionnaire data on sample cases.  The repository will be updated daily.  Statistical analysis of the data will produce new monitoring and reporting information for survey management.  Such analysis will also permit better costing of future surveys.

Step 2. Real-time Mode Assignment.  The second evolutionary step would add mode assignment based on a set of software-driven business rules that would guide real-time movement across modes.  Some of the business rules would be based on statistical models run in background and used to update mode assignments every 24 hours. The rules will never replace some of the unique knowledge held by field supervisors, but they would greatly increase efficiency of case management for a large majority of cases.  Hence, the management would also have capabilities of manual assignment to modes, when direct intervention is desirable.

Similar business rules would guide subsampling of nonrespondents for followup modes (as currently done in the American Community Survey).  Thus, step 2 enhances the data-based direction of cases from mode to mode to completion, exploiting more powerful statistical models to direct these.  At the same time, it provides managers with full control capabilities over the course of the sample administration.  It acts on the timeliness of data production.

To move from Step 1 to Step 2 we need a new group of statistical analysts to create the cost, effort, and quality indicators of survey data during their collection.  We need statistical models of the likelihood that the next followup on a case will generate a completed measurement.  Analysis of such data will lead to statistical models that form the business rules of mode-switches.  Such analysis will permit better budgeting and cost controls.

When we achieve this step we can control/reduce costs within controls on qualities differences across modes. This is the value of a unified mixed-mode operational control system.

Step 3.  Real-time Estimation During Data Collection.  The third evolutionary step would add to the second step the real-time estimation routines each 24 hours of a data collection.  This real-time estimation would have real-time imputation, nonresponse adjustment, and variance estimation as part of the process, with an assessment of whether the imputation variance inflation was acceptable or whether increased effort at nonresponse followup was necessary.  It integrates the editing and imputation step into the data collection process.

It thereby provides the survey management with a new source of information on whether a case should be shifted to another mode for followup – if high quality imputations can be made at a given point, the cost efficiency of followup may be too low to tolerate.  Of special application is the use of pre-loaded administrative data for the case, which may have item missing data rates higher than the full data record from another mode.  (In one sense, Step 3 adds imputation as another “mode” of data collection.)

Similarly, tracking of key survey estimates on a real-time basis can provide useful information on whether continued attempts to gather data are cost-efficient.   If estimates are unlikely to change in important ways with further efforts, the data collection could be halted.

The movement from step 2 to step 3 requires advancing the editing and imputation step in all modes as much as possible.  Two monitoring tools would be used during data collection to help make tradeoff decisions about when to terminate efforts on a case for a given mode: a) active tracking of the key statistics based on the fully imputed/adjusted estimates, b) measures of imputation variance and sampling variance of estimates.  With data on the costs of moving a case with partial data to another mode and some assessment of bias and variance impacts of accepting imputed data, more nearly optimal estimates can be achieved.  Step 3 is likely to be taken first for surveys that require more timely estimates.

Summary

The use of multiple sources of data takes advantage of the diverse resources describing our sample businesses and households.  The use of multiple sources is the device to “fill-in” the missing data.  For the self-report part of surveys, real-time assignment of new modes (e.g., internet, phone, mail, face to face) can be enhanced with active statistical analysis of paradata and survey data, in order to optimize the alternative uses of imputation versus original data collection.  In this sense, imputation itself can be viewed as an alternative source of data.

The use of multiple modes together with administrative data provides new control over costs of our programs that will well-serve both our clients and ourselves.  When active statistical modeling is used during data collection, it offers survey managers the possibility of making some cost-quality tradeoff decisions with greater confidence.  For programs desiring to increase the timeliness of their products, the real-time management of multiple modes can offer gains.

This future ties together Census Bureau successes in individual programs.  It takes advantage of our assembly and record matching of administrative data; our use of web, CATI, CAPI, and paper questionnaire technologies; our experience with running paper, CATI, CAPI, and web surveys; our development of paradata; our strong field interviewing staff; our building of statistical staff in the regional offices; our access to many data sets unavailable to other data collection organizations.

We will not move through these steps in a year. Getting to Step 3 may take more than five years, I suspect.  I am, however, fully convinced that we must move in this direction.  I am also convinced that we can succeed and indeed thrive in this new environment because we can gain access to more alternative data sources than most organizations.  We, more than most organizations, have fully elaborated systems for alternative modes of data collection.  We, more than most organizations, have the broad measurement mandate that provides us with more data resources within our security firewalls.  Finally, we have the talent to get from “here” to “there.”  Getting “there” will assure our place in the future.

Posted in Measuring America | 6 Comments

Estimating the Size of a Small Population

Bookmark and Share

Written by: Director Robert Groves

Let me tell you a wonderful story, a statistical detective story of sorts.

During the summer, you may have seen statistics released from the 2010 Census Summary File 1 on same-sex couple unmarried partner households.

We noticed that reported counts of same-sex couples from the 2010 census were much higher than similar estimates from American Community Survey at earlier years. Our demographic analysts had some immediate ideas, explained nicely in this video:

So we suspected that the format of the nonresponse followup form was the culprit. If that were the case, one should see some obvious mismatches between the name of the person written on the form and the recorded sex of that person. Bingo! A qualitative inspection of some of the records showed suspicious combinations (e.g., “Harold” recorded as a “female”). Past research led us to believe that the name entered was likely to be more accurate than the recorded sex.

How could the unintentional mistakes be fixed? We have an analysis of the full Census that lists the percentage male and female for all first names. Some names are common for both males and females (e.g., “Leslie,” “Dana,” “Alex”). Other names are very dominantly one sex or another (e.g., “Mary,” “Thomas,” “Alicia”). Our analysts identified the names that were 95% or higher male and those 95% or higher female. Then we completely reanalyzed the entire 2010 Census. When we discovered one of the names in the two lists that had a very unlikely sex reported to it, we noted that as a likely error.

When we count those apparent mistakes and reclassify them as a consistent name-sex pair, we found that the same-sex couples counts from the Census agree with other estimates. The best comparison is to the sample-based estimates of the American Community Survey, which moved to the improved question format in 2008. The chart below shows why we are confident that the “preferred estimates” are likely much better than the original counts.

Same sex Couple Households

The chart above shows a large decrease in the number of same-sex couples when we changed the format of the American Community Survey in the 2007-2008 time period. We have evidence that the lower estimates are more accurate.

Similarly, we are confident that the “Preferred estimates” at the rightmost bar of the chart are more accurate than the “original counts” from the 2010 Census. The logic of our analysis and repair procedure on the 2010 coding is compelling, and the closer agreement with the just-released 2010 American Community Survey results strengthens our confidence.

This is the technical expertise of the Census Bureau at its finest – examining statistics for anomalies, detecting the cause of a found anomaly, and fixing mistakes from data collection when possible to give the country the best statistics possible.

Posted in 2010 Census, Measuring America, Non-response Follow Up | 10 Comments

The Future of Producing Social and Economic Statistical Information, Part II

Bookmark and Share

Written by: Director Robert Groves

In my last post, I reviewed five observations. Because of changes in American society, 1) the Census Bureau’s methods of data collection are costing more money to produce the same statistical information, but 2) the demands are increasing for more statistical information from businesses, governments, and the public, and 3) there are new data collection technologies that are being invented constantly, 4) there are new sources of digital data from Federal program agencies, the internet, and economic transactions, but 5) in the medium run the Census Bureau is not likely to have more fiscal resources to take advantage of these. My conclusion: the current methods used in the Census Bureau are unsustainable in the medium run.

These observations suggest a way forward for this agency. In some areas, we have unique resources to achieve success; for others, we will need to work together in new ways.

  1. The Census Bureau’s future must actively employ multiple modes of data collection from the American public and businesses. Some people prefer to talk to someone they can see face-to-face; others want to talk to someone over the telephone; others want to use the internet to answer survey questions at whatever hour of the day they wish; still others want us to use answers they’ve already provided to another agency. We need to adapt to these diverse desires. We need to make survey response as convenient as possible.
  2. The resulting data sets may have more missing data, connected to the weaknesses of specific modes of data collection. For example, respondents using paper questionnaires sometimes fail to answer some questions that they would answer when asked by an interviewer. Each mode of data collection has strengths and weaknesses. We must be ready to use one mode to fill in gaps of another. When multiple modes are used in a single survey we can use each mode to bolster the weaknesses of another.
  3. Prior to contacting our sample units, we will be ignorant about their mode preferences; we must be able to switch across modes in real-time during the data collection phase to produce timely estimates. Our lists of addresses provide little information about who lives in them. We know nothing about their preferences about completing our surveys. We will learn about their preferences only after sending them requests. To save taxpayer money, we need to switch to another mode when there emerges evidence that one mode is not effective. The faster we do this, the better.
  4. Despite the multi-mode approach, we will not be able to gain self-reports from all sample members; statistical models will be used to produce accurate estimates when mode-switches to fill in missing data are judged cost-inefficient. To save taxpayer money, we need real-time data-based decisions about when it is more effective to cease efforts to measure a sample unit and instead use statistical models to account for the case in the final estimate.
  5. Our final statistical estimates from these “swiss cheese” data sets must rely on new statistical modeling techniques to repair differences in item missing data and measurement properties across modes. Modern statistical modeling can improve the quality of statistical information. The Census Bureau can incorporate such tools to address the weakness of failing to measure all sample cases in our demographic and economic surveys.

These new designs thus will build on practices that are appearing within the Bureau already.

This future will require some changes in our key work processes. All the changes share the common theme of developing new design, data collection, and analysis methods to improve participation rates and efficiencies of surveys and censuses.

All of these appear to be within our reach, given pockets of developments that have occurred in different Bureau programs recently. We need to consolidate these practices for the benefit of all the programs as they move to mixed-mode designs. One area needing direct and quick attention is the collection and analysis of cost data so that wise tradeoff decisions about mode switches can be made. Another development area is management information systems permitting the real-time administration of mixed-mode surveys.

I have some ideas on steps that we might mount to address these issues. I’ll talk about them in a later post.

Posted in About the Agency | Leave a comment