Creating a Codebook

A guide to documenting qualitative data to facilitate communication and reuse.

What Is a Codebook?

According to The Encyclopedia of Survey Research Methods, “Codebooks are used by survey researchers to serve two main purposes: to provide a guide for coding responses and to serve as documentation of the layout and code definitions of a data file…. At the most basic level, a codebook describes the layout of the data in the data file and describes what the data codes mean.”

A codebook is analogous to a data dictionary, but for qualitative data instead of quantitative. However, you will sometimes see the terms used interchangeably.

Some software programs (e.g., Atlas.ti) can help create codebooks automatically.

Why Are Codebooks Important?

As indicated by the name, a codebook is intended in part to help researchers code their interview or survey data. It may serve as a reminder for the research team of the codes that are in use and their definitions. A codebook also helps researchers who want to reuse the data by providing all of the information needed to make the data understandable. Finally, codebooks document the data analysis process, increasing the validity and reproducibility of the process.

What Should a Codebook Include?

ICPSR, the largest data repository for social sciences data, gives this list of what to include in a codebook:

  • Variable name: The name or number assigned to each variable in the data collection. Some researchers prefer to use mnemonic abbreviations (e.g., EMPLOY1), while others use alphanumeric patterns (e.g., VAR001). For survey data, try to name variables after the question numbers – e.g., Q1, Q2b, etc.
  • Variable label: A brief description to identify the variable for the user. Where possible, use the exact question or research wording.
  • Question text: Where applicable, the exact wording from survey questions.
  • Values: The actual coded values in the data for this variable.
  • Value labels: The textual descriptions of the codes.
  • Summary statistics: Where appropriate and depending on the type of variable, provide unweighted summary statistics for quick reference. For categorical variables, for instance, frequency counts showing the number of times a value occurs and the percentage of cases that value represents for the variable are appropriate. For continuous variables, minimum, maximum, and median values are relevant.
  • Missing data: Where applicable, the values and labels of missing data. Missing data can bias an analysis and is important to convey in study documentation. Remember to describe all missing codes, including “system missing” and blank.
  • Universe skip patterns: Where applicable, information about the population to which the variable refers, as well as the preceding and following variables.
  • Notes: Additional notes, remarks, or comments that contextualize the information conveyed in the variable or relay special instructions. For measures or questions from copyrighted instruments, the notes field is the appropriate location to cite the source.

Examples

Marked up Codebooks from the Data Documentation Initiative

Sources

ICPSR. “What Is a Codebook?,” n.d. https://www.icpsr.umich.edu/web/ICPSR/cms/1983.

Iowa University Libraries. “Documenting Data: Metadata.” Accessed February 8, 2024. https://www.lib.uiowa.edu/data/manage/documenting/.

Lavrakas, Paul. Encyclopedia of Survey Research Methods. 2455 Teller Road, Thousand Oaks California 91320 United States of America: Sage Publications, Inc., 2008. https://doi.org/10.4135/9781412963947.

 

Back to top of page