4 Working with Survey Data – Quantitative Social Science

This open-access book accompanies the quantitative research methods course I teach at Memorial University. It’s under active development and revision. Chapters are in different stages of development, so some may be a little rougher than others. Feedback is welcome!

By the end of this chapter you will be able to:

Load and inspect complex survey datasets using pandas

Understand and apply CES variable naming conventions and question families

Create lookup dictionaries to translate numeric codes into meaningful labels

Navigate survey documentation to connect concepts to measured variables

Understand missing value patterns and their substantive implications

Create standardized categorical variables for analysis

Use efficient pandas techniques for survey data processing

The previous chapter established survey research foundations: Total Survey Error, sampling designs, and measurement principles. Now we apply those concepts by working with actual CES data.

This chapter focuses on understanding complex survey datasets before assessing quality or exploring patterns. We’ll learn to load and inspect the data, interpret variable naming conventions, create meaningful labels, and understand how theoretical concepts become measured variables. By the end, you’ll be oriented within the CES structure and ready to assess data quality.

Think of this chapter as learning to read a map before starting a journey. We need to understand the terrain (dataset structure), the landmarks (key variables), and the legend (codebooks and documentation) before we can navigate confidently.

4.1 Setting up the analysis environment

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

4.2 Loading and understanding survey data

4.2.1 Dataset structure and file formats

Survey datasets are typically much larger than the polling data we worked with in previous chapters. The CES includes responses from around 20,000 Canadians per election, with 400+ variables per respondent. Understanding the structure is crucial before beginning analysis.

Understanding Dataset Structure

In data analysis, we organize information in a rectangular format:

Observations (rows): Each individual case or unit of analysis. In survey data, each row represents one person who completed the survey.
Variables (columns): Each characteristic or measurement we collected. This could be age, income, political preference, etc.

So a dataset with 20,000 observations and 400 variables means we surveyed 20,000 people and asked each person about 400 different things. This creates a data rectangle with 20,000 rows × 400 columns = 8 million individual data points.

The CES provides data in Stata format (.dta), which preserves important metadata about variable meanings and coding schemes. We load with numeric codes to maintain full control over how we apply and interpret value labels.

data_path = "data/source/ces-2021/2021 Canadian Election Study v2.0.dta"
ces = pd.read_stata(data_path, convert_categoricals=False)
ces.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20968 entries, 0 to 20967
Columns: 1059 entries, cps21_StartDate to pes21_weight_general_restricted
dtypes: datetime64[ns](7), float32(22), float64(753), int16(2), int32(2), int8(180), object(93)
memory usage: 142.1+ MB

ces.head(3)

	cps21_StartDate	cps21_EndDate	Duration__in_seconds_	RecordedDate	cps21_ResponseId	DistributionChannel	UserLanguage	cps21_consent_t_First_Click	cps21_consent_t_Last_Click	cps21_consent_t_Page_Submit	...	feduid	fedname	message	pccf_pcode_problem	provcode	cps21_weight_general_all	cps21_weight_general_restricted	pes21_weight_general_all	pes21_weight_general_restricted
0	2021-09-19 06:14:46	2021-09-19 06:28:25	818	2021-09-19 06:28:26	R_001Vw6R3CxCzbcR	anonymous	FR-CA	1.302	9.791	10.761	...	24042	Lévis--Lotbinière		0.0	24.0	0.848994	0.842803	0.837168	0.822091
1	2021-09-15 15:23:33	2021-09-15 15:46:57	1403	2021-09-15 15:46:58	R_00AJoGE6B8Xifwl	anonymous	EN	2.488	2.488	3.287	...	59025	Richmond Centre / Richmond-Centre		0.0	59.0	0.868409	0.887286	1.056942	1.091878
2	2021-08-20 09:44:55	2021-08-20 09:57:51	775	2021-08-20 09:57:52	R_00QYXuUFwGAZLgZ	anonymous	EN	3.851	8.468	8.501	...	59021	North Vancouver	0 ERROR: NO MATCH TO PCCF - CHECK PCODE/ADDRES...	1.0	NaN	0.868409	0.887286	1.056942	1.091878

3 rows × 1059 columns

The CES contains roughly 20,000 respondents (rows) and 400+ variables (columns). Each row represents one person who completed the survey. Each column represents a question or derived variable. Understanding what each column means and who actually saw each question is essential for responsible analysis.

# Examine survey timing
ces["cps21_StartDate"].min(), ces["cps21_EndDate"].max()

(Timestamp('2021-08-17 15:40:28'), Timestamp('2021-09-20 03:28:47'))

Survey duration is already calculated in minutes:

ces['cps21_time'].describe()

count    20968.000000
mean       145.160446
std       1018.010498
min          6.033333
25%         16.583334
50%         22.083334
75%         31.254167
max      26252.583984
Name: cps21_time, dtype: float64

4.3 Variable naming conventions

The CES uses systematic naming conventions that reflect the survey’s structure and make it easier to locate related variables:

Wave prefixes: Variables beginning with cps21_ come from the 2021 pre-election campaign period wave, while pes21_ variables come from the post-election wave. This convention makes it easy to identify when a question was asked and which respondents have valid data for each variable. It’s also important to understand because some questions were only asked in one wave or the other.

Question families: Related questions are grouped with similar names. For example:

cps21_party_rating_23, cps21_party_rating_24, cps21_party_rating_25 measure feelings toward different political parties (Liberal, Conservative, NDP)
cps21_lr_parties_1, cps21_lr_parties_2, cps21_lr_parties_3 ask about left-right ideological positions of parties
cps21_lr_scale_bef_1 reports how the respondent places themselves on a left-right political scale

Demographic variables: Standard demographic measures like age, sex and gender identity, education, and income typically have consistent names across election years, making longitudinal analysis possible.

# Examine variable naming patterns
sample_variables = [
    'cps21_StartDate', 'cps21_time', 'cps21_age', 'cps21_education',
    'cps21_party_rating_23', 'cps21_lr_scale_bef_1', 'cps21_data_quality',
    'pes21_votechoice2021', 'pes21_satisfaction'
]

print("Sample of CES variable naming patterns:")
for var in sample_variables:
    if var in ces.columns:
        print(f"{var:<25} | Non-null: {ces[var].count():,} | Type: {ces[var].dtype}")

Sample of CES variable naming patterns:
cps21_StartDate           | Non-null: 20,968 | Type: datetime64[ns]
cps21_time                | Non-null: 20,968 | Type: float32
cps21_age                 | Non-null: 20,968 | Type: float32
cps21_education           | Non-null: 20,968 | Type: int8
cps21_party_rating_23     | Non-null: 20,968 | Type: int8
cps21_lr_scale_bef_1      | Non-null: 20,968 | Type: int8
cps21_data_quality        | Non-null: 20,968 | Type: float32
pes21_votechoice2021      | Non-null: 13,331 | Type: float64

Finding Variables in Large Datasets

With 400+ variables, finding what you need requires systematic approaches:

Use .filter() on column names: ces.filter(like='party').columns
Search documentation PDFs: Use Ctrl+F to find question text
Group by prefixes: All ideology questions start with lr_
Check codebooks: Appendices list all variables by topic

Developing efficient search strategies saves hours of frustration when working with large survey datasets.

4.4 Creating analysis-ready variables

4.4.1 Building lookup dictionaries from codebooks

Raw survey data uses numeric codes that need to be translated into meaningful labels. Creating lookup dictionaries from codebook information is essential for producing interpretable results.

Lookup Dictionaries: Translating Codes into Meaning

Survey data often stores responses as numbers for efficiency:

Education: 1=“No schooling”, 2=“Some elementary”, 3=“Elementary completed”
Party ID: 1=“Liberal”, 2=“Conservative”, 3=“NDP”

Lookup dictionaries translate these codes into readable labels. They’re Python dictionaries where:

Keys = the numeric codes in your data
Values = the meaningful labels from the codebook

This lets you transform 1, 2, 3 into "Liberal", "Conservative", "NDP" for analysis and visualization.

# Create lookup dictionaries based on CES codebook information

# Education levels (from CES codebook)
education_labels = {
    1: "No schooling",
    2: "Some elementary school",
    3: "Elementary school completed",
    4: "Some secondary school",
    5: "Secondary school completed",
    6: "Some technical/community college",
    7: "Technical/community college completed",
    8: "Some university",
    9: "Bachelor's degree",
    10: "Master's degree",
    11: "Professional degree",
    12: "Doctoral degree",
}

# Federal party identification (from CES codebook)
party_labels = {
    1: "Liberal",
    2: "Conservative",
    3: "NDP",
    4: "Bloc Québécois",
    5: "Green",
    6: "Other party",
    7: "None of these",
    8: "Don't know/Prefer not to answer",
}

ces["education_labeled"] = ces["cps21_education"].map(education_labels)
ces["party_id_labeled"] = ces["cps21_fed_id"].map(party_labels)

ces["education_labeled"].value_counts().head()

education_labeled
Bachelor's degree                        6069
Technical/community college completed    4425
Secondary school completed               2726
Some university                          2261
Master's degree                          2153
Name: count, dtype: int64

ces['party_id_labeled'].value_counts().head()

party_id_labeled
Liberal           6395
Conservative      4800
NDP               3113
None of these     2108
Bloc Québécois    1832
Name: count, dtype: int64

Keeping label dictionaries in your code (rather than relying on embedded Stata labels) serves three purposes:

Transparency: Anyone reading your code sees exactly how codes map to meanings
Portability: Labels work the same way whether data is in Stata, CSV, or other formats
Flexibility: Easy to create alternative labeling schemes (shorter labels for plots, etc.)

This approach makes your analysis self-documenting and reproducible.

Performance Note: Adding Multiple Columns Efficiently

The pattern above where new columns are created based on values mapped from other columns is common in data processing. Doing it repeatedly, one column at a time, can lead to performance issues in pandas. For multiple transformations, it’s more efficient to do:

new_columns = pd.DataFrame({
    'education_labeled': ces['cps21_education'].map(education_labels),
    'party_id_labeled': ces['cps21_fed_id'].map(party_labels)
})
ces = pd.concat([ces, new_columns], axis=1)

This approach prevents DataFrame fragmentation by completing all transformations in a single step, which can speed up subsequent operations on large datasets.

4.5 From concepts to variables: The measurement process

Survey research requires translating abstract theoretical concepts into concrete measurements. This transformation-from idea to question to coded variable-shapes what we can credibly claim.

Consider political efficacy: the belief that ordinary citizens can understand and influence politics. As a theoretical construct, efficacy is abstract. To measure it, researchers craft survey items like:

“People like me don’t have any say about what the government does.”

Respondents evaluate this using response options (strongly agree to strongly disagree). Their responses get coded into a numeric variable, perhaps cps21_efficacy_1, which combines with related items into an efficacy scale.

Each step involves decisions:

Question wording (could emphasize “people like me” or “average citizens”)
Response categories (5-point? 7-point? labeled or numeric?)
Coding direction (does agreement indicate high or low efficacy?)
Scale construction (combine items? weight equally?)

These choices aren’t right or wrong, but they are consequential. Slightly different operationalizations can produce different substantive conclusions.

A second example comes from the study of affective polarization. Here the concept refers to the tendency to feel warmly toward one’s own party while viewing opposing parties with hostility. In the Canadian Election Study (CES), this is measured with feeling thermometers: respondents are asked, “How do you feel about the [Liberal/Conservative/NDP] Party?” and record their answer on a scale from 0 (very cold) to 100 (very warm). The resulting variables-for example, cps21_party_rating_23 for the Liberal Party-can be combined into a construct such as “in-group minus out-group difference.” This captures the extent to which respondents express positive feelings for their preferred party and negative feelings for its rivals.

These examples illustrate a general principle. Concepts become data through a chain of transformations, and each link in that chain involves choices. Good measurement is not accidental; it is designed, tested, and refined to ensure that the final numbers capture as much of the underlying concept as possible without introducing unnecessary noise or bias.

4.5.1 CES example: Measuring party identification

The theoretical concept of “party identification” refers to a psychological attachment to a political party-a social identity that shapes how people interpret political events. The CES operationalizes this concept with the question:

“In federal politics, do you usually think of yourself as a: Liberal, Conservative, NDP, Bloc Québécois, Green, or none of these?”

This becomes variable cps21_fed_id with numeric codes (1=Liberal, 2=Conservative, etc.). The single question doesn’t capture all aspects of party identification (strength of attachment, stability over time, multiple identities), but it provides a measurable indicator of the underlying construct.

Later in the survey, related questions about voting history, party thermometers, and issue positions help validate whether cps21_fed_id captures meaningful variation in party attachment.

4.6 Understanding missing value patterns

Survey data commonly includes missing value codes for responses like “Don’t know” (-8), “Refused” (-9), or system missing values. Let’s examine a key political variable:

# Examine the left-right scale variable
print("Left-right self-placement variable examination:")
print(f"Variable: cps21_lr_scale_bef_1")
print(f"Total responses: {ces['cps21_lr_scale_bef_1'].count():,}")
print(f"Missing values: {ces['cps21_lr_scale_bef_1'].isna().sum():,}")

# Look at the actual values to identify missing value codes
print(f"\nValue distribution:")
print(ces['cps21_lr_scale_bef_1'].value_counts(dropna=False).sort_index())

Left-right self-placement variable examination:
Variable: cps21_lr_scale_bef_1
Total responses: 20,968
Missing values: 0

Value distribution:
cps21_lr_scale_bef_1
-99    2895
 0      539
 1      635
 2     1436
 3     2113
 4     2169
 5     3372
 6     2501
 7     2410
 8     1629
 9      625
 10     644
Name: count, dtype: int64

The CES typically uses negative values for missing responses. Values like -99, -88, -77 typically indicate different types of non-response. For analysis, we need to filter to valid responses only:

# Filter to valid left-right responses (0-10 scale)
valid_lr_data = ces[(ces['cps21_lr_scale_bef_1'] >= 0) &
                    (ces['cps21_lr_scale_bef_1'] <= 10)].copy()

print(f"After filtering to valid responses (0-10 scale):")
print(f"Valid responses: {len(valid_lr_data):,}")
print(f"Removed as invalid: {len(ces) - len(valid_lr_data):,}")
print(f"Response rate for LR scale: {len(valid_lr_data)/len(ces):.1%}")

After filtering to valid responses (0-10 scale):
Valid responses: 18,073
Removed as invalid: 2,895
Response rate for LR scale: 86.2%

Survey nonresponse is coded with negative values (e.g., -8 Don’t know, -9 Refused) plus genuine system missing. We centralize the logic so it’s consistent across variables.

# Centralized missing-code handling (confirm codes in CES codebook)
MISSING_CODES = {-99, -98, -97, -88, -77, -9, -8}

def valid_range(series, lo, hi):
    """Keep only values in [lo, hi] and drop coded missings."""
    s = series.copy()
    mask = s.between(lo, hi) & ~s.isin(MISSING_CODES)
    return s[mask]

# Example: left-right self-placement (0-10)
lr_raw = ces["cps21_lr_scale_bef_1"]
lr = valid_range(lr_raw, 0, 10)
print(f"LR valid N: {lr.notna().sum():,}  ({lr.notna().mean():.1%} of full sample)")

LR valid N: 18,073  (100.0% of full sample)

Survey missing values carry different meanings:

-9 (Refused): Respondent saw the question but declined to answer
-8 (Don’t know): Respondent unsure or lacks information
-7 (Not asked): Question filtered based on earlier responses
NaN (System missing): Data collection or processing issue

These distinctions matter. “Don’t know” on ideology questions often indicates low political sophistication-substantively meaningful, not just debris. We’ll handle missing data carefully, not just drop it automatically.

Missing Data in Political Attitude Variables

Political attitude questions often have substantial missing data because:

“Don’t know” responses: Many people genuinely don’t have opinions on complex political issues

Refusal to answer: Some respondents prefer not to reveal political views

Question complexity: Abstract concepts like left-right ideology can be difficult to understand

Survey fatigue: Political batteries often come late in surveys when attention wanes

Analysis implications:

Always report response rates for political variables
Consider whether missing data patterns relate to other variables
Be cautious about generalizing from smaller valid-response samples
Missing data may not be “missing at random”-systematic non-response can bias results

4.6.1 Missing data patterns

Missing data patterns fall into three broad categories:

Missing Completely at Random (MCAR): Like a coin flip-who’s missing is unrelated to anything. Rare in real surveys.

Missing at Random (MAR): Missingness relates to observed variables but not the missing value itself. Example: younger people less likely to report income, but conditional on age, income reporting is random.

Missing Not at Random (MNAR): Missingness relates to the unobserved value. Example: high earners less likely to report exact income.

We can’t definitively determine which pattern holds, but thinking about plausible mechanisms helps us make principled decisions about handling missing data.

4.7 Region mapping

We define a canonical Region using the CES province code so later cross-tabs don’t break.

region_labels = {
    1:"Atlantic", 2:"Quebec", 3:"Ontario",
    4:"Prairies", 5:"Alberta", 6:"British Columbia"
}
ces["Region"] = ces["cps21_province"].map(region_labels)
assert ces["Region"].notna().any(), "Region mapping produced all-NA; check codes."

Creating a canonical region variable serves two purposes:

Consistent analysis: The same region categories work across all analyses
Readable output: “Atlantic” is clearer than “Province code: 1” in tables

Later analyses (cross-tabs, regional comparisons) will reference this standardized variable rather than numeric codes.

4.8 Understanding survey documentation

Before analyzing survey data, researchers must understand how it was collected. This requires careful attention to survey documentation, which provides crucial context for interpretation and helps identify potential limitations that affect what conclusions can be drawn.

4.8.1 Key documentation elements

Survey documentation typically includes several crucial components:

Sampling Design: Details about how participants were selected, including sampling frame, stratification procedures, and response rates. For the CES, this includes information about the target population (eligible Canadian voters), the sampling frame (online panel providers), and efforts to ensure demographic representativeness.

Questionnaire: Complete question wording, response options, and skip patterns. The CES documentation includes every question asked, the exact wording used, and the logic that determines which respondents see which questions.

Field Procedures: Data collection methods, quality control measures, and any interviewer training (though the CES is self-administered online). This includes information about survey length, incentives provided to respondents, and measures taken to ensure data quality.

Data Processing: Coding decisions, missing value conventions, and variable construction. The CES documentation explains how raw responses are converted into analysis variables and what different numeric codes mean.

Weighting: How sampling weights were constructed to adjust for nonresponse and sampling design. This is particularly important because not all demographic groups respond to surveys at equal rates.

Metadata vs. Paradata: Two Types of Survey Information

Metadata = Information about what survey variables mean and how they were collected

Codebook entries explaining variable names and values
Question wording and response options
Sample design and weighting procedures

Paradata = Information about the data collection process itself

Survey completion times and response patterns
Device type and browser information
Response sequence and revision patterns

Example: Metadata tells you that cps21_party_rating_23 measures Liberal Party feeling thermometer (0-100). Paradata tells you this respondent took 45 seconds to answer and revised their response twice.

Both types help you assess data quality and make informed analytical decisions.

Before analyzing any survey data, you must understand how it was collected. Tradeoffs are unavoidable-no design is perfect. To do credible, fair, and honest work we have to acknowledge these limitations, be transparent about them, and do our best to mitigate any problems. For example, if a survey oversamples certain groups who would otherwise be under-represented, we need to know how to use weights and other tools to adjust for that in our analysis, and base our interpretations accordingly.

4.9 Descriptive statistics: Central tendency and variability

Before moving to quality assessment, we examine basic descriptive statistics to understand our data better. Descriptive statistics help us understand our data by summarizing it in simple terms.

Descriptive Statistics: Summarizing What We See

Descriptive statistics help us understand our data by summarizing it in simple terms:

For Categorical Variables (like party preference):

Frequencies: Raw counts (847 people chose Liberal, 623 chose Conservative)
Percentages: Proportions of the total (40.2% Liberal, 29.6% Conservative)

For Continuous Variables (like age or income):

Central tendency: Where is the “middle” of our data?
Variability: How spread out are the values?

Think of descriptive statistics as taking a quick photograph of your data-they don’t explain why patterns exist, but they show you what patterns are there to begin with.

Central Tendency: For numeric variables, we calculate measures of the “typical” or “average” value:

Mean: The arithmetic average, sensitive to extreme values
Median: The middle value when data are arranged in order, less sensitive to extremes
Mode: The most frequently occurring value

Central Tendency: Finding the Typical Value

Imagine you have survey completion times: 12, 15, 18, 20, 22, 25, 180 minutes (one person took a 3-hour break!)

Mean (average): Add them all up and divide by the count: (12+15+18+20+22+25+180) ÷ 7 = 41.7 minutes
Median: Put them in order and pick the middle: 12, 15, 18, 20, 22, 25, 180 → 20 minutes
Mode: The most common value (if there were multiple 20s, that would be the mode)

Notice how the one extreme value (180 minutes) pulled the mean way up to 41.7 minutes, but the median stayed at a reasonable 20 minutes. This is why median is often better for data with outliers-it tells you what a typical person experienced.

# Duration analysis for our sample
duration_mean = ces['cps21_time'].mean()
duration_median = ces['cps21_time'].median()
duration_mode = ces['cps21_time'].mode()[0] if len(ces['cps21_time'].mode()) > 0 else None

print(f"Survey duration - Mean: {duration_mean:.1f} minutes")
print(f"Survey duration - Median: {duration_median:.1f} minutes")
print(f"Survey duration - Mode: {duration_mode:.1f} minutes" if duration_mode else "Survey duration - Mode: N/A")

Survey duration - Mean: 145.2 minutes
Survey duration - Median: 22.1 minutes
Survey duration - Mode: 14.3 minutes

Variability: Measures of how spread out our data are:

Range: Difference between maximum and minimum values
Standard deviation: Average distance from the mean, in the original units
Variance: Standard deviation squared (less intuitive but mathematically useful)

Variability: How Spread Out Is Our Data?

Imagine two classes that both have an average test score of 75%:

Class A scores: 73%, 74%, 75%, 76%, 77% (low variability - everyone did similarly)
Class B scores: 45%, 60%, 75%, 90%, 100% (high variability - big differences)

Range: Class A = 77-73 = 4 points; Class B = 100-45 = 55 points

Standard deviation tells us the typical distance from the average:

Class A: About 1.6 points (most students within 1-2 points of average)
Class B: About 21 points (students typically 20+ points above or below average)

Low variability = everyone similar; high variability = lots of differences. Both patterns are important to understand.

# Measures of variability for duration
duration_std = ces['cps21_time'].std()
duration_range = ces['cps21_time'].max() - ces['cps21_time'].min()

print(f"Survey duration - Standard deviation: {duration_std:.1f} minutes")
print(f"Survey duration - Range: {duration_range:.1f} minutes")

Survey duration - Standard deviation: 1018.0 minutes
Survey duration - Range: 26246.6 minutes

When the mean is much larger than the median, it suggests the presence of extreme values (outliers) pulling the average upward. We’ll address these outliers systematically in the next chapter through quality assessment.

# Generate comprehensive descriptive statistics for key variables
desc_vars = {
    'Age': 'cps21_age',
    'Survey Duration (minutes)': 'cps21_time',
    'Left-Right Self-Placement': 'cps21_lr_scale_bef_1'
}

# Create descriptive statistics table
descriptive_stats = []

for var_name, var_col in desc_vars.items():
    if var_col == 'cps21_lr_scale_bef_1':
        # Filter to valid responses for ideology
        var_data = ces[(ces[var_col] >= 0) & (ces[var_col] <= 10)][var_col]
    else:
        var_data = ces[var_col].dropna()

    stats = {
        'Variable': var_name,
        'N': len(var_data),
        'Mean': var_data.mean(),
        'Std Dev': var_data.std(),
        'Min': var_data.min(),
        '25th %ile': var_data.quantile(0.25),
        'Median': var_data.median(),
        '75th %ile': var_data.quantile(0.75),
        'Max': var_data.max()
    }
    descriptive_stats.append(stats)

# Add categorical variables
categorical_vars = {
    'Education': 'education_labeled',
    'Party ID': 'party_id_labeled'
}

for var_name, var_col in categorical_vars.items():
    var_data = ces[var_col].dropna()
    mode_value = var_data.mode()[0] if len(var_data.mode()) > 0 else "N/A"
    mode_count = (var_data == mode_value).sum()
    mode_pct = (mode_count / len(var_data)) * 100

    stats = {
        'Variable': var_name,
        'N': len(var_data),
        'Mean': f"Mode: {mode_value}",
        'Std Dev': f"({mode_pct:.1f}%)",
        'Min': f"{len(var_data.unique())} categories",
        '25th %ile': "",
        'Median': "",
        '75th %ile': "",
        'Max': ""
    }
    descriptive_stats.append(stats)

# Convert to DataFrame and format
desc_df = pd.DataFrame(descriptive_stats)

# Format numeric columns
numeric_cols = ['Mean', 'Std Dev', 'Min', '25th %ile', 'Median', '75th %ile', 'Max']
for col in numeric_cols:
    desc_df[col] = desc_df[col].apply(lambda x: f"{x:.2f}" if isinstance(x, (int, float)) else x)

print("\nDescriptive Statistics for Analysis Sample:")
print("=" * 80)
print(desc_df.to_string(index=False))


Descriptive Statistics for Analysis Sample:
================================================================================
                 Variable     N                    Mean      Std Dev           Min 25th %ile     Median 75th %ile           Max
                      Age 20968               51.304939    17.198538          18.0     36.00       53.0     66.00          97.0
Survey Duration (minutes) 20968              145.160446  1018.010498      6.033333     16.58  22.083334     31.25  26252.583984
Left-Right Self-Placement 18073                    5.11         2.35             0      3.00       5.00      7.00            10
                Education 20968 Mode: Bachelor's degree      (28.9%) 12 categories                                             
                 Party ID 20968           Mode: Liberal      (30.5%)  8 categories

This comprehensive descriptive statistics table would typically appear in a research paper, showing the characteristics of our analysis sample. These statistics describe our sample, not the Canadian population. In the next chapter, we’ll assess data quality to create a clean analysis sample suitable for exploration.

4.10 Looking forward: From understanding to evaluation

We can now load CES data, understand its structure, and translate numeric codes into meaningful labels. We know how variables are named, how question families are organized, and how theoretical concepts are operationalized as survey items.

But datasets of this size inevitably contain quality issues: respondents who rush through questions, technical glitches that corrupt responses, or duplicate entries from data collection errors. Before exploring patterns, we need to systematically assess quality and create a clean analysis sample.

The next chapter applies the Total Survey Error framework from Chapter 3 to identify and address quality issues, building a trustworthy dataset for analysis. We’ll examine duration outliers, attention checks, duplicate detection, and other indicators that help distinguish high-quality from problematic responses.