import numpy as np
import pandas as pd
import matplotlib.pyplot as pltBy the end of this chapter you will be able to:
- Load and inspect complex survey datasets using pandas
- Understand and apply CES variable naming conventions and question families
- Create lookup dictionaries to translate numeric codes into meaningful labels
- Navigate survey documentation to connect concepts to measured variables
- Understand missing value patterns and their substantive implications
- Create standardized categorical variables for analysis
- Use efficient pandas techniques for survey data processing
The previous chapter established survey research foundations: Total Survey Error, sampling designs, and measurement principles. Now we apply those concepts by working with actual CES data.
This chapter focuses on understanding complex survey datasets before assessing quality or exploring patterns. We’ll learn to load and inspect the data, interpret variable naming conventions, create meaningful labels, and understand how theoretical concepts become measured variables. By the end, you’ll be oriented within the CES structure and ready to assess data quality.
Think of this chapter as learning to read a map before starting a journey. We need to understand the terrain (dataset structure), the landmarks (key variables), and the legend (codebooks and documentation) before we can navigate confidently.
4.1 Setting up the analysis environment
4.2 Loading and understanding survey data
4.2.1 Dataset structure and file formats
Survey datasets are typically much larger than the polling data we worked with in previous chapters. The CES includes responses from around 20,000 Canadians per election, with 400+ variables per respondent. Understanding the structure is crucial before beginning analysis.
In data analysis, we organize information in a rectangular format:
- Observations (rows): Each individual case or unit of analysis. In survey data, each row represents one person who completed the survey.
- Variables (columns): Each characteristic or measurement we collected. This could be age, income, political preference, etc.
So a dataset with 20,000 observations and 400 variables means we surveyed 20,000 people and asked each person about 400 different things. This creates a data rectangle with 20,000 rows × 400 columns = 8 million individual data points.
The CES provides data in Stata format (.dta), which preserves important metadata about variable meanings and coding schemes. We load with numeric codes to maintain full control over how we apply and interpret value labels.
data_path = "data/source/ces-2021/2021 Canadian Election Study v2.0.dta"
ces = pd.read_stata(data_path, convert_categoricals=False)
ces.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20968 entries, 0 to 20967
Columns: 1059 entries, cps21_StartDate to pes21_weight_general_restricted
dtypes: datetime64[ns](7), float32(22), float64(753), int16(2), int32(2), int8(180), object(93)
memory usage: 142.1+ MB
ces.head(3)| cps21_StartDate | cps21_EndDate | Duration__in_seconds_ | RecordedDate | cps21_ResponseId | DistributionChannel | UserLanguage | cps21_consent_t_First_Click | cps21_consent_t_Last_Click | cps21_consent_t_Page_Submit | ... | feduid | fedname | message | pccf_pcode_problem | manual_PCCF | provcode | cps21_weight_general_all | cps21_weight_general_restricted | pes21_weight_general_all | pes21_weight_general_restricted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2021-09-19 06:14:46 | 2021-09-19 06:28:25 | 818 | 2021-09-19 06:28:26 | R_001Vw6R3CxCzbcR | anonymous | FR-CA | 1.302 | 9.791 | 10.761 | ... | 24042 | Lévis--Lotbinière | 0.0 | 0.0 | 24.0 | 0.848994 | 0.842803 | 0.837168 | 0.822091 | |
| 1 | 2021-09-15 15:23:33 | 2021-09-15 15:46:57 | 1403 | 2021-09-15 15:46:58 | R_00AJoGE6B8Xifwl | anonymous | EN | 2.488 | 2.488 | 3.287 | ... | 59025 | Richmond Centre / Richmond-Centre | 0.0 | 0.0 | 59.0 | 0.868409 | 0.887286 | 1.056942 | 1.091878 | |
| 2 | 2021-08-20 09:44:55 | 2021-08-20 09:57:51 | 775 | 2021-08-20 09:57:52 | R_00QYXuUFwGAZLgZ | anonymous | EN | 3.851 | 8.468 | 8.501 | ... | 59021 | North Vancouver | 0 ERROR: NO MATCH TO PCCF - CHECK PCODE/ADDRES... | 1.0 | 0.0 | NaN | 0.868409 | 0.887286 | 1.056942 | 1.091878 |
3 rows × 1059 columns
The CES contains roughly 20,000 respondents (rows) and 400+ variables (columns). Each row represents one person who completed the survey. Each column represents a question or derived variable. Understanding what each column means and who actually saw each question is essential for responsible analysis.
# Examine survey timing
ces["cps21_StartDate"].min(), ces["cps21_EndDate"].max()(Timestamp('2021-08-17 15:40:28'), Timestamp('2021-09-20 03:28:47'))
Survey duration is already calculated in minutes:
ces['cps21_time'].describe()count 20968.000000
mean 145.160446
std 1018.010498
min 6.033333
25% 16.583334
50% 22.083334
75% 31.254167
max 26252.583984
Name: cps21_time, dtype: float64
4.3 Variable naming conventions
The CES uses systematic naming conventions that reflect the survey’s structure and make it easier to locate related variables:
Wave prefixes: Variables beginning with cps21_ come from the 2021 pre-election campaign period wave, while pes21_ variables come from the post-election wave. This convention makes it easy to identify when a question was asked and which respondents have valid data for each variable. It’s also important to understand because some questions were only asked in one wave or the other.
Question families: Related questions are grouped with similar names. For example:
cps21_party_rating_23,cps21_party_rating_24,cps21_party_rating_25measure feelings toward different political parties (Liberal, Conservative, NDP)cps21_lr_parties_1,cps21_lr_parties_2,cps21_lr_parties_3ask about left-right ideological positions of partiescps21_lr_scale_bef_1reports how the respondent places themselves on a left-right political scale
Demographic variables: Standard demographic measures like age, sex and gender identity, education, and income typically have consistent names across election years, making longitudinal analysis possible.
# Examine variable naming patterns
sample_variables = [
'cps21_StartDate', 'cps21_time', 'cps21_age', 'cps21_education',
'cps21_party_rating_23', 'cps21_lr_scale_bef_1', 'cps21_data_quality',
'pes21_votechoice2021', 'pes21_satisfaction'
]
print("Sample of CES variable naming patterns:")
for var in sample_variables:
if var in ces.columns:
print(f"{var:<25} | Non-null: {ces[var].count():,} | Type: {ces[var].dtype}")Sample of CES variable naming patterns:
cps21_StartDate | Non-null: 20,968 | Type: datetime64[ns]
cps21_time | Non-null: 20,968 | Type: float32
cps21_age | Non-null: 20,968 | Type: float32
cps21_education | Non-null: 20,968 | Type: int8
cps21_party_rating_23 | Non-null: 20,968 | Type: int8
cps21_lr_scale_bef_1 | Non-null: 20,968 | Type: int8
cps21_data_quality | Non-null: 20,968 | Type: float32
pes21_votechoice2021 | Non-null: 13,331 | Type: float64
With 400+ variables, finding what you need requires systematic approaches:
- Use
.filter()on column names:ces.filter(like='party').columns - Search documentation PDFs: Use Ctrl+F to find question text
- Group by prefixes: All ideology questions start with
lr_ - Check codebooks: Appendices list all variables by topic
Developing efficient search strategies saves hours of frustration when working with large survey datasets.
4.4 Creating analysis-ready variables
4.4.1 Building lookup dictionaries from codebooks
Raw survey data uses numeric codes that need to be translated into meaningful labels. Creating lookup dictionaries from codebook information is essential for producing interpretable results.
Survey data often stores responses as numbers for efficiency:
- Education: 1=“No schooling”, 2=“Some elementary”, 3=“Elementary completed”
- Party ID: 1=“Liberal”, 2=“Conservative”, 3=“NDP”
Lookup dictionaries translate these codes into readable labels. They’re Python dictionaries where:
- Keys = the numeric codes in your data
- Values = the meaningful labels from the codebook
This lets you transform 1, 2, 3 into "Liberal", "Conservative", "NDP" for analysis and visualization.
# Create lookup dictionaries based on CES codebook information
# Education levels (from CES codebook)
education_labels = {
1: "No schooling",
2: "Some elementary school",
3: "Elementary school completed",
4: "Some secondary school",
5: "Secondary school completed",
6: "Some technical/community college",
7: "Technical/community college completed",
8: "Some university",
9: "Bachelor's degree",
10: "Master's degree",
11: "Professional degree",
12: "Doctoral degree",
}
# Federal party identification (from CES codebook)
party_labels = {
1: "Liberal",
2: "Conservative",
3: "NDP",
4: "Bloc Québécois",
5: "Green",
6: "Other party",
7: "None of these",
8: "Don't know/Prefer not to answer",
}
ces["education_labeled"] = ces["cps21_education"].map(education_labels)
ces["party_id_labeled"] = ces["cps21_fed_id"].map(party_labels)
ces["education_labeled"].value_counts().head()education_labeled
Bachelor's degree 6069
Technical/community college completed 4425
Secondary school completed 2726
Some university 2261
Master's degree 2153
Name: count, dtype: int64
ces['party_id_labeled'].value_counts().head()party_id_labeled
Liberal 6395
Conservative 4800
NDP 3113
None of these 2108
Bloc Québécois 1832
Name: count, dtype: int64
Keeping label dictionaries in your code (rather than relying on embedded Stata labels) serves three purposes:
- Transparency: Anyone reading your code sees exactly how codes map to meanings
- Portability: Labels work the same way whether data is in Stata, CSV, or other formats
- Flexibility: Easy to create alternative labeling schemes (shorter labels for plots, etc.)
This approach makes your analysis self-documenting and reproducible.
The pattern above where new columns are created based on values mapped from other columns is common in data processing. Doing it repeatedly, one column at a time, can lead to performance issues in pandas. For multiple transformations, it’s more efficient to do:
new_columns = pd.DataFrame({
'education_labeled': ces['cps21_education'].map(education_labels),
'party_id_labeled': ces['cps21_fed_id'].map(party_labels)
})
ces = pd.concat([ces, new_columns], axis=1)This approach prevents DataFrame fragmentation by completing all transformations in a single step, which can speed up subsequent operations on large datasets.
4.5 From concepts to variables: The measurement process
Survey research requires translating abstract theoretical concepts into concrete measurements. This transformation-from idea to question to coded variable-shapes what we can credibly claim.
Consider political efficacy: the belief that ordinary citizens can understand and influence politics. As a theoretical construct, efficacy is abstract. To measure it, researchers craft survey items like:
“People like me don’t have any say about what the government does.”
Respondents evaluate this using response options (strongly agree to strongly disagree). Their responses get coded into a numeric variable, perhaps cps21_efficacy_1, which combines with related items into an efficacy scale.
Each step involves decisions:
- Question wording (could emphasize “people like me” or “average citizens”)
- Response categories (5-point? 7-point? labeled or numeric?)
- Coding direction (does agreement indicate high or low efficacy?)
- Scale construction (combine items? weight equally?)
These choices aren’t right or wrong, but they are consequential. Slightly different operationalizations can produce different substantive conclusions.
A second example comes from the study of affective polarization. Here the concept refers to the tendency to feel warmly toward one’s own party while viewing opposing parties with hostility. In the Canadian Election Study (CES), this is measured with feeling thermometers: respondents are asked, “How do you feel about the [Liberal/Conservative/NDP] Party?” and record their answer on a scale from 0 (very cold) to 100 (very warm). The resulting variables-for example, cps21_party_rating_23 for the Liberal Party-can be combined into a construct such as “in-group minus out-group difference.” This captures the extent to which respondents express positive feelings for their preferred party and negative feelings for its rivals.
These examples illustrate a general principle. Concepts become data through a chain of transformations, and each link in that chain involves choices. Good measurement is not accidental; it is designed, tested, and refined to ensure that the final numbers capture as much of the underlying concept as possible without introducing unnecessary noise or bias.
4.5.1 CES example: Measuring party identification
The theoretical concept of “party identification” refers to a psychological attachment to a political party-a social identity that shapes how people interpret political events. The CES operationalizes this concept with the question:
“In federal politics, do you usually think of yourself as a: Liberal, Conservative, NDP, Bloc Québécois, Green, or none of these?”
This becomes variable cps21_fed_id with numeric codes (1=Liberal, 2=Conservative, etc.). The single question doesn’t capture all aspects of party identification (strength of attachment, stability over time, multiple identities), but it provides a measurable indicator of the underlying construct.
Later in the survey, related questions about voting history, party thermometers, and issue positions help validate whether cps21_fed_id captures meaningful variation in party attachment.
4.6 Understanding missing value patterns
Survey data commonly includes missing value codes for responses like “Don’t know” (-8), “Refused” (-9), or system missing values. Let’s examine a key political variable:
# Examine the left-right scale variable
print("Left-right self-placement variable examination:")
print(f"Variable: cps21_lr_scale_bef_1")
print(f"Total responses: {ces['cps21_lr_scale_bef_1'].count():,}")
print(f"Missing values: {ces['cps21_lr_scale_bef_1'].isna().sum():,}")
# Look at the actual values to identify missing value codes
print(f"\nValue distribution:")
print(ces['cps21_lr_scale_bef_1'].value_counts(dropna=False).sort_index())Left-right self-placement variable examination:
Variable: cps21_lr_scale_bef_1
Total responses: 20,968
Missing values: 0
Value distribution:
cps21_lr_scale_bef_1
-99 2895
0 539
1 635
2 1436
3 2113
4 2169
5 3372
6 2501
7 2410
8 1629
9 625
10 644
Name: count, dtype: int64
The CES typically uses negative values for missing responses. Values like -99, -88, -77 typically indicate different types of non-response. For analysis, we need to filter to valid responses only:
# Filter to valid left-right responses (0-10 scale)
valid_lr_data = ces[(ces['cps21_lr_scale_bef_1'] >= 0) &
(ces['cps21_lr_scale_bef_1'] <= 10)].copy()
print(f"After filtering to valid responses (0-10 scale):")
print(f"Valid responses: {len(valid_lr_data):,}")
print(f"Removed as invalid: {len(ces) - len(valid_lr_data):,}")
print(f"Response rate for LR scale: {len(valid_lr_data)/len(ces):.1%}")After filtering to valid responses (0-10 scale):
Valid responses: 18,073
Removed as invalid: 2,895
Response rate for LR scale: 86.2%
Survey nonresponse is coded with negative values (e.g., -8 Don’t know, -9 Refused) plus genuine system missing. We centralize the logic so it’s consistent across variables.
# Centralized missing-code handling (confirm codes in CES codebook)
MISSING_CODES = {-99, -98, -97, -88, -77, -9, -8}
def valid_range(series, lo, hi):
"""Keep only values in [lo, hi] and drop coded missings."""
s = series.copy()
mask = s.between(lo, hi) & ~s.isin(MISSING_CODES)
return s[mask]
# Example: left-right self-placement (0-10)
lr_raw = ces["cps21_lr_scale_bef_1"]
lr = valid_range(lr_raw, 0, 10)
print(f"LR valid N: {lr.notna().sum():,} ({lr.notna().mean():.1%} of full sample)")LR valid N: 18,073 (100.0% of full sample)
Survey missing values carry different meanings:
- -9 (Refused): Respondent saw the question but declined to answer
- -8 (Don’t know): Respondent unsure or lacks information
- -7 (Not asked): Question filtered based on earlier responses
- NaN (System missing): Data collection or processing issue
These distinctions matter. “Don’t know” on ideology questions often indicates low political sophistication-substantively meaningful, not just debris. We’ll handle missing data carefully, not just drop it automatically.
Political attitude questions often have substantial missing data because:
“Don’t know” responses: Many people genuinely don’t have opinions on complex political issues
Refusal to answer: Some respondents prefer not to reveal political views
Question complexity: Abstract concepts like left-right ideology can be difficult to understand
Survey fatigue: Political batteries often come late in surveys when attention wanes
Analysis implications:
- Always report response rates for political variables
- Consider whether missing data patterns relate to other variables
- Be cautious about generalizing from smaller valid-response samples
- Missing data may not be “missing at random”-systematic non-response can bias results
4.6.1 Missing data patterns
Missing data patterns fall into three broad categories:
Missing Completely at Random (MCAR): Like a coin flip-who’s missing is unrelated to anything. Rare in real surveys.
Missing at Random (MAR): Missingness relates to observed variables but not the missing value itself. Example: younger people less likely to report income, but conditional on age, income reporting is random.
Missing Not at Random (MNAR): Missingness relates to the unobserved value. Example: high earners less likely to report exact income.
We can’t definitively determine which pattern holds, but thinking about plausible mechanisms helps us make principled decisions about handling missing data.
4.7 Region mapping
We define a canonical Region using the CES province code so later cross-tabs don’t break.
region_labels = {
1:"Atlantic", 2:"Quebec", 3:"Ontario",
4:"Prairies", 5:"Alberta", 6:"British Columbia"
}
ces["Region"] = ces["cps21_province"].map(region_labels)
assert ces["Region"].notna().any(), "Region mapping produced all-NA; check codes."Creating a canonical region variable serves two purposes:
- Consistent analysis: The same region categories work across all analyses
- Readable output: “Atlantic” is clearer than “Province code: 1” in tables
Later analyses (cross-tabs, regional comparisons) will reference this standardized variable rather than numeric codes.
4.8 Understanding survey documentation
Before analyzing survey data, researchers must understand how it was collected. This requires careful attention to survey documentation, which provides crucial context for interpretation and helps identify potential limitations that affect what conclusions can be drawn.
4.8.1 Key documentation elements
Survey documentation typically includes several crucial components:
Sampling Design: Details about how participants were selected, including sampling frame, stratification procedures, and response rates. For the CES, this includes information about the target population (eligible Canadian voters), the sampling frame (online panel providers), and efforts to ensure demographic representativeness.
Questionnaire: Complete question wording, response options, and skip patterns. The CES documentation includes every question asked, the exact wording used, and the logic that determines which respondents see which questions.
Field Procedures: Data collection methods, quality control measures, and any interviewer training (though the CES is self-administered online). This includes information about survey length, incentives provided to respondents, and measures taken to ensure data quality.
Data Processing: Coding decisions, missing value conventions, and variable construction. The CES documentation explains how raw responses are converted into analysis variables and what different numeric codes mean.
Weighting: How sampling weights were constructed to adjust for nonresponse and sampling design. This is particularly important because not all demographic groups respond to surveys at equal rates.
Metadata = Information about what survey variables mean and how they were collected
- Codebook entries explaining variable names and values
- Question wording and response options
- Sample design and weighting procedures
Paradata = Information about the data collection process itself
- Survey completion times and response patterns
- Device type and browser information
- Response sequence and revision patterns
Example: Metadata tells you that cps21_party_rating_23 measures Liberal Party feeling thermometer (0-100). Paradata tells you this respondent took 45 seconds to answer and revised their response twice.
Both types help you assess data quality and make informed analytical decisions.
Before analyzing any survey data, you must understand how it was collected. Tradeoffs are unavoidable-no design is perfect. To do credible, fair, and honest work we have to acknowledge these limitations, be transparent about them, and do our best to mitigate any problems. For example, if a survey oversamples certain groups who would otherwise be under-represented, we need to know how to use weights and other tools to adjust for that in our analysis, and base our interpretations accordingly.
4.9 Descriptive statistics: Central tendency and variability
Before moving to quality assessment, we examine basic descriptive statistics to understand our data better. Descriptive statistics help us understand our data by summarizing it in simple terms.
Descriptive statistics help us understand our data by summarizing it in simple terms:
For Categorical Variables (like party preference):
- Frequencies: Raw counts (847 people chose Liberal, 623 chose Conservative)
- Percentages: Proportions of the total (40.2% Liberal, 29.6% Conservative)
For Continuous Variables (like age or income):
- Central tendency: Where is the “middle” of our data?
- Variability: How spread out are the values?
Think of descriptive statistics as taking a quick photograph of your data-they don’t explain why patterns exist, but they show you what patterns are there to begin with.
Central Tendency: For numeric variables, we calculate measures of the “typical” or “average” value:
- Mean: The arithmetic average, sensitive to extreme values
- Median: The middle value when data are arranged in order, less sensitive to extremes
- Mode: The most frequently occurring value
Imagine you have survey completion times: 12, 15, 18, 20, 22, 25, 180 minutes (one person took a 3-hour break!)
- Mean (average): Add them all up and divide by the count: (12+15+18+20+22+25+180) ÷ 7 = 41.7 minutes
- Median: Put them in order and pick the middle: 12, 15, 18, 20, 22, 25, 180 → 20 minutes
- Mode: The most common value (if there were multiple 20s, that would be the mode)
Notice how the one extreme value (180 minutes) pulled the mean way up to 41.7 minutes, but the median stayed at a reasonable 20 minutes. This is why median is often better for data with outliers-it tells you what a typical person experienced.
# Duration analysis for our sample
duration_mean = ces['cps21_time'].mean()
duration_median = ces['cps21_time'].median()
duration_mode = ces['cps21_time'].mode()[0] if len(ces['cps21_time'].mode()) > 0 else None
print(f"Survey duration - Mean: {duration_mean:.1f} minutes")
print(f"Survey duration - Median: {duration_median:.1f} minutes")
print(f"Survey duration - Mode: {duration_mode:.1f} minutes" if duration_mode else "Survey duration - Mode: N/A")Survey duration - Mean: 145.2 minutes
Survey duration - Median: 22.1 minutes
Survey duration - Mode: 14.3 minutes
Variability: Measures of how spread out our data are:
- Range: Difference between maximum and minimum values
- Standard deviation: Average distance from the mean, in the original units
- Variance: Standard deviation squared (less intuitive but mathematically useful)
Imagine two classes that both have an average test score of 75%:
- Class A scores: 73%, 74%, 75%, 76%, 77% (low variability - everyone did similarly)
- Class B scores: 45%, 60%, 75%, 90%, 100% (high variability - big differences)
Range: Class A = 77-73 = 4 points; Class B = 100-45 = 55 points
Standard deviation tells us the typical distance from the average:
- Class A: About 1.6 points (most students within 1-2 points of average)
- Class B: About 21 points (students typically 20+ points above or below average)
Low variability = everyone similar; high variability = lots of differences. Both patterns are important to understand.
# Measures of variability for duration
duration_std = ces['cps21_time'].std()
duration_range = ces['cps21_time'].max() - ces['cps21_time'].min()
print(f"Survey duration - Standard deviation: {duration_std:.1f} minutes")
print(f"Survey duration - Range: {duration_range:.1f} minutes")Survey duration - Standard deviation: 1018.0 minutes
Survey duration - Range: 26246.6 minutes
When the mean is much larger than the median, it suggests the presence of extreme values (outliers) pulling the average upward. We’ll address these outliers systematically in the next chapter through quality assessment.
# Generate comprehensive descriptive statistics for key variables
desc_vars = {
'Age': 'cps21_age',
'Survey Duration (minutes)': 'cps21_time',
'Left-Right Self-Placement': 'cps21_lr_scale_bef_1'
}
# Create descriptive statistics table
descriptive_stats = []
for var_name, var_col in desc_vars.items():
if var_col == 'cps21_lr_scale_bef_1':
# Filter to valid responses for ideology
var_data = ces[(ces[var_col] >= 0) & (ces[var_col] <= 10)][var_col]
else:
var_data = ces[var_col].dropna()
stats = {
'Variable': var_name,
'N': len(var_data),
'Mean': var_data.mean(),
'Std Dev': var_data.std(),
'Min': var_data.min(),
'25th %ile': var_data.quantile(0.25),
'Median': var_data.median(),
'75th %ile': var_data.quantile(0.75),
'Max': var_data.max()
}
descriptive_stats.append(stats)
# Add categorical variables
categorical_vars = {
'Education': 'education_labeled',
'Party ID': 'party_id_labeled'
}
for var_name, var_col in categorical_vars.items():
var_data = ces[var_col].dropna()
mode_value = var_data.mode()[0] if len(var_data.mode()) > 0 else "N/A"
mode_count = (var_data == mode_value).sum()
mode_pct = (mode_count / len(var_data)) * 100
stats = {
'Variable': var_name,
'N': len(var_data),
'Mean': f"Mode: {mode_value}",
'Std Dev': f"({mode_pct:.1f}%)",
'Min': f"{len(var_data.unique())} categories",
'25th %ile': "",
'Median': "",
'75th %ile': "",
'Max': ""
}
descriptive_stats.append(stats)
# Convert to DataFrame and format
desc_df = pd.DataFrame(descriptive_stats)
# Format numeric columns
numeric_cols = ['Mean', 'Std Dev', 'Min', '25th %ile', 'Median', '75th %ile', 'Max']
for col in numeric_cols:
desc_df[col] = desc_df[col].apply(lambda x: f"{x:.2f}" if isinstance(x, (int, float)) else x)
print("\nDescriptive Statistics for Analysis Sample:")
print("=" * 80)
print(desc_df.to_string(index=False))
Descriptive Statistics for Analysis Sample:
================================================================================
Variable N Mean Std Dev Min 25th %ile Median 75th %ile Max
Age 20968 51.304939 17.198538 18.0 36.00 53.0 66.00 97.0
Survey Duration (minutes) 20968 145.160446 1018.010498 6.033333 16.58 22.083334 31.25 26252.583984
Left-Right Self-Placement 18073 5.11 2.35 0 3.00 5.00 7.00 10
Education 20968 Mode: Bachelor's degree (28.9%) 12 categories
Party ID 20968 Mode: Liberal (30.5%) 8 categories
This comprehensive descriptive statistics table would typically appear in a research paper, showing the characteristics of our analysis sample. These statistics describe our sample, not the Canadian population. In the next chapter, we’ll assess data quality to create a clean analysis sample suitable for exploration.
4.10 Looking forward: From understanding to evaluation
We can now load CES data, understand its structure, and translate numeric codes into meaningful labels. We know how variables are named, how question families are organized, and how theoretical concepts are operationalized as survey items.
But datasets of this size inevitably contain quality issues: respondents who rush through questions, technical glitches that corrupt responses, or duplicate entries from data collection errors. Before exploring patterns, we need to systematically assess quality and create a clean analysis sample.
The next chapter applies the Total Survey Error framework from Chapter 3 to identify and address quality issues, building a trustworthy dataset for analysis. We’ll examine duration outliers, attention checks, duplicate detection, and other indicators that help distinguish high-quality from problematic responses.