2 Reproducible Data Processing – Quantitative Social Science

This open-access book accompanies the quantitative research methods course I teach at Memorial University. It’s under active development and revision. Chapters are in different stages of development, so some may be a little rougher than others. Feedback is welcome!

By the end of this chapter, you will be able to:

Load, inspect, and parse semi-structured web data (HTML tables) into Pandas

Clean messy data (names, types, missing values, and out-of-band rows)

Reshape data from wide to long format and explain why that’s useful

Produce and save publication-quality figures of polling trends

Explain what “margin of error” and “percentage points (pp)” mean-and what they don’t

Build a non-destructive, reproducible processing pipeline and save clean outputs

In Chapter 1, we successfully scraped polling data from Wikipedia and saved both the raw HTML and extracted tables. We applied Alexander’s framework by planning our target schema and preserving our sources exactly as found.

This chapter continues the Acquire phase by transforming raw extracts into analysis-ready data. The current data has mixed types (everything is text), inconsistent formatting (percentages include “%” symbols, sample sizes have commas), and structural issues (event rows mixed with polling records). Our goal: clean, typed, documented datasets suitable for exploration and modeling.

This chapter introduces some of the practical skills that make up the bulk of quantitative research work: cleaning inconsistent formats, handling missing values, converting data types, and reshaping or transforming data. These may seem like mundane technical tasks, but they’re essential for credible research. The decisions you make during data processing directly affect your analysis and conclusions.

We’ll work with the polling data from Chapter 1, learning to diagnose and fix common data quality issues. Along the way, you’ll develop a systematic approach to data cleaning that emphasizes transparency, reproducibility, and documentation. By the end, you’ll have both clean datasets ready for analysis and a workflow you can apply to any messy data you encounter in your research.

2.1 Setup

Installing Packages in Colab

Hey 3040 students! Every time you start a new Colab session, or restart an existing session, you’ll need to install some packages that aren’t included in the default Colab environment. To install what we need for this chapter, create a new code cell and run:

!pip install qmcp pyjanitor seaborn beautifulsoup4

import qmcp
import pandas as pd
import janitor
from bs4 import BeautifulSoup

2.2 The Non-destructive Processing Principle

A core principle of reproducible research: never modify original data files. Instead, document all transformations in code, creating new datasets at each processing stage:

polling_page.html → Raw source, exactly as downloaded
raw_polling_data.csv → Extracted tables, no cleaning applied
campaign_period_polls_cleaned.csv → Processed, analysis-ready data

This workflow provides three critical benefits:

Reproducibility: Anyone can verify our work by re-running the code. Every transformation is explicit and documented.

Flexibility: We can revise cleaning decisions without re-scraping. If we later discover we filtered too aggressively or made a coding error, we return to the raw data and adjust our processing code.

Transparency: The original source remains pristine. Reviewers can see exactly what we started with and what decisions we made. This builds trust.

In social science research, this separation is essential. Journals increasingly require data and code alongside publications. The non-destructive workflow makes this sharing straightforward.

2.3 Wikipedia Opinion Poll Data

In the previous chapter, we successfully scraped HTML data from a Wikipedia polling page and saved it to disk. Now we move to the next phase: transforming raw extracted data into clean, analysis-ready datasets.

2.3.1 Reloading the Source HTML

We can load the HTML source we saved in the previous chapter from data/processed/scraped-wikipedia/polling_page.html. It’s just a plaintext file, so if you open it in a text editor, you will see the raw HTML code (e.g., see Figure 2.1).

Figure 2.1: The HTML source saved as plaintext. Highlighted: The `<h2>` tag and element ID `National_polls` for the “National Polls” section.

Let’s load the file:

with open(
    "data/processed/scraped-wikipedia/polling_page.html", "r", encoding="utf-8"
) as f:
    html = f.read()

type(html)

As expected, the HTML source is a string. But we don’t really want a string, we want soup!

Section levels and titles from the Wikipedia page.
	Section Title	Element IDs
h2	National Polls	`National_polls`
h2	Regional Polls	`Regional_polls`
h2	Leadership Polls	`Leadership_polls`
h2	Seat Projections	`Seat_projections`
h2	Government Approval	`Government_approval_polls`

We can convert the HTML string to a BeautifulSoup object using the BeautifulSoup() constructor:

soup = BeautifulSoup(html, "html.parser")
type(soup)

There are a number of ways we can go about getting the data we want. How do we keep it simple? The approach I’ve settled on-having worked through a few-is to get all the wikitables on the page, and then find the nearest heading (h1-h6) above each table. This way we can see what tables we have, and what section they belong to.

wikitables = soup.find_all('table', class_='wikitable')

cleaned_tables = {}
for table in wikitables:
 heading = qmcp.get_nearest_heading_above_html_table(table)
 df = qmcp.read_wikitable(table)
 cleaned_tables[heading] = df

for key in cleaned_tables.keys():
 print(key)

The figure we want to produce is based on national data from the “Campaign period,” so we’ll select that dataframe.

cp_poll = cleaned_tables['Campaign period'].copy()
cp_poll.head()

Why Copy DataFrames?

When working with DataFrames, it’s good practice to create a copy when you plan to modify the data. This prevents unintended changes to your original dataset. The .copy() method creates a new DataFrame with the same data, allowing you to transform it safely while preserving the original for reference or alternative analyses.

Column Naming with PyJanitor

Source data often comes with messy column names containing spaces, special characters, or inconsistent capitalization. The clean_names() function from PyJanitor automatically converts column names to lowercase, replaces spaces with underscores, and removes special characters. This creates consistent, code-friendly column names that are easier to work with programmatically and less prone to errors.

cp_poll = cp_poll.clean_names()
cp_poll.head()

cp_poll.info()

We need to get the right dtypes! Can we just do something like this?

cp_poll['lpc'].astype("float")

Nope! The error message you’ll see if you run that code ends with ValueError: could not convert string to float: "Leaders' Debates Commission English debate". That’s not the only non-numeric value in the lpc column, either. There are a few other rows with text in them, as well as some missing values. As you can see in Figure 2.2, there is a pattern for using rows to indicate the timing of events, such as leaders’ debates and major changes in party leadership.

We’ll drop those rows (using 'polling_firm' as a proxy), and then convert each of the relevant columns to numeric. The errors='coerce' argument will convert any remaining non-numeric values to NaN (indicating missing values).

cp_poll = cp_poll.dropna(subset=['polling_firm'])

columns_float = ['cpc', 'lpc', 'ndp', 'bq', 'ppc', 'gpc', 'others', 'lead']
for party in columns_float:
 cp_poll[party] = pd.to_numeric(cp_poll[party], errors='coerce')

cp_poll.info()

Great! The next thing we’ll do is convert the data types of the remaining columns to more appropriate types. We’ll convert last_date_of_polling column to a datetime type, which will allow us to work with dates more easily.

cp_poll['last_date_of_polling'] = pd.to_datetime(
 cp_poll['last_date_of_polling'], errors='coerce' )

cp_poll.info()

Next, let’s clean sample_size and margin_of_error, each of which will require some special treatment.

Note that sample_size contains commas (e.g., “1,500”) and sometimes has values in parentheses (e.g., “1,500 (1/3)”).

cp_poll['sample_size'].value_counts(dropna=False).head(10)

A footnote on the Wikipedia page explains the parentheses:

Refers to the total, “raw” sample size, including undecided and non-voters, and before demographic weighting is applied. Fractions in parentheses apply to rolling polls (see below) and indicate the proportion of the sample that is independent from the previous poll in the series.

We’ll consider what this means a little later. The parenthetical information is useful and should be retained. There are a few ways we could do this, but since we won’t actually use that information for this analysis here, we’ll set it aside for now. Below, I’ll duplicate the column to retain the raw information, and then clean the original sample_size column to convert it to numeric.

cp_poll['sample_size_raw'] = cp_poll['sample_size']

cp_poll['sample_size'] = (
 cp_poll['sample_size'].str.split('(').str[0]
 .str.replace(',', '')
 .astype(int)
)

cp_poll.info()

There are still a few final things to do, but this is looking good!

Finally, if we take a look at sampling, we’ll see that the largest sample is huge. Why is that?

cp_poll['sample_size'].sort_values(ascending=False).head(10)

It’s because the first row of the DataFrame is aggregated data from multiple polls.

cp_poll.head()

We can drop that row, and then reset the index.

cp_poll = cp_poll.iloc[1:].reset_index(drop=True)
cp_poll.head()

Next, we’ll clean up the margin_of_error column. It contains the “±” character, which we can remove using the str.replace() method, as well as whitespace and pp. Then we can convert the column to numeric.

Understanding Margin of Error

The margin of error indicates expected variation from random sampling. If a poll reports CPC support at 41% ± 3 percentage points (pp), the true value in the sampling frame might plausibly range from 38% to 44%.

Key terminology:

Percentage points (pp): Absolute differences (41% → 44% is +3 pp)
Percent change: Relative differences (41% → 44% is a 7.3% increase)
±: Plus-or-minus range around the estimate

The margin assumes the poll used a probability sample where every person in the population had a known chance of selection.

Limitations of Reported Margins of Error

Standard polling margins of error make strong assumptions that often don’t hold:

Random sampling: Most modern polls use quota sampling or online panels, not pure probability samples
Coverage: Only accounts for sampling variation, not coverage errors (who’s missing from the frame)
Nonresponse: Assumes respondents represent non-respondents
Measurement: Ignores question wording effects, interviewer bias, etc.

The Total Survey Error framework (Chapter 3) shows that sampling error is just one component. A poll with a small margin of error can still be badly biased if coverage or nonresponse problems dominate. Use margins of error as rough guides to precision, not guarantees of accuracy.

cp_poll['margin_of_error'] = (
 cp_poll['margin_of_error']
 .str.replace('-', '')
 .str.replace('±', '') 
 .str.replace('pp', '')
 .str.strip()
)

cp_poll['margin_of_error'] = pd.to_numeric(
 cp_poll['margin_of_error'], errors='coerce'
)

cp_poll.info()

All columns now have appropriate data types. The next step is visualization.

cp_poll.head()

2.4 From Wide to Long Format

2.4.1 Why Reshape to Long Format?

Most statistical graphics and models expect “long” format data: one row per observation, with a column identifying categories. Our current “wide” format-where each party has its own column-makes certain analyses difficult.

For example, to plot polling trends for all parties on one figure with Seaborn, we need:

date	party	polling_percentage
2024-01-15	CPC	35
2024-01-15	LPC	40
2024-01-15	NDP	20

Rather than:

date	CPC	LPC	NDP
2024-01-15	35	40	20

Long format is also called “tidy data”-each variable is a column, each observation is a row, each type of observational unit is a table. The melt() function transforms wide data into long format.

First, we’ll select the columns we want to plot and sort by date to ensure the time series plot is correct.

plotting_columns = [
 'last_date_of_polling', 'sample_size', 'margin_of_error', 'polling_firm',
 'cpc', 'lpc', 'ndp', 'bq', 'ppc', 'gpc', "others"
]

plot_data = cp_poll[plotting_columns].copy()
plot_data.sort_values(by='last_date_of_polling', inplace=True)
plot_data.head()

Wide vs. Long Data Formats

Data can be organized in two primary formats: wide and long. In wide format, each variable has its own column (like our polling data where each party has a separate column). In long format, variables are stacked into rows with identifier columns.

For example, wide format:

Date        | CPC | LPC | NDP 
2024-01-01  | 35  | 40  | 20
2024-01-02  | 36  | 39  | 21

And long format:

Date        | Party | Percentage
2024-01-01  | CPC   | 35
2024-01-01  | LPC   | 40
2024-01-01  | NDP   | 20
2024-01-02  | CPC   | 36

Long format is often preferred for visualization and statistical modeling because it treats each observation as a separate row, making it easier to group, filter, and plot by categories.

Melt vs. Pivot Operations

Pandas provides two primary methods for reshaping data: melt() (wide to long) and pivot() (long to wide). We emphasize melt() because long format is generally more flexible for analysis and visualization. The melt() function “melts” column headers into row values, creating a more normalized data structure that’s easier to work with in most analytical contexts.

This data is in wide format, which is not ideal for plotting with Seaborn. We can convert it to long format using the melt() method. This will create a new DataFrame with columns: last_date_of_polling, party, and polling_percentage.

plot_data_melted = plot_data.melt(
 id_vars=['last_date_of_polling', 'sample_size', 'margin_of_error', 'polling_firm'],
 var_name='party',
 value_name='polling_percentage'
)

plot_data_melted.head()

2.5 Plotting Daily Trends During the Campaign Period

Now that we have clean data in long format, we can easily recreate the plot from Wikipedia using Seaborn. We’ll use a line plot to show the trends over time for each party.

import matplotlib.pyplot as plt
import seaborn as sns

qmcp.set_style()

plt.figure(figsize=(12, 7))
sns.lineplot(
 data=plot_data_melted,
 x='last_date_of_polling',
 y='polling_percentage',
 hue='party'
)

Now we’ll do a bit more iteration to improve the plot. We’ll add titles and labels, modify the x-axis labels for better readability, and use a custom color palette to match the parties’ colors. I’ll also adjust the legend to be more readable and use the errorbar argument to show the variability in our data.

I’ll define a custom color palette to match the parties’ colors.

party_colors = {
 "cpc":"#1A4782",
 "lpc":"#D71920",
 "ndp":"#F37021",
 "bq":"#33B2CC",
 "ppc":"#67319B",
 "gpc":"#3D9B35",
 "others":"gray"
}

plt.figure(figsize=(12, 7))
ax = sns.lineplot(
 data=plot_data_melted,
 x='last_date_of_polling',
 y='polling_percentage',
 hue='party',
 errorbar='sd',
 palette=party_colors
)

ax.xaxis.set_major_formatter(plt.matplotlib.dates.DateFormatter('%b %d'))

plt.legend(
 title="", 
 loc="lower center", 
 bbox_to_anchor=(0.5, -0.15), 
 ncol=len(party_colors), 
 frameon=False
)

plt.title('Opinion Polling Trends')
plt.xlabel('')
plt.ylabel('Polling Percentage (%)')
plt.grid(True)
plt.show()

In the plot above, I added error bars representing one standard deviation around the mean polling percentage for each party at each time point. This is not a measure of uncertainty in our estimates-it’s a description of the variability in our observed sample. The standard deviation shows us how much the polling results vary from the average for each party across the time period.

Approaches to Uncertainty in Polling Data

There are different ways we can approach uncertainty when analyzing polling data:

From a Frequentist perspective, you might use confidence intervals based on sampling distributions and assumptions about random sampling. This approach treats the “true” population parameter as fixed but unknown, and uses probability to describe the sampling process.

Alternatively, from a Bayesian perspective, you would treat the population preferences as uncertain quantities and use probability to describe your beliefs about them, updating these beliefs as new polling data becomes available. This approach naturally incorporates prior information and provides more intuitive interpretations of uncertainty.

We’ll explore these different approaches to uncertainty in later chapters when we move beyond descriptive analysis to inferential modeling.

2.6 Counting Polling Methods

Now let’s examine the different polling methodologies used in these surveys:

cp_poll['polling_method'].value_counts(dropna=False).head(20)

These polling methods represent different approaches to data collection, each with distinct characteristics:

IVR (Interactive Voice Response): Automated phone calls where respondents answer questions by pressing keypad buttons or speaking responses. When marked as “rolling,” the poll is conducted continuously with results updated regularly rather than as a single snapshot.

Online: Internet-based surveys distributed via email, social media, or website links. This method is cost-effective and fast but may suffer from selection bias if the sample isn’t representative of the broader population.

Telephone: Traditional phone interviews conducted by live interviewers. “Rolling” telephone polls maintain continuous data collection with regular updates, providing real-time trend tracking.

Text/Online: Hybrid approach combining SMS invitations with online survey completion. This method attempts to reach respondents through multiple channels while maintaining the convenience of digital data collection.

Telephone/Online: Mixed-mode methodology where some respondents complete surveys by phone while others use online platforms. This approach aims to improve representativeness by accommodating different respondent preferences.

The choice of polling method affects both the quality and representativeness of the data, with implications for how we should interpret and weight the results.

import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import seaborn as sns
qmcp.set_style()

firm_totals = (
 cp_poll.groupby('polling_method', as_index=False)
 .agg(
 total_sample_size=('sample_size', 'sum'),
 num_polls=('polling_method', 'size')
 )
 .sort_values('total_sample_size', ascending=False)
)

plt.figure(figsize=(11, 6))
sns.barplot(
 data=firm_totals,
 y="polling_method",
 x="total_sample_size",
 orient="h",
 errorbar=None
)

ax = plt.gca()
ax.xaxis.set_major_formatter(FuncFormatter(lambda x, _: f"{int(x):,}"))
plt.title("Total Sample Size by Polling Method")
plt.xlabel("Total Sample Size")
plt.ylabel("")
plt.show()

2.7 Reproducibility & Non-destructive Data Processing Pipelines

Now that we’ve accomplished our goal, let’s step back to consider what we’ve done and how it relates to some broader principles of data processing. In particular, we will consider the principles of reproducibility and non-destructive data processing pipelines.

2.7.1 The Importance of Non-destructive Workflows

Throughout this chapter, we’ve followed a key principle: never modify your original data files. Instead, we’ve documented all our data cleaning and transformation decisions in code. This approach has several important benefits:

Reproducibility: Anyone (including your future self) can re-run your analysis and get the same results
Transparency: All data processing decisions are explicitly documented in your code
Flexibility: You can easily modify your cleaning process without losing the original data
Error recovery: If you make a mistake, you can always return to the source data

2.7.2 Building Effective Data Pipelines

A data processing pipeline is a sequence of data processing steps that transforms raw data into analysis-ready datasets. Effective pipelines share several characteristics:

Modularity: Each step has a clear, single purpose. We saw this when we separated data loading, cleaning, transformation, and visualization into distinct sections.

Documentation: Each step is clearly documented, explaining not just what is being done but why. Our use of comments and markdown explanations demonstrates this principle.

Validation: The pipeline includes checks to ensure data quality at each step. Our use of .info() calls and data type conversions illustrates this practice.

Version control: All code is tracked in a version control system, allowing you to see how your analysis evolved over time.

2.7.3 Saving Processed Data

After cleaning our data, we should save it in a format that preserves our work:

import os

# Create output directory if it doesn't exist
os.makedirs("output", exist_ok=True)

# Save the cleaned polling data
cp_poll.to_csv("data/processed/scraped-wikipedia/campaign_period_polls_cleaned.csv", index=False)

# Save the visualization-ready data
plot_data_melted.to_csv("data/processed/scraped-wikipedia/polling_data_long_format.csv", index=False)

print("Cleaned data saved to data/processed/scraped-wikipedia/ directory")

This ensures that future analyses can begin with the cleaned data rather than repeating all the cleaning steps.

2.7.4 Documentation Standards

Good data processing documentation should include:

Data sources: Where did the data come from? When was it collected?
Processing decisions: Why did you make specific cleaning choices?
Assumptions: What assumptions underlie your analysis?
Limitations: What are the potential weaknesses or biases in your data?

2.7.5 Looking Forward

The principles we’ve established here-reproducibility, transparency, and non-destructive processing-will serve as the foundation for all our subsequent work. As we move toward more complex analyses in future chapters, maintaining these standards becomes even more critical.

Figure 2.3 shows a high-level overview of a data processing pipeline that incorporates these principles:

Figure 2.3: High-level overview of a data processing pipeline emphasizing reproducibility and non-destructive workflows.

2.8 Looking Forward: From Data to Design

We’ve successfully transformed raw Wikipedia extracts into clean, analysis-ready datasets. The polling data now has appropriate types, consistent formatting, and documented provenance. Both raw and processed versions are saved for reproducibility.

But before diving into analysis, we need conceptual foundations. Opinion polling is just one type of survey research, and not all surveys are created equal. The next chapter establishes principles of survey design, sampling, and measurement that will guide our work with the Canadian Election Study.

Understanding these foundations will help us distinguish between quick polls (like our Wikipedia data) and carefully designed surveys, and to assess what each can reliably tell us about public opinion.