1  Python via Web Scraping

Quantitative Social Science – Core Concepts, Skills, and Stories
(draft manuscript)

Author
Affiliation

John McLevey (he/him)

Sociology, Memorial University

 

This open-access book accompanies the quantitative research methods course I teach at Memorial University. It’s under active development and revision. Chapters are in different stages of development, so some may be a little rougher than others. Feedback is welcome!

By the end of this chapter you will be able to:

  • Explain what web scraping is and when it’s appropriate
  • Practice collecting data from a real web page in a careful, reproducible way
  • Load a page with requests
  • Turn its HTML into a structure Python can navigate with BeautifulSoup
  • Extract the page title, headings, paragraphs, links, and table data
  • Save your results as tidy, reusable files (CSV) for later analysis

Quantitative social scientists often need data that doesn’t exist as a convenient download. Wikipedia’s opinion polling pages contain structured tables tracking campaign dynamics, but no “Export to CSV” button. Web scraping lets us transform these HTML pages into analysis-ready datasets.

This chapter demonstrates the full Acquire phase of Alexander’s framework: planning what we need, understanding the source structure, extracting data programmatically, and saving both raw and processed versions for reproducible research. We’ll work with Canadian federal election polling data, creating a dataset suitable for analyzing campaign dynamics.

1.1 Telling Stories with Data

Before diving into code, let’s understand where web scraping fits in the broader quantitative social science workflow. In the Introduction, you learned about Alexander’s framework: Plan → Simulate → Acquire → Explore/Analyze → Share. This chapter lives primarily in the Acquire phase, but we’ll touch on all five steps as we work through our polling data project.

Planning gives direction to what we collect and helps us avoid gathering irrelevant data. Quick simulation or prototyping helps us spot problems before they cost time. A bit of exploration tells us whether we actually got what we thought we were getting. Saving results in clear, well-organized formats makes later sharing easier, including for your future self.

1.1.1 Planning Our Data Collection

Before writing any code, let’s apply the first step of Alexander’s framework: Plan. We’re going to collect polling data from Wikipedia’s page on the 2025 Canadian federal election. Rather than diving straight into code, let’s sketch our endpoint-what we want our final dataset to look like. This planning step helps us stay focused and make better decisions as we work.

Figure 1.1: Our target: polling data tables on Wikipedia that we’ll scrape and convert to a DataFrame.

A schema is a blueprint for your data table: it names each column, states its data type, and notes any constraints you expect. Planning your schema upfront keeps your data collection focused and gives you a target to check against. Here’s our simple schema:

Table 1.1: Basic schema for our polling data. We’ll keep data types simple initially and clean them in the next chapter.
Field Type Purpose
Polling firm string Organization that conducted the poll
Last date of polling datetime When the poll was completed
Sample size string Number of people surveyed (we’ll clean this later)
CPC string Conservative Party support percentage
LPC string Liberal Party support percentage
NDP string New Democratic Party support percentage
BQ string Bloc Québécois support percentage
Other parties string Support for other parties

This schema serves as our “simulation” step in Alexander’s framework. While we’re not creating fake data here, we’re doing something pretty close: being very specific about what we expect our real data to look like. This helps us write better scraping code and catch problems early.

Notice that we’re keeping this simple. We’ll extract the data as text first, then clean and convert data types in the next chapter. This two-step approach-extract first, clean later-is common in quantitative social science and makes debugging easier.

Writing this schema first keeps your cleaning steps honest and gives you a target to check against once you extract the data.

1.2 What is Web Scraping?

Web scraping is the process of programmatically extracting data from websites. When a website doesn’t offer a direct download or API (Application Programming Interface), we can still collect information by parsing the HTML that makes up web pages.

Sometimes websites provide data directly as a download or through an API (Application Programming Interface), which is an official, structured way for computers to request specific data from a website. When an API or download exists, prefer it because it is usually more stable than scraping. When no such option is available, we can still collect information by programmatically extracting it from the HTML (the markup language that describes web pages).

A typical scraping workflow has four steps:

  1. Request a webpage from a server
  2. Parse the returned HTML to understand its structure
  3. Extract the specific information you need
  4. Save the results in a structured format for analysis

1.2.1 Responsible Web Scraping

Responsible scraping blends technical etiquette and research ethics. Here are the key principles:

Prefer official sources: Use APIs or direct downloads when available-they’re more reliable and faster than scraping.

Respect website policies: Read the site’s robots.txt file (found at /robots.txt on most sites) and terms of use to understand what the site expects of automated visitors.

Be gentle: Add delays between requests and avoid overwhelming servers with rapid requests. Save what you fetch so you don’t need to ask again.

Protect privacy: Treat anything that could be personally identifying with care. Aggregate data where possible, and remember it’s easier than you might think to re-identify individuals from seemingly harmless details.

Be transparent: Don’t redistribute copyrighted content. Instead, share the code you wrote to collect and clean the data so others can reproduce your steps.

Common Scraping Errors and Solutions
  • Getting blocked (403/429 errors): Add proper User-Agent headers and delays between requests.
  • Empty results: Check if the site loads content with JavaScript (try viewing source vs. inspecting elements). BeautifulSoup only sees the initial HTML.
  • Changing structure: Websites update their layout. Save raw HTML so you can debug without re-scraping.
  • Rate limiting: If you get timeouts, add time.sleep(1) between requests to slow down your scraper.

For our Wikipedia example, scraping is appropriate because the data is publicly available and we’re using it for educational purposes.

1.2.2 Understanding HTML Structure

Web pages use HTML tags to structure content. Tables use <table>, rows use <tr>, cells use <td>. BeautifulSoup parses this hierarchical structure so we can navigate it programmatically.

The key insight: HTML is a tree of nested tags. We don’t need to memorize every tag-when you need to understand a page’s structure, right-click → Inspect Element to see the underlying HTML.

For our Wikipedia polling table, the structure looks like:

<table class="wikitable">
  <tr>
    <th>Polling firm</th>
    <th>CPC</th>
    <th>LPC</th>
  </tr>
  <tr>
    <td>Polling Company</td>
    <td>35%</td>
    <td>32%</td>
  </tr>
</table>

BeautifulSoup lets us find elements by tag name (<table>) or attributes (class="wikitable") without writing complex parsing code.

1.3 Setting Up for Scraping

We’ll use three Python packages for web scraping:

  • requests: Downloads web pages
  • BeautifulSoup: Parses HTML into a navigable structure
  • pandas: Converts extracted data into DataFrames

As you learned in the Introduction, packages are collections of pre-written Python code that solve common problems. The import statement brings a package into your program so you can use its functions.

Python Concept: Packages and Imports

A package is a collection of pre-written Python code that solves common problems. Instead of writing web scraping code from scratch, we can use the requests package that others have already built and tested.

The import statement brings a package into your program. When you write import requests, you’re telling Python to load the requests package so you can use its functions. This is like adding tools to your toolbox-once imported, you can use requests.get() to download web pages.

import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from io import StringIO

For consistency and reproducibility, we’ll use a specific version of the Wikipedia page. Wikipedia pages change over time, so using a permanent link ensures everyone gets the same results.

As you learned in the Introduction, a variable is a name that stores a value so you can refer to it later.

Python Concept: Variables and Strings

A variable is a name that stores a value. When you write url = "https://...", you’re creating a variable named url and storing a text string in it.

Strings are sequences of characters enclosed in quotes. You can use single quotes ('text') or double quotes ("text") - they work the same way. The long URL below is just a string that we can refer to later using the variable name url.

url = "https://en.wikipedia.org/w/index.php?title=Opinion_polling_for_the_2025_Canadian_federal_election&oldid=1306495612"

We also need to set a User-Agent header, which tells the website what kind of program is requesting the page. We’ll do by defining a dictionary, which you may remember from the Introduction is a data structure that stores key-value pairs:

headers = {
 "User-Agent": "Mozilla/5.0 (compatible; SOCI3040-Student/1.0; +https://www.mun.ca/)"
}

Many websites block requests that don’t include a User-Agent header because they look like anonymous bots. Including a clear User-Agent string signals that we’re behaving responsibly. The string above identifies our request as coming from a course project, which is honest and transparent.

Python Concept: Dictionaries

A dictionary is a way to store key-value pairs. You create one using curly braces {}. In {"User-Agent": "Mozilla/5.0..."}, the key is “User-Agent” and the value is the string identifying our program.

Dictionaries are like lookup tables - you can ask for the value associated with any key. They’re perfect for storing configuration settings like HTTP headers.

1.4 Loading Web Pages with requests

The requests library makes it easy to download web pages. We make a request to the server and get back a response containing the page content.

Python Concept: Functions and Parameters

A function is a piece of code that performs a specific task. requests.get() is a function that downloads web pages.

Parameters are values you pass to a function inside parentheses. In requests.get(url, headers=headers, timeout=30), we’re passing three parameters:

  • url: the web address to download
  • headers=headers: our dictionary of HTTP headers
  • timeout=30: wait at most 30 seconds for a response

Functions often return a value - requests.get() returns a response object containing the webpage.

response = requests.get(url, headers=headers, timeout=30)
print("Status code:", response.status_code)

The timeout=30 parameter ensures our request doesn’t hang indefinitely if the server is slow.

Status codes tell us what happened:

  • 200: Success! We got the page
  • 404: Page not found (usually a URL typo)
  • 500: Server error

We can access the HTML content as text:

Python Concept: Object Attributes and Methods

In Python, an object is a container that holds both data and functions. The response we got from requests.get() is an object.

Attributes are pieces of data stored in an object. response.status_code is an attribute containing the HTTP status code.

Methods are functions that belong to an object. response.text is actually a method that converts the raw webpage bytes into readable text. You access both using dot notation: object.attribute or object.method().

html = response.text
print("HTML length:", len(html), "characters")
print("\nFirst 500 characters:")
print(html[:500])

This raw HTML contains all the information on the page, but it’s not easy to work with directly. We need to parse it into a structure that Python can navigate.

1.5 Parsing HTML with BeautifulSoup

BeautifulSoup converts HTML text into a tree structure that we can search and navigate. Think of it as turning a long string of HTML tags into a map we can explore.

Python Concept: String Slicing

String slicing lets you extract parts of a string using square brackets. In html[:500], the :500 means “from the beginning up to character 500.”

The general format is string[start:end]:

  • text[:100] - first 100 characters
  • text[50:] - from character 50 to the end
  • text[10:20] - characters 10 through 19

Python uses zero-based indexing, so the first character is at position 0, not 1.

soup = BeautifulSoup(html, "html.parser")
print("Type:", type(soup))

Now we can easily find specific elements on the page.

1.5.1 Extracting Basic Information

Let’s start by extracting some basic information to get familiar with BeautifulSoup:

1.5.1.1 Page title

title = soup.find("h1").get_text()
print("Page title:", title)

1.5.1.2 Section headings

headings = []
for heading in soup.find_all(["h2", "h3"]):
    text = heading.get_text().strip()
    # Remove Wikipedia's "[edit]" links
    text = text.replace("[edit]", "").strip()
    if text:  # Skip empty headings
        headings.append(text)

print("First 5 headings:")
for i, heading in enumerate(headings[:5]):
    print(f"{i+1}. {heading}")

1.5.1.3 Opening paragraphs

Python Concept: Conditional Statements

Conditional statements use if to make decisions in your code. The statement if text and len(text) > 50: checks two conditions:

  1. text - is the text not empty?
  2. len(text) > 50 - is the text longer than 50 characters?

The and operator means both conditions must be true. Comparison operators like > (greater than), < (less than), and == (equal to) let you compare values. The len() function returns the length of a string or list.

paragraphs = []
for p in soup.find_all("p"):
    text = p.get_text().strip()
    if text and len(text) > 50:  # Skip very short paragraphs
        paragraphs.append(text)

print("First paragraph:")
print(paragraphs[0][:300] + "...")

These examples show the basic pattern: use find() to get the first matching element, or find_all() to get all matches, then use get_text() to extract the readable content.

Handling Relative URLs

HTML pages often contain two types of links: absolute URLs that start with https:// and relative URLs that start with / or just a page name. Relative URLs need the website’s base address added to work outside the original page. The urljoin() function handles this automatically-it leaves absolute URLs unchanged and properly combines relative URLs with the base address.

1.5.2 Finding Tables

Now let’s find the polling tables. Wikipedia typically marks data tables with the class “wikitable”:

tables = soup.find_all("table", class_="wikitable")
print(f"Found {len(tables)} tables with class 'wikitable'")

Let’s examine what we found by looking at the first few rows of each table:

Python Concept: Exception Handling

Sometimes code encounters errors that would normally crash your program. Exception handling using try and except lets you catch these errors and handle them gracefully.

In try: we put code that might fail. If an error occurs, Python jumps to the except: block instead of crashing. This is essential when parsing web data because HTML can be unpredictable - some tables might be malformed or contain unexpected content.

for i, table in enumerate(tables[:3]):  # Look at first 3 tables
    print(f"\n--- Table {i+1} ---")
    
    # Try to convert to DataFrame
    try:
        df = pd.read_html(str(table))[0]
        print("Shape:", df.shape)
        print("Columns:", list(df.columns))
        print("First row:", df.iloc[0].tolist() if len(df) > 0 else "Empty")
    except Exception as e:
        print("Could not parse as DataFrame:", e)

The pd.read_html() function is a powerful tool that automatically finds and converts HTML tables to DataFrames. We wrap the table in str() to convert it from a BeautifulSoup object to text that pandas can process.

1.5.3 Identifying Polling Tables

Not all tables contain polling data. We need to identify which ones have the columns we want. Let’s look for tables that contain polling-related columns:

Python Concept: List Comprehensions and Boolean Logic

List comprehensions are a compact way to transform lists. [str(col).lower().strip() for col in df.columns] creates a new list by applying str().lower().strip() to each column name.

Boolean logic uses and, or, and not operators:

  • any() returns True if at least one item in a list is True
  • "polling" in col and "firm" in col checks if both words appear in the column name
  • len(df) > 1 ensures the table has actual data rows, not just headers

The and in our if statement means all conditions must be true.

polling_tables = []

for i, table in enumerate(tables):
    try:
        # Convert table to DataFrame using StringIO for proper parsing
        df_list = pd.read_html(StringIO(str(table)), header=0)
        for df in df_list:
            # Check if this looks like a polling table
            columns = [str(col).lower().strip() for col in df.columns]
            
            # Look for key polling columns
            has_polling_firm = any("polling" in col and "firm" in col for col in columns)
            has_sample_size = any("sample" in col for col in columns)
            
            if has_polling_firm and has_sample_size and len(df) > 1:
                print(f"Found polling table {i+1}:")
                print(f"  Shape: {df.shape}")
                print(f"  Columns: {list(df.columns)}")
                
                # Remove duplicate header row if it exists
                if len(df) > 0 and str(df.iloc[0, 0]).strip().lower() == "polling firm":
                    print("  Removing duplicate header row")
                    df = df.iloc[1:].reset_index(drop=True)
                
                polling_tables.append(df)
                print(f"  Added {len(df)} rows of data\n")
                
    except Exception as e:
        print(f"Could not process table {i+1}: {e}")

print(f"Total polling tables found: {len(polling_tables)}")
Why StringIO?

pd.read_html() can accept a URL, a file, or a string of HTML. StringIO wraps our HTML string to make it behave like a file, which ensures pandas parses it correctly. This is a common pattern when working with HTML content that’s already loaded in memory.

Handling Messy HTML Tables

Wikipedia tables can be inconsistent. Sometimes the header row gets parsed as data, or tables contain nested structures. Our code checks for these common issues and handles them automatically. This kind of defensive checking is normal in web scraping where you can’t control the source format.

1.5.4 Combining the Data

If we found multiple polling tables, we can combine them into a single DataFrame:

if polling_tables:
    # Combine all polling tables
    combined_data = pd.concat(polling_tables, ignore_index=True)
    print("Combined data shape:", combined_data.shape)
    print("\nColumns:", list(combined_data.columns))
    print("\nFirst few rows:")
    print(combined_data.head())
else:
    print("No polling tables found!")

1.6 Saving Our Data

Now that we’ve extracted the data, we should save both the raw HTML and the extracted data. This follows the principle of preserving your original sources while creating clean, reusable datasets.

Save the raw HTML:

Python Concept: File Operations and F-strings

File operations in Python use the with open() pattern. This safely opens a file, lets you work with it, and automatically closes it when done - even if an error occurs.

The "w" parameter means “write mode” (create/overwrite the file). The encoding="utf-8" ensures special characters display correctly.

F-strings let you insert variables into text. f"Found {len(links_df)} links" puts the value of len(links_df) into the string. The f before the quotes makes it an f-string, and anything in {curly braces} gets evaluated and inserted.

with open(
    "data/source/scraped-wikipedia/polling_page.html", "w", encoding="utf-8"
) as f:
    f.write(html)

1.6.0.1 Save the extracted data

if polling_tables:
    combined_data.to_csv(
        "data/processed/scraped-wikipedia/raw_polling_data.csv", index=False
    )

    print(f"\nData summary:")
    print(f"- {len(combined_data)} polling records")
    print(f"- {len(combined_data.columns)} columns")
    if "Last date of polling" in combined_data.columns:
        non_null_dates = combined_data["Last date of polling"].notna().sum()
        print(f"- {non_null_dates} records with valid dates")

1.7 What We’ve Accomplished

This chapter implemented the Acquire phase of Alexander’s storytelling framework through a complete web scraping workflow:

Planning: We sketched our target schema before writing code, identifying exactly which fields we needed and what data types to expect.

Acquiring: We respectfully requested Wikipedia’s polling page, parsed its HTML structure, and extracted table data while following ethical scraping practices.

Preserving: We saved both raw HTML and extracted tables. This separation-source data exactly as found, processed data ready for analysis-supports reproducibility and transparent research.

The data needs cleaning (percentages include “%” symbols, sample sizes have commas, dates are text strings), but that’s expected and appropriate. Chapter 2 will transform this raw extraction into analysis-ready form while maintaining our non-destructive workflow.

1.7.1 Key Principles Demonstrated

  • Respect for sources: User-Agent headers, rate limiting, saving originals
  • Systematic extraction: Planning schema before coding
  • Transparency: Every step documented and reproducible
  • Separation of concerns: Collection preserves sources; processing transforms them

1.8 Looking Ahead

We now have polling data extracted from Wikipedia, but it’s still quite messy. The data types are all text, there are footnote markers mixed in with numbers, and some formatting needs to be cleaned up.

In the next chapter, we’ll process this source data, transforming it into a clean, analysis-ready dataset. We’ll convert text to appropriate data types, handle missing values, create visualizations, and build a complete data processing pipeline. We’ll also learn about reshaping data and creating publication-quality plots.

The separation between data collection (this chapter) and data processing (next chapter) is intentional and reflects how real quantitative social science projects work. This approach makes it easier to debug problems and allows you to refine your cleaning process without re-scraping the website-following our principle of preserving original sources while building clean, reusable datasets.