import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from io import StringIOBy the end of this chapter you will be able to:
- Explain what web scraping is and when it’s appropriate
- Practice collecting data from a real web page in a careful, reproducible way
- Load a page with
requests- Turn its HTML into a structure Python can navigate with
BeautifulSoup- Extract the page title, headings, paragraphs, links, and table data
- Save your results as tidy, reusable files (CSV) for later analysis
Quantitative social scientists often need data that doesn’t exist as a convenient download. Wikipedia’s opinion polling pages contain structured tables tracking campaign dynamics, but no “Export to CSV” button. Web scraping lets us transform these HTML pages into analysis-ready datasets.
This chapter demonstrates the full Acquire phase of Alexander’s framework: planning what we need, understanding the source structure, extracting data programmatically, and saving both raw and processed versions for reproducible research. We’ll work with Canadian federal election polling data, creating a dataset suitable for analyzing campaign dynamics.
1.1 Telling Stories with Data
Before diving into code, let’s understand where web scraping fits in the broader quantitative social science workflow. In the Introduction, you learned about Alexander’s framework: Plan → Simulate → Acquire → Explore/Analyze → Share. This chapter lives primarily in the Acquire phase, but we’ll touch on all five steps as we work through our polling data project.
Planning gives direction to what we collect and helps us avoid gathering irrelevant data. Quick simulation or prototyping helps us spot problems before they cost time. A bit of exploration tells us whether we actually got what we thought we were getting. Saving results in clear, well-organized formats makes later sharing easier, including for your future self.
1.1.1 Planning Our Data Collection
Before writing any code, let’s apply the first step of Alexander’s framework: Plan. We’re going to collect polling data from Wikipedia’s page on the 2025 Canadian federal election. Rather than diving straight into code, let’s sketch our endpoint-what we want our final dataset to look like. This planning step helps us stay focused and make better decisions as we work.
A schema is a blueprint for your data table: it names each column, states its data type, and notes any constraints you expect. Planning your schema upfront keeps your data collection focused and gives you a target to check against. Here’s our simple schema:
| Field | Type | Purpose |
|---|---|---|
Polling firm |
string | Organization that conducted the poll |
Last date of polling |
datetime | When the poll was completed |
Sample size |
string | Number of people surveyed (we’ll clean this later) |
CPC |
string | Conservative Party support percentage |
LPC |
string | Liberal Party support percentage |
NDP |
string | New Democratic Party support percentage |
BQ |
string | Bloc Québécois support percentage |
Other parties |
string | Support for other parties |
This schema serves as our “simulation” step in Alexander’s framework. While we’re not creating fake data here, we’re doing something pretty close: being very specific about what we expect our real data to look like. This helps us write better scraping code and catch problems early.
Notice that we’re keeping this simple. We’ll extract the data as text first, then clean and convert data types in the next chapter. This two-step approach-extract first, clean later-is common in quantitative social science and makes debugging easier.
Writing this schema first keeps your cleaning steps honest and gives you a target to check against once you extract the data.
1.2 What is Web Scraping?
Web scraping is the process of programmatically extracting data from websites. When a website doesn’t offer a direct download or API (Application Programming Interface), we can still collect information by parsing the HTML that makes up web pages.
Sometimes websites provide data directly as a download or through an API (Application Programming Interface), which is an official, structured way for computers to request specific data from a website. When an API or download exists, prefer it because it is usually more stable than scraping. When no such option is available, we can still collect information by programmatically extracting it from the HTML (the markup language that describes web pages).
A typical scraping workflow has four steps:
- Request a webpage from a server
- Parse the returned HTML to understand its structure
- Extract the specific information you need
- Save the results in a structured format for analysis
1.2.1 Responsible Web Scraping
Responsible scraping blends technical etiquette and research ethics. Here are the key principles:
Prefer official sources: Use APIs or direct downloads when available-they’re more reliable and faster than scraping.
Respect website policies: Read the site’s robots.txt file (found at /robots.txt on most sites) and terms of use to understand what the site expects of automated visitors.
Be gentle: Add delays between requests and avoid overwhelming servers with rapid requests. Save what you fetch so you don’t need to ask again.
Protect privacy: Treat anything that could be personally identifying with care. Aggregate data where possible, and remember it’s easier than you might think to re-identify individuals from seemingly harmless details.
Be transparent: Don’t redistribute copyrighted content. Instead, share the code you wrote to collect and clean the data so others can reproduce your steps.
- Getting blocked (403/429 errors): Add proper User-Agent headers and delays between requests.
- Empty results: Check if the site loads content with JavaScript (try viewing source vs. inspecting elements). BeautifulSoup only sees the initial HTML.
- Changing structure: Websites update their layout. Save raw HTML so you can debug without re-scraping.
- Rate limiting: If you get timeouts, add
time.sleep(1)between requests to slow down your scraper.
For our Wikipedia example, scraping is appropriate because the data is publicly available and we’re using it for educational purposes.
1.2.2 Understanding HTML Structure
Web pages use HTML tags to structure content. Tables use <table>, rows use <tr>, cells use <td>. BeautifulSoup parses this hierarchical structure so we can navigate it programmatically.
The key insight: HTML is a tree of nested tags. We don’t need to memorize every tag-when you need to understand a page’s structure, right-click → Inspect Element to see the underlying HTML.
For our Wikipedia polling table, the structure looks like:
<table class="wikitable">
<tr>
<th>Polling firm</th>
<th>CPC</th>
<th>LPC</th>
</tr>
<tr>
<td>Polling Company</td>
<td>35%</td>
<td>32%</td>
</tr>
</table>BeautifulSoup lets us find elements by tag name (<table>) or attributes (class="wikitable") without writing complex parsing code.
1.3 Setting Up for Scraping
We’ll use three Python packages for web scraping:
requests: Downloads web pagesBeautifulSoup: Parses HTML into a navigable structurepandas: Converts extracted data into DataFrames
As you learned in the Introduction, packages are collections of pre-written Python code that solve common problems. The import statement brings a package into your program so you can use its functions.
A package is a collection of pre-written Python code that solves common problems. Instead of writing web scraping code from scratch, we can use the requests package that others have already built and tested.
The import statement brings a package into your program. When you write import requests, you’re telling Python to load the requests package so you can use its functions. This is like adding tools to your toolbox-once imported, you can use requests.get() to download web pages.
For consistency and reproducibility, we’ll use a specific version of the Wikipedia page. Wikipedia pages change over time, so using a permanent link ensures everyone gets the same results.
As you learned in the Introduction, a variable is a name that stores a value so you can refer to it later.
A variable is a name that stores a value. When you write url = "https://...", you’re creating a variable named url and storing a text string in it.
Strings are sequences of characters enclosed in quotes. You can use single quotes ('text') or double quotes ("text") - they work the same way. The long URL below is just a string that we can refer to later using the variable name url.
url = "https://en.wikipedia.org/w/index.php?title=Opinion_polling_for_the_2025_Canadian_federal_election&oldid=1306495612"We also need to set a User-Agent header, which tells the website what kind of program is requesting the page. We’ll do by defining a dictionary, which you may remember from the Introduction is a data structure that stores key-value pairs:
headers = {
"User-Agent": "Mozilla/5.0 (compatible; SOCI3040-Student/1.0; +https://www.mun.ca/)"
}Many websites block requests that don’t include a User-Agent header because they look like anonymous bots. Including a clear User-Agent string signals that we’re behaving responsibly. The string above identifies our request as coming from a course project, which is honest and transparent.
A dictionary is a way to store key-value pairs. You create one using curly braces {}. In {"User-Agent": "Mozilla/5.0..."}, the key is “User-Agent” and the value is the string identifying our program.
Dictionaries are like lookup tables - you can ask for the value associated with any key. They’re perfect for storing configuration settings like HTTP headers.
1.4 Loading Web Pages with requests
The requests library makes it easy to download web pages. We make a request to the server and get back a response containing the page content.
A function is a piece of code that performs a specific task. requests.get() is a function that downloads web pages.
Parameters are values you pass to a function inside parentheses. In requests.get(url, headers=headers, timeout=30), we’re passing three parameters:
url: the web address to downloadheaders=headers: our dictionary of HTTP headerstimeout=30: wait at most 30 seconds for a response
Functions often return a value - requests.get() returns a response object containing the webpage.
response = requests.get(url, headers=headers, timeout=30)
print("Status code:", response.status_code)The timeout=30 parameter ensures our request doesn’t hang indefinitely if the server is slow.
Status codes tell us what happened:
200: Success! We got the page404: Page not found (usually a URL typo)500: Server error
We can access the HTML content as text:
In Python, an object is a container that holds both data and functions. The response we got from requests.get() is an object.
Attributes are pieces of data stored in an object. response.status_code is an attribute containing the HTTP status code.
Methods are functions that belong to an object. response.text is actually a method that converts the raw webpage bytes into readable text. You access both using dot notation: object.attribute or object.method().
html = response.text
print("HTML length:", len(html), "characters")
print("\nFirst 500 characters:")
print(html[:500])This raw HTML contains all the information on the page, but it’s not easy to work with directly. We need to parse it into a structure that Python can navigate.
1.5 Parsing HTML with BeautifulSoup
BeautifulSoup converts HTML text into a tree structure that we can search and navigate. Think of it as turning a long string of HTML tags into a map we can explore.
String slicing lets you extract parts of a string using square brackets. In html[:500], the :500 means “from the beginning up to character 500.”
The general format is string[start:end]:
text[:100]- first 100 characterstext[50:]- from character 50 to the endtext[10:20]- characters 10 through 19
Python uses zero-based indexing, so the first character is at position 0, not 1.
soup = BeautifulSoup(html, "html.parser")
print("Type:", type(soup))Now we can easily find specific elements on the page.
1.5.1 Extracting Basic Information
Let’s start by extracting some basic information to get familiar with BeautifulSoup:
1.5.1.1 Page title
title = soup.find("h1").get_text()
print("Page title:", title)1.5.1.2 Section headings
headings = []
for heading in soup.find_all(["h2", "h3"]):
text = heading.get_text().strip()
# Remove Wikipedia's "[edit]" links
text = text.replace("[edit]", "").strip()
if text: # Skip empty headings
headings.append(text)
print("First 5 headings:")
for i, heading in enumerate(headings[:5]):
print(f"{i+1}. {heading}")1.5.1.3 Opening paragraphs
Conditional statements use if to make decisions in your code. The statement if text and len(text) > 50: checks two conditions:
text- is the text not empty?len(text) > 50- is the text longer than 50 characters?
The and operator means both conditions must be true. Comparison operators like > (greater than), < (less than), and == (equal to) let you compare values. The len() function returns the length of a string or list.
paragraphs = []
for p in soup.find_all("p"):
text = p.get_text().strip()
if text and len(text) > 50: # Skip very short paragraphs
paragraphs.append(text)
print("First paragraph:")
print(paragraphs[0][:300] + "...")These examples show the basic pattern: use find() to get the first matching element, or find_all() to get all matches, then use get_text() to extract the readable content.
HTML pages often contain two types of links: absolute URLs that start with https:// and relative URLs that start with / or just a page name. Relative URLs need the website’s base address added to work outside the original page. The urljoin() function handles this automatically-it leaves absolute URLs unchanged and properly combines relative URLs with the base address.
1.5.2 Finding Tables
Now let’s find the polling tables. Wikipedia typically marks data tables with the class “wikitable”:
tables = soup.find_all("table", class_="wikitable")
print(f"Found {len(tables)} tables with class 'wikitable'")Let’s examine what we found by looking at the first few rows of each table:
Sometimes code encounters errors that would normally crash your program. Exception handling using try and except lets you catch these errors and handle them gracefully.
In try: we put code that might fail. If an error occurs, Python jumps to the except: block instead of crashing. This is essential when parsing web data because HTML can be unpredictable - some tables might be malformed or contain unexpected content.
for i, table in enumerate(tables[:3]): # Look at first 3 tables
print(f"\n--- Table {i+1} ---")
# Try to convert to DataFrame
try:
df = pd.read_html(str(table))[0]
print("Shape:", df.shape)
print("Columns:", list(df.columns))
print("First row:", df.iloc[0].tolist() if len(df) > 0 else "Empty")
except Exception as e:
print("Could not parse as DataFrame:", e)The pd.read_html() function is a powerful tool that automatically finds and converts HTML tables to DataFrames. We wrap the table in str() to convert it from a BeautifulSoup object to text that pandas can process.
1.5.3 Identifying Polling Tables
Not all tables contain polling data. We need to identify which ones have the columns we want. Let’s look for tables that contain polling-related columns:
List comprehensions are a compact way to transform lists. [str(col).lower().strip() for col in df.columns] creates a new list by applying str().lower().strip() to each column name.
Boolean logic uses and, or, and not operators:
any()returns True if at least one item in a list is True"polling" in col and "firm" in colchecks if both words appear in the column namelen(df) > 1ensures the table has actual data rows, not just headers
The and in our if statement means all conditions must be true.
polling_tables = []
for i, table in enumerate(tables):
try:
# Convert table to DataFrame using StringIO for proper parsing
df_list = pd.read_html(StringIO(str(table)), header=0)
for df in df_list:
# Check if this looks like a polling table
columns = [str(col).lower().strip() for col in df.columns]
# Look for key polling columns
has_polling_firm = any("polling" in col and "firm" in col for col in columns)
has_sample_size = any("sample" in col for col in columns)
if has_polling_firm and has_sample_size and len(df) > 1:
print(f"Found polling table {i+1}:")
print(f" Shape: {df.shape}")
print(f" Columns: {list(df.columns)}")
# Remove duplicate header row if it exists
if len(df) > 0 and str(df.iloc[0, 0]).strip().lower() == "polling firm":
print(" Removing duplicate header row")
df = df.iloc[1:].reset_index(drop=True)
polling_tables.append(df)
print(f" Added {len(df)} rows of data\n")
except Exception as e:
print(f"Could not process table {i+1}: {e}")
print(f"Total polling tables found: {len(polling_tables)}")pd.read_html() can accept a URL, a file, or a string of HTML. StringIO wraps our HTML string to make it behave like a file, which ensures pandas parses it correctly. This is a common pattern when working with HTML content that’s already loaded in memory.
Wikipedia tables can be inconsistent. Sometimes the header row gets parsed as data, or tables contain nested structures. Our code checks for these common issues and handles them automatically. This kind of defensive checking is normal in web scraping where you can’t control the source format.
1.5.4 Combining the Data
If we found multiple polling tables, we can combine them into a single DataFrame:
if polling_tables:
# Combine all polling tables
combined_data = pd.concat(polling_tables, ignore_index=True)
print("Combined data shape:", combined_data.shape)
print("\nColumns:", list(combined_data.columns))
print("\nFirst few rows:")
print(combined_data.head())
else:
print("No polling tables found!")1.6 Saving Our Data
Now that we’ve extracted the data, we should save both the raw HTML and the extracted data. This follows the principle of preserving your original sources while creating clean, reusable datasets.
Save the raw HTML:
File operations in Python use the with open() pattern. This safely opens a file, lets you work with it, and automatically closes it when done - even if an error occurs.
The "w" parameter means “write mode” (create/overwrite the file). The encoding="utf-8" ensures special characters display correctly.
F-strings let you insert variables into text. f"Found {len(links_df)} links" puts the value of len(links_df) into the string. The f before the quotes makes it an f-string, and anything in {curly braces} gets evaluated and inserted.
with open(
"data/source/scraped-wikipedia/polling_page.html", "w", encoding="utf-8"
) as f:
f.write(html)1.6.0.1 Save the extracted data
if polling_tables:
combined_data.to_csv(
"data/processed/scraped-wikipedia/raw_polling_data.csv", index=False
)
print(f"\nData summary:")
print(f"- {len(combined_data)} polling records")
print(f"- {len(combined_data.columns)} columns")
if "Last date of polling" in combined_data.columns:
non_null_dates = combined_data["Last date of polling"].notna().sum()
print(f"- {non_null_dates} records with valid dates")1.7 What We’ve Accomplished
This chapter implemented the Acquire phase of Alexander’s storytelling framework through a complete web scraping workflow:
Planning: We sketched our target schema before writing code, identifying exactly which fields we needed and what data types to expect.
Acquiring: We respectfully requested Wikipedia’s polling page, parsed its HTML structure, and extracted table data while following ethical scraping practices.
Preserving: We saved both raw HTML and extracted tables. This separation-source data exactly as found, processed data ready for analysis-supports reproducibility and transparent research.
The data needs cleaning (percentages include “%” symbols, sample sizes have commas, dates are text strings), but that’s expected and appropriate. Chapter 2 will transform this raw extraction into analysis-ready form while maintaining our non-destructive workflow.
1.7.1 Key Principles Demonstrated
- Respect for sources: User-Agent headers, rate limiting, saving originals
- Systematic extraction: Planning schema before coding
- Transparency: Every step documented and reproducible
- Separation of concerns: Collection preserves sources; processing transforms them
1.8 Looking Ahead
We now have polling data extracted from Wikipedia, but it’s still quite messy. The data types are all text, there are footnote markers mixed in with numbers, and some formatting needs to be cleaned up.
In the next chapter, we’ll process this source data, transforming it into a clean, analysis-ready dataset. We’ll convert text to appropriate data types, handle missing values, create visualizations, and build a complete data processing pipeline. We’ll also learn about reshaping data and creating publication-quality plots.
The separation between data collection (this chapter) and data processing (next chapter) is intentional and reflects how real quantitative social science projects work. This approach makes it easier to debug problems and allows you to refine your cleaning process without re-scraping the website-following our principle of preserving original sources while building clean, reusable datasets.
