This tutorial covers HTML fundamentals and how to collect, parse, and process web data using Python:
- Introduction to HTML Structure
- Setting Up Your Python Environment
- Extracting Complete HTML from a Webpage
- Targeting Specific HTML Elements
- Interacting with Page Elements
- Saving Extracted Data to CSV
- Final Thoughts
Want a comprehensive Python web scraping resource? Check out our detailed guide.
Before diving into scraping, it's important to understand the building blocks of HTML.
HTML consists of tags that define a webpage's structure and components. For instance, <h1> Text </h1>
defines heading text, while <a href=""> link </a>
creates a hyperlink.
HTML attributes provide additional element information. For example, the href
attribute in an <a> </a>
tag specifies the link destination URL.
Classes and IDs are crucial attributes for precisely identifying page elements. Classes group similar elements for consistent styling with CSS or manipulation via JavaScript. Classes are referenced using .class-name
.
On W3Schools, class groupings appear like this:
<div class="city">
<h2>London</h2>
<p>London is the capital of England.</p>
</div>
<div class="city">
<h2>Paris</h2>
<p>Paris is the capital of France.</p>
</div>
<div class="city">
<h2>Tokyo</h2>
<p>Tokyo is the capital of Japan.</p>
</div>
Notice how each title and city section is enclosed in a div
sharing the same city
class.
In contrast, IDs must be unique to individual elements (no two elements can share an ID). For example, these H1 elements have unique IDs for individual styling/manipulation:
<h1 id="header1">Hello World!</h1>
<h1 id="header2">Lorem Ipsum Dolor</h1>
The syntax for targeting ID elements is #id-name
.
This guide uses Python because of its many HTML scraping libraries and beginner-friendly syntax. To verify if Python is installed, run this command in PowerShell (Windows) or Terminal (macOS):
python3
If Python is installed, you'll see the version number; otherwise, you'll get an error message. If needed, download and install Python.
Next, create a folder named WebScraper
and inside it create a file called scraper.py
. Open this file in your preferred integrated development environment (IDE). We'll use Visual Studio Code in this guide:
An IDE is a comprehensive tool that enables developers to write code, debug, test programs, create automations, and more. You'll use it to develop your HTML scraper.
Now, isolate your global Python installation from your scraping project by creating a virtual environment. This prevents dependency conflicts and keeps your project organized.
Install the virtualenv
library with this command:
pip3 install virtualenv
Navigate to your project folder:
cd WebScraper
Create a virtual environment:
python<version> -m venv <virtual-environment-name>
This creates a directory for all packages and scripts within your project folder:
Now activate your virtual environment using the appropriate command for your system:
source <virtual-environment-name>/bin/activate #In MacOS and Linux
<virtual-environment-name>/Scripts/activate.bat #In CMD
<virtual-environment-name>/Scripts/Activate.ps1 #In Powershell
When activated successfully, your virtual environment name will appear on the left side of your terminal:
With your virtual environment active, install a web scraping library. Options include Playwright, Selenium, Beautiful Soup, and Scrapy. For this tutorial, we'll use Playwright because it's user-friendly, supports multiple browsers, handles dynamic content, and offers headless mode (scraping without a GUI).
Run pip install pytest-playwright
to install Playwright, then install the required browsers with playwright install
.
Now you're ready to start web scraping.
The first step in any scraping project is selecting your target website. For this tutorial, we'll use this e-commerce test site.
Next, identify what information you want to extract. Initially, we'll capture the entire HTML content of the page.
After identifying your scraping target, start coding your scraper. In Python, begin by importing the necessary Playwright libraries. Playwright offers two API types: sync and async. Since we're not writing asynchronous code, import the sync library:
from playwright.sync_api import sync_playwright
After importing the sync library, define a Python function:
def main():
#Rest of the code will be inside this function
All your web scraping code will reside within this function.
Typically, to access a website's information, you open a browser, create a tab, and visit the site. For scraping, translate these actions into code using Playwright. According to their documentation, you can call the imported sync_api
and launch a browser:
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
Setting headless=False
allows you to see the browser content during execution.
After launching the browser, open a new tab and navigate to your target URL:
page = browser.new_page()
try:
page.goto("https://webscraper.io/test-sites/e-commerce/static")
except:
print("Error")
Note:
Add these lines below the previous browser launch code. All this code belongs inside the main function in a single file.
This code wraps the goto()
function in a try-except block for better error handling.
Just as you wait for a site to load when entering a URL, add a waiting period to your code:
page.wait_for_timeout(7000) #millisecond value in brackets
Note:
Add these lines below the previous code.
Finally, extract the complete HTML content from the page:
print(page.content())
The complete HTML extraction code looks like this:
from playwright.sync_api import sync_playwright
def main():
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
try:
page.goto("https://webscraper.io/test-sites/e-commerce/static")
except:
print("Error")
page.wait_for_timeout(7000)
print(page.content())
main()
In Visual Studio Code, the extracted HTML appears like this:
While extracting an entire webpage is possible, web scraping becomes truly valuable when you focus on specific information. In this section, we'll extract only the laptop titles from the website's first page:
To extract specific elements, understand the website's structure first. Right-click and select Inspect on the page:
Alternatively, use these keyboard shortcuts:
- macOS: Cmd + Option + I
- Windows: Control + Shift + C
Here's the structure of our target page:
You can examine specific page elements using the selection tool in the top-left corner of the Inspect window:
Select one of the laptop titles in the Inspect window:
You can see that each title is contained in an <a> </a>
tag, wrapped by an h4
tag, with the link having a title
class. So we need to look for <a href>
tags inside <h4>
tags with a title
class.
To create a targeted scraping program, import the required libraries, create a Python function, launch the browser, and navigate to the laptop page:
from playwright.sync_api import sync_playwright
def main():
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
try:
page.goto("https://webscraper.io/test-sites/e-commerce/static/computers/laptops")
except:
print("Error")
page.wait_for_timeout(7000)
Note that we've updated the URL in the page.goto()
function to point directly to the laptop listing page.
Now locate the target elements based on your structure analysis. Playwright provides locators to find elements using various attributes:
get_by_label()
finds elements by their associated labelget_by_text()
finds elements containing specific textget_by_alt_text()
finds images by their alt textget_by_test_id()
finds elements by their test ID
See the official documentation for more element location methods.
To extract all laptop titles, locate the <h4>
tags that contain them. Use the get_by_role()
locator to find elements by their function (buttons, checkboxes, headings, etc.):
titles = page.get_by_role("heading").all()
Print the results to your console:
print(titles)
The output shows an array of elements:
This output doesn't show the titles directly, but references elements matching our criteria. We need to loop through these elements to find <a>
tags with a title
class and extract their text.
Use the CSS locator to find elements by path and class, and the all_inner_texts()
function to extract their text:
for title in titles:
laptop = title.locator("a.title").all_inner_texts()
Running this code produces output like:
To filter out empty arrays, add:
if len(laptop) == 1:
print(laptop[0])
Here's the complete code for this specific element scraper:
from playwright.sync_api import sync_playwright
def main():
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
try:
page.goto("https://webscraper.io/test-sites/e-commerce/static/computers/laptops")
except:
print("Error")
page.wait_for_timeout(7000)
titles = page.get_by_role("heading").all()
for title in titles:
laptop = title.locator("a.title").all_inner_texts()
if len(laptop) == 1:
print(laptop[0])
main()
Let's enhance our scraper to extract titles from multiple pages. We'll scrape titles from the first laptop page, navigate to the second page, and extract those titles as well.
Since we already know how to extract titles, we just need to learn how to navigate to the next page.
The website has pagination buttons at the bottom. We need to locate and click on the "2" button programmatically. Inspecting the page reveals that this element is a list item (<li>
tag) with the text "2":
We can use the get_by_role()
selector to find a list item and the get_by_text()
selector to find text containing "2":
page.get_by_role("listitem").get_by_text("2", exact=True)
This finds an element matching both conditions: it must be a list item and have exactly "2" as its text.
The exact=True
parameter ensures we match the exact text.
To click this button, modify the code:
page.get_by_role("listitem").get_by_text("2", exact=True).click()
The click()
function performs a click on the matched element.
Now wait for the page to load and extract titles again:
page.wait_for_timeout(5000)
titles = page.get_by_role("heading").all()
for title in titles:
laptop = title.locator("a.title").all_inner_texts()
if len(laptop) == 1:
print(laptop[0])
Your complete multi-page scraper code should look like this:
from playwright.sync_api import sync_playwright
def main():
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
try:
page.goto("https://webscraper.io/test-sites/e-commerce/static/computers/laptops")
except:
print("Error")
page.wait_for_timeout(7000)
titles = page.get_by_role("heading").all()
for title in titles:
laptop = title.locator("a.title").all_inner_texts()
if len(laptop) == 1:
print(laptop[0])
page.get_by_role("listitem").get_by_text("2", exact=True).click()
page.wait_for_timeout(5000)
titles = page.get_by_role("heading").all()
for title in titles:
laptop = title.locator("a.title").all_inner_texts()
if len(laptop) == 1:
print(laptop[0])
main()
Scraped data needs to be stored and analyzed to be useful. Now we'll create an advanced program that asks users how many laptop pages to scrape, extracts the titles, and saves them to a CSV file.
First, import the CSV library:
import csv
Next, determine how to visit multiple pages based on user input.
Looking at the website's URL structure, we notice that each laptop page uses a URL parameter. For example, the second page URL is https://webscraper.io/test-sites/e-commerce/static/computers/laptops?page=2.
We can navigate to different pages by changing the ?page=2
parameter. Ask the user how many pages to scrape:
pages = int(input("enter the number of pages to scrape: "))
To visit each page from 1 to the user-specified number, use a for
loop:
for i in range(1, pages+1):
The range
function starts at the first value (1) and ends before the second value (pages+1). For example, range(1,5)
loops from 1 to 4.
Now visit each page by using the loop variable i
as the URL parameter. We can insert variables into strings using Python f-strings.
An f-string uses an f
prefix before quotation marks, allowing variable insertion using curly brackets:
print(f"The value of the variable is {variable_name_goes_here}")
For our scraper, use f-strings in the navigation code:
try:
page.goto(f"https://webscraper.io/test-sites/e-commerce/static/computers/laptops?page={i}")
except:
print("Error")
Wait for the page to load and extract titles:
page.wait_for_timeout(7000)
titles = page.get_by_role("heading").all()
Next, open a CSV file, loop through each title, extract the text, and write it to the file.
Open a CSV file with:
with open("laptops.csv", "a") as csvfile:
We're opening laptops.csv
in append mode (a
). Using append means new data is added without erasing existing data. If the file doesn't exist, it will be created. CSV supports several file modes:
- r: Default mode, opens file for reading only
- w: Opens file for writing, overwrites existing data
- a: Opens file for appending, preserves existing data
- r+: Opens file for both reading and writing
- x: Creates a new file
Next, create a writer object to manipulate the CSV file:
writer = csv.writer(csvfile)
Loop through each title element and extract the text:
for title in titles:
laptop = title.locator("a.title").all_inner_texts()
To filter out empty arrays and write valid titles to the CSV:
if len(laptop) == 1:
writer.writerow([laptop[0]])
The writerow
function adds new rows to the CSV file.
Here's the complete CSV export scraper code:
from playwright.sync_api import sync_playwright
import csv
def main():
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
pages = int(input("enter the number of pages to scrape: "))
for i in range(1, pages+1):
try:
page.goto(f"https://webscraper.io/test-sites/e-commerce/static/computers/laptops?page={i}")
except:
print("Error")
page.wait_for_timeout(7000)
titles = page.get_by_role("heading").all()
with open("laptops.csv", "a") as csvfile:
writer = csv.writer(csvfile)
for title in titles:
laptop = title.locator("a.title").all_inner_texts()
if len(laptop) == 1:
writer.writerow([laptop[0]])
browser.close()
main()
After running this code, your CSV file should look like:
While this guide demonstrates basic web scraping, real-world scenarios often present challenges such as CAPTCHAs, rate limits, site layout changes, and regulatory requirements. Bright Data offers solutions for these challenges, including advanced residential proxies to improve scraping performance, a Web Scraper IDE for building scalable scrapers, and a Web Unblocker to access blocked sites.
Start your free trial today!