Skip to content

HTML web scraping with Python using Playwright. This tutorial covers HTML structure, element targeting, pagination, and CSV export with real examples.

Notifications You must be signed in to change notification settings

luminati-io/html-scraping-with-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

HTML Web Scraping with Python

Bright Data Promo

This tutorial covers HTML fundamentals and how to collect, parse, and process web data using Python:

Want a comprehensive Python web scraping resource? Check out our detailed guide.

Introduction to HTML Structure

Before diving into scraping, it's important to understand the building blocks of HTML.

HTML consists of tags that define a webpage's structure and components. For instance, <h1> Text </h1> defines heading text, while <a href=""> link </a> creates a hyperlink.

HTML attributes provide additional element information. For example, the href attribute in an <a> </a> tag specifies the link destination URL.

Classes and IDs are crucial attributes for precisely identifying page elements. Classes group similar elements for consistent styling with CSS or manipulation via JavaScript. Classes are referenced using .class-name.

On W3Schools, class groupings appear like this:

<div class="city">
  <h2>London</h2>
  <p>London is the capital of England.</p>
</div>

<div class="city">
  <h2>Paris</h2>
  <p>Paris is the capital of France.</p>
</div>

<div class="city">
  <h2>Tokyo</h2>
  <p>Tokyo is the capital of Japan.</p>
</div>

Notice how each title and city section is enclosed in a div sharing the same city class.

In contrast, IDs must be unique to individual elements (no two elements can share an ID). For example, these H1 elements have unique IDs for individual styling/manipulation:

<h1 id="header1">Hello World!</h1>

<h1 id="header2">Lorem Ipsum Dolor</h1>

The syntax for targeting ID elements is #id-name.

Setting Up Your Python Environment

This guide uses Python because of its many HTML scraping libraries and beginner-friendly syntax. To verify if Python is installed, run this command in PowerShell (Windows) or Terminal (macOS):

python3

If Python is installed, you'll see the version number; otherwise, you'll get an error message. If needed, download and install Python.

Next, create a folder named WebScraper and inside it create a file called scraper.py. Open this file in your preferred integrated development environment (IDE). We'll use Visual Studio Code in this guide:

VSCode showing the project

An IDE is a comprehensive tool that enables developers to write code, debug, test programs, create automations, and more. You'll use it to develop your HTML scraper.

Now, isolate your global Python installation from your scraping project by creating a virtual environment. This prevents dependency conflicts and keeps your project organized.

Install the virtualenv library with this command:

pip3 install virtualenv

Navigate to your project folder:

cd WebScraper

Create a virtual environment:

python<version> -m venv <virtual-environment-name>

This creates a directory for all packages and scripts within your project folder:

Virtual environment folder creation

Now activate your virtual environment using the appropriate command for your system:

source <virtual-environment-name>/bin/activate #In MacOS and Linux

<virtual-environment-name>/Scripts/activate.bat #In CMD

<virtual-environment-name>/Scripts/Activate.ps1 #In Powershell

When activated successfully, your virtual environment name will appear on the left side of your terminal:

Virtual environment activation indicator

With your virtual environment active, install a web scraping library. Options include Playwright, Selenium, Beautiful Soup, and Scrapy. For this tutorial, we'll use Playwright because it's user-friendly, supports multiple browsers, handles dynamic content, and offers headless mode (scraping without a GUI).

Run pip install pytest-playwright to install Playwright, then install the required browsers with playwright install.

Now you're ready to start web scraping.

Extracting Complete HTML from a Webpage

The first step in any scraping project is selecting your target website. For this tutorial, we'll use this e-commerce test site.

Next, identify what information you want to extract. Initially, we'll capture the entire HTML content of the page.

After identifying your scraping target, start coding your scraper. In Python, begin by importing the necessary Playwright libraries. Playwright offers two API types: sync and async. Since we're not writing asynchronous code, import the sync library:

from playwright.sync_api import sync_playwright

After importing the sync library, define a Python function:

def main():
    #Rest of the code will be inside this function

All your web scraping code will reside within this function.

Typically, to access a website's information, you open a browser, create a tab, and visit the site. For scraping, translate these actions into code using Playwright. According to their documentation, you can call the imported sync_api and launch a browser:

with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)

Setting headless=False allows you to see the browser content during execution.

After launching the browser, open a new tab and navigate to your target URL:

page = browser.new_page()
try:
  page.goto("https://webscraper.io/test-sites/e-commerce/static")
except:
  print("Error")

Note:

Add these lines below the previous browser launch code. All this code belongs inside the main function in a single file.

This code wraps the goto() function in a try-except block for better error handling.

Just as you wait for a site to load when entering a URL, add a waiting period to your code:

page.wait_for_timeout(7000) #millisecond value in brackets

Note:

Add these lines below the previous code.

Finally, extract the complete HTML content from the page:

print(page.content())

The complete HTML extraction code looks like this:

from playwright.sync_api import sync_playwright

def main():
  with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)   
        page = browser.new_page()
        try:
            page.goto("https://webscraper.io/test-sites/e-commerce/static")
        except:
            print("Error")

        page.wait_for_timeout(7000)
        print(page.content())

main()

In Visual Studio Code, the extracted HTML appears like this:

Extracted HTML in VSCode

Targeting Specific HTML Elements

While extracting an entire webpage is possible, web scraping becomes truly valuable when you focus on specific information. In this section, we'll extract only the laptop titles from the website's first page:

Laptop titles to extract

To extract specific elements, understand the website's structure first. Right-click and select Inspect on the page:

Using inspect on the target website

Alternatively, use these keyboard shortcuts:

  • macOS: Cmd + Option + I
  • Windows: Control + Shift + C

Here's the structure of our target page:

HTML structure of target website

You can examine specific page elements using the selection tool in the top-left corner of the Inspect window:

Inspecting specific elements

Select one of the laptop titles in the Inspect window:

Inspecting a laptop title

You can see that each title is contained in an <a> </a> tag, wrapped by an h4 tag, with the link having a title class. So we need to look for <a href> tags inside <h4> tags with a title class.

To create a targeted scraping program, import the required libraries, create a Python function, launch the browser, and navigate to the laptop page:

from playwright.sync_api import sync_playwright

def main():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)
        page = browser.new_page()
        try:
            page.goto("https://webscraper.io/test-sites/e-commerce/static/computers/laptops")

        except:
            print("Error")

        page.wait_for_timeout(7000)

Note that we've updated the URL in the page.goto() function to point directly to the laptop listing page.

Now locate the target elements based on your structure analysis. Playwright provides locators to find elements using various attributes:

  • get_by_label() finds elements by their associated label
  • get_by_text() finds elements containing specific text
  • get_by_alt_text() finds images by their alt text
  • get_by_test_id() finds elements by their test ID

See the official documentation for more element location methods.

To extract all laptop titles, locate the <h4> tags that contain them. Use the get_by_role() locator to find elements by their function (buttons, checkboxes, headings, etc.):

titles = page.get_by_role("heading").all()

Print the results to your console:

print(titles)

The output shows an array of elements:

Array of heading elements

This output doesn't show the titles directly, but references elements matching our criteria. We need to loop through these elements to find <a> tags with a title class and extract their text.

Use the CSS locator to find elements by path and class, and the all_inner_texts() function to extract their text:

for title in titles:
  laptop = title.locator("a.title").all_inner_texts()

Running this code produces output like:

Output of title extraction

To filter out empty arrays, add:

if len(laptop) == 1:
  print(laptop[0])

Here's the complete code for this specific element scraper:

from playwright.sync_api import sync_playwright

def main():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)
        page = browser.new_page()
        try:
            page.goto("https://webscraper.io/test-sites/e-commerce/static/computers/laptops")

        except:
            print("Error")

        page.wait_for_timeout(7000)

        titles = page.get_by_role("heading").all()

        for title in titles:
            laptop = title.locator("a.title").all_inner_texts()
            if len(laptop) == 1:
                print(laptop[0])

main()

Interacting with Page Elements

Let's enhance our scraper to extract titles from multiple pages. We'll scrape titles from the first laptop page, navigate to the second page, and extract those titles as well.

Since we already know how to extract titles, we just need to learn how to navigate to the next page.

The website has pagination buttons at the bottom. We need to locate and click on the "2" button programmatically. Inspecting the page reveals that this element is a list item (<li> tag) with the text "2":

Pagination element with text "2"

We can use the get_by_role() selector to find a list item and the get_by_text() selector to find text containing "2":

page.get_by_role("listitem").get_by_text("2", exact=True)

This finds an element matching both conditions: it must be a list item and have exactly "2" as its text.

The exact=True parameter ensures we match the exact text.

To click this button, modify the code:

page.get_by_role("listitem").get_by_text("2", exact=True).click()

The click() function performs a click on the matched element.

Now wait for the page to load and extract titles again:

page.wait_for_timeout(5000)

titles = page.get_by_role("heading").all()

for title in titles:
    laptop = title.locator("a.title").all_inner_texts()
    if len(laptop) == 1:
        print(laptop[0])

Your complete multi-page scraper code should look like this:

from playwright.sync_api import sync_playwright

def main():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)
        page = browser.new_page()
        try:
            page.goto("https://webscraper.io/test-sites/e-commerce/static/computers/laptops")

        except:
            print("Error")

        page.wait_for_timeout(7000)

        titles = page.get_by_role("heading").all()

        for title in titles:
            laptop = title.locator("a.title").all_inner_texts()
            if len(laptop) == 1:
                print(laptop[0])
        
        page.get_by_role("listitem").get_by_text("2", exact=True).click()

        page.wait_for_timeout(5000)

        titles = page.get_by_role("heading").all()

        for title in titles:
            laptop = title.locator("a.title").all_inner_texts()
            if len(laptop) == 1:
                print(laptop[0])

main()

Saving Extracted Data to CSV

Scraped data needs to be stored and analyzed to be useful. Now we'll create an advanced program that asks users how many laptop pages to scrape, extracts the titles, and saves them to a CSV file.

First, import the CSV library:

import csv

Next, determine how to visit multiple pages based on user input.

Looking at the website's URL structure, we notice that each laptop page uses a URL parameter. For example, the second page URL is https://webscraper.io/test-sites/e-commerce/static/computers/laptops?page=2.

We can navigate to different pages by changing the ?page=2 parameter. Ask the user how many pages to scrape:

pages = int(input("enter the number of pages to scrape: "))

To visit each page from 1 to the user-specified number, use a for loop:

for i in range(1, pages+1):

The range function starts at the first value (1) and ends before the second value (pages+1). For example, range(1,5) loops from 1 to 4.

Now visit each page by using the loop variable i as the URL parameter. We can insert variables into strings using Python f-strings.

An f-string uses an f prefix before quotation marks, allowing variable insertion using curly brackets:

print(f"The value of the variable is {variable_name_goes_here}")

For our scraper, use f-strings in the navigation code:

try:
    page.goto(f"https://webscraper.io/test-sites/e-commerce/static/computers/laptops?page={i}")
except:
    print("Error")

Wait for the page to load and extract titles:

page.wait_for_timeout(7000)
titles = page.get_by_role("heading").all()

Next, open a CSV file, loop through each title, extract the text, and write it to the file.

Open a CSV file with:

with open("laptops.csv", "a") as csvfile:

We're opening laptops.csv in append mode (a). Using append means new data is added without erasing existing data. If the file doesn't exist, it will be created. CSV supports several file modes:

  • r: Default mode, opens file for reading only
  • w: Opens file for writing, overwrites existing data
  • a: Opens file for appending, preserves existing data
  • r+: Opens file for both reading and writing
  • x: Creates a new file

Next, create a writer object to manipulate the CSV file:

writer = csv.writer(csvfile)

Loop through each title element and extract the text:

for title in titles:
    laptop = title.locator("a.title").all_inner_texts()

To filter out empty arrays and write valid titles to the CSV:

if len(laptop) == 1:
    writer.writerow([laptop[0]])

The writerow function adds new rows to the CSV file.

Here's the complete CSV export scraper code:

from playwright.sync_api import sync_playwright
import csv

def main():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)
        page = browser.new_page()
        pages = int(input("enter the number of pages to scrape: "))

        for i in range(1, pages+1):
            try:
                page.goto(f"https://webscraper.io/test-sites/e-commerce/static/computers/laptops?page={i}")

            except:
                print("Error")

            page.wait_for_timeout(7000)

            titles = page.get_by_role("heading").all()

            with open("laptops.csv", "a") as csvfile:
                writer = csv.writer(csvfile)
                for title in titles:
                    laptop = title.locator("a.title").all_inner_texts()
                    if len(laptop) == 1:
                        writer.writerow([laptop[0]])

        browser.close()             


main()

After running this code, your CSV file should look like:

CSV file output example

Final Thoughts

While this guide demonstrates basic web scraping, real-world scenarios often present challenges such as CAPTCHAs, rate limits, site layout changes, and regulatory requirements. Bright Data offers solutions for these challenges, including advanced residential proxies to improve scraping performance, a Web Scraper IDE for building scalable scrapers, and a Web Unblocker to access blocked sites.

Start your free trial today!

About

HTML web scraping with Python using Playwright. This tutorial covers HTML structure, element targeting, pagination, and CSV export with real examples.

Topics

Resources

Stars

Watchers

Forks