Goodreads Scraper

This script was created as part of my college project, ReadUniverse, where I needed a large volume of book data for development and dummy content.

Prerequisites

Before running the scripts, make sure you have Python 3.8+ installed. Then install the required dependencies using:

pip install -r requirements.txt

Overview

This project consists of three main Python scripts that work together to collect and extract book data from Goodreads:

url.py — Collects book URLs based on a genre, shelf, list, or search query.

list.py — Automatically generated after running url.py, this file stores the collected book URLs by default.

scraper.py — Extracts detailed book information from the URLs stored in the list.py file.

Command-Line Arguments

`url.py`

The following command-line arguments configure the URL collection process:

--url (string, required):
The Goodreads shelf, search, or list page URL from which to start scraping book URLs.
--max (integer, optional):
The maximum number of book URLs to collect.
Default: 20
--delay (integer, optional):
Delay in seconds between consecutive page requests to respect Goodreads' rate limits and avoid overloading their servers.
Default: 2
--output (string, optional):
The filename where the collected URLs will be saved as a Python file containing a list of URLs. Default: "list.py"

Note: The collected URLs will be saved into the specified output file (default is list.py).

Usage example:

python url.py --url https://www.goodreads.com/shelf/show/fantasy --max 50 --delay 1 --output books.py

`scraper.py`

Currently, scraper.py does not accept any command-line arguments. It automatically processes the list of book URLs saved in the output file generated by url.py (default: list.py).

Usage example:

python scraper.py

Workflow

Run url.py to collect book URLs and save them to a Python file (e.g., list.py or a custom filename).
Run scraper.py to scrape detailed book data from the URLs contained in that file.
The scraped data will be exported in JSON and CSV formats for further analysis.

Scraped Data Output

The extracted data are saved in data.json and data.csv below are the JSON example:

    {
        "title": "Fight Club",
        "authorName": "Chuck Palahniuk",
        "description": "Chuck Palahniuk showed himself to be his generation",
        "isbn": "9780393355949",
        "publication": "1996-08-17",
        "pages": 224,
        "category": [
            "Fiction",
            "Classics",
            "Thriller",
            "Contemporary",
            "Novels",
            "Mystery",
            "Literature"
        ],
        "likes": 69,
        "averageRating": 4.18,
        "totalRating": 625058,
        "totalReview": 25009,
        "price": 57000,
        "stock": 7,
        "imageURL": "https://images-na.ssl-images-amazon.com/images/S/compressed.photo.goodreads.com/books/1558216416i/36236124.jpg"
    }

There are Likes, Price and Stock field, those 3 field created only for the dummy data, not from actual goodreads website

Summary

This project provides a scalable and modular solution for extracting comprehensive book data from Goodreads. The url.py script collects book URLs based on specified criteria (genre, shelf, or search) and saves them in a Python file (default: list.py). The scraper.py script processes these URLs to scrape detailed book metadata, exporting the results in JSON and CSV formats. This approach enables efficient, large-scale data harvesting suitable for analytics, research, machine learning datasets, or app development.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirement.txt		requirement.txt
scraper.py		scraper.py
url.py		url.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Goodreads Scraper

Prerequisites

Overview

Command-Line Arguments

`url.py`

`scraper.py`

Workflow

Scraped Data Output

Summary

About

Uh oh!

Releases

Packages

Languages

License

Daffariandhika/Goodreads-scraper

Folders and files

Latest commit

History

Repository files navigation

Goodreads Scraper

Prerequisites

Overview

Command-Line Arguments

url.py

scraper.py

Workflow

Scraped Data Output

Summary

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`url.py`

`scraper.py`

Packages