Skip to content

Daffariandhika/Goodreads-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Goodreads Scraper

Python version License: MIT

This script was created as part of my college project, ReadUniverse, where I needed a large volume of book data for development and dummy content.

Prerequisites

Before running the scripts, make sure you have Python 3.8+ installed. Then install the required dependencies using:

pip install -r requirements.txt

Overview

This project consists of three main Python scripts that work together to collect and extract book data from Goodreads:

url.py — Collects book URLs based on a genre, shelf, list, or search query.

list.py — Automatically generated after running url.py, this file stores the collected book URLs by default.

scraper.py — Extracts detailed book information from the URLs stored in the list.py file.

Command-Line Arguments

url.py

The following command-line arguments configure the URL collection process:

  • --url (string, required):
    The Goodreads shelf, search, or list page URL from which to start scraping book URLs.

  • --max (integer, optional):
    The maximum number of book URLs to collect.
    Default: 20

  • --delay (integer, optional):
    Delay in seconds between consecutive page requests to respect Goodreads' rate limits and avoid overloading their servers.
    Default: 2

  • --output (string, optional):
    The filename where the collected URLs will be saved as a Python file containing a list of URLs. Default: "list.py"

Note: The collected URLs will be saved into the specified output file (default is list.py).

Usage example:

python url.py --url https://www.goodreads.com/shelf/show/fantasy --max 50 --delay 1 --output books.py

scraper.py

Currently, scraper.py does not accept any command-line arguments. It automatically processes the list of book URLs saved in the output file generated by url.py (default: list.py).

Usage example:

python scraper.py

Workflow

  1. Run url.py to collect book URLs and save them to a Python file (e.g., list.py or a custom filename).
  2. Run scraper.py to scrape detailed book data from the URLs contained in that file.
  3. The scraped data will be exported in JSON and CSV formats for further analysis.

Scraped Data Output

The extracted data are saved in data.json and data.csv below are the JSON example:

    {
        "title": "Fight Club",
        "authorName": "Chuck Palahniuk",
        "description": "Chuck Palahniuk showed himself to be his generation",
        "isbn": "9780393355949",
        "publication": "1996-08-17",
        "pages": 224,
        "category": [
            "Fiction",
            "Classics",
            "Thriller",
            "Contemporary",
            "Novels",
            "Mystery",
            "Literature"
        ],
        "likes": 69,
        "averageRating": 4.18,
        "totalRating": 625058,
        "totalReview": 25009,
        "price": 57000,
        "stock": 7,
        "imageURL": "https://images-na.ssl-images-amazon.com/images/S/compressed.photo.goodreads.com/books/1558216416i/36236124.jpg"
    }

There are Likes, Price and Stock field, those 3 field created only for the dummy data, not from actual goodreads website

Summary

This project provides a scalable and modular solution for extracting comprehensive book data from Goodreads. The url.py script collects book URLs based on specified criteria (genre, shelf, or search) and saves them in a Python file (default: list.py). The scraper.py script processes these URLs to scrape detailed book metadata, exporting the results in JSON and CSV formats. This approach enables efficient, large-scale data harvesting suitable for analytics, research, machine learning datasets, or app development.

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages