This script was created as part of my college project, ReadUniverse, where I needed a large volume of book data for development and dummy content.
Before running the scripts, make sure you have Python 3.8+ installed. Then install the required dependencies using:
pip install -r requirements.txt
This project consists of three main Python scripts that work together to collect and extract book data from Goodreads:
url.py
— Collects book URLs based on a genre, shelf, list, or search query.
list.py
— Automatically generated after running url.py
, this file stores the collected book URLs by default.
scraper.py
— Extracts detailed book information from the URLs stored in the list.py
file.
The following command-line arguments configure the URL collection process:
-
--url
(string, required):
The Goodreads shelf, search, or list page URL from which to start scraping book URLs. -
--max
(integer, optional):
The maximum number of book URLs to collect.
Default:20
-
--delay
(integer, optional):
Delay in seconds between consecutive page requests to respect Goodreads' rate limits and avoid overloading their servers.
Default:2
-
--output
(string, optional):
The filename where the collected URLs will be saved as a Python file containing a list of URLs. Default:"list.py"
Note: The collected URLs will be saved into the specified output file (default is
list.py
).
Usage example:
python url.py --url https://www.goodreads.com/shelf/show/fantasy --max 50 --delay 1 --output books.py
Currently, scraper.py
does not accept any command-line arguments. It automatically processes the list of book URLs saved in the output file generated by url.py
(default: list.py
).
Usage example:
python scraper.py
- Run
url.py
to collect book URLs and save them to a Python file (e.g.,list.py
or a custom filename). - Run
scraper.py
to scrape detailed book data from the URLs contained in that file. - The scraped data will be exported in JSON and CSV formats for further analysis.
The extracted data are saved in data.json
and data.csv
below are the JSON example:
{
"title": "Fight Club",
"authorName": "Chuck Palahniuk",
"description": "Chuck Palahniuk showed himself to be his generation",
"isbn": "9780393355949",
"publication": "1996-08-17",
"pages": 224,
"category": [
"Fiction",
"Classics",
"Thriller",
"Contemporary",
"Novels",
"Mystery",
"Literature"
],
"likes": 69,
"averageRating": 4.18,
"totalRating": 625058,
"totalReview": 25009,
"price": 57000,
"stock": 7,
"imageURL": "https://images-na.ssl-images-amazon.com/images/S/compressed.photo.goodreads.com/books/1558216416i/36236124.jpg"
}
There are
Likes
,Price
andStock
field, those 3 field created only for the dummy data, not from actual goodreads website
This project provides a scalable and modular solution for extracting comprehensive book data from Goodreads. The url.py
script collects book URLs based on specified criteria (genre, shelf, or search) and saves them in a Python file (default: list.py
). The scraper.py
script processes these URLs to scrape detailed book metadata, exporting the results in JSON and CSV formats. This approach enables efficient, large-scale data harvesting suitable for analytics, research, machine learning datasets, or app development.