Skip to content

Commit 6c37d5c

Browse files
committed
update README
1 parent bc0a145 commit 6c37d5c

File tree

2 files changed

+52
-56
lines changed

2 files changed

+52
-56
lines changed

README-en.md

+37-39
Original file line numberDiff line numberDiff line change
@@ -1,41 +1,35 @@
11
[中文](./README.md) | English
22
Translated by ChatGPT
33

4-
# A Web Crawler for v2ex.com
4+
# A web crawler for v2ex.com
55

6-
This is a small crawler I wrote to scrape data from the v2ex.com website using the Scrapy framework.
6+
A small crawler written to learn scrapy.
77

8-
The data is stored in an SQLite database, making it convenient for sharing. The entire database size is 2.1GB.
8+
The data is stored in an SQLite database for easy sharing, with a total database size of 3.7GB.
99

1010
I have released the complete SQLite database file on GitHub.
1111

12-
## Note: I do not recommend running the crawler again, as the data is already available
12+
**Database Update: 2023-07-22-full**
1313

14-
The crawling process took several dozen hours. If you crawl too fast, your IP may be banned, and I didn't use a proxy pool. Setting the concurrency to 3 allows you to crawl continuously.
14+
It contains all post data, including hot topics. Both post and comment content are now scraped in their original HTML format. Additionally, the "topic" has a new field "reply_count," and "comment" has a new field "no."
15+
16+
## It is not recommended to run the crawler as the data is already available
17+
18+
The crawling process took several dozen hours because rapid crawling can result in IP banning, and I didn't use a proxy pool. Setting the concurrency to 3 should allow for continuous crawling.
1519

1620
[Download the database](https://github.com/oldshensheep/v2ex_scrapy/releases)
1721

1822
## Explanation of Crawled Data
1923

20-
The crawler starts crawling from `topic_id = 1`, with the path as `https://www.v2ex.com/t/{topic_id}`. The server may return 404/403/302/200 status codes. If it's 404, the post has been deleted. If it's 403, the crawler is restricted. A 302 is usually a redirect to a login page or homepage, while 200 indicates a normal page.
21-
22-
Since the crawler doesn't log in, the data collected may not be complete. For example, some popular posts with many replies might not have been crawled. Additionally, if a post returns a 302 status, its ID will be recorded, but 404/403 posts will not be recorded.
24+
The crawler starts crawling from `topic_id = 1`, and the path is `https://www.v2ex.com/t/{topic_id}`. The server might return 404/403/302/200 status codes. A 404 indicates that the post has been deleted, 403 indicates that the crawler has been restricted, 302 is usually a redirection to the login page or homepage, and 200 indicates a normal page.
2325

24-
The crawler collects post content, comments, and user information for the comments.
26+
The crawler fetches post content, comments, and user information during the crawling process.
2527

2628
Database table structure: [Table structure source code](./v2ex_scrapy/items.py)
2729

28-
Note 1: I realized halfway through that I missed crawling post scripts, which will be crawled starting from `topic_id = 448936`.
29-
30-
Note 2: The total number of users obtained from `select count(*) from member` is relatively small, about 200,000. This is because users are crawled based on comments and posts. If a user has neither commented nor posted anything, their account won't be crawled. Some posts may not be accessible, which can also lead to some accounts not being crawled. Additionally, some accounts may have been deleted, and those weren't crawled either.
31-
32-
Note 3: All times are in UTC+0 in seconds.
33-
34-
Note 4: Apart from primary keys and unique indexes, there are no other indexes in the database.
30+
## Running
3531

36-
## Running the Crawler
37-
38-
Ensure that you have Python >= 3.10
32+
Ensure Python version is >=3.10
3933

4034
### Install Dependencies
4135

@@ -45,71 +39,75 @@ pip install -r requirements.txt
4539

4640
### Configuration
4741

48-
The default concurrency is set to 1. If you want to change it, modify `CONCURRENT_REQUESTS`.
42+
The default concurrency is set to 1. To change it, modify `CONCURRENT_REQUESTS`.
4943

5044
#### Cookie
5145

52-
Some posts and certain post-related information require login to be crawled. You can set a Cookie to log in by modifying the `COOKIES` value in `v2ex_scrapy/settings.py`.
46+
Some posts and post information require login for crawling. You can set a Cookie to log in. Modify the `COOKIES` value in `v2ex_scrapy/settings.py`:
5347

5448
```python
5549
COOKIES = """
5650
a=b;c=d;e=f
5751
"""
5852
```
5953

60-
#### Proxies
54+
#### Proxy
6155

62-
To change `PROXIES` in `v2ex_scrapy/settings.py`, use the following format:
56+
Change the value of `PROXIES` in `v2ex_scrapy/settings.py`, for example:
6357

6458
```python
6559
[
66-
"http://127.0.0.1:7890"
60+
"http://127.0.0.1:7890"
6761
]
6862
```
6963

70-
Requests will randomly select a proxy. For more advanced proxy management, you can use third-party libraries or implement middleware yourself.
64+
Requests will randomly choose one of the proxies. If you need a more advanced proxy method, you can use a third-party library or implement Middleware yourself.
7165

72-
#### Log
66+
#### LOG
7367

74-
Logging to a file is disabled by default. If you want to enable it, uncomment the following line in `v2ex_scrapy/settings.py`:
68+
The writing of Log files is disabled by default. To enable it, uncomment this line in `v2ex_scrapy\settings.py`:
7569

7670
```python
77-
# LOG_FILE = "v2ex_scrapy.log"
71+
LOG_FILE = "v2ex_scrapy.log"
7872
```
7973

8074
### Run the Crawler
8175

82-
Crawl all posts on the entire website:
76+
Crawl all posts, user information, and comments on the entire site:
8377

8478
```bash
8579
scrapy crawl v2ex
8680
```
8781

88-
Crawl posts from a specific node. If `node-name` is empty, it will crawl from the "flamewar" node:
82+
Crawl posts, user information, and comments for a specific node. If node-name is empty, it crawls "flamewar":
83+
84+
```bash
85+
scrapy crawl v2ex-node node=${node-name}
86+
```
87+
88+
Crawl user information, starting from uid=1 and crawling up to uid=635000:
8989

9090
```bash
91-
scrapy crawl v2ex-node ${node-name}
91+
scrapy crawl v2ex-member start_id=${start_id} end_id=${end_id}
9292
```
9393

94-
If you encounter a `scrapy: command not found` error, it means that the Python package installation path has not been added to your environment variables.
94+
> If you see `scrapy: command not found`, it means the Python package installation path has not been added to the environment variable.
9595
96-
### Continue from Where It Left Off
96+
### Resuming the Crawl
9797

98-
Just run the crawling command, and it will automatically continue crawling. It will skip the posts that have already been crawled.
98+
Simply run the crawl command again, and it will automatically continue crawling, skipping the posts that have already been crawled:
9999

100100
```bash
101101
scrapy crawl v2ex
102102
```
103103

104104
### Notes
105105

106-
If you encounter a 403 error during crawling, it's likely due to IP restrictions. In such cases, wait for some time and try again.
107-
108-
After code updates, you cannot continue using your old database. The table structure has changed, and now `topic_content` retrieves the complete HTML instead of just the text content.
106+
If you encounter a 403 error during the crawling process, it is likely due to IP restrictions. Wait for a while before trying again.
109107

110-
## Data Analysis
108+
## Statistical Analysis
111109

112-
The SQL queries for statistics are in the [query.sql](query.sql) file, and the source code for generating charts is in [analysis.py](analysis.py).
110+
The SQL queries used for statistics can be found in the [query.sql](query.sql) file, and the source code for the charts is in the [analysis](analysis) subproject. It includes a Python script for exporting data to JSON for analysis and a frontend display project.
113111

114112
The first analysis can be found at <https://www.v2ex.com/t/954480>
115113

README.md

+15-17
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,14 @@
44

55
学习scrapy写的一个小爬虫
66

7-
数据都放在了sqlite数据库,方便分享,整个数据库大小2.1GB
7+
数据都放在了sqlite数据库,方便分享,整个数据库大小3.7GB
88

99
在GitHub我 release 了完整的sqlite数据库文件
1010

11+
**数据库更新:2023-07-22-full**
12+
13+
包含全部帖子数据(水深火热在内),同时帖子和评论的内容不再只爬取文本内容,而是爬取原始的HTML,同时topic增加了单独的reply_count,comment增加了no
14+
1115
## 不建议自行运行爬虫,数据已经有了
1216

1317
爬取花了几十小时,因为爬快了会封禁IP,并且我也没使用代理池。并发数设置为3基本上可以一直爬。
@@ -18,20 +22,10 @@
1822

1923
爬虫从`topic_id = 1`开始爬,路径为`https://www.v2ex.com/t/{topic_id}`。 服务器可能返回404/403/302/200,如果是404说明帖子被删除了,如果是403说明是爬虫被限制了,302一般是跳转到登陆页面,有的也是跳转到主页,200返回正常页面。
2024

21-
爬虫没有登陆,所以爬取的数据不完全,比如水下火热的帖子就没有爬到,还有就是如果是302的帖子会记录帖子id,404/403不会记录。
22-
2325
爬取过程中会帖子内容,评论,以及评论的用户信息。
2426

2527
数据库表结构:[表结构源码](./v2ex_scrapy/items.py)
2628

27-
注1:爬了一半才发现V站帖子附言没有爬,附言从`topic_id = 448936`才会爬取
28-
29-
注2:select count(*) from member 得到的用户数比较小,大概20W,是因为爬取过程中是根据评论,以及发帖信息爬取用户的,如果一个用户注册之后既没有评论也没有发帖,那么这个账号就爬不到。还有就是因为部分帖子访问不了,也可能导致部分账号没有爬。还有部分用户号被删除,这一部分也没有爬。(代码改了,可以爬,但是都已经爬完了……)
30-
31-
注3:时间均为UTC+0的秒数
32-
33-
注4:数据库除了主键,和唯一索引,没有加其他索引。
34-
3529
## 运行
3630

3731
确保python >=3.10
@@ -74,16 +68,22 @@ a=b;c=d;e=f
7468

7569
### 运行爬虫
7670

77-
爬取全站帖子
71+
爬取全站帖子、用户信息和评论
7872

7973
```bash
8074
scrapy crawl v2ex
8175
```
8276

83-
爬取指定节点帖子,如果node-name为空则爬flamewar
77+
爬取指定节点帖子、用户信息和评论,如果node-name为空则爬flamewar
78+
79+
```bash
80+
scrapy crawl v2ex-node node=${node-name}
81+
```
82+
83+
爬取用户信息,从uid=1开始爬到uid=635000
8484

8585
```bash
86-
scrapy crawl v2ex-node ${node-name}
86+
scrapy crawl v2ex-member start_id=${start_id} end_id=${end_id}
8787
```
8888

8989
> `scrapy: command not found` 说明没有添加python包的安装位置到环境变量
@@ -100,11 +100,9 @@ scrapy crawl v2ex
100100

101101
爬取过程中出现403基本上是因为IP被限制了,等待一段时间即可
102102

103-
代码更新后不能继续用我之前的数据库爬了。表结构改了,topic_content爬取的内容改为完整的HTML而不是只有文本内容。
104-
105103
## 统计分析
106104

107-
统计用的SQL在[query.sql](query.sql)这个文件下,图表的源码在[analysis.py](analysis.py)
105+
统计用的SQL在[query.sql](query.sql)这个文件下,图表的源码在[analysis](analysis)这个子项目下,包含一个分析数据导出到JSON的Python脚本和一个前端展示项目
108106

109107
第一次的分析见 <https://www.v2ex.com/t/954480>
110108

0 commit comments

Comments
 (0)