You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README-en.md
+37-39
Original file line number
Diff line number
Diff line change
@@ -1,41 +1,35 @@
1
1
[中文](./README.md) | English
2
2
Translated by ChatGPT
3
3
4
-
# A Web Crawler for v2ex.com
4
+
# A web crawler for v2ex.com
5
5
6
-
This is a small crawler I wrote to scrape data from the v2ex.com website using the Scrapy framework.
6
+
A small crawler written to learn scrapy.
7
7
8
-
The data is stored in an SQLite database, making it convenient for sharing. The entire database size is 2.1GB.
8
+
The data is stored in an SQLite databasefor easy sharing, with a total database size of 3.7GB.
9
9
10
10
I have released the complete SQLite database file on GitHub.
11
11
12
-
## Note: I do not recommend running the crawler again, as the data is already available
12
+
**Database Update: 2023-07-22-full**
13
13
14
-
The crawling process took several dozen hours. If you crawl too fast, your IP may be banned, and I didn't use a proxy pool. Setting the concurrency to 3 allows you to crawl continuously.
14
+
It contains all post data, including hot topics. Both post and comment content are now scraped in their original HTML format. Additionally, the "topic" has a new field "reply_count," and "comment" has a new field "no."
15
+
16
+
## It is not recommended to run the crawler as the data is already available
17
+
18
+
The crawling process took several dozen hours because rapid crawling can result in IP banning, and I didn't use a proxy pool. Setting the concurrency to 3 should allow for continuous crawling.
15
19
16
20
[Download the database](https://github.com/oldshensheep/v2ex_scrapy/releases)
17
21
18
22
## Explanation of Crawled Data
19
23
20
-
The crawler starts crawling from `topic_id = 1`, with the path as `https://www.v2ex.com/t/{topic_id}`. The server may return 404/403/302/200 status codes. If it's 404, the post has been deleted. If it's 403, the crawler is restricted. A 302 is usually a redirect to a login page or homepage, while 200 indicates a normal page.
21
-
22
-
Since the crawler doesn't log in, the data collected may not be complete. For example, some popular posts with many replies might not have been crawled. Additionally, if a post returns a 302 status, its ID will be recorded, but 404/403 posts will not be recorded.
24
+
The crawler starts crawling from `topic_id = 1`, and the path is `https://www.v2ex.com/t/{topic_id}`. The server might return 404/403/302/200 status codes. A 404 indicates that the post has been deleted, 403 indicates that the crawler has been restricted, 302 is usually a redirection to the login page or homepage, and 200 indicates a normal page.
23
25
24
-
The crawler collects post content, comments, and user information for the comments.
26
+
The crawler fetches post content, comments, and user information during the crawling process.
Note 1: I realized halfway through that I missed crawling post scripts, which will be crawled starting from `topic_id = 448936`.
29
-
30
-
Note 2: The total number of users obtained from `select count(*) from member` is relatively small, about 200,000. This is because users are crawled based on comments and posts. If a user has neither commented nor posted anything, their account won't be crawled. Some posts may not be accessible, which can also lead to some accounts not being crawled. Additionally, some accounts may have been deleted, and those weren't crawled either.
31
-
32
-
Note 3: All times are in UTC+0 in seconds.
33
-
34
-
Note 4: Apart from primary keys and unique indexes, there are no other indexes in the database.
The default concurrency is set to 1. If you want to change it, modify `CONCURRENT_REQUESTS`.
42
+
The default concurrency is set to 1. To change it, modify `CONCURRENT_REQUESTS`.
49
43
50
44
#### Cookie
51
45
52
-
Some posts and certain post-related information require login to be crawled. You can set a Cookie to log in by modifying the `COOKIES` value in `v2ex_scrapy/settings.py`.
46
+
Some posts and post information require login for crawling. You can set a Cookie to log in. Modify the `COOKIES` value in `v2ex_scrapy/settings.py`:
53
47
54
48
```python
55
49
COOKIES="""
56
50
a=b;c=d;e=f
57
51
"""
58
52
```
59
53
60
-
#### Proxies
54
+
#### Proxy
61
55
62
-
To change `PROXIES` in `v2ex_scrapy/settings.py`, use the following format:
56
+
Change the value of `PROXIES` in `v2ex_scrapy/settings.py`, for example:
63
57
64
58
```python
65
59
[
66
-
"http://127.0.0.1:7890"
60
+
"http://127.0.0.1:7890"
67
61
]
68
62
```
69
63
70
-
Requests will randomly select a proxy. For more advanced proxy management, you can use third-party libraries or implement middleware yourself.
64
+
Requests will randomly choose one of the proxies. If you need a more advanced proxy method, you can use a third-party library or implement Middleware yourself.
71
65
72
-
#### Log
66
+
#### LOG
73
67
74
-
Logging to a file is disabled by default. If you want to enable it, uncomment the following line in `v2ex_scrapy/settings.py`:
68
+
The writing of Log files is disabled by default. To enable it, uncomment this line in `v2ex_scrapy\settings.py`:
75
69
76
70
```python
77
-
#LOG_FILE = "v2ex_scrapy.log"
71
+
LOG_FILE="v2ex_scrapy.log"
78
72
```
79
73
80
74
### Run the Crawler
81
75
82
-
Crawl all postson the entire website:
76
+
Crawl all posts, user information, and comments on the entire site:
83
77
84
78
```bash
85
79
scrapy crawl v2ex
86
80
```
87
81
88
-
Crawl posts from a specific node. If `node-name` is empty, it will crawl from the "flamewar" node:
82
+
Crawl posts, user information, and comments for a specific node. If node-name is empty, it crawls "flamewar":
83
+
84
+
```bash
85
+
scrapy crawl v2ex-node node=${node-name}
86
+
```
87
+
88
+
Crawl user information, starting from uid=1 and crawling up to uid=635000:
If you encounter a `scrapy: command not found` error, it means that the Python package installation path has not been added to your environment variables.
94
+
> If you see `scrapy: command not found`, it means the Python package installation path has not been added to the environment variable.
95
95
96
-
### Continue from Where It Left Off
96
+
### Resuming the Crawl
97
97
98
-
Just run the crawling command, and it will automatically continue crawling. It will skip the posts that have already been crawled.
98
+
Simply run the crawl command again, and it will automatically continue crawling, skipping the posts that have already been crawled:
99
99
100
100
```bash
101
101
scrapy crawl v2ex
102
102
```
103
103
104
104
### Notes
105
105
106
-
If you encounter a 403 error during crawling, it's likely due to IP restrictions. In such cases, wait for some time and try again.
107
-
108
-
After code updates, you cannot continue using your old database. The table structure has changed, and now `topic_content` retrieves the complete HTML instead of just the text content.
106
+
If you encounter a 403 error during the crawling process, it is likely due to IP restrictions. Wait for a while before trying again.
109
107
110
-
## Data Analysis
108
+
## Statistical Analysis
111
109
112
-
The SQL queries for statistics are in the [query.sql](query.sql) file, and the source code for generating charts is in [analysis.py](analysis.py).
110
+
The SQL queries used for statistics can be found in the [query.sql](query.sql) file, and the source code for the charts is in the [analysis](analysis) subproject. It includes a Python script for exporting data to JSON for analysis and a frontend display project.
113
111
114
112
The first analysis can be found at <https://www.v2ex.com/t/954480>
注2:select count(*) from member 得到的用户数比较小,大概20W,是因为爬取过程中是根据评论,以及发帖信息爬取用户的,如果一个用户注册之后既没有评论也没有发帖,那么这个账号就爬不到。还有就是因为部分帖子访问不了,也可能导致部分账号没有爬。还有部分用户号被删除,这一部分也没有爬。(代码改了,可以爬,但是都已经爬完了……)
0 commit comments