update README

oldshensheep · oldshensheep · commit 6c37d5c49e28 · 2023-07-24T11:50:20.000+08:00
diff --git a/README-en.md b/README-en.md
@@ -1,41 +1,35 @@
 [中文](./README.md) | English
 Translated by ChatGPT
 
-# A Web Crawler for v2ex.com
+# A web crawler for v2ex.com
 
-This is a small crawler I wrote to scrape data from the v2ex.com website using the Scrapy framework.
+A small crawler written to learn scrapy.
 
-The data is stored in an SQLite database, making it convenient for sharing. The entire database size is 2.1GB.
+The data is stored in an SQLite database for easy sharing, with a total database size of 3.7GB.
 
 I have released the complete SQLite database file on GitHub.
 
-## Note: I do not recommend running the crawler again, as the data is already available
+**Database Update: 2023-07-22-full**
 
-The crawling process took several dozen hours. If you crawl too fast, your IP may be banned, and I didn't use a proxy pool. Setting the concurrency to 3 allows you to crawl continuously.
+It contains all post data, including hot topics. Both post and comment content are now scraped in their original HTML format. Additionally, the "topic" has a new field "reply_count," and "comment" has a new field "no."
+
+## It is not recommended to run the crawler as the data is already available
+
+The crawling process took several dozen hours because rapid crawling can result in IP banning, and I didn't use a proxy pool. Setting the concurrency to 3 should allow for continuous crawling.
 
 [Download the database](https://github.com/oldshensheep/v2ex_scrapy/releases)
 
 ## Explanation of Crawled Data
 
-The crawler starts crawling from `topic_id = 1`, with the path as `https://www.v2ex.com/t/{topic_id}`. The server may return 404/403/302/200 status codes. If it's 404, the post has been deleted. If it's 403, the crawler is restricted. A 302 is usually a redirect to a login page or homepage, while 200 indicates a normal page.
-
-Since the crawler doesn't log in, the data collected may not be complete. For example, some popular posts with many replies might not have been crawled. Additionally, if a post returns a 302 status, its ID will be recorded, but 404/403 posts will not be recorded.
+The crawler starts crawling from `topic_id = 1`, and the path is `https://www.v2ex.com/t/{topic_id}`. The server might return 404/403/302/200 status codes. A 404 indicates that the post has been deleted, 403 indicates that the crawler has been restricted, 302 is usually a redirection to the login page or homepage, and 200 indicates a normal page.
 
-The crawler collects post content, comments, and user information for the comments.
+The crawler fetches post content, comments, and user information during the crawling process.
 
 Database table structure: [Table structure source code](./v2ex_scrapy/items.py)
 
-Note 1: I realized halfway through that I missed crawling post scripts, which will be crawled starting from `topic_id = 448936`.
-
-Note 2: The total number of users obtained from `select count(*) from member` is relatively small, about 200,000. This is because users are crawled based on comments and posts. If a user has neither commented nor posted anything, their account won't be crawled. Some posts may not be accessible, which can also lead to some accounts not being crawled. Additionally, some accounts may have been deleted, and those weren't crawled either.
-
-Note 3: All times are in UTC+0 in seconds.
-
-Note 4: Apart from primary keys and unique indexes, there are no other indexes in the database.
+## Running
 
-## Running the Crawler
-
-Ensure that you have Python >= 3.10
+Ensure Python version is >=3.10
 
 ### Install Dependencies
 
@@ -45,71 +39,75 @@ pip install -r requirements.txt
 
 ### Configuration
 
-The default concurrency is set to 1. If you want to change it, modify `CONCURRENT_REQUESTS`.
+The default concurrency is set to 1. To change it, modify `CONCURRENT_REQUESTS`.
 
 #### Cookie
 
-Some posts and certain post-related information require login to be crawled. You can set a Cookie to log in by modifying the `COOKIES` value in `v2ex_scrapy/settings.py`.
+Some posts and post information require login for crawling. You can set a Cookie to log in. Modify the `COOKIES` value in `v2ex_scrapy/settings.py`:
 
 ```python
 COOKIES = """
 a=b;c=d;e=f
 """
 ```
 
-#### Proxies
+#### Proxy
 
-To change `PROXIES` in `v2ex_scrapy/settings.py`, use the following format:
+Change the value of `PROXIES` in `v2ex_scrapy/settings.py`, for example:
 
 ```python
 [
-    "http://127.0.0.1:7890"
+     "http://127.0.0.1:7890"
 ]
 ```
 
-Requests will randomly select a proxy. For more advanced proxy management, you can use third-party libraries or implement middleware yourself.
+Requests will randomly choose one of the proxies. If you need a more advanced proxy method, you can use a third-party library or implement Middleware yourself.
 
-#### Log
+#### LOG
 
-Logging to a file is disabled by default. If you want to enable it, uncomment the following line in `v2ex_scrapy/settings.py`:
+The writing of Log files is disabled by default. To enable it, uncomment this line in `v2ex_scrapy\settings.py`:
 
 ```python
-# LOG_FILE = "v2ex_scrapy.log"
+LOG_FILE = "v2ex_scrapy.log"
 ```
 
 ### Run the Crawler
 
-Crawl all posts on the entire website:
+Crawl all posts, user information, and comments on the entire site:
 
 ```bash
 scrapy crawl v2ex
 ```
 
-Crawl posts from a specific node. If `node-name` is empty, it will crawl from the "flamewar" node:
+Crawl posts, user information, and comments for a specific node. If node-name is empty, it crawls "flamewar":
+
+```bash
+scrapy crawl v2ex-node node=${node-name}
+```
+
+Crawl user information, starting from uid=1 and crawling up to uid=635000:
 
 ```bash
-scrapy crawl v2ex-node ${node-name}
+scrapy crawl v2ex-member start_id=${start_id} end_id=${end_id}
 ```
 
-If you encounter a `scrapy: command not found` error, it means that the Python package installation path has not been added to your environment variables.
+> If you see `scrapy: command not found`, it means the Python package installation path has not been added to the environment variable.
 
-### Continue from Where It Left Off
+### Resuming the Crawl
 
-Just run the crawling command, and it will automatically continue crawling. It will skip the posts that have already been crawled.
+Simply run the crawl command again, and it will automatically continue crawling, skipping the posts that have already been crawled:
 
 ```bash
 scrapy crawl v2ex
 ```
 
 ### Notes
 
-If you encounter a 403 error during crawling, it's likely due to IP restrictions. In such cases, wait for some time and try again.
-
-After code updates, you cannot continue using your old database. The table structure has changed, and now `topic_content` retrieves the complete HTML instead of just the text content.
+If you encounter a 403 error during the crawling process, it is likely due to IP restrictions. Wait for a while before trying again.
 
-## Data Analysis
+## Statistical Analysis
 
-The SQL queries for statistics are in the [query.sql](query.sql) file, and the source code for generating charts is in [analysis.py](analysis.py).
+The SQL queries used for statistics can be found in the [query.sql](query.sql) file, and the source code for the charts is in the [analysis](analysis) subproject. It includes a Python script for exporting data to JSON for analysis and a frontend display project.
 
 The first analysis can be found at <https://www.v2ex.com/t/954480>
 
diff --git a/README.md b/README.md
@@ -4,10 +4,14 @@
 
 学习scrapy写的一个小爬虫
 
-数据都放在了sqlite数据库，方便分享，整个数据库大小2.1GB。
+数据都放在了sqlite数据库，方便分享，整个数据库大小3.7GB。
 
 在GitHub我 release 了完整的sqlite数据库文件
 
+**数据库更新：2023-07-22-full**
+
+包含全部帖子数据（水深火热在内），同时帖子和评论的内容不再只爬取文本内容，而是爬取原始的HTML，同时topic增加了单独的reply_count，comment增加了no
+
 ## 不建议自行运行爬虫，数据已经有了
 
 爬取花了几十小时，因为爬快了会封禁IP，并且我也没使用代理池。并发数设置为3基本上可以一直爬。
@@ -18,20 +22,10 @@
 
 爬虫从`topic_id = 1`开始爬，路径为`https://www.v2ex.com/t/{topic_id}`。 服务器可能返回404/403/302/200，如果是404说明帖子被删除了，如果是403说明是爬虫被限制了，302一般是跳转到登陆页面，有的也是跳转到主页，200返回正常页面。
 
-爬虫没有登陆，所以爬取的数据不完全，比如水下火热的帖子就没有爬到，还有就是如果是302的帖子会记录帖子id，404/403不会记录。
-
 爬取过程中会帖子内容，评论，以及评论的用户信息。
 
 数据库表结构：[表结构源码](./v2ex_scrapy/items.py)
 
-注1：爬了一半才发现V站帖子附言没有爬，附言从`topic_id = 448936`才会爬取
-
-注2：select count(*) from member 得到的用户数比较小，大概20W，是因为爬取过程中是根据评论，以及发帖信息爬取用户的，如果一个用户注册之后既没有评论也没有发帖，那么这个账号就爬不到。还有就是因为部分帖子访问不了，也可能导致部分账号没有爬。还有部分用户号被删除，这一部分也没有爬。（代码改了，可以爬，但是都已经爬完了……）
-
-注3：时间均为UTC+0的秒数
-
-注4：数据库除了主键，和唯一索引，没有加其他索引。
-
 ## 运行
 
 确保python >=3.10
@@ -74,16 +68,22 @@ a=b;c=d;e=f
 
 ### 运行爬虫
 
-爬取全站帖子
+爬取全站帖子、用户信息和评论
 
 ```bash
 scrapy crawl v2ex
 ```
 
-爬取指定节点帖子，如果node-name为空则爬flamewar
+爬取指定节点帖子、用户信息和评论，如果node-name为空则爬flamewar
+
+```bash
+scrapy crawl v2ex-node node=${node-name}
+```
+
+爬取用户信息，从uid=1开始爬到uid=635000
 
 ```bash
-scrapy crawl v2ex-node ${node-name}
+scrapy crawl v2ex-member start_id=${start_id} end_id=${end_id}
 ```
 
 > `scrapy: command not found` 说明没有添加python包的安装位置到环境变量
@@ -100,11 +100,9 @@ scrapy crawl v2ex
 
 爬取过程中出现403基本上是因为IP被限制了，等待一段时间即可
 
-代码更新后不能继续用我之前的数据库爬了。表结构改了，topic_content爬取的内容改为完整的HTML而不是只有文本内容。
-
 ## 统计分析
 
-统计用的SQL在[query.sql](query.sql)这个文件下，图表的源码在[analysis.py](analysis.py)
+统计用的SQL在[query.sql](query.sql)这个文件下，图表的源码在[analysis](analysis)这个子项目下，包含一个分析数据导出到JSON的Python脚本和一个前端展示项目
 
 第一次的分析见 <https://www.v2ex.com/t/954480>