You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You can actually use **SpecialExport** to retrieve pages from *any* Wiki site. On the German Wiktionary, however, the tool is labelled **Spezial:Exportieren**, but it works the same way.
12
+
You can actually use **Special:Export** to retrieve pages from *any* Wiki site. On the German Wiktionary, however, the tool is labelled **Spezial:Exportieren**, but it works the same way.
8
13
9
-
### Examples
10
14
11
15
**Exporting Pages from Any Wiki Site**
12
16
13
-
To access the XML content of the page titled "Austria" from English Wikipedia, you can use the following Python code. When you press `run`, it will open the export link in your default browser:
14
-
15
-
```pyodide session="webbrowser"
16
-
import webbrowser
17
+
To access the XML content of the page titled "Austria" from English Wikipedia, you can construct your URL as follows.
For the German Wiktionary, the export tool uses `Spezial:Exportieren` instead of `Special:Export`. You can use similar Python code to open the export link for the page titled "schön" (German for "beautiful"):
28
+
For the German Wiktionary, the export tool uses `Spezial:Exportieren` instead of `Special:Export`.
To programmatically fetch and download XML content, you can use Python's `requests` library. This example shows how to build the URL, make a request, and get the XML content of a Wiktionary page by its title.
Next, let us attempt to retrieve the XML content for the page titled "hoch" and print the initial 500 bytes for a glimpse of the XML content displayed in the `Result` tab.
57
+
Next, let us attempt to retrieve the XML content for the page titled "hoch" and print the initial 500 bytes for a glimpse of the XML content.
Copy file name to clipboardExpand all lines: docs/Parsing Wikitext/Parsing Wikitext.md
+34-12Lines changed: 34 additions & 12 deletions
Original file line number
Diff line number
Diff line change
@@ -1,28 +1,51 @@
1
1
[TOC]
2
2
3
+
## Importing Packages
4
+
```python exec="1" source="above" session="wiki"
5
+
import requests # to fetch info from URLs
6
+
import lxml.etree asET# to parse XML data
7
+
import mwparserfromhell # to parse and analyze wikitext
8
+
import re # to extract information using regular expressions
9
+
import functools # to implement caching with a decorator
10
+
```
11
+
3
12
## Get the Wikitext Data
4
13
5
-
We already know how to extract *wikitext* from dump files and the special exports tool. In this section, we will parse this*wikitext*.
14
+
We already know how to extract *wikitext* from **Dump** files and the **Special Exports tool**. In this section, we will parse the*wikitext*.
6
15
7
-
We will use the word `stark`as an example ([link to the wiktionary page](https://de.wiktionary.org/wiki/stark)). We will retrieve the *wikitext* for the page `stark` from my GitHub repository so that we have the same version of the page. However, you can use either of the two methods we have learned so far to retrieve the *wikitext*.
16
+
We will use the page titled `stark`([Wiktionary page](https://de.wiktionary.org/wiki/stark)) and the functions we created in the previous sections based on the **Special Export** method.
I added `@functools.cache`, optional, to avoid redundant requests and be more respectful to the server. Basically, `@functools.cache` stores results of `fetch(title)`, so repeated calls with the same title return the cached response instead of requesting the page again from the wiki servers.
- Finally, the **fourth-level** headings contain information on the translation of the word into different languages (`Übersetzungen`).
77
100
78
-
For my project, I only need the *German-to-German* dictionary. So, let us extract the *wikitext* for that heading. We can use the method `get_sections()`, which accepts a heading level as an argument. Passing **level 2** will split the text into sections based on the second-level headings.
101
+
For my project, I only need the *German-to-German* dictionary. So, let us extract the *wikitext* for that heading. We can use the method `get_sections()`, which accepts a heading level as an argument. Passing level **2** will split the text into sections based on the second-level headings.
@@ -52,7 +38,6 @@ Note that this process can take some time. On my computer, it takes approximatel
52
38
print(type(root)) # <class 'lxml.etree._Element'>
53
39
```
54
40
=== "Result"
55
-
56
41
```pycon
57
42
<class 'lxml.etree._ElementTree'>
58
43
<class 'lxml.etree._Element'>
@@ -62,15 +47,14 @@ Note that this process can take some time. On my computer, it takes approximatel
62
47
The parser returns an `ElementTree` object. We use the `getroot()` method to access the root `Element`.
63
48
64
49
65
-
###Displaying the XML Structure
50
+
## Displaying the XML Structure
66
51
67
52
The XML structure of the dump file is quite large, so printing the entire tree would not only be inefficient but also quite overwhelming. To make it more manageable, let us modify our `print_tags_tree` function.
68
53
69
54
We will add options to limit the number of children displayed for the root element and to control the depth of the tree.
@@ -133,15 +115,16 @@ To view the first 3 children of the root element and display two levels of the t
133
115
2 revision
134
116
```
135
117
136
-
### Extracting Data
118
+
## Extracting Data
119
+
120
+
### `element.findall`
137
121
138
122
As with the previous section, we are interested in extracting the `page`, `title`, `ns`, and `text` tags.
139
123
140
124
The main difference in structure here is that we now have multiple `page` elements, and we want to extract all of them.
141
125
142
126
We cannot use `find`, because it will return only the first `page`. However, we can use the `findall` method instead, which will return a list of all `page` elements.
@@ -196,24 +178,21 @@ To check that our dictionary is correctly populated, let us print out part of th
196
178
197
179
```
198
180
199
-
###Saving the Dictionary Locally
181
+
## Saving the Dictionary Locally
200
182
201
183
Once the dictionary is built, we save it locally using the `pickle` module, which allows us to store the dictionary in a serialized format. This way, we will not need to parse the XML file again in the future.
202
184
203
185
```python
204
-
import pickle
205
-
206
186
dict_file =DICT_PATH/f'wikidict_{ns}.pkl'
207
187
208
188
withopen(dict_file, 'wb') as f:
209
189
pickle.dump(dict_0, f)
210
190
```
211
191
212
-
###Loading Dictionary
192
+
## Loading Dictionary
213
193
214
194
The next time you need to retrieve *wikitext*, simply load the dictionary from the pickle file and select the title page you need!
0 commit comments