Skip to content

Commit a4a51f4

Browse files
committed
Creating content for PyCon AT.
1 parent 2c65e5d commit a4a51f4

14 files changed

+6671
-112
lines changed

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,9 @@
44
# Cache
55
__pycache__/
66
.pytest_cache/
7-
7+
8+
# Secrets
9+
.env
810

911
# editors
1012
.vscode/

docs/Fetching XML data/Special Exports.md

Lines changed: 17 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -2,43 +2,44 @@ The **Special Export** tool fetches specific pages with their raw content (*wiki
22

33
[TOC]
44

5+
## Importing Packages
6+
```python exec="true" source="above" session="requests"
7+
import requests # to fetch info from URLs
8+
```
9+
510
## Using the **Special Export** Tool
611

7-
You can actually use **Special Export** to retrieve pages from *any* Wiki site. On the German Wiktionary, however, the tool is labelled **Spezial:Exportieren**, but it works the same way.
12+
You can actually use **Special:Export** to retrieve pages from *any* Wiki site. On the German Wiktionary, however, the tool is labelled **Spezial:Exportieren**, but it works the same way.
813

9-
### Examples
1014

1115
**Exporting Pages from Any Wiki Site**
1216

13-
To access the XML content of the page titled "Austria" from English Wikipedia, you can use the following Python code. When you press `run`, it will open the export link in your default browser:
14-
15-
```pyodide session="webbrowser"
16-
import webbrowser
17+
To access the XML content of the page titled "Austria" from English Wikipedia, you can construct your URL as follows.
1718

19+
```python exec="true" source="tabbed-left" result="pycon" session="manual"
1820
title = 'Austria'
1921
domain = 'en.wikipedia.org'
2022
url = f'https://{domain}/wiki/Special:Export/{title}'
21-
webbrowser.open_new_tab(url)
23+
print(url)
2224
```
2325

2426
**Exporting Pages from the German Wiktionary**
2527

26-
For the German Wiktionary, the export tool uses `Spezial:Exportieren` instead of `Special:Export`. You can use similar Python code to open the export link for the page titled "schön" (German for "beautiful"):
28+
For the German Wiktionary, the export tool uses `Spezial:Exportieren` instead of `Special:Export`.
2729

28-
```pyodide session="webbrowser"
29-
title = 'schön'
30+
```python exec="true" source="tabbed-left" result="pycon" session="manual"
31+
title = 'hoch'
3032
domain = 'de.wiktionary.org'
3133
url = f'https://{domain}/wiki/Spezial:Exportieren/{title}'
32-
webbrowser.open_new_tab(url)
34+
print(url)
3335
```
3436

35-
## Using the `requests` Library
37+
## Fetching XML Data with `requests`
38+
3639

3740
To programmatically fetch and download XML content, you can use Python's `requests` library. This example shows how to build the URL, make a request, and get the XML content of a Wiktionary page by its title.
3841

3942
```python exec="true" source="above" session="requests"
40-
import requests
41-
4243
def fetch(title):
4344
# Construct the URL for the XML export of the given page title
4445
url = f'https://de.wiktionary.org/wiki/Spezial:Exportieren/{title}'
@@ -50,22 +51,17 @@ def fetch(title):
5051
resp.raise_for_status()
5152

5253
# Return the XML content of the requested page
53-
return resp.content
54+
return resp.text
5455
```
5556

56-
Next, let us attempt to retrieve the XML content for the page titled "hoch" and print the initial 500 bytes for a glimpse of the XML content displayed in the `Result` tab.
57+
Next, let us attempt to retrieve the XML content for the page titled "hoch" and print the initial 500 bytes for a glimpse of the XML content.
5758

5859

5960
```python exec="true" source="tabbed-left" result="pycon" session="requests"
6061
page = fetch('hoch')
6162
print(page[:500])
6263
```
6364

64-
<!-- Which will return
65-
```xml
66-
b'<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.11/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.11/ http://www.mediawiki.org/xml/export-0.11.xsd" version="0.11" xml:lang="de">\n <siteinfo>\n <sitename>Wiktionary</sitename'
67-
```
68-
-->
6965
We will continue to use the `fetch` function throughout this tutorial.
7066

7167

docs/Parsing Wikitext/Parsing Wikitext.md

Lines changed: 34 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,51 @@
11
[TOC]
22

3+
## Importing Packages
4+
```python exec="1" source="above" session="wiki"
5+
import requests # to fetch info from URLs
6+
import lxml.etree as ET # to parse XML data
7+
import mwparserfromhell # to parse and analyze wikitext
8+
import re # to extract information using regular expressions
9+
import functools # to implement caching with a decorator
10+
```
11+
312
## Get the Wikitext Data
413

5-
We already know how to extract *wikitext* from dump files and the special exports tool. In this section, we will parse this *wikitext*.
14+
We already know how to extract *wikitext* from **Dump** files and the **Special Exports tool**. In this section, we will parse the *wikitext*.
615

7-
We will use the word `stark` as an example ([link to the wiktionary page](https://de.wiktionary.org/wiki/stark)). We will retrieve the *wikitext* for the page `stark` from my GitHub repository so that we have the same version of the page. However, you can use either of the two methods we have learned so far to retrieve the *wikitext*.
16+
We will use the page titled `stark` ([Wiktionary page](https://de.wiktionary.org/wiki/stark)) and the functions we created in the previous sections based on the **Special Export** method.
817

18+
```python exec="1" source="above" session="wiki"
19+
@functools.cache
20+
def fetch(title):
21+
url = f'https://de.wiktionary.org/wiki/Spezial:Exportieren/{title}'
22+
resp = requests.get(url)
23+
resp.raise_for_status()
24+
return resp.text
25+
26+
def fetch_wikitext(title):
27+
xml_content = fetch(title)
28+
root = ET.fromstring(xml_content)
29+
namespaces = root.nsmap
30+
page = root.find('page', namespaces)
31+
wikitext = page.find('revision/text', namespaces)
32+
return wikitext.text
933

10-
```python exec="1" source="tabbed-left" result="pycon" session="wiki"
11-
import requests
12-
title = 'stark'
13-
url = f'https://raw.githubusercontent.com/lennon-c/python-wikitext-parser-guide/refs/heads/main/docs/data/{title}.txt'
14-
resp = requests.get(url)
15-
wikitext = resp.text
34+
```
35+
36+
I added `@functools.cache`, optional, to avoid redundant requests and be more respectful to the server. Basically, `@functools.cache` stores results of `fetch(title)`, so repeated calls with the same title return the cached response instead of requesting the page again from the wiki servers.
1637

17-
print(wikitext[:500])
38+
39+
```python exec="1" source="tabbed-left" result="pycon" session="wiki"
40+
wikitext = fetch_wikitext('stark')
41+
print(wikitext[:1000])
1842
```
1943

2044
## Parsing Wikitext
2145

2246
First, we need to import `mwparserfromhell`. Then, we use the `parse` function and pass in our wikitext, which will return a `Wikicode` object.
2347

2448
```python exec="1" source="tabbed-left" result="pycon" session="wiki"
25-
import mwparserfromhell
2649
parsed = mwparserfromhell.parse(wikitext)
2750
print(type(parsed)) # <class 'mwparserfromhell.wikicode.Wikicode'>
2851
```
@@ -75,7 +98,7 @@ print_headings_tree(parsed)
7598

7699
- Finally, the **fourth-level** headings contain information on the translation of the word into different languages (`Übersetzungen`).
77100

78-
For my project, I only need the *German-to-German* dictionary. So, let us extract the *wikitext* for that heading. We can use the method `get_sections()`, which accepts a heading level as an argument. Passing **level 2** will split the text into sections based on the second-level headings.
101+
For my project, I only need the *German-to-German* dictionary. So, let us extract the *wikitext* for that heading. We can use the method `get_sections()`, which accepts a heading level as an argument. Passing level **2** will split the text into sections based on the second-level headings.
79102

80103
```python exec="1" source="tabbed-left" result="pycon" session="wiki"
81104
sections = parsed.get_sections(levels=[2])
@@ -291,7 +314,6 @@ Putting everything together, we get the following pattern:
291314
Let us try it using the `re.search` method:
292315

293316
```python exec="1" source="tabbed-left" result="pycon" session="wiki"
294-
import re
295317
# Define the pattern
296318
pattern = r'\n\n\{\{Bedeutungen\}\}\n(.+?)\n\n'
297319

docs/Parsing XML/Parsing XML from Dump file.md

Lines changed: 15 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
[TOC]
22

3+
## Importing Packages
4+
```python
5+
from pathlib import Path
6+
import lxml.etree as ET # to parse XML documents
7+
import pickle # to store the dictionary locally
8+
```
39

4-
As in the previous section, we begin by importing the `lxml.etree` module:
5-
6-
```python exec="true" source="above" session="dump"
7-
import lxml.etree as ET
8-
```
9-
10-
### Setting Up Paths
10+
## Setting Up Paths Local Machine
1111

1212
To follow along in this section:
1313

@@ -18,32 +18,18 @@ To follow along in this section:
1818
- Therefore, do not forget to specify in which folder the dictionary should be saved in `DICT_PATH`.
1919

2020
```python
21-
from pathlib import Path
22-
2321
# Specify your own paths
2422
XML_FILE = Path(r'path\to\xml\dewiktionary-20241020-pages-articles-multistream.xml')
2523
DICT_PATH = Path(r'path\to\dict')
2624
```
2725

28-
<!-- ```python exec="1" session="dump"
29-
from pathlib import Path
30-
31-
# Specify your own paths
32-
XML_FILE = Path(r'D:\Dropbox\Python\My_packages\de_wiktio\data\dewiktionary-20241020-pages-articles-multistream.xml')
33-
DICT_PATH = Path(r"D:\Dropbox\Python\My_packages\de_wiktio\out")
34-
``` -->
35-
36-
### Parsing the XML File
37-
26+
## Parsing the XML File
3827
Since we are working with a file, we cannot use the `ET.fromstring` function to parse the XML content. Instead, we must use the `ET.parse` function.
3928

4029
Note that this process can take some time. On my computer, it takes approximately 42 seconds to load the entire XML tree.
4130

42-
<!-- ```python exec="1" source="tabbed-left" result="pycon" session="dump" -->
4331
=== "Source"
4432
```python
45-
import lxml.etree as ET
46-
4733
# ET.parse for a xml file
4834
tree = ET.parse(XML_FILE)
4935
print(type(tree)) # lxml.etree._ElementTree
@@ -52,7 +38,6 @@ Note that this process can take some time. On my computer, it takes approximatel
5238
print(type(root)) # <class 'lxml.etree._Element'>
5339
```
5440
=== "Result"
55-
5641
```pycon
5742
<class 'lxml.etree._ElementTree'>
5843
<class 'lxml.etree._Element'>
@@ -62,15 +47,14 @@ Note that this process can take some time. On my computer, it takes approximatel
6247
The parser returns an `ElementTree` object. We use the `getroot()` method to access the root `Element`.
6348

6449

65-
### Displaying the XML Structure
50+
## Displaying the XML Structure
6651

6752
The XML structure of the dump file is quite large, so printing the entire tree would not only be inefficient but also quite overwhelming. To make it more manageable, let us modify our `print_tags_tree` function.
6853

6954
We will add options to limit the number of children displayed for the root element and to control the depth of the tree.
7055

7156
Here is our updated `print_tags_tree` function:
7257

73-
<!-- ```python exec="1" source="above" session="dump" -->
7458
```python
7559
def print_tags_tree(elem, level=0, only_tagnames=False, max_children=5, max_level=5):
7660

@@ -88,7 +72,6 @@ def print_tags_tree(elem, level=0, only_tagnames=False, max_children=5, max_leve
8872

8973
To display only the first 5 direct children of the root element and limit the tree to the first level:
9074

91-
<!-- ```python exec="1" source="tabbed-left" result="pycon" session="dump" -->
9275
=== "Source"
9376
```python
9477
print_tags_tree(root, only_tagnames=True, max_children=5, max_level=1)
@@ -106,7 +89,6 @@ To display only the first 5 direct children of the root element and limit the tr
10689

10790
To view the first 3 children of the root element and display two levels of the tree:
10891

109-
<!-- ```python exec="1" source="tabbed-left" result="pycon" session="dump" -->
11092
=== "Source"
11193
```python
11294
print_tags_tree(root, only_tagnames=True, max_children=3, max_level=2)
@@ -133,15 +115,16 @@ To view the first 3 children of the root element and display two levels of the t
133115
2 revision
134116
```
135117

136-
### Extracting Data
118+
## Extracting Data
119+
120+
### `element.findall`
137121

138122
As with the previous section, we are interested in extracting the `page`, `title`, `ns`, and `text` tags.
139123

140124
The main difference in structure here is that we now have multiple `page` elements, and we want to extract all of them.
141125

142126
We cannot use `find`, because it will return only the first `page`. However, we can use the `findall` method instead, which will return a list of all `page` elements.
143127

144-
<!-- ```python exec="1" source="tabbed-left" result="pycon" session="dump" -->
145128
=== "Source"
146129
```python
147130
NAMESPACES = root.nsmap
@@ -161,7 +144,7 @@ We will create a dictionary, `dict_0`, using page titles as keys and their *wiki
161144

162145
This process may take a couple of minutes!
163146

164-
<!-- ```python exec="1" source="above" session="dump" -->
147+
165148
```python
166149
ns = '0'
167150
dict_0 = dict()
@@ -175,7 +158,6 @@ for page in pages:
175158

176159
To check that our dictionary is correctly populated, let us print out part of the *wikitext* for a sample page:
177160

178-
<!-- ```python exec="1" source="tabbed-left" result="pycon" session="dump" -->
179161
=== "Source"
180162
```python
181163
print(dict_0['schön'][:300])
@@ -196,24 +178,21 @@ To check that our dictionary is correctly populated, let us print out part of th
196178

197179
```
198180

199-
### Saving the Dictionary Locally
181+
## Saving the Dictionary Locally
200182

201183
Once the dictionary is built, we save it locally using the `pickle` module, which allows us to store the dictionary in a serialized format. This way, we will not need to parse the XML file again in the future.
202184

203185
```python
204-
import pickle
205-
206186
dict_file = DICT_PATH / f'wikidict_{ns}.pkl'
207187

208188
with open(dict_file, 'wb') as f:
209189
pickle.dump(dict_0, f)
210190
```
211191

212-
### Loading Dictionary
192+
## Loading Dictionary
213193

214194
The next time you need to retrieve *wikitext*, simply load the dictionary from the pickle file and select the title page you need!
215195

216-
<!-- ```python exec="1" source="tabbed-left" result="pycon" session="dump" -->
217196
=== "Source"
218197
```python
219198
import pickle

0 commit comments

Comments
 (0)