Skip to content

Commit e61e5d0

Browse files
committed
Initial commit.
0 parents  commit e61e5d0

25 files changed

+4457
-0
lines changed

.github/workflows/ci.yaml

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
name: ci
2+
on:
3+
push:
4+
branches:
5+
# - master
6+
- main
7+
permissions:
8+
contents: write
9+
jobs:
10+
deploy:
11+
runs-on: ubuntu-latest
12+
steps:
13+
- uses: actions/checkout@v4
14+
- name: Configure Git Credentials
15+
run: |
16+
git config user.name github-actions[bot]
17+
git config user.email 41898282+github-actions[bot]@users.noreply.github.com
18+
- uses: actions/setup-python@v5
19+
with:
20+
python-version: 3.x
21+
- run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV
22+
- uses: actions/cache@v4
23+
with:
24+
key: mkdocs-material-${{ env.cache_id }}
25+
path: .cache
26+
restore-keys: |
27+
mkdocs-material-
28+
- run: pip install mkdocs-material
29+
- run: pip install mkdocstrings-python
30+
- run: pip install markdown-exec[ansi]
31+
- run: pip install mkdocs-open-in-new-tab
32+
- run: pip install requests
33+
- run: pip install lxml
34+
- run: pip install mwparserfromhell
35+
- run: mkdocs gh-deploy --force

.gitignore

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# MkDocs documentation
2+
/site/
3+
4+
# Cache
5+
__pycache__/
6+
.pytest_cache/
7+
8+
9+
# editors
10+
.vscode/
11+
12+
# Test
13+
/tests/
14+
15+
# Python Built
16+
/build/
17+
/dist/
18+
19+
# temporary folders
20+
temp/
21+
analysis/
22+
docs/attachments
23+
docs/Sandbox.md
24+
pyodide.md
25+
26+
# Data
27+
/data/
28+
out/
29+

docs/Fetching XML data/Dump files.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
**Dump File Fetching:** This is a static snapshot of all wiki pages, which is stored in a single compressed file (e.g., dewiktionary-latest-pages-articles.xml.bz2).
2+
3+
[TOC]
4+
5+
## German Wiktionary Dump Files
6+
7+
### Latest Version
8+
9+
- To download the latest dump file of the German Wiktionary, click [here](https://dumps.wikimedia.org/dewiktionary/latest/dewiktionary-latest-pages-articles-multistream.xml.bz2).
10+
- This will download the compressed file: `dewiktionary-latest-pages-articles-multistream.xml.bz2`.
11+
- The file is stored in this directory: [https://dumps.wikimedia.org/dewiktionary/latest/](https://dumps.wikimedia.org/dewiktionary/latest/).
12+
13+
### Older Versions
14+
15+
- If you need an older version of the Wiktionary dump, visit this directory: [https://dumps.wikimedia.org/dewiktionary/](https://dumps.wikimedia.org/dewiktionary/).
16+
- To download a specific version:
17+
- Navigate to the folder for the desired date.
18+
- A new window will open.
19+
- Look for the section titled **Articles, templates, media/file descriptions, and primary meta-pages**.
20+
- Select the file. The file name will follow the pattern: `dewiktionary-YYYYMMDD-pages-articles.xml.bz2`, where `YYYYMMDD` represents the dump date.
21+
- **Notes**:
22+
- Older dumps are only retained for a few months.
23+
- You can also fetch the latest version from this directory by choosing the most recent date.
24+
25+
## Any Wiki Dump File
26+
27+
- Click on **[Database backup dumps](https://dumps.wikimedia.org/backup-index.html)**.
28+
- Scroll down the page to find the domain of interest.
29+
- For example, use `enwiktionary` for the English Wiktionary.
30+
- Click on the domain link, then look for the section titled **Articles, templates, media/file descriptions, and primary meta-pages**.
31+
- The file you are looking for should end with `-pages-articles.xml.bz2`.
32+
33+
## Should I download a multistream dump file or not?
34+
35+
The files `-pages-articles.xml.bz2` and `multistream-pages-articles.xml.bz2` contain the same information. For our purposes here, either option will work just fine because we are working with a relatively small wiki database.
36+
37+
However, if you plan to work with a larger dump file in the future that exceeds your computer's memory capacity, you could download the `multistream-pages-articles.xml.bz2` version. This would allow you to adjust your parsing strategy to process the data in smaller chunks.
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
The **Special Export** tool fetches specific pages with their raw content (*wikitext*) in real-time, without needing to download the entire dataset. The content is provided in XML format.
2+
3+
[TOC]
4+
5+
## Using the **Special Export** Tool
6+
7+
You can actually use **Special Export** to retrieve pages from *any* Wiki site. On the German Wiktionary, however, the tool is labelled **Spezial:Exportieren**, but it works the same way.
8+
9+
### Examples
10+
11+
**Exporting Pages from Any Wiki Site**
12+
13+
To access the XML content of the page titled "Austria" from English Wikipedia, you can use the following Python code. When you press `run`, it will open the export link in your default browser:
14+
15+
```pyodide session="webbrowser"
16+
import webbrowser
17+
18+
title = 'Austria'
19+
domain = 'en.wikipedia.org'
20+
url = f'https://{domain}/wiki/Special:Export/{title}'
21+
webbrowser.open_new_tab(url)
22+
```
23+
24+
**Exporting Pages from the German Wiktionary**
25+
26+
For the German Wiktionary, the export tool uses `Spezial:Exportieren` instead of `Special:Export`. You can use similar Python code to open the export link for the page titled "schön" (German for "beautiful"):
27+
28+
```pyodide session="webbrowser"
29+
title = 'schön'
30+
domain = 'de.wiktionary.org'
31+
url = f'https://{domain}/wiki/Spezial:Exportieren/{title}'
32+
webbrowser.open_new_tab(url)
33+
```
34+
35+
## Using the `requests` Library
36+
37+
To programmatically fetch and download XML content, you can use Python's `requests` library. This example shows how to build the URL, make a request, and get the XML content of a Wiktionary page by its title.
38+
39+
```python exec="true" source="above" session="requests"
40+
import requests
41+
42+
def fetch(title):
43+
# Construct the URL for the XML export of the given page title
44+
url = f'https://de.wiktionary.org/wiki/Spezial:Exportieren/{title}'
45+
46+
# Send a GET request
47+
resp = requests.get(url)
48+
49+
# Check if the request was successful, and raise an error if not
50+
resp.raise_for_status()
51+
52+
# Return the XML content of the requested page
53+
return resp.content
54+
```
55+
56+
Next, let us attempt to retrieve the XML content for the page titled "hoch" and print the initial 500 bytes for a glimpse of the XML content displayed in the `Result` tab.
57+
58+
59+
```python exec="true" source="tabbed-left" result="pycon" session="requests"
60+
page = fetch('hoch')
61+
print(page[:500])
62+
```
63+
64+
<!-- Which will return
65+
```xml
66+
b'<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.11/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.11/ http://www.mediawiki.org/xml/export-0.11.xsd" version="0.11" xml:lang="de">\n <siteinfo>\n <sitename>Wiktionary</sitename'
67+
```
68+
-->
69+
We will continue to use the `fetch` function throughout this tutorial.
70+
71+

docs/Fetching XML data/index.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
---
2+
title: Fetching XML
3+
---
4+
5+
6+
In the first section, we will cover two ways of accessing the XML files that contain *Wikitext*.
7+
8+
- First, we will access them online using the Wiki [Special Export tool](Special Exports.md).
9+
- Next, we will learn where to find the Wiki [Dump File](Dump files.md).
10+
11+
Notice that you can use the `Previous` and `Next` links in the footer to navigate forward or backward throughout the hands-on tutorial.
12+
13+
Let us begin by exploring the [Special Export tool](Special Exports.md) method.

0 commit comments

Comments
 (0)