lennon-c
diff --git a/‎.gitignore
Lines changed: 3 additions & 1 deletion b/‎.gitignore
Lines changed: 3 additions & 1 deletion
diff --git a/‎docs/Fetching XML data/Special Exports.md
Lines changed: 17 additions & 21 deletions b/‎docs/Fetching XML data/Special Exports.md
Lines changed: 17 additions & 21 deletions
diff --git a/‎docs/Parsing Wikitext/Parsing Wikitext.md
Lines changed: 34 additions & 12 deletions b/‎docs/Parsing Wikitext/Parsing Wikitext.md
Lines changed: 34 additions & 12 deletions
diff --git a/‎docs/Parsing XML/Parsing XML from Dump file.md
Lines changed: 15 additions & 36 deletions b/‎docs/Parsing XML/Parsing XML from Dump file.md
Lines changed: 15 additions & 36 deletions
@@ -4,7 +4,9 @@
 # Cache
 __pycache__/
 .pytest_cache/
- 
+
+# Secrets
+.env
 
 # editors
 .vscode/
 
@@ -2,43 +2,44 @@ The **Special Export** tool fetches specific pages with their raw content (*wiki
 
 [TOC]
 
+## Importing Packages
+```python exec="true" source="above"   session="requests"
+import requests # to fetch info from URLs
+```
+
 ## Using the **Special Export** Tool
 
-You can actually use **Special Export** to retrieve pages from *any* Wiki site. On the German Wiktionary, however, the tool is labelled **Spezial:Exportieren**, but it works the same way.
+You can actually use **Special:Export** to retrieve pages from *any* Wiki site. On the German Wiktionary, however, the tool is labelled **Spezial:Exportieren**, but it works the same way.
 
-### Examples
 
 **Exporting Pages from Any Wiki Site**
 
-To access the XML content of the page titled "Austria" from English Wikipedia, you can use the following Python code. When you press `run`, it will open the export link in your default browser:
-
-```pyodide session="webbrowser"
-import webbrowser
+To access the XML content of the page titled "Austria" from English Wikipedia, you can construct your URL as follows. 
 
+```python exec="true" source="tabbed-left" result="pycon"  session="manual"
 title = 'Austria'
 domain = 'en.wikipedia.org'
 url = f'https://{domain}/wiki/Special:Export/{title}'
-webbrowser.open_new_tab(url)
+print(url)
 ```
 
 **Exporting Pages from the German Wiktionary**
 
-For the German Wiktionary, the export tool uses `Spezial:Exportieren` instead of `Special:Export`. You can use similar Python code to open the export link for the page titled "schön" (German for "beautiful"):
+For the German Wiktionary, the export tool uses `Spezial:Exportieren` instead of `Special:Export`. 
 
-```pyodide session="webbrowser"
-title = 'schön'
+```python exec="true" source="tabbed-left" result="pycon"  session="manual"
+title = 'hoch'
 domain = 'de.wiktionary.org'
 url = f'https://{domain}/wiki/Spezial:Exportieren/{title}'
-webbrowser.open_new_tab(url)
+print(url)
 ```
 
-## Using the `requests` Library
+## Fetching XML Data with `requests`
+
 
 To programmatically fetch and download XML content, you can use Python's `requests` library. This example shows how to build the URL, make a request, and get the XML content of a Wiktionary page by its title.
 
 ```python exec="true" source="above"   session="requests"
-import requests
-
 def fetch(title):
     # Construct the URL for the XML export of the given page title
     url = f'https://de.wiktionary.org/wiki/Spezial:Exportieren/{title}'
@@ -50,22 +51,17 @@ def fetch(title):
     resp.raise_for_status()
 
     # Return the XML content of the requested page
-    return resp.content
+    return resp.text
 ```
 
-Next, let us attempt to retrieve the XML content for the page titled "hoch" and print the initial 500 bytes for a glimpse of the XML content displayed in the `Result` tab.
+Next, let us attempt to retrieve the XML content for the page titled "hoch" and print the initial 500 bytes for a glimpse of the XML content.
 
 
 ```python exec="true" source="tabbed-left" result="pycon" session="requests"
 page = fetch('hoch')
 print(page[:500])
 ```
 
-<!-- Which will return 
-```xml
-b'<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.11/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.11/ http://www.mediawiki.org/xml/export-0.11.xsd" version="0.11" xml:lang="de">\n  <siteinfo>\n    <sitename>Wiktionary</sitename'
-```
-   -->
 We will continue to use the `fetch` function throughout this tutorial.
 
 
@@ -1,28 +1,51 @@
 [TOC]
 
+## Importing Packages
+```python exec="1" source="above"  session="wiki"
+import requests # to fetch info from URLs
+import lxml.etree as ET # to parse XML data
+import mwparserfromhell # to parse and analyze wikitext
+import re # to extract information using regular expressions
+import functools # to implement caching with a decorator
+```
+
 ## Get the Wikitext Data
 
-We already know how to extract *wikitext* from dump files and the special exports tool. In this section, we will parse this *wikitext*.
+We already know how to extract *wikitext* from **Dump** files and the **Special Exports tool**. In this section, we will parse the *wikitext*.
 
-We will use the word `stark` as an example ([link to the wiktionary page](https://de.wiktionary.org/wiki/stark)). We will retrieve the *wikitext* for the page `stark` from my GitHub repository so that we have the same version of the page. However, you can use either of the two methods we have learned so far to retrieve the *wikitext*. 
+We will use the page titled `stark` ([Wiktionary page](https://de.wiktionary.org/wiki/stark)) and the functions we created in the previous sections based on the **Special Export** method.
 
+```python exec="1" source="above" session="wiki"
+@functools.cache
+def fetch(title):
+    url = f'https://de.wiktionary.org/wiki/Spezial:Exportieren/{title}'
+    resp = requests.get(url)
+    resp.raise_for_status()
+    return resp.text
+
+def fetch_wikitext(title):
+    xml_content = fetch(title)
+    root = ET.fromstring(xml_content)
+    namespaces  = root.nsmap
+    page = root.find('page', namespaces)
+    wikitext = page.find('revision/text', namespaces)
+    return wikitext.text
 
-```python exec="1" source="tabbed-left" result="pycon" session="wiki"
-import requests
-title = 'stark'
-url = f'https://raw.githubusercontent.com/lennon-c/python-wikitext-parser-guide/refs/heads/main/docs/data/{title}.txt'
-resp = requests.get(url)
-wikitext = resp.text
+```
+
+I added `@functools.cache`, optional, to avoid redundant requests and be more respectful to the server. Basically, `@functools.cache` stores results of `fetch(title)`, so repeated calls with the same title return the cached response instead of requesting the page again from the wiki servers. 
 
-print(wikitext[:500])
+
+```python exec="1" source="tabbed-left" result="pycon" session="wiki"
+wikitext = fetch_wikitext('stark')
+print(wikitext[:1000])
 ```
 
 ## Parsing Wikitext 
 
 First, we need to import `mwparserfromhell`. Then, we use the `parse` function and pass in our wikitext, which will return a `Wikicode` object.
 
 ```python exec="1" source="tabbed-left" result="pycon" session="wiki"
-import mwparserfromhell
 parsed = mwparserfromhell.parse(wikitext)
 print(type(parsed)) # <class 'mwparserfromhell.wikicode.Wikicode'>
 ```
@@ -75,7 +98,7 @@ print_headings_tree(parsed)
 
 - Finally, the **fourth-level** headings contain information on the translation of the word into different languages (`Übersetzungen`).
 
-For my project, I only need the *German-to-German* dictionary. So, let us extract the *wikitext* for that heading. We can use the method `get_sections()`, which accepts a heading level as an argument. Passing **level 2** will split the text into sections based on the second-level headings.
+For my project, I only need the *German-to-German* dictionary. So, let us extract the *wikitext* for that heading. We can use the method `get_sections()`, which accepts a heading level as an argument. Passing level **2** will split the text into sections based on the second-level headings.
 
 ```python exec="1" source="tabbed-left" result="pycon" session="wiki"
 sections = parsed.get_sections(levels=[2])
@@ -291,7 +314,6 @@ Putting everything together, we get the following pattern:
 Let us try it using the `re.search` method:
 
 ```python exec="1" source="tabbed-left" result="pycon" session="wiki"
-import re
 # Define the pattern 
 pattern = r'\n\n\{\{Bedeutungen\}\}\n(.+?)\n\n'
 
 
@@ -1,13 +1,13 @@
 [TOC]
 
+## Importing Packages
+```python  
+from pathlib import Path 
+import lxml.etree as ET # to parse XML documents
+import pickle # to store the dictionary locally 
+```
 
-As in the previous section, we begin by importing the `lxml.etree` module:
-
-```python exec="true" source="above"   session="dump"
-import lxml.etree as ET
-``` 
-
-### Setting Up Paths
+## Setting Up Paths Local Machine
 
 To follow along in this section:
 
@@ -18,32 +18,18 @@ To follow along in this section:
     - Therefore, do not forget to specify in which folder the dictionary should be saved in `DICT_PATH`.
 
 ```python  
-from pathlib import Path
-
 # Specify your own paths
 XML_FILE = Path(r'path\to\xml\dewiktionary-20241020-pages-articles-multistream.xml')
 DICT_PATH = Path(r'path\to\dict')
 ```
 
-<!-- ```python exec="1"   session="dump"
-from pathlib import Path
-
-# Specify your own paths
-XML_FILE = Path(r'D:\Dropbox\Python\My_packages\de_wiktio\data\dewiktionary-20241020-pages-articles-multistream.xml')
-DICT_PATH = Path(r"D:\Dropbox\Python\My_packages\de_wiktio\out")
-``` -->
-
-### Parsing the XML File
-
+## Parsing the XML File
 Since we are working with a file, we cannot use the `ET.fromstring` function to parse the XML content. Instead, we must use the `ET.parse` function.
 
 Note that this process can take some time. On my computer, it takes approximately 42 seconds to load the entire XML tree.
 
-<!-- ```python exec="1" source="tabbed-left" result="pycon" session="dump" -->
 === "Source"
     ```python  
-    import lxml.etree as ET
-
     # ET.parse for a xml file
     tree = ET.parse(XML_FILE)
     print(type(tree)) # lxml.etree._ElementTree
@@ -52,7 +38,6 @@ Note that this process can take some time. On my computer, it takes approximatel
     print(type(root)) # <class 'lxml.etree._Element'>
     ```
 === "Result"
-
     ```pycon
     <class 'lxml.etree._ElementTree'>
     <class 'lxml.etree._Element'>
@@ -62,15 +47,14 @@ Note that this process can take some time. On my computer, it takes approximatel
 The parser returns an `ElementTree` object. We use the `getroot()` method to access the root `Element`.
 
 
-### Displaying the XML Structure
+## Displaying the XML Structure
 
 The XML structure of the dump file is quite large, so printing the entire tree would not only be inefficient but also quite overwhelming. To make it more manageable, let us modify our `print_tags_tree` function.
 
 We will add options to limit the number of children displayed for the root element and to control the depth of the tree.
 
 Here is our updated `print_tags_tree` function:
 
-<!-- ```python exec="1"  source="above"  session="dump" -->
 ```python
 def print_tags_tree(elem, level=0, only_tagnames=False, max_children=5, max_level=5):
 
@@ -88,7 +72,6 @@ def print_tags_tree(elem, level=0, only_tagnames=False, max_children=5, max_leve
 
 To display only the first 5 direct children of the root element and limit the tree to the first level:
 
-<!-- ```python exec="1" source="tabbed-left" result="pycon" session="dump" -->
 === "Source"
     ```python
     print_tags_tree(root, only_tagnames=True, max_children=5, max_level=1)
@@ -106,7 +89,6 @@ To display only the first 5 direct children of the root element and limit the tr
 
 To view the first 3 children of the root element and display two levels of the tree:
 
-<!-- ```python exec="1" source="tabbed-left" result="pycon" session="dump" -->
 === "Source"
     ```python
     print_tags_tree(root, only_tagnames=True, max_children=3, max_level=2)
@@ -133,15 +115,16 @@ To view the first 3 children of the root element and display two levels of the t
             2 revision
     ```
 
-### Extracting Data
+## Extracting Data
+
+### `element.findall` 
 
 As with the previous section, we are interested in extracting the `page`, `title`, `ns`, and `text` tags.
 
 The main difference in structure here is that we now have multiple `page` elements, and we want to extract all of them.
 
 We cannot use `find`, because it will return only the first `page`. However, we can use the `findall` method instead, which will return a list of all `page` elements.
 
-<!-- ```python exec="1" source="tabbed-left" result="pycon" session="dump" -->
 === "Source"
     ```python
     NAMESPACES = root.nsmap 
@@ -161,7 +144,7 @@ We will create a dictionary, `dict_0`, using page titles as keys and their *wiki
 
 This process may take a couple of minutes!
 
-<!-- ```python exec="1"  source="above"  session="dump" -->
+ 
 ```python
 ns = '0'
 dict_0 = dict()
@@ -175,7 +158,6 @@ for page in pages:
 
 To check that our dictionary is correctly populated, let us print out part of the *wikitext* for a sample page:
 
-<!-- ```python exec="1" source="tabbed-left" result="pycon" session="dump" -->
 === "Source"
     ```python
     print(dict_0['schön'][:300])
@@ -196,24 +178,21 @@ To check that our dictionary is correctly populated, let us print out part of th
 
     ```
 
-### Saving the Dictionary Locally
+## Saving the Dictionary Locally
 
 Once the dictionary is built, we save it locally using the `pickle` module, which allows us to store the dictionary in a serialized format. This way, we will not need to parse the XML file again in the future.
 
 ```python
-import pickle
-
 dict_file = DICT_PATH / f'wikidict_{ns}.pkl'
 
 with open(dict_file, 'wb') as f:
     pickle.dump(dict_0, f)
 ```
 
-### Loading Dictionary
+## Loading Dictionary
 
 The next time you need to retrieve *wikitext*, simply load the dictionary from the pickle file and select the title page you need!
 
-<!-- ```python exec="1" source="tabbed-left" result="pycon" session="dump" -->
 === "Source"
     ```python
     import pickle