Skip to content

Commit cd58762

Browse files
committed
markdown source builds
Auto-generated via `{sandpaper}` Source : d7d88a9 Branch : main Author : Aleksandra Nenadic <[email protected]> Time : 2025-05-14 20:44:57 +0000 Message : Merge pull request #179 from smangham/issue114-utf-encoding Issue 114: Add encoding to Python file operations.
1 parent 3028a68 commit cd58762

File tree

2 files changed

+65
-25
lines changed

2 files changed

+65
-25
lines changed

05-reproducible-dev-environment.md

Lines changed: 64 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ After completing this episode, participants should be able to:
1919

2020
::::::::::::::::::::::::::::::::::::::::::::::::
2121

22-
So far we have created a local Git repository to track changes in our software project and pushed it to GitHub
22+
So far we have created a local Git repository to track changes in our software project and pushed it to GitHub
2323
to enable others to see and contribute to it.
2424

2525
::: instructor
@@ -29,7 +29,7 @@ https://github.com/carpentries-incubator/astronaut-data-analysis-not-so-fair/tre
2929

3030
:::
3131

32-
We now want to start developing the code further.
32+
We now want to start developing the code further.
3333
If we have a look at our script, we may notice a few `import` lines like the following:
3434

3535
```python
@@ -81,14 +81,14 @@ allowing isolation from other software projects on your machine that may require
8181
different versions of Python or external libraries.
8282

8383
It is recommended to create a separate virtual environment for each project.
84-
Then you do not have to worry about changes to the environment of the current project you are working on
84+
Then you do not have to worry about changes to the environment of the current project you are working on
8585
affecting other projects - you can use different Python versions and different versions of the same third party
8686
dependency by different projects on your machine independently from one another.
8787

8888
We can visualise the use of virtual environments for different Python projects on the same machine as follows:
8989
![Diagram to depict different Python environments containing different packages on the same machine](episodes/fig/ep05_virtual-env.png){alt='Diagram to depict different Python environments containing different packages on the same machine'}
9090

91-
Another big motivator for using virtual environments is that they make sharing your code with others much easier -
91+
Another big motivator for using virtual environments is that they make sharing your code with others much easier -
9292
as we will see shortly you can record your virtual environment in a special file and share it with your collaborators
9393
who can then recreate the same development environment on their machines.
9494

@@ -100,7 +100,7 @@ They also enable you to use a specific older version of a package for your proje
100100

101101
## Managing virtual environments
102102

103-
There are several command line tools used for managing Python virtual environments - we will use `venv`,
103+
There are several command line tools used for managing Python virtual environments - we will use `venv`,
104104
available by default from the standard `Python` distribution since `Python 3.3`.
105105

106106
Part of managing your (virtual) working environment involves
@@ -212,11 +212,11 @@ we will see how to handle it using Git in one of the subsequent episodes.
212212
### Installing external packages
213213

214214
We noticed earlier that our code depends on four **external packages/libraries** -
215-
`json`, `csv`, `datetime` and `matplotlib`.
216-
As of Python 3.5, Python comes with in-built JSON and CSV libraries - this means there is no need to install these
217-
additional packages (if you are using a fairly recent version of Python), but you still need to import them in any
218-
script that uses them.
219-
However, we still need to install packages `datetime` and `matplotlib` as they do not come as standard with
215+
`json`, `csv`, `datetime` and `matplotlib`.
216+
As of Python 3.5, Python comes with in-built JSON and CSV libraries - this means there is no need to install these
217+
additional packages (if you are using a fairly recent version of Python), but you still need to import them in any
218+
script that uses them.
219+
However, we still need to install packages `datetime` and `matplotlib` as they do not come as standard with
220220
Python distribution.
221221

222222
To install the latest version of packages `datetime` and `matplotlib` with `pip`
@@ -233,7 +233,7 @@ or like this to install multiple packages at once for short:
233233
(venv_spacewalks) $ python3 -m pip install datetime matplotlib
234234
```
235235

236-
The above commands have installed packages `datetime` and `matplotlib` in our currently active `venv_spacewalks`
236+
The above commands have installed packages `datetime` and `matplotlib` in our currently active `venv_spacewalks`
237237
environment and will not affect any other Python projects we may have on our machines.
238238

239239
If you run the `python3 -m pip install` command on a package that is already installed,
@@ -257,15 +257,15 @@ To display information about a particular installed package do:
257257
Name: matplotlib
258258
Version: 3.9.0
259259
Summary: Python plotting package
260-
Home-page:
260+
Home-page:
261261
Author: John D. Hunter, Michael Droettboom
262262
Author-email: Unknown <[email protected]>
263263
License: License agreement for matplotlib versions 1.3.0 and later
264264
=========================================================
265265
...
266266
Location: /opt/homebrew/lib/python3.11/site-packages
267267
Requires: contourpy, cycler, fonttools, kiwisolver, numpy, packaging, pillow, pyparsing, python-dateutil
268-
Required-by:
268+
Required-by:
269269
```
270270

271271
To list all packages installed with `pip` (in your current virtual environment):
@@ -339,9 +339,9 @@ The `requirements.txt` file can then be committed to a version control system
339339
(we will see how to do this using Git in a moment)
340340
and get shipped as part of your software and shared with collaborators and/or users.
341341

342-
Note that you only need to share the small `requirements.txt` file with your collaborators - and not the entire
343-
`venv_spacewalks` directory with packages contained in your virtual environment.
344-
We need to tell Git to ignore that directory, so it is not tracked and shared - we do this by creating a file
342+
Note that you only need to share the small `requirements.txt` file with your collaborators - and not the entire
343+
`venv_spacewalks` directory with packages contained in your virtual environment.
344+
We need to tell Git to ignore that directory, so it is not tracked and shared - we do this by creating a file
345345
`.gitignore` in the root directory of our project and adding a line `venv_spacewalks` to it.
346346

347347
Let's now put `requirements.txt` under version control and share it along with our code.
@@ -352,9 +352,9 @@ Let's now put `requirements.txt` under version control and share it along with o
352352
(venv_spacewalks) $ git push origin main
353353
```
354354

355-
Your collaborators or users of your software can now download your software's source code and replicate the same
356-
virtual software environment for running your code on their machines using `requirements.txt` to install all
357-
the necessary depending packages.
355+
Your collaborators or users of your software can now download your software's source code and replicate the same
356+
virtual software environment for running your code on their machines using `requirements.txt` to install all
357+
the necessary depending packages.
358358

359359
To recreate a virtual environment from `requirements.txt`, from the project root one can do the following:
360360

@@ -365,7 +365,7 @@ To recreate a virtual environment from `requirements.txt`, from the project root
365365
As your project grows - you may need to update your environment for a variety of reasons, e.g.:
366366

367367
- one of your project's dependencies has just released a new version (dependency version number update),
368-
- you need an additional package for data analysis (adding a new dependency), or
368+
- you need an additional package for data analysis (adding a new dependency), or
369369
- you have found a better package and no longer need the older package
370370
(adding a new and removing an old dependency).
371371

@@ -382,8 +382,48 @@ We are now setup to run our code from the newly created virtual environment:
382382
(venv_spacewalks) $ python3 eva_data_analysis.py
383383
```
384384

385-
You should get a pop up window with a graph.
386-
Let's inspect the code in a more detail, see if we can understand and improve it.
385+
You should get a pop up window with a graph.
386+
However, some (but not all) Windows users will not.
387+
You might instead see an error like:
388+
389+
```
390+
Traceback (most recent call last):
391+
File "C:\Users\Toaster\Desktop\spacewalks\eva_data_analysis.py", line 30, in <module>
392+
w.writerow(data[j].values())
393+
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.2544.0_x64__qbz5n2kfra8p0\Lib\encodings\cp1252.py", line 19, in encode
394+
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
395+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
396+
UnicodeEncodeError: 'charmap' codec can't encode character '\x92' in position 101: character maps to <undefined>
397+
(spacewalks) (spacewalks)
398+
```
399+
400+
This is not what we were expecting!
401+
The problem is *character encoding*.
402+
'Standard' Latin characters are encoded using ASCII,
403+
but the expanded Unicode character set covers many more.
404+
In this case, the data contains Unicode characters that are represented in the ASCII input file with shortcuts (`Â` as `\u00c2` and `` as `\u0092`).
405+
406+
When we read the file, Python converts those into the Unicode characters.
407+
Then by default Windows tries to write out `eva-data.csv` using UTF-7.
408+
This saves space compared to the standard UTF-8,
409+
but it doesn't include all of the characters.
410+
It automatically converts `\u0092` into the shorter `\x92`,
411+
then discovers that doesn't exist in UTF-7.
412+
413+
The fact that different systems have different defaults,
414+
which can change or even break your code's behaviour,
415+
shows why it is so important to make our code's requirements explicit!
416+
417+
We can easily fix this by explicitly telling Python what encoding to use when reading and writing our files:
418+
419+
```
420+
data_f = open('./eva-data.json', 'r', encoding='ascii')
421+
data_t = open('./eva-data.csv','w', encoding='utf-8')
422+
```
423+
424+
Now we have the code running in a virtual environment,
425+
in the next episode we will inspect it in more detail,
426+
to see if we can understand and improve it.
387427

388428
## Further reading
389429

@@ -397,8 +437,8 @@ Also check the [full reference set](learners/reference.md#litref) for the course
397437
:::::: keypoints
398438
- Virtual environments keep Python versions and dependencies required by different projects separate.
399439
- A Python virtual environment is itself a directory structure.
400-
- You can use `venv` to create and manage Python virtual environments, and `pip` to install and manage Python
440+
- You can use `venv` to create and manage Python virtual environments, and `pip` to install and manage Python
401441
external (third-party) libraries.
402-
- By convention, you can save and export your Python virtual environment in a `requirements.txt` in your project's root
442+
- By convention, you can save and export your Python virtual environment in a `requirements.txt` in your project's root
403443
directory, which can then be shared with collaborators/users and used to replicate your virtual environment elsewhere.
404444
::::::

md5sum.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
"episodes/02-fair-research-software.md" "77166cb9e7f376cc9a1f8bb932a5beb7" "site/built/02-fair-research-software.md" "2025-03-13"
1010
"episodes/03-tools-practices.md" "aee50624ea854598a6054cee37baf119" "site/built/03-tools-practices.md" "2025-03-27"
1111
"episodes/04-version-control.md" "13c6fe30d2a760371c78ff44a8000e8b" "site/built/04-version-control.md" "2025-03-25"
12-
"episodes/05-reproducible-dev-environment.md" "17022303ea71bd1dc8d83693368d1930" "site/built/05-reproducible-dev-environment.md" "2025-03-17"
12+
"episodes/05-reproducible-dev-environment.md" "e50409fbf9c48e24d99e607d54efb30d" "site/built/05-reproducible-dev-environment.md" "2025-05-14"
1313
"episodes/06-code-readability.md" "0e0be3bb2115d56719e893f0822c1f77" "site/built/06-code-readability.md" "2025-02-12"
1414
"episodes/07-code-structure.md" "fb8d5799fcd6235bdc548c7fd14c83fd" "site/built/07-code-structure.md" "2025-02-12"
1515
"episodes/08-code-correctness-testing.md" "452d6f07b4c6f01c7249492d3b03d667" "site/built/08-code-correctness-testing.md" "2025-03-13"

0 commit comments

Comments
 (0)