Skip to content

Commit 9917c8d

Browse files
authored
Merge pull request #342 from carpentries-incubator/stevecrouch-section3devreview
Section 3 rework - develop branch review revisions
2 parents 6934815 + cb967db commit 9917c8d

File tree

4 files changed

+105
-84
lines changed

4 files changed

+105
-84
lines changed

_episodes/32-software-architecture-design.md

Lines changed: 16 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -130,19 +130,20 @@ either as a whole or in terms of its individual functions.
130130
Now that we know what goals we should aspire to, let us take a critical look at the code in our
131131
software project and try to identify ways in which it can be improved.
132132

133-
Our software project contains a branch `full-data-analysis` with code for a new feature of our
134-
inflammation analysis software. Recall that you can see all your branches as follows:
133+
Our software project contains a pre-existing branch `full-data-analysis` which contains code for a new feature of our
134+
inflammation analysis software, which we'll consider as a contribution by another developer.
135+
Recall that you can see all your branches as follows:
135136

136137
~~~
137138
$ git branch --all
138139
~~~
139140
{: .language-bash}
140141

141-
Let's checkout a new local branch from the `full-data-analysis` branch, making sure we
142-
have saved and committed all current changes before doing so.
142+
Once you have saved and committed any current changes,
143+
checkout this `full-data-analysis` branch:
143144

144145
~~~
145-
git checkout -b full-data-analysis
146+
git switch full-data-analysis
146147
~~~
147148
{: .language-bash}
148149

@@ -161,20 +162,25 @@ calculates and compares standard deviation across all the data by day and finaly
161162
> Critically examine the code in `analyse_data()` function in `compute_data.py` file.
162163
>
163164
> In what ways does this code not live up to the ideal properties of 'good' code?
164-
> Think about ways in which you find it hard to understand.
165+
> Think about ways in which you find it hard to read and understand.
165166
> Think about the kinds of changes you might want to make to it, and what would
166-
> make making those changes challenging.
167+
> make those changes challenging.
168+
>
167169
>> ## Solution
170+
>>
168171
>> You may have found others, but here are some of the things that make the code
169172
>> hard to read, test and maintain.
170173
>>
171174
>> * **Hard to read:** everything is implemented in a single function.
172175
>> In order to understand it, you need to understand how file loading works at the same time as
173176
>> the analysis itself.
177+
>> * **Hard to read:** using the `--full-data-analysis` flag changes the meaning of the `infiles` argument
178+
>> to indicate a single data directory, instead of a set of data files, which may cause confusion.
174179
>> * **Hard to modify:** if you wanted to use the data for some other purpose and not just
175-
>> plotting the graph you would have to change the `data_analysis()` function.
176-
>> * **Hard to modify or test:** it is always analysing a fixed set of CSV data files
177-
>> stored on a disk.
180+
>> plotting the graph you would have to change the `analysis_data()` function.
181+
>> * **Hard to modify or test:** it only analyses a set of CSV data files
182+
>> matching a very particular hardcoded `inflammation*.csv` pattern, which seems an unreasonable assumption.
183+
>> What if someone wanted to use files which don't match this naming convention?
178184
>> * **Hard to modify:** it does not have any tests so we cannot be 100% confident the code does
179185
>> what it claims to do; any changes to the code may break something and it would be harder and
180186
>> more time-consuming to figure out what.

_episodes/33-code-decoupling-abstractions.md

Lines changed: 48 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -62,29 +62,46 @@ Data loading is a functionality separate from data analysis, so firstly
6262
let's decouple the data loading part into a separate component (function).
6363

6464
> ## Exercise: Decouple Data Loading from Data Analysis
65-
> Separate out the data loading functionality from `analyse_data()` into a new function
66-
> `load_inflammation_data()` that returns a list of 2D NumPy arrays with inflammation data
65+
>
66+
> Modify `compute_data.py` to separate out the data loading functionality from `analyse_data()` into a new function
67+
> `load_inflammation_data()`, that returns a list of 2D NumPy arrays with inflammation data
6768
> loaded from all inflammation CSV files found in a specified directory path.
69+
> Then, change your `analyse_data()` function to make use of this new function instead.
70+
>
6871
>> ## Solution
72+
>>
6973
>> The new function `load_inflammation_data()` that reads all the inflammation data into the
7074
>> format needed for the analysis could look something like:
75+
>.
7176
>> ```python
7277
>> def load_inflammation_data(dir_path):
73-
>> data_file_paths = glob.glob(os.path.join(dir_path, 'inflammation*.csv'))
74-
>> if len(data_file_paths) == 0:
75-
>> raise ValueError(f"No inflammation CSV files found in path {dir_path}")
76-
>> data = map(models.load_csv, data_file_paths) #load inflammation data from each CSV file
77-
>> return list(data) #return the list of 2D NumPy arrays with inflammation data
78+
>> data_file_paths = glob.glob(os.path.join(dir_path, 'inflammation*.csv'))
79+
>> if len(data_file_paths) == 0:
80+
>> raise ValueError(f"No inflammation CSV files found in path {dir_path}")
81+
>> data = map(models.load_csv, data_file_paths) # Load inflammation data from each CSV file
82+
>> return list(data) # Return the list of 2D NumPy arrays with inflammation data
7883
>> ```
79-
>> This function can now be used in the analysis as follows:
84+
>>
85+
>> The new function `analyse_data()` could then look like:
86+
>>
8087
>> ```python
8188
>> def analyse_data(data_dir):
82-
>> data = load_inflammation_data(data_dir)
83-
>> daily_standard_deviation = compute_standard_deviation_by_data(data)
84-
>> ...
89+
>> data = load_inflammation_data(data_dir)
90+
>>
91+
>> means_by_day = map(models.daily_mean, data)
92+
>> means_by_day_matrix = np.stack(list(means_by_day))
93+
>>
94+
>> daily_standard_deviation = np.std(means_by_day_matrix, axis=0)
95+
>>
96+
>> graph_data = {
97+
>> 'standard deviation by day': daily_standard_deviation,
98+
>> }
99+
>> views.visualize(graph_data)
85100
>> ```
101+
>>
86102
>> The code is now easier to follow since we do not need to understand the data loading part
87103
>> to understand the statistical analysis part, and vice versa.
104+
>> In most cases, functions work best when they are short!
88105
> {: .solution}
89106
{: .challenge}
90107
@@ -185,13 +202,12 @@ In addition, implementation of the method `get_area()` is hidden too (abstractio
185202
>> At the end of this exercise, the code in the `analyse_data()` function should look like:
186203
>> ```python
187204
>> def analyse_data(data_source):
188-
>> data = data_source.load_inflammation_data()
189-
>> daily_standard_deviation = compute_standard_deviation_by_data(data)
190-
>> ...
205+
>> data = data_source.load_inflammation_data()
206+
>> ...
191207
>> ```
192208
>> The controller code should look like:
193209
>> ```python
194-
>> data_source = CSVDataSource(os.path.dirname(InFiles[0]))
210+
>> data_source = CSVDataSource(os.path.dirname(infiles[0]))
195211
>> analyse_data(data_source)
196212
>> ```
197213
> {: .solution}
@@ -200,33 +216,32 @@ In addition, implementation of the method `get_area()` is hidden too (abstractio
200216
>>
201217
>> ```python
202218
>> class CSVDataSource:
203-
>> """
204-
>> Loads all the inflammation CSV files within a specified directory.
205-
>> """
206-
>> def __init__(self, dir_path):
207-
>> self.dir_path = dir_path
219+
>> """
220+
>> Loads all the inflammation CSV files within a specified directory.
221+
>> """
222+
>> def __init__(self, dir_path):
223+
>> self.dir_path = dir_path
208224
>>
209-
>> def load_inflammation_data(self):
210-
>> data_file_paths = glob.glob(os.path.join(self.dir_path, 'inflammation*.csv'))
211-
>> if len(data_file_paths) == 0:
212-
>> raise ValueError(f"No inflammation CSV files found in path {self.dir_path}")
213-
>> data = map(models.load_csv, data_file_paths)
214-
>> return list(data)
225+
>> def load_inflammation_data(self):
226+
>> data_file_paths = glob.glob(os.path.join(self.dir_path, 'inflammation*.csv'))
227+
>> if len(data_file_paths) == 0:
228+
>> raise ValueError(f"No inflammation CSV files found in path {self.dir_path}")
229+
>> data = map(models.load_csv, data_file_paths)
230+
>> return list(data)
215231
>> ```
216232
>> In the controller, we create an instance of CSVDataSource and pass it
217233
>> into the the statistical analysis function.
218234
>>
219235
>> ```python
220-
>> data_source = CSVDataSource(os.path.dirname(InFiles[0]))
236+
>> data_source = CSVDataSource(os.path.dirname(infiles[0]))
221237
>> analyse_data(data_source)
222238
>> ```
223239
>> The `analyse_data()` function is modified to receive any data source object (that implements
224240
>> the `load_inflammation_data()` method) as a parameter.
225241
>> ```python
226242
>> def analyse_data(data_source):
227-
>> data = data_source.load_inflammation_data()
228-
>> daily_standard_deviation = compute_standard_deviation_by_data(data)
229-
>> ...
243+
>> data = data_source.load_inflammation_data()
244+
>> ...
230245
>> ```
231246
>> We have now fully decoupled the reading of the data from the statistical analysis and
232247
>> the analysis is not fixed to reading from a directory of CSV files. Indeed, we can pass various
@@ -364,11 +379,11 @@ data sources with no extra work.
364379
>> Additionally, in the controller we will need to select an appropriate DataSource instance to
365380
>> provide to the analysis:
366381
>>```python
367-
>> _, extension = os.path.splitext(InFiles[0])
382+
>> _, extension = os.path.splitext(infiles[0])
368383
>> if extension == '.json':
369-
>> data_source = JSONDataSource(os.path.dirname(InFiles[0]))
384+
>> data_source = JSONDataSource(os.path.dirname(infiles[0]))
370385
>> elif extension == '.csv':
371-
>> data_source = CSVDataSource(os.path.dirname(InFiles[0]))
386+
>> data_source = CSVDataSource(os.path.dirname(infiles[0]))
372387
>> else:
373388
>> raise ValueError(f'Unsupported data file format: {extension}')
374389
>> analyse_data(data_source)

_episodes/34-code-refactoring.md

Lines changed: 28 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -193,8 +193,8 @@ be harder to test but, when simplified like this, may only require a handful of
193193
> The pure function should take in the data, and return the analysis result, as follows:
194194
> ```python
195195
> def compute_standard_deviation_by_day(data):
196-
> # TODO
197-
> return daily_standard_deviation
196+
> # TODO
197+
> return daily_standard_deviation
198198
> ```
199199
>> ## Solution
200200
>> The analysis code will be refactored into a separate function that may look something like:
@@ -208,23 +208,23 @@ be harder to test but, when simplified like this, may only require a handful of
208208
>> ```
209209
>> The `analyse_data()` function now calls the `compute_standard_deviation_by_day()` function,
210210
>> while keeping all the logic for reading the data, processing it and showing it in a graph:
211-
>>```python
212-
>>def analyse_data(data_dir):
213-
>> """Calculates the standard deviation by day between datasets.
214-
>> Gets all the inflammation data from CSV files within a directory, works out the mean
215-
>> inflammation value for each day across all datasets, then visualises the
216-
>> standard deviation of these means on a graph."""
217-
>> data_file_paths = glob.glob(os.path.join(data_dir, 'inflammation*.csv'))
218-
>> if len(data_file_paths) == 0:
219-
>> raise ValueError(f"No inflammation csv's found in path {data_dir}")
220-
>> data = map(models.load_csv, data_file_paths)
221-
>> daily_standard_deviation = compute_standard_deviation_by_day(data)
211+
>> ```python
212+
>> def analyse_data(data_dir):
213+
>> """Calculates the standard deviation by day between datasets.
214+
>> Gets all the inflammation data from CSV files within a directory, works out the mean
215+
>> inflammation value for each day across all datasets, then visualises the
216+
>> standard deviation of these means on a graph."""
217+
>> data_file_paths = glob.glob(os.path.join(data_dir, 'inflammation*.csv'))
218+
>> if len(data_file_paths) == 0:
219+
>> raise ValueError(f"No inflammation csv's found in path {data_dir}")
220+
>> data = map(models.load_csv, data_file_paths)
221+
>> daily_standard_deviation = compute_standard_deviation_by_day(data)
222222
>>
223-
>> graph_data = {
224-
>> 'standard deviation by day': daily_standard_deviation,
225-
>> }
226-
>> # views.visualize(graph_data)
227-
>> return daily_standard_deviation
223+
>> graph_data = {
224+
>> 'standard deviation by day': daily_standard_deviation,
225+
>> }
226+
>> # views.visualize(graph_data)
227+
>> return daily_standard_deviation
228228
>>```
229229
>> Make sure to re-run the regression test to check this refactoring has not
230230
>> changed the output of `analyse_data()`.
@@ -252,17 +252,17 @@ from CSV to JSON, the bulk of the tests need not be updated
252252
>> You might have thought of more tests, but we can easily extend the test by parametrizing
253253
>> with more inputs and expected outputs:
254254
>> ```python
255-
>>@pytest.mark.parametrize('data,expected_output', [
256-
>> ([[[0, 1, 0], [0, 2, 0]]], [0, 0, 0]),
257-
>> ([[[0, 2, 0]], [[0, 1, 0]]], [0, math.sqrt(0.25), 0]),
258-
>> ([[[0, 1, 0], [0, 2, 0]], [[0, 1, 0], [0, 2, 0]]], [0, 0, 0])
259-
>>],
260-
>>ids=['Two patients in same file', 'Two patients in different files', 'Two identical patients in two different files'])
261-
>>def test_compute_standard_deviation_by_day(data, expected_output):
262-
>> from inflammation.compute_data import compute_standard_deviation_by_data
255+
>> @pytest.mark.parametrize('data,expected_output', [
256+
>> ([[[0, 1, 0], [0, 2, 0]]], [0, 0, 0]),
257+
>> ([[[0, 2, 0]], [[0, 1, 0]]], [0, math.sqrt(0.25), 0]),
258+
>> ([[[0, 1, 0], [0, 2, 0]], [[0, 1, 0], [0, 2, 0]]], [0, 0, 0])
259+
>> ],
260+
>> ids=['Two patients in same file', 'Two patients in different files', 'Two identical patients in two different files'])
261+
>> def test_compute_standard_deviation_by_day(data, expected_output):
262+
>> from inflammation.compute_data import compute_standard_deviation_by_data
263263
>>
264-
>> result = compute_standard_deviation_by_data(data)
265-
>> npt.assert_array_almost_equal(result, expected_output)
264+
>> result = compute_standard_deviation_by_data(data)
265+
>> npt.assert_array_almost_equal(result, expected_output)
266266
```
267267
> {: .solution}
268268
{: .challenge}

_episodes/35-software-architecture-revisited.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -72,19 +72,19 @@ how you should structure your code.
7272
>>
7373
>> ```python
7474
>> if args.full_data_analysis:
75-
>> _, extension = os.path.splitext(InFiles[0])
76-
>> if extension == '.json':
77-
>> data_source = JSONDataSource(os.path.dirname(InFiles[0]))
78-
>> elif extension == '.csv':
79-
>> data_source = CSVDataSource(os.path.dirname(InFiles[0]))
80-
>> else:
81-
>> raise ValueError(f'Unsupported file format: {extension}')
82-
>> data_result = analyse_data(data_source)
83-
>> graph_data = {
84-
>> 'standard deviation by day': data_result,
85-
>> }
86-
>> views.visualize(graph_data)
87-
>> return
75+
>> _, extension = os.path.splitext(infiles[0])
76+
>> if extension == '.json':
77+
>> data_source = JSONDataSource(os.path.dirname(infiles[0]))
78+
>> elif extension == '.csv':
79+
>> data_source = CSVDataSource(os.path.dirname(infiles[0]))
80+
>> else:
81+
>> raise ValueError(f'Unsupported file format: {extension}')
82+
>> data_result = analyse_data(data_source)
83+
>> graph_data = {
84+
>> 'standard deviation by day': data_result,
85+
>> }
86+
>> views.visualize(graph_data)
87+
>> return
8888
>> ```
8989
>> Note that this is, more or less, the change we did to write our regression test.
9090
>> This demonstrates that splitting up Model code from View code can

0 commit comments

Comments
 (0)