Merge pull request #342 from carpentries-incubator/stevecrouch-section3devreview

anenadic · web-flow · commit 9917c8d6917d · 2024-05-07T16:19:19.000+01:00
Section 3 rework - develop branch review revisions
diff --git a/_episodes/32-software-architecture-design.md b/_episodes/32-software-architecture-design.md
@@ -130,19 +130,20 @@ either as a whole or in terms of its individual functions.
 Now that we know what goals we should aspire to, let us take a critical look at the code in our 
 software project and try to identify ways in which it can be improved. 
 
-Our software project contains a branch `full-data-analysis` with code for a new feature of our 
-inflammation analysis software. Recall that you can see all your branches as follows:
+Our software project contains a pre-existing branch `full-data-analysis` which contains code for a new feature of our 
+inflammation analysis software, which we'll consider as a contribution by another developer.
+Recall that you can see all your branches as follows:
 
 ~~~
 $ git branch --all
 ~~~
 {: .language-bash}
 
-Let's checkout a new local branch from the `full-data-analysis` branch, making sure we
-have saved and committed all current changes before doing so.
+Once you have saved and committed any current changes,
+checkout this `full-data-analysis` branch:
 
 ~~~
-git checkout -b full-data-analysis
+git switch full-data-analysis
 ~~~
 {: .language-bash}
 
@@ -161,20 +162,25 @@ calculates and compares standard deviation across all the data by day and finaly
 > Critically examine the code in `analyse_data()` function in `compute_data.py` file. 
 > 
 > In what ways does this code not live up to the ideal properties of 'good' code?
-> Think about ways in which you find it hard to understand.
+> Think about ways in which you find it hard to read and understand.
 > Think about the kinds of changes you might want to make to it, and what would
-> make making those changes challenging.
+> make those changes challenging.
+> 
 >> ## Solution
+>>
 >> You may have found others, but here are some of the things that make the code
 >> hard to read, test and maintain.
 >>
 >> * **Hard to read:** everything is implemented in a single function. 
 >> In order to understand it, you need to understand how file loading works at the same time as 
 >> the analysis itself.
+>> * **Hard to read:** using the `--full-data-analysis` flag changes the meaning of the `infiles` argument
+>> to indicate a single data directory, instead of a set of data files, which may cause confusion.
 >> * **Hard to modify:** if you wanted to use the data for some other purpose and not just 
->> plotting the graph you would have to change the `data_analysis()` function.
->> * **Hard to modify or test:** it is always analysing a fixed set of CSV data files 
->> stored on a disk.
+>> plotting the graph you would have to change the `analysis_data()` function.
+>> * **Hard to modify or test:** it only analyses a set of CSV data files 
+>> matching a very particular hardcoded `inflammation*.csv` pattern, which seems an unreasonable assumption.
+>> What if someone wanted to use files which don't match this naming convention?
 >> * **Hard to modify:** it does not have any tests so we cannot be 100% confident the code does 
 >> what it claims to do; any changes to the code may break something and it would be harder and 
 >> more time-consuming to figure out what.
diff --git a/_episodes/33-code-decoupling-abstractions.md b/_episodes/33-code-decoupling-abstractions.md
@@ -62,29 +62,46 @@ Data loading is a functionality separate from data analysis, so firstly
 let's decouple the data loading part into a separate component (function).
 
 > ## Exercise: Decouple Data Loading from Data Analysis
-> Separate out the data loading functionality from `analyse_data()` into a new function 
-> `load_inflammation_data()` that returns a list of 2D NumPy arrays with inflammation data 
+> 
+> Modify `compute_data.py` to separate out the data loading functionality from `analyse_data()` into a new function 
+> `load_inflammation_data()`, that returns a list of 2D NumPy arrays with inflammation data 
 > loaded from all inflammation CSV files found in a specified directory path.
+> Then, change your `analyse_data()` function to make use of this new function instead.
+> 
 >> ## Solution
+>>
 >> The new function `load_inflammation_data()` that reads all the inflammation data into the 
 >> format needed for the analysis could look something like:
+>.
 >> ```python
 >> def load_inflammation_data(dir_path):
->>   data_file_paths = glob.glob(os.path.join(dir_path, 'inflammation*.csv'))
->>   if len(data_file_paths) == 0:
->>       raise ValueError(f"No inflammation CSV files found in path {dir_path}")
->>   data = map(models.load_csv, data_file_paths) #load inflammation data from each CSV file
->>   return list(data) #return the list of 2D NumPy arrays with inflammation data
+>>     data_file_paths = glob.glob(os.path.join(dir_path, 'inflammation*.csv'))
+>>     if len(data_file_paths) == 0:
+>>         raise ValueError(f"No inflammation CSV files found in path {dir_path}")
+>>     data = map(models.load_csv, data_file_paths) # Load inflammation data from each CSV file
+>>     return list(data) # Return the list of 2D NumPy arrays with inflammation data
 >> ```
->> This function can now be used in the analysis as follows:
+>> 
+>> The new function `analyse_data()` could then look like:
+>>
 >> ```python
 >> def analyse_data(data_dir):
->>   data = load_inflammation_data(data_dir)
->>   daily_standard_deviation = compute_standard_deviation_by_data(data)
->>   ...
+>>     data = load_inflammation_data(data_dir)
+>> 
+>>     means_by_day = map(models.daily_mean, data)
+>>     means_by_day_matrix = np.stack(list(means_by_day))
+>> 
+>>     daily_standard_deviation = np.std(means_by_day_matrix, axis=0)
+>> 
+>>     graph_data = {
+>>         'standard deviation by day': daily_standard_deviation,
+>>     }
+>>     views.visualize(graph_data)
 >> ```
+>> 
 >> The code is now easier to follow since we do not need to understand the data loading part
 >> to understand the statistical analysis part, and vice versa.
+>> In most cases, functions work best when they are short!
 > {: .solution}
 {: .challenge}
 
@@ -185,13 +202,12 @@ In addition, implementation of the method `get_area()` is hidden too (abstractio
 >> At the end of this exercise, the code in the `analyse_data()` function should look like:
 >> ```python
 >> def analyse_data(data_source):
->>   data = data_source.load_inflammation_data()
->>   daily_standard_deviation = compute_standard_deviation_by_data(data)
->>   ...
+>>     data = data_source.load_inflammation_data()
+>>     ...
 >> ```
 >> The controller code should look like:
 >> ```python
->> data_source = CSVDataSource(os.path.dirname(InFiles[0]))
+>> data_source = CSVDataSource(os.path.dirname(infiles[0]))
 >> analyse_data(data_source)
 >> ```
 > {: .solution}
@@ -200,33 +216,32 @@ In addition, implementation of the method `get_area()` is hidden too (abstractio
 >>
 >> ```python
 >> class CSVDataSource:
->>   """
->>   Loads all the inflammation CSV files within a specified directory.
->>   """
->>   def __init__(self, dir_path):
->>     self.dir_path = dir_path
+>>     """
+>>     Loads all the inflammation CSV files within a specified directory.
+>>     """
+>>     def __init__(self, dir_path):
+>>         self.dir_path = dir_path
 >>
->>   def load_inflammation_data(self):
->>     data_file_paths = glob.glob(os.path.join(self.dir_path, 'inflammation*.csv'))
->>     if len(data_file_paths) == 0:
->>       raise ValueError(f"No inflammation CSV files found in path {self.dir_path}")
->>     data = map(models.load_csv, data_file_paths)
->>     return list(data)
+>>     def load_inflammation_data(self):
+>>         data_file_paths = glob.glob(os.path.join(self.dir_path, 'inflammation*.csv'))
+>>         if len(data_file_paths) == 0:
+>>             raise ValueError(f"No inflammation CSV files found in path {self.dir_path}")
+>>         data = map(models.load_csv, data_file_paths)
+>>         return list(data)
 >> ```
 >> In the controller, we create an instance of CSVDataSource and pass it 
 >> into the the statistical analysis function.
 >>
 >> ```python
->> data_source = CSVDataSource(os.path.dirname(InFiles[0]))
+>> data_source = CSVDataSource(os.path.dirname(infiles[0]))
 >> analyse_data(data_source)
 >> ```
 >> The `analyse_data()` function is modified to receive any data source object (that implements 
 >> the `load_inflammation_data()` method) as a parameter.
 >> ```python
 >> def analyse_data(data_source):
->>   data = data_source.load_inflammation_data()
->>   daily_standard_deviation = compute_standard_deviation_by_data(data)
->>   ...
+>>     data = data_source.load_inflammation_data()
+>>     ...
 >> ```
 >> We have now fully decoupled the reading of the data from the statistical analysis and 
 >> the analysis is not fixed to reading from a directory of CSV files. Indeed, we can pass various 
@@ -364,11 +379,11 @@ data sources with no extra work.
 >> Additionally, in the controller we will need to select an appropriate DataSource instance to
 >> provide to the analysis:
 >>```python
->> _, extension = os.path.splitext(InFiles[0])
+>> _, extension = os.path.splitext(infiles[0])
 >> if extension == '.json':
->>   data_source = JSONDataSource(os.path.dirname(InFiles[0]))
+>>   data_source = JSONDataSource(os.path.dirname(infiles[0]))
 >> elif extension == '.csv':
->>   data_source = CSVDataSource(os.path.dirname(InFiles[0]))
+>>   data_source = CSVDataSource(os.path.dirname(infiles[0]))
 >> else:
 >>   raise ValueError(f'Unsupported data file format: {extension}')
 >> analyse_data(data_source)
diff --git a/_episodes/34-code-refactoring.md b/_episodes/34-code-refactoring.md
@@ -193,8 +193,8 @@ be harder to test but, when simplified like this, may only require a handful of
 > The pure function should take in the data, and return the analysis result, as follows:
 > ```python
 > def compute_standard_deviation_by_day(data):
->   # TODO
->   return daily_standard_deviation
+>     # TODO
+>     return daily_standard_deviation
 > ```
 >> ## Solution
 >> The analysis code will be refactored into a separate function that may look something like:
@@ -208,23 +208,23 @@ be harder to test but, when simplified like this, may only require a handful of
 >> ```
 >> The `analyse_data()` function now calls the `compute_standard_deviation_by_day()` function, 
 >> while keeping all the logic for reading the data, processing it and showing it in a graph:
->>```python
->>def analyse_data(data_dir):
->>    """Calculates the standard deviation by day between datasets.
->>    Gets all the inflammation data from CSV files within a directory, works out the mean
->>    inflammation value for each day across all datasets, then visualises the
->>    standard deviation of these means on a graph."""
->>    data_file_paths = glob.glob(os.path.join(data_dir, 'inflammation*.csv'))
->>    if len(data_file_paths) == 0:
->>        raise ValueError(f"No inflammation csv's found in path {data_dir}")
->>    data = map(models.load_csv, data_file_paths)
->>    daily_standard_deviation = compute_standard_deviation_by_day(data)
+>> ```python
+>> def analyse_data(data_dir):
+>>     """Calculates the standard deviation by day between datasets.
+>>     Gets all the inflammation data from CSV files within a directory, works out the mean
+>>     inflammation value for each day across all datasets, then visualises the
+>>     standard deviation of these means on a graph."""
+>>     data_file_paths = glob.glob(os.path.join(data_dir, 'inflammation*.csv'))
+>>     if len(data_file_paths) == 0:
+>>         raise ValueError(f"No inflammation csv's found in path {data_dir}")
+>>     data = map(models.load_csv, data_file_paths)
+>>     daily_standard_deviation = compute_standard_deviation_by_day(data)
 >>
->>    graph_data = {
->>        'standard deviation by day': daily_standard_deviation,
->>    }
->>    # views.visualize(graph_data)
->>    return daily_standard_deviation
+>>     graph_data = {
+>>         'standard deviation by day': daily_standard_deviation,
+>>     }
+>>     # views.visualize(graph_data)
+>>     return daily_standard_deviation
 >>```
 >> Make sure to re-run the regression test to check this refactoring has not
 >> changed the output of `analyse_data()`.
@@ -252,17 +252,17 @@ from CSV to JSON, the bulk of the tests need not be updated
 >> You might have thought of more tests, but we can easily extend the test by parametrizing
 >> with more inputs and expected outputs:
 >> ```python
->>@pytest.mark.parametrize('data,expected_output', [
->>    ([[[0, 1, 0], [0, 2, 0]]], [0, 0, 0]),
->>    ([[[0, 2, 0]], [[0, 1, 0]]], [0, math.sqrt(0.25), 0]),
->>    ([[[0, 1, 0], [0, 2, 0]], [[0, 1, 0], [0, 2, 0]]], [0, 0, 0])
->>],
->>ids=['Two patients in same file', 'Two patients in different files', 'Two identical patients in two different files'])
->>def test_compute_standard_deviation_by_day(data, expected_output):
->>    from inflammation.compute_data import compute_standard_deviation_by_data
+>> @pytest.mark.parametrize('data,expected_output', [
+>>     ([[[0, 1, 0], [0, 2, 0]]], [0, 0, 0]),
+>>     ([[[0, 2, 0]], [[0, 1, 0]]], [0, math.sqrt(0.25), 0]),
+>>     ([[[0, 1, 0], [0, 2, 0]], [[0, 1, 0], [0, 2, 0]]], [0, 0, 0])
+>> ],
+>> ids=['Two patients in same file', 'Two patients in different files', 'Two identical patients in two different files'])
+>> def test_compute_standard_deviation_by_day(data, expected_output):
+>>     from inflammation.compute_data import compute_standard_deviation_by_data
 >>
->>    result = compute_standard_deviation_by_data(data)
->>    npt.assert_array_almost_equal(result, expected_output)
+>>     result = compute_standard_deviation_by_data(data)
+>>     npt.assert_array_almost_equal(result, expected_output)
 ```
 > {: .solution}
 {: .challenge}
diff --git a/_episodes/35-software-architecture-revisited.md b/_episodes/35-software-architecture-revisited.md
@@ -72,19 +72,19 @@ how you should structure your code.
 >>
 >> ```python
 >> if args.full_data_analysis:
->>   _, extension = os.path.splitext(InFiles[0])
->>   if extension == '.json':
->>     data_source = JSONDataSource(os.path.dirname(InFiles[0]))
->>   elif extension == '.csv':
->>     data_source = CSVDataSource(os.path.dirname(InFiles[0]))
->>   else:
->>     raise ValueError(f'Unsupported file format: {extension}')
->>   data_result = analyse_data(data_source)
->>   graph_data = {
->>     'standard deviation by day': data_result,
->>   }
->>   views.visualize(graph_data)
->>   return
+>>     _, extension = os.path.splitext(infiles[0])
+>>     if extension == '.json':
+>>         data_source = JSONDataSource(os.path.dirname(infiles[0]))
+>>     elif extension == '.csv':
+>>         data_source = CSVDataSource(os.path.dirname(infiles[0]))
+>>     else:
+>>         raise ValueError(f'Unsupported file format: {extension}')
+>>     data_result = analyse_data(data_source)
+>>     graph_data = {
+>>         'standard deviation by day': data_result,
+>>     }
+>>     views.visualize(graph_data)
+>>     return
 >> ```
 >> Note that this is, more or less, the change we did to write our regression test.
 >> This demonstrates that splitting up Model code from View code can