@@ -62,29 +62,46 @@ Data loading is a functionality separate from data analysis, so firstly
62
62
let's decouple the data loading part into a separate component (function).
63
63
64
64
> ## Exercise: Decouple Data Loading from Data Analysis
65
- > Separate out the data loading functionality from ` analyse_data() ` into a new function
66
- > ` load_inflammation_data() ` that returns a list of 2D NumPy arrays with inflammation data
65
+ >
66
+ > Modify ` compute_data.py ` to separate out the data loading functionality from ` analyse_data() ` into a new function
67
+ > ` load_inflammation_data() ` , that returns a list of 2D NumPy arrays with inflammation data
67
68
> loaded from all inflammation CSV files found in a specified directory path.
69
+ > Then, change your ` analyse_data() ` function to make use of this new function instead.
70
+ >
68
71
>> ## Solution
72
+ >>
69
73
>> The new function ` load_inflammation_data() ` that reads all the inflammation data into the
70
74
>> format needed for the analysis could look something like:
75
+ > .
71
76
>> ``` python
72
77
>> def load_inflammation_data (dir_path ):
73
- >> data_file_paths = glob.glob(os.path.join(dir_path, ' inflammation*.csv' ))
74
- >> if len (data_file_paths) == 0 :
75
- >> raise ValueError (f " No inflammation CSV files found in path { dir_path} " )
76
- >> data = map (models.load_csv, data_file_paths) # load inflammation data from each CSV file
77
- >> return list (data) # return the list of 2D NumPy arrays with inflammation data
78
+ >> data_file_paths = glob.glob(os.path.join(dir_path, ' inflammation*.csv' ))
79
+ >> if len (data_file_paths) == 0 :
80
+ >> raise ValueError (f " No inflammation CSV files found in path { dir_path} " )
81
+ >> data = map (models.load_csv, data_file_paths) # Load inflammation data from each CSV file
82
+ >> return list (data) # Return the list of 2D NumPy arrays with inflammation data
78
83
>> ```
79
- >> This function can now be used in the analysis as follows:
84
+ >>
85
+ >> The new function `analyse_data()` could then look like:
86
+ >>
80
87
>> ```python
81
88
>> def analyse_data (data_dir ):
82
- >> data = load_inflammation_data(data_dir)
83
- >> daily_standard_deviation = compute_standard_deviation_by_data(data)
84
- >> ...
89
+ >> data = load_inflammation_data(data_dir)
90
+ >>
91
+ >> means_by_day = map (models.daily_mean, data)
92
+ >> means_by_day_matrix = np.stack(list (means_by_day))
93
+ >>
94
+ >> daily_standard_deviation = np.std(means_by_day_matrix, axis = 0 )
95
+ >>
96
+ >> graph_data = {
97
+ >> ' standard deviation by day' : daily_standard_deviation,
98
+ >> }
99
+ >> views.visualize(graph_data)
85
100
>> ```
101
+ >>
86
102
>> The code is now easier to follow since we do not need to understand the data loading part
87
103
>> to understand the statistical analysis part, and vice versa.
104
+ >> In most cases, functions work best when they are short!
88
105
> {: .solution}
89
106
{: .challenge}
90
107
@@ -185,13 +202,12 @@ In addition, implementation of the method `get_area()` is hidden too (abstractio
185
202
>> At the end of this exercise, the code in the ` analyse_data() ` function should look like:
186
203
>> ``` python
187
204
>> def analyse_data (data_source ):
188
- >> data = data_source.load_inflammation_data()
189
- >> daily_standard_deviation = compute_standard_deviation_by_data(data)
190
- >> ...
205
+ >> data = data_source.load_inflammation_data()
206
+ >> ...
191
207
>> ```
192
208
>> The controller code should look like:
193
209
>> ```python
194
- >> data_source = CSVDataSource(os.path.dirname(InFiles [0 ]))
210
+ >> data_source = CSVDataSource(os.path.dirname(infiles [0 ]))
195
211
>> analyse_data(data_source)
196
212
>> ```
197
213
> {: .solution}
@@ -200,33 +216,32 @@ In addition, implementation of the method `get_area()` is hidden too (abstractio
200
216
>>
201
217
>> ```python
202
218
>> class CSVDataSource :
203
- >> """
204
- >> Loads all the inflammation CSV files within a specified directory.
205
- >> """
206
- >> def __init__ (self , dir_path ):
207
- >> self .dir_path = dir_path
219
+ >> """
220
+ >> Loads all the inflammation CSV files within a specified directory.
221
+ >> """
222
+ >> def __init__ (self , dir_path ):
223
+ >> self .dir_path = dir_path
208
224
>>
209
- >> def load_inflammation_data (self ):
210
- >> data_file_paths = glob.glob(os.path.join(self .dir_path, ' inflammation*.csv' ))
211
- >> if len (data_file_paths) == 0 :
212
- >> raise ValueError (f " No inflammation CSV files found in path { self .dir_path} " )
213
- >> data = map (models.load_csv, data_file_paths)
214
- >> return list (data)
225
+ >> def load_inflammation_data (self ):
226
+ >> data_file_paths = glob.glob(os.path.join(self .dir_path, ' inflammation*.csv' ))
227
+ >> if len (data_file_paths) == 0 :
228
+ >> raise ValueError (f " No inflammation CSV files found in path { self .dir_path} " )
229
+ >> data = map (models.load_csv, data_file_paths)
230
+ >> return list (data)
215
231
>> ```
216
232
>> In the controller, we create an instance of CSVDataSource and pass it
217
233
>> into the the statistical analysis function.
218
234
>>
219
235
>> ```python
220
- >> data_source = CSVDataSource(os.path.dirname(InFiles [0 ]))
236
+ >> data_source = CSVDataSource(os.path.dirname(infiles [0 ]))
221
237
>> analyse_data(data_source)
222
238
>> ```
223
239
>> The `analyse_data()` function is modified to receive any data source object (that implements
224
240
>> the `load_inflammation_data()` method) as a parameter.
225
241
>> ```python
226
242
>> def analyse_data (data_source ):
227
- >> data = data_source.load_inflammation_data()
228
- >> daily_standard_deviation = compute_standard_deviation_by_data(data)
229
- >> ...
243
+ >> data = data_source.load_inflammation_data()
244
+ >> ...
230
245
>> ```
231
246
>> We have now fully decoupled the reading of the data from the statistical analysis and
232
247
>> the analysis is not fixed to reading from a directory of CSV files. Indeed, we can pass various
@@ -364,11 +379,11 @@ data sources with no extra work.
364
379
>> Additionally, in the controller we will need to select an appropriate DataSource instance to
365
380
>> provide to the analysis:
366
381
>>```python
367
- >> _, extension = os.path.splitext(InFiles [0])
382
+ >> _, extension = os.path.splitext(infiles [0])
368
383
>> if extension == '.json':
369
- >> data_source = JSONDataSource(os.path.dirname(InFiles [0]))
384
+ >> data_source = JSONDataSource(os.path.dirname(infiles [0]))
370
385
>> elif extension == '.csv':
371
- >> data_source = CSVDataSource(os.path.dirname(InFiles [0]))
386
+ >> data_source = CSVDataSource(os.path.dirname(infiles [0]))
372
387
>> else:
373
388
>> raise ValueError(f'Unsupported data file format: {extension}')
374
389
>> analyse_data(data_source)
0 commit comments