Skip to content

Commit 1cc9d09

Browse files
committed
To the clouds!
1 parent 2a68cac commit 1cc9d09

File tree

4 files changed

+154
-31
lines changed

4 files changed

+154
-31
lines changed

references.bib

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,80 @@
1+
@software{zarr,
2+
author = {Alistair Miles and
3+
jakirkham and
4+
M Bussonnier and
5+
Josh Moore and
6+
Dimitri Papadopoulos Orfanos and
7+
Davis Bennett and
8+
David Stansby and
9+
Joe Hamman and
10+
James Bourbeau and
11+
Andrew Fulton and
12+
Gregory Lee and
13+
Ryan Abernathey and
14+
Norman Rzepka and
15+
Zain Patel and
16+
Mads R. B. Kristensen and
17+
Sanket Verma and
18+
Saransh Chopra and
19+
Matthew Rocklin and
20+
AWA BRANDON AWA and
21+
Max Jones and
22+
Martin Durant and
23+
Elliott Sales de Andrade and
24+
Vincent Schut and
25+
raphael dussin and
26+
Shivank Chaudhary and
27+
Chris Barnes and
28+
Juan Nunez-Iglesias and
29+
shikharsg},
30+
title = {zarr-developers/zarr-python: v3.0.0-alpha},
31+
month = jun,
32+
year = 2024,
33+
publisher = {Zenodo},
34+
version = {v3.0.0-alpha},
35+
doi = {10.5281/zenodo.11592827},
36+
url = {https://doi.org/10.5281/zenodo.11592827}
37+
}
38+
39+
@inproceedings{Norman2021CloudBank,
40+
author = {Norman, Michael and Kellen, Vince and Smallen, Shava and DeMeulle, Brian and Strande, Shawn and Lazowska, Ed and Alterman, Naomi and Fatland, Rob and Stone, Sarah and Tan, Amanda and Yelick, Katherine and Van Dusen, Eric and Mitchell, James},
41+
title = {{CloudBank: Managed Services to Simplify Cloud Access for Computer Science Research and Education}},
42+
year = {2021},
43+
isbn = {9781450382922},
44+
publisher = {Association for Computing Machinery},
45+
address = {New York, NY, USA},
46+
url = {https://doi.org/10.1145/3437359.3465586},
47+
doi = {10.1145/3437359.3465586},
48+
abstract = {CloudBank is a cloud access entity founded to enable the computer science research and education communities to harness the profound computational potential of public clouds. By delivering a set of managed services designed to alleviate common points of friction associated with cloud adoption, Cloudbank serves as an integrated service provider to the research and education community. These services include front-line help desk support, cloud solution consulting, training, account management, cost monitoring and optimization support, and automated billing. CloudBank has a multi-cloud pay-per-use billing model and aims to serve the spectrum of cloud users from novice to advanced.},
49+
booktitle = {Practice and Experience in Advanced Research Computing},
50+
articleno = {45},
51+
numpages = {4},
52+
keywords = {Cloud Computing},
53+
location = {Boston, MA, USA},
54+
series = {PEARC '21}
55+
}
56+
57+
@article{Connolly2023Software,
58+
author = {Connolly, Andrew and Hellerstein, Joseph and Alterman, Naomi and Beck, David and Fatland, Rob and Lazowska, Ed and Mandava, Vani and Stone, Sarah},
59+
journal = {Harvard Data Science Review},
60+
number = {2},
61+
year = {2023},
62+
month = {apr 27},
63+
note = {https://hdsr.mitpress.mit.edu/pub/f0f7h5cu},
64+
publisher = {},
65+
title = {
66+
67+
{Software} {Engineering} {Practices} in {Academia}: Promoting the 3Rs---{Readability}, {Resilience}, and {Reuse}},
68+
volume = {5},
69+
}
70+
71+
@article{pestilli2021community,
72+
title={A community-driven development of the Brain Imaging Data Standard (BIDS) to describe macroscopic brain connections},
73+
author={Pestilli, Franco and Poldrack, Russ and Rokem, Ariel and Satterthwaite, Theodore and Feingold, Franklin and Duff, Eugene and Pernet, Cyril and Smith, Robert and Esteban, Oscar and Cieslak, Matt},
74+
journal={OSF},
75+
year={2021}
76+
}
77+
178
@MISC{Nosek2019CultureChange,
279
title = "Strategy for Culture Change",
380
author = "Nosek, Brian",

sections/02-use-cases.qmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Use cases
1+
# Use cases {#sec-use-cases}
22

33
To understand how OSS development practices affect the development of data and
44
metadata standards, it is informative to demonstrate this cross-fertilization

sections/03-challenges.qmd

Lines changed: 46 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
1-
# Opportunities and risks for open-source standards {#sec-opportunities}
1+
# Opportunities and risks for open-source standards {#sec-challenges}
22

33
At the same time, these tools and practices are associated with risks that need
44
to be mitigated.
55

6-
## Flexibility vs. stability
6+
## Flexibility vs. Stability
77

88
One of the defining characteristics of OSS is its dynamism and its rapid
99
evolution. Because OSS can be used by anyone and, in most cases, contributions
@@ -59,27 +59,60 @@ standardization lacks formal avenues for success and recognition, for example th
5959
Data standardization investment is justified if the standard is generalizable
6060
beyond any specific science domain. However while the use cases are domain
6161
sciences based, data standardization is seen as a data infrastructure and not a
62-
science investment. Moreover due to how science research funding works,
63-
scientists lack incentives to work across domains, or work on infrastructure
62+
science investment. Moreover, due to how science research funding works,
63+
scientists lack incentives to work across domains or to work on infrastructure
6464
problems.
6565

6666
## Data instrumentation issues
6767

6868
Data for scientific observations are often generated by proprietary
69-
instrumentation due to commercialization or other profit driven incentives.
70-
There islack of regulatory oversight to adhere to available standards or evolve
71-
Significant data transformation is required to get data to a state that is
72-
amenable to standards, if available. If not available, there is lack of
69+
instrumentation due to commercialization or other profit-driven incentives.
70+
There is a lack of regulatory oversight to adhere to available standards or
71+
evolve Significant data transformation is required to get data to a state that
72+
is amenable to standards, if available. If not available, there is a lack of
7373
incentive to set aside investment or resources to invest in establishing data
7474
standards.
7575

76+
### Harnessing new computing paradigms and technologies
77+
78+
Open-source standards development faces the challenges of adapting to new
79+
computing paradigms and technologies. Cloud computing provides a particularly
80+
stark set of opportunities and challenges. On the one hand, cloud computing
81+
offers practical solutions for many challenges of contemporary data-driven
82+
research. For example, the scalability of cloud resources addresses some of the
83+
challenges of the scale of data that is produced by instruments in many fields.
84+
The cloud also makes data access relatively straightforward, because of the
85+
ability to determine data access permissions in a granular fashion. On the
86+
other hand, cloud computing requires reinstrumenting many data formats. This is
87+
because cloud data access patterns are fundamentally different from the ones
88+
that are used in local posix-style file-systems. Suspicion of cloud computing
89+
comes in two different flavors: the first by researchers and administrators who
90+
may be wary of costs associated with cloud computing, and especially with the
91+
difficulty of predicting these costs. Projects such as NSF's Cloud Bank seek to
92+
mitigate some of these concerns, by providing an additional layer of
93+
transparency into cloud costs [@Norman2021CloudBank]. The other type of
94+
objection relates to the fact that cloud computing services, by their very
95+
nature, are closed ecosystems that resist portability and interoperability.
96+
Some aspects of the services are always going to remain hidden and privy only
97+
to the cloud computing service provider. In this respect, cloud computing runs
98+
afoul of some of the appealing aspects of OSS. That said, the development of
99+
"cloud native" standards can provide significant benefits in terms of the
100+
research that can be conducted. For example, NOAA plans to use cloud computing
101+
for integration across the multiple disparate datasets that it collects to
102+
build knowledge graphs that can be queried by researchers to answer questions
103+
that can only be answered through this integration. Putting all the data "in
104+
one place" should help with that. Adaptation to the cloud in terms of data
105+
standards has driven development of new file formats. A salient example is the
106+
ZARR format [@zarr], which supports random access into array-based datasets
107+
stored in cloud object storage, facilitating scalable and parallelized
108+
computing on these data. Indeed, data standards such as NWB (neuroscience) and
109+
OME (microscopy) now use ZARR as a backend for cloud-based storage. In other
110+
cases, file formats that were once not straightforward to use in the cloud,
111+
such as HDF5 and TIFF have been adapted to cloud use (e.g., through the
112+
cloud-optimized geoTIFF format).
113+
76114
## Sustainability
77115

78116
## The importance of automated validation
79117

80-
## Harnessing new computing paradigms and technologies
81-
82-
Open-source standards development faces the challenges of adapting to new
83-
technologies The development of standards that are well-Cloud computing
84-
provides
85118

sections/05-recommendations.qmd

Lines changed: 30 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11

2-
# Recommendations for open-source data and metadata standards
2+
# Recommendations for open-source data and metadata standards {#sec-recommendations}
33

44
In conclusion of this report, we propose the following recommendations:
55

6-
## Funding or Grantmaking entities:
6+
## Policy-making and Funding entities:
77

88
### Fund Data Standards Development
99

@@ -15,17 +15,26 @@ encourage the development and adoption of standards, and fund associated
1515
community efforts and tools for this. The OSS model is seen as a particularly
1616
promising avenue for an investment of resources, because it builds on
1717
previously-developed procedures and technical infrastructure and because it
18-
provides avenues for community input along the way. The clarity offered by
19-
procedures for enhancement proposals and semantic versioning schemes adopted in
20-
standards development offer avenues for a range of stakeholders to propose to
21-
funding bodies well-defined contributions to large and field-wide standards
22-
efforts.
23-
24-
### Invest in Data Stewards Recognize data stewards as a distinct role in
25-
research and science investment. Set up programs for training for data stewards
26-
and invest in career paths that encourage this role. Initial proposals for the
27-
curriculum and scope of the role have already been proposed (e.g., in
28-
[@Mons2018DataStewardshipBook])
18+
provides avenues for democratization of development processes and for community
19+
input along the way. The clarity offered by procedures for enhancement
20+
proposals and semantic versioning schemes adopted in standards development
21+
offer avenues for a range of stakeholders to propose to funding bodies
22+
well-defined contributions to large and field-wide standards efforts (e.g., [@pestilli2021community]).
23+
24+
### Invest in Data Stewards
25+
26+
Advancing the development and adoption of open-source standards requires the
27+
dissemination of knowledge to researchers in a variety of fields, but this
28+
dissemination itself may not be enough without the fostering of specialized
29+
expertise. Therefore, it is important to recognize *data stewards* as a
30+
distinct role in research. To truly support experts whose role will be to
31+
develop, maintain, and facilitate the adoption and use of open-source
32+
standards, it will be necessary to set up programs for training for data
33+
stewards and invest in career paths that encourage this role. Initial proposals
34+
for the curriculum and scope of the role have already been proposed (e.g., in
35+
[@Mons2018DataStewardshipBook]). In addition, in order for these individuals to be able to make the best use of open-source standards, it will be important for these individuals to be facile in the methodology of OSS. This does not mean that they need to become software engineers -- though there may be some overlap with the role of research software engineers [@Connolly2023Software] -- but rather that they
36+
need to become familiar with those parts of the OSS development life-cycle that
37+
are useful for development of open-source standards.
2938

3039
### Review Data Standards Pathways
3140

@@ -50,18 +59,22 @@ metadata and descriptions of how to use it.
5059

5160
### Program Manage Cross Sector alliances
5261

53-
Encourage cross sector and cross domain alliances that can impact successful standards creation. Invest in robust program management of these alliances to align pace and create incentives (for instance via Open Source Program Office / OSPO efforts). Similar to program officers at funding agencies, standards evolution need sustained PM efforts. Multi company partnerships should include strategic initiatives for standard establishment e.g. [Pistoiaalliance](https://www.pistoiaalliance.org/news/press-release-pistoia-alliance-launches-idmp-1-0/).
54-
62+
Encourage cross-sector and cross-domain alliances that can impact successful
63+
standards creation. Invest in robust program management of these alliances to
64+
align pace and create incentives (for instance via Open Source Program Office /
65+
OSPO efforts). Similar to program officers at funding agencies, standards
66+
evolution need sustained PM efforts. Multi company partnerships should include
67+
strategic initiatives for standard establishment e.g.
68+
[Pistoiaalliance](https://www.pistoiaalliance.org/news/press-release-pistoia-alliance-launches-idmp-1-0/).
5569

5670

5771
### Curriculum Development
5872

5973
Stakeholder organizations should invest in training grants to establish curriculum for data and metadata standards education. </ol>
6074

61-
6275
## Science and Technology Communities:
6376

64-
### User Driven Development
77+
### User-Driven Development
6578

6679
Standards should be needs-driven and developed in close collaboration with users. Changes and enhancements should be in response to community feedback.
6780

0 commit comments

Comments
 (0)