@@ -253,13 +253,14 @@ \section{Introduction}\label{sec-intro}
253
253
the universe. However, the development of new machine learning methods
254
254
and data-intensive discovery more generally depends on Findability,
255
255
Accessibility, Interoperability and Reusability (FAIR) of data
256
- (Wilkinson et al. 2016). One of the main mechanisms through which the
257
- FAIR principles are promoted is the development of \emph {standards } for
258
- data and metadata. Standards can vary in the level of detail and scope,
259
- and encompass such things as \emph {file formats } for the storage of
260
- certain data types, \emph {schemas } for databases that organize data,
261
- \emph {ontologies } to describe and organize metadata in a manner that
262
- connects it to field-specific meaning, as well as mechanisms to describe
256
+ (Wilkinson et al. 2016) as well as metadata (Musen 2022). One of the
257
+ main mechanisms through which the FAIR principles are promoted is the
258
+ development of \emph {standards } for data and metadata. Standards can
259
+ vary in the level of detail and scope, and encompass such things as
260
+ \emph {file formats } for the storage of certain data types,
261
+ \emph {schemas } for databases that organize data, \emph {ontologies } to
262
+ describe and organize metadata in a manner that connects it to
263
+ field-specific meaning, as well as mechanisms to describe
263
264
\emph {provenance } of analysis products.
264
265
265
266
Community-driven development of robust, adaptable and useful standards
@@ -272,30 +273,30 @@ \section{Introduction}\label{sec-intro}
272
273
OSS. For example, the Open Source Initiative (OSI), a non-profit
273
274
organization that was founded in the 1990s developed a set of guidelines
274
275
for licensing of OSS that is designed to protect the rights of
275
- developers and users. On the more technical side, tools such as the Git
276
- Source-code management system support open-source development workflows
277
- that can be adopted in the development of standards. Governance
278
- approaches have been honed to address the challenges of managing a range
279
- of stakeholder interests and to mediate between large numbers of
280
- weakly-connected individuals that contribute to OSS. When these social
281
- and technical innovations are put together they enable a host of
282
- positive defining features of OSS, such as transparency, collaboration,
283
- and decentralization. These features allow OSS to have a remarkable
284
- level of dynamism and productivity, while also retaining the ability of
285
- a variety of stakeholders to guide the evolution of the software to take
286
- their needs and interests into account.
287
-
288
- Data and metadata standards that adopt tools and practices of OSS
289
- ( `` open-source standards'' henceforth) stand to reap many of the
290
- benefits that the OSS model has provided in the development of other
291
- technologies. The present report explore how OSS processes and tools
292
- have affected the development of data and metadata standards. The report
293
- will triangulate common features of a variety of use cases; it will
294
- identify some of the challenges and pitfalls of this mode of standards
295
- development, with a particular focus on cross-sector interactions; and
296
- it will make recommendations for future developments and policies that
297
- can help this mode of standards development thrive and reach its full
298
- potential.
276
+ developers and users. On the technical side, tools such as the Git
277
+ Source-code management system support complex and distributed
278
+ open-source workflows that accelerate, streamline, and robustify OSS
279
+ development. Governance approaches have been honed to address the
280
+ challenges of managing a range of stakeholder interests and to mediate
281
+ between large numbers of weakly-connected individuals that contribute to
282
+ OSS. When these social and technical innovations are put together they
283
+ enable a host of positive defining features of OSS, such as
284
+ transparency, collaboration, and decentralization. These features allow
285
+ OSS to have a remarkable level of dynamism and productivity, while also
286
+ retaining the ability of a variety of stakeholders to guide the
287
+ evolution of the software to take their needs and interests into
288
+ account.
289
+
290
+ Data and metadata standards that use tools and practices of OSS
291
+ ( `` open-source standards '' henceforth) reap many of the benefits that
292
+ the OSS model has provided in the development of other technologies. The
293
+ present report explores how OSS processes and tools have affected the
294
+ development of data and metadata standards. The report will triangulate
295
+ common features of a variety of use cases; it will identify some of the
296
+ challenges and pitfalls of this mode of standards development, with a
297
+ particular focus on cross-sector interactions; and it will make
298
+ recommendations for future developments and policies that can help this
299
+ mode of standards development thrive and reach its full potential.
299
300
300
301
\section {Use cases }\label {sec-use-cases }
301
302
@@ -307,11 +308,14 @@ \section{Use cases}\label{sec-use-cases}
307
308
organizations such as LSST, CERN, and NASA, while other fields have only
308
309
relatively recently become aware of the value of data sharing and its
309
310
impact. These disparate histories inform how standards have evolved and
310
- how OSS practices have pervaded their development.
311
+ how OSS practices have pervaded their development. It also demonstrates
312
+ field-specific limitations on the adoption of OSS tools and practices
313
+ that exemplify some of the challenges, which we will explore
314
+ subsequently.
311
315
312
316
\subsection {Astronomy }\label {astronomy }
313
317
314
- One prominent example of a community-driven standard is the FITS
318
+ An early prominent example of a community-driven standard is the FITS
315
319
(Flexible Image Transport System) file format standard, which was
316
320
developed in the late 1970s and early 1980s (Wells and Greisen 1979),
317
321
and has been adopted worldwide for astronomy data preservation and
@@ -320,7 +324,7 @@ \subsection{Astronomy}\label{astronomy}
320
324
1980s to store image data in the visible and x-ray spectrum. It has been
321
325
endorsed by IAU, as well as funding agencies. Though the format has
322
326
evolved over time, `` once FITS, always FITS'' . That is, the format
323
- cannot be evolved to introduce changes that break backwards
327
+ cannot be evolved to introduce changes that break backward
324
328
compatibility. Among the features that make FITS so durable is that it
325
329
was designed originally to have a very restricted metadata schema. That
326
330
is, FITS records were designed to be the lowest common denominator of
@@ -336,13 +340,14 @@ \subsection{High-energy physics (HEP)}\label{high-energy-physics-hep}
336
340
Because data collection is centralized, standards to collect and store
337
341
HEP data have been established and the adoption of these standards in
338
342
data analysis has high penetration (Basaglia et al. 2023). A top-down
339
- approach is taken so that within every large collaboration standards are
340
- enforced, and this adoption is centrally managed. Access to raw data is
341
- essentially impossible, and making it publicly available is both
342
- technically very hard and potentially ill-advised. Therefore, analysis
343
- tools are tuned specifically to the standards. Incentives to use the
344
- standards are provided by funders that require data management plans
345
- that specify how the data is shared.
343
+ approach is taken so that within every large collaboration, standards
344
+ are enforced, and this adoption is centrally managed. Access to raw data
345
+ is essentially impossible because of its large volume, and making it
346
+ publicly available is both technically very hard and potentially
347
+ ill-advised. Therefore, analysis tools are tuned specifically to the
348
+ standards of the released data. Incentives to use the standards are
349
+ provided by funders that require data management plans that specify how
350
+ the data is shared (i.e., in a standards-compliant manner).
346
351
347
352
\subsection {Earth sciences }\label {earth-sciences }
348
353
@@ -380,16 +385,32 @@ \subsection{Neuroscience}\label{neuroscience}
380
385
development to accept contributions from a wide range of stakeholders
381
386
and tap a broad base of expertise.
382
387
383
- \subsection {Automated discovery }\label {automated-discovery }
384
-
385
388
\subsection {Community science }\label {community-science }
386
389
387
390
Another interesting use case for open-source standards is
388
- community/citizen science. This approach, which has grown Here,
389
- standards are needed to facilitate interactions between an in-group of
390
- expert researchers who generate and curate data and a broader set of
391
- out-group enthusiasts who would like to make meaningful contributions to
392
- the science.
391
+ community/citizen science. This approach, which has grown in the last 20
392
+ years, has many benefits for both the research field that harnesses the
393
+ energy of non-scientist members of the community to engage with
394
+ scientific data, as well as to the community members themselves who can
395
+ draw both knowledge and pride in their participation in the scientific
396
+ endeavor. It is also recognized that unique broader benefits are accrued
397
+ from this mode of scientific research, through the inclusion of
398
+ perspectives and data that would not otherwise be included. To make data
399
+ accessible to community scientists, and to make the data collected by
400
+ community scientists accessible to professional scientists, it needs to
401
+ be provided in a manner that can be created and accessed without
402
+ specialized instruments or specialized knowledge. Here, standards are
403
+ needed to facilitate interactions between an in-group of expert
404
+ researchers who generate and curate data and a broader set of out-group
405
+ enthusiasts who would like to make meaningful contributions to the
406
+ science. This creates a particularly stringent constraint on
407
+ transparency and simplicity of standards. Creating these standards in a
408
+ manner that addresses these unique constraints can benefit from OSS
409
+ tools, with the caveat that some of these tools require additional
410
+ expertise. For example, if the standard is developed using git/GitHub
411
+ for versioning, this would require learning the complex and obscure
412
+ technical aspects of these system that are far from easy to adopt, even
413
+ for many professional scientists.
393
414
394
415
\section {Opportunities and risks for open-source
395
416
standards }\label {sec-challenges }
@@ -851,6 +872,10 @@ \section*{References}\label{references}
851
872
Implementing FAIR Principles }. 1st ed. Vol. 1. Milton: CRC Press.
852
873
\url {https://doi.org/10.1201/9781315380711}.
853
874
875
+ \bibitem [\citeproctext ]{ref-Musen2022metadata}
876
+ Musen, Mark A. 2022. {`` Without Appropriate Metadata, Data-Sharing
877
+ Mandates Are Pointless.'' } \emph {Nature } 609 (7926): 222.
878
+
854
879
\bibitem [\citeproctext ]{ref-NIST2019}
855
880
National Institute of Standards and Technology. 2019. {`` {U.S}.
856
881
{LEADERSHIP} {IN} {AI}: A Plan for Federal Engagement in Developing
0 commit comments