Skip to content

Commit cd61e15

Browse files
author
Quarto GHA Workflow Runner
committed
Built site for gh-pages
1 parent d5ddc61 commit cd61e15

22 files changed

+209
-144
lines changed

.nojekyll

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
4a8cda21
1+
d3ff964a

_tex/index.tex

Lines changed: 73 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -253,13 +253,14 @@ \section{Introduction}\label{sec-intro}
253253
the universe. However, the development of new machine learning methods
254254
and data-intensive discovery more generally depends on Findability,
255255
Accessibility, Interoperability and Reusability (FAIR) of data
256-
(Wilkinson et al. 2016). One of the main mechanisms through which the
257-
FAIR principles are promoted is the development of \emph{standards} for
258-
data and metadata. Standards can vary in the level of detail and scope,
259-
and encompass such things as \emph{file formats} for the storage of
260-
certain data types, \emph{schemas} for databases that organize data,
261-
\emph{ontologies} to describe and organize metadata in a manner that
262-
connects it to field-specific meaning, as well as mechanisms to describe
256+
(Wilkinson et al. 2016) as well as metadata (Musen 2022). One of the
257+
main mechanisms through which the FAIR principles are promoted is the
258+
development of \emph{standards} for data and metadata. Standards can
259+
vary in the level of detail and scope, and encompass such things as
260+
\emph{file formats} for the storage of certain data types,
261+
\emph{schemas} for databases that organize data, \emph{ontologies} to
262+
describe and organize metadata in a manner that connects it to
263+
field-specific meaning, as well as mechanisms to describe
263264
\emph{provenance} of analysis products.
264265

265266
Community-driven development of robust, adaptable and useful standards
@@ -272,30 +273,30 @@ \section{Introduction}\label{sec-intro}
272273
OSS. For example, the Open Source Initiative (OSI), a non-profit
273274
organization that was founded in the 1990s developed a set of guidelines
274275
for licensing of OSS that is designed to protect the rights of
275-
developers and users. On the more technical side, tools such as the Git
276-
Source-code management system support open-source development workflows
277-
that can be adopted in the development of standards. Governance
278-
approaches have been honed to address the challenges of managing a range
279-
of stakeholder interests and to mediate between large numbers of
280-
weakly-connected individuals that contribute to OSS. When these social
281-
and technical innovations are put together they enable a host of
282-
positive defining features of OSS, such as transparency, collaboration,
283-
and decentralization. These features allow OSS to have a remarkable
284-
level of dynamism and productivity, while also retaining the ability of
285-
a variety of stakeholders to guide the evolution of the software to take
286-
their needs and interests into account.
287-
288-
Data and metadata standards that adopt tools and practices of OSS
289-
(``open-source standards'' henceforth) stand to reap many of the
290-
benefits that the OSS model has provided in the development of other
291-
technologies. The present report explore how OSS processes and tools
292-
have affected the development of data and metadata standards. The report
293-
will triangulate common features of a variety of use cases; it will
294-
identify some of the challenges and pitfalls of this mode of standards
295-
development, with a particular focus on cross-sector interactions; and
296-
it will make recommendations for future developments and policies that
297-
can help this mode of standards development thrive and reach its full
298-
potential.
276+
developers and users. On the technical side, tools such as the Git
277+
Source-code management system support complex and distributed
278+
open-source workflows that accelerate, streamline, and robustify OSS
279+
development. Governance approaches have been honed to address the
280+
challenges of managing a range of stakeholder interests and to mediate
281+
between large numbers of weakly-connected individuals that contribute to
282+
OSS. When these social and technical innovations are put together they
283+
enable a host of positive defining features of OSS, such as
284+
transparency, collaboration, and decentralization. These features allow
285+
OSS to have a remarkable level of dynamism and productivity, while also
286+
retaining the ability of a variety of stakeholders to guide the
287+
evolution of the software to take their needs and interests into
288+
account.
289+
290+
Data and metadata standards that use tools and practices of OSS
291+
(``open-source standards'' henceforth) reap many of the benefits that
292+
the OSS model has provided in the development of other technologies. The
293+
present report explores how OSS processes and tools have affected the
294+
development of data and metadata standards. The report will triangulate
295+
common features of a variety of use cases; it will identify some of the
296+
challenges and pitfalls of this mode of standards development, with a
297+
particular focus on cross-sector interactions; and it will make
298+
recommendations for future developments and policies that can help this
299+
mode of standards development thrive and reach its full potential.
299300

300301
\section{Use cases}\label{sec-use-cases}
301302

@@ -307,11 +308,14 @@ \section{Use cases}\label{sec-use-cases}
307308
organizations such as LSST, CERN, and NASA, while other fields have only
308309
relatively recently become aware of the value of data sharing and its
309310
impact. These disparate histories inform how standards have evolved and
310-
how OSS practices have pervaded their development.
311+
how OSS practices have pervaded their development. It also demonstrates
312+
field-specific limitations on the adoption of OSS tools and practices
313+
that exemplify some of the challenges, which we will explore
314+
subsequently.
311315

312316
\subsection{Astronomy}\label{astronomy}
313317

314-
One prominent example of a community-driven standard is the FITS
318+
An early prominent example of a community-driven standard is the FITS
315319
(Flexible Image Transport System) file format standard, which was
316320
developed in the late 1970s and early 1980s (Wells and Greisen 1979),
317321
and has been adopted worldwide for astronomy data preservation and
@@ -320,7 +324,7 @@ \subsection{Astronomy}\label{astronomy}
320324
1980s to store image data in the visible and x-ray spectrum. It has been
321325
endorsed by IAU, as well as funding agencies. Though the format has
322326
evolved over time, ``once FITS, always FITS''. That is, the format
323-
cannot be evolved to introduce changes that break backwards
327+
cannot be evolved to introduce changes that break backward
324328
compatibility. Among the features that make FITS so durable is that it
325329
was designed originally to have a very restricted metadata schema. That
326330
is, FITS records were designed to be the lowest common denominator of
@@ -336,13 +340,14 @@ \subsection{High-energy physics (HEP)}\label{high-energy-physics-hep}
336340
Because data collection is centralized, standards to collect and store
337341
HEP data have been established and the adoption of these standards in
338342
data analysis has high penetration (Basaglia et al. 2023). A top-down
339-
approach is taken so that within every large collaboration standards are
340-
enforced, and this adoption is centrally managed. Access to raw data is
341-
essentially impossible, and making it publicly available is both
342-
technically very hard and potentially ill-advised. Therefore, analysis
343-
tools are tuned specifically to the standards. Incentives to use the
344-
standards are provided by funders that require data management plans
345-
that specify how the data is shared.
343+
approach is taken so that within every large collaboration, standards
344+
are enforced, and this adoption is centrally managed. Access to raw data
345+
is essentially impossible because of its large volume, and making it
346+
publicly available is both technically very hard and potentially
347+
ill-advised. Therefore, analysis tools are tuned specifically to the
348+
standards of the released data. Incentives to use the standards are
349+
provided by funders that require data management plans that specify how
350+
the data is shared (i.e., in a standards-compliant manner).
346351

347352
\subsection{Earth sciences}\label{earth-sciences}
348353

@@ -380,16 +385,32 @@ \subsection{Neuroscience}\label{neuroscience}
380385
development to accept contributions from a wide range of stakeholders
381386
and tap a broad base of expertise.
382387

383-
\subsection{Automated discovery}\label{automated-discovery}
384-
385388
\subsection{Community science}\label{community-science}
386389

387390
Another interesting use case for open-source standards is
388-
community/citizen science. This approach, which has grown Here,
389-
standards are needed to facilitate interactions between an in-group of
390-
expert researchers who generate and curate data and a broader set of
391-
out-group enthusiasts who would like to make meaningful contributions to
392-
the science.
391+
community/citizen science. This approach, which has grown in the last 20
392+
years, has many benefits for both the research field that harnesses the
393+
energy of non-scientist members of the community to engage with
394+
scientific data, as well as to the community members themselves who can
395+
draw both knowledge and pride in their participation in the scientific
396+
endeavor. It is also recognized that unique broader benefits are accrued
397+
from this mode of scientific research, through the inclusion of
398+
perspectives and data that would not otherwise be included. To make data
399+
accessible to community scientists, and to make the data collected by
400+
community scientists accessible to professional scientists, it needs to
401+
be provided in a manner that can be created and accessed without
402+
specialized instruments or specialized knowledge. Here, standards are
403+
needed to facilitate interactions between an in-group of expert
404+
researchers who generate and curate data and a broader set of out-group
405+
enthusiasts who would like to make meaningful contributions to the
406+
science. This creates a particularly stringent constraint on
407+
transparency and simplicity of standards. Creating these standards in a
408+
manner that addresses these unique constraints can benefit from OSS
409+
tools, with the caveat that some of these tools require additional
410+
expertise. For example, if the standard is developed using git/GitHub
411+
for versioning, this would require learning the complex and obscure
412+
technical aspects of these system that are far from easy to adopt, even
413+
for many professional scientists.
393414

394415
\section{Opportunities and risks for open-source
395416
standards}\label{sec-challenges}
@@ -851,6 +872,10 @@ \section*{References}\label{references}
851872
Implementing FAIR Principles}. 1st ed. Vol. 1. Milton: CRC Press.
852873
\url{https://doi.org/10.1201/9781315380711}.
853874

875+
\bibitem[\citeproctext]{ref-Musen2022metadata}
876+
Musen, Mark A. 2022. {``Without Appropriate Metadata, Data-Sharing
877+
Mandates Are Pointless.''} \emph{Nature} 609 (7926): 222.
878+
854879
\bibitem[\citeproctext]{ref-NIST2019}
855880
National Institute of Standards and Technology. 2019. {``{U.S}.
856881
{LEADERSHIP} {IN} {AI}: A Plan for Federal Engagement in Developing

_tex/references.bib

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,24 @@
1+
2+
@ARTICLE{Musen2022metadata,
3+
title = "Without appropriate metadata, data-sharing mandates are
4+
pointless",
5+
author = "Musen, Mark A",
6+
abstract = "Funders and investigators must demand appropriate metadata
7+
standards to take data from foul to FAIR. Funders and
8+
investigators must demand appropriate metadata standards to take
9+
data from foul to FAIR.",
10+
journal = "Nature",
11+
publisher = "Springer Science and Business Media LLC",
12+
volume = 609,
13+
number = 7926,
14+
pages = "222",
15+
month = sep,
16+
year = 2022,
17+
keywords = "Research data; Research management",
18+
language = "en"
19+
}
20+
21+
122
@software{zarr,
223
author = {Alistair Miles and
324
jakirkham and

index.docx

692 Bytes
Binary file not shown.

0 commit comments

Comments
 (0)