Skip to content

Commit 2568099

Browse files
committed
More edits.
1 parent 34bae6b commit 2568099

File tree

2 files changed

+69
-45
lines changed

2 files changed

+69
-45
lines changed

sections/01-introduction.qmd

Lines changed: 54 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -9,54 +9,63 @@ machine learning techniques, these datasets can help us understand everything
99
from the cellular operations of the human body, through business transactions
1010
on the internet, to the structure and history of the universe. However, the
1111
development of new machine learning methods, and data-intensive discovery more
12-
generally, rely heavily on the availability and usability of these large
13-
datasets. Data can be openly available but still not useful if it cannot be
14-
properly understood. In current conditions in which almost all of the relevant
15-
data is stored in digital formats, and many relevant datasets can be found
16-
through the communication networks of the world wide web, Findability,
17-
Accessibility, Interoperability and Reusability (FAIR) principles for data
18-
management and stewardship become critically important
19-
\cite{Wilkinson2016FAIR}.
12+
generally, rely heavily on Findability, Accessibility, Interoperability and
13+
Reusability (FAIR) of data [@Wilkinson2016FAIR].
2014

21-
One of the main mechanisms through which these principles are promoted is the
22-
development of \emph{standards} for data and metadata. Standards can vary in
23-
the level of detail and scope, and encompass such things as \emph{file formats}
24-
for the storing of certain data types, \emph{schemas} for databases that store
25-
a range of data types, \emph{ontologies} to describe and organize metadata in a
15+
One of the main mechanisms through which the FAIR principles are promoted is the
16+
development of *standards* for data and metadata. Standards can vary in
17+
the level of detail and scope, and encompass such things as *file formats*
18+
for the storing of certain data types, *schemas* for databases that store
19+
a range of data types, *ontologies* to describe and organize metadata in a
2620
manner that connects it to field-specific meaning, as well as mechanisms to
27-
describe \emph{provenance} of different data derivatives. The importance of
28-
standards was underscored in a recent report report by the Subcommittee on Open
29-
Science of the National Science and Technology Council on "Desirable
30-
characteristics of data repositories for federally funded research"
31-
\cite{nstc2022desirable}. The report explicitly called out the importance of
32-
"allow[ing] datasets and metadata to be accessed, downloaded, or exported from
33-
the repository in widely used, preferably non-proprietary, formats consistent
34-
with standards used in the disciplines the repository serves." This highlights
35-
the need for data and metadata standards across a variety of different kinds of
36-
data. In addition, a report from the National Institute of Standards and
37-
Technology on "U.S. Leadership in AI: A Plan for Federal Engagement in
38-
Developing Technical Standards and Related Tools" emphasized that --
39-
specifically for the case of AI -- "U.S. government agencies should prioritize
40-
AI standards efforts that are [...] Consensus-based, [...] Inclusive and
41-
accessible, [...] Multi-path, [...] Open and transparent, [...] and [that]
42-
Result in globally relevant and non-discriminatory standards..."
43-
\cite{NIST2019}. The converging characteristics of standards that arise from
44-
these reports suggest that considerable thought needs to be given to the manner
45-
in which standards arise, so that these goals are achieved.
21+
describe *provenance* of analysis products.
4622

47-
Standards for a specific domain can come about in various ways, but very
48-
broadly speaking two kinds of mechanisms can generate a standard for a specific
49-
type of data: (i) top-down: in this case a (usually) small group of people
50-
develop the standard and disseminate it to the communities of interest with
51-
very little input from these communities. An example of this mode of standards
52-
development can occur when an instrument is developed by a manufacturer and
53-
users of this instrument receive the data in a particular format that was
54-
developed in tandem with the instrument; and (ii) bottom-up: in this case,
55-
standards are developed by a larger group of people that convene and reach
56-
consensus about the details of the standard in an attempt to cover a large
57-
range of use-cases. Most standards are developed through an interplay between
58-
these two modes, and understanding how to make the best of these modes is
59-
critical in advancing the development of data and metadata standards.
23+
The importance of standards stems not only from discussions within research
24+
fields about how research can best be conducted to take advantage of existing
25+
and growing datasets, but also arises from an ongoing series of policy
26+
discussions that address the interactions between research communities and the
27+
general public. In the United States, memos issued in 2013 and 2022 by the
28+
directors of the White House Office of Science and Technology Policy (OSTP),
29+
James Holdren (2013) and Alondra Nelson (2022). While these memos focused
30+
primarily on making peer-reviewed publications funded by the US Federal
31+
government available to the general public, they also lay an increasingly
32+
detailed path towards the publication and general availability of the data that
33+
is collected as part of the research that is funded by the US government.
34+
35+
The general guidance and overall spirit of these memos dovetail with more
36+
specific policy discussions that put meat on the bones of the general guidance.
37+
The importance of data and metadata standards, for example, was underscored in
38+
a recent report by the Subcommittee on Open Science of the National Science and
39+
Technology Council on the "Desirable characteristics of data repositories for
40+
federally funded research" [@nstc2022desirable]. The report explicitly called
41+
out the importance of "allow[ing] datasets and metadata to be accessed,
42+
downloaded, or exported from the repository in widely used, preferably
43+
non-proprietary, formats consistent with standards used in the disciplines the
44+
repository serves." This highlights the need for data and metadata standards
45+
across a variety of different kinds of data. In addition, a report from the
46+
National Institute of Standards and Technology on "U.S. Leadership in AI: A
47+
Plan for Federal Engagement in Developing Technical Standards and Related
48+
Tools" emphasized that -- specifically for the case of AI -- "U.S. government
49+
agencies should prioritize AI standards efforts that are [...] Consensus-based,
50+
[...] Inclusive and accessible, [...] Multi-path, [...] Open and transparent,
51+
[...] and [that] Result in globally relevant and non-discriminatory
52+
standards..." [@NIST2019]. The converging characteristics of standards that
53+
arise from these reports suggest that considerable thought needs to be given to
54+
the manner in which standards arise, so that these goals are achieved.
55+
56+
Standards for a specific domain can come about in various ways. Broadly
57+
speaking two kinds of mechanisms can generate a standard for a specific type of
58+
data: (i) top-down: in this case a (usually) small group of people develop the
59+
standard and disseminate it to the communities of interest with very little
60+
input from these communities. An example of this mode of standards development
61+
can occur when an instrument is developed by a manufacturer and users of this
62+
instrument receive the data in a particular format that was developed in tandem
63+
with the instrument; and (ii) bottom-up: in this case, standards are developed
64+
by a larger group of people that convene and reach consensus about the details
65+
of the standard in an attempt to cover a large range of use-cases. Most
66+
standards are developed through an interplay between these two modes, and
67+
understanding how to make the best of these modes is critical in advancing the
68+
development of data and metadata standards.
6069

6170
One source of inspiration for bottom-up development of robust, adaptable and
6271
useful standards comes from open-source software (OSS). OSS has a long history

sections/03-recommendations.qmd

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11

22

3+
<<<<<<< HEAD
34
## Funding or Grantmaking entities:
45

56
### Fund Data Standards Development
@@ -57,5 +58,19 @@ Development of standards should be coupled with development of associated softwa
5758
Additionally, standards evolution should maintain software compatibility, and ability to translate and migrate between standards.
5859

5960

61+
=======
62+
1. Training for data stewards and career paths that encourage this role.
63+
2. Development of meta-standards or standards-of-standards. These are descriptions of cross-cutting best practices. These can be used as a basis of the analysis or assessment of an existing standard, or as guidelines to develop new standards.
64+
3. Recommend pathways or lifecycles for successful data standards. Include process, creators, affiliations, grants, and adoption journeys. Make this documentation step integral to the work of standards creators and granting agencies.
65+
4. Retrocactively document #3 for standards such as CF(climate science), NASA genelab (space omics), OpenGIS (geospatial), DICOM (medical imaging), GA4GH (genomics), FITS (astronomy), Zarr (domain agnostic n-dimensional arrays)... ?
66+
5. Create ontology for standards process such as top down vs bottom up, minimum number of datasets, and community size. Examine schema.org (w3c), PEP (Python), CDISC (FDA).
67+
6. Amplify formalization/guidelines on how to create standards (example metadata schema specifications using https://linkml.io).
68+
7. Make data standards machine readable, and software creation an integral part of establishing a standard's schema e.g. identifiers for a person using CFF in citations. cffconvert software makes the CFF standard usable and useful.
69+
8. Survey and document failure of current standards for a specific dataset / domain before establishing a new one. Use resources such as Fairsharing.org or Digital Curation Center https://www.dcc.ac.uk/guidance/standards.
70+
9. Funding agencies and science communities need to establish governance for standards creation and adoption (cite https://www.theopensourceway.org/the_open_source_way-guidebook-2.0.html#_project_and_community_governance).
71+
10. Cross sector alliances such as industry - academia need closer coordination and algnment of pace through strong program management (for instance via OSPO efforts).
72+
11. Multi company partnerships should include strategic initiatives for standard establishment (example https://www.pistoiaalliance.org/news/press-release-pistoia-alliance-launches-idmp-1-0/).
73+
12. Stakeholder organizations should invest in training grants to establish curriculum for data and metadata standards education.
74+
>>>>>>> 8cb3f6b (More edits.)
6075
6176

0 commit comments

Comments
 (0)