An analysis of characteristics and structures embedded in data papers: a preliminary study

Research data or datasets can be regarded as a catalyst to inspire new research by repurposing or combining existing research data, and grant applicants have been requested by funding institutions to include a data management plan as part of research project proposals. In addition to the metadata approach, data papers may mirror the scientific publication model as an alternative means of description and management of research data. However, there is not a common standard for all data papers across various communities. This study aimed to build up a common structural framework to investigate the embedded characteristics and structures of the content of data papers by using a content analysis approach, and 26 data journals from 16 publishers were selected as subject in this study. This study has proposed a common framework and further embodied a concept map (Candela et al. 2015) into more concrete components for the structure of data papers.


Introduction
Traditionally, journal articles and books are the primary conduit of scholarly communication.In the information age, these resources are still important, but research data or datasets are also emerging as another important source of scholarly output.Data have been regarded as a catalyst to inspire new research by repurposing or combining existing research data.Many international institutions have advocated making research data readily available, including the International Council for Science, the Global Biodiversity Information Facility, the UK Research Councils, the US National Science Foundation, and the US National Institutes of Health.These institutions have requested that grant applicants include a data management plan as part of research projects to manage data for future reuse.In order to achieve reuse and share research data, data documentation is a required component of research data management (RDM) for data discovery and curation.In addition to the metadata approach, data papers may mirror the scientific publication model as an alternative means of description and management of research data.

Literature review
In the process of knowledge inquiry, journal articles and books are regarded as "standing on the shoulders of giants" to support exploration and research.In recent years, research has been enhanced by extensive use of data to test and examine hypotheses.Raw research data or datasets have become an essential part of data-driven research such as eResearch, eScience and digital humanities.However, sharing research data is not the norm for researchers in science.Thus, data are easily packaged and locked in dark archives.These data often lack adequate documentation for discovery and management, and are in danger of being lost (Chavan and Penev 2011, 2).Furthermore, the cost of recollecting or reproducing data is much more than documenting data, although data documentation is time consuming and costly (Kratz and Strasser 2014, 4).Even more importantly, some data cannot be recollected or reproduced.Therefore, adequate documentation of research data is an essential part of RDM for future sharing and reuse (Atici et al. 2013, 670, Chavan and Penev 2011, 2, Costello 2009, 421, Kansa and Kansa 2013, 4, Kratz and Strasser 2014, 4, Niu and Hedstrom 2008, 4, Rees 2010).
Metadata is a fundamental component of digital libraries, and various existing standards or guidelines have been developed for different communities or data types; however, descriptive metadata may not be suitable for RDM, as "metadata may not be sufficient to enable them [researchers] to use the data" (Costello 2009, 4).The reason may be that metadata "may not provide sufficient documentation of the context in which data were collected" (Borgman et al. 2007, 275), such as a "research methods" description not being included in most metadata standards or guidelines (Chao 2015).This means that most descriptive elements of metadata schemas are not suitable for the description and discovery of research data.Therefore, "data papers" that mirror the scientific publication model (Akers 2013) have been proposed as an alternative solution to metadata for description and discovery of research data.
Data papers are data publications resembling traditional journal articles, only shorter (Candela et al. 2015, 1751, Gray 2015).Data papers can be published to make research data or datasets public (Breure 2014, Callaghan et al. 2012, 112, Chavan and Penev 2011, 3, Gray 2015).It is required that data papers and research data have a digital object identifier (Candela et al. 2015, 1754, Gray 2015).Researchers have proposed various categories for the structure of data papers including general information such as authors, keywords and abstracts on the title page (Candela et al. 2015(Candela et al. , 1754)), and specific information such as data collection or production (Akers 2013, Breure 2014, Candela et al. 2015, 1754), data processing (Akers 2013, Rees 2010), data analyzing or analytical methods (Akers 2013, Atici et al. 2013, 670), provenance (Candela et al. 2015, 1754, Rees 2010), context or coverage (e.g., time and place) (Atici et al. 2013, 670, Candela et al. 2015, 1754), background (Breure 2014, Candela et al. 2015, 1754), competing interests (Candela et al. 2015(Candela et al. , 1754)), license (Candela et al. 2015(Candela et al. , 1754)), attribution (Candela et al. 2015(Candela et al. , 1754)), reuse (Candela et al. 2015(Candela et al. , 1754)), etc.On the other hand, some researchers have proposed the 5Ws (i.e., what, where, why, how and who) as a documentation basis for the description of research data (Challaghan et al. 2012, 12, Kennedy, Ascoli, andDe Schutter 2011, 318-319).Although some data journals have also defined structural categories in their templates or guidelines for data papers, there is not a common standard for all data papers across various communities (Candela et al. 2015, 1753-4, Callaghan et al. 2014, Chavan and Penev 2011, 3, Smith 2009, 2).With reference to the previous research outlined above, this study poses the following research questions: RQ: Is there a common framework for the structure of data papers to describe research data to facilitate discovery, sharing and reuse across various communities?In addition to the proposed framework and its categories, what components and their characteristics are embedded in existing data papers?

Methodology
In this study, we aimed to build up a common structural framework for the content of data papers using a content analysis approach.Data papers are referred to using various terms including database article, data paper, data note, data article, data descriptor, data in brief, data original article, database paper, dataset paper, and genome database (Candela et al. 2015).In addition to the aforementioned variants of data papers, software papers were also included in this study to expand the examination of the characteristics and structures of data papers.In this study, 73 data journals provided by Akers (2014) were extended to 94 journals through query results for "data journal" in Ulrichsweb.One third of the 94 data journals (i.e., 31) were randomly selected for this preliminary study.In order to include diverse disciplinary domains and their characteristics, 31 data journals were reduced to 27.Then The Data Science Journal was excluded because, to date, it has not published any data papers.As a result, 26 data journals from 16 publishers were selected as subject in this study, and the disciplinary domain of research subject covered science, social science, and humanities.The publishers, number of journals, and their disciplinary domains are shown in Table 1.Templates or guidelines offered by journals for data papers were downloaded or printed out for content analysis.Online websites were also cross-checked to examine the characteristics and structures embedded in data papers.Based on the analysis of characteristics and structures, a common framework was generalized to examine the embedded characteristics and structures of data papers.Furthermore, a crosswalk between the common frameworks is proposed herein and a data papers concept map (Candela et al. 2015) was also created to examine the similarities and differences for in-depth investigation of data papers.

Type of data journal: pure or hybrid data journals
Thirteen journals were pure data journals focused on data papers.Ten journals were hybrid journals meaning that these journals do not only focus on traditional journal articles, but also regularly include data papers.Three journals used special issues to publish data papers (Table 2).

A framework for data papers
The proposed framework for the structure of data papers can be generalized into three categories (title page, description of datasets, and relationship) and each category is composed of individual components used for description of various different objects, including a title page, a description of datasets for research data, and relationships among data papers, datasets, journal articles and data repositories.Within the framework, the proposed categories and their components and contents can not only be regarded as a two-level hierarchical metadata schemas, but also access points for discovery and contextual background information for sharing and reuse research data.Detailed information is shown below: l Title page This category is composed of title, authors, author's affiliation, author's email address, abstract, keywords, identifiers, copyright, citation and date, and the aforementioned components are often regarded as basic information on descriptive metadata elements for data papers.The detailed analysis is as follows: ü One journal (JRN04) does not offer authors' affiliations.ü Four journals (JRN04, 07, 12 and 25) do not offer keywords.ü One journal (JRN04) assigns specific identifiers using the journal data platform, rather than a DOI or URL.ü Two journals (JRN04 and 16) do not have citation data for users.ü Four journals (JRN04, 08, 16 and 25) do not indicate the date of data papers.Most data papers offered four kinds of dates (received, revised, accepted and published online) to illustrate the publishing process.ü With the exception of JRN25 that adopts "all rights reserved" as copyright for data papers, the journals adopted the Creative Commons as licensing terms and conditions to release data papers openly for wide public use the same as open access.CC-BY is the most popular for licensing terms and conditions of data papers.ü HTML (24/26) is the most popular format provided by data journals, followed by PDF (23/26).Following the HTML and PDF formats, XML was the third most popular format.Twelve of the 26 data journals offered two formats (HTML and PDF).Nine data journals offered three formats (HTML, PDF and XML) and one offered four formats (HTML, PDF, XML, and EPub).Three data journals offered HTML only, and one journal offered PDF only. l

Description of datasets
This category consists of collection, description, coverage, identifier, competing interest, ethics approval, consent for publication, funding statement, copyright, reuse, availability, author's contribution, authors' information, references, and acknowledgements.These components are used for description of research datasets with adequate understanding of the context within which they were collected or processed to answer specific research question(s).The information embedded in the aforementioned components is useful for reuse and interpretation for future studies and cannot be found in most descriptive metadata standards.The detailed analysis is as follows: ü Collection: this component focuses on how data is captured or created and reflects the significance of "research method" (Chao 2015).
Most of the content of data papers provides information describing methodology through which data are collected or produced to answer specific research problems in a certain context.Other important information is included within the description of methodology such as background, ideas of the project, experimental design, factors, features and quality control.ü Description: this component is focused on the description of file information of data, including file format, versions, creation date and file creators.There are three approaches to this description of the dataset.The first is structured components similar to structured metadata elements (e.g., JRN18-20, 22-23 and 25), the second tends towards textual based description (JRN01-02, 04-05, 07, 09, 11-14, 16-17, 21 and 24), and the last is a hybrid approach of the first two with a structured category name with accompanying textual statements (e.g., JRN10) with basic information about the dataset.Most of the file formats of datasets described by data papers are dependent on the requirements of the data repositories in which they are deposited.Therefore, there is no common agreement on the file format of datasets.ü Coverage: data papers related to the disciplinary domains of medicine, biology, and archaeology are inclined to provide temporal and spatial keywords (e.g., JRN15,[18][19][20][21][22][23][24].Furthermore, some data papers indicate the spatial coverage by tagging with longitude and latitude (e.g., JRN9-10).In addition to temporal and spatial coverage, this component is also used to indicate the taxonomy in terms of the biological classification of species (e.g., JRN14-15).ü Competing interest: the majority of data papers provide this component that clarifies potential factors that might affect the results of the dataset (e.g., [9][10][11][12][13][14]17,[19][20][21][22]and 25).ü Ethical approval and consent for publication: data papers in which the target subjects are related to individual privacy or human and animal rights are required to indicate whether researchers have received approval from subjects for the public release and use of data (e.g., JRN01-03, 09-11, 1718, 20, 22-23, and 25).ü Funding statement: two approaches are used to indicate whether production of data was supported by a funding grant.One is described in this component, and the other is embedded in the ac-knowledgment component.ü Copyright: the majority of subjects in this study tend towards the adoption of open licensing terms and conditions such as CC, CC0 and PDDL.ü Author's contribution: an interesting point that deserves note is that some data papers offer information to clarify the contribution of each author to the data paper (e.g., 05,11,13,17 and 23).
l Relationships: there are three types of relationships between data papers and their datasets (or data repositories); indication of the relationship between versions of data papers (e.g., JRN11), between datasets and their derived journal article (e.g., JRN08-09, 12 and 15), and between datasets and the data repository in which they are deposited (e.g., JRN01-06, 09-15, 17-22 and 24-26).The aforementioned types of relationships can be regarded as three components of this category.

A crosswalk between the proposed common framework of this study and a concept map for data papers
A mapping between the common framework proposed in this study and a concept map (CP) (Candela et al. 2015) revealed that these two are fully interoperable (Table 3).We further illustrate the components of data papers in a more concrete manner.The Global Biodiversity Information Facility (GBIF) has transferred the content of the data paper into GBIF Integrated Publishing Tool (IPT) Metadata Profile elements with clear instructions (Penev et al. 2015).
Using the successful case proved by GBIF, we mapped the components of the proposed common framework into those of CP (Table 4).In addition to oneto-one, many-to-one and one-to-many, some components embedded in data papers, such as ethical approval, consent to publication and relationships, have not been provided and defined by CP (Candela et al. 2015).

Conclusion
In this study, 26 data journals of 16 publishers were selected to examine the characteristics and structures embedded in data papers.CP of data papers (Candela et al. 2015) was embodied into more concrete and extended components, and new components for the structure of data papers were provided although this preliminary study only partially reveals the phenomena associated with data papers.Furthermore, the proposed categories and their components and descriptions can be regarded as a useful facilitator for discovery, sharing and reuse of research data, as well as a useful basis to develop a set of structured descriptive metadata elements for research data.In the near future, we intend to include more target subjects to examine the feasibility of a proposed common framework and investigate more implicit characteristics and structures embedded in data papers, such as core and optional characteristics and structures of proposed categories, and the comparison with a metadata approach in RDM.

Table 1 .
Subjects of this study

Table 2 .
Type of data journal

Table 3 .
A crosswalk between components of CP and the proposed framework found in this study

Table 4 .
A crosswalk between CP category and subcategories in the proposed framework found in this study