Statistical analyses of digital collections: using a large corpus of systematic reviews to study non-citations

Using statistical methods to analyse digital material for patterns makes it possible to detect patterns in big data that we would otherwise not be able to detect. This paper seeks to exemplify this fact by statistically analysing a large corpus of references in systematic reviews. The aim of the analysis is to study the phenomenon of non-citation : Situations where just one (or some) document(s) are cited from a pool of otherwise equally citable documents. The study is based on more than 120,000 cited studies, and a total number of non-cited studies of more than 1.6 million. The number of cited studies is found to be much smaller than the number of non-cited. Also, the cited and non-cited studies are found to differ in age. Very recent studies tend to be non-cited whereas the cited studies are rarely of recent age (e.g. within the same year). The greatest differences are found within the first 10 years. After 10 years the cited and non-cited studies tend to be more similar in terms of age. Separating the data set into different sub-disciplines reveals that the sub-disciplines vary in terms of age of cited vs. non-cited references. Some fields may be expanding and the number of published studies is thus growing. Consequently, cited and non-cited studies tend to be younger. Other fields may be more slowly progressing fields that use a greater proportion of the older literature within the field. These field differences manifest themselves in the average age of references.


Introduction
The recent production of digital materials more than equals the total amount of information produced so far in the entire history of our species (e.g., Dienes 2012, Hilbert and López 2012, Kitchin 2014b).Clearly, this is a big challenge to the traditional core competences of information professionals (curation, preservation, organization, and seeking).Yet, it also promises great opportunities.Using statistical methods to analyse digital material for patterns makes it possible to detect patterns in big data that we would otherwise not be able to detect (Kitchin 2014a).
This paper seeks to exemplify this great potential by statistically analyzing a large corpus of references in systematic reviews.The aim of the analysis is to study the phenomenon of non-citation: situations where just a few are cited from a pool of otherwise equally citable documents.Previously, this phenomenon has only been studied by a process of close reading.For example, MacRoberts and MacRoberts (1986) analysed 15 randomly selected papers from a single discipline and determined the influences manifested in them.They found that in many cases influence was not captured in references and footnotes.Thus, by close reading, they determined a group of cited and non-cited studies.Yet, their method limited them to just 15 papers, and they also had to admit that their process was quite subjective.Some form of distant reading (Moretti 2013) is needed in order to increase the amount of data, and to carry out a more objective evaluation process.Using systematic reviews, we believe to have found a way to increase the amount of data to be analysed as well as a more objective process of evaluation.The research question is thus two-fold: How can we use distant reading to study the non-citation phenomenon, and to what extent are we able to confirm the results of previous studies of non-citations based on close reading?
The next section provides a review of previous studies of the non-citation phenomenon and related citation phenomena.We then outline the new method in detail, and show how it works in practice by applying it to a study of noncitations in the field of healthcare.

Review
According to the so-called normative citation theory (Nicolaisen 2007), failure to give credit where credit is due is unusual.Cole & Cole (1972, p. 370), for example, claim that "sometimes […] a crucial intellectual forebear to a paper is not cited.The omission is rarely due to direct malice on the part of the author but more often to oversight or lack of awareness […].We can assume that omitted citations to less influential work are random in nature […]".Garfield (1977, 7) agrees stating that "the vast majority of citations are accurate and the vast majority of papers do properly cite the earlier literature".Garfield, however, admits that this assertion had not been empirically substantiated: "Unfortunately, there has never been a definitive study of this assertion" (Ibid).
The basic assumption of the normative theory of citing was not tested before the 1980s.MacRoberts and MacRoberts wrote a number of articles during the 1980s and 1990s in which they argued that citation analysis is an invalid tool for research evaluation (MacRoberts 1997, MacRoberts and MacRoberts 1984, 1986, 1987a, 1987b, 1988, 1989a, 1989b, 1996).In these articles they challenge the basic assumption of the normative theory of citing: that scientists cite their influences.In their 1986 paper MacRoberts and MacRoberts report the results of a test of this assumption.They had read and analysed fifteen randomly selected papers in the history of genetics, a subject with which they claimed to be familiar, and had found that from zero (paper had no references or footnotes) to 64 percent influence was captured in references and footnotes.After having reconstructed the bibliographies of the fifteen papers, they were able to estimate that the papers required some 719 references at a minimum to cover the influence evident in them, when in fact they contained only 216 -a coverage of thirty percent for the entire sample.In their 1996 paper MacRoberts and MacRoberts claim that this percentage typifies all fields with which they are familiar (botany, zoology, ethology, sociology, and psychology) and conclude that "if one wants to know what influence has gone into a particular bit of research, there is only one way to proceed: head for the lab bench, stick close to the scientist as he works and interacts with colleagues, examine his lab notebooks, pay close attention to what he reads, and consider carefully his cultural milieu" (MacRoberts and MacRoberts 1996, 442).
Terrence A. Brooks published a pair of papers in the mid-1980s, which also challenge the basic assumption of the normative citation theory (Brooks 1985(Brooks , 1986)).These papers report the results of a survey covering 26 authors at the University of Iowa.Brooks had asked the authors to indicate their motivations for giving each reference in their recently published articles by rating seven motives for citing.One of the motives was labelled "persuasiveness".Brooks claims that the results of his survey show that authors cite for many reasons, giving credit being the least important one.Of the 900 references studied, Brooks found that about 70 percent were motivated by more reasons, and concluded: "No longer can we naively assume that authors cite only noteworthy pieces in a positive manner.Authors are revealed to be advocates of their own points of view who utilize previous literature in a calculated attempt to self-justify" (Brooks 1985, 228).However, as pointed out by White (2004), the results of Brooks' survey need to be assessed with some caution.This is because respondents almost certainly misunderstood the motive reading "persuasiveness" to denote "citing to help build a case", and not "citing to utilize previous literature in a calculated attempt to self-justify".
Zuckerman (1987,334) asked if persuasion really were the major motivation to cite, would citation distributions look as they do?She answered "plainly not", referring to data from an article by Eugene Garfield. Garfield (1985, 406) presents a table illustrating the number of citations retrieved by items cited one or more times in the 1975-1979 cumulated Science Citation Index.The table reveals, among other things, that only 6.3 percent of the 10.6 million citations went to documents cited 10 or more times in the five-year period.Zuckerman (1987) points to the low percentage as evidence against the persuasion hypothesis.According to the persuasion hypothesis, the percentage should be much higher.Zuckerman (1987, 334) refers to Gilbert (1977, 113), one of the "inventors" of the persuasion hypothesis, who states that it is the papers seen as "important and correct" which "are selected because the author hopes that the referenced papers will be regarded as authoritative by the intended audience".However, as Zuckerman (1987) notes, if one adopts a modest criterion of authoritative papers being equal to those, which have been cited at least ten times in five years (or twice annually), the persuasion hypothesis needs to be radically adjusted.White (2004) realized that instead of testing the persuasion hypothesis like Zuckerman (1987) had done, by determining the percentage of citations received by authoritative papers, he could test the hypothesis, by determining the percentage of citations received by authoritative authors.Initially he had to determine how to measure the reputation of cited authors.This was done by counting the number of citations the cited authors had received.He then drew a judgment sample that consisted of 28 citing authors from different disciplines (ten from information science, eight from science studies, six from various natural sciences and four from cultural studies in the humanities).Finally, he tabulated the references provided by the 28 citing authors in their publications.The method enabled him to determine the frequencies by which reputable and nonreputable authors appeared in the bibliographies under study.His findings do not support the persuasion hypothesis.Most of the 28 authors cited at all levels over the entire scale of reputation, and they did not exclusively favour highend names with authoritative reputations.If anything, White's (2004) findings suggest that authors tend to favour low-end names slightly.Moed and Garfield (2003, 192) asked "how does the relative frequency at which authors in a research field cite 'authoritative' documents in the reference lists in their papers vary with the number of references such papers contain"?They reasoned "if this proportion decreases as reference lists become shorter, it can be concluded that citing authoritative documents is less important than other types of citations, and is not a major motivation to cite" (Ibid).They went on to analyse the references cited in all source items denoted as 'normal articles' included in the 2001 edition of the Science Citation Index on CD-ROM.The source papers were arranged by research field, defined in terms of aggregates of journal categories.They limited their study to four fields: molecular biology & biochemistry, physics & astronomy, applied physics & chemistry, and engineering.The cited references were classified in two groups: those published in journals processed for the ISI citation indexes, and those published in non-ISI sources, including monographs, multi-authored books and proceedings volumes.In each research field the distribution of citations among cited items was compiled in each group separately, and the 90th percentile of that distribution was determined.Thus, the ten percent most frequently cited items published in ISI journals, and the ten percent most frequently cited documents in non-ISI sources were identified.These two sets were then combined.The combined set was assumed to represent the documents perceived in the year 2001 as 'authoritative' in a research field.Source articles were arranged in classes on the basis of the number of references they contained.The percentage of references to 'authoritative' documents was finally calculated per class.The findings of their analysis clearly show that authors in all four fields cite proportionally fewer 'authoritative' documents as their bibliographies become shorter.In other words, when the authors display selective referencing behaviour, references to 'authoritative' documents are found to drop radically.Moed and Garfield (2003, 195) therefore concluded "In this sense, persuasion is not the major motivation to cite".

Method
From the review above it is evident that some types of citation behaviour have been studied statistically by analyzing large corpuses of references and citations.Yet, the specific citation behaviour known as non-citation has so far only been studied by close reading, which, as noted in the introduction, for various reasons limits the generalisability of the obtained results.We believe to have found both a way to increase the amount of data to be analysed, and a more objective process of evaluation, which make it possible to draw stronger conclusions regarding the non-citation phenomenon.
Using systematic reviews from the field of healthcare we are able to identify studies addressing the same research question."Systematic reviews seek to collate all evidence that fits pre-specified eligibility criteria in order to address a specific research question" (Higgens and Green 2011).Importantly, systematic reviews aim to minimize bias by using explicit, systematic methods.Cochrane reviews are usually held among the best systematic reviews (Collier, Heilig, Schilling, Williams, andDellavalle 2006, Moseley, Elkins, Herbert, Maher, andSherrington 2009).Consequently, we decided to use Cochrane reviews for our analyses.Cochrane Reviews are prepared by authors who register titles with one of the 53 Cochrane Review Groups. 1 A Cochrane review contains a list of included studies.These studies address the same research question.The included studies of any given systematic review may therefore be examined to determine which of the preceding studies were cited in that specific study.To give an example: Three studies (A, B, C) are included in a given Cochrane review (X); consequently we know that they are addressing the same research question.We retrieved 5,843 systematic reviews from 53 Cochrane groups (withdrawn reviews were excluded).Some did not contain any included studies resulting in 5,042 systematic reviews.In those reviews we were able to match the included studies to 60,495 references in Web of Science.These approximately 60,000 studies can cite the previous studies addressing the same question but obviously not the ones published after.We look at studies that can cite similar previous studies and this results in more than 1.5 million incidences of a given study citing or not citing a preceding study.This means that in our sample every study can on average potentially cite about 27 previous studies.We only included a pair of potentially citing and cited documents if the cited / non-cited document is from the same publication year or older.We included pairs of publications with the same publication year although we know that citing within the same year is rare.However, it does happen (due to preprint, early view etc.).
The data collected consists of the following information: • Publication year of the citing or potentially citing study • Publication year of the cited or potentially cited study • Whether or not the study is actually being cited We analysed the data focusing on the distribution of cited and non-cited publications, both in general as well as between sub-disciplines.The publication year of the cited or potentially cited study was in each case subtracted from the publication year of the citing or potentially citing study, such that we only look at the age difference between the citing and cited study.

Results
Table 1 provides an overview of how many cited and non-cited studies we were able to match by the age difference between the cited and citing studies.Since the age difference distribution has a very long tail, we limit the table to pairs with age differences of 25 years or less.
The total number of cited studies is just above 120,000 whereas the number of non-cited studies is more than 1.6 million.Consequently, the cited studies are outnumbered by the non-cited by a factor 13 which is what we would expect knowing that a considerable part of all publications are never cited even within the health sciences (see e.g.Ranasinghe et al. 2015, Weale, Bailey, andLear 2004).
As we would expect pairs with a zero age difference 0 exhibit very few cited studies and many more non-cited studies, since only rarely will a study be able to cite another study published in the same year.Figure 1 provides an overview of the cited as well as the non-cited studies by age.We can see that the uncited studies outweigh the cited.The non-cited studies are voluminous in the most recent years indicating publications are less likely to cite a relevant, similar study if that study is published recently.Again, publication lags may also play a role.By year 9 and 10 the cited and non-cited studies show similar shares and thus it is within the first 10 years of publication we find the greatest differences.Figure 3 shows the distribution in cumulative distribution of the cited and noncited studies by age difference.In the figure we can see that non-cited studies a year or less old makes up about 25 per cent of the non-cited studies whereas this is only a little over 10 per cent for the cited studies.In year 4 both cited and non-cited studies have reached 50 per cent.The distributions depicted in figures 1-3 may not necessarily be valid for all disciplines or sub-areas.To examine this we separate the data into the 53 Cochrane groups from which the data originate.Due to space limitations we cannot show figures for all groups, but will present some examples.
The first example is the Cochrane Dementia and Cognitive Improvement Group 2 .The majority of the cited studies are found in the literature dating back 5 years.For the non-cited studies the tendency is even clearer as the non-cited studies are plenty in the literature from the last 3 years.This implies that studies within this area tend to be relatively young which may be caused by an increase in production.The next example is the Methodology Review group 3 .Figure 5 indicates that studies within this area are slightly fewer during the first year whereas the number of studies that are 2-4 years old are greater than in the case of the dementia group.Based on these figures, the methods group seems to be a more slowly progressing field that uses a greater proportion of the older literature within the field whereas the dementia group is a more fast-moving group.

Discussion and conclusion
This study exemplifies the great potential of digital collections for statistical analyses.By statistically analyzing a large corpus of references in systematic reviews we are able to draw at least three conclusions regarding the phenomenon of non-citation: 1.The number of non-citations far outweighs the number of citations.

Citations and non-citations differ in age.
3. There are great differences between fields.
These conclusions would be difficult or even impossible to reach by other methods (i.e., close reading).
The data for this study consists of 120,000+ cited studies and more than 1.6 million non-cited studies.It is thus obvious that the number of cited studies is much smaller than the number of the non-cited.Apart from the difference in size pools of cited and non-cited studies differ in age.Very recent studies tend to be non-cited whereas the cited studies are rarely very recent (e.g.within the same year).The greatest differences are found within the first 10 years.After 10 years the cited and non-cited studies tend to be more similar in terms of age.Also, we have separated the data set into different sub-disciplines, and we find that the various sub-disciplines vary in terms of the age of cited and noncited references.Some fields may be expanding, and the number of published studies is thus growing fast.Consequently, cited and non-cited studies tend to be younger.Other fields may be more slowly progressing fields that use a greater proportion of the older literature within the field.These field differences manifest themselves in the average age of references.
Our results confirm the results of previous studies of the non-citation phenomenon: Only a small fraction of the literature that could or should have been cited end up being cited.Traditionally, sceptics of citation analysis have argued that this makes research evaluation based on citation analysis an invalid evaluation method (MacRoberts andMacRoberts 1996, Seglen 1997).
Conversely, proponents have defended citation analysis by arguing that as long as citation analyses are based on many reference lists, results are valid (e.g., Narin 1987, Nederhof and Van Raan 1987, Small 1987, White 2001).The sceptics have countered this by claiming that this would only be true if bias were distributed randomly, but biased citing, they claim, is not random (MacRoberts 1997).For instance, it is claimed that authors cite works by established authorities, so as to gain credibility by association (Gilbert 1977, Latour 1987).This hypothesis is known as the persuasion by name-dropping hypothesis (White 2004).As we saw in the review section, proponents of citation analysis have challenged the hypothesis by pointing to empirical studies of citation distributions that show little or no signs of such biased citing (e.g., Zuckerman 1987, Moed and Garfield 2003, White 2004).These studies reveal that authors do not tend to favour highly cited names or highly cited publications.Instead, authors generally cite the entire scale of citation reputation.However, these studies do not test the essence of the persuasion by name-dropping hypothesis.To test the hypothesis, we need to investigate whether authors that have a choice between citing equally citable authors or documents tend to choose citing the highly cited ones.As we have just shown, cohorts of research addressing the same research question may be identified using systematic reviews.The reviewed studies of any given systematic review may then be examined to determine which of the preceding studies were cited in later publications.On the basis of that analysis it should be possible to match pairs of cited and uncited studies of the same age and then to trace their number of citations at the time of citation/non-citation.In a forthcoming article (Frandsen and Nicolaisen, In Press), we present the results of a large-scale test of the persuasion by name-dropping using this approach on a dataset similar to the one we study in the present paper.Our results seem to suggest a more careful interpretation than simply name-dropping.

Figure 1 .
Figure 1.The number of cited and non-cited studies by age difference (only years 0-25 are shown)

Figure 2 .
Figure 2. Distribution of the percentages of cited and non-cited studies by age difference (only 0-25 years are shown)

Figure 3 .
Figure 3. Cumulative distribution functions for cited and non-cited studies by age difference (only 0-25 years are shown)

Figure 4 .
Figure 4. Distribution of the shares of cited and non-cited studies in the dementia group (only year 0-25 are shown)

Figure 5 .
Figure 5. Distribution of the shares of cited and non-cited studies in the methods group (only year 0-25 are shown)

Table 1 .
Overview of the age difference between cited and non-cited studies (only years 0 to 25 are shown)