By: Nancy K. Herther, writer, consultant and former librarian with the University of Minnesota Libraries
In a new 2022 Scientometrics assessment of the three core citation indexes – Web of Science, Dimensions and Scopus – German scientists concluded that “WoS largely represents the well-interconnected core citation network component on base research, while Scopus allows us to observe some transfer from the core to the applied research periphery. Dimensions with its laissez-faire indexation policy conveys, apart from the improving metadata quality, more coverage but a similar, although less decisive, message to Scopus.”
In an upcoming article in Quantitative Science Studies, Henry Small writes about the future of the field, noting that “scientometrics and quantitative studies of science have traditionally avoided epistemological issues such as the nature of scientific knowledge, how knowledge is discovered and confirmed, and the relationship of theory and evidence. This is despite the fact that the scientific papers we count, classify, and map are filled with arguments and descriptions dealing with theories and observations, and why we should believe one finding or theory rather than another. Clearly the field will need new tools, or adapt old ones, to enable us to delve into this deeper level of scientific content.”
Clearly the COVID crisis is providing researchers and information professionals alike with a critical challenge that cuts across disciplines and includes all types of information – from social media posts to preprints to formal publication and research reports. The consequences of the pandemic on individuals and communities is making the quickly evolving systems of communicating information, key data, theoretical approaches and potential solutions of critical interest. In order to better understand the challenges ahead, we speak with two of the key figures in the ongoing development of the field of scientometrics.
HENRY SMALL: ON CHANGING TRENDS IN SCIENTIFIC CITATION
Henry Small received a joint Ph.D. in Chemistry and the History of Science from the University of Wisconsin, beginning his long and distinguished career. First, as a historian of science at the American Institute of Physics’ Center for History and Philosophy of Physics. He joined Gene Garfield and the Institute for Scientific Information in 1972, becoming one of the early and most respected experts in the field of scientific citation. He co-pioneered scientometrics, co-creating the first global model of research. Dr. Small served as the Director of Research and Chief Scientists at ISI (now Clarivate Analytics). At SciTech Strategies he continues to push the boundaries of the field of scientometrics for the betterment of science.
NKH: We all know that knowledge evolves as research changes over time. The role of citation began in the 1960s with Gene Garfield’s interest and your groundbreaking work to create indexing and study the role and nature of citation itself. Can you bring readers up-to-date on the growth and development of the Citation Databases and their role today?
HS: I don’t know the details of the growth of citation databases so you should contact ISI and Scopus directly to learn about that. As best I can tell, they are both continuing to expand their coverages by adding new journals and new types of data. For example, ISI just added “citation context” data for a subset of citing and cited papers which is a good thing. Scopus may have done something similar to that since they have direct access to the Elsevier citation context data. Adding citation contexts to citation databases is the obvious next evolutionary step in developing these databases. This gives users much more information on why citations are made in specific instances and opens up all kinds of possibilities for analysis (see my paper attached for example). The scite startup tries to aggregate this type of data too, and attempts to give labels for “types” of citation but in a very limited way.
NKH: Today citation numbers are accepted as a quality metric by users searching for “good” information (based on the assumption that the more citations, the more significant the papers (and perhaps even a sign of higher quality). The assumption being, the higher the number of citations, the more important or “better” the quality of some article/research. This continues to be an important surrogate in judging the quality of some research or the researchers themselves. How is this changing today as we are now getting more detailed, granular information about research quality?
HS: I think people are getting away from just using citation numbers as a proxy for quality, although nowadays you can look up usage numbers in several places and see if they agree. Not too many hold the belief simply that usage equals quality. For example, method papers are widely known to receive more citations than research papers. There are plenty of examples of papers that should have been cited but were not maybe because they were “premature”, too far before their time. The era when I was at ISI was the time when citation context studies were difficult to do.
I spent a lot of time getting papers from the “stacks” and copying them to collect “contexts”. We also spend a lot of time trying to figure out what the best citation count normalization to use was. For example, Gene’s impact factor was simply an attempt to “normalize” the citation count for a journal. We did simpler formulas for other entities like authors, organizations, countries, fields, etc. There are a lot of methods out there for trying to eliminate size effects of various kinds. Anyone in the indicators business had to use them because otherwise the largest entities would always rank first.
NKH: COVID has made access to information – and use of the internet for research – all the more critical and pervasive. Preprints and non-peer-reviewed pieces appear in return lists when searching the internet. How has COVID changed the role of citation and precedent in terms of publications that you are seeing?
HS: I don’t think this started with COVID, it was already happening before. COVID perhaps accelerated the process. I still think peer-review is important, and you need to have papers vetted even if it’s an imperfect process. Open access journals often allow authors to post their papers prior to official publication. This is OK as long as the paper has gone through some kind of peer review. That doesn’t mean the paper is “correct,” it’s just that it passed some minimal standards for the particular journal.
NKH: The very presence of instant information over the internet brings up issues of trust and value as well. Research is coming from all types of governments, institutions, organizations, and in a wide variety of languages from individuals and institutions across the globe. What trends or changes are you seeing in “publication” itself? And how can assessment systems value/judge results in pre-publication, drafts or other non-traditional reporting? Is this a danger to the integrity of research itself?
HS: I don’t see any way to deal with this, and maybe it’s a good thing to have all these opinions “out there”. It just means people have to be careful not to believe everything they read. Just today the Supreme Court leaked the anti-abortion opinion. At least we know what they intend to do and we can take a stand ahead to the official decision. Obviously, the information you don’t see is “proprietary,” what the originators don’t want you to know about, e.g., inventions, new ideas, political schemes, etc. You only see what the originators want you to see, propaganda, misinformation, etc.
NKH: In one recent article, the authors assessed citation finding that “citations considered as critical by these same authors are often comparisons between results rather than blunt attacks against the cited works….the lexical and grammatical markers characterizing comparison must be taken into account in addition to those expressing a negative evaluation and those expressing doubt.” Are citation tools today able to do this?
HS: I’ve done a lot of studies using lexical markers (otherwise known as “words”) to assess “sentiment” in a broad sense, like certainty/uncertainty, supporting evidence, disagreement, etc.. scite only distinguishes two types: “support” and “contrast.” I’m sure the categories in the future will expand based on a variety of lexical markers because researchers now can use powerful computational technologies like deep learning and the older machine learning to do classification studies (in addition to just looking for specific sentiment words).
I personally don’t think the automatic categorizations are reliable enough yet to be generally useful. For example, the scite “supporting” category captures only about 6% of references based on citation contexts, and this probably undercounts the number to supporting citations. I recently looked at a Nobel Prize winning discovery and one of the key papers had only 3% supporting citations based on scite. Now, this might be correct but just because a citation context is not classified as “supporting” does not mean that the citing author would not support the correctness of the cited paper. Authors of scientific papers are expected to be objective and non-emotional in their prose. So trying to detect “affect” is very problematic. For example, when a scientist labels something as a “discovery” this is about as strong expression of approval as you can find in scientific prose. Papers can be rejected by peer review for being too “subjective”. Of course, the situation is reversed in “social media” where only the outlandish is even noticed.
NKH: What trends are you seeing with publication itself? Are we seeing a larger, more democratic representation of research across the globe? How about issues such as the uploading of datasets for further review/use? How are more of what used to be called “ephemeral” types of reports being assessed today?
HS: We are seeing a deluge of new information on the internet. I have a friend who publishes his poetry on the internet, and creative musicians and composers use YouTube to “publish” their new work. In the past, their creative work would have remained unpublished or un-recorded because traditional publishers didn’t see any way of making money on it. So creative people can self-publish and that’s a good thing. Of course, getting the word out is another story. Another source of data is from open access academic papers that require authors to make their datasets available for downloading so other people can analyze them. Database now generally allow downloading subsets for further analysis.
NKH: What is your take on all of the approaches/products out there today in the marketplace – SSI/SCSI, scite and potentially others. Is this adding weight to citation research or is there some fragmentation or confusion being created with all of the potential ways research can be discussed/cited/used now?
HS: As I noted above, I like some of the new services like scite and some of the new features and enhancements of older databases that have appeared in Web of Science and Scopus. For example, I recently learned how to use WoS to find the papers co-citing two papers on by “ANDing” their citing paper sets. The papers dealt with the discovery of “neutral currents” in particle physics. Back in 1973, when I was writing my paper on co-citation, it would have taken me days to figure that out working from the printed SCI. Now it take just a couple of seconds. By the way, I haven’t mentioned the amazing advances that have been made in “science mapping” based on citation data in recent years. In my view these mapping methods complement the citation context analysis that is now possible.
NKH: Greater globalization of research – and the huge number of publishing venues available now through Google Scholar and so many other venues – how do you see the field today – and what concerns do you have about the future of citation research?
HS: I don’t have any concerns. We have an active society pursuing these topics (ISSI – the International Society of Scientometrics and Informetrics) and we have a new open access journal (Quantitative Science Studies). Plus we have all the older more established outlets like Scientometrics, Journal of Informetrics, JASIST, etc. Plus you have software packages in “R” and like Chen’s Cite-space.
Research is becoming more global. China is now a major player, I think, because the Chinese government sees the importance of studies of the scientific literature. The US government never paid much attention to statistical studies of scientific literature and citation indexes in the 1970s and ‘80s.
Gene Garfield kept on looking for support from NSF, NIH, etc. but didn’t have much success, probably because they didn’t see any relevance to their missions. I don’t know if the situation has changed now, but perhaps there is more recognition of the field. Perhaps some people still think that citations are only useful for coming up with a dubious measure of “quality”, and don’t see how access to citation contexts allow new ways of assessing scientific knowledge that go way beyond what was possible before.
CEO & CO-FOUNDER JOSH NICHOLSON ON SCITE TODAY
Josh Nicholson is co-founder and CEO of scite, a deep learning platform that evaluates the reliability of scientific claims by citation analysis. ATG featured Josh and his company in a series of articles in 2021. Scite’s deep learning platform evaluates the reliability of scientific claims by citation analysis, providing a deeper understanding of the content of scholarly mentions/discussions of research from articles than was possible before. In the past year, the database has grown in size and functionality, making it worthwhile to talk with Josh about the progress his company is making.
NKH: Can you bring readers up-to-date on the growth and development of the scite database itself?
JN: scite is working to change the conversation from “how many times has this paper been cited” to “how has this paper been cited.” We want to transform citations from superficial counts that readers and/or administrators might glance at into rich sources of contextual information that help people understand the research better. We have spent years working to build relationships with leading publishers in order to extract citation statements from the full text of scientific articles. To date, we have extracted over 1 billion citation statements from 30 million full-text articles. Looking at this another way, we now have over 1 billion expert insights, critiques, analyses, and viewpoints on nearly every research topic.
NKH: How has COVID changed the role of citation and precedent in terms of publications that you are seeing?
JN: Anecdotally, we have seen citations accrue much quicker with preprints. In some cases, citations from preprint to preprint occur on the order of days or weeks. Over the last year, we have started to work with preprint servers to display Smart Citations from scite directly on the servers and are now live on arXiv, Research Square, and Authorea, helping readers to see how preprints have been cited by other preprints and papers. This information can help readers assess and understand how preprints have been received ahead of peer review and potentially be used by editors to solicit submission to their journals.
Critically, we have also seen preprints that have been shared widely on the news, social media, and other areas with very high Altmetric scores receive subsequent contrasting citations. Showing how a paper or preprint has been received by other studies is important and we think it is very important to show if work has been challenged. This is something we as a community need to address better so that we are not amplifying click-bait type findings and then when they are later shown to be flawed or weak this information is not seen.
NH:. As a database that uses text as a key to its indexing (if you will), how is scite dealing with text – especially in our global community of publication? Language itself is a major factor, I’d assume.
JN: Our contrasting citations can range from “We could not reproduce this work” to “We find a smaller effect.” Clearly, even within this same type of citation (Contrasting Citations), there are different levels. We think it is crucial to show citation context so that readers can read this information and understand it. Of course, there are challenges with our approach in that contrasting language can appear outside of the citation context we capture. However, overall we have found scite can help readers better understand any article of finding at scale. It’s simply not a replacement or silver bullet for everything; it is a significant improvement over current citations though.
NKH: Today we have “instant information.” What trends or changes are you seeing in “publication” itself? How does scite value/judge results in pre-publication, drafts or other non-traditional reporting?
JN: I think the challenges with information overload are more apparent than ever. Everyone is dealing with trying to understand guidance on COVID and we all have a hard time knowing what to trust or not. scite helps here by helping show the conversation happening amongst papers in an easily digestible way. By indexing preprints and peer-reviewed articles only, we rely upon data that is backed by evidence and analyses. Thus, scite can help you see what the literature says and it can do it in a way that doesn’t require a full literature review.
NKH: Are you seeing a larger, more democratic representation of research across the globe? Are more of what used to be called “ephemeral” types of reports being assessed with scite for validity/use?
JN: I can’t really speak to this as I have not done the analysis but in general, preprints are certainly increasing. Models of peer review around preprints are being increasingly explored by various initiatives and publishers. I think it is an exciting time and while not everything will work, it is worth trying various models of evaluation and review, including scite.
NKH: Tell us more about scite itself – the size and nature of the database as it grows. Today scite is partnering with “over 20 different publishers,” having analyzed and indexed “nearly 1B citation statements extracted from over 30M full-text articles.” That’s a massive accomplishment. and the types of use/value it is giving scientists, researchers as well as other key communities (institutional evaluation, libraries or other user groups, etc.) that you are serving?
JN: We have been hyperfocused on making sure we have good coverage across all research areas. We have grown our database by hundreds of millions of citations by signing on new indexing partnerships with publishers. Excitedly, we did cross the 1 billion citation mark recently and are still rapidly growing as we work through back content from publishers and regularly add new articles as well. Beyond indexing, we also work with publishers to display our Smart Citations and are now live on 3 million-plus articles, including PNAS, Wiley, Royal Society, and others.
NKH: In the years since ATG first featured scite, you have further developed the Smart Citations and are actively working with a still growing group of publishing and research partners. Can you give us a sense of the size and extent of coverage in scite today? What’s next for the developing scite database?
JN: As mentioned, we have 24 indexing agreements signed and have extracted and analyzed over 1 billion citation statements from 30 million full-text articles. Additionally, we have over 1.5 billion traditional citations in our system, including all Elsevier references. Thus, scite is very comprehensive at this point often exceeding coverage from traditional indices. We will continue to work with publishers, indexing their content, and will continue to improve how our users discover and interact with this content.
I truly believe that research articles contain the world’s most important knowledge, we want to organize the world’s knowledge so that anyone can better discover and understand research and make more informed decisions in their research, business, school work, or personal lives.
NKH: What is scite telling us today about the pandemic, as well as the future of research publication?
JN: This is a broad question and I am not exactly sure how to address it. I think in general we need to continue to find and develop ways of improving how research is disseminated and how it is understood and evaluated. scite helps but there are more things we can do and we are excited to start to work on them more as scite matures.
FINDING NEW PERSPECTIVES TO MOVE RESEARCH FORWARD
Henry Small noted in an excellent overview of Gene Garfield’s evolving development of the concept of citation indexing that “Gene’s early conviction on how citation indexing would foster the hybridization and cross-fertilization of scientific fields was critical to the design and implementation of the Science Citation Index.” Garfield’s 1955 Science article explained his growing realization that “the farther away you get from the immediate subject area of the main article, the fewer the references to it you will locate. Yet these may well be the most useful references of all, for the cross fertilization of subject fields is one of the most important problems in scientific literature.” To him, it was at these intersections, this cross-fertilization, that new ideas and hybrid fields grow.
Perhaps British physicist Lawrence Bragg’s reflection is apt: “The important thing in science is not so much to obtain new facts as to discover new ways of thinking about them.” I have no doubt that Gene would be most pleased with the directions that his work and the evolving nature of citation is moving – despite the chaos of the current crazy quilt of research reporting in this age of the internet.
The first part of this series looked at how Internet posting of information and research data was creating issues of reliability by challenging traditional scientific publication systems.
Nancy K. Herther, writer, consultant and former librarian with the University of Minnesota Libraries