By Donald T. Hawkins (Freelance Columnist and Conference Blogger)
Charleston In Between conferences are held approximately midway between the annual Charleston Library Conferences in November. They typically feature an in-depth look at a major event in the information industry. Organized by Heather Staines, Senior Consultant, Delta Think, and Gary Price, Editor, INFO Docket, the virtual third Charleston In Between conference took place on April 4-5, 2023 and attracted over 400 attendees from 14 countries.
Peter Brantley, Director of Online Strategy at the library of the University of California at Davis, keynoted the conference with a presentation entitled “Aiding and Abetting: The Machines We Are Making”. He began by noting that current discussions are about the underlying technology of artificial intelligence (AI), and large language models (LLMs) that are based on neural networks and can process massive amounts of information. LLMs are transformative and are increasingly becoming able to process different types of information such as text, images, and videos, although we do not actually know how they work to formulate search queries. We do not understand the relationships of computational training data and parameters. Developing a “Chain of Thought” requires over 100 billion parameters.
Many LLM models are “owned” by Big Tech and do not provide access to underlying modes or training data. Models may deliver bias and harmful behavior because privacy impacts are unregulated. In a new AI world, there are concerns about originality, copyright, and laws. Is a LLM a legal “person”? They are predictive, and their capabilities are not fully understood. Much like human language, they are very adaptable and increasingly ubiquitous, but the user experience needs consideration. We are seeing a race to integrate AI into every app we have on the desktop.
In higher education AI research, there is a wariness towards commercialized AI, so truly open AI systems have been produced. Fine-tuning the output of LLMs provides structured data for parameter optimization. Much professional work is repetitive, methodical, and based on patterns or rules. Likely targets for AI include programming, medical diagnosis, paralegal training, editorial functions, and librarian reference services. Here is an AI Publishing Taxonomy which can help us channel our work and determine how it may change.
In the future, new intelligence will bring new insights. Work will move faster and much of it will become less repetitive and routine. New jobs will emerge, but will we want them? Trying to understand AI systems may help us to better understand different natures of intelligence. Comparisons of AI output vs. human are doomed to fail. More specialized AI which is highly trained and focused is starting to emerge. It is important to engage people; we cannot hide this technology.
Stakeholder Perspectives Panel
This session, moderated by Heather Staines, featured 5 participants discussing their views on AI and scholarly communications.
Kyle Courtney, Copyright Advisor at the Harvard University Library, said that currently the law of AI is nebulous, but it has many copyright implications; which has resulted, for example, in shutting off ChatGPT in Italy. Input and output are separate issues that must be addressed. Are works generated by AI copyrightable? AI systems are trained to create works by using examples of various types of works, so creating copies might infringe copyright; therefore it may be necessary to get permission from the owner. We have seen that new technologies lower risk. AI can be used to enhance images, but what happens if they are combined with author-written text?
Raymond Pun, a reference and instruction librarian at Alder Graduate School of Education and now a teacher of educators, noted that ChatGPT is impacting learning. Libraries are moving away from being repositories, adopting teaching and learning roles, and developing critical reading and research courses for graduate students. ChatGPT is accelerating progress of those who would benefit from such courses by helping them to focus. Piracy and plagiarism are major issues. It is probably not possible to identify an AI generated passage of text or the use of ChatGPT in published works.
Danielle Cooper, Director, Libraries, Scholarly Communication, and Museums at ITHAKA S+R, follows patterns in technology, evaluates use cases for university guidance, and studies how generative AI is being approached in higher education. Universities are in a nascent phase and are preparing people to explore and create a response. EDUCAUSE is going a good job in this area.
Kyle Jensen, Director of English at Arizona State University has been writing about AI for more than 7 years. The scale and scope of LLMs is getting very complicated. ChatGPT came on the scene quickly and abruptly, but it is not the only large language system. What value will students get in the classroom and what are we asking them to learn? How do AI and ChatGPT technologies help them make decisions for effective arguments? We are in an evolving landscape, and it is difficult to keep up with new technological developments and how we are going to engage with them.
Theresa Fucito, Director of Publications, AIP Publishing, noted that the American Institute of Physics (AIP) is using guidelines developed by the Committee on Publishing Ethics (COPE) to formulate a publisher perspective to ChatGPT. AIP now requires authors to identify their specific contributions to an article and their funding sources, which can help them get tenure. This requirement is also helping researchers to ensure that they have all the information needed to have their paper published. Digital copy editing is not intended to replace input from humans who check what needs to be added to an article. There is a significant opportunity to use AI in similarity checks to guard against plagiarism. AIP is also experimenting with the use of AI in peer review.
Gary Price moderated two panels describing AI-based tools.
Tim Vines, Founder of DataSeer said that funders of research typically spend $200,000 to $500,000 on a research grant, and want to know what the recipients did with the money they received. Most research outputs never become public, so much of the grant funding is wasted. Furthermore, a new US policy requires granting agencies to ensure that published articles include the underlying data. DataSeer’s services include a database of articles to conduct compliance checks for articles.
Josh Nicholson, Co-Founder and CEO of scite.ai noted that research has allowed us to see amazing things in our lives. Although almost everything we can think of has been touched by research, there are growing concerns about reproducibility. How do we know what to trust? Scite was developed to help researchers discover and understand articles through the use of a database of “smart citations”, which displays them in context and notes whether the article provides supporting or contrasting evidence. We used to treat all citations equally, but there are many reasons to cite articles, so the main focus of scite is to find the reason for a citation. Challenges include getting access to the full text of the article, different formats of text and reference styles, and use of ChatGPT which is often totally wrong even though it is used by researchers and students.
Emma Warren-Jones is Co-Founder of scholarcy, a tool to help researchers understand large collections of the scholarly literature. Knowledge management challenges include
- Screening articles for key information,
- Organizing and recalling them even after months or years,
- Expanding research with related articles, and
- Managing the cognitive load of reading and retaining information.
We can think of this process as breaking research into “flash cards” to screen and evaluate it. With scholarcy, libraries of articles can be created from a variety of formats and imported into a database, from which summary outputs such as headlines and reviews of articles within the context of other research can be generated and then exported in various formats to Excel, Zotero, and similar systems.
David Harvey, Head of Research and Business Development at prophy said that finding a good peer reviewer can be a problem, and it will only get worse. Prophy opens the way to efficient peer review in an era of OA. Interdisciplinary science is growing, which leads to long waits to find a reviewer, imposes a burden on authors and editors, and may introduce bias in reviews. Prophy makes an AI map of scientific concepts and creates a “digital fingerprint” of articles related to each other to allow finding possible reviewers in seconds, and grouping them to find articles that are similar to each other. Editors have control of the process and can specify the type of referee they want.
Anita Schjoll Abidegaard is Founder and CEO of iris.ai, a startup that was founded in Norway about 7 years ago and has developed a researcher workspace. With the rapid growth in publications, interdisciplinary research has become prominent. However, ChatGPT and other LLM systems are failing in adherence to facts, which is particularly a problem in the most fact-based discipline: science. The researcher workspace is expected to be launched shortly. Users will be able to load any content and apply a range of tools to it: Explore, Analyze, Summaries, Filters, Extraction. Summaries from several papers can be combined into one and their core concepts, and topics can be visualized. Data can be extracted from text, tables, etc. automatically, which will permit interdisciplinary discovery.
AI Tools to Watch
Juan Castro, Founder and CEO of Writefull, said that his company has been providing custom AI in the world of publishing and education since 2016. AI has been integrated throughout the copy editing and publishing pipeline, with benefits to publishers, institutions, and researchers through automatic editing of manuscripts and quality control and assurance features.
AI helps the author write because it generates the title and abstract. Some institutions have been concerned that AI will make students lazy, so a capability for institutions to enable or disable widgets and a plagiarism detector have been incorporated into the service.
Petr Knoth, Research Fellow at the Knowledge Media Institute, Open University, and Founder of the CORE system, which aggregates articles from OA repositories, said that the mission of CORE is to deliver seamless access for humans and machines to content collected from 11,000 providers of OA research articles worldwide. CORE offers fact checking and plagiarism research, and CORE-GPT addresses the issue of reduction of bias. Here are some of Knoth’s reflections on the limitations of ChatGPT and CORE-GPT
Artur Nowak, Co-Founder, Evidence Prime (a collaboration between McMaster University in Ontario and a group of Polish IT professionals), discussed LASER.AI in a ChatGPT world and noted that it is a next-generation tool for systematic reviews in support of evidence-based medicine at McMaster. Does ChatGPT make us obsolete? LASER AI provides several features that ChatGPT does not:
- Choice of the best models for a job and validation of them,
- Fine-tuning to a biomedical domain and knowledge of how its data should be structured,
- A user interface that allows interaction with AI business-specific workflows, and
- Security and compliance.
Eric Olson and Christian Salem, Co-Founders of Consensus, described the Consensus search engine that uses AI to find and surface claims made in research articles. It provides intuitive searching and allows for natural language questions from a database of 200 million articles from the Semantic Scholar database, as well as including synthesis and summarization features. Consensus now has over 125,000 users from more than 1,000 universities. Access is free now, but a premium paid product is under development and will be launched soon. Consensus has redefined search: it does not search for articles but for answers by delivering summaries from the most relevant articles and “quality indicator tags” obtained by reading through the article and determining what type of study was done.
Dustin Smith, Co-Founder and President of Hum, a customer data platform for professional associations and scholarly publishers, said that much data goes to waste. Hum uses deep intelligence with publisher data to create an LLM database that helps publishers and societies manage their business. Its LLM approach provides AI for many functions and can represent people and behaviors in their environment; for example, people with specific interests, those who lose interest as less content is relevant to them, those whose interests change over time, and topics that attract new users or create a significant engagement.
Kalev Leetaru, Senior Fellow at the George Washington University School of Engineering described the Global Database of Events, Language, and Tone (GDELT) project that captures global events in over 400 languages and derives data from it. It uses AI to understand the world around us and how people are interconnected. Leetaru presented two examples of GDELT data: fact checking of Television news and an archive of Eastern European news from Belorussia, Iran, Russia, and Ukraine which contains 236 million broadcasts containing 1.1 billion words of spoken English, all of which is downloadable today.
At the closing session of the conference, the 3 sponsors presented information about their products.
In her presentation entitled “Analyze and Evaluate Real World Impact: Policy, Clinical Guidelines, and Point of Care Data”, Manisha Bolina, Sr. Sales Manager, The BMJ described BMJ Impact Analytics and said that it is the only tool in the world dedicated to health and social care and uniquely shows links to patient outcomes, provides citations in context, links to impact factors, and offers a search of over 30,000 global organizations by region, country, government sources, DOIs, and ORCID identifiers. It was developed with guidance from medical departments at universities as well as funders of medical research. Users can approach data globally at a glance and find top researchers in a field. BMJ Impact Analytics is unique because search results show the exact page in the text where the answer was found, the funder, and the journal where the article was published. It is a standalone product; users do not have to subscribe to The BMJ to access it; however, the Impact Analysis product is sold to institutions only. Free trials are available.
Lorna Vasica, AIP Publishing Sales Manager—USA West, discussed “Building Tomorrow Together: Accelerating the Physical Sciences”. She noted that science never stands still: it evolves and improves. Discoveries are based on what has gone before, which is what drives scientific progress and also AIP Publishing. Science belongs to everyone and should be practiced, published, and available to anyone who seeks it.
Here is AIP Publishing’s new look.
Its mission is to accelerate science and disseminate new research results worldwide. Two New journals: APL Energy and APL Quantum will be available soon. Every article has a “Scilight”, a highlight that focuses on most crucial aspects of the article and the author’s research focus. Collections of e-books were launched in 2020, and 83 books were published during 2021 and 2022. With 4,000 subscribing institutions in 195 countries, AIP is expanding the scope of research globally now and in future with global authorship. The pursuit of knowledge is a pursuit of a better world. AIP is publishing journals in partnership with the American Journal of Physics and Physics Teacher. In April 2023, AIP Publishing is migrating to Silverchair which will provide an enhanced user experience from a new domain.
John Dillon, Sr. Product Manager of ProQuest TDM Studio said that the Studio was launched during the COVID pandemic. It uses data science to analyze dissertations and determine the top terms used by authors. The study of data science is growing rapidly and is expected to reach 181 zettabytes (181 trillion gigabytes) by 2025, most of which is in unstructured information such as text. How do we support machine learning? Problems faced by researchers include gated publications, uncertain usage rights, long delivery times, and unstructured data. The TDM Studio provides several solutions: immediate access to a library’s holdings, rights-cleared content, a breadth of providers, and data in consistent formats like XML or CSV. The Studio contains a wealth of information that supports both teaching and research, including over 3 million dissertations and 180 million newspaper articles in full text back to 1851. Data mining this content has large positive effects on course of research; it uses a visualization interface and sentiment analysis to create data sets that support TDM researchers at all levels. The Studio works with the ProQuest content to which a university subscribes.
Donald T. Hawkins is a conference blogger and information industry freelance writer. He blogs and writes about conferences for Information Today, Inc. (ITI) and The Charleston Information Group, LLC (publisher of Against The Grain). He maintains the Conference Calendar on the ITI website). He contributed a chapter to the book Special Libraries: A Survival Guide (ABC-Clio, 2013) and is the Editor of Personal Archiving: Preserving Our Digital Heritage (Information Today, 2013) and Co-Editor of Public Knowledge: Access and Benefits (Information Today, 2016). He holds a Ph.D. degree from the University of California, Berkeley and has worked in the online information industry for over 50 years.