Giant, free index to world’s research papers released online is by Holly Else and appears in Nature.
Technologist Carl Malamud. Credit: Smita Sharma
“In a project that could unlock the world’s research papers for easier computerized analysis, an American technologist has released online a gigantic index of the words and short phrases contained in more than 100 million journal articles — including many paywalled papers.
“The catalogue, which was released on 7 October and is free to use, holds tables of more than 355 billion words and sentence fragments listed next to the articles in which they appear. It is an effort to help scientists use software to glean insights from published work even if they have no legal access to the underlying papers, says its creator, Carl Malamud. He released the files under the auspices of Public Resource, a non-profit corporation in Sebastopol, California, that he founded.
“Malamud says that because his index doesn’t contain the full text of articles, but only sentence snippets up to five words long, releasing it does not breach publishers’ copyright restrictions on the reuse of paywalled articles. However, one legal expert says that publishers might question the legality of how Malamud created the index in the first place.
“Some researchers who have had early access to the index say it’s a major development in helping them to search the literature with software — a procedure known as text mining. Gitanjali Yadav, a computational biologist at the University of Cambridge, UK, who studies volatile organic compounds emitted by plants, says she aims to comb through Malamud’s index to produce analyses of the plant chemicals described in the world’s research papers. “There is no way for me — or anyone else — to experimentally analyse or measure the chemical fingerprint of each and every plant species on Earth. Much of the information we seek already exists, in published literature,” she says. But researchers are restricted by lack of access to many papers, Yadav adds.
“Malamud’s General Index, as he calls it, aims to address the problems faced by researchers such as Yadav. Computer scientists already text mine papers to build databases of genes, drugs and chemicals found in the literature, and to explore papers’ content faster than a human could read. But they often note that publishers ultimately control the speed and scope of their work, and that scientists are restricted to mining only open-access papers, or those articles they (or their institutions) have subscriptions to. Some publishers have said that researchers looking to mine the text of paywalled papers need their authorization.
And although free search engines such as Google Scholar have — with publishers’ agreement — indexed the text of paywalled literature, they only allow users to search with certain types of text queries, and restrict automated searching. That doesn’t allow large-scale computerized analysis using more specialized searches, Malamud says…”
Please click here to continue reading the entire article.