TOO LONG; DIDN’T READ

 

OPEN ACCESS AI RESEARCH
https://openscholar.allen.ai/paper
https://allenai.org/blog/openscholar
https://github.com/AkariAsai/OpenScholar?tab=readme-ov-file
https://huggingface.co/openscholar-v1-67376a89f6a80f448da411a6
https://venturebeat.com/ai/openscholar-the-open-source-a-i
OpenScholar: The open-source A.I. that’s outperforming GPT-4o in scientific research
by   /  November 20, 2024

“Scientists are drowning in data. With millions of research papers published every year, even the most dedicated experts struggle to stay updated on the latest findings in their fields. A new artificial intelligence system, called OpenScholar, is promising to rewrite the rules for how researchers access, evaluate, and synthesize scientific literature. Built by the Allen Institute for AI (Ai2) and the University of Washington, OpenScholar combines cutting-edge retrieval systems with a fine-tuned language model to deliver citation-backed, comprehensive answers to complex research questions. “Scientific progress depends on researchers’ ability to synthesize the growing body of literature,” the OpenScholar researchers wrote in their paper. But that ability is increasingly constrained by the sheer volume of information. OpenScholar, they argue, offers a path forward—one that not only helps researchers navigate the deluge of papers but also challenges the dominance of proprietary AI systems like OpenAI’s GPT-4o.

At OpenScholar’s core is a retrieval-augmented language model that taps into a datastore of more than 45 million open-access academic papers. When a researcher asks a question, OpenScholar doesn’t merely generate a response from pre-trained knowledge, as models like GPT-4o often do. Instead, it actively retrieves relevant papers, synthesizes their findings, and generates an answer grounded in those sources. This ability to stay “grounded” in real literature is a major differentiator. In tests using a new benchmark called ScholarQABench, designed specifically to evaluate AI systems on open-ended scientific questions, OpenScholar excelled. The system demonstrated superior performance on factuality and citation accuracy, even outperforming much larger proprietary models like GPT-4o.

One particularly damning finding involved GPT-4o’s tendency to generate fabricated citations—hallucinations, in AI parlance. When tasked with answering biomedical research questions, GPT-4o cited nonexistent papers in more than 90% of cases. OpenScholar, by contrast, remained firmly anchored in verifiable sources. The grounding in real, retrieved papers is fundamental. The system uses what the researchers describe as their “self-feedback inference loop” and “iteratively refines its outputs through natural language feedback, which improves quality and adaptively incorporates supplementary information.” The implications for researchers, policy-makers, and business leaders are significant. OpenScholar could become an essential tool for accelerating scientific discovery, enabling experts to synthesize knowledge faster and with greater confidence.


“How OpenScholar works: The system begins by searching 45 million research papers (left), uses AI to retrieve and rank relevant passages, generates an initial response, and then refines it through an iterative feedback loop before verifying citations. This process allows OpenScholar to provide accurate, citation-backed answers to complex scientific questions. | Source: Allen Institute for AI and University of Washington”

OpenScholar’s debut comes at a time when the AI ecosystem faces a growing tension between closed, proprietary systems and the rise of open-source alternatives like Meta’s Llama. Models like OpenAI’s GPT-4o and Anthropic’s Claude offer impressive capabilities, but they are expensive, opaque, and inaccessible to many researchers. OpenScholar flips this model on its head by being fully open-source. The OpenScholar team has released not only the code for the language model but also the entire retrieval pipeline, a specialized 8-billion-parameter model fine-tuned for scientific tasks, and a datastore of scientific papers. “To our knowledge, this is the first open release of a complete pipeline for a scientific assistant LM—from data to training recipes to model checkpoints,” the researchers wrote in their blog post announcing the system.

 

This openness is not just a philosophical stance; it’s also a practical advantage. OpenScholar’s smaller size and streamlined architecture make it far more cost-efficient than proprietary systems. For example, the researchers estimate that OpenScholar-8B is 100 times cheaper to operate than PaperQA2, a concurrent system built on GPT-4o. This cost-efficiency could democratize access to powerful AI tools for smaller institutions, underfunded labs, and researchers in developing countries. Still, OpenScholar is not without limitations. Its datastore is restricted to open-access papers, leaving out paywalled research that dominates some fields. This constraint, while legally necessary, means the system might miss critical findings in areas like medicine or engineering. The researchers acknowledge this gap and hope future iterations can responsibly incorporate closed-access content.


“How OpenScholar performs: Expert evaluations show OpenScholar (OS-GPT4o and OS-8B) competing favorably with both human experts and GPT-4o across four key metrics: organization, coverage, relevance and usefulness. Notably, both OpenScholar versions were rated as more “useful” than human-written responses. | Source: Allen Institute for AI and University of Washington”

The OpenScholar project raises important questions about the role of AI in science. While the system’s ability to synthesize literature is impressive, it is not infallible. In expert evaluations, OpenScholar’s answers were preferred over human-written responses 70% of the time, but the remaining 30% highlighted areas where the model fell short—such as failing to cite foundational papers or selecting less representative studies. These limitations underscore a broader truth: AI tools like OpenScholar are meant to augment, not replace, human expertise. The system is designed to assist researchers by handling the time-consuming task of literature synthesis, allowing them to focus on interpretation and advancing knowledge.

 

Critics may point out that OpenScholar’s reliance on open-access papers limits its immediate utility in high-stakes fields like pharmaceuticals, where much of the research is locked behind paywalls. Others argue that the system’s performance, while strong, still depends heavily on the quality of the retrieved data. If the retrieval step fails, the entire pipeline risks producing suboptimal results. But even with its limitations, OpenScholar represents a watershed moment in scientific computing. While earlier AI models impressed with their ability to engage in conversation, OpenScholar demonstrates something more fundamental: the capacity to process, understand, and synthesize scientific literature with near-human accuracy.

 

The numbers tell a compelling story. OpenScholar’s 8-billion-parameter model outperforms GPT-4o while being orders of magnitude smaller. It matches human experts in citation accuracy where other AIs fail 90% of the time. And perhaps most tellingly, experts prefer its answers to those written by their peers. These achievements suggest we’re entering a new era of AI-assisted research, where the bottleneck in scientific progress may no longer be our ability to process existing knowledge, but rather our capacity to ask the right questions. The researchers have released everything—code, models, data, and tools—betting that openness will accelerate progress more than keeping their breakthroughs behind closed doors. In doing so, they’ve answered one of the most pressing questions in AI development: Can open-source solutions compete with Big Tech’s black boxes? The answer, it seems, is hiding in plain sight among 45 million papers.”

BIG DATA SYNTHESIS
https://storm.genie.stanford.edu/
https://arxiv.org/abs/2402.14207
https://github.com/stanford-oval/storm
https://storm-project.stanford.edu/research/storm/
https://blog.acer.com/en/discussion/2218/storm-by-stanford-university-the-ai-model-for-academic-and-research-purposes
STORM by Stanford: The AI Model for Academic and Research Purposes
by Edmund_McGowan  /  November 13

“Artificial intelligence is a swiftly evolving beast. From novel chatbots that come with the dust and are gone with the wind to behemoths like ChatGPT, AI is on the march. STORM by Stanford University is an innovative AI-powered research tool currently making waves in the global academic community and beyond. Since early 2024, this open-source research project has helped many academics, students, and content creators craft articles from scratch. “Articles from scratch?” We hear you ask. Yes, in a nutshell, it can be used to create Wikipedia-style papers, complete with citations in a matter of minutes. Whether you’re interested in AI for schoolwork, or even AI for grad school level writing, STORM can help you on your path to a PhD. Get set, because we’re headed for the eye of the storm to discover the origins of STORM, and the humans behind it. We’ll also go on to discuss its performance and steer you in the direction of the STORM website so you can try it out for yourself.

Short for “Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking”, STORM Stanford AI research project is an AI tool that can create Wikipedia style entries faster than you can make a cup of coffee. Let’s be clear, STORM is not your average B- chatbot, it is an A+ gifted-class knowledge creator and research assistant that’s ready to back up its statements and provide citations galore. While AI is often a faceless, authorless corporate beast, the team behind STORM are actually Stanford students and faculty. STORM is created by human members of Stanford’s OVAL team, namely: Yijia Shao, Yucheng Jiang, Theodore A. Kanell, Peter Xu, Omar Khattab, and Monica S. Lam.

LLMs (large language models) may be useful for a layman’s general research. But for academics and content creators, they tend to fall short in several areas. Accuracy is king in academia, and LLMs have well publicized limitations in veracity, as well as specificity, and understanding of complex academic topics. What’s more, LLMs are renowned for producing confident, yet incorrect answers that lack citations. The final nail in the coffin for academic use of LLMs is plagiarism. Rapid generation of text comes with the risk that the LLM is simply replicating existing academic sources. While the majority of LLMs create content via retrieval-augmented generation (RAG), STORM takes content creation several steps further to craft accurate, organized answers. Now let’s find out more about the multi-agent conversations behind every STORM search. At time of writing, STORM is powered by Bing Search and Azure OpenAI GPT-4o-mini. This recent upgrade featuring the latest technologies enables STORM to break down the barrier between the excess of accessible information out there, and what an individual is able to assimilate. The “knowledge curation agent” explored in STORM (remember, it is still a research project) aims to provide a solid foundation for knowledge discovery, making in-depth learning possible without the stress of laborious research.

Where many LLMs are a letdown, STORM is a success. This is in no small part thanks to STORM’s multi-perspective question asking. Multiple AI agents cooperate in an agentic system, where individual AI agents perform the tasks of content retrieval, multi-perspective question asking, and finally, synthesis of content. Similar in many ways to how a human team would collaborate to research and write an ambitious project, STORM approaches complex tasks from multiple angles to create comprehensive written content that can give human-created articles a run for their money. STORM provides users with the option of STORM AI autonomous or Co-STORM (Human-AI collaboration), as well as search engine choices. After inputting your topic to STORM, the platform generally takes a minute or two to generate your article.

Once an article is completed, a “See BrainSTORMing Process” option appears above the summary of your article. This neat feature allows users to see the AI agents (editors) and the steps they have taken to contribute to the final article. If you do try STORM, do the good folks at Stanford a favor and provide feedback using the handy feedback box on the web demo. This information, as well as your purpose for writing the article will be securely stored, and not combined with your Google account info. If you’re looking for an AI tool to assist your academic writing, or just AI for school in general, then STORM is certainly worth a try. Here are a few different user groups that may find STORM more useful than regular old LLMs.

  • Academics and researchers can both benefit from using STORM, as it can create structured outlines on complex academic topics that can be used as educational resources. The verification and citation features of STORM are particularly attractive for this cohort.
  • Students today may lack the time to conduct their own research. With STORM, students of all levels can quickly get well-organized notes and summaries in easy to understand Wikipedia style articles, likely a form that they are already familiar with.
  • Content creators with deadlines to meet or day jobs to attend to can rapidly research and organize data on STORM. Verified, fact-based outlines that offer multiple perspectives can be quickly crafted, and updated by users as topics evolve.

As with all AI platforms, STORM is not without its limitations. If you’ve read this far, chances are you’re not plotting to misuse STORM to graduate from school or college. But just in case you were wondering, STORM is not (yet) an AI writing tool that can knock out a 10,000 word college-level dissertation for you. Try out STORM and you will quickly discover that the “research preview” excels in generating Wikipedia-style articles.

Similar to Wikipedia, STORM is very good at providing a comprehensive outline of a topic. The Wikipedia-esque sections are useful as foundations to build out from, but may lack specific or detailed information that some users require. This is presumably an aspect of the platform that will be improved, time will tell. Another issue that may deter or, indeed, attract some users is STORM’s limited safety measures. The potential to generate offensive content is certainly present on STORM, and on behalf of the Stanford Open Virtual Assistant Lab team, we remind you to follow STORM’s guidelines. As with other AI content generators, mistakes are still a likelihood, so double check your info before going to print!”

HALLUCINATION RISK
https://perplexity.ai/introducing-the-election-information-hub
https://techcrunch.com/the-other-election-night-winner-perplexity
https://zdnet.com/why-i-prefer-perplexity-over-every-other-ai-chatbot
https://science.org/is-your-ai-hallucinating-when-chatbots-make-things-up
https://forbes.com/perplexitys-election-hub-triggers-reactions-from-ai-experts
AI Experts Test Perplexity’s New Election Hub
by Tor Constantino  /  Nov 4, 2024

“While most Americans are focusing on who will be the next president of the United States following Tuesday’s general election in the U.S, there are hundreds of other important political races to be decided as well. There are 33 U.S. Senate seats and another 435 contests in the House of Representatives up for grabs. Not to mention the hundreds of other races at the various state and county levels of government that need to settled. To keep track of the national contests, Perplexity announced the launch on Friday of the first publicly available AI-based election tracker. Beginning on Tuesday, the election information hub will be powered using data from the Associated Press as well as Democracy Works.

Several AI experts praised the tracker, but they also expressed concerns about inconsistencies in framing and tonality as well as hallucinations and inaccurate summaries. A article on Sunday in The Verge reported errors in the Perplexity election information hub that it said the company later updated. It’s also worth noting that Perplexity itself has in recent months been embroiled in litigation and threats of legal action with several companies—including Forbes—for its AI’s unauthorized use of content, fabricating facts and false article abstracts. Despite those signals, users who still might want to use the AI election hub can visit the Perplexity link and enter their ZIP code. All the races and ballot measures that apply to that ZIP code will populate the screen, which the user can explore — the interface is pictured below.


“Screenshot of Perplexity query box for its 2024 election information hub”

“We want to make it as simple as possible to receive trusted, easy-to-understand information to inform your voting decisions. For each response, you can view the sources that informed an answer, allowing you to dive deeper and verify referenced materials. Whether you’re seeking to understand complex ballot measures, verify candidate positions, or simply find your polling place,” the company statement reads. In an email exchange seeking answers about whether Perplexity was paying AP and Democracy Works, the company’s rationale behind the project and timing, as well as how Perplexity’s large language model might mitigate hallucination allegations of creating fake news that multiple media outlets made earlier this year, the company sent a single blanket statement.

“We want to make it easier for people to make informed choices on all ballot items, including elected offices and ballot measures. We waited to release this to the public until we could conduct the appropriate testing,” wrote a company spokesperson. “To clarify, Perplexity uses LLMs for summarizing content, but is designed to optimize for accuracy. We use a process called Retrieval-Augmented Generation to identify relevant information and summarize it in a way that’s tailored to a user’s query. Answers are not utilizing stored knowledge from a model’s training data, which makes us different from other AI chatbots. It’s also why we chose to work with organizations like the AP and Democracy Works to provide us with up-to-date information on ballot items and election results,” the spokesperson’s message concluded.

In general, the AI influencers and experts I spoke with were impressed with the concept behind the election-specific info hub. Kirk Borne, Ph.D., is an internationally recognized thought leader and speaker within the AI and data space, as well as founder of the Data Leadership Group. “I believe that this platform will be very helpful for many people: with ‘less talk, more data’ and ‘fewer opinions, more analysis,’” Dr. Borne wrote via text. “The specificity, utility, currency, and accuracy—four key dimensions of all large language models and Generative AI—are 100% dependent on their data sources. Very broad LLMs that try to answer all possible end-user queries, utilizing massive datasets can be impossible to tame in all 4 key dimensions. Perplexity’s focus on a hyper-targeted and tightly constrained use case, with a limited spectrum of end-user queries, thus makes sense and is consistent with having such formalized data sources,” he explained.

Ahmed Banafa, Ph.D., is a technology expert and engineering professor at San Jose State University. He thinks there could be benefits from the hub as well. “Using AI to provide real-time updates on voting requirements, polling locations and candidate details—the platform aims to increase voter engagement and support informed choices. I checked it and I found the list of candidates in my area, it was accurate with good information about each candidate and each measure on the ballot. This approach reflects the growing trend of using AI to simplify access to critical election information, with a user-friendly design that makes it easy for voters to find what they need quickly,” Dr. Banafa wrote via email.

He also applauded Perplexity’s partnering with AP and Democracy Works for this initiative, which is intended to give the AI model authoritative, up-to-date election data. He noted that using such trusted data sources lends credibility of the provided information and reduces reliance on less dependable outlets or data suppliers. “These collaborations are crucial for providing accurate and trustworthy content, especially during elections, when misinformation can have serious impacts. This will save the service the steps of verifications of the information as the AP/DW already did that work. These partnerships uphold high standards of election integrity, helping ensure users receive only thoroughly vetted, current information,” added Dr. Banafa. Conor Grennan is chief AI Architect at NYU Stern School of Business as well as CEO and founder of the consultancy AI Mindset. In an email exchange, he wrote that he ran extensive queries to test the model and gained some interesting insights.

“Perplexity’s election hub addresses a crucial need by centralizing essential election information, from candidate profiles to voting logistics. While their factual information on voting procedures is reliable and serves a valuable public function, the platform faces challenges in presenting candidate information equitably,” he wrote. “A comparison of candidate pages reveals inconsistencies in tone and framing — Harris’s page emphasizes historic achievements, while Trump’s page takes a markedly different editorial approach. This highlights a fundamental challenge with LLM-based platforms: maintaining consistent, unbiased presentation across variable content generation,” Grennan explained. He also lauded the team-up with AP and Democracy Works, stating that it should help ensure greater consistency in information delivery. “This is particularly important for candidate information, where varying source material can lead to dramatically different presentations of the same individual. Having authorized data vendors helps establish a baseline for information quality,” noted Grennan.

Despite the data collaborations, Grennan stated that they’re not a panacea for all the risks such as faulty article summations and hallucinations that can plague LLMs — even those that use RAG technology. “While partnerships with established data providers like AP and Democracy Works should definitely help reduce technical hallucinations, they don’t fully address the challenge of perceived bias in presentation. Even factually accurate information can create different impressions based on framing and emphasis. The contrast in candidate biographies demonstrates how LLMs can inadvertently reflect existing biases in their training data, potentially affecting how information is contextualized and presented,” Grennan concluded.

Dr. Banafa echoed those sentiments, writing that even if third-party data providers have reliable, rigorous fact-checking standards — the AI models sourcing those data can benefit from continual monitoring and refinement. “While trusted sources lower the chances of misinformation, continuous monitoring and validation of AI outputs are still crucial to maintain reliability and trustworthiness,” he wrote. “It’s equally important for users to crosscheck critical details, given the potential consequences of even minor inaccuracies in election information.” However, Dr. Borne was a bit more optimistic that the specific use case that Perplexity has developed should further curtail the incidence of hallucinations.

“The typical hallucinations arise in LLMs when the end user queries are essentially unconstrained on the vast historical knowledgebase of the world. Those LLMs cannot give a truly accurate and complete—and short—answer to a complex question any more than a physics professor can explain the fullness and the intricacies of quantum theory or general relativity 100% accurately in a few sentences to a general audience. I am optimistic that Perplexity will do better than the typical LLM track record,” wrote Dr. Borne. Dr. Banafa believes that Perplexity’s unique model may hold promise for the future. “But it’s essential to consider the broader challenges of using AI in election contexts. AI chatbots have previously provided incorrect or partially correct answers to election-related questions. This highlights the need for continuous evaluation and refinement of AI systems to meet the rigorous standards required for sharing accurate election information. Additionally, advancements in AI transparency and interpretability could further reduce errors, fostering more trust in AI-generated election information,” he noted.

While Dr. Borne described Perplexity’s election tracker platform as an experiment, that’s centered around the highly charged and personalized human sentiments and context-driven narratives associated with modern politics. “We will see if this works well enough to be considered a success, or—if like any science-technology implementation—we learn from it and refine it for the next time. In this specific instance, I believe that the outcomes should be positive since there is more ‘technology implementation’ than ‘scientific experimentation’ involved, but the latter is definitely not 0%. Perplexity’s project is still ultimately an LLM after all,” he concluded.”


“The discs had images of skeletons on them and were called ‘Bones’ or ‘Ribs’ and contained music that was forbidden. The practice of copying and recording music onto X-rays really got going in St Petersburg, a port where it was be easier to obtain illicit records from abroad. But it spread, first to Moscow and then to most major conurbations throughout the states of the Soviet Union.”

OPEN SAMIZDAT
https://sovietmaps.com/CityMil
https://jstor.org/stable/jj.5425967
https://press.princeton.edu/ideas/forbidden-texts
https://semanticscholar.org/Libraries-in-the-Post-Scarcity-Era
https://reason.com/2022/07/24/you-cant-stop-pirate-libraries
You Can’t Stop Pirate Libraries
by Elizabeth Nolan Brown  / August/September 2022

“Shadow libraries exist in the space where intellectual property rights collide with the free-flowing exchange of knowledge and ideas. In some cases, these repositories of pirated books and journal articles serve as a blow against censorship, allowing those under repressive regimes to access otherwise verboten works. At other times, shadow libraries—a.k.a pirate libraries—function as a peer-to-peer lending economy, providing e-books and PDFs of research papers to people who can’t or won’t pay for access, as well as to people who might otherwise be paying customers. Are the proprietors of these pirate libraries freedom fighters? Digital Robin Hoods? Criminals? That depends on your perspective, and it may also differ depending on the platform in question.

But one thing is certain: These platforms are nearly impossible to eradicate. Even a greatly enhanced crackdown on them would be little more than a waste of time and resources. Some of the biggest digital-age shadow libraries—including Library Genesis (or Libgen) and Aleph—have roots in Russia, where a culture of illicit book sharing arose under communism. “Russian academic and research institutions…had to somehow deal with the frustrating lack of access to up-to-date and affordable western works to be used in education and research,” the legal researcher Balázs Bodó wrote in the 2015 paper “Libraries in the Post-Scarcity Era.”

“samizdat copy of Aleksandr Solzhenitsyn’s novel In the First Circle, 1960s”

“This may explain why the first batch of shadow libraries started in a number of academic/research institutions such as the Department of Mechanics and Mathematics…at Moscow State University.” As PCs and internet access slowly penetrated Russian society, an extremely lively digital librarianship movement emerged, mostly fuelled by enthusiastic readers, book fans and often authors, who spared no effort to make their favorite books available on FIDOnet, a popular BBS [bulletin board system] in Russia,” Bodó’s paper explained.

As a result, a “bottom-up, decentralized, often anarchic digital library movement” emerged. These libraries have found large audiences among academics in America and around the world, thanks to the high cost of accessing scholarly journal articles. “Payment of 32 dollars is just insane when you need to skim or read tens or hundreds of these papers to do research,” wrote Alexandra Elbakyan—the Russia-based founder of the massive shadow library Sci-Hub—in a 2015 letter to the judge presiding over the academic publisher Elsevier’s suit against Sci-Hub. Elbakyan pointed out that in days of yore, students and researchers would share access to papers via forum requests and emails, a system which Sci-Hub simply streamlines. She also noted that Elsevier makes money off the work of researchers who do not get paid for their work.

Such economic imperatives are just one part of the Sci-Hub ethos. “Any law against knowledge is fundamentally unjust,” Elbakyan tweeted in December 2021. “There seems to be a widely shared…consensus in the academic sector about the moral acceptability of such radical open access practices,” wrote Bodó, Dániel Antal, and Zoltán Puha in a 2020 paper published by PLOS One. “Willful copyright infringement in the research and education sector is seen as an act of civil disobedience, resisting the business models in academic publishing that have faced substantial criticism in recent years for unsustainable prices and outstanding profit margins.”

In his earlier paper, Bodó argued that “the emergence of black markets whether they be of culture, of drugs or of arms is always a symptom, a warning sign of a friction between supply and demand.” When “there is a substantial difference between what is legally available and what is in demand, cultural black markets will be here to compete with and outcompete the established and recognized cultural intermediaries. Under this constant existential threat, business models and institutions are forced to adapt, evolve or die.” The 2020 paper underlined the point: Its “supply side analysis” of scholarly piracy suggested “that a significant chunk of the shadow library supply is not available in digital format and a significant share of downloads concentrate on legally inaccessible works.”

Many would reply that such piracy is just plain wrong, no matter how much trouble and expense copyright causes for authors and researchers. But copyright, according to some strains of libertarian thought, is not the sort of “property right” we ought to justly respect, given its historical genesis in propping up unjust monopoly by creating artificial scarcity. “Only tangible, scarce resources are the possible object of interpersonal conflict, so it is only for them that property rules are applicable,” the libertarian lawyer Stephan Kinsella argued in “Against Intellectual Property,” published in the Journal of Libertarian Studies in 2001. “Thus, patents and copyrights are unjustifiable monopolies granted by government legislation.”

Intellectual property rights give creators “partial rights of control—ownership—over the tangible property of everyone else” and can “prohibit them from performing certain actions with their own property,” Kinsella continues. “Author X, for example, can prohibit a third party, Y, from inscribing a certain pattern of words on Y’s own blank pages with Y’s own ink. That is, by merely authoring an original expression of ideas…the [intellectual property] creator instantly, magically becomes a partial owner of others’ property.” Justly enforced property rights, by this line of thinking, ought to apply only to physical things that are scarce and whose control is rivalrous. This would not apply to words or ideas that can—as the very existence of these pirate libraries shows—be copied exactly and infinitely. Enforcing copyright inherently stops other people from doing things with their minds and their justly owned property, including their server space and hard drives.

What about the utilitarian case for intellectual property? The U.S. Constitution enshrines copyrights to “promote the progress of science and the useful arts.” But banning shadow libraries could do more harm to such promotion of “science and the useful arts” than good, given how much they facilitate research and scholarship that would otherwise be either prohibitively expensive or outright impossible.

As a 2016 letter in The Lancet pointed out, such sites could be hugely beneficial for doctors in places like Peru, where few physicians have access “to the papers and information they need to care for a growing and diverse set of patients.” Such arguments became even more powerful during the COVID-19 pandemic. Interestingly, the 2020 Immersive Media & Books survey found that pirates are more likely to be avid book buyers than nonpirates. “Compared to the general survey population, a higher percentage of book pirates during COVID are buying more ebooks (38.7%), audiobooks (27.1%) and print books (33.7%),” the study concluded.

But publishers love their copyrights, and they do not wish to adapt their legacy systems to the digital age. They thus have been trying to crush the shadow libraries, with the help of the legal system. In 2015, Elsevier sued to shut down Sci-Hub and Libgen. A federal court eventually ruled in Elsevier’s favor, awarding it $15 million in damages and issuing an injunction against the two platforms. In 2017, the American Chemical Society (ACS) sued Sci-Hub. The U.S. District Court for the Eastern District of Virginia ruled in the plaintiff’s favor, saying that Sci-Hub owed it $4.8 million in damages. The court ordered American web hosting companies, domain registrars, and search engines to stop facilitating access to “any or all domain names and websites through which Defendant Sci-Hub engages in unlawful access to, use, reproduction, and distribution” of ACS’s works. Other countries, such as Sweden and France, have also ordered internet service providers to block Sci-Hub and Libgen.

Enforcing any of these rulings has proven nearly impossible, since Sci-Hub and Libgen are hosted in other countries and not beholden to U.S.—or Swedish, or French—rules. The people behind Sci-Hub and LibGen didn’t bother to contest the lawsuits against them. When internet service providers and domain registrars in these countries cut off access, the shadow libraries simply popped up elsewhere. And even if search engines don’t display them, these libraries can be accessed via the dark web. Yet publishers keep signing up to play this game of whack-a-mole in different venues. Elsevier, research publisher Wiley, and ACS are currently suing Sci-Hub in Indian court. (This time, Elbakyan is fighting back, arguing that Sci-Hub is covered under the exemptions in India’s Copyright Act.) Another shadow library, the Ukraine-based Kiss Library, lost a case last year in the U.S. District Court for the Western District of Washington and was ordered to pay $7.8 million in statutory damages and to stop distributing copyrighted materials. The library has not paid a cent.

Since U.S. courts have no real power to make any of these institutions pay, popular authors John Grisham and Scott Turow have challenged the Department of Justice to do more. “The time and money required for the suit demonstrate the absurdity of leaving anti-piracy enforcement to the victims,” they wrote in a February op-ed for The Hill. “We are also asking Congress to amend the law to stop U.S. search engines from linking to notorious foreign-based piracy sites, which they have refused to do on their own.” It’s no surprise that some best-selling authors are among those most inflamed about pirate libraries.

“The few existing studies in the general e-book piracy space…echo findings of research on music and audiovisual piracy: displacement effects are mostly detrimental for best sellers,” while “long tail content enjoys a discovery effect,” wrote Bodó and his colleagues in their 2020 paper. But the U.S. Department of Justice will have no more luck than the courts in getting the outcome those American authors want. Nor would stopping search engines from linking to shadow libraries make much of a dent, since the sites would still be accessible to those in the know and since social media can easily provide this knowledge to anyone searching for it. The whole business would ultimately be a costly and time-consuming failure—in addition to keeping students, scientists, doctors, and others from accessing important information. In an earlier internet era, people liked to say that information wants to be free. Information, of course, wants nothing. But so long as people want free information, the modern tech and digital ecosystem will provide it. Perhaps authors and publishers would do better to accept that and address ways to mitigate its effects rather than engage in an unwinnable copyright war.”

PREVIOUSLY

MACHINE READABLE
https://spectrevision.net/2024/04/25/machine-readable/
SHADOW LIBRARIES
https://spectrevision.net/2019/12/18/data-hoarding/
GUERRILLA OPEN ACCESS
https://spectrevision.net/2016/02/18/guerrilla-open-access/