OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs@AkariAsai et al. introduce a retrieval-augmented LM that answers scientific queries by identifying relevant passages from 45M open-access papers.
đhttps://t.co/a3qI45sHNLhttps://t.co/vjQif5xli1
â Sumit (@_reachsumit) November 22, 2024
OPEN ACCESS AI RESEARCH
https://openscholar.allen.ai/paper
https://allenai.org/blog/openscholar
https://github.com/AkariAsai/OpenScholar?tab=readme-ov-file
https://huggingface.co/openscholar-v1-67376a89f6a80f448da411a6
https://venturebeat.com/ai/openscholar-the-open-source-a-i
OpenScholar: The open-source A.I. thatâs outperforming GPT-4o in scientific research
by Michael Nuùez / November 20, 2024
“Scientists are drowning in data. With millions of research papers published every year, even the most dedicated experts struggle to stay updated on the latest findings in their fields. A new artificial intelligence system, called OpenScholar, is promising to rewrite the rules for how researchers access, evaluate, and synthesize scientific literature. Built by the Allen Institute for AI (Ai2) and the University of Washington, OpenScholar combines cutting-edge retrieval systems with a fine-tuned language model to deliver citation-backed, comprehensive answers to complex research questions. âScientific progress depends on researchersâ ability to synthesize the growing body of literature,â the OpenScholar researchers wrote in their paper. But that ability is increasingly constrained by the sheer volume of information. OpenScholar, they argue, offers a path forwardâone that not only helps researchers navigate the deluge of papers but also challenges the dominance of proprietary AI systems like OpenAIâs GPT-4o.
At OpenScholarâs core is a retrieval-augmented language model that taps into a datastore of more than 45 million open-access academic papers. When a researcher asks a question, OpenScholar doesnât merely generate a response from pre-trained knowledge, as models like GPT-4o often do. Instead, it actively retrieves relevant papers, synthesizes their findings, and generates an answer grounded in those sources. This ability to stay âgroundedâ in real literature is a major differentiator. In tests using a new benchmark called ScholarQABench, designed specifically to evaluate AI systems on open-ended scientific questions, OpenScholar excelled. The system demonstrated superior performance on factuality and citation accuracy, even outperforming much larger proprietary models like GPT-4o.
One particularly damning finding involved GPT-4oâs tendency to generate fabricated citationsâhallucinations, in AI parlance. When tasked with answering biomedical research questions, GPT-4o cited nonexistent papers in more than 90% of cases. OpenScholar, by contrast, remained firmly anchored in verifiable sources. The grounding in real, retrieved papers is fundamental. The system uses what the researchers describe as their âself-feedback inference loopâ and âiteratively refines its outputs through natural language feedback, which improves quality and adaptively incorporates supplementary information.â The implications for researchers, policy-makers, and business leaders are significant. OpenScholar could become an essential tool for accelerating scientific discovery, enabling experts to synthesize knowledge faster and with greater confidence.
“How OpenScholar works: The system begins by searching 45 million research papers (left), uses AI to retrieve and rank relevant passages, generates an initial response, and then refines it through an iterative feedback loop before verifying citations. This process allows OpenScholar to provide accurate, citation-backed answers to complex scientific questions. | Source: Allen Institute for AI and University of Washington”
OpenScholarâs debut comes at a time when the AI ecosystem faces a growing tension between closed, proprietary systems and the rise of open-source alternatives like Metaâs Llama. Models like OpenAIâs GPT-4o and Anthropicâs Claude offer impressive capabilities, but they are expensive, opaque, and inaccessible to many researchers. OpenScholar flips this model on its head by being fully open-source. The OpenScholar team has released not only the code for the language model but also the entire retrieval pipeline, a specialized 8-billion-parameter model fine-tuned for scientific tasks, and a datastore of scientific papers. âTo our knowledge, this is the first open release of a complete pipeline for a scientific assistant LMâfrom data to training recipes to model checkpoints,â the researchers wrote in their blog post announcing the system.
2/ đď¸ On the shoulders of giants
With millions of papers published yearly, keeping up with scientific literature has become a monumental challenge. á´á´á´É´ęąá´Ęá´Ęá´Ę aims to help researchers navigate this vast landscape by synthesizing grounded, citation-supported answers fromâŚâ Akari Asai (@AkariAsai) November 19, 2024
This openness is not just a philosophical stance; itâs also a practical advantage. OpenScholarâs smaller size and streamlined architecture make it far more cost-efficient than proprietary systems. For example, the researchers estimate that OpenScholar-8B is 100 times cheaper to operate than PaperQA2, a concurrent system built on GPT-4o. This cost-efficiency could democratize access to powerful AI tools for smaller institutions, underfunded labs, and researchers in developing countries. Still, OpenScholar is not without limitations. Its datastore is restricted to open-access papers, leaving out paywalled research that dominates some fields. This constraint, while legally necessary, means the system might miss critical findings in areas like medicine or engineering. The researchers acknowledge this gap and hope future iterations can responsibly incorporate closed-access content.
“How OpenScholar performs: Expert evaluations show OpenScholar (OS-GPT4o and OS-8B) competing favorably with both human experts and GPT-4o across four key metrics: organization, coverage, relevance and usefulness. Notably, both OpenScholar versions were rated as more âusefulâ than human-written responses. | Source: Allen Institute for AI and University of Washington”
The OpenScholar project raises important questions about the role of AI in science. While the systemâs ability to synthesize literature is impressive, it is not infallible. In expert evaluations, OpenScholarâs answers were preferred over human-written responses 70% of the time, but the remaining 30% highlighted areas where the model fell shortâsuch as failing to cite foundational papers or selecting less representative studies. These limitations underscore a broader truth: AI tools like OpenScholar are meant to augment, not replace, human expertise. The system is designed to assist researchers by handling the time-consuming task of literature synthesis, allowing them to focus on interpretation and advancing knowledge.
7/ đ Whatâs next?
We’re just getting started with OpenScholar! đExpanding domains: Support for non-CS fields is coming soon.
Public API: Full-text search over 45M+ papers will be available shortly.
Try the OpenScholar demo and share your feedbackâyour input is invaluable asâŚâ Akari Asai (@AkariAsai) November 19, 2024
Critics may point out that OpenScholarâs reliance on open-access papers limits its immediate utility in high-stakes fields like pharmaceuticals, where much of the research is locked behind paywalls. Others argue that the systemâs performance, while strong, still depends heavily on the quality of the retrieved data. If the retrieval step fails, the entire pipeline risks producing suboptimal results. But even with its limitations, OpenScholar represents a watershed moment in scientific computing. While earlier AI models impressed with their ability to engage in conversation, OpenScholar demonstrates something more fundamental: the capacity to process, understand, and synthesize scientific literature with near-human accuracy.
8/ 𧪠Summary
Try it out: https://t.co/4QrEWAnBhL
Read more: https://t.co/QjVY0eVW3Q â we discuss more details as well as limitations of OpenScholar, based on our beta testing with CS researchers!
Code & data: https://t.co/W7aFN1FcI1
Paper: https://t.co/2Ovz7qMpdT pic.twitter.com/h63MHi0XwSâ Akari Asai (@AkariAsai) November 19, 2024
The numbers tell a compelling story. OpenScholarâs 8-billion-parameter model outperforms GPT-4o while being orders of magnitude smaller. It matches human experts in citation accuracy where other AIs fail 90% of the time. And perhaps most tellingly, experts prefer its answers to those written by their peers. These achievements suggest weâre entering a new era of AI-assisted research, where the bottleneck in scientific progress may no longer be our ability to process existing knowledge, but rather our capacity to ask the right questions. The researchers have released everythingâcode, models, data, and toolsâbetting that openness will accelerate progress more than keeping their breakthroughs behind closed doors. In doing so, theyâve answered one of the most pressing questions in AI development: Can open-source solutions compete with Big Techâs black boxes? The answer, it seems, is hiding in plain sight among 45 million papers.”
4. You can also chat with articles and expand on them.
To do so, switch to Co-Storm and type in the topic and purpose of your article.
Storm will research and synthesize available information and give you the requried article. pic.twitter.com/2zhvwDoSr9
â Mushtaq Bilal, PhD (@MushtaqBilalPhD) October 29, 2024
BIG DATA SYNTHESIS
https://storm.genie.stanford.edu/
https://arxiv.org/abs/2402.14207
https://github.com/stanford-oval/storm
https://storm-project.stanford.edu/research/storm/
https://blog.acer.com/en/discussion/2218/storm-by-stanford-university-the-ai-model-for-academic-and-research-purposes
STORM by Stanford: The AI Model for Academic and Research Purposes
by Edmund_McGowan / November 13
“Artificial intelligence is a swiftly evolving beast. From novel chatbots that come with the dust and are gone with the wind to behemoths like ChatGPT, AI is on the march. STORM by Stanford University is an innovative AI-powered research tool currently making waves in the global academic community and beyond. Since early 2024, this open-source research project has helped many academics, students, and content creators craft articles from scratch. âArticles from scratch?â We hear you ask. Yes, in a nutshell, it can be used to create Wikipedia-style papers, complete with citations in a matter of minutes. Whether youâre interested in AI for schoolwork, or even AI for grad school level writing, STORM can help you on your path to a PhD. Get set, because weâre headed for the eye of the storm to discover the origins of STORM, and the humans behind it. Weâll also go on to discuss its performance and steer you in the direction of the STORM website so you can try it out for yourself.
STORM by Stanford is an interactive data platform designed to support machine learning research by streamlining data access and model development. a powerful tool for accelerating AI research and development.https://t.co/nFgK0TFsB0 pic.twitter.com/lAf7yFTyYY
â Pouria Akbari (@pouriaakbari_) October 29, 2024
Short for âSynthesis of Topic Outlines through Retrieval and Multi-perspective Question Askingâ, STORM Stanford AI research project is an AI tool that can create Wikipedia style entries faster than you can make a cup of coffee. Letâs be clear, STORM is not your average B- chatbot, it is an A+ gifted-class knowledge creator and research assistant thatâs ready to back up its statements and provide citations galore. While AI is often a faceless, authorless corporate beast, the team behind STORM are actually Stanford students and faculty. STORM is created by human members of Stanfordâs OVAL team, namely: Yijia Shao, Yucheng Jiang, Theodore A. Kanell, Peter Xu, Omar Khattab, and Monica S. Lam.
LLMs (large language models) may be useful for a laymanâs general research. But for academics and content creators, they tend to fall short in several areas. Accuracy is king in academia, and LLMs have well publicized limitations in veracity, as well as specificity, and understanding of complex academic topics. Whatâs more, LLMs are renowned for producing confident, yet incorrect answers that lack citations. The final nail in the coffin for academic use of LLMs is plagiarism. Rapid generation of text comes with the risk that the LLM is simply replicating existing academic sources. While the majority of LLMs create content via retrieval-augmented generation (RAG), STORM takes content creation several steps further to craft accurate, organized answers. Now letâs find out more about the multi-agent conversations behind every STORM search. At time of writing, STORM is powered by Bing Search and Azure OpenAI GPT-4o-mini. This recent upgrade featuring the latest technologies enables STORM to break down the barrier between the excess of accessible information out there, and what an individual is able to assimilate. The âknowledge curation agentâ explored in STORM (remember, it is still a research project) aims to provide a solid foundation for knowledge discovery, making in-depth learning possible without the stress of laborious research.
Where many LLMs are a letdown, STORM is a success. This is in no small part thanks to STORMâs multi-perspective question asking. Multiple AI agents cooperate in an agentic system, where individual AI agents perform the tasks of content retrieval, multi-perspective question asking, and finally, synthesis of content. Similar in many ways to how a human team would collaborate to research and write an ambitious project, STORM approaches complex tasks from multiple angles to create comprehensive written content that can give human-created articles a run for their money. STORM provides users with the option of STORM AI autonomous or Co-STORM (Human-AI collaboration), as well as search engine choices. After inputting your topic to STORM, the platform generally takes a minute or two to generate your article.
Once an article is completed, a âSee BrainSTORMing Processâ option appears above the summary of your article. This neat feature allows users to see the AI agents (editors) and the steps they have taken to contribute to the final article. If you do try STORM, do the good folks at Stanford a favor and provide feedback using the handy feedback box on the web demo. This information, as well as your purpose for writing the article will be securely stored, and not combined with your Google account info. If youâre looking for an AI tool to assist your academic writing, or just AI for school in general, then STORM is certainly worth a try. Here are a few different user groups that may find STORM more useful than regular old LLMs.
- Academics and researchers can both benefit from using STORM, as it can create structured outlines on complex academic topics that can be used as educational resources. The verification and citation features of STORM are particularly attractive for this cohort.
- Students today may lack the time to conduct their own research. With STORM, students of all levels can quickly get well-organized notes and summaries in easy to understand Wikipedia style articles, likely a form that they are already familiar with.
- Content creators with deadlines to meet or day jobs to attend to can rapidly research and organize data on STORM. Verified, fact-based outlines that offer multiple perspectives can be quickly crafted, and updated by users as topics evolve.
As with all AI platforms, STORM is not without its limitations. If youâve read this far, chances are youâre not plotting to misuse STORM to graduate from school or college. But just in case you were wondering, STORM is not (yet) an AI writing tool that can knock out a 10,000 word college-level dissertation for you. Try out STORM and you will quickly discover that the âresearch previewâ excels in generating Wikipedia-style articles.
stanford-oval/storm: An LLM-powered knowledge curation system that researches a topic and generates a full-length report with citations. https://t.co/OfxUQm7ZBH pic.twitter.com/SoYT3hb9Od
â Vukosi Marivate đżđŚđđ (@vukosi) April 12, 2024
Similar to Wikipedia, STORM is very good at providing a comprehensive outline of a topic. The Wikipedia-esque sections are useful as foundations to build out from, but may lack specific or detailed information that some users require. This is presumably an aspect of the platform that will be improved, time will tell. Another issue that may deter or, indeed, attract some users is STORMâs limited safety measures. The potential to generate offensive content is certainly present on STORM, and on behalf of the Stanford Open Virtual Assistant Lab team, we remind you to follow STORMâs guidelines. As with other AI content generators, mistakes are still a likelihood, so double check your info before going to print!”
Google accused of being “woke” after Gemini creates inaccurate, racially diverse historical images
Generative AIs are often accused of being biased, but it appears that Google went a bit too far in trying to address this problem with Ge… https://t.co/AsWfm7bDlp
â TechSpot (@TechSpot) February 22, 2024
HALLUCINATION RISK
https://perplexity.ai/introducing-the-election-information-hub
https://techcrunch.com/the-other-election-night-winner-perplexity
https://zdnet.com/why-i-prefer-perplexity-over-every-other-ai-chatbot
https://science.org/is-your-ai-hallucinating-when-chatbots-make-things-up
https://forbes.com/perplexitys-election-hub-triggers-reactions-from-ai-experts
AI Experts Test Perplexityâs New Election Hub
by Tor Constantino / Nov 4, 2024
“While most Americans are focusing on who will be the next president of the United States following Tuesdayâs general election in the U.S, there are hundreds of other important political races to be decided as well. There are 33 U.S. Senate seats and another 435 contests in the House of Representatives up for grabs. Not to mention the hundreds of other races at the various state and county levels of government that need to settled. To keep track of the national contests, Perplexity announced the launch on Friday of the first publicly available AI-based election tracker. Beginning on Tuesday, the election information hub will be powered using data from the Associated Press as well as Democracy Works.
Several AI experts praised the tracker, but they also expressed concerns about inconsistencies in framing and tonality as well as hallucinations and inaccurate summaries. A article on Sunday in The Verge reported errors in the Perplexity election information hub that it said the company later updated. Itâs also worth noting that Perplexity itself has in recent months been embroiled in litigation and threats of legal action with several companiesâincluding Forbesâfor its AIâs unauthorized use of content, fabricating facts and false article abstracts. Despite those signals, users who still might want to use the AI election hub can visit the Perplexity link and enter their ZIP code. All the races and ballot measures that apply to that ZIP code will populate the screen, which the user can explore â the interface is pictured below.
“Screenshot of Perplexity query box for its 2024 election information hub”
âWe want to make it as simple as possible to receive trusted, easy-to-understand information to inform your voting decisions. For each response, you can view the sources that informed an answer, allowing you to dive deeper and verify referenced materials. Whether you’re seeking to understand complex ballot measures, verify candidate positions, or simply find your polling place,â the company statement reads. In an email exchange seeking answers about whether Perplexity was paying AP and Democracy Works, the companyâs rationale behind the project and timing, as well as how Perplexityâs large language model might mitigate hallucination allegations of creating fake news that multiple media outlets made earlier this year, the company sent a single blanket statement.
The controversial AI search engine, accused of aggressively scraping content, went all in on providing AI-generated election information. https://t.co/bLmMD9XJ5N
â WIRED (@WIRED) November 6, 2024
âWe want to make it easier for people to make informed choices on all ballot items, including elected offices and ballot measures. We waited to release this to the public until we could conduct the appropriate testing,â wrote a company spokesperson. âTo clarify, Perplexity uses LLMs for summarizing content, but is designed to optimize for accuracy. We use a process called Retrieval-Augmented Generation to identify relevant information and summarize it in a way that’s tailored to a user’s query. Answers are not utilizing stored knowledge from a model’s training data, which makes us different from other AI chatbots. It’s also why we chose to work with organizations like the AP and Democracy Works to provide us with up-to-date information on ballot items and election results,â the spokespersonâs message concluded.
In general, the AI influencers and experts I spoke with were impressed with the concept behind the election-specific info hub. Kirk Borne, Ph.D., is an internationally recognized thought leader and speaker within the AI and data space, as well as founder of the Data Leadership Group. âI believe that this platform will be very helpful for many people: with âless talk, more dataâ and âfewer opinions, more analysis,ââ Dr. Borne wrote via text. âThe specificity, utility, currency, and accuracyâfour key dimensions of all large language models and Generative AIâare 100% dependent on their data sources. Very broad LLMs that try to answer all possible end-user queries, utilizing massive datasets can be impossible to tame in all 4 key dimensions. Perplexity’s focus on a hyper-targeted and tightly constrained use case, with a limited spectrum of end-user queries, thus makes sense and is consistent with having such formalized data sources,â he explained.
In the recent U.S. elections, some voters turned to Perplexity, an AI-powered search engine, to access verified, nonpartisan election information.
Users received tailored insights based on vetted sources with links and citations.
Learn more in #TheBatch: https://t.co/6eXZ3fbA8i
â DeepLearning.AI (@DeepLearningAI) November 19, 2024
Ahmed Banafa, Ph.D., is a technology expert and engineering professor at San Jose State University. He thinks there could be benefits from the hub as well. âUsing AI to provide real-time updates on voting requirements, polling locations and candidate detailsâthe platform aims to increase voter engagement and support informed choices. I checked it and I found the list of candidates in my area, it was accurate with good information about each candidate and each measure on the ballot. This approach reflects the growing trend of using AI to simplify access to critical election information, with a user-friendly design that makes it easy for voters to find what they need quickly,â Dr. Banafa wrote via email.
The AP calls Wisconsin for Trump, declaring him the winner after clearing 270 electoral votes.
More at https://t.co/kxpAIJtgy5. pic.twitter.com/Uj585o63Iz
â Perplexity (@perplexity_ai) November 6, 2024
He also applauded Perplexityâs partnering with AP and Democracy Works for this initiative, which is intended to give the AI model authoritative, up-to-date election data. He noted that using such trusted data sources lends credibility of the provided information and reduces reliance on less dependable outlets or data suppliers. âThese collaborations are crucial for providing accurate and trustworthy content, especially during elections, when misinformation can have serious impacts. This will save the service the steps of verifications of the information as the AP/DW already did that work. These partnerships uphold high standards of election integrity, helping ensure users receive only thoroughly vetted, current information,â added Dr. Banafa. Conor Grennan is chief AI Architect at NYU Stern School of Business as well as CEO and founder of the consultancy AI Mindset. In an email exchange, he wrote that he ran extensive queries to test the model and gained some interesting insights.
Itâs almost election day! While the presidency gets the most attention, Perplexityâs here to help you make an informed vote on all ballot items, including statewide and local.
Weâve curated a trusted set of sources to answer all election-related queries: https://t.co/ERtkwpo8yj pic.twitter.com/FfptvEw9KN
â Perplexity (@perplexity_ai) November 1, 2024
âPerplexity’s election hub addresses a crucial need by centralizing essential election information, from candidate profiles to voting logistics. While their factual information on voting procedures is reliable and serves a valuable public function, the platform faces challenges in presenting candidate information equitably,â he wrote. âA comparison of candidate pages reveals inconsistencies in tone and framing â Harris’s page emphasizes historic achievements, while Trump’s page takes a markedly different editorial approach. This highlights a fundamental challenge with LLM-based platforms: maintaining consistent, unbiased presentation across variable content generation,â Grennan explained. He also lauded the team-up with AP and Democracy Works, stating that it should help ensure greater consistency in information delivery. âThis is particularly important for candidate information, where varying source material can lead to dramatically different presentations of the same individual. Having authorized data vendors helps establish a baseline for information quality,â noted Grennan.
Despite the data collaborations, Grennan stated that theyâre not a panacea for all the risks such as faulty article summations and hallucinations that can plague LLMs â even those that use RAG technology. âWhile partnerships with established data providers like AP and Democracy Works should definitely help reduce technical hallucinations, they don’t fully address the challenge of perceived bias in presentation. Even factually accurate information can create different impressions based on framing and emphasis. The contrast in candidate biographies demonstrates how LLMs can inadvertently reflect existing biases in their training data, potentially affecting how information is contextualized and presented,â Grennan concluded.
Dr. Banafa echoed those sentiments, writing that even if third-party data providers have reliable, rigorous fact-checking standards â the AI models sourcing those data can benefit from continual monitoring and refinement. âWhile trusted sources lower the chances of misinformation, continuous monitoring and validation of AI outputs are still crucial to maintain reliability and trustworthiness,â he wrote. âItâs equally important for users to crosscheck critical details, given the potential consequences of even minor inaccuracies in election information.â However, Dr. Borne was a bit more optimistic that the specific use case that Perplexity has developed should further curtail the incidence of hallucinations.
Google apologizes for âmissing the markâ after Gemini generated racially diverse Nazis https://t.co/0S6y6aJZ39
â The Verge (@verge) February 21, 2024
âThe typical hallucinations arise in LLMs when the end user queries are essentially unconstrained on the vast historical knowledgebase of the world. Those LLMs cannot give a truly accurate and completeâand shortâanswer to a complex question any more than a physics professor can explain the fullness and the intricacies of quantum theory or general relativity 100% accurately in a few sentences to a general audience. I am optimistic that Perplexity will do better than the typical LLM track record,â wrote Dr. Borne. Dr. Banafa believes that Perplexityâs unique model may hold promise for the future. âBut itâs essential to consider the broader challenges of using AI in election contexts. AI chatbots have previously provided incorrect or partially correct answers to election-related questions. This highlights the need for continuous evaluation and refinement of AI systems to meet the rigorous standards required for sharing accurate election information. Additionally, advancements in AI transparency and interpretability could further reduce errors, fostering more trust in AI-generated election information,â he noted.
While Dr. Borne described Perplexityâs election tracker platform as an experiment, thatâs centered around the highly charged and personalized human sentiments and context-driven narratives associated with modern politics. âWe will see if this works well enough to be considered a success, orâif like any science-technology implementationâwe learn from it and refine it for the next time. In this specific instance, I believe that the outcomes should be positive since there is more âtechnology implementationâ than âscientific experimentationâ involved, but the latter is definitely not 0%. Perplexityâs project is still ultimately an LLM after all,â he concluded.”
“The discs had images of skeletons on them and were called ‘Bones’ or ‘Ribs’ and contained music that was forbidden. The practice of copying and recording music onto X-rays really got going in St Petersburg, a port where it was be easier to obtain illicit records from abroad. But it spread, first to Moscow and then to most major conurbations throughout the states of the Soviet Union.”
OPEN SAMIZDAT
https://sovietmaps.com/CityMil
https://jstor.org/stable/jj.5425967
https://press.princeton.edu/ideas/forbidden-texts
https://semanticscholar.org/Libraries-in-the-Post-Scarcity-Era
https://reason.com/2022/07/24/you-cant-stop-pirate-libraries
You Canât Stop Pirate Libraries
by Elizabeth Nolan Brown / August/September 2022
âShadow libraries exist in the space where intellectual property rights collide with the free-flowing exchange of knowledge and ideas. In some cases, these repositories of pirated books and journal articles serve as a blow against censorship, allowing those under repressive regimes to access otherwise verboten works. At other times, shadow librariesâa.k.a pirate librariesâfunction as a peer-to-peer lending economy, providing e-books and PDFs of research papers to people who canât or wonât pay for access, as well as to people who might otherwise be paying customers. Are the proprietors of these pirate libraries freedom fighters? Digital Robin Hoods? Criminals? That depends on your perspective, and it may also differ depending on the platform in question.
On this day 56 years ago (July 22, 1968) the New York Times finally got around to publishing the samizdat manifesto of dissident Soviet physicist Andrei Sakharov. It mentions carbon dioxide build up as one to watchâŚhttps://t.co/6AErcCYfc7
1/2 pic.twitter.com/gUBZuLEzxHâ All Our Yesterdays (@our_yesterdays) July 21, 2024
But one thing is certain: These platforms are nearly impossible to eradicate. Even a greatly enhanced crackdown on them would be little more than a waste of time and resources. Some of the biggest digital-age shadow librariesâincluding Library Genesis (or Libgen) and Alephâhave roots in Russia, where a culture of illicit book sharing arose under communism. âRussian academic and research institutionsâŚhad to somehow deal with the frustrating lack of access to up-to-date and affordable western works to be used in education and research,â the legal researcher BalĂĄzs BodĂł wrote in the 2015 paper âLibraries in the Post-Scarcity Era.â
“samizdat copy of Aleksandr Solzhenitsynâs novel In the First Circle, 1960s”
âThis may explain why the first batch of shadow libraries started in a number of academic/research institutions such as the Department of Mechanics and MathematicsâŚat Moscow State University.â As PCs and internet access slowly penetrated Russian society, an extremely lively digital librarianship movement emerged, mostly fuelled by enthusiastic readers, book fans and often authors, who spared no effort to make their favorite books available on FIDOnet, a popular BBS [bulletin board system] in Russia,â BodĂłâs paper explained.
As a result, a âbottom-up, decentralized, often anarchic digital library movementâ emerged. These libraries have found large audiences among academics in America and around the world, thanks to the high cost of accessing scholarly journal articles. âPayment of 32 dollars is just insane when you need to skim or read tens or hundreds of these papers to do research,â wrote Alexandra Elbakyanâthe Russia-based founder of the massive shadow library Sci-Hubâin a 2015 letter to the judge presiding over the academic publisher Elsevierâs suit against Sci-Hub. Elbakyan pointed out that in days of yore, students and researchers would share access to papers via forum requests and emails, a system which Sci-Hub simply streamlines. She also noted that Elsevier makes money off the work of researchers who do not get paid for their work.
Such economic imperatives are just one part of the Sci-Hub ethos. âAny law against knowledge is fundamentally unjust,â Elbakyan tweeted in December 2021. âThere seems to be a widely sharedâŚconsensus in the academic sector about the moral acceptability of such radical open access practices,â wrote BodĂł, DĂĄniel Antal, and ZoltĂĄn Puha in a 2020 paper published by PLOS One. âWillful copyright infringement in the research and education sector is seen as an act of civil disobedience, resisting the business models in academic publishing that have faced substantial criticism in recent years for unsustainable prices and outstanding profit margins.â
In his earlier paper, BodĂł argued that âthe emergence of black markets whether they be of culture, of drugs or of arms is always a symptom, a warning sign of a friction between supply and demand.â When âthere is a substantial difference between what is legally available and what is in demand, cultural black markets will be here to compete with and outcompete the established and recognized cultural intermediaries. Under this constant existential threat, business models and institutions are forced to adapt, evolve or die.â The 2020 paper underlined the point: Its âsupply side analysisâ of scholarly piracy suggested âthat a significant chunk of the shadow library supply is not available in digital format and a significant share of downloads concentrate on legally inaccessible works.â
Many would reply that such piracy is just plain wrong, no matter how much trouble and expense copyright causes for authors and researchers. But copyright, according to some strains of libertarian thought, is not the sort of âproperty rightâ we ought to justly respect, given its historical genesis in propping up unjust monopoly by creating artificial scarcity. âOnly tangible, scarce resources are the possible object of interpersonal conďŹict, so it is only for them that property rules are applicable,â the libertarian lawyer Stephan Kinsella argued in âAgainst Intellectual Property,â published in the Journal of Libertarian Studies in 2001. âThus, patents and copyrights are unjustiďŹable monopolies granted by government legislation.â
Intellectual property rights give creators âpartial rights of controlâownershipâover the tangible property of everyone elseâ and can âprohibit them from performing certain actions with their own property,â Kinsella continues. âAuthor X, for example, can prohibit a third party, Y, from inscribing a certain pattern of words on Yâs own blank pages with Yâs own ink. That is, by merely authoring an original expression of ideasâŚthe [intellectual property] creator instantly, magically becomes a partial owner of othersâ property.â Justly enforced property rights, by this line of thinking, ought to apply only to physical things that are scarce and whose control is rivalrous. This would not apply to words or ideas that canâas the very existence of these pirate libraries showsâbe copied exactly and infinitely. Enforcing copyright inherently stops other people from doing things with their minds and their justly owned property, including their server space and hard drives.
What about the utilitarian case for intellectual property? The U.S. Constitution enshrines copyrights to âpromote the progress of science and the useful arts.â But banning shadow libraries could do more harm to such promotion of âscience and the useful artsâ than good, given how much they facilitate research and scholarship that would otherwise be either prohibitively expensive or outright impossible.
As a 2016 letter in The Lancet pointed out, such sites could be hugely beneficial for doctors in places like Peru, where few physicians have access âto the papers and information they need to care for a growing and diverse set of patients.â Such arguments became even more powerful during the COVID-19 pandemic. Interestingly, the 2020 Immersive Media & Books survey found that pirates are more likely to be avid book buyers than nonpirates. âCompared to the general survey population, a higher percentage of book pirates during COVID are buying more ebooks (38.7%), audiobooks (27.1%) and print books (33.7%),â the study concluded.
Why Sci-Hub matters: new empirical study shows that “articles downloaded from Sci-Hub were cited 1.72 times more than papers not downloaded from Sci-Hub…the number of downloads from Sci-Hub was a robust predictor of future citations” https://t.co/RyFixhJJMr
â Evgeny Morozov (@evgenymorozov) January 31, 2021
But publishers love their copyrights, and they do not wish to adapt their legacy systems to the digital age. They thus have been trying to crush the shadow libraries, with the help of the legal system. In 2015, Elsevier sued to shut down Sci-Hub and Libgen. A federal court eventually ruled in Elsevierâs favor, awarding it $15 million in damages and issuing an injunction against the two platforms. In 2017, the American Chemical Society (ACS) sued Sci-Hub. The U.S. District Court for the Eastern District of Virginia ruled in the plaintiffâs favor, saying that Sci-Hub owed it $4.8 million in damages. The court ordered American web hosting companies, domain registrars, and search engines to stop facilitating access to âany or all domain names and websites through which Defendant Sci-Hub engages in unlawful access to, use, reproduction, and distributionâ of ACSâs works. Other countries, such as Sweden and France, have also ordered internet service providers to block Sci-Hub and Libgen.
DeSci memes and coins are exploding with RIF up 150% to 195M, and SCIHUB up 250% to 57M
If the DeSci trend continues, (which based on mindshare stats it is) then imo LIBGEN is a very slept on coin
(obligatory this is nfa, i hold a position in libgen, this is a short history⌠pic.twitter.com/qM2dqp0WIS
â atareh (@atareh) November 18, 2024
Enforcing any of these rulings has proven nearly impossible, since Sci-Hub and Libgen are hosted in other countries and not beholden to U.S.âor Swedish, or Frenchârules. The people behind Sci-Hub and LibGen didnât bother to contest the lawsuits against them. When internet service providers and domain registrars in these countries cut off access, the shadow libraries simply popped up elsewhere. And even if search engines donât display them, these libraries can be accessed via the dark web. Yet publishers keep signing up to play this game of whack-a-mole in different venues. Elsevier, research publisher Wiley, and ACS are currently suing Sci-Hub in Indian court. (This time, Elbakyan is fighting back, arguing that Sci-Hub is covered under the exemptions in Indiaâs Copyright Act.) Another shadow library, the Ukraine-based Kiss Library, lost a case last year in the U.S. District Court for the Western District of Washington and was ordered to pay $7.8 million in statutory damages and to stop distributing copyrighted materials. The library has not paid a cent.
Unlike Sci-Hub, Z-Library never was a political project and never cared about Sci-Hub fighting for free access to knowledge as a human right. Few months after Z-Library founders were arrested by FBI, they finally changed their mind: https://t.co/S3yBh9b219
â Alexandra Elbakyan (@ringo_ring) August 14, 2023
Since U.S. courts have no real power to make any of these institutions pay, popular authors John Grisham and Scott Turow have challenged the Department of Justice to do more. âThe time and money required for the suit demonstrate the absurdity of leaving anti-piracy enforcement to the victims,â they wrote in a February op-ed for The Hill. âWe are also asking Congress to amend the law to stop U.S. search engines from linking to notorious foreign-based piracy sites, which they have refused to do on their own.â Itâs no surprise that some best-selling authors are among those most inflamed about pirate libraries.
Glad to see video playback return to the @internetarchive â and, with it, access to four Emperor Norton films from 1936, 1956, and 1966 that we contributed to IA in 2017. https://t.co/FIWDTfB0dY
â The Emperor Norton Trust (@EmpNortonTrust) November 4, 2024
âThe few existing studies in the general e-book piracy spaceâŚecho findings of research on music and audiovisual piracy: displacement effects are mostly detrimental for best sellers,â while âlong tail content enjoys a discovery effect,â wrote BodĂł and his colleagues in their 2020 paper. But the U.S. Department of Justice will have no more luck than the courts in getting the outcome those American authors want. Nor would stopping search engines from linking to shadow libraries make much of a dent, since the sites would still be accessible to those in the know and since social media can easily provide this knowledge to anyone searching for it. The whole business would ultimately be a costly and time-consuming failureâin addition to keeping students, scientists, doctors, and others from accessing important information. In an earlier internet era, people liked to say that information wants to be free. Information, of course, wants nothing. But so long as people want free information, the modern tech and digital ecosystem will provide it. Perhaps authors and publishers would do better to accept that and address ways to mitigate its effects rather than engage in an unwinnable copyright war.â
PREVIOUSLY
MACHINE READABLE
https://spectrevision.net/2024/04/25/machine-readable/
SHADOW LIBRARIES
https://spectrevision.net/2019/12/18/data-hoarding/
GUERRILLA OPEN ACCESS
https://spectrevision.net/2016/02/18/guerrilla-open-access/