Challenges in AI Biomed Lit Searches

Palo Alto, CA — If you ask most scientists what part of their job they dislike, chances are “literature search” will be high on the list. Sifting through mountains of biomedical research to find relevant information is often slow, tedious, and frustrating. But with the rapid rise of AI tools like ChatGPT, could that be about to change?

In a recent Stanford study published in PLOS Digital Health, “Artificial intelligence’s contribution to biomedical literature search: revolutionizing or complicating?,” researchers found ChatGPT had limitations in consistency, accuracy, and relevancy that made it unreliable for widespread scientific literature searches. 

Led by Ray Yip, a researcher in Mahajan’s lab, the study explored ChatGPT's utility for literature searches from an end-user perspective through the lens of clinicians and biomedical researchers. AI models were tasked to find information on niche topics with limited published data, navigate well-known topics with overwhelming amounts of research, generate new hypotheses based on existing knowledge, and search for clinical guidelines and best practices.

The searches missed key papers and pulled in non-academic sources like blogs or media articles, echoing early AI models that produced errors, hallucinating facts or citing non-existent papers.

When asked the question, “Give me 6 vitreous proteomics papers in age-related macular degeneration (AMD)” researchers found “GPT-3.5 generated inconsistent results, failing to suggest relevant publications in six instances and providing inaccurate references in others, often advising users to consult PubMed and Google Scholar. In contrast, while ChatGPT Classic generated lists of publications in every iteration, it also failed to provide accurate references, often featuring rephrased words in titles, fabricated authorships, or incorrect publication dates or journals.”

Vinit Mahajan, M.D., Ph.D., Stanford professor and vice chair of research, said, “The inconsistent accuracy of conversational AI tools underscores the need for careful human oversight. As more advanced Language Learning Models continue to be optimized with plugins that add functions that improve accuracy, we envision a time when AI becomes a strong and reliable ally in streamlining and reshaping scientific research practices.”

Meaningful improvements were observed as models evolved from GPT-3 to GPT-4, especially when enhanced with support functions, prompt engineering, and plugins. In a handful of cases, AI tools were able to retrieve papers the study authors had not found and consistently provided some accurate references across repeated tests. 

Ray said, “Overall, augmentations significantly improved ChatGPT’s ability to return relevant research manuscripts with accurate sources. Yet, it couldn’t fully resolve issues with consistency, accuracy, and relevancy.”

More powerful AI models have emerged that were not evaluated in this study: xAI’s Grok3, OpenAI’s new advanced models (GPT4o, o3, o4), and Anthropic’s claude 3 series. The development of specialized AI agents like Elicit or FutureHouse, which aim to automate not just literature searches but entire literature reviews, with reported accuracies from FutureHouse reaching up to 70%, promise to make AI a powerful ally for researchers. 

Ray expressed both optimism and caution. “As AI tools rapidly develop, I’m excited for a future where biomedical literature searches become effortless. But we, as researchers, also need to be incredibly cautious—we must continue to thoroughly test these tools before fully relying on them.”

Mahajan said, “Ray’s interest in and efficient use of AI was the impetus behind this paper. It is an important issue for researchers, and I feel like we have shed some light on the benefits and drawbacks of using AI for biomedical literature searches.”

20/20 Blog
May 15 2025