Evaluating Medical Retrieval-Augmented Generation (RAG) with NVIDIA AI Endpoints and Ragas

Front page > Programming > Evaluating Medical Retrieval-Augmented Generation (RAG) with NVIDIA AI Endpoints and Ragas

Evaluating Medical Retrieval-Augmented Generation (RAG) with NVIDIA AI Endpoints and Ragas

Published on 2024-11-15

Browse:636

Evaluating Medical Retrieval-Augmented Generation (RAG) with NVIDIA AI Endpoints and Ragas

In the realm of medicine, incorporating advanced technologies is essential to enhance patient care and improve research methodologies. Retrieval-augmented generation (RAG) is one of these pioneering innovations, blending the power of large language models (LLMs) with external knowledge retrieval. By pulling relevant information from databases, scientific literature, and patient records, RAG systems provide a more accurate and contextually enriched response foundation, addressing limitations like outdated information and hallucinations often observed in pure LLMs.

In this overview, we’ll explore RAG’s growing role in healthcare, focusing on its potential to transform applications like drug discovery and clinical trials. We'll also dive into the methods and tools necessary to evaluate the unique demands of medical RAG systems, such as NVIDIA’s LangChain endpoints and the Ragas framework, along with the MACCROBAT dataset, a collection of patient reports from PubMed Central.

Key Challenges of Medical RAG

Scalability: With medical data expanding at over 35% CAGR, RAG systems need to manage and retrieve information efficiently without compromising speed, especially in scenarios where timely insights can impact patient care.
Specialized Language and Knowledge Requirements: Medical RAG systems require domain-specific tuning since the medical lexicon and content differ substantially from other domains like finance or law.
Absence of Tailored Evaluation Metrics: Unlike general-purpose RAG applications, medical RAG lacks well-suited benchmarks. Conventional metrics (like BLEU or ROUGE) emphasize text similarity rather than the factual accuracy critical in medical contexts.
Component-wise Evaluation: Effective evaluation requires independent scrutiny of both the retrieval and generation components. Retrieval must pull relevant, current data, and the generation component must ensure faithfulness to retrieved content.

Introducing Ragas for RAG Evaluation

Ragas, an open-source evaluation framework, offers an automated approach for assessing RAG pipelines. Its toolkit focuses on context relevancy, recall, faithfulness, and answer relevancy. Utilizing an LLM-as-a-judge model, Ragas minimizes the need for manually annotated data, making the process efficient and cost-effective.

Evaluation Strategies for RAG Systems

For robust RAG evaluation, consider these steps:

Synthetic Data Generation: Generate triplet data (question, answer, context) based on the vector store documents to create synthetic test data.
Metric-Based Evaluation: Evaluate the RAG system on metrics like precision and recall, comparing its responses to the generated synthetic data as ground truth.
Independent Component Evaluation: For each question, assess retrieval context relevance and the generation’s answer accuracy.

Here’s an example pipeline: given a question like “What are typical BP measurements in congestive heart failure?” the system first retrieves relevant context and then evaluates if the response addresses the question accurately.

Setting Up RAG with NVIDIA API and LangChain

To follow along, create an NVIDIA account and obtain an API key. Install the necessary packages with:

pip install langchain
pip install langchain_nvidia_ai_endpoints
pip install ragas

Download the MACCROBAT dataset, which offers comprehensive medical records that can be loaded and processed via LangChain.

from langchain_community.document_loaders import HuggingFaceDatasetLoader
from datasets import load_dataset

dataset_name = "singh-aditya/MACCROBAT_biomedical_ner"
page_content_column = "full_text"

loader = HuggingFaceDatasetLoader(dataset_name, page_content_column)
dataset = loader.load()

Using NVIDIA endpoints and LangChain, we can now build a robust test set generator and create synthetic data based on the dataset:

from ragas.testset.generator import TestsetGenerator
from langchain_nvidia_ai_endpoints import ChatNVIDIA, NVIDIAEmbeddings

critic_llm = ChatNVIDIA(model="meta/llama3.1-8b-instruct")
generator_llm = ChatNVIDIA(model="mistralai/mixtral-8x7b-instruct-v0.1")
embeddings = NVIDIAEmbeddings(model="nv-embedqa-e5-v5", truncate="END")

generator = TestsetGenerator.from_langchain(
    generator_llm, critic_llm, embeddings, chunk_size=512
)
testset = generator.generate_with_langchain_docs(dataset, test_size=10)

Deploying and Evaluating the Pipeline

Deploy your RAG system on a vector store, generating sample questions from actual medical reports:

# Sample questions
["What are typical BP measurements in the case of congestive heart failure?",
 "What can scans reveal in patients with severe acute pain?",
 "Is surgical intervention necessary for liver metastasis?"]

Each question links with a retrieved context and a generated ground truth answer, which can then be used to evaluate the performance of both retrieval and generation components.

Custom Metrics with Ragas

Medical RAG systems may need custom metrics to assess retrieval precision. For instance, a metric could determine if a retrieved document is relevant enough for a search query:

from dataclasses import dataclass, field
from ragas.evaluation.metrics import MetricWithLLM, Prompt

RETRIEVAL_PRECISION = Prompt(
    name="retrieval_precision",
    instruction="Is this result relevant enough for the first page of search results? Answer '1' for yes and '0' for no.",
    input_keys=["question", "context"]
)

@dataclass
class RetrievalPrecision(MetricWithLLM):
    name: str = "retrieval_precision"
    evaluation_mode = EvaluationMode.qc
    context_relevancy_prompt: Prompt = field(default_factory=lambda: RETRIEVAL_PRECISION)

# Use this custom metric in evaluation
score = evaluate(dataset["eval"], metrics=[RetrievalPrecision()])

Structured Output for Precision and Reliability

For an efficient and reliable evaluation, structured output simplifies processing. With NVIDIA's LangChain endpoints, structure your LLM response into predefined categories (e.g., yes/no).

import enum

class Choices(enum.Enum):
    Y = "Y"
    N = "N"

structured_llm = nvidia_llm.with_structured_output(Choices)
structured_llm.invoke("Is this search result relevant to the query?")

Conclusion

RAG bridges LLMs and dense vector retrieval for highly efficient, scalable applications across medical, multilingual, and code generation domains. In healthcare, its potential to bring accurate, contextually aware responses is evident, but evaluation must prioritize accuracy, domain specificity, and cost-efficiency.

The outlined evaluation pipeline, employing synthetic test data, NVIDIA endpoints, and Ragas, offers a robust method to meet these demands. For a deeper dive, you can explore Ragas and NVIDIA Generative AI examples on GitHub.

Release Statement This article is reproduced at: https://dev.to/koolkamalkishor/evaluating-medical-retrieval-augmented-generation-rag-with-nvidia-ai-endpoints-and-ragas-2m34?1 If there is any infringement, please contact study_golang@163 .comdelete

Latest tutorial More>

Do I Need to Explicitly Delete Heap Allocations in C++ Before Program Exit?
Explicit Deletion in C Despite Program ExitWhen working with dynamic memory allocation in C , developers often wonder if it's necessary to manu...

Programming Posted on 2025-04-12
How Can I Maintain Custom JTable Cell Rendering After Cell Editing?
Maintaining JTable Cell Rendering After Cell EditIn a JTable, implementing custom cell rendering and editing capabilities can enhance the user experie...

Programming Posted on 2025-04-12
How to Correctly Display the Current Date and Time in "dd/MM/yyyy HH:mm:ss.SS" Format in Java?
How to Display Current Date and Time in "dd/MM/yyyy HH:mm:ss.SS" FormatIn the provided Java code, the issue with displaying the date and tim...

Programming Posted on 2025-04-12
How to Combine Data from Three MySQL Tables into a New Table?
mySQL: Creating a New Table from Data and Columns of Three TablesQuestion:How can I create a new table that combines selected data from three existing...

Programming Posted on 2025-04-12
$Why Isn\'t My CSS Background Image Appearing?$
Why Isn\'t My CSS Background Image Appearing?
Troubleshoot: CSS Background Image Not AppearingYou've encountered an issue where your background image fails to load despite following tutorial i...

Programming Posted on 2025-04-12
How to enable LOAD DATA LOCAL INFILE function in MySQL?
Enable LOAD DATA LOCAL INFILE in MySQLProblem:How to enable LOAD DATA LOCAL INFILE in the MySQL configuration file (my.cnf) for MySQL 5.5 on Ubuntu 12...

Programming Posted on 2025-04-12
How Can I Efficiently Generate URL-Friendly Slugs from Unicode Strings in PHP?
Crafting a Function for Efficient Slug GenerationCreating slugs, simplified representations of Unicode strings used in URLs, can be a challenging task...

Programming Posted on 2025-04-12
How to Send a Raw POST Request with cURL in PHP?
How to Send a Raw POST Request Using cURL in PHPIn PHP, cURL is a popular library for sending HTTP requests. This article will demonstrate how to use ...

Programming Posted on 2025-04-12
How Do I Efficiently Select Columns in Pandas DataFrames?
Selecting Columns in Pandas DataframesWhen dealing with data manipulation tasks, selecting specific columns becomes necessary. In Pandas, there are va...

Programming Posted on 2025-04-12
Why does SQL query report "Unknown column in WHERE clause" error when using alias Times?
SQL query causes "Unknown Column In Where Clause" error due to alias question: Query using alias in a SELECT statement will cause an error...

Programming Posted on 2025-04-12
How to Capture and Stream stdout in Real Time for Chatbot Command Execution?
Capturing stdout in Real Time from Command ExecutionIn the realm of developing chatbots capable of executing commands, a common requirement is the abi...

Programming Posted on 2025-04-12
How to Efficiently Convert Timezones in PHP?
Efficient Timezone Conversion in PHPIn PHP, handling timezones can be a straightforward task. This guide will provide an easy-to-implement method for ...

Programming Posted on 2025-04-12
$Does COUNT(\*) always return results in SQL queries?$
Does COUNT(\*) always return results in SQL queries?
*COUNT() in SQL: Guaranteed Results** This article clarifies whether the SQL function COUNT(*) always produces a result. The Answer: Yes, COUNT(*) a...

Programming Posted on 2025-04-12
Why Doesn't `body { margin: 0; }` Always Remove Top Margin in CSS?
Addressing Body Margin Removal in CSSFor novice web developers, removing the margin of the body element can be a confusing task. Often, the code provi...

Programming Posted on 2025-04-12
Create responsive thumbnails to maintain the quality of the original image
Creating Responsive Thumbnails from Uploaded ImagesWhen working with user-uploaded images, creating responsive thumbnails is crucial to enhance the us...

Programming Posted on 2025-04-12