From 73c194f304f827b55081b15524479f82a1b7d94c Mon Sep 17 00:00:00 2001 From: maszhongming Date: Tue, 16 Sep 2025 15:15:29 -0500 Subject: Initial commit --- README.md | 113 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 113 insertions(+) create mode 100644 README.md (limited to 'README.md') diff --git a/README.md b/README.md new file mode 100644 index 0000000..e6ed987 --- /dev/null +++ b/README.md @@ -0,0 +1,113 @@ +\

+\ +\ + +# CS 598 JH Assignment: Enhancing KG-RAG for Biomedical Question Answering + +This assignment requires you to set up, replicate, and enhance the **KG-RAG (Knowledge Graph-based Retrieval Augmented Generation)** framework. You will start by reproducing the baseline results using the `gemini-2.0-flash` model, then design and evaluate three distinct improvement strategies. + +## Assignment Instructions + +### **Step 1: Set Up the Environment** + +First, prepare your local environment by cloning the repository, creating a virtual environment, installing dependencies, and running the setup script. + +```bash +# Clone the repository +git clone +cd KG_RAG + +# Create and activate a conda virtual environment +conda create -n kg_rag python=3.10.9 +conda activate kg_rag + +# Install the required packages +pip install -r requirements.txt + +# Run the setup script to create the disease vector database +python -m kg_rag.run_setup +``` + +### **Step 2: Update Your Google API Key** + +Configure your Google API key to use the Gemini model. The recommended LLM for this assignment is **Gemini-2.0-flash**, which is used for Disease Entity Extraction and Answer Generation. + + * **Get Your API Key**: You can set up your API key for free by visiting [this link](https://makersuite.google.com/app/apikey) + * **Free Credits & Rate Limits**: While there are free credits available, please be aware of the daily rate limits. Plan your project schedule accordingly to avoid interruptions. + * **Update Config File**: Add your API key to the `gpt_config.env` file. + +### **Step 3: Replicate the Baseline Model** + +Run the following script to generate results using the baseline KG-RAG implementation with `gemini-2.0-flash`. + +```bash +sh run_gemini.sh +``` + +### **Step 4: Evaluate the Baseline** + +Execute the evaluation script to measure the performance of the baseline model. + +```bash +python data/my_results/evaluate_gemini.py +``` + +### **Step 5: Implement Enhancement Strategies** + +This is the core of the assignment. You are required to implement **3 distinct improvement strategies** in the [**`kg_rag/rag_based_generation/GPT/run_mcq_qa.py`**](kg_rag/rag_based_generation/GPT/run_mcq_qa.py) file. + +We have left TODO sections for `Mode 1`, `Mode 2`, and `Mode 3` in the code as placeholders for your implementations. + +### **Step 6: Evaluate Your Enhancements** + +Evaluate the performance of each of your proposed strategies. + +1. Ensure your enhanced model variant saves its output to a new file path. +2. Open the evaluation script at [**`data/my_results/evaluate_gemini.py`**](data/my_results/evaluate_gemini.py) and modify the file path to point to your new results file. +3. Run the script again and record the results for each of your three strategies. + +----- + +## What is KG-RAG? + +KG-RAG is a task-agnostic framework that combines the explicit knowledge of a Knowledge Graph (KG) with the implicit knowledge of a Large Language Model (LLM). It empowers a general-purpose LLM by incorporating an optimized, domain-specific **'prompt-aware context'** extracted from a massive biomedical KG called [SPOKE](https://spoke.ucsf.edu/). + +The main feature of KG-RAG is that it extracts the minimal context sufficient to respond to the user prompt, making the information provided to the LLM both dense and highly relevant. + +\