
gittech. site
for different kinds of informations and explorations.
HealthGPT : A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation
Mengze Li4, Xiaohui Song1, Siliang Tang1, Jun Xiao1, Hui Lin1, Yueting Zhuang1, Beng Chin Ooi5
1Zhejiang University, 2University of Electronic Science and Technology of China, 3Alibaba, 4The Hong Kong University of Science and Technology, 5National University of Singapore
π Overview
Welcome to HealthGPT! π HealthGPT is an advanced medical Large Vision-Language Model with a unified framework that integrates both medical visual comprehension and generation capabilities. In this project, a heterogeneous low rank adaptation (H-LoRA) and a three-stage learning strategy are proposed, enabling the pre-trained large language model to efficiently follow both visual comprehension and generation instructions.
π₯ News
- [2025.02.26] We have released the UI/UX for the inference.
- [2025.02.17] We have released the pre-trained weight on HuggingFace and inference script.
TODO
- Release inference code.
- Release the pre-trained weight of the model.
- Release the inference UI/UX.
- Release VL-Health dataset.
- Release training scripts.
- Construct the website.
π Task Classification and Support
HealthGPT supports 7 types of medical comprehension tasks and 5 types of medical generation tasks, outperforming recent unified visual models and medical-specific models.
ποΈ Architecture
The HealthGPT architecture integrates hierarchical visual perception and H-LoRA, employing a task-specific hard router to select visual features and H-LoRA plugins, generating text and vision outputs with an autoregressive manner.
π οΈ Getting Started
We have released our model in two configurations, HealthGPT-M3 and HealthGPT-L14, to suit different requirements and resource availability:
- HealthGPT-M3: A smaller version optimized for speed and reduced memory usage.
- HealthGPT-L14: A larger version designed for higher Performance and more complex tasks.
Installation
1. Prepare Environment
First, clone our repository and create the Python environment for running HealthGPT using the following command:
# clone our project
git clone https://github.com/DCDmllm/HealthGPT.git
cd HealthGPT
# prepare python environment
conda create -n HealthGPT python=3.10
conda activate HealthGPT
pip install -r requirements.txt
2. Prepare Pre-trained Weights
HealthGPT utilizes clip-vit-large-patch14-336
as the visual encoder and employs Phi-3-mini-4k-instruct
and phi-4
as the pre-trained LLM base models for HealthGPT-M3 and HealthGPT-L14, respectively. Please download the corresponding weights:
Model Type | Model Name | Download |
---|---|---|
ViT | clip-vit-large-patch14-336 |
Download |
Base Model (HealthGPT-M3) | Phi-3-mini-4k-instruct |
Download |
Base Model (HealthGPT-L14) | phi-4 |
Download |
For medical vision generation tasks, please follow the official VQGAN guide and download the VQGAN OpenImages (f=8), 8192
model weights from the "Overview of pretrained models"
section. Below is the direct link to the corresponding VQGAN pre-trained weights:
Model Name | Download |
---|---|
VQGAN OpenImages (f=8), 8192, GumbelQuantization | Download |
After downloading, place the last.ckpt
and model.yaml
files in the taming_transformers/ckpt
directory.
3. Prepare H-LoRA and Adapter Weights
HealthGPT enhances the base model's capabilities for medical visual comprehension and generation by training a small number of H-LoRA parameters and adapter layers for aligning vision and text. We have currently released some weights from the training process, supporting medical visual question answering
and open-world visual reconstruction
tasks. Here are the corresponding weights: Download.
We will soon be releasing the full weights for HealthGPT-L14, along with the H-LoRA weights for medical generation tasks. Stay tuned!!!
β‘ Inference
Medical Visual Question Answering
To perform inference using HealthGPT, please follow these steps:
- Download Necessary Files:
- Ensure you have downloaded all the required model weights and resources.
- Update Script Paths:
- Open the script located at
llava/demo/com_infer.sh
. - Modify the following variables to point to the paths where you stored the downloaded files:
- MODEL_NAME_OR_PATH: Path or identifier for base model.
- VIT_PATH: Path to the Vision Transformer model weights.
- HLORA_PATH: Path to the HLORA weights file for visual comprehension.
- FUSION_LAYER_PATH: Path to your fusion layer weights file.
- Open the script located at
- Run the Script:
- Execute the script in your terminal to begin inference:
cd llava/demo bash com_infer.sh
- Execute the script in your terminal to begin inference:
You can directly run the Python command in your terminal by specifying the paths and parameters. This approach allows you to easily change the image or question as needed:
python3 com_infer.py \
--model_name_or_path "microsoft/Phi-3-mini-4k-instruct" \
--dtype "FP16" \
--hlora_r "64" \
--hlora_alpha "128" \
--hlora_nums "4" \
--vq_idx_nums "8192" \
--instruct_template "phi3_instruct" \
--vit_path "openai/clip-vit-large-patch14-336/" \
--hlora_path "path/to/your/local/com_hlora_weights.bin" \
--fusion_layer_path "path/to/your/local/fusion_layer_weights.bin" \
--question "Your question" \
--img_path "path/to/image.jpg"
- Customize the Question and Image: You can modify the
--question
and--img_path
parameters to ask different questions or analyze different images.
Correspondingly, the visual Question Answering task of HealthGPT-L14
can be executed with the following Python command:
python3 com_infer_phi4.py \
--model_name_or_path "microsoft/Phi-4" \
--dtype "FP16" \
--hlora_r "32" \
--hlora_alpha "64" \
--hlora_nums "4" \
--vq_idx_nums "8192" \
--instruct_template "phi4_instruct" \
--vit_path "openai/clip-vit-large-patch14-336/" \
--hlora_path "path/to/your/local/com_hlora_weights_phi4.bin" \
--question "Your question" \
--img_path "path/to/image.jpg"
The weights of com_hlora_weights_phi4.bin
can be downloaded here.
Image Reconstruction
Similarly, simply set the HLORA_PATH
to point to the gen_hlora_weights.bin
file and configure the other model paths. Then, you can perform the image reconstruction task using the following script:
cd llava/demo
bash gen_infer.sh
You can also directly execute the following python command:
python3 gen_infer.py \
--model_name_or_path "microsoft/Phi-3-mini-4k-instruct" \
--dtype "FP16" \
--hlora_r "256" \
--hlora_alpha "512" \
--hlora_nums "4" \
--vq_idx_nums "8192" \
--instruct_template "phi3_instruct" \
--vit_path "openai/clip-vit-large-patch14-336/" \
--hlora_path "path/to/your/local/gen_hlora_weights.bin" \
--fusion_layer_path "path/to/your/local/fusion_layer_weights.bin" \
--question "Reconstruct the image." \
--img_path "path/to/image.jpg" \
--save_path "path/to/save.jpg"
Server
An interactive Chat UI based on Gradio, supporting text + image input, and returning text or images according to different modes.
π Project Introduction
This project is a Gradio front-end interface, supporting users:
- Analyze image (comprehension task): input text + image, output text
- Generate image (generation task): input text + image, output image
π¦ Installation Dependencies
This project runs based on Python, and requires the installation of Gradio
and Pillow
.
pip install gradio pillow
βΆοΈ Run the project
Run the following command in the terminal:
python app.py
After running, the terminal will output the Gradio access address (such as http://127.0.0.1:5010), which can be opened in the browser for use.
π Citation
If you found this work useful, please consider giving this repository a star and citing our paper as followed:
@misc{lin2025healthgptmedicallargevisionlanguage,
title={HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation},
author={Tianwei Lin and Wenqiao Zhang and Sijing Li and Yuqian Yuan and Binhe Yu and Haoyuan Li and Wanggui He and Hao Jiang and Mengze Li and Xiaohui Song and Siliang Tang and Jun Xiao and Hui Lin and Yueting Zhuang and Beng Chin Ooi},
year={2025},
eprint={2502.09838},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.09838},
}
π€ Acknowledgment
Our project is developed based on the following repositories:
- LLaVA: Large Language and Vision Assistant
- LLaVA++: Extending Visual Capabilities with LLaMA-3 and Phi-3
- Taming Transformers: Taming Transformers for High-Resolution Image Synthesis
βοΈ License
This repository is under Apache License 2.0.