gittech. site

for different kinds of informations and explorations.

HealthGPT

Published at
Feb 20, 2025

icon HealthGPT : A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation

Tianwei Lin1, Wenqiao Zhang1, Sijing Li1, Yuqian Yuan1, Binhe Yu2, Haoyuan Li3, Wanggui He3, Hao Jiang3,

Mengze Li4, Xiaohui Song1, Siliang Tang1, Jun Xiao1, Hui Lin1, Yueting Zhuang1, Beng Chin Ooi5

1Zhejiang University, 2University of Electronic Science and Technology of China, 3Alibaba, 4The Hong Kong University of Science and Technology, 5National University of Singapore

🌟 Overview

Welcome to HealthGPT! πŸš€ HealthGPT is an advanced medical Large Vision-Language Model with a unified framework that integrates both medical visual comprehension and generation capabilities. In this project, a heterogeneous low rank adaptation (H-LoRA) and a three-stage learning strategy are proposed, enabling the pre-trained large language model to efficiently follow both visual comprehension and generation instructions.

πŸ”₯ News

  • [2025.02.26] We have released the UI/UX for the inference.
  • [2025.02.17] We have released the pre-trained weight on HuggingFace and inference script.

TODO

  • Release inference code.
  • Release the pre-trained weight of the model.
  • Release the inference UI/UX.
  • Release VL-Health dataset.
  • Release training scripts.
  • Construct the website.

πŸ“š Task Classification and Support

HealthGPT supports 7 types of medical comprehension tasks and 5 types of medical generation tasks, outperforming recent unified visual models and medical-specific models.

Example Image

πŸ—οΈ Architecture

The HealthGPT architecture integrates hierarchical visual perception and H-LoRA, employing a task-specific hard router to select visual features and H-LoRA plugins, generating text and vision outputs with an autoregressive manner.

Example Image

πŸ› οΈ Getting Started

We have released our model in two configurations, HealthGPT-M3 and HealthGPT-L14, to suit different requirements and resource availability:

  • HealthGPT-M3: A smaller version optimized for speed and reduced memory usage.
  • HealthGPT-L14: A larger version designed for higher Performance and more complex tasks.

Installation

1. Prepare Environment

First, clone our repository and create the Python environment for running HealthGPT using the following command:

# clone our project
git clone https://github.com/DCDmllm/HealthGPT.git
cd HealthGPT

# prepare python environment
conda create -n HealthGPT python=3.10
conda activate HealthGPT
pip install -r requirements.txt

2. Prepare Pre-trained Weights

HealthGPT utilizes clip-vit-large-patch14-336 as the visual encoder and employs Phi-3-mini-4k-instruct and phi-4 as the pre-trained LLM base models for HealthGPT-M3 and HealthGPT-L14, respectively. Please download the corresponding weights:

Model Type Model Name Download
ViT clip-vit-large-patch14-336 Download
Base Model (HealthGPT-M3) Phi-3-mini-4k-instruct Download
Base Model (HealthGPT-L14) phi-4 Download

For medical vision generation tasks, please follow the official VQGAN guide and download the VQGAN OpenImages (f=8), 8192 model weights from the "Overview of pretrained models" section. Below is the direct link to the corresponding VQGAN pre-trained weights:

Model Name Download
VQGAN OpenImages (f=8), 8192, GumbelQuantization Download

After downloading, place the last.ckpt and model.yaml files in the taming_transformers/ckpt directory.

3. Prepare H-LoRA and Adapter Weights

HealthGPT enhances the base model's capabilities for medical visual comprehension and generation by training a small number of H-LoRA parameters and adapter layers for aligning vision and text. We have currently released some weights from the training process, supporting medical visual question answering and open-world visual reconstruction tasks. Here are the corresponding weights: Download.

We will soon be releasing the full weights for HealthGPT-L14, along with the H-LoRA weights for medical generation tasks. Stay tuned!!!

⚑ Inference

Medical Visual Question Answering

To perform inference using HealthGPT, please follow these steps:

  1. Download Necessary Files:
    • Ensure you have downloaded all the required model weights and resources.
  2. Update Script Paths:
    • Open the script located at llava/demo/com_infer.sh.
    • Modify the following variables to point to the paths where you stored the downloaded files:
      • MODEL_NAME_OR_PATH: Path or identifier for base model.
      • VIT_PATH: Path to the Vision Transformer model weights.
      • HLORA_PATH: Path to the HLORA weights file for visual comprehension.
      • FUSION_LAYER_PATH: Path to your fusion layer weights file.
  3. Run the Script:
    • Execute the script in your terminal to begin inference:
      cd llava/demo
      bash com_infer.sh
      

You can directly run the Python command in your terminal by specifying the paths and parameters. This approach allows you to easily change the image or question as needed:

python3 com_infer.py \
    --model_name_or_path "microsoft/Phi-3-mini-4k-instruct" \
    --dtype "FP16" \
    --hlora_r "64" \
    --hlora_alpha "128" \
    --hlora_nums "4" \
    --vq_idx_nums "8192" \
    --instruct_template "phi3_instruct" \
    --vit_path "openai/clip-vit-large-patch14-336/" \
    --hlora_path "path/to/your/local/com_hlora_weights.bin" \
    --fusion_layer_path "path/to/your/local/fusion_layer_weights.bin" \
    --question "Your question" \
    --img_path "path/to/image.jpg"
  • Customize the Question and Image: You can modify the --question and --img_path parameters to ask different questions or analyze different images.

Correspondingly, the visual Question Answering task of HealthGPT-L14 can be executed with the following Python command:

python3 com_infer_phi4.py \
    --model_name_or_path "microsoft/Phi-4" \
    --dtype "FP16" \
    --hlora_r "32" \
    --hlora_alpha "64" \
    --hlora_nums "4" \
    --vq_idx_nums "8192" \
    --instruct_template "phi4_instruct" \
    --vit_path "openai/clip-vit-large-patch14-336/" \
    --hlora_path "path/to/your/local/com_hlora_weights_phi4.bin" \
    --question "Your question" \
    --img_path "path/to/image.jpg"

The weights of com_hlora_weights_phi4.bin can be downloaded here.

Image Reconstruction

Similarly, simply set the HLORA_PATH to point to the gen_hlora_weights.bin file and configure the other model paths. Then, you can perform the image reconstruction task using the following script:

cd llava/demo
bash gen_infer.sh

You can also directly execute the following python command:

python3 gen_infer.py \
    --model_name_or_path "microsoft/Phi-3-mini-4k-instruct" \
    --dtype "FP16" \
    --hlora_r "256" \
    --hlora_alpha "512" \
    --hlora_nums "4" \
    --vq_idx_nums "8192" \
    --instruct_template "phi3_instruct" \
    --vit_path "openai/clip-vit-large-patch14-336/" \
    --hlora_path "path/to/your/local/gen_hlora_weights.bin" \
    --fusion_layer_path "path/to/your/local/fusion_layer_weights.bin" \
    --question "Reconstruct the image." \
    --img_path "path/to/image.jpg" \
    --save_path "path/to/save.jpg"

Server

An interactive Chat UI based on Gradio, supporting text + image input, and returning text or images according to different modes.

πŸ“Œ Project Introduction

This project is a Gradio front-end interface, supporting users:

  • Analyze image (comprehension task): input text + image, output text
  • Generate image (generation task): input text + image, output image

πŸ“¦ Installation Dependencies

This project runs based on Python, and requires the installation of Gradio and Pillow.

pip install gradio pillow

▢️ Run the project

Run the following command in the terminal:

python app.py

After running, the terminal will output the Gradio access address (such as http://127.0.0.1:5010), which can be opened in the browser for use.

Example Image

πŸ”— Citation

If you found this work useful, please consider giving this repository a star and citing our paper as followed:

@misc{lin2025healthgptmedicallargevisionlanguage,
      title={HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation}, 
      author={Tianwei Lin and Wenqiao Zhang and Sijing Li and Yuqian Yuan and Binhe Yu and Haoyuan Li and Wanggui He and Hao Jiang and Mengze Li and Xiaohui Song and Siliang Tang and Jun Xiao and Hui Lin and Yueting Zhuang and Beng Chin Ooi},
      year={2025},
      eprint={2502.09838},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.09838}, 
}

🀝 Acknowledgment

Our project is developed based on the following repositories:

  • LLaVA: Large Language and Vision Assistant
  • LLaVA++: Extending Visual Capabilities with LLaMA-3 and Phi-3
  • Taming Transformers: Taming Transformers for High-Resolution Image Synthesis

βš–οΈ License

This repository is under Apache License 2.0.