llm2vec4cxr / README.md

Update README: self-contained usage with trust_remote_code

8c10378 verified about 12 hours ago

9.44 kB

	---
	license: mit
	base_model: microsoft/LLM2CLIP-Llama-3.2-1B-Instruct-CC-Finetuned
	tags:
	- text-embeddings
	- sentence-transformers
	- llm2vec
	- medical
	- chest-xray
	- radiology
	- clinical-nlp
	language:
	- en
	pipeline_tag: feature-extraction
	library_name: transformers
	---

	# LLM2Vec4CXR - Fine-tuned Model for Chest X-ray Report Analysis

	LLM2Vec4CXR is a text encoder optimized for chest X-ray report analysis and medical text understanding.
	It is introduced in our paper [Exploring the Capabilities of LLM Encoders for Image–Text Retrieval in Chest X-rays](https://arxiv.org/pdf/2509.15234).

	## Model Description

	LLM2Vec4CXR is a bidirectional text encoder fine-tuned with a `latent_attention` pooling strategy.
	This design enhances semantic representation of chest X-ray reports, making the model robust across different reporting styles and effective even with domain-specific abbreviations.
	It improves performance on clinical text similarity, retrieval, and interpretation tasks.

	### Key Features

	- Base Architecture: LLM2CLIP-Llama-3.2-1B-Instruct
	- Pooling Mode: Latent Attention (trained weights automatically loaded)
	- Bidirectional Processing: Enabled for better context understanding
	- Medical Domain: Specialized for chest X-ray report analysis
	- Max Length: 512 tokens
	- Precision: bfloat16
	- Automatic Loading: Latent attention weights are automatically loaded from safetensors
	- Simple API: Built-in methods for similarity computation and instruction-based encoding

	## Training Details

	### Training Data
	- Fully fine-tuned on chest X-ray reports and medical text data
	- Training focused on understanding pleural effusion status and other chest X-ray findings

	### Training Configuration
	- Pooling Mode: `latent_attention` (modified from base model)
	- Enable Bidirectional: True
	- Max Length: 512
	- Torch Dtype: bfloat16
	- Full Fine-tuning: All model weights were updated during training

	## Usage

	### Installation

	```bash
	# Only transformers is needed!
	pip install transformers torch
	```

	### Basic Usage

	```python
	import torch
	from transformers import AutoModel

	# Load the model - that's it!
	model = AutoModel.from_pretrained(
	"lukeingawesome/llm2vec4cxr",
	trust_remote_code=True,
	torch_dtype=torch.bfloat16
	).to("cuda" if torch.cuda.is_available() else "cpu").eval()

	# Simple text encoding
	report = "Small left pleural effusion with basal atelectasis."
	embedding = model.encode_text([report])
	print(embedding.shape) # torch.Size([1, 2048])

	# Multiple texts at once
	reports = [
	"No acute cardiopulmonary abnormality.",
	"Small bilateral pleural effusions.",
	"Large left pleural effusion with compressive atelectasis."
	]
	embeddings = model.encode_text(reports)
	print(embeddings.shape) # torch.Size([3, 2048])
	```

	### Instruction-Based Encoding and Similarity

	```python
	import torch
	from transformers import AutoModel

	# Load model
	model = AutoModel.from_pretrained(
	"lukeingawesome/llm2vec4cxr",
	trust_remote_code=True,
	torch_dtype=torch.bfloat16
	).to("cuda" if torch.cuda.is_available() else "cpu").eval()

	# Instruction-based task with separator
	instruction = "Determine the status of the pleural effusion."
	report = "There is a small increase in the left-sided effusion."
	query = instruction + "!@#$%^&*()" + report

	# Compare against multiple candidates
	candidates = [
	"No pleural effusion",
	"Pleural effusion present",
	"Worsening pleural effusion",
	"Improving pleural effusion"
	]

	# One-line similarity computation
	scores = model.compute_similarities(query, candidates)
	print(scores)
	# tensor([0.7171, 0.8270, 0.9155, 0.8113], device='cuda:0')

	best_match = candidates[torch.argmax(scores)]
	print(f"Best match: {best_match}")
	# Best match: Worsening pleural effusion
	```

	### Medical Report Retrieval Example

	```python
	import torch
	from transformers import AutoModel

	# Load model
	model = AutoModel.from_pretrained(
	"lukeingawesome/llm2vec4cxr",
	trust_remote_code=True,
	torch_dtype=torch.bfloat16
	).to("cuda" if torch.cuda.is_available() else "cpu").eval()

	# Instruction for retrieval
	instruction = "Retrieve semantically similar reports"
	query_report = "Small left pleural effusion with basal atelectasis."
	query = instruction + "!@#$%^&*()" + query_report

	# Candidate reports
	candidates = [
	"No acute cardiopulmonary abnormality.",
	"Small left pleural effusion is present.",
	"Large right pleural effusion causing compressive atelectasis.",
	"Heart size is normal with no evidence of pleural effusion.",
	]

	# Compute similarities
	scores = model.compute_similarities(query, candidates)

	# Get most similar
	best_idx = torch.argmax(scores)
	print(f"Most similar: {candidates[best_idx]}")
	print(f"Score: {scores[best_idx]:.4f}")
	```

	## API Reference

	The model provides three main methods:

	### `encode_text(texts, max_length=512)`
	Simple text encoding for one or more texts.

	Parameters:
	- `texts`: List of strings or single string
	- `max_length`: Maximum sequence length (default: 512)

	Returns: Tensor of shape `(batch_size, 2048)`

	📄 Related Papers:
	- [Exploring the Capabilities of LLM Encoders for Image–Text Retrieval in Chest X-rays](https://arxiv.org/pdf/2509.15234)
	Ko, Hanbin, et al. "Exploring the capabilities of LLM encoders for image–text retrieval in chest X-rays." arXiv preprint arXiv:2509.15234 (2025).
	- [LLM2CLIP4CXR](https://github.com/lukeingawesome/llm2clip4cxr): A CLIP-based model that leverages the LLM2Vec encoder to align visual and textual representations of chest X-rays.

	Parameters:
	- `texts`: List of strings with optional separator
	- `separator`: String separator (default: `'!@#$%^&*()'`)
	- `max_length`: Maximum sequence length (default: 512)

	Returns: Tensor of shape `(batch_size, 2048)`

	The model has been evaluated on chest X-ray report analysis tasks, particularly for:
	- Text retrieval/encoder
	- Medical text similarity comparison
	- Clinical finding extraction

	Parameters:
	- `query_text`: Single query string
	- `candidate_texts`: List of candidate strings
	- `separator`: String separator (default: `'!@#$%^&*()'`)
	- `max_length`: Maximum sequence length (default: 512)

	Returns: Tensor of shape `(num_candidates,)` with cosine similarity scores

	## Training Details

	### Training Data
	- Fully fine-tuned on chest X-ray reports and medical text data
	- Training focused on understanding pleural effusion status and other chest X-ray findings

	### Training Configuration
	- Pooling Mode: `latent_attention` (512 latents, 8 attention heads)
	- Enable Bidirectional: True
	- Max Length: 512 tokens
	- Torch Dtype: bfloat16
	- Full Fine-tuning: All model weights were updated during training

	## Technical Specifications

	- Model Type: Bidirectional Language Model (LLM2Vec)
	- Architecture: LlamaBiModel (modified Llama 3.2) + Latent Attention Pooling
	- Parameters: ~1B parameters
	- Hidden Size: 2048
	- Input Length: Up to 512 tokens
	- Output Dimension: 2048
	- Precision: bfloat16
	- Dependencies: Only transformers and torch

	## Intended Use

	### Primary Use Cases
	- Medical Text Embeddings: Generate embeddings for chest X-ray reports
	- Clinical Text Similarity: Compare medical texts for semantic similarity
	- Medical Information Retrieval: Find relevant medical reports or findings
	- Clinical NLP Research: Foundation model for medical text analysis

	### Limitations
	- Specialized for chest X-ray reports - may not generalize to other medical domains
	- Requires careful preprocessing for optimal performance
	- Should be used as part of a larger clinical decision support system, not for standalone diagnosis

	## Evaluation

	The model has been evaluated on chest X-ray report analysis tasks, particularly for:
	- Text retrieval and encoding
	- Medical text similarity comparison
	- Clinical finding extraction

	### Sample Performance

	The model demonstrates consistent improvements over the base LLM2CLIP architecture on medical text understanding benchmarks.
	LLM2Vec4CXR shows stronger performance in:
	- Handling medical abbreviations and radiological terminology
	- Capturing fine-grained semantic differences in chest X-ray reports
	- Understanding clinical context and temporal changes

	## Related Resources

	📄 Paper: [Exploring the Capabilities of LLM Encoders for Image–Text Retrieval in Chest X-rays](https://arxiv.org/pdf/2509.15234)

	🔗 Related Projects:
	- [LLM2CLIP4CXR](https://github.com/lukeingawesome/llm2clip4cxr): A CLIP-based model that leverages the LLM2Vec encoder to align visual and textual representations of chest X-rays

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@article{ko2025exploring,
	title={Exploring the Capabilities of LLM Encoders for Image--Text Retrieval in Chest X-rays},
	author={Ko, Hanbin and Cho, Gihun and Baek, Inhyeok and Kim, Donguk and Koo, Joonbeom and Kim, Changi and Lee, Dongheon and Park, Chang Min},
	journal={arXiv preprint arXiv:2509.15234},
	year={2025}
	}
	```

	## Acknowledgments

	This model is built upon:
	- [LLM2Vec](https://github.com/McGill-NLP/llm2vec) - Framework for converting decoder-only LLMs into text encoders
	- [LLM2CLIP](https://github.com/microsoft/LLM2CLIP) - Microsoft's implementation for connecting LLMs with CLIP models

	## License

	This model is licensed under the MIT License.