File size: 9,436 Bytes

---
license: mit
base_model: microsoft/LLM2CLIP-Llama-3.2-1B-Instruct-CC-Finetuned
tags:
- text-embeddings
- sentence-transformers
- llm2vec
- medical
- chest-xray
- radiology
- clinical-nlp
language:
- en
pipeline_tag: feature-extraction
library_name: transformers
---

# LLM2Vec4CXR - Fine-tuned Model for Chest X-ray Report Analysis

LLM2Vec4CXR is a text encoder optimized for chest X-ray report analysis and medical text understanding.  
It is introduced in our paper [Exploring the Capabilities of LLM Encoders for Image–Text Retrieval in Chest X-rays](https://arxiv.org/pdf/2509.15234).

## Model Description

LLM2Vec4CXR is a **bidirectional text encoder** fine-tuned with a `latent_attention` pooling strategy.  
This design enhances semantic representation of chest X-ray reports, making the model robust across different reporting styles and effective even with domain-specific abbreviations.  
It improves performance on clinical text similarity, retrieval, and interpretation tasks.

### Key Features

- **Base Architecture**: LLM2CLIP-Llama-3.2-1B-Instruct
- **Pooling Mode**: Latent Attention (trained weights automatically loaded)
- **Bidirectional Processing**: Enabled for better context understanding
- **Medical Domain**: Specialized for chest X-ray report analysis
- **Max Length**: 512 tokens
- **Precision**: bfloat16
- **Automatic Loading**: Latent attention weights are automatically loaded from safetensors
- **Simple API**: Built-in methods for similarity computation and instruction-based encoding

## Training Details

### Training Data
- Fully fine-tuned on chest X-ray reports and medical text data
- Training focused on understanding pleural effusion status and other chest X-ray findings

### Training Configuration
- **Pooling Mode**: `latent_attention` (modified from base model)
- **Enable Bidirectional**: True
- **Max Length**: 512
- **Torch Dtype**: bfloat16
- **Full Fine-tuning**: All model weights were updated during training

## Usage

### Installation

```bash
# Only transformers is needed!
pip install transformers torch
```

### Basic Usage

```python
import torch
from transformers import AutoModel

# Load the model - that's it!
model = AutoModel.from_pretrained(
    "lukeingawesome/llm2vec4cxr",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
).to("cuda" if torch.cuda.is_available() else "cpu").eval()

# Simple text encoding
report = "Small left pleural effusion with basal atelectasis."
embedding = model.encode_text([report])
print(embedding.shape)  # torch.Size([1, 2048])

# Multiple texts at once
reports = [
    "No acute cardiopulmonary abnormality.",
    "Small bilateral pleural effusions.",
    "Large left pleural effusion with compressive atelectasis."
]
embeddings = model.encode_text(reports)
print(embeddings.shape)  # torch.Size([3, 2048])
```

### Instruction-Based Encoding and Similarity

```python
import torch
from transformers import AutoModel

# Load model
model = AutoModel.from_pretrained(
    "lukeingawesome/llm2vec4cxr",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
).to("cuda" if torch.cuda.is_available() else "cpu").eval()

# Instruction-based task with separator
instruction = "Determine the status of the pleural effusion."
report = "There is a small increase in the left-sided effusion."
query = instruction + "!@#$%^&*()" + report

# Compare against multiple candidates
candidates = [
    "No pleural effusion",
    "Pleural effusion present",
    "Worsening pleural effusion",
    "Improving pleural effusion"
]

# One-line similarity computation
scores = model.compute_similarities(query, candidates)
print(scores)
# tensor([0.7171, 0.8270, 0.9155, 0.8113], device='cuda:0')

best_match = candidates[torch.argmax(scores)]
print(f"Best match: {best_match}")
# Best match: Worsening pleural effusion
```

### Medical Report Retrieval Example

```python
import torch
from transformers import AutoModel

# Load model
model = AutoModel.from_pretrained(
    "lukeingawesome/llm2vec4cxr",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
).to("cuda" if torch.cuda.is_available() else "cpu").eval()

# Instruction for retrieval
instruction = "Retrieve semantically similar reports"
query_report = "Small left pleural effusion with basal atelectasis."
query = instruction + "!@#$%^&*()" + query_report

# Candidate reports
candidates = [
    "No acute cardiopulmonary abnormality.",
    "Small left pleural effusion is present.",
    "Large right pleural effusion causing compressive atelectasis.",
    "Heart size is normal with no evidence of pleural effusion.",
]

# Compute similarities
scores = model.compute_similarities(query, candidates)

# Get most similar
best_idx = torch.argmax(scores)
print(f"Most similar: {candidates[best_idx]}")
print(f"Score: {scores[best_idx]:.4f}")
```

## API Reference

The model provides three main methods:

### `encode_text(texts, max_length=512)`
Simple text encoding for one or more texts.

**Parameters:**
- `texts`: List of strings or single string
- `max_length`: Maximum sequence length (default: 512)

**Returns:** Tensor of shape `(batch_size, 2048)`

📄 **Related Papers**:
- [Exploring the Capabilities of LLM Encoders for Image–Text Retrieval in Chest X-rays](https://arxiv.org/pdf/2509.15234)  
  *Ko, Hanbin, et al. "Exploring the capabilities of LLM encoders for image–text retrieval in chest X-rays." arXiv preprint arXiv:2509.15234 (2025).*
- [LLM2CLIP4CXR](https://github.com/lukeingawesome/llm2clip4cxr): A CLIP-based model that leverages the LLM2Vec encoder to align visual and textual representations of chest X-rays.

**Parameters:**
- `texts`: List of strings with optional separator
- `separator`: String separator (default: `'!@#$%^&*()'`)
- `max_length`: Maximum sequence length (default: 512)

**Returns:** Tensor of shape `(batch_size, 2048)`

The model has been evaluated on chest X-ray report analysis tasks, particularly for:
- Text retrieval/encoder
- Medical text similarity comparison
- Clinical finding extraction

**Parameters:**
- `query_text`: Single query string
- `candidate_texts`: List of candidate strings
- `separator`: String separator (default: `'!@#$%^&*()'`)
- `max_length`: Maximum sequence length (default: 512)

**Returns:** Tensor of shape `(num_candidates,)` with cosine similarity scores

## Training Details

### Training Data
- Fully fine-tuned on chest X-ray reports and medical text data
- Training focused on understanding pleural effusion status and other chest X-ray findings

### Training Configuration
- **Pooling Mode**: `latent_attention` (512 latents, 8 attention heads)
- **Enable Bidirectional**: True
- **Max Length**: 512 tokens
- **Torch Dtype**: bfloat16
- **Full Fine-tuning**: All model weights were updated during training

## Technical Specifications

- **Model Type**: Bidirectional Language Model (LLM2Vec)
- **Architecture**: LlamaBiModel (modified Llama 3.2) + Latent Attention Pooling
- **Parameters**: ~1B parameters
- **Hidden Size**: 2048
- **Input Length**: Up to 512 tokens
- **Output Dimension**: 2048
- **Precision**: bfloat16
- **Dependencies**: Only transformers and torch

## Intended Use

### Primary Use Cases
- **Medical Text Embeddings**: Generate embeddings for chest X-ray reports
- **Clinical Text Similarity**: Compare medical texts for semantic similarity
- **Medical Information Retrieval**: Find relevant medical reports or findings
- **Clinical NLP Research**: Foundation model for medical text analysis

### Limitations
- Specialized for chest X-ray reports - may not generalize to other medical domains
- Requires careful preprocessing for optimal performance
- Should be used as part of a larger clinical decision support system, not for standalone diagnosis

## Evaluation

The model has been evaluated on chest X-ray report analysis tasks, particularly for:
- Text retrieval and encoding
- Medical text similarity comparison
- Clinical finding extraction

### Sample Performance

The model demonstrates consistent improvements over the base LLM2CLIP architecture on medical text understanding benchmarks.  
**LLM2Vec4CXR** shows stronger performance in:
- Handling medical abbreviations and radiological terminology  
- Capturing fine-grained semantic differences in chest X-ray reports  
- Understanding clinical context and temporal changes

## Related Resources

📄 **Paper**: [Exploring the Capabilities of LLM Encoders for Image–Text Retrieval in Chest X-rays](https://arxiv.org/pdf/2509.15234)  

🔗 **Related Projects**:
- [LLM2CLIP4CXR](https://github.com/lukeingawesome/llm2clip4cxr): A CLIP-based model that leverages the LLM2Vec encoder to align visual and textual representations of chest X-rays

## Citation

If you use this model in your research, please cite:

```bibtex
@article{ko2025exploring,
  title={Exploring the Capabilities of LLM Encoders for Image--Text Retrieval in Chest X-rays},
  author={Ko, Hanbin and Cho, Gihun and Baek, Inhyeok and Kim, Donguk and Koo, Joonbeom and Kim, Changi and Lee, Dongheon and Park, Chang Min},
  journal={arXiv preprint arXiv:2509.15234},
  year={2025}
}
```

## Acknowledgments

This model is built upon:
- [LLM2Vec](https://github.com/McGill-NLP/llm2vec) - Framework for converting decoder-only LLMs into text encoders
- [LLM2CLIP](https://github.com/microsoft/LLM2CLIP) - Microsoft's implementation for connecting LLMs with CLIP models

## License

This model is licensed under the MIT License.