Model Card: English–Faroese Translation Adapter
Model Details
Model Description
- Developed by: Barbara Scalvini
- Model type: Language model adapter for English → Faroese translation
- Language(s): English, Faroese
- License: This adapter inherits the license from the original Llama 3.1 8B model.
- Finetuned from model: meta-llama/Meta-Llama-3.1-8B
- Library used: PEFT 0.13.0
Model Sources
- Paper: [COMING SOON]
Uses
Direct Use
This adapter is intended to perform English→Faroese translation, leveraging a parameter-efficient fine-tuning (PEFT) approach.
Downstream Use [optional]
- Can be integrated into broader multilingual or localization workflows.
Out-of-Scope Use
- Any uses that rely on languages other than English or Faroese will likely yield suboptimal results.
- Other tasks (e.g., summarization, classification) may be unsupported or require further fine-tuning.
Bias, Risks, and Limitations
- Biases: The model could reflect biases present in the training data, such as historical or societal biases in English or Faroese texts.
- Recommendation: Users should critically evaluate outputs, especially in sensitive or high-stakes applications.
How to Get Started with the Model
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load the trained model and tokenizer from the checkpoint
checkpoint_dir = "barbaroo/llama3.1_translate_8B"  # The directory where your trained model and tokenizer are saved
model = AutoModelForCausalLM.from_pretrained(checkpoint_dir, device_map="auto", load_in_8bit = True)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_dir)
MAX_SEQ_LENGTH = 512
sentences = ["What's your name?"]
# Define the prompt template (same as in training)
alpaca_prompt = """
### Instruction:
{}
### Input:
{}
### Response:
{}"""
# Inference loop
for sentence in sentences:
    inputs = tokenizer(
        [
            alpaca_prompt.format(
                "Translate this sentence from English to Faroese:",  # Instruction
                sentence,  # The input sentence to translate
                "",  # Leave blank for generation
            )
        ],
        return_tensors="pt",
        padding=True,
        truncation=True,  # Make sure the input is not too long
        max_length=MAX_SEQ_LENGTH  # Enforce the max length if necessary
    ).to("cuda")
    # Generate the translation
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,  # Limit the number of new tokens generated
        eos_token_id=tokenizer.eos_token_id,  # Ensure EOS token is used
        pad_token_id=tokenizer.pad_token_id,  # Ensure padding token is used
        temperature=0.1,  # Sampling temperature for diversity
        top_p=1.0,  # Sampling top-p for generation
        use_cache=True  # Use cache for efficiency
    )
    # Decode the generated tokens into text
    output_string = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    print(f"Input: {sentence}")
    print(f"Generated Translation: {output_string}")
Training Details
Training Data
We used the Sprotin parallel corpus for English–Faroese translation: barbaroo/Sprotin_parallel.
Training Procedure
Preprocessing [optional]
- Tokenization: We used the tokenizer from the base model meta-llama/Llama-3.1-8B.
- The Alpaca prompt format was used, with Instruction, Input and Response.
Training Hyperparameters
- Epochs: 3 total, with an early stopping criterion monitoring validation loss.
- Batch Size: 2, with 4 Gradient accumulation steps
- Learning Rate: 2e-4
- Optimizer: AdamW with a linear learning-rate scheduler and warm-up.
Evaluation
Testing Data, Factors & Metrics
Testing Data
- The model was evaluated on the [FLORES-200] benchmark, of ~1012 English–Faroese pairs.
Metrics and Results
- BLEU: [0.175]
- chrF: [49.5]
- BERTScore f1: [0.948]
Human evaluation was also performed (see paper)
Citation []
[COMING SOON]
Framework versions
- PEFT 0.13.0
- Downloads last month
- 1
Model tree for barbaroo/llama3.1_translate_8B
Base model
meta-llama/Llama-3.1-8B