faceless-void
/

protbert-sequence-unmasking

@@ -1,3 +1,18 @@
 # ProtBERT-Unmasking
 This model is a fine-tuned version of ProtBERT specifically optimized for unmasking protein sequences. It can predict masked amino acids in protein sequences based on the surrounding context.
@@ -8,6 +23,10 @@ This model is a fine-tuned version of ProtBERT specifically optimized for unmask
 - **Task**: Protein Sequence Unmasking
 - **Training**: Fine-tuned on masked protein sequences
 - **Use Case**: Predicting missing or masked amino acids in protein sequences
 ## Usage
@@ -15,20 +34,45 @@ This model is a fine-tuned version of ProtBERT specifically optimized for unmask
 from transformers import AutoModelForMaskedLM, AutoTokenizer
 # Load model and tokenizer
-model = AutoModelForMaskedLM.from_pretrained("faceless-void/protbert-sequence-unmasking")
-tokenizer = AutoTokenizer.from_pretrained("faceless-void/protbert-sequence-unmasking")
-# Example usage
 sequence = "MALN[MASK]KFGP[MASK]LVRK"
 inputs = tokenizer(sequence, return_tensors="pt")
 outputs = model(**inputs)
 predictions = outputs.logits
 ```
 ## Limitations and Biases
-This model is specifically designed for protein sequence unmasking and may not perform optimally for other protein-related tasks.
 ## Training Details
-The model was fine-tuned from the original ProtBERT model with specific focus on masked sequence prediction.

+---
+language: en
+tags:
+- protein
+- protbert
+- masked-language-modeling
+- bioinformatics
+- sequence-prediction
+datasets:
+- custom
+license: mit
+library_name: transformers
+pipeline_tag: fill-mask
+---
 # ProtBERT-Unmasking
 This model is a fine-tuned version of ProtBERT specifically optimized for unmasking protein sequences. It can predict masked amino acids in protein sequences based on the surrounding context.
 - **Task**: Protein Sequence Unmasking
 - **Training**: Fine-tuned on masked protein sequences
 - **Use Case**: Predicting missing or masked amino acids in protein sequences
+- **Optimal Use**: Best performance on E. coli sequences with known amino acids K, C, Y, H, S, M
+For detailed information about the training methodology and approach, please refer to our paper:
+[https://arxiv.org/abs/2408.00892](https://arxiv.org/abs/2408.00892)
 ## Usage
 from transformers import AutoModelForMaskedLM, AutoTokenizer
 # Load model and tokenizer
+model = AutoModelForMaskedLM.from_pretrained("your-username/protbert-sequence-unmasking")
+tokenizer = AutoTokenizer.from_pretrained("your-username/protbert-sequence-unmasking")
+# Example usage for E. coli sequence with known amino acids (K,C,Y,H,S,M)
 sequence = "MALN[MASK]KFGP[MASK]LVRK"
 inputs = tokenizer(sequence, return_tensors="pt")
 outputs = model(**inputs)
 predictions = outputs.logits
 ```
+## Inference API
+The model is optimized for:
+- **Organism**: E. coli
+- **Known Amino Acids**: K, C, Y, H, S, M
+- **Task**: Predicting unknown amino acids in a sequence
+Example API usage:
+```python
+from transformers import pipeline
+unmasker = pipeline('fill-mask', model='your-username/protbert-sequence-unmasking')
+sequence = "K[MASK]YHS[MASK]"  # Example with known amino acids K,Y,H,S
+results = unmasker(sequence)
+for result in results:
+    print(f"Predicted amino acid: {result['token_str']}, Score: {result['score']:.3f}")
+```
 ## Limitations and Biases
+- This model is specifically designed for protein sequence unmasking in E. coli
+- Optimal performance is achieved when working with sequences containing known amino acids K, C, Y, H, S, M
+- The model may not perform optimally for:
+  - Sequences from other organisms
+  - Sequences without the specified known amino acids
+  - Other protein-related tasks
 ## Training Details
+The complete details of the training methodology, dataset preparation, and model evaluation can be found in our paper:
+[https://arxiv.org/abs/2408.00892](https://arxiv.org/abs/2408.00892)