tlemberger
/

sd-ner

+---
+language:
+- english
+thumbnail:
+tags:
+- token classification
+license:
+datasets:
+- EMBO/sd-panels
+metrics:
+-
+---
+# sd-ner
+## Model description
+This model is a [RoBERTa base model](https://huggingface.co/roberta-base) that was further trained using a masked language modeling task on a compendium of english scientific textual examples from the life sciences using the [BioLang dataset](https://huggingface.co/datasets/EMBO/biolang) and fine-tuned for token classification on the SourceData [sd-panels](https://huggingface.co/datasets/EMBO/sd-panels) dataset to perform Named Entity Recognition of bioentities.
+## Intended uses & limitations
+#### How to use
+The intended use of this model is for Named Entity Recognition of biological entitie used in SourceData annotations (https://sourcedata.embo.org), including small molecules, gene products (genes and proteins), subcellular components, cell line and cell types, organ and tissues, species as well as experimental methods.
+To have a quick check of the model:
+```python
+from transformers import pipeline, RobertaTokenizerFast, RobertaForTokenClassification
+example = """<s> F. Western blot of input and eluates of Upf1 domains purification in a Nmd4-HA strain. The band with the # might corresponds to a dimer of Upf1-CH, bands marked with a star correspond to residual signal with the anti-HA antibodies (Nmd4). Fragments in the eluate have a smaller size because the protein A part of the tag was removed by digestion with the TEV protease. G6PDH served as a loading control in the input samples </s>"""
+tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', max_len=512)
+model = RobertaForTokenClassification.from_pretrained('EMBO/sd-ner')
+ner = pipeline('ner', model, tokenizer=tokenizer)
+res = ner(example)
+for r in res:
+    print(r['word'], r['entity'])
+```
+#### Limitations and bias
+The model must be used with the `roberta-base` tokenizer.
+## Training data
+The model was trained for token classification using the  [EMBO/sd-panels dataset](https://huggingface.co/datasets/EMBO/biolang) wich includes manually annotated examples.
+## Training procedure
+The training was run on a NVIDIA DGX Station with 4XTesla V100 GPUs.
+Training code is available at https://github.com/source-data/soda-roberta
+- Command: `python -m tokcl.train /data/json/sd_panels NER  --num_train_epochs=3.5`
+- Tokenizer vocab size: 50265
+- Training data: EMBO/biolang MLM
+- Training with 31410 examples.
+- Evaluating on 8861 examples.
+- Training on 15 features: O, I-SMALL_MOLECULE, B-SMALL_MOLECULE, I-GENEPROD, B-GENEPROD, I-SUBCELLULAR, B-SUBCELLULAR, I-CELL, B-CELL, I-TISSUE, B-TISSUE, I-ORGANISM, B-ORGANISM, I-EXP_ASSAY, B-EXP_ASSAY
+- Epochs: 3.5
+- `per_device_train_batch_size`: 32
+- `per_device_eval_batch_size`: 32
+- `learning_rate`: 0.0001
+- `weight_decay`: 0.0
+- `adam_beta1`: 0.9
+- `adam_beta2`: 0.999
+- `adam_epsilon`: 1e-08
+- `max_grad_norm`: 1.0
+## Eval results
+On test set with `sklearn.metrics`:
+```
+                precision    recall  f1-score   support
+          CELL       0.77      0.81      0.79      3477
+     EXP_ASSAY       0.71      0.70      0.71      7049
+      GENEPROD       0.86      0.90      0.88     16140
+      ORGANISM       0.80      0.82      0.81      2759
+SMALL_MOLECULE       0.78      0.82      0.80      4446
+   SUBCELLULAR       0.71      0.75      0.73      2125
+        TISSUE       0.70      0.75      0.73      1971
+     micro avg       0.79      0.82      0.81     37967
+     macro avg       0.76      0.79      0.78     37967
+  weighted avg       0.79      0.82      0.81     37967
+```