no-name-ner-th / README.md
atirut-name-looloo's picture
Update README.md
fe7fd8e verified
metadata
datasets:
  - pythainlp/thainer-corpus-v2
language:
  - th
base_model:
  - clicknext/phayathaibert
pipeline_tag: token-classification
library_name: transformers
tags:
  - medical

No Name Thai NER

mascot
Looloo Health Prescribe

Compact Thai token-classification model optimized for fast named-entity recognition (NER) and practical medical-text deidentification. This checkpoint was trained for robust entity detection on Thai clinical and conversational text and is intended for use in context-preserving anonymization pipelines.

At Looloo Health, we're passionate about making healthcare more accessible and affordable for everyone. The model is a core component of our AI Medical Scribe, PresScribe, where it helps ensure patient privacy through automated de-identification. We believe that unlocking the potential of clinical data is key to this goal, and we're excited to share our work with the community.

Features

  • Detects common sensitive entity types found in medical text (names, phone numbers, IDs, addresses, dates, etc.).
  • Lightweight and fast to run on CPUs with the Hugging Face transformers pipeline.
  • Designed to be used as part of a deidentification workflow (post-processing recommended to merge token-level spans).
  • Trained on a comprehensive synthetic dataset of over 300,000 samples, ensuring it is robust and generalizable.
  • On our internal test set, we achieved over 95% accuracy for our specific use case.

Supported entity labels

  • PERSON
  • PHONE
  • EMAIL
  • ADDRESS (sometimes labelled as LOCATION)
  • DATE
  • NATIONAL_ID
  • HOSPITAL_IDS

Quick start

Install minimal dependencies:

pip install -U transformers torch

Load and run the model with Hugging Face pipelines:

from transformers import pipeline

ner = pipeline("token-classification", model="loolootech/no-name-ner-th", device=-1)
text = "คุณสมชายเป็นอะไรมาครับวันนี้ อ๋อวันนี้ปวดตับครับ งั้นวันนี้หมอขอตรวจละเอียดหน่อยนะ ได้เลยครับน้องมาร์ค"
results = ner(text)
print(results)

Notes on post-processing (more details on our example notebook)

  • The pipeline returns token-level predictions (B-/I- style). For redaction or anonymization you should merge adjacent tokens with the same label to form full spans before replacing with entity-specific redaction tokens (e.g. [PERSON], [PHONE]).
  • When redacting, replace spans from right-to-left or rebuild the output string from slices to avoid offset shifts.

Disclaimer

  • This model is intended as an assistive tool for de-identification. It is not a substitute for professional, legal, or medical advice.

  • Users are fully responsible for ensuring compliance with applicable privacy, legal, and regulatory requirements.

  • While efforts have been made to improve accuracy, no automated system is 100% reliable. We strongly recommend implementing a regular human review process to validate outputs.

License

This work is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).

Citation

If you use the model, you can cite it with the following bibtex.

@misc {no_name_ner_th,
    author       = { Atirut Boribalburephan, Chiraphat Boonnag, Knot Pipatsrisawat },
    title        = { no-name-ner-th },
    year         = 2025,
    url          = { https://huggingface.co/loolootech/no-name-ner-th },
    publisher    = { Hugging Face }
}

Acknowledgement

We extend our gratitude to the PhayaThaiBERT team and Pavarissy/phayathaibert-thainer for providing the initial checkpoint for our model, which served as a crucial starting point. We also acknowledge PyThaiNLP for their invaluable contribution of the thainer-corpus-v2 dataset, which was essential for training and evaluation.