no-name-ner-th / README.md

atirut-name-looloo

Update README.md

fe7fd8e verified 2 months ago

preview code

raw

history blame contribute delete

4.62 kB

metadata

datasets:
  - pythainlp/thainer-corpus-v2
language:
  - th
base_model:
  - clicknext/phayathaibert
pipeline_tag: token-classification
library_name: transformers
tags:
  - medical

No Name Thai NER

Compact Thai token-classification model optimized for fast named-entity recognition (NER) and practical medical-text deidentification. This checkpoint was trained for robust entity detection on Thai clinical and conversational text and is intended for use in context-preserving anonymization pipelines.

At Looloo Health, we're passionate about making healthcare more accessible and affordable for everyone. The model is a core component of our AI Medical Scribe, PresScribe, where it helps ensure patient privacy through automated de-identification. We believe that unlocking the potential of clinical data is key to this goal, and we're excited to share our work with the community.

Features

Detects common sensitive entity types found in medical text (names, phone numbers, IDs, addresses, dates, etc.).
Lightweight and fast to run on CPUs with the Hugging Face transformers pipeline.
Designed to be used as part of a deidentification workflow (post-processing recommended to merge token-level spans).
Trained on a comprehensive synthetic dataset of over 300,000 samples, ensuring it is robust and generalizable.
On our internal test set, we achieved over 95% accuracy for our specific use case.

Supported entity labels

PERSON
PHONE
EMAIL
ADDRESS (sometimes labelled as LOCATION)
DATE
NATIONAL_ID
HOSPITAL_IDS

Quick start

Install minimal dependencies:

pip install -U transformers torch

Load and run the model with Hugging Face pipelines:

from transformers import pipeline

ner = pipeline("token-classification", model="loolootech/no-name-ner-th", device=-1)
text = "คุณสมชายเป็นอะไรมาครับวันนี้ อ๋อวันนี้ปวดตับครับ งั้นวันนี้หมอขอตรวจละเอียดหน่อยนะ ได้เลยครับน้องมาร์ค"
results = ner(text)
print(results)

Notes on post-processing (more details on our example notebook)

The pipeline returns token-level predictions (B-/I- style). For redaction or anonymization you should merge adjacent tokens with the same label to form full spans before replacing with entity-specific redaction tokens (e.g. [PERSON], [PHONE]).
When redacting, replace spans from right-to-left or rebuild the output string from slices to avoid offset shifts.

Disclaimer

This model is intended as an assistive tool for de-identification. It is not a substitute for professional, legal, or medical advice.
Users are fully responsible for ensuring compliance with applicable privacy, legal, and regulatory requirements.
While efforts have been made to improve accuracy, no automated system is 100% reliable. We strongly recommend implementing a regular human review process to validate outputs.

License

This work is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).

For commercial usage, please contact [email protected].

Citation

If you use the model, you can cite it with the following bibtex.

@misc {no_name_ner_th,
    author       = { Atirut Boribalburephan, Chiraphat Boonnag, Knot Pipatsrisawat },
    title        = { no-name-ner-th },
    year         = 2025,
    url          = { https://huggingface.co/loolootech/no-name-ner-th },
    publisher    = { Hugging Face }
}

Acknowledgement

We extend our gratitude to the PhayaThaiBERT team and Pavarissy/phayathaibert-thainer for providing the initial checkpoint for our model, which served as a crucial starting point. We also acknowledge PyThaiNLP for their invaluable contribution of the thainer-corpus-v2 dataset, which was essential for training and evaluation.