Spaces:
Running
Running
| title: Comma fixer | |
| emoji: 🤗 | |
| colorFrom: red | |
| colorTo: indigo | |
| sdk: docker | |
| sdk_version: 20.10.17 | |
| app_file: app.py | |
| pinned: true | |
| app_port: 8000 | |
| # Comma fixer | |
| This repository contains a web service for fixing comma placement within a given text, for instance: | |
| `"A sentence however, not quite good correct and sound."` -> `"A sentence, however, not quite good, correct and sound."` | |
| It provides a webpage for testing the functionality, a REST API, | |
| and Jupyter notebooks for evaluating and training comma fixing models. | |
| A web demo is hosted in the [huggingface spaces](https://huggingface.co/spaces/klasocki/comma-fixer). | |
| ## Development setup | |
| Deploying the service for local development can be done by running `docker-compose up` in the root directory. | |
| Note that you might have to | |
| `sudo service docker start` | |
| first. | |
| The application should then be available at http://localhost:8000. | |
| For the API, see the `openapi.yaml` file. | |
| Docker-compose mounts a volume and listens to changes in the source code, so the application will be reloaded and | |
| reflect them. | |
| We use multi-stage builds to reduce the image size, ensure flexibility in requirements and that tests are run before | |
| each deployment. | |
| However, while it does reduce the size by nearly 3GB, the resulting image still contains deep learning libraries and | |
| pre-downloaded models, and will take around 9GB of disk space. | |
| NOTE: Since the service is hosting two large deep learning models, there might be memory issues depending on your | |
| machine, where the terminal running | |
| docker would simply crash. | |
| Should that happen, you can try increasing resources allocated to docker, or splitting commands in the docker file, | |
| e.g., running tests one by one. | |
| If everything fails, you can still use the hosted huggingface hub demo, or follow the steps below and run the app | |
| locally without Docker. | |
| Alternatively, you can setup a python environment by hand. It is recommended to use a virtualenv. Inside one, run | |
| ```bash | |
| pip install -e .[test] | |
| ``` | |
| the `[test]` option makes sure to install test dependencies. | |
| Then, run `python app.py` or `uvicorn --host 0.0.0.0 --port 8000 "app:app" --reload` to run the application. | |
| If you intend to perform training and evaluation of deep learning models, install also using the `[training]` option. | |
| ### Running tests | |
| To run the tests, execute | |
| ```bash | |
| docker build -t comma-fixer --target test . | |
| ``` | |
| Or `python -m pytest tests/ ` if you already have a local python environment. | |
| ### Deploying to huggingface spaces | |
| In order to deploy the application, one needs to be added as a collaborator to the space and have set up a | |
| corresponding git remote. | |
| The application is then continuously deployed on each push. | |
| ```bash | |
| git remote add hub https://huggingface.co/spaces/klasocki/comma-fixer | |
| git push hub | |
| ``` | |
| ## Evaluation | |
| In order to evaluate, run `jupyter notebook notebooks/` or copy the notebooks to a web hosting service with GPUs, | |
| such as Google Colab or Kaggle | |
| and clone this repository there. | |
| We use the [oliverguhr/fullstop-punctuation-multilang-large](https://huggingface.co/oliverguhr/fullstop-punctuation-multilang-large) | |
| model as the baseline. | |
| It is a RoBERTa large model fine-tuned for the task of punctuation restoration on a dataset of political speeches | |
| in English, German, French and Italian. | |
| That is, it takes a sentence without any punctuation as input, and predicts the missing punctuation in token | |
| classification fashion, thanks to which the original token structure stays unchanged. | |
| We use a subset of its capabilities focusing solely on commas, and leaving other punctuation intact. | |
| The authors report the following token classification F1 scores on commas for different languages on the original | |
| dataset: | |
| | English | German | French | Italian | | |
| |---------|--------|--------|---------| | |
| | 0.819 | 0.945 | 0.831 | 0.798 | | |
| The results of our evaluation of the baseline model out of domain on the English wikitext-103-raw-v1 validation | |
| dataset are as follows: | |
| | Model | precision | recall | F1 | support | | |
| |----------|-----------|--------|------|---------| | |
| | baseline | 0.79 | 0.72 | 0.75 | 10079 | | |
| | ours* | 0.84 | 0.84 | 0.84 | 10079 | | |
| *details of the fine-tuning process in the next section. | |
| We treat each comma as one token instance, as opposed to the original paper, which NER-tags the whole multiple-token | |
| preceding words as comma class tokens. | |
| In our approach, for each comma from the prediction text obtained from the model: | |
| * If it should be there according to ground truth, it counts as a true positive. | |
| * If it should not be there, it counts as a false positive. | |
| * If a comma from ground truth is not predicted, it counts as a false negative. | |
| ## Training | |
| The fine-tuned model is the [klasocki/roberta-large-lora-ner-comma-fixer](https://huggingface.co/klasocki/roberta-large-lora-ner-comma-fixer). | |
| Further description can be found in the model card. | |
| To compare with the baseline, we fine-tune the same model, RoBERTa large, on the wikitext English dataset. | |
| We use a similar approach, where we treat comma-fixing as a NER problem, and for each token predict whether a comma | |
| should be inserted after it. | |
| The biggest differences are the dataset, the fact that we focus on commas, and that we use [LoRa](https://arxiv.org/pdf/2106.09685.pdf) | |
| for parameter-efficient fine-tuning of the base model. | |
| The biggest advantage of this approach is that it preserves the input structure and only focuses on commas, | |
| ensuring that nothing else will be changed and that the model will not have to learn repeating the input back in case | |
| no commas should be inserted. | |
| We have also thought that trying out pre-trained text-to-text or decoder-only LLMs for this task using PEFT could be | |
| interesting, and wanted to check if we have enough resources for low-rank adaptation or prefix-tuning. | |
| While the model would have to learn to not change anything else than commas and the free-form could prove evaluation | |
| to be difficult, this approach has added flexibility in case we decide we want to fix other errors in the future not | |
| just commas. | |
| However, even with the smallest model from the family, we struggled with CUDA memory errors using the free Google | |
| colab GPU quotas, and could only train with a batch size of two. | |
| After a short training, it seems the loss keeps fluctuating and the model is only able to learn to repeat the | |
| original phrase back. | |
| If time permits, we plan to experiment with seq2seq pre-trained models, increasing gradient accumulation steps, and the | |
| percentage of | |
| data with commas, and trying out artificially inserting mistaken commas as opposed to removing them in preprocessing. | |