| # LM-Combiner | |
| All the code and model are released [link](https://github.com/wyxstriker/LM-Combiner). Thank you for your patience! | |
| # Model Weight | |
| - cbart_large.zip | |
| - Weight of Bart baseline model. | |
| - lm_combiner.zip | |
| - Weight of LM-Combiner for Bart baseline on FCGEC dataset. | |
| # Requirements | |
| The part of the model is implemented using the huggingface framework and the required environment is as follows: | |
| - Python | |
| - torch | |
| - transformers | |
| - datasets | |
| - tqdm | |
| For the evaluation, we refer to the relevant environment configurations of [ChERRANT](https://github.com/HillZhang1999/MuCGEC/tree/main/scorers/ChERRANT). | |
| # Training Stage | |
| ## Preprocessing | |
| ### Baseline Model | |
| - Firstly, we train a baseline model (Chinese-Bart-large) for LM-Combiner on the FCGEC dataset using the Seq2Seq format. | |
| ```bash | |
| sh ./script/run_bart_baseline.sh | |
| ``` | |
| ### Candidate Datasets | |
| 1. Candidate Sentence Generation | |
| - We use the baseline model to generate candidate sentences for the training and test sets | |
| - On tasks where the model fits better (spelling correction, etc.), we recommend using the K-fold cross-inference from the paper to generate candidate sentences separately. | |
| ```bash | |
| python ./src/predict_bl_tsv.py | |
| ``` | |
| 2. Golden Labels Merging | |
| - We use the ChERRANT tool to fully decouple the error correction task and the rewriting task by merging the correct labels. | |
| ```bash | |
| python ./scorer_wapper/golden_label_merging.py | |
| ``` | |
| ## LM-combiner (gpt2) | |
| - Subsequently, we train LM-Combiner on the constructed candidate dataset | |
| - In particular, we supplement the gpt2 vocab (mainly **double quotes**) to better fit the FCGEC dataset, see ```./pt_model/gpt2-base/vocab.txt``` for details. | |
| ```bash | |
| sh ./script/run_lm_combiner.py | |
| ``` | |
| # Evaluation | |
| - We use the official ChERRANT script to evaluate the model on the FCGEC-dev. | |
| ```shell | |
| sh ./script/compute_score.sh | |
| ``` | |
| |method|Prec|Rec|F0.5| | |
| |-|-|-|-| | |
| | bart_baseline|28.88|**38.95**|40.46| | |
| |+lm_combiner|**52.15**|37.41|**48.34**| | |
| # Citation | |
| If you find this work is useful for your research, please cite our paper: | |
| ``` | |
| @inproceedings{wang-etal-2024-lm-combiner, | |
| title = "{LM}-Combiner: A Contextual Rewriting Model for {C}hinese Grammatical Error Correction", | |
| author = "Wang, Yixuan and | |
| Wang, Baoxin and | |
| Liu, Yijun and | |
| Wu, Dayong and | |
| Che, Wanxiang", | |
| editor = "Calzolari, Nicoletta and | |
| Kan, Min-Yen and | |
| Hoste, Veronique and | |
| Lenci, Alessandro and | |
| Sakti, Sakriani and | |
| Xue, Nianwen", | |
| booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", | |
| month = may, | |
| year = "2024", | |
| address = "Torino, Italia", | |
| publisher = "ELRA and ICCL", | |
| url = "https://aclanthology.org/2024.lrec-main.934", | |
| pages = "10675--10685", | |
| } | |
| ``` |