CodeSearchNet Challenge: Evaluating the State of Semantic Code Search
Paper
•
1909.09436
•
Published
•
2
A specialized tokenizer trained on code from 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go) using the CodeSearchNet dataset.
This tokenizer is based on GPT-2's tokenizer but retrained specifically for source code across multiple programming languages. It provides more efficient tokenization for code compared to general-purpose tokenizers.
This tokenizer is designed for preprocessing source code before training or inference with language models. It's particularly useful for:
Compared to the original GPT-2 tokenizer, this specialized tokenizer achieves:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("helmo/code-search-net-multilang-tokenizer")
# Example usage
code = '''public class Calculator {
public int add(int a, int b) {
return a + b;
}
}'''
tokens = tokenizer.tokenize(code)
token_ids = tokenizer.encode(code)
Trained on the CodeSearchNet dataset which contains:
@misc{codesearchnet-multilang-tokenizer,
title={CodeSearchNet Multilingual Tokenizer},
author={Hélder Monteiro},
year={2025},
howpublished={Hugging Face Model Hub},
}
@article{husain2019codesearchnet,
title={CodeSearchNet Challenge: Evaluating the State of Semantic Code Search},
author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
journal={arXiv preprint arXiv:1909.09436},
year={2019}
}
Base model
openai-community/gpt2