Spaces:
Build error
Build error
| title: Word Count | |
| emoji: π€ | |
| colorFrom: green | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 3.0.2 | |
| app_file: app.py | |
| pinned: false | |
| tags: | |
| - evaluate | |
| - measurement | |
| description: >- | |
| Returns the total number of words, and the number of unique words in the input data. | |
| # Measurement Card for Word Count | |
| ## Measurement Description | |
| The `word_count` measurement returns the total number of word count of the input string, using the sklearn's [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) | |
| ## How to Use | |
| This measurement requires a list of strings as input: | |
| ```python | |
| >>> data = ["hello world and hello moon"] | |
| >>> wordcount= evaluate.load("word_count") | |
| >>> results = wordcount.compute(data=data) | |
| ``` | |
| ### Inputs | |
| - **data** (list of `str`): The input list of strings for which the word length is calculated. | |
| - **max_vocab** (`int`): (optional) the top number of words to consider (can be specified if dataset is too large) | |
| ### Output Values | |
| - **total_word_count** (`int`): the total number of words in the input string(s). | |
| - **unique_words** (`int`): the number of unique words in the input string(s). | |
| Output Example(s): | |
| ```python | |
| {'total_word_count': 5, 'unique_words': 4} | |
| ### Examples | |
| Example for a single string | |
| ```python | |
| >>> data = ["hello sun and goodbye moon"] | |
| >>> wordcount = evaluate.load("word_count") | |
| >>> results = wordcount.compute(data=data) | |
| >>> print(results) | |
| {'total_word_count': 5, 'unique_words': 5} | |
| ``` | |
| Example for a multiple strings | |
| ```python | |
| >>> data = ["hello sun and goodbye moon", "foo bar foo bar"] | |
| >>> wordcount = evaluate.load("word_count") | |
| >>> results = wordcount.compute(data=data) | |
| >>> print(results) | |
| {'total_word_count': 9, 'unique_words': 7} | |
| ``` | |
| Example for a dataset from π€ Datasets: | |
| ```python | |
| >>> imdb = datasets.load_dataset('imdb', split = 'train') | |
| >>> wordcount = evaluate.load("word_count") | |
| >>> results = wordcount.compute(data=imdb['text']) | |
| >>> print(results) | |
| {'total_word_count': 5678573, 'unique_words': 74849} | |
| ``` | |
| ## Citation(s) | |
| ## Further References | |
| - [Sklearn `CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) | |