Spaces:
Running
Running
| <h1 style='color: purple;'> Using on your data </h1> | |
| Source code is available as a pip installable python package. | |
| ## Installation | |
| Use of a virtual enviroment is recommended. | |
| ```bash | |
| conda create -n selfrank python=3.10 | |
| ``` | |
| Activate the virtual environment | |
| ```bash | |
| conda activate selfrank | |
| ``` | |
| and then install, | |
| ```bash | |
| pip install git+https://huggingface.co/spaces/ibm/llm-rank-themselves.git | |
| ``` | |
| ## Usage | |
| Start by gathering model inferences for the same question/prompt across all models you want to rank. The ranking method expects a pandas dataframe, with a row for each prompt, and a column for each model, i.e. | |
| | | M1 | M2 | M3 | ... | | |
| |:-----------|:-----|:-----|:-----|:------| | |
| | Q1 | a | a | b | ... | | |
| | Q2 | a | b | b | ... | | |
| | ... | ... | ... | ... | ... | | |
| With this data, the self ranking procedure can be invoked as follows: | |
| ```python | |
| import pandas as pd | |
| from selfrank.algos.iterative import SelfRank # The full ranking algorithm | |
| from selfrank.algos.greedy import SelfRankGreedy # The greedy version | |
| from selfrank.algos.triplet import rouge, equality | |
| f = "inferences.csv" | |
| df = pd.read_csv(f) | |
| models_to_rank = df.columns.tolist() | |
| evaluator = rouge | |
| true_ranking = None | |
| r = SelfRank(models_to_rank, evaluator, true_ranking) | |
| # or, for the greedy version | |
| # r = SelfRankGreedy(models_to_rank, evaluator, true_ranking) | |
| r.fit(adf) | |
| print(r.ranking) | |
| ``` | |
| This should output the estimated ranking (best to worst): `['M5', 'M2', 'M1', ...]`. If true rankings are known, evaluation measures can be computed by `r.measure(metric='rbo')` (for rank-biased overlap) or `r.measure(metric='mapk')` for mean-average precision. | |
| We provide implementations of few evaluation function, i.e. the function the judge model uses to evaluate the contestant models. While `rouge` is recommended for generative tasks like summarization, `equality` would be more appropriate for multiple choice settings (like MMLU) or classification tasks with a discrete set of outcomes. | |
| You can also pass any arbitrary function to the ranker as long as it follows the following signature: | |
| ```python | |
| def user_function(a: str, b:str, c:str, df:pd.DataFrame) -> int: | |
| """ | |
| use model c to evaluate a vs. b | |
| df: is a dataframe with inferences of all models | |
| returns 1 if a is preferred or 0 if b is preferred | |
| """ | |
| # Is this example, we count number of times a/b is the same as c | |
| ties = df[a] == df[b] | |
| a_wins = sum((df[a] == df[c]) & ~(ties)) | |
| b_wins = sum((df[b] == df[c]) & ~(ties)) | |
| if a_wins >= b_wins: | |
| return 1 | |
| else: | |
| return 0 | |
| ``` | |
| <br> |