Update README.md
Browse files
README.md
CHANGED
|
@@ -107,7 +107,7 @@ SteamSHP-XL gets an average 72.8% accuracy across all domains:
|
|
| 107 |
|
| 108 |
|
| 109 |
|
| 110 |
-
|
| 111 |
|
| 112 |
SteamSHP is trained to predict which of two responses humans will find *more helpful*, not which response is *less harmful*.
|
| 113 |
It should not be used to detect toxicity, make ethical judgments, or for a similar purpose.
|
|
@@ -119,6 +119,8 @@ The responses that humans collectively found more helpful are also not guarantee
|
|
| 119 |
The people whose preferences are captured in SHP and HH-RLHF are not representative of the broader population.
|
| 120 |
Although specific demographic information is not available, overall, the Reddit users whose preferences are captured in SHP are disproportionately male and from developed, Western, and English-speaking countries (Pew Research).
|
| 121 |
|
|
|
|
|
|
|
| 122 |
|
| 123 |
## Contact
|
| 124 |
|
|
|
|
| 107 |
|
| 108 |
|
| 109 |
|
| 110 |
+
## Biases and Limitations
|
| 111 |
|
| 112 |
SteamSHP is trained to predict which of two responses humans will find *more helpful*, not which response is *less harmful*.
|
| 113 |
It should not be used to detect toxicity, make ethical judgments, or for a similar purpose.
|
|
|
|
| 119 |
The people whose preferences are captured in SHP and HH-RLHF are not representative of the broader population.
|
| 120 |
Although specific demographic information is not available, overall, the Reddit users whose preferences are captured in SHP are disproportionately male and from developed, Western, and English-speaking countries (Pew Research).
|
| 121 |
|
| 122 |
+
[Past work](https://www.anthropic.com/model-written-evals.pdf) by Anthropic has found that models optimized for human preference can be obsequious, at the expense of the truth.
|
| 123 |
+
|
| 124 |
|
| 125 |
## Contact
|
| 126 |
|