Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -107,7 +107,7 @@ SteamSHP-XL gets an average 72.8% accuracy across all domains:
-### Biases and Limitations
 SteamSHP is trained to predict which of two responses humans will find *more helpful*, not which response is *less harmful*.
 It should not be used to detect toxicity, make ethical judgments, or for a similar purpose.
@@ -119,6 +119,8 @@ The responses that humans collectively found more helpful are also not guarantee
 The people whose preferences are captured in SHP and HH-RLHF are not representative of the broader population.
 Although specific demographic information is not available, overall, the Reddit users whose preferences are captured in SHP are disproportionately male and from developed, Western, and English-speaking countries (Pew Research).
 ## Contact

+## Biases and Limitations
 SteamSHP is trained to predict which of two responses humans will find *more helpful*, not which response is *less harmful*.
 It should not be used to detect toxicity, make ethical judgments, or for a similar purpose.
 The people whose preferences are captured in SHP and HH-RLHF are not representative of the broader population.
 Although specific demographic information is not available, overall, the Reddit users whose preferences are captured in SHP are disproportionately male and from developed, Western, and English-speaking countries (Pew Research).
+[Past work](https://www.anthropic.com/model-written-evals.pdf) by Anthropic has found that models optimized for human preference can be obsequious, at the expense of the truth.
 ## Contact