Machine Learning Research
Breaking Jailbreaks: New E-DPO method strengthens defenses against jailbreak prompts
Jailbreak prompts can prod a large language model (LLM) to overstep built-in boundaries, leading it to do things like respond to queries it was trained to refuse to answer. Researchers devised a way to further boost the probability that LLMs will respond in ways that respect such limits.