AI Control: Improving Safety Despite Intentional Subversion
-
Tagung:
AI Control: Improving Safety Despite Intentional Subversion
-
Tagungsort:
252 / BBB
-
Datum:
2026-01-27
- Referent:
-
Zeit:
15:45
-
Large language models (LLMs) are becoming more capable in a large number of diverse tasks and they can work autonomously for longer and longer periods of time. Sometimes LLMs are doing things that do not align with their operators' intent or values. Thus, it is desireable to control LLMs in such a way that they cannot cause harm, even if they try to undermine these security measures. This talk discusses AI control -- a set of techniques to monitor AI systems so that bad actions can be recognized and prevented. While this is not cryptography itself, it has similiarities such as considering security in adversarial environments and using games to measure security. This talk is based on the paper with the same name: https://arxiv.org/pdf/2312.06942.