Stealing part of a production language model
ICML• 2024
Abstract
We introduce the first model-stealing attack that extracts precise,
nontrivial information from black-box production language models like OpenAI's
ChatGPT or Google's PaLM-2. Specifically, our attack recovers the embedding
projection layer (up to symmetries) of a transformer model, given typical API
access. For under 2,000 in queries to
recover the entire projection matrix. We conclude with potential defenses and
mitigations, and discuss the implications of possible future work that could
extend our attack.