Stealing part of a production language model
ICMLMar 11, 2024Best Paper
We introduce the first model-stealing attack that extracts precise,
nontrivial information from black-box production language models like OpenAI's
ChatGPT or Google's PaLM-2. Specifically, our attack recovers the embedding
projection layer (up to symmetries) of a transformer model, given typical API
access. For under 20USD,ourattackextractstheentireprojectionmatrixofOpenAI′sAdaandBabbagelanguagemodels.Wetherebyconfirm,forthefirsttime,thattheseblack−boxmodelshaveahiddendimensionof1024and2048,respectively.Wealsorecovertheexacthiddendimensionsizeofthegpt−3.5−turbomodel,andestimateitwouldcostunder2,000 in queries to
recover the entire projection matrix. We conclude with potential defenses and
mitigations, and discuss the implications of possible future work that could
extend our attack.