Specializing Word Embeddings (for Parsing) by Information Bottleneck
EMNLPOct 1, 2019Best Paper
Pre-trained word embeddings like ELMo and BERT contain rich syntactic and
semantic information, resulting in state-of-the-art performance on various
tasks. We propose a very fast variational information bottleneck (VIB) method
to nonlinearly compress these embeddings, keeping only the information that
helps a discriminative parser. We compress each word embedding to either a
discrete tag or a continuous vector. In the discrete version, our automatically
compressed tags form an alternative tag set: we show experimentally that our
tags capture most of the information in traditional POS tag annotations, but
our tag sequences can be parsed more accurately at the same level of tag
granularity. In the continuous version, we show experimentally that moderately
compressing the word embeddings by our method yields a more accurate parser in
8 of 9 languages, unlike simple dimensionality reduction.