LAION-5B: An open large-scale dataset for training next generation image-text models
NeurIPSOct 16, 2022Datasets & Benchmarks Best Paper
Groundbreaking language-vision architectures like CLIP and DALL-E proved the
utility of training on large amounts of noisy image-text data, without relying
on expensive accurate labels used in standard vision unimodal supervised
learning. The resulting models showed capabilities of strong text-guided image
generation and transfer to downstream tasks, while performing remarkably at
zero-shot classification with noteworthy out-of-distribution robustness. Since
then, large-scale language-vision models like ALIGN, BASIC, GLIDE, Flamingo and
Imagen made further improvements. Studying the training and capabilities of
such models requires datasets containing billions of image-text pairs. Until
now, no datasets of this size have been made openly available for the broader
research community. To address this problem and democratize research on
large-scale multi-modal models, we present LAION-5B - a dataset consisting of
5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English
language. We show successful replication and fine-tuning of foundational models
like CLIP, GLIDE and Stable Diffusion using the dataset, and discuss further
experiments enabled with an openly available dataset of this scale.
Additionally we provide several nearest neighbor indices, an improved
web-interface for dataset exploration and subset generation, and detection
scores for watermark, NSFW, and toxic content detection. Announcement page
https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/