Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours
ICRA• 2016
Abstract
Current learning-based robot grasping approaches exploit human-labeled
datasets for training the models. However, there are two problems with such a
methodology: (a) since each object can be grasped in multiple ways, manually
labeling grasp locations is not a trivial task; (b) human labeling is biased by
semantics. While there have been attempts to train robots using trial-and-error
experiments, the amount of data used in such experiments remains substantially
low and hence makes the learner prone to over-fitting. In this paper, we take
the leap of increasing the available training data to 40 times more than prior
work, leading to a dataset size of 50K data points collected over 700 hours of
robot grasping attempts. This allows us to train a Convolutional Neural Network
(CNN) for the task of predicting grasp locations without severe overfitting. In
our formulation, we recast the regression problem to an 18-way binary
classification over image patches. We also present a multi-stage learning
approach where a CNN trained in one stage is used to collect hard negatives in
subsequent stages. Our experiments clearly show the benefit of using
large-scale datasets (and multi-stage training) for the task of grasping. We
also compare to several baselines and show state-of-the-art performance on
generalization to unseen objects for grasping.