Robot Synesthesia: A Sound and Emotion Guided AI Painter
IROS• 2024
Abstract
If a picture paints a thousand words, sound may voice a million. While recent
robotic painting and image synthesis methods have achieved progress in
generating visuals from text inputs, the translation of sound into images is
vastly unexplored. Generally, sound-based interfaces and sonic interactions
have the potential to expand accessibility and control for the user and provide
a means to convey complex emotions and the dynamic aspects of the real world.
In this paper, we propose an approach for using sound and speech to guide a
robotic painting process, known here as robot synesthesia. For general sound,
we encode the simulated paintings and input sounds into the same latent space.
For speech, we decouple speech into its transcribed text and the tone of the
speech. Whereas we use the text to control the content, we estimate the
emotions from the tone to guide the mood of the painting. Our approach has been
fully integrated with FRIDA, a robotic painting framework, adding sound and
speech to FRIDA's existing input modalities, such as text and style. In two
surveys, participants were able to correctly guess the emotion or natural sound
used to generate a given painting more than twice as likely as random chance.
On our sound-guided image manipulation and music-guided paintings, we discuss
the results qualitatively.