Machine learning algorithms excel primarily in settings where an engineer can first reduce the problem to a particular function (e.g. an image classifier), and then collect a substantial amount of labeled input-output pairs for that function. In drastic contrast, humans are capable of learning from streams of raw sensory data with minimal external instruction. In this talk, I will argue that, in order to build intelligent systems that are as capable as humans, machine learning models should not be trained in the context of one particular application. Instead, we should be designing systems that can be versatile, can learn in unstructured settings without detailed human-provided labels, and can accomplish many tasks, all while processing high-dimensional sensory inputs. To do so, these systems must be able to actively explore and experiment, collecting data themselves rather than relying on detailed human labels.
My talk will focus on two key aspects of this goal: generalization and self-supervision. I will first show how we can move away from hand-designed, task-specific representations of a robot’s environment by enabling the robot to learn high-capacity models, such as deep networks, for representing complex skills from raw pixels. Further, I will present an algorithm that learns deep models that can be rapidly adapted to different objects, new visual concepts, or varying environments, leading to versatile behaviors. Beyond such versatility, a hallmark of human intelligence is self-supervised learning. I will discuss how we can allow a robot to learn by playing with objects in the environment without any human supervision. From this experience, the robot can acquire a visual predictive model of the world that can be used for maneuvering many different objects to varying positions. In all settings, our experiments on simulated and real robot platforms demonstrate the ability to scale to complex, vision-based skills with novel objects.