Variable-Viewpoint Representations for Object Recognition
Object recognition is an essential task for computer vision in which deep learning models trained on large-scale datasets have made remarkable achievements. However, while datasets like ImageNet often span large numbers of instances per category, relatively few of these types of datasets contain viewpoint-rich information about specific object instances. On the flip side, 3D datasets like ModelNet offer the capability to simulate viewpoint-rich information about an object instance, but these data are based on CAD models and not realworld objects. In contrast to the types of training data available in most current object recognition datasets, humans learn to recognize many objects during infancy through extended, embodied experience with a few, highly familiar object instances (e.g., using a favorite "cup" multiple times a day for months). Inspired by this research from developmental psychology showing the importance of egocentric and viewpoint-rich experiences in human learning, this dissertation explores three key aspects of knowledge representations for computational object recognition. First, I present the new Toybox dataset, which contains viewpoint-rich and instance-rich video representations of 360 handheld object instances across 12 categories. The Toybox dataset fills important gaps in currently available object recognition datasets to support research on how varying training data along these dimensions affects computational object recognition performance using deep learning models, and also how human recognition performance varies along these dimensions. Second, I propose a framework called Variable-Viewpoint (V2) representations that unifies common computational approaches to object recognition by parameterizing representations according to the number, resolution, and spatial distribution of viewpoints. I demonstrate how the V2 framework enables systematic comparisons of different representational approaches using the ModelNet dataset, and how integrating multiple levels of representations within the V2 framework can support strong recognition performance by current deep learning models. Third, I describe how the Toybox dataset, which contains data structured by both viewpoint and time, can be leveraged along with methods from time-series analysis to enhance the interpretation of intermediate representations in deep learning models.