Structuring Representation Geometry in Self-Supervised Learning
Author(s)
Gupta, Sharut
DownloadThesis PDF (28.48Mb)
Advisor
Jegelka, Stefanie
Terms of use
Metadata
Show full item recordAbstract
The central promise of deep learning is to learn a map 𝑓 : 𝒳 → β_𝑑 that transforms objects 𝒳—represented in their raw perceptual forms, such as images or molecular strings—into a representation space β_𝑑 where everything that is hard to do with raw perceptual data becomes easy. For instance, measuring the similarity between two objects [scientific notation] expressed as tensors of pixel intensities is non-trivial in their raw form, but becomes straightforward if 𝑓 maps these objects to a space where simple Euclidean distances, β𝑓(𝑥β) − 𝑓(𝑥β)ββ are meaningful measures of similarity. While this simple recipe has shown standout success in a range of tasks, certain applications require representations that encode richer structural relationships beyond pairwise similarity. For instance, tasks that encode relational information— such as “𝑋 is a parent of 𝑌 ” or “𝐴 is a treatment for 𝐵”—require embedding spaces that capture richer structural relationships. In this thesis, we explore what 𝑓 should encode in order to be useful for a range of unknown downstream tasks, from the point of view of the geometric structure of representation space. We investigate this question in the context of self-supervised learning, a paradigm that extracts meaningful representations by leveraging the structure of the data itself without relying on explicit labels. Specifically, we propose adding additional geometric structure to the embedding space by enforcing transformations of input space to correspond to simple (i.e., linear) transformations in the embedding space. To this end, we introduce an equivariance objective and theoretically prove that its minima forces transformations on input space to correspond to rotations on the spherical embedding space. Our proposed method significantly improves performance on downstream tasks, and ensures sensitivity in embedding space to important variations in data (e.g., color, rotation) that existing contrastive methods do not achieve.
Date issued
2025-02Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology