Structuring Representation Geometry in Self-Supervised Learning

Gupta, Sharut

Author(s)

Gupta, Sharut

DownloadThesis PDF (28.48Mb)

Advisor

Jegelka, Stefanie

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

The central promise of deep learning is to learn a map 𝑓 : 𝒳 → ℝ_𝑑 that transforms objects 𝒳—represented in their raw perceptual forms, such as images or molecular strings—into a representation space ℝ_𝑑 where everything that is hard to do with raw perceptual data becomes easy. For instance, measuring the similarity between two objects [scientific notation] expressed as tensors of pixel intensities is non-trivial in their raw form, but becomes straightforward if 𝑓 maps these objects to a space where simple Euclidean distances, ‖𝑓(𝑥₁) − 𝑓(𝑥₂)‖₂ are meaningful measures of similarity. While this simple recipe has shown standout success in a range of tasks, certain applications require representations that encode richer structural relationships beyond pairwise similarity. For instance, tasks that encode relational information— such as “𝑋 is a parent of 𝑌 ” or “𝐴 is a treatment for 𝐵”—require embedding spaces that capture richer structural relationships. In this thesis, we explore what 𝑓 should encode in order to be useful for a range of unknown downstream tasks, from the point of view of the geometric structure of representation space. We investigate this question in the context of self-supervised learning, a paradigm that extracts meaningful representations by leveraging the structure of the data itself without relying on explicit labels. Specifically, we propose adding additional geometric structure to the embedding space by enforcing transformations of input space to correspond to simple (i.e., linear) transformations in the embedding space. To this end, we introduce an equivariance objective and theoretically prove that its minima forces transformations on input space to correspond to rotations on the spherical embedding space. Our proposed method significantly improves performance on downstream tasks, and ensures sensitivity in embedding space to important variations in data (e.g., color, rotation) that existing contrastive methods do not achieve.

Date issued

2025-02

URI

https://hdl.handle.net/1721.1/158966

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses