A Variational Lower Bound to Mitigate Batch Effect in Molecular Representations
Author(s)
Wang, Chenyu
DownloadThesis PDF (5.820Mb)
Advisor
Jaakkola, Tommi S.
Terms of use
Metadata
Show full item recordAbstract
High-throughput drug screening – using cell imaging or gene expression measurements as readouts of drug effect – is a critical tool in biotechnology to assess and understand the relationship between the chemical structure and biological activity of a drug. Since large-scale screens have to be divided into multiple experiments, a key difficulty is dealing with batch effects, which can introduce systematic errors and non-biological associations in the data. We propose InfoCORE, an Information maximization approach for COnfounder REmoval, to effectively deal with batch effects and obtain refined molecular representations. InfoCORE establishes a variational lower bound on the conditional mutual information of the latent representations given a batch identifier. Experiments on drug screening data reveal InfoCORE’s superior performance in a multitude of tasks including molecular property prediction and molecule-phenotype retrieval. Additionally, we show results for how InfoCORE offers a versatile framework and resolves general distribution shifts and issues of data fairness by minimizing correlation with spurious features or removing sensitive attributes.
Date issued
2025-02Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology