Learning visual and multimodal representations

doi:10.12681/eadd/57401

Home

Browse

Discipline

Date

Author

Country

Language

Degree Grantor

About

Theses Submission

FAQ

Helpdesk

Open Data

Abstract

Representations lie at the heart of artificial intelligence, enabling machines to perceive, interpret and interact with the world. Visual representations, extracted from images or videos, enable tasks such as image classification, image retrieval, and object detection. Visual-textual representations, bridging the gap between the visual and linguistic domains, enable tasks like image captioning, visual question answering, and cross-modal retrieval. The ability to learn and manipulate these representations is paramount for advancing the state-of-the-art in computer vision and beyond. In this dissertation, we investigate novel methods for learning both visual (unimodal) and visual-textual (multimodal) representations, focusing mainly on applications in deep metric learning, image classification, and composed image retrieval. We address the challenges of learning representations from both datacentric and model-centric perspectives, aiming to unlock new capabilities for visual understanding and interaction. In visual representation learning, we first focus on data and introduce Metrix, a deep metric learning method utilizing mixup for data augmentation. Metrix addresses the challenge of interpolating both examples and target labels, overcoming the non-additive nature of traditional metric learning loss functions. By generalizing existing loss functions to incorporate mixup, Metrix enhances learning and explores new embedding space regions. We introduce a novel metric, utilization, to measure this exploration. Experiments on four benchmark datasets, including various mixup settings, show that Metrix significantly outperforms state-of-the-art methods, improving robustness and generalization. This work exemplifies our aim to advance visual representation learning through innovative data augmentation. Next, we shift our focus to the model architecture, introducing SimPool, a simple attention-based pooling method designed to replace the default pooling in both convolutional neural networks (CNNs) and vision transformers (ViTs). We develop a generic pooling framework and formulate existing pooling methods as its instantiations, allowing us to analyze, compare, and discuss their properties. Through this, we derive SimPool, which improves performance in supervised and self-supervised settings on standard benchmarks and downstream tasks. SimPool generates high-quality attention maps that accurately delineate object boundaries, significantly enhancing object localization and robustness to background changes. It improves object discovery metrics and performs efficiently, even when removing ViT blocks, thus optimizing the balance between performance and model complexity. This work exemplifies our aim to advance visual representation learning through innovative model architecture components. Transitioning to visual-textual representations, we introduce FreeDom, a training-free method for zero-shot composed image retrieval in open-world domain conversion. FreeDom leverages the descriptive power of a frozen vision-language model (VLM) and employs textual inversion, enabling flexible image and text query composition. Unlike traditional methods that invert query images to the continuous latent space of tokens, FreeDom’s inversion into the discrete input space of text is pivotal for its success. Experiments on four benchmark domain conversion datasets, including three newly introduced by us, demonstrate its superior performance. Additionally, FreeDom performs on par with the best methods in generic composed image retrieval. This work exemplifies our aim to advance multimodal representation learning through innovative discrete-space textual inversion. Expanding on visual-textual representations, we now focus on their applications in remote sensing to introduce a novel task: remote sensing composed image retrieval (RSCIR). This task aims to provide a more expressive and flexible search capability within the remote sensing domain. We explore and qualitatively evaluate the unique challenges and capabilities this task introduces. Users can now pair a query image with a query text specifying modifications related to color, shape, size, texture, density, context, quantity, or the presence of certain classes. To quantitatively assess this, we establish a benchmark, PatternCom, and an evaluation protocol focusing on shape, color, density, and quantity modifications. Our method, WeiCom, operates training-free by utilizing a frozen vision-language model and incorporates a modality control parameter for generating more image- or text-oriented results based on specific search needs. This work exemplifies our aim to advance multimodal representation learning by introducing a flexible method that showcases the potential of this novel task in a new domain.

	Read Online
	Download full text in PDF format (41.94 MB) (Available only to registered users) I declare that I have read and unconditionally agree and accept the Terms of Use of the National Archive of Ph.D. Theses, as well as the

All items in National Archive of Phd theses are protected by copyright.

DOI	10.12681/eadd/57401
Handle URL	http://hdl.handle.net/10442/hedi/57401
ND	57401
Alternative title	Εκμάθηση οπτικών και πολυτροπικών αναπαραστάσεων
Author	Psomas, Vasileios (Father's name: Emmanouil)
Date	2024
Degree Grantor	National Technical University of Athens (NTUA)
Committee members	Καράντζαλος Κωνσταντίνος Αργιαλάς Δημήτριος Τόλιας Γεώργιος Καραθανάση Βασιλεία Παπουτσής Ιωάννης Κομοντάκης Νικόλαος Βακαλοπούλου Μαρία
Discipline	Engineering and Technology ➨ Electrical Engineering, Electronic Engineering, Information Engineering ➨ Media Technology
Keywords	Neural networks; Computer vision; Deep learning; Remote sensing; Representation learning
Country	Greece
Language	English
Description	im., tbls., fig., ch.

Usage statistics

VIEWS

Concern the unique Ph.D. Thesis' views for the period 07/2018 - 07/2023.
Source: Google Analytics.

ONLINE READER

Concern the online reader's opening for the period 07/2018 - 07/2023.
Source: Google Analytics.

DOWNLOADS

Concern all downloads of this Ph.D. Thesis' digital file.
Source: National Archive of Ph.D. Theses.

USERS

Concern all registered users of National Archive of Ph.D. Theses who have interacted with this Ph.D. Thesis. Mostly, it concerns downloads.
Source: National Archive of Ph.D. Theses.

"Learning visual and multimodal representations"
	Please, type what you see in the image!
I declare that I have read and unconditionally agree and accept the Terms of Use of the National Archive of Ph.D. Theses, as well as the.