Isabel Rio-Torto

Cookies Policy

The website need some cookies and similar means to function. If you permit us, we will use those means to collect data on your visits for aggregated statistics to improve our service. Find out More

Institution
Research
Research Domains
Artificial Intelligence

Bioengineering

Communications

Computer Science and Engineering
Photonics

Power and Energy Systems

Robotics

Systems Engineering and Management
RESEARCH CENTERS
Porto, Portugal

+351 222 094 000

info@inesctec.pt
Innovation
Innovation / Tec4

TEC4AGRO-FOOD

TEC4ENERGY

TEC4HEALTH

TEC4INDUSTRY

TEC4SEA

TECPARTNERSHIPS

Available Technologies
Porto, Portugal

+351 222 094 000

info@inesctec.pt
Laboratories
Research Laboratories

iilab
Communication
News

Events

Media

Newsletter
Porto, Portugal

+351 222 094 000

info@inesctec.pt
Work with us
Contacts

Home
People
Isabel Rio-Torto

Read Full presentation

Isabel Rio-Torto received the master's degree in Electrical and Computers Engineering in 2019 from the Faculty of Engineering of the University of Porto (FEUP). Isabel is currently a research assistant at INESC TEC, associated with the Visual Computing and Machine Intelligence Group (VCMI), and a Ph.D. student in Computer Science from the Faculty of Sciences of the University of Porto (FCUP). Isabel is also an Invited Teaching Assistant at FEUP, teaching programming courses. Her work is currently focused on "Self-explanatory computer-aided diagnosis with limited supervision".

Read Full presentation

About

Interest
Topics

Details

Name
Isabel Rio-Torto
Role
Research Assistant
Since
06th July 2020

Nationality
Portugal
Centre
Telecommunications and Multimedia
Contacts
+351222094000
isabel.riotorto@inesctec.pt

001

Publications

View all Publications

2025

CBVLM: Training-free explainable concept-based Large Vision Language Models for medical image classification

Authors
Patrício, C; Torto, IR; Cardoso, JS; Teixeira, LF; Neves, J;

Publication
Comput. Biol. Medicine

Abstract
The main challenges limiting the adoption of deep learning-based solutions in medical workflows are the availability of annotated data and the lack of interpretability of such systems. Concept Bottleneck Models (CBMs) tackle the latter by constraining the model output on a set of predefined and human-interpretable concepts. However, the increased interpretability achieved through these concept-based explanations implies a higher annotation burden. Moreover, if a new concept needs to be added, the whole system needs to be retrained. Inspired by the remarkable performance shown by Large Vision-Language Models (LVLMs) in few-shot settings, we propose a simple, yet effective, methodology, CBVLM, which tackles both of the aforementioned challenges. First, for each concept, we prompt the LVLM to answer if the concept is present in the input image. Then, we ask the LVLM to classify the image based on the previous concept predictions. Moreover, in both stages, we incorporate a retrieval module responsible for selecting the best examples for in-context learning. By grounding the final diagnosis on the predicted concepts, we ensure explainability, and by leveraging the few-shot capabilities of LVLMs, we drastically lower the annotation cost. We validate our approach with extensive experiments across four medical datasets and twelve LVLMs (both generic and medical) and show that CBVLM consistently outperforms CBMs and task-specific supervised methods without requiring any training and using just a few annotated examples. More information on our project page: https://cristianopatricio.github.io/CBVLM/. © 2025 Elsevier B.V., All rights reserved.

CloseRead Abstract

2024

<i>DeViL</i>: Decoding Vision features into Language

Authors
Dani, M; Rio Torto, I; Alaniz, S; Akata, Z;

Publication
PATTERN RECOGNITION, DAGM GCPR 2023

Abstract
Post-hoc explanation methods have often been criticised for abstracting away the decision-making process of deep neural networks. In this work, we would like to provide natural language descriptions for what different layers of a vision backbone have learned. Our DeViL method generates textual descriptions of visual features at different layers of the network as well as highlights the attribution locations of learned concepts. We train a transformer network to translate individual image features of any vision layer into a prompt that a separate off-the-shelf language model decodes into natural language. By employing dropout both per-layer and per-spatial-location, our model can generalize training on image-text pairs to generate localized explanations. As it uses a pre-trained language model, our approach is fast to train and can be applied to any vision backbone. Moreover, DeViL can create open-vocabulary attribution maps corresponding to words or phrases even outside the training scope of the vision model. We demonstrate that DeViL generates textual descriptions relevant to the image content on CC3M, surpassing previous lightweight captioning models and attribution maps, uncovering the learned concepts of the vision backbone. Further, we analyze fine-grained descriptions of layers as well as specific spatial locations and show that DeViL outperforms the current state-of-the-art on the neuron-wise descriptions of the MILANNOTATIONS dataset.

CloseRead Abstract

2024

ON THE SUITABILITY OF B-COS NETWORKS FOR THE MEDICAL DOMAIN

Authors
Rio-Torto, I; Gonçalves, T; Cardoso, JS; Teixeira, LF;

Publication
IEEE INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING, ISBI 2024

Abstract
In fields that rely on high-stakes decisions, such as medicine, interpretability plays a key role in promoting trust and facilitating the adoption of deep learning models by the clinical communities. In the medical image analysis domain, gradient-based class activation maps are the most widely used explanation methods and the field lacks a more in depth investigation into inherently interpretable models that focus on integrating knowledge that ensures the model is learning the correct rules. A new approach, B-cos networks, for increasing the interpretability of deep neural networks by inducing weight-input alignment during training showed promising results on natural image classification. In this work, we study the suitability of these B-cos networks to the medical domain by testing them on different use cases (skin lesions, diabetic retinopathy, cervical cytology, and chest X-rays) and conducting a thorough evaluation of several explanation quality assessment metrics. We find that, just like in natural image classification, B-cos explanations yield more localised maps, but it is not clear that they are better than other methods' explanations when considering more explanation properties.

CloseRead Abstract

2024

Parameter-Efficient Generation of Natural Language Explanations for Chest X-ray Classification

Authors
Rio-Torto, I; Cardoso, JS; Teixeira, LF;

Publication
MEDICAL IMAGING WITH DEEP LEARNING

Abstract
The increased interest and importance of explaining neural networks' predictions, especially in the medical community, associated with the known unreliability of saliency maps, the most common explainability method, has sparked research into other types of explanations. Natural Language Explanations (NLEs) emerge as an alternative, with the advantage of being inherently understandable by humans and the standard way that radiologists explain their diagnoses. We extend upon previous work on NLE generation for multi-label chest X-ray diagnosis by replacing the traditional decoder-only NLE generator with an encoder-decoder architecture. This constitutes a first step towards Reinforcement Learning-free adversarial generation of NLEs when no (or few) ground-truth NLEs are available for training, since the generation is done in the continuous encoder latent space, instead of in the discrete decoder output space. However, in the current scenario, large amounts of annotated examples are still required, which are especially costly to obtain in the medical domain, given that they need to be provided by clinicians. Thus, we explore how the recent developments in Parameter-Efficient Fine-Tuning (PEFT) can be leveraged for this usecase. We compare different PEFT methods and find that integrating the visual information into the NLE generator layers instead of only at the input achieves the best results, even outperforming the fully fine-tuned encoder-decoder-based model, while only training 12% of the model parameters. Additionally, we empirically demonstrate the viability of supervising the NLE generation process on the encoder latent space, thus laying the foundation for RL-free adversarial training in low ground-truth NLE availability regimes. The code is publicly available at https://github.com/icrto/peft-nles.

CloseRead Abstract

2023

Fill in the blank for fashion complementary outfit product Retrieval: VISUM summer school competition

Authors
Castro, E; Ferreira, PM; Rebelo, A; Rio-Torto, I; Capozzi, L; Ferreira, MF; Goncalves, T; Albuquerque, T; Silva, W; Afonso, C; Sousa, RG; Cimarelli, C; Daoudi, N; Moreira, G; Yang, HY; Hrga, I; Ahmad, J; Keswani, M; Beco, S;

Publication
MACHINE VISION AND APPLICATIONS

Abstract
Every year, the VISion Understanding and Machine intelligence (VISUM) summer school runs a competition where participants can learn and share knowledge about Computer Vision and Machine Learning in a vibrant environment. 2021 VISUM's focused on applying those methodologies in fashion. Recently, there has been an increase of interest within the scientific community in applying computer vision methodologies to the fashion domain. That is highly motivated by fashion being one of the world's largest industries presenting a rapid development in e-commerce mainly since the COVID-19 pandemic. Computer Vision for Fashion enables a wide range of innovations, from personalized recommendations to outfit matching. The competition enabled students to apply the knowledge acquired in the summer school to a real-world problem. The ambition was to foster research and development in fashion outfit complementary product retrieval by leveraging vast visual and textual data with domain knowledge. For this, a new fashion outfit dataset (acquired and curated by FARFETCH) for research and benchmark purposes is introduced. Additionally, a competitive baseline with an original negative sampling process for triplet mining was implemented and served as a starting point for participants. The top 3 performing methods are described in this paper since they constitute the reference state-of-the-art for this particular problem. To our knowledge, this is the first challenge in fashion outfit complementary product retrieval. Moreover, this joint project between academia and industry brings several relevant contributions to disseminating science and technology, promoting economic and social development, and helping to connect early-career researchers to real-world industry challenges.

CloseRead Abstract

Supervised
thesis

Supervised Thesis

View all Supervised Theses

2023

Self-Supervised Learning for Medical Image Classification: A Study on MoCo-CXR

Author
Hugo Miguel Monteiro Guimarães

Institution
UM

2023

Improving Image Captioning through Segmentation

Author
Pedro Daniel Fernandes Ferreira

Institution
UM

2021

Combining simulated and real images in deep learning

Author
Pedro Xavier Tavares Monteiro Correia de Pinho

Institution
UM

2020

Automatic generation of textual explanations in deep learning

Author
Patrícia Ferreira Rocha

Institution
UM

View all Supervised Theses

About

Details

Name

Role

Since

Nationality

Centre

Contacts

CAGING

CBVLM: Training-free explainable concept-based Large Vision Language Models for medical image classification

<i>DeViL</i>: Decoding Vision features into Language

ON THE SUITABILITY OF B-COS NETWORKS FOR THE MEDICAL DOMAIN

Parameter-Efficient Generation of Natural Language Explanations for Chest X-ray Classification

Fill in the blank for fashion complementary outfit product Retrieval: VISUM summer school competition

Self-Supervised Learning for Medical Image Classification: A Study on MoCo-CXR

Improving Image Captioning through Segmentation

Combining simulated and real images in deep learning

Automatic generation of textual explanations in deep learning