Cookies Policy
The website need some cookies and similar means to function. If you permit us, we will use those means to collect data on your visits for aggregated statistics to improve our service. Find out More
Accept Reject
  • Menu
Publications

Publications by CTM

2025

CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification

Authors
Patrício, C; Torto, IR; Cardoso, JS; Teixeira, LF; Neves, JC;

Publication
CoRR

Abstract

2025

Markerless multi-view 3D human pose estimation: A survey

Authors
Nogueira, AFR; Oliveira, HP; Teixeira, LF;

Publication
IMAGE AND VISION COMPUTING

Abstract
3D human pose estimation aims to reconstruct the human skeleton of all the individuals in a scene by detecting several body joints. The creation of accurate and efficient methods is required for several real-world applications including animation, human-robot interaction, surveillance systems or sports, among many others. However, several obstacles such as occlusions, random camera perspectives, or the scarcity of 3D labelled data, have been hampering the models' performance and limiting their deployment in real-world scenarios. The higher availability of cameras has led researchers to explore multi-view solutions due to the advantage of being able to exploit different perspectives to reconstruct the pose. Most existing reviews focus mainly on monocular 3D human pose estimation and a comprehensive survey only on multi-view approaches to determine the 3D pose has been missing since 2012. Thus, the goal of this survey is to fill that gap and present an overview of the methodologies related to 3D pose estimation in multi-view settings, understand what were the strategies found to address the various challenges and also, identify their limitations. According to the reviewed articles, it was possible to find that most methods are fully-supervised approaches based on geometric constraints. Nonetheless, most of the methods suffer from 2D pose mismatches, to which the incorporation of temporal consistency and depth information have been suggested to reduce the impact of this limitation, besides working directly with 3D features can completely surpass this problem but at the expense of higher computational complexity. Models with lower supervision levels were identified to overcome some of the issues related to 3D pose, particularly the scarcity of labelled datasets. Therefore, no method is yet capable of solving all the challenges associated with the reconstruction of the 3D pose. Due to the existing trade-off between complexity and performance, the best method depends on the application scenario. Therefore, further research is still required to develop an approach capable of quickly inferring a highly accurate 3D pose with bearable computation cost. To this goal, techniques such as active learning, methods that learn with a low level of supervision, the incorporation of temporal consistency, view selection, estimation of depth information and multi-modal approaches might be interesting strategies to keep in mind when developing a new methodology to solve this task.

2025

A two-step concept-based approach for enhanced interpretability and trust in skin lesion diagnosis

Authors
Patrício, C; Teixeira, LF; Neves, JC;

Publication
COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL

Abstract
The main challenges hindering the adoption of deep learning-based systems in clinical settings are the scarcity of annotated data and the lack of interpretability and trust in these systems. Concept Bottleneck Models (CBMs) offer inherent interpretability by constraining the final disease prediction on a set of human-understandable concepts. However, this inherent interpretability comes at the cost of greater annotation burden. Additionally, adding new concepts requires retraining the entire system. In this work, we introduce a novel two-step methodology that addresses both of these challenges. By simulating the two stages of a CBM, we utilize a pretrained Vision Language Model (VLM) to automatically predict clinical concepts, and an off-the-shelf Large Language Model (LLM) to generate disease diagnoses grounded on the predicted concepts. Furthermore, our approach supports test-time human intervention, enabling corrections to predicted concepts, which improves final diagnoses and enhances transparency in decision-making. We validate our approach on three skin lesion datasets, demonstrating that it outperforms traditional CBMs and state-of-the-art explainable methods, all without requiring any training and utilizing only a few annotated examples. The code is available at https://github.com/CristianoPatricio/2step-concept-based-skin-diagnosis.

2025

Causal representation learning through higher-level information extraction

Authors
Silva, F; Oliveira, HP; Pereira, T;

Publication
ACM COMPUTING SURVEYS

Abstract
The large gap between the generalization level of state-of-the-art machine learning and human learning systems calls for the development of artificial intelligence (AI) models that are truly inspired by human cognition. In tasks related to image analysis, searching for pixel-level regularities has reached a power of information extraction still far from what humans capture with image-based observations. This leads to poor generalization when even small shifts occur at the level of the observations. We explore a perspective on this problem that is directed to learning the generative process with causality-related foundations, using models capable of combining symbolic manipulation, probabilistic reasoning, and pattern recognition abilities. We briefly review and explore connections of research from machine learning, cognitive science, and related fields of human behavior to support our perspective for the direction to more robust and human-like artificial learning systems.

2025

AI-based models to predict decompensation on traumatic brain injury patients

Authors
Ribeiro, R; Neves, I; Oliveira, HP; Pereira, T;

Publication
Comput. Biol. Medicine

Abstract
Traumatic Brain Injury (TBI) is a form of brain injury caused by external forces, resulting in temporary or permanent impairment of brain function. Despite advancements in healthcare, TBI mortality rates can reach 30%–40% in severe cases. This study aims to assist clinical decision-making and enhance patient care for TBI-related complications by employing Artificial Intelligence (AI) methods and data-driven approaches to predict decompensation. This study uses learning models based on sequential data from Electronic Health Records (EHR). Decompensation prediction was performed based on 24-h in-mortality prediction at each hour of the patient's stay in the Intensive Care Unit (ICU). A cohort of 2261 TBI patients was selected from the MIMIC-III dataset based on age and ICD-9 disease codes. Logistic Regressor (LR), Long-short term memory (LSTM), and Transformers architectures were used. Two sets of features were also explored combined with missing data strategies by imputing the normal value, data imbalance techniques with class weights, and oversampling. The best performance results were obtained using LSTMs with the original features with no unbalancing techniques and with the added features and class weight technique, with AUROC scores of 0.918 and 0.929, respectively. For this study, using EHR time series data with LSTM proved viable in predicting patient decompensation, providing a helpful indicator of the need for clinical interventions. © 2025 Elsevier Ltd

2025

Evaluation of Lyrics Extraction from Folk Music Sheets Using Vision Language Models (VLMs)

Authors
Sales Mendes, A; Lozano Murciego, Á; Silva, LA; Jiménez Bravo, M; Navarro Cáceres, M; Bernardes, G;

Publication
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Abstract
Monodic folk music has traditionally been preserved in physical documents. It constitutes a vast archive that needs to be digitized to facilitate comprehensive analysis using AI techniques. A critical component of music score digitization is the transcription of lyrics, an extensively researched process in Optical Character Recognition (OCR) and document layout analysis. These fields typically require the development of specific models that operate in several stages: first, to detect the bounding boxes of specific texts, then to identify the language, and finally, to recognize the characters. Recent advances in vision language models (VLMs) have introduced multimodal capabilities, such as processing images and text, which are competitive with traditional OCR methods. This paper proposes an end-to-end system for extracting lyrics from images of handwritten musical scores. We aim to evaluate the performance of two state-of-the-art VLMs to determine whether they can eliminate the need to develop specialized text recognition and OCR models for this task. The results of the study, obtained from a dataset in a real-world application environment, are presented along with promising new research directions in the field. This progress contributes to preserving cultural heritage and opens up new possibilities for global analysis and research in folk music. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.

  • 2
  • 315