Publications

Publications by CESE

2023

Predicting Model Training Time to Optimize Distributed Machine Learning Applications

Authors
Guimaraes, M; Carneiro, D; Palumbo, G; Oliveira, F; Oliveira, O; Alves, V; Novais, P;

Publication
ELECTRONICS

Abstract
Despite major advances in recent years, the field of Machine Learning continues to face research and technical challenges. Mostly, these stem from big data and streaming data, which require models to be frequently updated or re-trained, at the expense of significant computational resources. One solution is the use of distributed learning algorithms, which can learn in a distributed manner, from distributed datasets. In this paper, we describe CEDEs-a distributed learning system in which models are heterogeneous distributed Ensembles, i.e., complex models constituted by different base models, trained with different and distributed subsets of data. Specifically, we address the issue of predicting the training time of a given model, given its characteristics and the characteristics of the data. Given that the creation of an Ensemble may imply the training of hundreds of base models, information about the predicted duration of each of these individual tasks is paramount for an efficient management of the cluster's computational resources and for minimizing makespan, i.e., the time it takes to train the whole Ensemble. Results show that the proposed approach is able to predict the training time of Decision Trees with an average error of 0.103 s, and the training time of Neural Networks with an average error of 21.263 s. We also show how results depend significantly on the hyperparameters of the model and on the characteristics of the input data.

CloseRead Abstract

2023

Teaching Data Structures and Algorithms Through Games

Authors
Carneiro, D; Carvalho, M;

Publication
METHODOLOGIES AND INTELLIGENT SYSTEMS FOR TECHNOLOGY ENHANCED LEARNING

Abstract
Computer Science degrees are often seen as challenging by students, especially in what concerns subjects such as programming, data structures or algorithms. Many reasons can be pointed out for this, some of which related to the abstract nature of these subjects and the lack of previous related knowledge by the students. In this paper we tackle this challenge using gamification in the teaching/learning process, with two main goals in mind. The first is to increase the intrinsic motivation of students to learn, by making the whole process more fun, enjoyable and competitive. The second is to facilitate the learning process by providing intuitive tools for the visualization of data structures and algorithmic output, together with a tool for automated assessment that decreases the dependence on the teacher and allows them to work more autonomously. We validated this approach over the course of three academic years in a Computer Science degree of the Polytechnic of Porto, Portugal, through the use of a questionnaire. Results show that the effects of using games and game elements have a generally positive effect on motivation and on the overall learning process.

CloseRead Abstract

2023

Using meta-learning to predict performance metrics in machine learning problems

Authors
Carneiro, D; Guimaraes, M; Carvalho, M; Novais, P;

Publication
EXPERT SYSTEMS

Abstract
Machine learning has been facing significant challenges over the last years, much of which stem from the new characteristics of machine learning problems, such as learning from streaming data or incorporating human feedback into existing datasets and models. In these dynamic scenarios, data change over time and models must adapt. However, new data do not necessarily mean new patterns. The main goal of this paper is to devise a method to predict a model's performance metrics before it is trained, in order to decide whether it is worth it to train it or not. That is, will the model hold significantly better results than the current one? To address this issue, we propose the use of meta-learning. Specifically, we evaluate two different meta-models, one built for a specific machine learning problem, and another built based on many different problems, meant to be a generic meta-model, applicable to virtually any problem. In this paper, we focus only on the prediction of the root mean square error (RMSE). Results show that it is possible to accurately predict the RMSE of future models, event in streaming scenarios. Moreover, results also show that it is possible to reduce the need for re-training models between 60% and 98%, depending on the problem and on the threshold used.

CloseRead Abstract

2023

Using Segmentation to Improve Machine Learning Performance in Human-in-the-Loop Systems

Authors
Carneiro, D; Carvalho, M;

Publication
INTELLIGENT SYSTEMS AND APPLICATIONS, VOL 2

Abstract
The expectations of Machine Learning systems are becoming increasingly demanding, namely in what concerns the diversity of applications, the expected accuracy, and the pressure for results. However, there are cases in which Human experts are needed to label the data, which may have a significant cost in terms of human resources and time. In these cases, it is often best to learn on-the-fly, without expecting for the whole data to be labeled. Often, it is desirable to guide the Human annotators into focusing on the more relevant instances: this constitutes the so-called active learning. In this paper we propose an approach in which a clustering algorithm is used to find groups of similar instances. Then, the procedure is guided with the objective of favoring the annotation of the groups that are under-represented in the labeled dataset. Results show that this approach leads to models that are, over time, more accurate and reliable.

CloseRead Abstract

2023

Dynamic Management of Distributed Machine Learning Projects

Authors
Oliveira, F; Alves, A; Moço, H; Monteiro, J; Oliveira, O; Carneiro, D; Novais, P;

Publication
INTELLIGENT DISTRIBUTED COMPUTING XV, IDC 2022

Abstract
Given the new requirements of Machine Learning problems in the last years, especially in what concerns the volume, diversity and speed of data, new approaches are needed to deal with the associated challenges. In this paper we describe CEDEs - a distributed learning system that runs on top of an Hadoop cluster and takes advantage of blocks, replication and balancing. CEDEs trains models in a distributed manner following the principle of data locality, and is able to change parts of the model through an optimization module, thus allowing a model to evolve over time as the data changes. This paper describes its generic architecture, details the implementation of the first modules, and provides a first validation.

CloseRead Abstract

2023

Observability: Towards Ethical Artificial Intelligence

Authors
Palumbo, G; Carneiro, D; Alves, V;

Publication
NEW TRENDS IN DISRUPTIVE TECHNOLOGIES, TECH ETHICS AND ARTIFICIAL INTELLIGENCE, DITTET 2023

Abstract
In recent years, several regulatory initiatives have been carried out at the European Commission level to ensure the ethical use of Artificial Intelligence, including the General Data Protection Regulation, Data Governance Act, or the Artificial Intelligence Act. However, there is also a need for technological solutions that effectively enable the implementation of this regulation in a realistic and efficient way. The main goal of this work is to propose and implement such a technological solution, relying on the notion of observability. The hypothesis is that a set of ethics metrics can be implemented along a domain-agnostic Data Science/Artificial Intelligence pipeline. These metrics, when observed in real time, will allow not only to assess the level of compliance of the pipeline with ethics standards at different levels, but also allow for a timely reaction by the organization when the data, the model or any other artifact in the pipeline exhibits undesired behavior. In this way, some of the most important ethical principles of AI are guaranteed: responsibility and prevention of harm. This work aims to identify a large group of ethics metrics, implement them, map them onto the different stages of a typical Data Science / AI process, and determine whether the presence of these metrics ensures or contributes to the development of AI solutions that can be considered ethical according to the latest European regulation.

CloseRead Abstract