2021
Autores
Carneiro, D; Oliveira, F; Novais, P;
Publicação
Ambient Intelligence - Software and Applications - 12th International Symposium on Ambient Intelligence, ISAmI 2021, Salamanca, Spain, 6-8 October, 2021.
Abstract
Machine Learning problems are significantly growing in complexity, either due to an increase in the volume of data, to new forms of data, or due to the change of data over time. This poses new challenges that are both technical and scientific. In this paper we propose a Distributed Learning System that runs on top of a Hadoop cluster, leveraging its native functionalities. It is guided by the principle of data locality. Data are distributed across the cluster, so models are also distributed and trained in parallel. Models are thus seen as Ensembles of base models, and predictions are made by combining the predictions of the base models. Moreover, models are replicated and distributed across the cluster, so that multiple nodes can answer requests. This results in a system that is both resilient and with high availability. © 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.
2023
Autores
Guimarães, M; Oliveira, F; Carneiro, D; Novais, P;
Publicação
Ambient Intelligence - Software and Applications - 14th International Symposium on Ambient Intelligence, ISAmI 2023, Guimarães, Portugal, July 12-14, 2023
Abstract
Distributed Machine Learning, in which data and learning tasks are scattered across a cluster of computers, is one of the answers of the field to the challenges posed by Big Data. Still, in an era in which data abounds, decisions must still be made regarding which specific data to use on the training of the model, either because the amount of available data is simply too large, or because the training time or complexity of the model must be kept low. Typical approaches include, for example, selection based on data freshness. However, old data are not necessarily outdated and might still contain relevant patterns. Likewise, relying only on recent data may significantly decrease data diversity and representativity, and decrease model quality. The goal of this paper is to compare different heuristics for selecting data in a distributed Machine Learning scenario. Specifically, we ascertain whether selecting data based on their characteristics (meta-features), and optimizing for maximum diversity, improves model quality while, eventually, allowing to reduce model complexity. This will allow to develop more informed data selection strategies in distributed settings, in which the criteria are not only the location of the data or the state of each node in the cluster, but also include intrinsic and relevant characteristics of the data. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023.
2023
Autores
Oliveira, F; Carneiro, D; Guimaraes, M; Oliveira, O; Novais, P;
Publicação
INTERNATIONAL JOURNAL OF PARALLEL EMERGENT AND DISTRIBUTED SYSTEMS
Abstract
As distributed and multi-organization Machine Learning emerges, new challenges must be solved, such as diverse and low-quality data or real-time delivery. In this paper, we use a distributed learning environment to analyze the relationship between block size, parallelism, and predictor quality. Specifically, the goal is to find the optimum block size and the best heuristic to create distributed Ensembles. We evaluated three different heuristics and five block sizes on four publicly available datasets. Results show that using fewer but better base models matches or outperforms a standard Random Forest, and that 32 MB is the best block size.
2023
Autores
Guimaraes, M; Carneiro, D; Palumbo, G; Oliveira, F; Oliveira, O; Alves, V; Novais, P;
Publicação
ELECTRONICS
Abstract
Despite major advances in recent years, the field of Machine Learning continues to face research and technical challenges. Mostly, these stem from big data and streaming data, which require models to be frequently updated or re-trained, at the expense of significant computational resources. One solution is the use of distributed learning algorithms, which can learn in a distributed manner, from distributed datasets. In this paper, we describe CEDEs-a distributed learning system in which models are heterogeneous distributed Ensembles, i.e., complex models constituted by different base models, trained with different and distributed subsets of data. Specifically, we address the issue of predicting the training time of a given model, given its characteristics and the characteristics of the data. Given that the creation of an Ensemble may imply the training of hundreds of base models, information about the predicted duration of each of these individual tasks is paramount for an efficient management of the cluster's computational resources and for minimizing makespan, i.e., the time it takes to train the whole Ensemble. Results show that the proposed approach is able to predict the training time of Decision Trees with an average error of 0.103 s, and the training time of Neural Networks with an average error of 21.263 s. We also show how results depend significantly on the hyperparameters of the model and on the characteristics of the input data.
2023
Autores
Oliveira, F; Alves, A; Moço, H; Monteiro, J; Oliveira, O; Carneiro, D; Novais, P;
Publicação
INTELLIGENT DISTRIBUTED COMPUTING XV, IDC 2022
Abstract
Given the new requirements of Machine Learning problems in the last years, especially in what concerns the volume, diversity and speed of data, new approaches are needed to deal with the associated challenges. In this paper we describe CEDEs - a distributed learning system that runs on top of an Hadoop cluster and takes advantage of blocks, replication and balancing. CEDEs trains models in a distributed manner following the principle of data locality, and is able to change parts of the model through an optimization module, thus allowing a model to evolve over time as the data changes. This paper describes its generic architecture, details the implementation of the first modules, and provides a first validation.
2024
Autores
Ferreira, HM; Carneiro, DR; Guimaraes, MA; Oliveira, FV;
Publicação
5TH INTERNATIONAL CONFERENCE ON INDUSTRY 4.0 AND SMART MANUFACTURING, ISM 2023
Abstract
Quality inspection is a critical step in ensuring the quality and efficiency of textile production processes. With the increasing complexity and scale of modern textile manufacturing systems, the need for accurate and efficient quality inspection and defect detection techniques has become paramount. This paper compares supervised and unsupervised Machine Learning techniques for defect detection in the context of industrial textile production, in terms of their respective advantages and disadvantages, and their implementation and computational costs. We explore the use of an autoencoder for the detection of defects in textiles. The goal of this preliminary work is to find out if unsupervised methods can successfully train models with good performance without the need for defect labelled data. (c) 2023 The Authors. Published by Elsevier B.V.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.