2022
Autores
Strecht, P; Mendes Moreira, J; Soares, C;
Publicação
ADVANCED DATA MINING AND APPLICATIONS, ADMA 2022, PT II
Abstract
Density estimation is an important tool for data analysis. Non-parametric approaches have a reputation for offering state-of-the-art density estimates limited to few dimensions. Despite providing less accurate density estimates, histogram-based approaches remain the only alternative for datasets in high-dimensional spaces. In this paper, we present a multivariate histogram approach to estimate the density of a dataset without restrictions on the number of dimensions, containing both numerical and categorical variables (without numerical encoding) and allowing missing data (without the need to preprocess them). Results from the empirical evaluation show that it is possible to estimate the density of datasets without restrictions on dimensionality, and the method is robust to missing values and categorical variables.
2022
Autores
Cerqueira, V; Torgo, L; Soares, C;
Publicação
JOURNAL OF INTELLIGENT INFORMATION SYSTEMS
Abstract
Time series forecasting is one of the most active research topics. Machine learning methods have been increasingly adopted to solve these predictive tasks. However, in a recent work, evidence was shown that these approaches systematically present a lower predictive performance relative to simple statistical methods. In this work, we counter these results. We show that these are only valid under an extremely low sample size. Using a learning curve method, our results suggest that machine learning methods improve their relative predictive performance as the sample size grows. The R code to reproduce all of our experiments is available at https://github.com/vcerqueira/MLforForecasting.
2022
Autores
Brazdil, P; van Rijn, JN; Soares, C; Vanschoren, J;
Publicação
Cognitive Technologies
Abstract
2022
Autores
Hetlerovic, D; Popelinsky, L; Brazdil, P; Soares, C; Freitas, F;
Publicação
ADVANCES IN INTELLIGENT DATA ANALYSIS XX, IDA 2022
Abstract
Although outlier detection/elimination has been studied before, few comprehensive studies exist on when exactly this technique would be useful as preprocessing in classification tasks. The objective of our study is to fill in this gap. We have performed experiments with 12 various outlier elimination methods and 10 classification algorithms on 50 different datasets. The results were then processed by the proposed reduction method, whose aim is identify the most useful workflows for a given set of tasks (datasets). The reduction method has identified that just three OEMs that are generally useful for the given set of tasks. We have shown that the inclusion of these OEMs is indeed useful, as it leads to lower loss in accuracy and the difference is quite significant (0.5%) on average.
2022
Autores
Baghcheband, H; Soares, C; Reis, LP;
Publicação
2022 IEEE/WIC/ACM INTERNATIONAL JOINT CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY, WI-IAT
Abstract
The amount of data produced by distributed devices, such as smart devices and the IoT, is increasing continuously. The cost of transmitting data and also distributed computing power raise interest in distributed data mining (DDM). However, in a pure DDM scenario, data availability may not be enough to generate reliable models in a distributed environment. So, the ability to exchange data efficiently and effectively will become a crucial component of DDM. In this paper, we propose the concept of the Machine Learning Data Market (MLDM), a framework for the exchange of data among autonomous agents. We consider a set of learning agents in a cooperative distributed ML, where agents negotiate data to improve the models they use locally. In the proposed data market, the system's predictive accuracy is investigated, as well as the economic value of data. The question addressed in this paper is: How data exchange among the agents will improve the accuracy of the learning model. Agent budget is defined as a limitation of negotiation. We defined a multi-agent system with negotiation and assessed it against the multi-agent system baseline and the single-agent system. The proposed framework is analyzed based on the different sizes of batch data collected over time to find out how this changes the effect of the negotiation on the accuracy of the model. The results indicate that even simple negotiation among agents increases their learning accuracy.
2022
Autores
Martins, I; Resende, JS; Sousa, PR; Silva, S; Antunes, L; Gama, J;
Publicação
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE
Abstract
The Internet of Things (IoT) envisions a smart environment powered by connectivity and heterogeneity where ensuring reliable services and communications across multiple industries, from financial fields to healthcare and fault detection systems, is a top priority. In such fields, data is being collected and broadcast at high speed on a continuous and real-time scale, including IoT in the streaming processing paradigm. Intrusion Detection Systems (IDS) rely on manually defined security policies and signatures that fail to design a real-time solution or prevent zero-day attacks. Therefore, anomaly detection appears as a prominent solution capable of recognizing patterns, learning from experience, and detecting abnormal behavior. However, most approaches do not fit the urged requirements, often evaluated on deprecated datasets not representative of the working environment. As a result, our contributions address an overview of cybersecurity threats in IoT, important recommendations for a real-time IDS, and a real-time dataset setting to evaluate a security system covering multiple cyber threats. The dataset used to evaluate current host-based IDS approaches is publicly available and can be used as a benchmark by the community.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.