Publications

Publications by Bruno Miguel Veloso

2020

A case study on using heavy-hitters in interconnect bypass fraud

Authors
Veloso, B; Gama, J; Martins, C; Espanha, R; Azevedo, R;

Publication
ACM SIGAPP Applied Computing Review

Abstract
Nowadays, fraudsters are continually trying to explore technical gaps in telecom companies to get some profit. The high cost of international termination rates in Telecom Companies, and mainly because of their high asymmetry property, attracts the attention of fraudsters. In this paper, we explore the application of three deterministic algorithms and one probabilistic, that combined can help to identify possible abnormal behaviors. Interconnect Bypass Fraud (IBF) is on the top three (worldwide), most common frauds in the telecommunication domain. Typically, the Telecom Companies can detect IBF by the occurrence of bursts of calls, repetitions, and mirror behaviors from specific numbers. The goal of our work is to discover as soon as possible numbers with abnormal behaviors and based on this assumption we developed: ( i ) the lossy count algorithm with fast forgetting technique; and ( ii ) the single-pass hierarchical heavy hitter algorithm that also contains a forgetting technique; as well as the application of the HyperLogLog sketches, and the application of sticky sampling algorithm. We applied the four algorithms in two real datasets and did a parameter sensitivity analysis. The results show that our two proposals (Lossy Counting with fast forgetting and the Hierarchical Heavy Hitters) can capture the most recent abnormal behaviors, faster than the baseline algorithms. Nonetheless, these four algorithms combined can make the fraud task more difficult and can complement the techniques used by the Telecom Company.

CloseRead Abstract

2020

Self Hyper-parameter Tuning for Stream Classification Algorithms

Authors
Veloso, B; Gama, J;

Publication
IoT Streams for Data-Driven Predictive Maintenance and IoT, Edge, and Mobile for Embedded Machine Learning - Second International Workshop, IoT Streams 2020, and First International Workshop, ITEM 2020, Co-located with ECML/PKDD 2020, Ghent, Belgium, September 14-18, 2020, Revised Selected Papers

Abstract
The new 5G mobile communication system era brings a new set of communication devices that will appear on the market. These devices will generate data streams that require proper handling by machine algorithms. The processing of these data streams requires the design, development, and adaptation of appropriate machine learning algorithms. While stream processing algorithms include hyper-parameters for performance refinement, their tuning process is time-consuming and typically requires an expert to do the task. In this paper, we present an extension of the Self Parameter Tuning (SPT) optimization algorithm for data streams. We apply the Nelder-Mead algorithm to dynamically sized samples that converge to optimal settings in a double pass over data (during the exploration phase), using a relatively small number of data points. Additionally, the SPT automatically readjusts hyper-parameters when concept drift occurs. We did a set of experiments with well-known classification data sets and the results show that the proposed algorithm can outperform the results of previous hyper-parameter tuning efforts by human experts. The statistical results show that this extension is faster in terms of convergence and presents at least similar accuracy results when compared with the standard optimization techniques. © 2020, Springer Nature Switzerland AG.

CloseRead Abstract

2020

Failure Detection of an Air Production Unit in Operational Context

Authors
Barros, M; Veloso, B; Pereira, PM; Ribeiro, RP; Gama, J;

Abstract
The transformation of industrial manufacturing with computers and automation with smart systems leads us to monitor and log of industrial equipment events. It is possible to apply analytic approaches, and to find interpretive results for strategic decision making, providing advantages such as failure detection and predictive maintenance. Over the last years, many researchers have been studying the application of machine learning techniques to improve such tasks. In this context, we develop a system capable of detect anomalies on an Air Production Unit (APU), taking into consideration the peak frequency of each sensor. The study started with the analysis of the sensors installed on the APU, defining its normal behavior and its failure mode. Using that information, we define rules, to monitor the APU, to detect anomalies on its components, and to predict possible failures. The definition of rules was based on the peak frequency analysis, which allowed the setting of boundaries of normality for the APU working modes and, thus, the identification of anomalies. © 2020, Springer Nature Switzerland AG.

CloseRead Abstract

2021

Classification and Recommendation With Data Streams

Authors
Veloso, B; Gama, J; Malheiro, B;

Publication
Encyclopedia of Information Science and Technology, Fifth Edition - Advances in Information Quality and Management

Abstract
Nowadays, with the exponential growth of data stream sources (e.g., Internet of Things [IoT], social networks, crowdsourcing platforms, and personal mobile devices), data stream processing has become indispensable for online classification, recommendation, and evaluation. Its main goal is to maintain dynamic models updated, holding the captured patterns, to make accurate predictions. The foundations of data streams algorithms are incremental processing, in order to reduce the computational resources required to process large quantities of data, and relevance model updating. This article addresses data stream knowledge processing, covering classification, recommendation, and evaluation; describing existing algorithms/techniques; and identifying open challenges.

CloseRead Abstract

2020

AutoML for Stream k-Nearest Neighbors Classification

Authors
Bahri, M; Veloso, B; Bifet, A; Gama, J;

Publication
2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA)

Abstract
The last few decades have witnessed a significant evolution of technology in different domains, changing the way the world operates, which leads to an overwhelming amount of data generated in an open-ended way as streams. Over the past years, we observed the development of several machine learning algorithms to process big data streams. However, the accuracy of these algorithms is very sensitive to their hyper-parameters, which requires expertise and extensive trials to tune. Another relevant aspect is the high-dimensionality of data, which can causes degradation to computational performance. To cope with these issues, this paper proposes a stream k-nearest neighbors (kNN) algorithm that applies an internal dimension reduction to the stream in order to reduce the resource usage and uses an automatic monitoring system that tunes dynamically the configuration of the kNN algorithm and the output dimension size with big data streams. Experiments over a wide range of datasets show that the predictive and computational performances of the kNN algorithm are improved.

CloseRead Abstract

2021

Crowdsourced Data Stream Mining for Tourism Recommendation

Authors
Leal, F; Veloso, B; Malheiro, B; Burguillo, JC;

Publication
Trends and Applications in Information Systems and Technologies - Volume 1, WorldCIST 2021, Terceira Island, Azores, Portugal, 30 March - 2 April, 2021.

Abstract
Crowdsourced data streams are continuous flows of data generated at high rate by users, also known as the crowd. These data streams are popular and extremely valuable in several domains. This is the case of tourism, where crowdsourcing platforms rely on tourist and business inputs to provide tailored recommendations to future tourists in real time. The continuous, open and non-curated nature of the crowd-originated data requires robust data stream mining techniques for on-line profiling, recommendation and evaluation. The sought techniques need, not only, to continuously improve profiles and learn models, but also be transparent, overcome biases, prioritise preferences, and master huge data volumes; all in real time. This article surveys the state-of-art in this field, and identifies future research opportunities. © 2021, The Author(s), under exclusive license to Springer Nature Switzerland AG.

CloseRead Abstract