Cookies Policy
The website need some cookies and similar means to function. If you permit us, we will use those means to collect data on your visits for aggregated statistics to improve our service. Find out More
Accept Reject
  • Menu
Publications

Publications by LIAAD

2021

Modelling Voting Behaviour During a General Election Campaign Using Dynamic Bayesian Networks

Authors
Costa, P; Nogueira, AR; Gama, J;

Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE (EPIA 2021)

Abstract
This work aims to develop a Machine Learning framework to predict voting behaviour. Data resulted from longitudinally collected variables during the Portuguese 2019 general election campaign. Naive Bayes (NB), and Tree Augmented Naive Bayes (TAN) and three different expert models using Dynamic Bayesian Networks (DBN) predict voting behaviour systematically for each moment in time considered using past information. Even though the differences found in some performance comparisons are not statistically significant, TAN and NB outperformed DBN experts' models. The learned models outperformed one of the experts' models when predicting abstention and two when predicting right-wing parties vote. Specifically, for the right-wing parties vote, TAN and NB presented satisfactory accuracy, while the experts' models were below 50% in the third evaluation moment.

2021

An Analysis of Performance Metrics for Imbalanced Classification

Authors
Gaudreault, JG; Branco, P; Gama, J;

Publication
DISCOVERY SCIENCE (DS 2021)

Abstract
Numerous machine learning applications involve dealing with imbalanced domains, where the learning focus is on the least frequent classes. This imbalance introduces new challenges for both the performance assessment of these models and their predictive modeling. While several performance metrics have been established as baselines in balanced domains, some cannot be applied to the imbalanced case since the use of the majority class in the metric could lead to a misleading evaluation of performance. Other metrics, such as the area under the precision-recall curve, have been demonstrated to be more appropriate for imbalance domains due to their focus on class-specific performance. There are, however, many proposed implementations for this particular metric, which could potentially lead to different conclusions depending on the one used. In this research, we carry out an experimental study to better understand these issues and aim at providing a set of recommendations by studying the impact of using different metrics and different implementations of the same metric under multiple imbalance settings.

2021

A sketch for the KS test for Big Data

Authors
Galeno, TD; Gama, J; Cardoso, DO;

Publication
Anais do IX Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2021)

Abstract
Motivated by the challenges of Big Data, this paper presents an approximative algorithm to assess the Kolmogorov-Smirnov test. This goodness of fit statistical test is extensively used because it is non-parametric. This work focuses on the one-sample test, which considers the hypothesis that a given univariate sample follows some reference distribution. The method allows to evaluate the departure from such a distribution of a input stream, being space and time efficient. We show the accuracy of our algorithm by making several experiments in different scenarios: varying reference distribution and its parameters, sample size, and available memory. The performance of rival methods, some of which are considered the state-of-the-art, were compared. It is demonstrated that our algorithm is superior in most of the cases, considering the absolute error of the test statistic.

2021

Text documents streams with improved incremental similarity

Authors
Sarmento, RP; Cardoso, DO; Dearo, K; Brazdil, P; Gama, J;

Publication
SOCIAL NETWORK ANALYSIS AND MINING

Abstract
There has been a significant effort by the research community to address the problem of providing methods to organize documentation, with the help of Information Retrieval methods. In this paper, we present several experiments with stream analysis methods to explore streams of text documents. This paper also presents possible architectures of the Text Document Stream Organization, with the use of incremental algorithms like Incremental Sparse TF-IDF and Incremental Similarity. Our results show that with this architecture, significant improvements are achieved, regarding efficiency in grouping of similar documents. These improvements are important since it is of general knowledge that great amounts of text analysis are a high dimensional and complex subject of study, in the data analysis area.

2021

Hyper-parameter Optimization for Latent Spaces

Authors
Veloso, B; Caroprese, L; Konig, M; Teixeira, S; Manco, G; Hoos, HH; Gama, J;

Publication
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2021: RESEARCH TRACK, PT III

Abstract
We present an online optimization method for time-evolving data streams that can automatically adapt the hyper-parameters of an embedding model. More specifically, we employ the Nelder-Mead algorithm, which uses a set of heuristics to produce and exploit several potentially good configurations, from which the best one is selected and deployed. This step is repeated whenever the distribution of the data is changing. We evaluate our approach on streams of real-world as well as synthetic data, where the latter is generated in such way that its characteristics change over time (concept drift). Overall, we achieve good performance in terms of accuracy compared to state-of-the-art AutoML techniques.

2021

Dynamic Topic Modeling Using Social Network Analytics

Authors
Tabassum, S; Gama, J; Azevedo, P; Teixeira, L; Martins, C; Martins, A;

Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE (EPIA 2021)

Abstract
Topic modeling or inference has been one of the well-known problems in the area of text mining. It deals with the automatic categorisation of words or documents into similarity groups also known as topics. In most of the social media platforms such as Twitter, Instagram, and Facebook, hashtags are used to define the content of posts. Therefore, modelling of hashtags helps in categorising posts as well as analysing user preferences. In this work, we tried to address this problem involving hashtags that stream in real-time. Our approach encompasses graph of hashtags, dynamic sampling and modularity based community detection over the data from a popular social media engagement application. Further, we analysed the topic clusters' structure and quality using empirical experiments. The results unveil latent semantic relations between hashtags and also show frequent hashtags in a cluster. Moreover, in this approach, the words in different languages are treated synonymously. Besides, we also observed top trending topics and correlated clusters.

  • 77
  • 429