Publicacoes - INESC TEC

Publicações

Publicações por João Gama

2019

Learning under Concept Drift: A Review

Autores
Lu, J; Liu, AJ; Dong, F; Gu, F; Gama, J; Zhang, GQ;

Publicação
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Abstract
Concept drift describes unforeseeable changes in the underlying distribution of streaming data over time. Concept drift research involves the development of methodologies and techniques for drift detection, understanding, and adaptation. Data analysis has revealed that machine learning in a concept drift environment will result in poor learning results if the drift is not addressed. To help researchers identify which research topics are significant and how to apply related techniques in data analysis tasks, it is necessary that a high quality, instructive review of current research developments and trends in the concept drift field is conducted. In addition, due to the rapid development of concept drift in recent years, the methodologies of learning under concept drift have become noticeably systematic, unveiling a framework which has not been mentioned in literature. This paper reviews over 130 high quality publications in concept drift related research areas, analyzes up-to-date developments in methodologies and techniques, and establishes a framework of learning under concept drift including three main components: concept drift detection, concept drift understanding, and concept drift adaptation. This paper lists and discusses 10 popular synthetic datasets and 14 publicly available benchmark datasets used for evaluating the performance of learning algorithms aiming at handling concept drift. Also, concept drift related research directions are covered and discussed. By providing state-of-the-art knowledge, this survey will directly support researchers in their understanding of research developments in the field of learning under concept drift.

FecharLer Abstract

2021

Statistically Robust Evaluation of Stream-Based Recommender Systems

Autores
Vinagre, J; Jorge, AM; Rocha, C; Gama, J;

Publicação
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Abstract
Online incremental models for recommendation are nowadays pervasive in both the industry and the academia. However, there is not yet a standard evaluation methodology for the algorithms that maintain such models. Moreover, online evaluation methodologies available in the literature generally fall short on the statistical validation of results, since this validation is not trivially applicable to stream-based algorithms. We propose a k-fold validation framework for the pairwise comparison of recommendation algorithms that learn from user feedback streams, using prequential evaluation. Our proposal enables continuous statistical testing on adaptive-size sliding windows over the outcome of the prequential process, allowing practitioners and researchers to make decisions in real time based on solid statistical evidence. We present a set of experiments to gain insights on the sensitivity and robustness of two statistical tests-McNemar's and Wilcoxon signed rank-in a streaming data environment. Our results show that besides allowing a real-time, fine-grained online assessment, the online versions of the statistical tests are at least as robust as the batch versions, and definitely more robust than a simple prequential single-fold approach.

FecharLer Abstract

2019

Credit scoring for microfinance using behavioral data in emerging markets

Autores
Ruiz, S; Gomes, P; Rodrigues, L; Gama, J;

Publicação
INTELLIGENT DATA ANALYSIS

Abstract
Emerging markets contain the vast majority of the world's population. Despite the enormous number of inhabitants, these markets still lack a proper finance infrastructure. One of the main difficulties felt by customers is the access to loans. This limitation arises from the fact that most customers usually lack a verifiable credit history. As such, traditional banks are unable to provide loans. This paper proposes credit scoring modeling based on non-traditional-data, acquired from smartphones, for loan classification processes. We use Logistic Regression (LR) and Support Vector Machine (SVM) models which are the top linear models in traditional banking. Then we compared the transformation of the training datasets creating boolean indicators against the categorization using Weight of Evidence (WoE). Our models surpassed the performance of the manual loan application selection process, improving the approval rate and decreasing the overdue rate. Compared to the baseline, the loans approved by meeting the criteria of the SVM model presented a decreased overdue rate. At the same time, using the score generated by a SVM model we were able to grant more loans. This paper shows that credit scoring can be useful in emerging markets. The non-traditional data can be used to build robust algorithms that can identify good borrowers as in traditional banking.

FecharLer Abstract

2019

Identifying, Ranking and Tracking Community Leaders in Evolving Social Networks

Autores
Cordeiro, M; Sarmento, RP; Brazdil, P; Kimura, M; Gama, J;

Publicação
Complex Networks and Their Applications VIII - Volume 1 Proceedings of the Eighth International Conference on Complex Networks and Their Applications COMPLEX NETWORKS 2019, Lisbon, Portugal, December 10-12, 2019.

Abstract
Discovering communities in a network is a fundamental and important problem to complex networks. Find the most influential actors among its peers is a major task. If on one side, studies on community detection ignore the influence of actors and communities, on the other hand, ignoring the hierarchy and community structure of the network neglect the actor or community influence. We bridge this gap by combining a dynamic community detection method with a dynamic centrality measure. The proposed enhanced dynamic hierarchical community detection method computes centrality for nodes and aggregated communities and selects each community representative leader using the ranked centrality of every node belonging to the community. This method is then able to unveil, track, and measure the importance of main actors, network intra and inter-community structural hierarchies based on a centrality measure. The empirical analysis performed, using two temporal networks shown that the method is able to find and tracking community leaders in evolving networks. © 2020, Springer Nature Switzerland AG.

FecharLer Abstract

2019

Adapting ClusTree for more challenging data stream environments

Autores
Zgraja, J; Moulton, RH; Gama, J; Kasprzak, A; Wozniak, M;

Publicação
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS

Abstract
Data stream mining seeks to extract useful information from quickly-arriving, infinitely-sized and evolving data streams. Although these challenges have been addressed throughout the literature, none of them can be considered "solved." We contribute to closing this gap for the task of data stream clustering by proposing two modifications to the well-known ClusTree data stream clustering algorithm: pruning unused branches and detecting concept drift. Our experimental results show the difficulty in tackling these aspects of data stream mining and the sensitivity of stream mining algorithms to parameter values. We conclude that further research is required to better equip stream learners for the data stream clustering task.

FecharLer Abstract

2019

Correction to: Database Systems for Advanced Applications

Autores
Li, G; Yang, J; Gama, J; Natwichai, J; Tong, Y;

Publicação
Database Systems for Advanced Applications - Lecture Notes in Computer Science

Abstract