Publicacoes - INESC TEC

Publicações

Publicações por João Gama

2015

Collaborative filtering with recency-based negative feedback

Autores
Vinagre, J; Jorge, AM; Gama, J;

Publicação
30TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, VOLS I AND II

Abstract
Many online communities and services continuously generate data that can be used by recommender systems. When explicit ratings are not available, rating prediction algorithms are not directly applicable. Instead, data consists of positive-only user-item interactions, and the task is therefore not to predict ratings, but rather to predict good items to recommend - item prediction. One particular challenge of positive-only data is how to interpret absent user-item interactions. These can either be seen as negative or as unknown preferences. In this paper, we propose a recency-based scheme to perform negative preference imputation in an incremental matrix factorization algorithm designed for streaming data. Our results show that this approach substantially improves the accuracy of the baseline method, outperforming both classic and state-of-the-art algorithms.

FecharLer Abstract

2016

Concept Neurons - Handling Drift Issues for Real-Time Industrial Data Mining

Autores
Moreira Matias, L; Gama, J; Mendes Moreira, J;

Publicação
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2016, PT III

Abstract
Learning from data streams is a challenge faced by data science professionals from multiple industries. Most of them struggle hardly on applying traditional Machine Learning algorithms to solve these problems. It happens so due to their high availability on ready-to-use software libraries on big data technologies (e.g. SparkML). Nevertheless, most of them cannot cope with the key characteristics of this type of data such as high arrival rate and/or non-stationary distributions. In this paper, we introduce a generic and yet simplistic framework to fill this gap denominated Concept Neurons. It leverages on a combination of continuous inspection schemas and residual-based updates over the model parameters and/or the model output. Such framework can empower the resistance of most of induction learning algorithms to concept drifts. Two distinct and hence closely related flavors are introduced to handle different drift types. Experimental results on successful distinct applications on different domains along transportation industry are presented to uncover the hidden potential of this methodology.

FecharLer Abstract

2016

Detection of Fraud Symptoms in the Retail Industry

Autores
Ribeiro, RP; Oliveira, R; Gama, J;

Publicação
ADVANCES IN ARTIFICIAL INTELLIGENCE - IBERAMIA 2016

Abstract
Data mining is one of the most effective methods for fraud detection. This is highlighted by 25% of organizations that have suffered from economic crimes [1]. This paper presents a case study using real-world data from a large retail company. We identify symptoms of fraud by looking for outliers. To identify the outliers and the context where outliers appear, we learn a regression tree. For a given node, we identify the outliers using the set of examples covered at that node, and the context as the conjunction of the conditions in the path from the root to the node. Surprisingly, at different nodes of the tree, we observe that some outliers disappear and new ones appear. From the business point of view, the outliers that are detected near the leaves of the tree are the most suspicious ones. These are cases of difficult detection, being observed only in a given context, defined by a set of rules associated with the node.

FecharLer Abstract

2016

Dynamic community detection in evolving networks using locality modularity optimization

Autores
Cordeiro, M; Sarmento, RP; Gama, J;

Publicação
SOCIAL NETWORK ANALYSIS AND MINING

Abstract
The amount and the variety of data generated by today's online social and telecommunication network services are changing the way researchers analyze social networks. Facing fast evolving networks with millions of nodes and edges are, among other factors, its main challenge. Community detection algorithms in these conditions have also to be updated or improved. Previous state-of-the-art algorithms based on the modularity optimization (i.e. Louvain algorithm), provide fast, efficient and robust community detection on large static networks. Nonetheless, due to the high computing complexity of these algorithms, the use of batch techniques in dynamic networks requires to perform network community detection for the whole network in each one of the evolution steps. This fact reveals to be computationally expensive and unstable in terms of tracking of communities. Our contribution is a novel technique that maintains the community structure always up-to-date following the addition or removal of nodes and edges. The proposed algorithm performs a local modularity optimization that maximizes the modularity gain function only for those communities where the editing of nodes and edges was performed, keeping the rest of the network unchanged. The effectiveness of our algorithm is demonstrated with the comparison to other state-of-the-art community detection algorithms with respect to Newman's Modularity, Modularity with Split Penalty, Modularity Density, number of detected communities and running time.

FecharLer Abstract

2015

EigenEvent: An algorithm for event detection from complex data streams in syndromic surveillance

Autores
Fanaee T, H; Gama, J;

Publicação
INTELLIGENT DATA ANALYSIS

Abstract
Syndromic surveillance systems continuously monitor multiple pre-diagnostic daily streams of indicators from different regions with the aim of early detection of disease outbreaks. The main objective of these systems is to detect outbreaks hours or days before the clinical and laboratory confirmation. The type of data that is being generated via these systems is usually multivariate and seasonal with spatial and temporal dimensions. The algorithm What's Strange About Recent Events (WSARE) is the state-of-the-art method for such problems. It exhaustively searches for contrast sets in the multivariate data and signals an alarm when find statistically significant rules. This bottom-up approach presents a much lower detection delay comparing the existing top-down approaches. However, WSARE is very sensitive to the small-scale changes and subsequently comes with a relatively high rate of false alarms. We propose a new approach called EigenEvent that is neither fully top-down nor bottom-up. In this method, we instead of top-down or bottom-up search, track changes in data correlation structure via eigenspace techniques. This new methodology enables us to detect both overall changes (via eigenvalue) and dimension-level changes (via eigenvectors). Experimental results on hundred sets of benchmark data reveals that EigenEvent presents a better overall performance comparing state-of-the-art, in particular in terms of the false alarm rate.

FecharLer Abstract

2015

Eigenspace method for spatiotemporal hotspot detection

Autores
Fanaee T, H; Gama, J;

Publicação
EXPERT SYSTEMS

Abstract
Hotspot detection aims at identifying sub-groups in the observations that are unexpected, with respect to some baseline information. For instance, in disease surveillance, the purpose is to detect sub-regions in spatiotemporal space, where the count of reported diseases (e.g. cancer) is higher than expected, with respect to the population. The state-of-the-art method for this kind of problem is the space-time scan statistics, which exhaustively search the whole space through a sliding window looking for significant spatiotemporal clusters. Space-time scan statistics makes some restrictive assumptions about the distribution of data, the shape of the hotspots and the quality of data, which can be unrealistic for some non-traditional data sources. A novel methodology called EigenSpot is proposed where instead of an exhaustive search over the space, it tracks the changes in a space-time occurrences structure. The new approach does not only present much more computational efficiency but also makes no assumption about the data distribution, hotspot shape or the data quality. The principal idea is that with the joint combination of abnormal elements in the principal spatial and the temporal singular vectors, the location of hotspots in the spatiotemporal space can be approximated. The experimental evaluation, both on simulated and real data sets, reveals the effectiveness of the proposed method.

FecharLer Abstract