Publicacoes - INESC TEC

Publicações

Publicações por João Gama

2008

Improving the performance of an incremental algorithm driven by error margins

Autores
del Campo Avilaa, J; Ramos Jimeneza, G; Gamab, J; Morales Buenoa, R;

Publicação
Intelligent Data Analysis

Abstract
Classification is a quite relevant task within data analysis field. This task is not a trivial task and different difficulties can arise depending on the nature of the problem. All these difficulties can become worse when the datasets are too large or when new information can arrive at any time. Incremental learning is an approach that can be used to deal with the classification task in these cases. It must alleviate, or solve, the problem of limited time and memory resources. One emergent approach uses concentration bounds to ensure that decisions are made when enough information supports them. IADEM is one of the most recent algorithms that use this approach. The aim of this paper is to improve the performance of this algorithm in different ways: simplifying the complexity of the induced models, adding the ability to deal with continuous data, improving the detection of noise, selecting new criteria for evolutionating the model, including the use of more powerful prediction techniques, etc. Besides these new properties, the new system, IADEM-2, preserves the ability to obtain a performance similar to standard learning algorithms independently of the datasets size and it can incorporate new information as the basic algorithm does: using short time per example.

FecharLer Abstract

2000

Iterative Bayes

Autores
Gama, J;

Publicação
Intelligent Data Analysis

Abstract
Naive Bayes is a well known and studied algorithm both in statistics and machine learning. Bayesian learning algorithms represent each concept with a single probabilistic summary. In this paper we present an iterative approach to naive Bayes. The iterative Bayes begins with the distribution tables built by the naive Bayes. Those tables are iteratively updated in order to improve the probability class distribution associated with each training example. Experimental evaluation of Iterative Bayes on 27 benchmark datasets shows consistent gains in accuracy. Moreover, the update schema can take costs into account turning the algorithm cost sensitive. Unlike stratification, it is applicable to any number of classes and to arbitrary cost matrices. An interesting side effect of our algorithm is that it shows to be robust to attribute dependencies.

FecharLer Abstract

2008

Schema matching on streams with accuracy guarantees

Autores
Gama, J; Aguilar Ruiz, J; Klinkenberg, R;

Publicação
Intelligent Data Analysis

Abstract
We address the problem of matching imperfectly documented schemas of data streams and large databases. Instance-level schema matching algorithms identify likely correspondences between attributes by quantifying the similarity of their corresponding values. However, exact calculation of these similarities requires processing of all database records - which is infeasible for data streams. We devise a fast matching algorithm that uses only a small sample of records, and is yet guaranteed to find a matching that is a close approximation of the matching that would be obtained if the entire stream were processed. The method can be applied to any given (combination of) similarity metrics that can be estimated from a sample with bounded error; we apply the algorithm to several metrics. We give a rigorous proof of the method's correctness and report on experiments using large databases.

FecharLer Abstract

2007

Pursuing the best ECOC dimension for multiclass problems

Autores
Pimenta, E; Gama, J; Carvalho, A;

Publicação
Proceedings of the Twentieth International Florida Artificial Intelligence Research Society Conference, FLAIRS 2007

Abstract
Recent work highlights advantages in decomposing multiclass decision problems into multiple binary problems. Several strategies have been proposed for this decomposition. The most frequently investigated are All-vs-All, One-vs-All and the Error correction output codes (ECOC). ECOC are binary words (codewords) and can be adapted to be used in classifications problems. They must, however, comply with some specific constraints. The codewords can have several dimensions for each number of classes to be represented. These dimensions grow exponentially with the number of classes of the multiclass problem. Two methods to choose the dimension of a ECOC, which assure a good trade-off between redundancy and error correction capacity, are proposed in this paper. The methods are evaluated in a set of benchmark classification problems. Experimental results show that they are competitive against conventional multiclass decomposition methods. Copyright

FecharLer Abstract

1999

Linear tree

Autores
Gama, J; Brazdil, P;

Publicação
Intelligent Data Analysis

Abstract
In this paper we present system Ltree for propositional supervised learning. Ltree is able to define decision surfaces both orthogonal and oblique to the axes defined by the attributes of the input space. This is done combining a decision tree with a linear discriminant by means of constructive induction. At each decision node Ltree defines a new instance space by insertion of new attributes that are projections of the examples that fall at this node over the hyper-planes given by a linear discriminant function. This new instance space is propagated down through the tree. Tests based on those new attributes are oblique with respect to the original input space. Ltree is a probabilistic tree in the sense that it outputs a class probability distribution for each query example. The class probability distribution is computed at learning time, taking into account the different class distributions on the path from the root to the actual node. We have carried out experiments on twenty one benchmark datasets and compared our system with other well known decision tree systems (orthogonal and oblique) like C4.5, OC1, LMDT, and CART. On these datasets we have observed that our system has advantages in what concerns accuracy and learning times at statistically significant confidence levels.

FecharLer Abstract

2010

Drift Severity Metric

Autores
Kosina, P; Gama, J; Sebastiao, R;

Publicação
ECAI 2010 - 19TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE

Abstract
Concept drift in data is usually considered only as abrupt or gradual thus referring to the speed of change. Such simple distinguishing by speed is sufficient for most of the problems, but there might be situations for which a finer representation would be of use. This paper studies further the phenomenon of concept drift and introduces a simple measure which is relevant to the speed and amount of change between different concepts.

FecharLer Abstract