Publications

Publications by Alípio Jorge

2006

Design of an end-to-end method to extract information from tables

Authors
Costa e Silva, A; Jorge, AM; Torgo, L;

Publication
INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION

Abstract
This paper plans an end-to-end method for extracting information from tables embedded in documents; input format is ASCII, to which any richer fort-nat can be converted, preserving all textual and much of the layout information. We start by defining table. Then we describe the steps involved in extracting information from tables and analyse table-related research to place the contribution of different authors, find the paths research is following, and identify issues that are still unsolved. We then analyse current approaches to evaluating table processing algorithms and propose two new metrics for the task of segmenting cells/columns/rows. We proceed to design our own end-to-end method, where there is a higher interaction between different steps; we indicate how back loops in the usual order of the steps can reduce the possibility of errors and contribute to solving previously unsolved problems. Finally, we explore how the actual interpretation of the table not only allows inferring the accuracy of the overall extraction process but also contributes to actually improving its quality. In order to do so, we believe interpretation has to consider context-specific knowledge; we explore how the addition of this knowledge can be made in a plug-in/out manner, such that the overall method will maintain its operability in different contexts.

CloseRead Abstract

2012

Finding interesting contexts for explaining deviations in bus trip duration using distribution rules

Authors
Jorge, AM; Mendes Moreira, J; De Sousa, JF; Soares, C; Azevedo, PJ;

Publication
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Abstract
In this paper we study the deviation of bus trip duration and its causes. Deviations are obtained by comparing scheduled times against actual trip duration and are either delays or early arrivals. We use distribution rules, a kind of association rules that may have continuous distributions on the consequent. Distribution rules allow the systematic identification of particular conditions, which we call contexts, under which the distribution of trip time deviations differs significantly from the overall deviation distribution. After identifying specific causes of delay the bus company operational managers can make adjustments to the timetables increasing punctuality without disrupting the service. © Springer-Verlag Berlin Heidelberg 2012.

CloseRead Abstract

2012

HCAC: Semi-supervised hierarchical clustering using confidence-based active learning

Authors
Nogueira, BM; Jorge, AM; Rezende, SO;

Publication
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Abstract
Despite their importance, hierarchical clustering has been little explored for semi-supervised algorithms. In this paper, we address the problem of semi-supervised hierarchical clustering by using an active learning solution with cluster-level constraints. This active learning approach is based on a new concept of merge confidence in agglomerative clustering. When there is low confidence in a cluster merge the user is queried and provides a cluster-level constraint. The proposed method is compared with an unsupervised algorithm (average-link) and two state-of-the-art semi-supervised algorithms (pairwise constraints and Constrained Complete-Link). Results show that our algorithm tends to be better than the two semi-supervised algorithms and can achieve a significant improvement when compared to the unsupervised algorithm. Our approach is particularly useful when the number of clusters is high which is the case in many real problems. © 2012 Springer-Verlag Berlin Heidelberg.

CloseRead Abstract

2011

An Exploratory Study on the Impact of Temporal Features on the Classification and Clustering of Future-Related Web Documents

Authors
Campos, R; Dias, G; Jorge, A;

Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE

Abstract
In the last few years, a huge amount of temporal written information has become widely available on the Internet with the advent of forums, blogs and social networks. This gave rise to a new challenging problem called future retrieval, which consists of extracting future temporal information, that is known in advance, from web sources in order to answer queries that combine text of a future temporal nature. This paper aims to confirm whether web snippets can be used to form an intelligent web that can detect future expected events when their dates are already known. Moreover, the objective is to identify the nature of future texts and understand how these temporal features affect the classification and clustering of the different types of future-related texts: informative texts, scheduled texts and rumor texts. We have conducted a set of comprehensive experiments and the results show that web documents are a valuable source of future data that can be particularly useful in identifying and understanding the future temporal nature of a given implicit temporal query.

CloseRead Abstract

2009

Efficient Coverage of Case Space with Active Learning

Authors
Escudeiro, NF; Jorge, AM;

Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS

Abstract
Collecting and annotating exemplary cases is a costly and critical task that is required in early stages of any classification process. Reducing labeling cost without degrading accuracy calls for a compromise solution which may be achieved with active learning. Common active learning approaches focus on accuracy and assume the availability of a pre-labeled set of exemplary cases covering all classes to learn. This assumption does not necessarily hold. In this paper we study the capabilities of a new active learning approach, d-Confidence, in rapidly covering the case space when compared to the traditional active learning confidence criterion, when the representativeness assumption is not met.. Experimental results also show that; d-Confidence reduces the number of queries required to achieve complete class coverage and tends to improve or maintain classification error.

CloseRead Abstract

2009

Analysis and Forecast of Team Formation in the Simulated Robotic Soccer Domain

Authors
Almeida, R; Reis, LP; Jorge, AM;

Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS

Abstract
This paper proposes a classification approach to identify the team's formation (formation means the strategical layout of the players in the field) in the robotic soccer domain for the two dimensional (213) simulation league. It is a tool for decision support that allows the coach to understand the strategy of the opponent. To reach that goal we employ Data Mining classification techniques. TO understand the simulated robotic soccer domain we briefly describe the Simulation system, some related work and the use of Data Mining techniques for the detection of formations. In order to perform a robotic soccer match with different formations we develop a way to configure the formations in a training base team (FC Portugal) and a data preparation process. The paper describes the base team and the test team,, used and the respective configuration process. After the matches between test teams the data is subjected to a reduction process taking into account the players' position in the field given the collective. In the modeling stage appropriate learning algorithms were selected. In the solution analysis, the error rate (% incorrectly classify instances) with the statistic test t-Student for paired samples were selected, as the evaluation measure. Experimental results show that it is possible to automatically identify the formations used by the base team (FC Portugal) in distinct matches against different opponents, using Data Mining techniques. The experimental results also show that the SMO (Sequential Minimal Optimization) learning algorithm has the best performance.

CloseRead Abstract