Publicacoes - INESC TEC

Publicações

Publicações por Carla Lopes

2019

Knowledge Graph Implementation of Archival Descriptions Through CIDOC-CRM

Autores
Koch, I; Freitas, N; Ribeiro, C; Lopes, CT; da Silva, JR;

Publicação
DIGITAL LIBRARIES FOR OPEN KNOWLEDGE, TPDL 2019

Abstract
Archives have well-established description standards, namely the ISAD(G) and ISAAR(CPF) with a hierarchical structure adapted to the nature of archival assets. However, as archives connect to a growing diversity of data, they aim to make their representations more apt to the so-called linked data cloud. The corresponding move from hierarchical, ISAD-conforming descriptions to graph counterparts requires state-of-the-art technologies, data models and vocabularies. Our approach addresses this problem from two perspectives. The first concerns the data model and description vocabularies, as we adopt and build upon the CIDOC-CRM standard. The second is the choice of technologies to support a knowledge graph, including a graph database and an Object Graph Mapping library. The case study is the Portuguese National Archives, Torre do Tombo, and the overall goal is to build a CIDOC-CRM-compliant system for document description and retrieval, to be used by professionals and the public. The early stages described here include the design of the core data model for archival records represented as the ArchOnto ontology and its embodiment in the ArchGraph knowledge graph. The goal of a semantic archival information system will be pursued in the migration of existing records to the richer representation and the development of applications supported on the graph.

FecharLer Abstract

2019

Readability of web content An analysis by topic

Autores
Antunes, H; Lopes, CT;

Publicação
2019 14TH IBERIAN CONFERENCE ON INFORMATION SYSTEMS AND TECHNOLOGIES (CISTI)

Abstract
Readability is determined by the characteristics of the text that influence their understanding. The web is composed of content on various topics and the results retrieved in the top positions by the main search engines are expected to be those with the highest number of views. In this study, we analyzed the readability of web pages according to the topic to which it belongs and their position in the search result. For that, we collected the top-20 results retrieved by Google to 23,779 queries from 20 topics and used several readability metrics. The results of the analysis showed that the content from organizations (like colleges and other institutions) and health-related content have lower readability values. Categories Games and Home are on the opposite side. For the categories identified as having less readability, tools can be developed that help the user understand their content. We also found that top-ranked pages have higher values of readability. One can conclude that, directly or indirectly, readability is a factor that seems to be being considered by the Google search engine or has an influence on page popularity.

FecharLer Abstract

2019

Is it a lay or medico-scientific concept? Automatic classification in two languages

Autores
Santos, PM; Lopes, CT;

Publicação
2019 14TH IBERIAN CONFERENCE ON INFORMATION SYSTEMS AND TECHNOLOGIES (CISTI)

Abstract
Searching for health information is the third most popular activity on the Internet. There is evidence that query suggestions in lay and medico-scientific terminology improve health information retrieval by who is not a health professional. Developing systems that suggest queries in these terminologies requires knowing if concepts are lay or medico-scientific. In this paper, we propose and compare approaches to compute the degree of association of a concept to lay and medico-scientific terminology. We use different thesauri for this purpose and use the cosine similarity to measure the closeness of concepts with subsets of those thesauri. The evaluation of our approaches uses an existing glossary containing concepts in both terminologies in English and Portuguese and a and a set of queries submitted by users and classified by health professionals as lay or medical-scientific. We concluded that the best method to classify a concept uses the CHV vocabulary as a subset.

FecharLer Abstract

2019

Normalized Google Distance in the identification and characterization of health queries

Autores
Lopes, CT; Moura, D;

Publicação
2019 14TH IBERIAN CONFERENCE ON INFORMATION SYSTEMS AND TECHNOLOGIES (CISTI)

Abstract
Classifying web queries into a set of categories is a crucial task to better understand the user's intent behind a query, contextualize their search and provide more relevant results to the user. However, web queries are typically short and ambiguous making the query classification a non-trivial problem. In this article, we present a new automatic approach for identifying and characterizing queries in the health domain. This method makes use of the search engine counts through a semantic similarity measure called Normalized Google Distance (NGD) combined with Support Vector Machines to classify queries into three dimensions: health-related, severity and semantic type. To evaluate our methods, we used two datasets in different languages, Portuguese and English, and built another for evaluating the last dimension. Overall, the results achieved were satisfactory. The most generic classification obtains better results than more specific ones. The NGD proved to be a valuable assent in query classification.

FecharLer Abstract

2019

Analyzing the Adequacy of Readability Indicators to a Non-English Language

Autores
Antunes, H; Lopes, CT;

Publicação
Experimental IR Meets Multilinguality, Multimodality, and Interaction - 10th International Conference of the CLEF Association, CLEF 2019, Lugano, Switzerland, September 9-12, 2019, Proceedings

Abstract
Readability is a linguistic feature that indicates how difficult it is to read a text. Traditional readability formulas were made for the English language. This study evaluates their adequacy to the Portuguese language. We applied the traditional formulas in 10 parallel corpora. We verified that the Portuguese language had higher grade scores (less readability) in the formulas that use the number of syllables per words or number of complex words per sentence. Formulas that use letters by words instead of syllables by words output similar grade scores. Considering this, we evaluated the correlation of the complex words in 65 Portuguese school books of 12 schooling years. We found out that the concept of complex word as a word with 4 or more syllables, instead of 3 or more syllables as originally used in traditional formulas applied to English texts, is more correlated with the grade of Portuguese school books. In the end, for each traditional readability formula, we adapted it to the Portuguese language performing a multiple linear regression in the same dataset of school books. © Springer Nature Switzerland AG 2019.

FecharLer Abstract

2018

InfoLabPM at TREC 2018 Precision Medicine Track

Autores
Ferreira, J; Lopes, CT;

Publicação
Proceedings of the Twenty-Seventh Text REtrieval Conference, TREC 2018, Gaithersburg, Maryland, USA, November 14-16, 2018

Abstract
This paper reports the participation of the InfoLab at the TREC Precision Medicine Track 2018. InfoLab is an informal group that brings together researchers with interest in the information area and is located at Faculty of Engineering of University of Porto. The experiments made in this participation include query expansion approaches for the disease and gene concepts. The expansion of the disease terms was done using Unified Medical Language System (UMLS). UMLS is a repository that provides the mapping between a large number of vocabularies. The gene terms were expanded using Ensembl. Ensembl provides a genome browser that maps genes to their synonyms. An additional layer was developed on top of Terrier to provide the execution of a large batch of experiments. Multiple runs were evaluated in order to measure the influence of each expansion approach. © 2018 Copyright held by the owner/author(s).

FecharLer Abstract