Publicacoes - INESC TEC

Publicações

Publicações por HumanISE

2019

Characterizing and comparing Portuguese and English Wikipedia medicine-related articles

Autores
Domingues, G; Lopes, CT;

Publicação
COMPANION OF THE WORLD WIDE WEB CONFERENCE (WWW 2019 )

Abstract
Wikipedia is the largest on-line collaborative encyclopedia, containing information from a plethora of fields, including medicine. It has been shown that Wikipedia is one of the top visited sites by readers looking for information on this topic. The large reliance on Wikipedia for this type of information drives research towards the analysis of the quality of its articles. In this work, we evaluate and compare the quality of medicine-related articles in the English and Portuguese Wikipedia. For that we use metrics such as authority, completeness, complexity, informativeness, consistency, currency and volatility, and domain-specific measurements, in order to evaluate and compare the quality of medicine related articles in the English and Portuguese Wikipedia. We were able to conclude that the English articles score better across most metrics than the Portuguese articles.

FecharLer Abstract

2019

Readability of web content An analysis by topic

Autores
Antunes, H; Lopes, CT;

Publicação
2019 14TH IBERIAN CONFERENCE ON INFORMATION SYSTEMS AND TECHNOLOGIES (CISTI)

Abstract
Readability is determined by the characteristics of the text that influence their understanding. The web is composed of content on various topics and the results retrieved in the top positions by the main search engines are expected to be those with the highest number of views. In this study, we analyzed the readability of web pages according to the topic to which it belongs and their position in the search result. For that, we collected the top-20 results retrieved by Google to 23,779 queries from 20 topics and used several readability metrics. The results of the analysis showed that the content from organizations (like colleges and other institutions) and health-related content have lower readability values. Categories Games and Home are on the opposite side. For the categories identified as having less readability, tools can be developed that help the user understand their content. We also found that top-ranked pages have higher values of readability. One can conclude that, directly or indirectly, readability is a factor that seems to be being considered by the Google search engine or has an influence on page popularity.

FecharLer Abstract

2019

Is it a lay or medico-scientific concept? Automatic classification in two languages

Autores
Santos, PM; Lopes, CT;

Publicação
2019 14TH IBERIAN CONFERENCE ON INFORMATION SYSTEMS AND TECHNOLOGIES (CISTI)

Abstract
Searching for health information is the third most popular activity on the Internet. There is evidence that query suggestions in lay and medico-scientific terminology improve health information retrieval by who is not a health professional. Developing systems that suggest queries in these terminologies requires knowing if concepts are lay or medico-scientific. In this paper, we propose and compare approaches to compute the degree of association of a concept to lay and medico-scientific terminology. We use different thesauri for this purpose and use the cosine similarity to measure the closeness of concepts with subsets of those thesauri. The evaluation of our approaches uses an existing glossary containing concepts in both terminologies in English and Portuguese and a and a set of queries submitted by users and classified by health professionals as lay or medical-scientific. We concluded that the best method to classify a concept uses the CHV vocabulary as a subset.

FecharLer Abstract

2019

Normalized Google Distance in the identification and characterization of health queries

Autores
Lopes, CT; Moura, D;

Publicação
2019 14TH IBERIAN CONFERENCE ON INFORMATION SYSTEMS AND TECHNOLOGIES (CISTI)

Abstract
Classifying web queries into a set of categories is a crucial task to better understand the user's intent behind a query, contextualize their search and provide more relevant results to the user. However, web queries are typically short and ambiguous making the query classification a non-trivial problem. In this article, we present a new automatic approach for identifying and characterizing queries in the health domain. This method makes use of the search engine counts through a semantic similarity measure called Normalized Google Distance (NGD) combined with Support Vector Machines to classify queries into three dimensions: health-related, severity and semantic type. To evaluate our methods, we used two datasets in different languages, Portuguese and English, and built another for evaluating the last dimension. Overall, the results achieved were satisfactory. The most generic classification obtains better results than more specific ones. The NGD proved to be a valuable assent in query classification.

FecharLer Abstract

2019

Analyzing the Adequacy of Readability Indicators to a Non-English Language

Autores
Antunes, H; Lopes, CT;

Publicação
Experimental IR Meets Multilinguality, Multimodality, and Interaction - 10th International Conference of the CLEF Association, CLEF 2019, Lugano, Switzerland, September 9-12, 2019, Proceedings

Abstract
Readability is a linguistic feature that indicates how difficult it is to read a text. Traditional readability formulas were made for the English language. This study evaluates their adequacy to the Portuguese language. We applied the traditional formulas in 10 parallel corpora. We verified that the Portuguese language had higher grade scores (less readability) in the formulas that use the number of syllables per words or number of complex words per sentence. Formulas that use letters by words instead of syllables by words output similar grade scores. Considering this, we evaluated the correlation of the complex words in 65 Portuguese school books of 12 schooling years. We found out that the concept of complex word as a word with 4 or more syllables, instead of 3 or more syllables as originally used in traditional formulas applied to English texts, is more correlated with the grade of Portuguese school books. In the end, for each traditional readability formula, we adapted it to the Portuguese language performing a multiple linear regression in the same dataset of school books. © Springer Nature Switzerland AG 2019.

FecharLer Abstract

2019

Combining sentiment analysis scores to improve accuracy of polarity classification in MOOC posts

Autores
Pinto, HL; Rocio, V;

Publicação
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Abstract
Sentiment analysis is a set of techniques that deal with the verification of sentiment and emotions in written texts. This introductory work aims to explore the combination of scores and polarities of sentiments (positive, neutral and negative) provided by different sentiment analysis tools. The goal is to generate a final score and its respective polarity from the normalization and arithmetic average scores given by those tools that provide a minimum of reliability. The texts analyzed to test our hypotheses were obtained from forum posts from participants in a massive open online course (MOOC) offered by Universidade Aberta de Portugal, and were submitted to four online service APIs offering sentiment analysis: Amazon Comprehend, Google Natural Language, IBM Watson Natural Language Understanding, and Microsoft Text Analytics. The initial results are encouraging, suggesting that the average score is a valid way to increase the accuracy of the predictions from different sentiment analyzers. © Springer Nature Switzerland AG 2019.

FecharLer Abstract