2022
Autores
Loureiro, D; Barbieri, F; Neves, L; Anke, LE; Camacho-Collados, J;
Publicação
PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): PROCEEDINGS OF SYSTEM DEMONSTRATIONS
Abstract
Despite its importance, the time variable has been largely neglected in the NLP and language model literature. In this paper, we present TimeLMs, a set of language models specialized on diachronic Twitter data. We show that a continual learning strategy contributes to enhancing Twitter-based language models' capacity to deal with future and out-of-distribution tweets, while making them competitive with standardized and more monolithic benchmarks. We also perform a number of qualitative analyses showing how they cope with trends and peaks in activity involving specific named entities or concept drift. TimeLMs is available at https://github.com/cardiffnlp/timelms.
2022
Autores
Abdulmumin, I; Dash, SR; Dawud, MA; Parida, S; Muhammad, SH; Ahmad, IS; Panda, S; Bojar, O; Galadanci, BS; Bello, BS;
Publicação
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION
Abstract
Multi-modal Machine Translation (MMT) enables the use of visual information to enhance the quality of translations. The visual information can serve as a valuable piece of context information to decrease the ambiguity of input sentences. Despite the increasing popularity of such a technique, good and sizeable datasets are scarce, limiting the full extent of their potential. Hausa, a Chadic language, is a member of the Afro-Asiatic language family. It is estimated that about 100 to 150 million people speak the language, with more than 80 million indigenous speakers. This is more than any of the other Chadic languages. Despite a large number of speakers, the Hausa language is considered low-resource in natural language processing (NLP). This is due to the absence of sufficient resources to implement most NLP tasks. While some datasets exist, they are either scarce, machine-generated, or in the religious domain. Therefore, there is a need to create training and evaluation data for implementing machine learning tasks and bridging the research gap in the language. This work presents the Hausa Visual Genome (HaVG), a dataset that contains the description of an image or a section within the image in Hausa and its equivalent in English. To prepare the dataset, we started by translating the English description of the images in the Hindi Visual Genome (HVG) into Hausa automatically. Afterward, the synthetic Hausa data was carefully post-edited considering the respective images. The dataset comprises 32,923 images and their descriptions that are divided into training, development, test, and challenge test set. The Hausa Visual Genome is the first dataset of its kind and can be used for Hausa-English machine translation, multi-modal research, and image description, among various other natural language processing and generation tasks.
2022
Autores
Moreno, M; Vilaca, R; Ferreira, PG;
Publicação
BMC BIOINFORMATICS
Abstract
Background: Gene expression studies are an important tool in biological and biomedical research. The signal carried in expression profiles helps derive signatures for the prediction, diagnosis and prognosis of different diseases. Data science and specifically machine learning have many applications in gene expression analysis. However, as the dimensionality of genomics datasets grows, scalable solutions become necessary. Methods: In this paper we review the main steps and bottlenecks in machine learning pipelines, as well as the main concepts behind scalable data science including those of concurrent and parallel programming. We discuss the benefits of the Dask framework and how it can be integrated with the Python scientific environment to perform data analysis in computational biology and bioinformatics. Results: This review illustrates the role of Dask for boosting data science applications in different case studies. Detailed documentation and code on these procedures is made available at https:// github. com/martaccmoreno/gexp-ml-dask. Conclusion: By showing when and how Dask can be used in transcriptomics analysis, this review will serve as an entry point to help genomic data scientists develop more scalable data analysis procedures.
2022
Autores
Baptista, D; Ferreira, PG; Rocha, M;
Publicação
Abstract
2022
Autores
Jurado Rodriguez, D; Jurado, JM; Pauda, L; Neto, A; Munoz Salinas, R; Sousa, JJ;
Publicação
COMPUTERS & GRAPHICS-UK
Abstract
Environment understanding in real-world scenarios has gained an increased interest in research and industry. The advances in data capture and processing allow a high-detailed reconstruction from a set of multi-view images by generating meshes and point clouds. Likewise, deep learning architectures along with the broad availability of image datasets bring new opportunities for the segmentation of 3D models into several classes. Among the areas that can benefit from 3D semantic segmentation is the automotive industry. However, there is a lack of labeled 3D models that can be useful for training and use as ground truth in deep learning-based methods. In this work, we propose an automatic procedure for the generation and semantic segmentation of 3D cars that were obtained from the photogrammetric processing of UAV-based imagery. Therefore, sixteen car parts are identified in the point cloud. To this end, a convolutional neural network based on the U-Net architecture combined with an Inception V3 encoder was trained in a publicly available dataset of car parts. Then, the trained model is applied to the UAV-based images and these are mapped on the photogrammetric point clouds. According to the preliminary image-based segmentation, an optimization method is developed to get a full labeled point cloud, taking advantage of the geometric and spatial features of the 3D model. The results demonstrate the method's capabilities for the semantic segmentation of car models. Moreover, the proposed methodology has the potential to be extended or adapted to other applications that benefit from 3D segmented models.
2022
Autores
Pinto, H; Pernice, R; Silva, ME; Javorka, M; Faes, L; Rocha, AP;
Publicação
PHYSIOLOGICAL MEASUREMENT
Abstract
Objective. In this work, an analytical framework for the multiscale analysis of multivariate Gaussian processes is presented, whereby the computation of Partial Information Decomposition measures is achieved accounting for the simultaneous presence of short-term dynamics and long-range correlations. Approach. We consider physiological time series mapping the activity of the cardiac, vascular and respiratory systems in the field of Network Physiology. In this context, the multiscale representation of transfer entropy within the network of interactions among Systolic arterial pressure (S), respiration (R) and heart period (H), as well as the decomposition into unique, redundant and synergistic contributions, is obtained using a Vector AutoRegressive Fractionally Integrated (VARFI) framework for Gaussian processes. This novel approach allows to quantify the directed information flow accounting for the simultaneous presence of short-term dynamics and long-range correlations among the analyzed processes. Additionally, it provides analytical expressions for the computation of the information measures, by exploiting the theory of state space models. The approach is first illustrated in simulated VARFI processes and then applied to H, S and R time series measured in healthy subjects monitored at rest and during mental and postural stress. Main Results. We demonstrate the ability of the VARFI modeling approach to account for the coexistence of short-term and long-range correlations in the study of multivariate processes. Physiologically, we show that postural stress induces larger redundant and synergistic effects from S and R to H at short time scales, while mental stress induces larger information transfer from S to H at longer time scales, thus evidencing the different nature of the two stressors. Significance. The proposed methodology allows to extract useful information about the dependence of the information transfer on the balance between short-term and long-range correlations in coupled dynamical systems, which cannot be observed using standard methods that do not consider long-range correlations.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.