Equipe RCLN

Seminars

Towards Efficient and Effective Vocabulary in Sparse Information Retrieval

#SeminaireRCLN
Yuxuan ZONG
2026-04-13 12:15:00
Salle B107, bâtiment B, Université de Villetaneuse
In the era of big data, information retrieval (IR) plays a central role in how information is accessed and consumed. Recent advances in Transformer-based neural models have substantially improved retrieval performance. Two major paradigms have emerged in this context: learned sparse retrieval, which represents texts using weighted vocabulary terms, and generative retrieval, which formulates retrieval as the generation of a document identifier. While both approaches have shown strong performance, they also exhibit important limitations. Sparse retrieval methods are often constrained by the fixed vocabulary of the underlying language model, limiting their adaptability, whereas generative retrieval methods rely on arbitrary document identifiers that tend to generalize poorly to unseen documents. In this thesis, we explore how these two paradigms can be combined to obtain more efficient and more effective retrieval representations. Our core idea is to construct sparse retrieval vocabularies from learning rather than from predefined lexical tokens. We first propose REFERENTIAL and HotBERT to investigate the use of hierarchical structured identifiers as the vocabulary representation for retrieval, whose coarse-to-fine representation is designed to capture global semantics at higher levels and progressively refine finer-grained distinctions. While this representation proves expressive and effective, our analysis reveals that directly learning and optimizing hierarchical identifiers is challenging in practice. Motivated by this observation, we introduce SAE-SPLADE, a sparse retrieval framework built on sparse autoencoders (SAE), which is an architecture that learn sparse, interpretable latent representations. By using SAE latents as the retrieval vocabulary, SAE-SPLADE removes the dependence on fixed token vocabularies and improves flexibility and representation capacity. Finally, recognizing the efficiency challenges HotBERT, we propose a theoretically lossless token-pruning method for late interaction models that reduces computation while preserving retrieval performance.

Titre bientôt disponible

#SeminaireRCLN
Nesrine Bannour
2025-01-27 12:30:00
Salle B107, bâtiment B, Université de Villetaneuse
à paraître

Titre bientôt disponible

#SeminaireRCLN
Paul Lerner
2025-01-13 12:15:00
Salle B107, bâtiment B, Université de Villetaneuse
Résumé à paraitre

Synthèse graphique multidimensionnelle : application aux documents hétérogènes

#SeminaireRCLN
Amal BELDI
2024-10-07 12:15:00
Salle B107, bâtiment B, Université de Villetaneuse
L'intégration des données implique l'organisation et la consolidation harmonieuses de divers formats et structures de données, garantissant leur compatibilité pour une analyse approfondie L'analyse basée sur les graphes exploite des techniques avancées de modélisation de graphes pour découvrir des connexions et des motifs complexes au sein des ensembles de données, offrant ainsi des aperçus précieux pour la prise de décision. Les techniques de résumé de graphes visent à condenser des graphes de données complexes tout en préservant les informations essentielles, facilitant ainsi un traitement et une visualisation plus efficaces des données pour une compréhension et une interprétation améliorées. Nos objectifs sont les suivants : - Aborder l'hétérogénéité des données : se concentrer sur la synthèse de types de données variés, en particulier numériques et textuelles, pour gérer efficacement la diversité des formats, des structures et des représentations des données. - Personnaliser le résumé des données : c’est adapter le processus de synthèse pour répondre aux besoins spécifiques des utilisateurs, en veillant à ce que la consolidation des données soit à la fois pertinente et centrée sur l'utilisateur. - Mettre en œuvre un résumé sémantique : Développer une approche de synthèse qui se concentre principalement sur un cadre normalisé pour décrire les ressources, en intégrant des éléments sémantiques pour améliorer l'interconnexion et la signification des données.

Les innovations et défis dans l'ingénierie du langage autour des unités phraséologiques

#SeminaireRCLN
Belem Priego Sanchez
2024-06-17 12:45:00
Salle B107, bâtiment B, Université de Villetaneuse
TBC

MAFALDA : Une étude comparative et complète de la détection et de la classification des sophismes

#SeminaireRCLN
Pierre Henri Paris
2024-03-18 12:15:00
Salle B107, bâtiment B, Université de Villetaneuse
Nous présentons MAFALDA, un benchmark pour la classification des sophismes qui fusionne et unifie les ensembles de données antérieurs sur les sophismes. Il s'accompagne d'une taxonomie qui aligne, affine et unifie les classifications existantes des sophismes. Nous fournissons également une annotation manuelle d'une partie des données ainsi que des explications manuelles pour chaque annotation. Nous proposons un nouveau schéma d'annotation adapté aux tâches subjectives en NLP, ainsi qu'une nouvelle méthode d'évaluation conçue pour gérer la subjectivité. Nous évaluons ensuite plusieurs modèles de langage dans un contexte d'apprentissage zero-shot et les performances humaines sur MAFALDA afin d'évaluer leur capacité à détecter et à classer les sophismes.

Ethically-driven Multimodal Emotion Detection for Children with Autism

#SeminaireRCLN
Annanda Sousa
2024-02-12 12:15:00
Salle B107, bâtiment B, Université de Villetaneuse
Emotion detection (ED) aims to identify people’s emotions automatically. However, most ED applications do not consider individuals who express emotions differently, such as people with autism. Although studies have already focused on creating ED models tailored for children with ASD, this application of ED suffers from a scarcity of resources and remains underperforming compared to the state-of-the-art ED models for the general population. This thesis addresses the gap in automatic ED between the general population and autistic children while ensuring an ethically driven approach, i.e., having the well-being of participants as the main priority during the whole research process. To meet our research objectives, we created a data collection framework that minimises emo- tional disruption to the participants, respects their privacy and rights according to GDPR, and provides a dataset that can be shared with the research community. We created CALMED, a multimodal annotated dataset for ED featuring children with autism that includes privacy- preserving features, novel target emotion classes, annotations provided by the participants’ par- ents and a researcher specialist who works with children with ASD. Using the CALMED dataset, we created hundreds of models with unique configurations and analysed them to explore the effectiveness of various methods for multimodal ED in autism. Then, utilising the knowledge acquired in this analysis, we proposed a multimodal ED model that outperformed the previous state-of-the-art, reaching 81.56% and 75.47% for accuracy and balanced accuracy, respectively. Finally, we created and shared many systems to support the data acquisition process and data experiments creation and analysis. We placed great importance on ensuring reproducibility, reusability, and ethical conduct. This research has made significant contributions to the field of ED applied to ASD. It has provided a valuable dataset, analytical insights, a state-of-the-art model, and many computer systems that can serve as a groundwork for future work.

The role of Knowledge Graphs in externalizing information from conceptual models

#SeminaireRCLN
Ana-Maria Ghiran
2024-02-08 12:15:00
Salle B107, bâtiment B, Université de Villetaneuse
Due to the machine readable format used by Knowledge Graphs (KGs) in representing facts, and ontological models, they enabled AI systems to make decisions or to provide humans with insights by revealing hidden relationships between entities. Nevertheless, decision making in enterprises is far from being assigned to AI. Describing and evaluating business processes take the form of visual models that gained increased popularity among managers. But a business process diagram, usually described in the standardized notation BPMN (Business Process Model and Notation), enables more than just a visual representation of the knowledge – it creates a structured encoding of knowledge, which can be captured in a graph-based format. In this way, information that captures diverse facets of an enterprise (e.g. about business processes, resources, strategies, goals etc.) and that was mainly used by business executives and restricted to human interpretation, is externalized as KGs and provided for machine interpretation, thus enabling reasoning and semantic linking with external knowledge. In this presentation I will highlight that conceptual models should be considered as knowledge acquisition structures for any domain and that they can be processed as KGs with the help of Semantic Technology.

On Semantic Annotation of Legislation

#SeminaireRCLN
Adam Wyner
2023-10-30 12:30:00
Salle B107, bâtiment B, Université de Villetaneuse
The talk presents an overview of recent work on semantic annotation of legislation. The law is presented in a range of complex, dense texts. Querying and correlating laws would help individuals and organisations access, understand, and comply with their legal obligations. We first present the Core Legal Annotation Language (CLAL), a machine readable XML for key semantic elements such as obligations, prohibitions, exceptions, and others. CLAL is applied to the GDPR; we show some examples. We then turn to issues related to information retrieval from the annotated GDPR, particularly where implicitly related information is needed, e.g., obligations and rights. Finally, we step back and discuss general methodological issues. Currently, there is diversity amongst the metadata of legal texts. This is particularly problematic for the law, as it is desirable to have common resources in order to extract information or support inferences. To achieve this, we propose a methodology based on the notions of formalisation continuum, modularisation, and stepwise refinement.

Towards a Formalisation of Value-based Actions and Consequentialist Ethics

#SeminaireRCLN
Adam Wyner
2023-10-23 13:00:00
Salle B107, bâtiment B, Université de Villetaneuse
Agents act in ways that relate to their personal or institutional values, amongst other reasons to act; that is, Agents aim to bring about a state of the world that is more compatible with their values. To formalise and ground this intuition, the paper proposes an action framework based on the familiar STRIPS formalisation. The technical contribution is to express actions in terms of Value-based Formal Reasoning (VFR), which provides a set of propositions derived from an Agent’s value profile and the Agent’s assessment of propositions in light of the profile. The conceptual contribution is to provide a computational framework for a form of consequentialist ethics which is satisficing, pluralistic, act-based, and preferential.

L’ontologie pour la représentation des connaissances et la prise de décision pour les systèmes multi-capteurs de plateformes aéroportées

#SeminaireRCLN
Vincent Beugnet
2023-10-23 12:30:00
Salle B107, bâtiment B, Université de Villetaneuse
L’identification d’objets est un enjeu critique pour les plateformes aéroportées dans un contexte militaire. Les systèmes actuellement existants et utilisés ne permettent pas une identification automatique des objets rencontrés malgré l’augmentation des capacités des capteurs. Nous proposons un système basé sur des ontologies pour optimiser l’acquisition d’informations sur les objets rencontrés en prenant la main sur la suite de capteurs à disposition.

A family of contrast-pattern based classifiers for class-imbalance problems

#SeminaireRCLN
Raul Monroy
2023-07-03 12:30:00
Salle B107, bâtiment B, Université de Villetaneuse
In this talk, I will give an overview of a family of contrast-pattern based classification mechanisms, especially designed to deal with class-imbalance problems. In particular, I will go into the internal workings of three classifiers, namely: PBC4cip, MHLDT and FT4cip. I will highlight pros and cons, as well as giving an outline of some greatest hits.

ChêneTAL. Plateforme d’expérimentation sur des outils de traitement automatique des langues et d’intelligence artificielle

#SeminaireRCLN
Othman Boudarga
2023-06-26 12:00:00
Salle B107, bâtiment B, Université de Villetaneuse
La plateforme CheneTAL a été conçue pour permettre la mise en place de chaînes hétérogènes de Traitement Automatique des Langues (TAL) en intégrant des logiciels existants en gestion et manipulation de corpus avec des modèles plus récents d’Intelligence Artificielle (IA), tout en gardant une interface simplifiée qui permette son utilisation et par les chercheur·euse·s de la communauté de Traitement Automatique des Langues (TAL) et par des chercheur·euse·s en Linguistique/Sciences Humaines et Sociales non experts en informatique. Pendant le séminaire, une première version fonctionnelle de la plateforme sera présentée.

From Language Models to (very) Large Language Models

#SeminaireRCLN
Davide Buscaldi
2023-03-20 12:30:00
Salle B107, bâtiment B, Université de Villetaneuse
Originairement destiné à l'équipe RCLN, je propose ce séminaire pour tous les curieux sur les derniers modèles de langage, BERT, GPT, GPT-2, GPT-3, GPT-4 et bien sûr chatGPT. J'ai ciblé la presentation pour couvrir aussi les bases des modèles de langage pour comprendre le fonctionnement de ces modèles à plus bas niveau.

Towards Detecting Pre-training Data Set Manipulations: the Need to Build Efficient Language Models

#SeminaireRCLN
Wissam Antoun
2023-02-13 12:30:00
Salle B107, bâtiment B, Université de Villetaneuse
The high compute cost required to train Large Language Models (LLMs) makes them only available to a hand full of high-budget private institutions, and countries. These institutions rarely documented their training data nor the data collection and filtering source code, thus raising questions about potential vulnerabilities of models that have been trained on them. For example, one of the many ways to inject adversarial biases and temper with training data is to produce machine-generated text carrying out these biases and have them included in the training data. So the matter of robust detection of machine-generated text is becoming crucial. Answering these questions first requires efficient ways to iterate and train language models quickly. In this talk, I will present my work on pretraining language models for Arabic and French and showcase the lessons learned in designing and training efficient LLMs. In particular, I'll talk about training AraBERT, AraELECTRA, AraGPT2, the current largest Transformer-based models for Arabic, and the AraGPT2 detector. I’ll also introduce CamemBERTa, a new sample-efficient language model for French, the first publicly available DeBERTa V3-based model outside of the original paper and which establishes a new SOTA for this language in many tasks. (Joint work with Benoit Sagot and Djamé Seddah, at the Inria’s Almanach team project)

Trustworthy AI: Ethical considerations when using AI techniques

#SeminaireRCLN
Fernando Perez-Tellez
2022-11-07 12:30:00
Salle B107, bâtiment B, Université de Villetaneuse
Recently, Artificial Intelligence (AI) is being used everywhere this is due to the accessibility of this technology in different aspects of everyday life. The idea of incorporating AI systems into several aspects of human life is to benefit humans by reducing labour and increasing everyday conveniences. Independently of the adopted definition of AI, we know that AI can either represent a benefit or an threat (unintentional in most of the cases). Then we should be thinking of creating intelligent systems considering important ethical and legal aspects. Dr. Fernando Perez Tellez, a lecturer and researcher from the Technological University Dublin (TU Dublin), Ireland is visiting LIPN. Dr. Perez Tellez will give a presentation on why is important to consider Ethics when AI techniques are used and how to make responsible use of AI. He will also present his TU Dublin colleagues research interests to promote the creation of potential research collaborations between LIPN and TU Dublin research groups.

Abstractive Summarization Evaluation: Overview and Reflections

#SeminaireRCLN
Yanzhu Guo
2022-03-28 13:00:00
Salle B107, bâtiment B, Université de Villetaneuse
The topic of summarization evaluation has recently received a surge of attention due to the rapid development of abstractive summarization systems. We conduct a survey of the state-of-the-art evaluation metrics along with relevant datasets and visualization systems. We also touch upon the statistical deficiencies in current meta-evaluation approaches such as the problematic choice of scoring range, the lack of paired evaluation as well as the prevalence of underpowered tests. Finally, we show experimental results proving the unreliability of human-annotated ground-truth reference summaries and thus argue for reference-free metrics as a more promising future direction.

Relation Extraction with Distant Supervision: noise Reductio

#SeminaireRCLN
Juan Luis Garcia-Mendoza
2021-11-12 12:30:00
Salle B107, bâtiment B, Université de Villetaneuse
Distant Supervision is an approach that allows automatic labeling of instances. This approach has been used in Relation Extraction. Still, the main challenge of this task is handling instances with noisy labels (e.g., when two entities in a sentence are automatically labeled with an invalid relation). The approaches reported in the literature addressed this problem by employing noise-tolerant classifiers. However, if a noise reduction stage is introduced before the classification step, this increases the macro precision values or keep the same values with fewer instances. An approach based on Adversarial Autoencoders is proposed to obtain a new representation that allows noise reduction in Distant Supervision.

Knowledge-based Detection of Automatically Generated Text

#SeminaireRCLN
Vijini Liyanage
2021-05-31 13:00:00
Salle A303, Bâtiment A, LIPN
Séminaire de Vijini Liyanage, étudiante du groupe RCLN, qui va nous présenter son sujet de thèse et les premières étapes de sa thèse sur la détection des textes générés automatiquement par des modèles de langage neuronales, du genre GPT-2.

wikiSERA: Domain independent evaluation of automatic summaries using relevance analysis on Wikipedia

#SeminaireRCLN
Jorge Garcia Flores
2020-12-07 12:30:00
Salle B107, bâtiment B, Université de Villetaneuse
Text summarization has been the subject of increasing research efforts in the last years. However, automatic summary evaluation is as crucial as the summarization task itself. For more than 15 years, the dominant approach for evaluating this task has been ROUGE [Lin, 2004], a machine translation inspired lexical comparison between a candidate machine summary and a set of human gold standard summaries. Lexical comparison might be a suitable evaluation approach for extractive summarization systems. However, the methodological leap of Deep Learning brought increasing research efforts on abstractive summarization, which raised some questions about the pertinence of an all-lexical evaluation perspective. In this work we present wikiSERA, an open source improvement of the SERA evaluation method [Cohan et al., 2018], based on a semantic comparison of information extraction vectors from a document base. We adapted the method to generic domain summarization and provide to the community a Wikipedia based implementation that shows robust correlation with human evaluations. --- Après le séminaire on va saluer Jorge qui nous quitte pour quelques mois, avec un apéro de "résistance" (contre la Covid, la LPR, etc..)

Person-Independent Multimodal Emotion Detection for Children with High-Functioning Autism

#SeminaireRCLN
Annanda Sousa
2020-10-12 12:30:00
Salle B107, bâtiment B, Université de Villetaneuse
The use of affect-sensitive interfaces carries the promise of enhancing human-computer interaction by delivering a system capable of identifying a user's emotions and adapt its content accordingly. Today's technology shows great potential to support children with autism, for example by using computer systems to improve their social skills. Generally, however, this technology does not encompass the potential of affect-sensitive interfaces. This is mainly due to Emotion Detection (ED) models built for the general population usually not performing well when applied to children with autism, who express emotions differently. The aim of this project is therefore to build a person-independent Multimodal Emotion Detection system tailored for children with high-functioning autism for the ultimate goal of applying it to design affect-sensitive interfaces dedicated to children with autism. This is a work in progress and the project expects to build upon the current body of knowledge on methods to apply ED systems to this specific subset of the general population. We expect to apply the overall theoretical and practical design perspectives that arise from this research investigation (e.g. analysis of modalities and features extraction, behavioural cues based features, fusion layers and classifier techniques) to propose a guiding framework for future studies.

Recherche d’experts à partir de publications scientifiques

#SeminaireRCLN
Stella Zevio
2020-09-28 12:30:00
Salle B107, bâtiment B, Université de Villetaneuse
Qui assigner au comité de lecture de la conférence que j'organise ? Au comité de thèse de mon doctorant ? Qui sont les membres éminents et les publications phares de mon domaine de recherche ? Suis-je un chercheur émérite ? Qui dois-je citer et avec qui dois-je collaborer pour espérer faire partie des membres éminents de la communauté scientifique et améliorer ma réputation ? Afin de répondre à ces problématiques essentielles, nous proposons une méthode de recherche d’experts à partir de publications scientifiques combinant annotation sémantique à l’aide d’une ontologie et fouille de motifs dans les coeurs de graphes attribués.

Generating Referring Expressions from RDF Knowledge Graphs for Data Linking

#SeminaireRCLN
Armita Khajeh-Nassiri
2020-06-08 15:00:00
Virtuel sur Jitsi: https://jitsi.lipn.univ-paris13.fr/RDFKGforDataLinking
In a knowledge graph, a referring [removed]RE) is a logical formula that can uniquely identify an entity. We propose a novel approach for discovering REs that are valid within a class of a knowledge graph. There can potentially exist many REs for each entity, hence we have focused on those descriptions that are 1) minimal 2) diverse and that 3) can not be found by instantiating the keys. As an application, we study the data linking problem that, given two knowledge graphs G1 and G2, finds the possible links between the entities of G1 and G2. We show that REs can drastically improve the quality of data linking. Rejoindre la réunion?: https://jitsi.lipn.univ-paris13.fr/RDFKGforDataLinking

Attention is all I need

#SeminaireRCLN
José Angel Gonzalez-Barba
2020-02-10 12:30:00
Salle B107, bâtiment B, Université de Villetaneuse
The use of attention mechanisms has been widespreaded along all the natural language processing tasks. These kind of mechanisms have increased the capacity of Deep Learning models allowing them to focus explicitly on the most discriminant relationships and properties for a given task. Recently, the Transformer model have replaced Convolutional and Recurrent neural networks in many NLP tasks, mainly due to its capability of modeling sequences, avoiding the sequential processing by using only attention mechanisms. In this talk I will speak about the application of the Transformer encoders to text classification in social media (Sentiment Analysis and Irony Detection in Twitter) and its application in a novel framework for extractive summarization. About the author: José Angel just finished his PhD and he is going to start a PostDoc in the group of Yoshua Bengio at the University of Montreal. His works on Spanish NLP have been very promising and he developed some state-of-the-art systems for sentiment analysis and summarization.