Equipe RCLN

Presentation Members Publications Seminars Theses Software

Seminars

Towards Efficient and Effective Vocabulary in Sparse Information Retrieval

Yuxuan ZONG

2026-04-13 12:15:00
Salle B107, bâtiment B, Université de Villetaneuse

In the era of big data, information retrieval (IR) plays a central role in how information is accessed and consumed. Recent advances in Transformer-based neural models have substantially improved retrieval performance. Two major paradigms have emerged in this context: learned sparse retrieval, which represents texts using weighted vocabulary terms, and generative retrieval, which formulates retrieval as the generation of a document identifier. While both approaches have shown strong performance, they also exhibit important limitations. Sparse retrieval methods are often constrained by the fixed vocabulary of the underlying language model, limiting their adaptability, whereas generative retrieval methods rely on arbitrary document identifiers that tend to generalize poorly to unseen documents. In this thesis, we explore how these two paradigms can be combined to obtain more efficient and more effective retrieval representations. Our core idea is to construct sparse retrieval vocabularies from learning rather than from predefined lexical tokens. We first propose REFERENTIAL and HotBERT to investigate the use of hierarchical structured identifiers as the vocabulary representation for retrieval, whose coarse-to-fine representation is designed to capture global semantics at higher levels and progressively refine finer-grained distinctions. While this representation proves expressive and effective, our analysis reveals that directly learning and optimizing hierarchical identifiers is challenging in practice. Motivated by this observation, we introduce SAE-SPLADE, a sparse retrieval framework built on sparse autoencoders (SAE), which is an architecture that learn sparse, interpretable latent representations. By using SAE latents as the retrieval vocabulary, SAE-SPLADE removes the dependence on fixed token vocabularies and improves flexibility and representation capacity. Finally, recognizing the efficiency challenges HotBERT, we propose a theoretically lossless token-pruning method for late interaction models that reduces computation while preserving retrieval performance.

Equipe RCLN

Seminars

Towards Efficient and Effective Vocabulary in Sparse Information Retrieval

Titre bientôt disponible

Titre bientôt disponible

Synthèse graphique multidimensionnelle : application aux documents hétérogènes

Les innovations et défis dans l'ingénierie du langage autour des unités phraséologiques

MAFALDA : Une étude comparative et complète de la détection et de la classification des sophismes

Ethically-driven Multimodal Emotion Detection for Children with Autism

The role of Knowledge Graphs in externalizing information from conceptual models

On Semantic Annotation of Legislation

Towards a Formalisation of Value-based Actions and Consequentialist Ethics

L’ontologie pour la représentation des connaissances et la prise de décision pour les systèmes multi-capteurs de plateformes aéroportées

A family of contrast-pattern based classifiers for class-imbalance problems

ChêneTAL. Plateforme d’expérimentation sur des outils de traitement automatique des langues et d’intelligence artificielle

From Language Models to (very) Large Language Models

Towards Detecting Pre-training Data Set Manipulations: the Need to Build Efficient Language Models

Trustworthy AI: Ethical considerations when using AI techniques

Abstractive Summarization Evaluation: Overview and Reflections

Relation Extraction with Distant Supervision: noise Reductio

Knowledge-based Detection of Automatically Generated Text

wikiSERA: Domain independent evaluation of automatic summaries using relevance analysis on Wikipedia

Person-Independent Multimodal Emotion Detection for Children with High-Functioning Autism

Recherche d’experts à partir de publications scientifiques

Generating Referring Expressions from RDF Knowledge Graphs for Data Linking

Attention is all I need