University of Wollongong
Browse

MIRS: [MASK] Insertion Based Retrieval Stabilizer for Query Variations

journal contribution
posted on 2024-11-17, 13:41 authored by Junping Liu, Mingkang Gong, Xinrong Hu, Jie Yang, Yi Guo
Pre-trained Language Models (PLMs) have greatly pushed the frontier of document retrieval tasks. Recent studies, however, show that PLMs are vulnerable to query variations, i.e., queries containing misspellings or word re-ordering of original queries, and etc. Despite the increasing interest to robustify the retriever performance, the impact of the query variations is not fully exploited. To effectively address this problem, this paper revisits the Masked-Language Modeling (MLM) and proposes a robust fine-tuning algorithm, termed [MASK] Insertion based Retrieval Stabilizer (MIRS). The proposed algorithm differs from existing methods via the injection of [MASK] tokens into query variations and further encouraging the representation similarity between the pair of original queries and their variations. In comparison to MLM, the traditional [MASK] substitution-then-prediction is less emphasized in MIRS. Additionally, an in-depth analysis of our algorithm is also provided to reveal: (1) the latent representation (or semantic) of the original query forms a convex hull, while the impact of the query variation is then quantified as a “distortion” to this hull via deviating the hull vertices; and (2) inserted [MASK] tokens play a significant role in enlarging the intersection between the newly-formed hull (after variations) and the original one, thereby preserving more semantic from original queries. With the proposed [MASK] injection, MIRS exhibits a relative 1.8 MRR@10 absolute point enhancement on average in the retrieval accuracy, verified using 5 baselines across 3 public datasets with 4 types of query variations. We also provide intensive ablation studies to investigate the hyperparameter sensitiveness, to breakdown the model into individual components to manifest their efficacy, and further, to evaluate the out-of-domain model generalizability.

Funding

Australian Research Council (DP210101426)

History

Journal title

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Volume

14146 LNCS

Pagination

392-407

Language

English

Usage metrics

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC