HyperAIHyperAI

Command Palette

Search for a command to run...

2 months ago

SearchInstruct: Enhancing Domain Adaptation via Retrieval-Based Instruction Dataset Creation

Iman Barati Mostafa Amiri Heshaam Faili

SearchInstruct: Enhancing Domain Adaptation via Retrieval-Based
  Instruction Dataset Creation

Abstract

Supervised Fine-Tuning (SFT) is essential for training large language models(LLMs), significantly enhancing critical capabilities such as instructionfollowing and in-context learning. Nevertheless, creating suitable trainingdatasets tailored for specific domains remains challenging due to unique domainconstraints and data scarcity. In this paper, we propose SearchInstruct, aninnovative method explicitly designed to construct high quality instructiondatasets for SFT. Our approach begins with a limited set of domain specific,human generated questions, which are systematically expanded using a largelanguage model. Subsequently, domain relevant resources are dynamicallyretrieved to generate accurate and contextually appropriate answers for eachaugmented question. Experimental evaluation demonstrates that SearchInstructenhances both the diversity and quality of SFT datasets, leading to measurableimprovements in LLM performance within specialized domains. Additionally, weshow that beyond dataset generation, the proposed method can also effectivelyfacilitate tasks such as model editing, enabling efficient updates to existingmodels. To facilitate reproducibility and community adoption, we provide fullimplementation details, the complete set of generated instruction responsepairs, and the source code in a publicly accessible Git repository:https://github.com/mostafaamiri/SearchInstruct

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
SearchInstruct: Enhancing Domain Adaptation via Retrieval-Based Instruction Dataset Creation | Papers | HyperAI