Member-only story

To fine-tune or not to fine-tune

When adopting general models like GPT-4 and LLaMA-2 to specific areas, such as healthcare, we can use different strategies, including prompt engineering, parameter-efficient fine-tuning, instruction tuning, and re-training from scratch (ordered by technical difficulty to implement, figure 1). People usually believe that fine-tuning with domain-specific knowledge and data can yield better results than prompt engineering. Is it necessary to do fine-tuning, which can cost significant computing resources but probably gain only marginal performance improvement?

Figure 1. LLM adoption techniques (source: Nvidia)

In a recent manuscript published by Microsoft, the authors showed that GPT-4 with a composition of several prompting strategies (named Medprompt, figure 2) can achieve superior performance in multiple medical challenge benchmarks in the absence of fine-tuning. These strategies include (figure 3):

  1. Few-shot prompting: Provide a few examples of correct Q&A. The study developed a kNN-based few-shot, which selects the examples based on K-nearest text embeddings from previous correct Q&A.
  2. Chain-of-thought: Guide the LLM to think step by step
  3. Ensembling: Produce multiple outputs and then consolidate to a consensus output

--

--

Encode Box
Encode Box

No responses yet