Member-only story
To fine-tune or not to fine-tune
AI + Medicine Newsletter 2023–12–05
When adopting general models like GPT-4 and LLaMA-2 to specific areas, such as healthcare, we can use different strategies, including prompt engineering, parameter-efficient fine-tuning, instruction tuning, and re-training from scratch (ordered by technical difficulty to implement, figure 1). People usually believe that fine-tuning with domain-specific knowledge and data can yield better results than prompt engineering. Is it necessary to do fine-tuning, which can cost significant computing resources but probably gain only marginal performance improvement?
In a recent manuscript published by Microsoft, the authors showed that GPT-4 with a composition of several prompting strategies (named Medprompt, figure 2) can achieve superior performance in multiple medical challenge benchmarks in the absence of fine-tuning. These strategies include (figure 3):
- Few-shot prompting: Provide a few examples of correct Q&A. The study developed a kNN-based few-shot, which selects the examples based on K-nearest text embeddings from previous correct Q&A.
- Chain-of-thought: Guide the LLM to think step by step
- Ensembling: Produce multiple outputs and then consolidate to a consensus output