Modern Fine-Tuning

Monday. February 24, 2025

Modern Fine-tuning

This is a non-comprehensive review of modern LLM fine-tuning. These notes are a way for me to review for research roles. If you’re hiring – reach out!.

TL;DR: Full fine-tuning updates all model parameters, maximising performance but requiring high memory and compute. Parameter-efficient fine-tuning (PEFT) attemps to reduce trainable parameters while maintaining good performance.

Full Fine-Tuning

Full fine-tuning updates all the parameters of a (typically pre-trained) deep neural network on a new, task-specific dataset. This approach generally yields the highest performance but is computationally intensive and requires considerable VRAM since all gradients must be tracked for all parameters. When the new task has limited data, full fine-tuning can lead to overfitting and may result in the loss of general knowledge from pre-training (catastrophic forgetting).

If \(N\) is the number of parameters and each parameter takes \(B\) bytes (e.g. 4 bytes for FP32 or 2 bytes for FP16), then for SGD and Adam (which stores two extra tensors per parameter – momentum and moment), the total static memory footprint (ignoring activations) is given by:

\[\text{Memory}_{SGD} = (\text{params state} + \text{gradients state}) \times N \times B = 2NB\] \[\text{Memory}_{adam} = (\text{params state} + \text{gradients state} + 2 \times \text{optimizer state}) \times N \times B = 4NB\]

Importantly, this equation does not include additional memory required for activations during training, which depends on batch size, model architecture, and CUDA overhead.

Numerical Precision Considerations

FP32: High precision but high memory usage.
FP16: Lower memory usage but limited exponent range, which can cause numerical issues.
BF16: BF16, or bfloat16, uses an 8‐bit exponent like FP32, unlike FP16 which has only a 5‐bit exponent. Neural networks often produce very small and very large values—particularly in gradients during backpropagation—so having a wider exponent range is crucial to avoid underflow or overflow. Although BF16 offers less precision in the mantissa compared to FP16, its broader range makes it more robust in capturing the full scale of numbers that arise during neural network training and inference.
Mixed Precision Training: Uses lower precisions for most operations while keeping FP32 for critical computations.

Parameter-Efficient Fine-Tuning (PEFT)

Fine-tuning all parameters is costly. PEFT methods reduce the number of trainable parameters while maintaining good performance.

Quantized Fine-Tuning

Quantization reduces memory usage by storing model parameters, gradients, and optimizer states in lower bit-width formats (e.g., 8-bit or 4-bit). This reduces both memory footprint and computational load but can affect training stability and model performance.

Adapters [Houlsby et al., 2019]

Adapters insert small trainable modules between layers while freezing the original weights. This method introduces additional parameters but allows for efficient fine-tuning with minimal computational overhead.

Low-Rank Adaptation (LoRA) [Hu et al., 2021]

LoRA updates weights using low-rank matrices:

\[\Delta W = AB\]

where \(A \in \mathbb{R}^{d \times r}\) and \(B \in \mathbb{R}^{r \times k}\) with \(r < \min(d,k)\). LoRA significantly reduces trainable parameters and allows merging the updates back into the original model after training.

Variants

QLoRA [Dettmers et al., 2023]: Combines low-rank decomposition with quantization for additional memory savings.
DoRA [Liu et al., 2024]: Decomposes weights into a magnitude and direction. Uses a separate trainable vector for magnitude and LoRA for the direction. This adds a tiny bit of extra parameters but is shown to be more stable.

Prefix-Tuning [Li and Liang, 2021]

Prefix-tuning prepends trainable soft tokens to the input sequence at each layer, keeping LM weights frozen. A simplified version, Prompt-Tuning [Lester et al., 2021], applies this only to the embedding layer. This method increases sequence length but allows for fast adaptation to new tasks.

Representation Fine-Tuning (ReFT) [Wu et al., 2024]

ReFT applies task-specific interventions to hidden states at predefined positions and layers. LoReFT combines LoRA with ReFT, updating hidden states as:

\[h' = h + R^T(W h + b - R h)\]

where \(R \in \mathbb{R}^{r \times d}\) and \(W \in \mathbb{R}^{r \times d}\) are low-rank matrices.

Summary: Fine-Tuning Trade-offs

Method	Pros	Cons	Trainable Parameters
Fine-Tuning	Maximises performance, Utilises all model capacity, Simple	High memory/computation cost, Risk of overfitting	100%
Adapters	Low additional parameters, Modular (switchable adapters)	Increases total parameter count, Performance depends on design	~3%
LoRA	Reduces trainable parameters, Merges back into base model	Hyperparameter sensitive, May not match full fine-tuning performance	0.5-1%
Prefix-Tuning	Switchable prefixes for different tasks	Increases sequence length	0.1%
ReFT	Minimises parameter changes, Hybrid of LoRA/adapters	Requires careful tuning (layer/position selection)	0.025-0.05%

Conclusion

The choice between full and partial fine-tuning depends on available resources and task requirements. Full fine-tuning maximises performance but is expensive. PEFT methods like Adapters, LoRA, and ReFT provide efficient alternatives for scenarios with limited data or compute resources.

Citations

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., de Laroussilhe, Q., Gesmundo, A., Attariyan, M., & Gelly, S. (2019). Parameter-Efficient Transfer Learning for NLP. arXiv:1902.00751.

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685.

Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:2305.14314.

Liu, S.-Y., Wang, C.-Y., Yin, H., Molchanov, P., Wang, Y.-C. F., Cheng, K.-T., & Chen, M.-H. (2024). DoRA: Weight-Decomposed Low-Rank Adaptation. arXiv:2402.09353.

Li, X. L., & Liang, P. (2021). Prefix-Tuning: Optimizing Continuous Prompts for Generation. arXiv:2101.00190.

Wu, S., Liu, S., & Ruder, S. (2024). ReFT: Representation Fine-Tuning for Language Models. arXiv:2401.12345.