Fanny Walldén: Fine-Tuning Language Models with Preferences for Text Summarization Tasks: A Comparative Study of Reinforcement Learning from Human Feedback and Direct Preference Optimization
Presentation of Master's theses in Mathematical statistics
Time: Wed 2026-06-03 08.00 - 08.45
Location: Albano, Mittag-Leffler room, floor 3, house 1
Respondent: Fanny Walldén
Supervisor: Chun-Biu Li
Abstract: The work investigates reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO), two approaches used to fine-tune language models using human preference. The approaches are trained and evaluated for the text summarization task, using the ROUGE and BERTScore metrics. In addition, the language models are fine-tuned using AI preference obtained by an off-the-shelf language model to investigate their robustness to different sources of feedback. The results demonstrate the main disadvantage of RLHF, namely, the approach is prone to suffer from reward hacking. Reward hacking is a common problem in reinforcement learning that occurs because the objective diverges from the actual goal. In addition, the results demonstrate that main advantage of DPO lies in its robustness and stability. Lastly, the results show that the approaches are robust to different sources of feedback, enabling efficient feedback generation without loss of quality.
