This paper is published in Volume-11, Issue-5, 2025
Area
Computer Science
Author
Armaan Mahajan
Org/Univ
Pathways School Gurgaon, Haryana, India
Keywords
LLMs, Adversarial Training, Artificial Intelligence, RLHF, ChatGPT, Jailbreaking, Lat.
Citations
IEEE
Armaan Mahajan. A Deep Dive into the Adversarial Training Process of Modern LLMs, International Journal of Advance Research, Ideas and Innovations in Technology, www.IJARIIT.com.
APA
Armaan Mahajan (2025). A Deep Dive into the Adversarial Training Process of Modern LLMs. International Journal of Advance Research, Ideas and Innovations in Technology, 11(5) www.IJARIIT.com.
MLA
Armaan Mahajan. "A Deep Dive into the Adversarial Training Process of Modern LLMs." International Journal of Advance Research, Ideas and Innovations in Technology 11.5 (2025). www.IJARIIT.com.
Armaan Mahajan. A Deep Dive into the Adversarial Training Process of Modern LLMs, International Journal of Advance Research, Ideas and Innovations in Technology, www.IJARIIT.com.
APA
Armaan Mahajan (2025). A Deep Dive into the Adversarial Training Process of Modern LLMs. International Journal of Advance Research, Ideas and Innovations in Technology, 11(5) www.IJARIIT.com.
MLA
Armaan Mahajan. "A Deep Dive into the Adversarial Training Process of Modern LLMs." International Journal of Advance Research, Ideas and Innovations in Technology 11.5 (2025). www.IJARIIT.com.
Abstract
Large language models can be trained to resist jailbreak attempts in order to ensure the model’s safety. This literature review analyzes common adversarial training techniques used in modern LLMs to understand and break down the process of adversarial training. The paper looks into 6 training techniques, ranging from PGD to Deployment Time Safety Layers and Model Editing. For each method, we look into how they work behind the scenes, their strengths, weaknesses, and real-world usage. We then synthesize the most effective manner of training models against adversarial inputs. This research revealed that the most effective training technique emphasizes a multilayered defense strategy. Future research can look at improving model-editing coverage, creating open-sourced LLM benchmarks to test against jailbreaks, and closing the divide between embedding-space and discrete input training.
