
I work on trustworthy and interpretable AI. My research goal is to understand how models learn (a.k.a Trainig Dynamics) well enough that we can build the ones we are willing to trust. I am a master's student at Hanyang University, advised by Sungyoon Lee.
The two perturbations, δ in the input space and η in the parameters space, are usually studied in different rooms — one as adversarial robustness, the other as training dynamics. I think they belong on the same page. Studying them jointly is a useful lens on robustness, generalization, and what a model is really doing when it generalizes.
Concretely, I am drawn to questions like: Why does the robustness–accuracy trade-off persist? Why does catastrophic overfitting appear so abruptly? When does in-context learning behave like an implicit optimizer, and when does it not? And which component of training dynamics cause reward hacking?
- Adversarial Robustness
- — Robustness–Accuracy Trade-off
- — Catastrophic Overfitting
- LLM Jailbreaking
- — Certifiable Defense
- — Continuous Defense
- In-Context Learning
- Linear Transformer
- Model Unlearning
- Reasoning
- Reward Hacking
- Edge of Stability in Adversarial Training — how the EoS regime interacts with catastrophic overfitting.
- Reward Hacking — mechanisms behind reward hacking in alignment-tuned models, and training-dynamics-aware ways to mitigate it.
For papers, see publications. For the long form, CV. For anything else, email works.