
I work on trustworthy and interpretable AI. My goal is to understand how models learn well enough that we can build the ones we are willing to trust. I am a master's student at Hanyang University, advised by Sungyoon Lee.
The two perturbations, δ in the input and η in the parameters, are usually studied in different rooms — one as adversarial robustness, the other as training dynamics. I think they belong on the same page. Studying them jointly is a useful lens on robustness, generalization, and what a model is really doing when it generalizes.
Concretely, I am drawn to questions like: why does the robustness–accuracy trade-off persist? Why does catastrophic overfitting appear so abruptly? When does in-context learning behave like an implicit optimizer, and when does it not? And what is actually being forgotten in model unlearning?
- Adversarial Robustness
- — Robustness–Accuracy Trade-off
- — Catastrophic Overfitting
- LLM Jailbreaking
- — Certifiable Defense
- In-Context Learning
- Linear Transformer
- Model Unlearning
- Reasoning
- Edge of Stability in Adversarial Training — how the EoS regime interacts with catastrophic overfitting.
- Reward Hacking — mechanisms behind reward hacking in alignment-tuned models, and training-dynamics-aware ways to mitigate it.
For papers, see publications. For the long form, CV. For anything else, email works.