BOME! Bilevel Optimization Made Easy: A Simple First-Order Approach

Based on the presentation of the paper by Mao Ye, Bo Liu, Stephen Wright, Peter Stone, and Qiang Liu

Motivation

Bilevel Optimization (BO) is crucial for meta-learning, hyperparameter optimization, and adversarial training, but it is notoriously difficult to scale. Traditional methods fall into two camps:

Hypergradient Descent (Implicit Differentiation): Requires solving $\arg\min$ exactly and computing expensive Hessian inverses ($\nabla_{\nu} \theta^*(\nu) = -[\nabla_{2,2}g]^{-1} \nabla_{1,2}g$). This involves CG iterations or Neumann series, making it cumbersome for deep learning.
Stationary-Seeking Methods: Replace the inner problem with a constraint $\nabla_\theta g = 0$. This only works if $g$ is convex in $\theta$ and fails in the non-convex settings typical of neural networks.

The Core Question: How can we perform Bilevel Optimization in large-scale deep learning without computing implicit derivatives or Hessians?

The Answer: BOME (Bilevel Optimization Made Easy).

Key Concepts

1. Value-Function Based Reformulation

Instead of dealing with the nested structure, BOME leverages a Value-Function reformulation:

\[\min_{v,\theta} f(v,\theta) \quad \text{s.t.} \quad q(v,\theta) := g(v,\theta) - g^*(v) \leq 0\]

Where $g^*(v) = \min_\theta g(v,\theta)$ is the value function.

Why is this powerful? By Danskin’s Theorem, the gradient of the value function:

\[\nabla_v g^*(v)\]

depends only on

\[\nabla_1 g(v, \theta^*(v))\]

and does not depend on the Jacobian $\nabla_v \theta^*(v)$.

This removes the need for implicit differentiation entirely! This formulation is valid even when $g$ is non-convex.

2. Algorithm: BOME in Practice

BOME approximates the value function by simply running a few steps of Gradient Descent on the inner problem and applying a dynamic barrier method.

Algorithm 1: BOME

Input: Initial $(v_0, \theta_0)$, step size $\xi$, inner steps $T$, barrier coeff. $\eta$.
Inner Loop (Approximate Solution):
Run $T$ steps of GD on inner problem: $\theta^{(t+1)} = \theta^{(t)} - \alpha \nabla_\theta g(v, \theta^{(t)})$.
Approximate Value Function:
$\hat{q}(v, \theta) = g(v, \theta) - g(v, \theta^{(T)})$.
Crucial detail: Stop-gradient is used on $\theta^{(T)}$ to avoid backpropagating through the inner optimization path.
Update Outer Variables:
$(v_{k+1}, \theta_{k+1}) \leftarrow (v_k, \theta_k) - \xi (\nabla f_k + \lambda_k \nabla \hat{q}_k)$.

Simplicity: This is a fully first-order method requiring only standard autodiff operations available in PyTorch/TensorFlow.

Theoretical Analysis

1. Proximal KKT Measure

Since standard KKT fails in this context (CRCQ does not hold), BOME analyzes convergence via a proximal KKT measure:

\[\mathcal{K}(v, \theta) = \min_{\lambda \geq 0} \| \nabla f(v, \theta) + \lambda \nabla q(v, \theta) \|^2 + \underbrace{q(v, \theta)}_{\text{Feasibility}}\]

2. Convergence Rates (First Non-Asymptotic Guarantee)

Under smoothness and PL (Polyak-Łojasiewicz) assumptions for the inner problem:

Unimodal $g$: $\min_{k \leq K} \mathcal{K}(v_k, \theta_k) = O\left(\sqrt{\xi} + \sqrt{\frac{q_0}{\xi K}} + \frac{1}{\xi K} + e^{-bT}\right)$ This indicates that a better initialization (smaller $q_0$) speeds up convergence.

Multimodal $g$ (Local Optima): Even with local PL basins, the algorithm maintains the rate: $\min_{k \leq K} \mathcal{K}^\circ(v_k, \theta_k) = O\left(\sqrt{\xi} + \sqrt{\frac{1}{\xi K}} + e^{-bT}\right)$ With optimal step size scheduling $\xi = O(K^{-1/2})$, BOME achieves an overall rate of $O(K^{-1/4})$.

Experimental Results: Continual Learning

BOME was tested on Contextual Transformation Networks (CTN) for sequential tasks (PMNIST, Split CIFAR).

Method (PMNIST)	ACC (↑)	NBT (↓)	FT (↑)
CTN (+ITD)	78.40	5.62	84.02
CTN (+BVFSM)	77.78	7.25	85.03
CTN (+BOME)	80.70	4.09	84.79

Key Observations:

Faster & Better: BOME converges to superior solutions with higher accuracy and less forgetting (Negative Backward Transfer).
Robust Hyperparameters: Default values $\eta=0.5$ and $T=10$ work robustly across tasks. Even $T=1$ (one inner step) is often sufficient!
Performance: Outperforms state-of-the-art bilevel optimizers like BVFSM in both speed and final accuracy.

Key Takeaways

No Hessians Required: BOME is a simple, fully first-order method that bypasses the need for implicit differentiation.
Theoretical Backing: It is the first method to offer a non-asymptotic convergence rate for general non-convex bilevel optimization.
Practicality: It scales to deep learning problems and is robust to hyperparameter choices, making it an excellent default choice for meta-learning pipelines.

References

Ye, M., Liu, B., Wright, S., Stone, P., & Liu, Q. (2022). BOME! Bilevel Optimization Made Easy: A Simple First-Order Approach. NeurIPS.