BOME! Bilevel Optimization Made Easy: A Simple First-Order Approach

Based on the presentation of the paper by Mao Ye, Bo Liu, Stephen Wright, Peter Stone, and Qiang Liu


Motivation

Bilevel Optimization (BO) is crucial for meta-learning, hyperparameter optimization, and adversarial training, but it is notoriously difficult to scale. Traditional methods fall into two camps:

  1. Hypergradient Descent (Implicit Differentiation): Requires solving $\arg\min$ exactly and computing expensive Hessian inverses ($\nabla_{\nu} \theta^*(\nu) = -[\nabla_{2,2}g]^{-1} \nabla_{1,2}g$). This involves CG iterations or Neumann series, making it cumbersome for deep learning.
  2. Stationary-Seeking Methods: Replace the inner problem with a constraint $\nabla_\theta g = 0$. This only works if $g$ is convex in $\theta$ and fails in the non-convex settings typical of neural networks.

The Core Question: How can we perform Bilevel Optimization in large-scale deep learning without computing implicit derivatives or Hessians?

The Answer: BOME (Bilevel Optimization Made Easy).


Key Concepts

1. Value-Function Based Reformulation

Instead of dealing with the nested structure, BOME leverages a Value-Function reformulation:

\[\min_{v,\theta} f(v,\theta) \quad \text{s.t.} \quad q(v,\theta) := g(v,\theta) - g^*(v) \leq 0\]

Where $g^*(v) = \min_\theta g(v,\theta)$ is the value function.

Why is this powerful? By Danskin’s Theorem, the gradient of the value function:

\[\nabla_v g^*(v)\]

depends only on

\[\nabla_1 g(v, \theta^*(v))\]

and does not depend on the Jacobian $\nabla_v \theta^*(v)$.

This removes the need for implicit differentiation entirely! This formulation is valid even when $g$ is non-convex.

2. Algorithm: BOME in Practice

BOME approximates the value function by simply running a few steps of Gradient Descent on the inner problem and applying a dynamic barrier method.

Algorithm 1: BOME

  1. Input: Initial $(v_0, \theta_0)$, step size $\xi$, inner steps $T$, barrier coeff. $\eta$.
  2. Inner Loop (Approximate Solution):
    Run $T$ steps of GD on inner problem: $\theta^{(t+1)} = \theta^{(t)} - \alpha \nabla_\theta g(v, \theta^{(t)})$.
  3. Approximate Value Function:
    $\hat{q}(v, \theta) = g(v, \theta) - g(v, \theta^{(T)})$.
    Crucial detail: Stop-gradient is used on $\theta^{(T)}$ to avoid backpropagating through the inner optimization path.
  4. Update Outer Variables:
    $(v_{k+1}, \theta_{k+1}) \leftarrow (v_k, \theta_k) - \xi (\nabla f_k + \lambda_k \nabla \hat{q}_k)$.

Simplicity: This is a fully first-order method requiring only standard autodiff operations available in PyTorch/TensorFlow.


Theoretical Analysis

1. Proximal KKT Measure

Since standard KKT fails in this context (CRCQ does not hold), BOME analyzes convergence via a proximal KKT measure:

\[\mathcal{K}(v, \theta) = \min_{\lambda \geq 0} \| \nabla f(v, \theta) + \lambda \nabla q(v, \theta) \|^2 + \underbrace{q(v, \theta)}_{\text{Feasibility}}\]

2. Convergence Rates (First Non-Asymptotic Guarantee)

Under smoothness and PL (Polyak-Łojasiewicz) assumptions for the inner problem:

Unimodal $g$: \(\min_{k \leq K} \mathcal{K}(v_k, \theta_k) = O\left(\sqrt{\xi} + \sqrt{\frac{q_0}{\xi K}} + \frac{1}{\xi K} + e^{-bT}\right)\) This indicates that a better initialization (smaller $q_0$) speeds up convergence.

Multimodal $g$ (Local Optima): Even with local PL basins, the algorithm maintains the rate: \(\min_{k \leq K} \mathcal{K}^\circ(v_k, \theta_k) = O\left(\sqrt{\xi} + \sqrt{\frac{1}{\xi K}} + e^{-bT}\right)\) With optimal step size scheduling $\xi = O(K^{-1/2})$, BOME achieves an overall rate of $O(K^{-1/4})$.


Experimental Results: Continual Learning

BOME was tested on Contextual Transformation Networks (CTN) for sequential tasks (PMNIST, Split CIFAR).

Method (PMNIST) ACC (↑) NBT (↓) FT (↑)
CTN (+ITD) 78.40 5.62 84.02
CTN (+BVFSM) 77.78 7.25 85.03
CTN (+BOME) 80.70 4.09 84.79

Key Observations:


Key Takeaways

  1. No Hessians Required: BOME is a simple, fully first-order method that bypasses the need for implicit differentiation.
  2. Theoretical Backing: It is the first method to offer a non-asymptotic convergence rate for general non-convex bilevel optimization.
  3. Practicality: It scales to deep learning problems and is robust to hyperparameter choices, making it an excellent default choice for meta-learning pipelines.

References


© 2026 Papay Ivan. All rights reserved.