!pip install plotly
!pip install modopt
Requirement already satisfied: plotly in /home/will/miniconda3/envs/bs/lib/python3.12/site-packages (5.22.0) Requirement already satisfied: tenacity>=6.2.0 in /home/will/miniconda3/envs/bs/lib/python3.12/site-packages (from plotly) (8.3.0) Requirement already satisfied: packaging in /home/will/miniconda3/envs/bs/lib/python3.12/site-packages (from plotly) (24.0) Requirement already satisfied: modopt in /home/will/miniconda3/envs/bs/lib/python3.12/site-packages (1.7.2) Requirement already satisfied: numpy in /home/will/miniconda3/envs/bs/lib/python3.12/site-packages (from modopt) (1.26.4) Requirement already satisfied: scipy in /home/will/miniconda3/envs/bs/lib/python3.12/site-packages (from modopt) (1.13.0) Requirement already satisfied: tqdm in /home/will/miniconda3/envs/bs/lib/python3.12/site-packages (from modopt) (4.66.4) Requirement already satisfied: importlib-metadata in /home/will/miniconda3/envs/bs/lib/python3.12/site-packages (from modopt) (8.0.0) Requirement already satisfied: zipp>=0.5 in /home/will/miniconda3/envs/bs/lib/python3.12/site-packages (from importlib-metadata->modopt) (3.19.2)
Blog: Deriving and experimenting with a (new) non-convex penalty based on the $k$-support norm¶
Disclaimer: This blog just proposes some initial idea, forgive the imperfect writing and typos, and feel free to let me know if you see any mistake!
Description of the problem and introduction of the non-convex penalty based on the $k$-support norm¶
We will consider the following optimization problem:
$$ \min_{x \in \mathbb{R}^d} f(x) + \lambda h(x), $$where $f$ is a smooth function, and $h$ is a penalty term which goal is to ensure sparsity of the solution $x$, and $\lambda \geq 0$ is the penalty strength. We will consider the following (new up to our knowledge) non-convex penalty term:
$$h(x) := g(x) - \frac{1}{2} \|x \|^2,$$where $g(\cdot) := \frac{1}{2}\left(\| \cdot \|^{sp}_{k}\right)^2 $ denotes the half-squared $k$-support norm defined in [1]. We will optimize this problem with (non-convex) proximal gradient descent [2].
Motivation¶
Such penalty is $0$ for (at most) $k$ sparse vectors, and non-zero for vectors with more than $k$ non-zero elements (the reason why will become clearer from the proofs below). Indeed, as will be shown below, for any $x \in \mathbb{R}^d$, we have: $\| x \|^{sp}_{k} \geq \| x\|_2$, and we have equality if $x$ is (at most) $k$-sparse. As such, when $\lambda \rightarrow \infty$, the penalty converges to the indicator of the $\ell_0$ pseudo-ball of radius $k$. It is (to our knowledge) the first and only non-convex penalty which explicitly allows to specify a precise sparsity $k$ (except the $\ell_0$ penalty itself) (in comparison, other non-convex penalties such as SCAD, MCP, or $\ell_p$ norms, will penalize any non-zero vector, even if its sparsity lower or equal than $k$).
Closed form for the proximal operator¶
We recall the definition of the proximal operator, for some function $h: \mathbb{R}^d \rightarrow \mathbb{R}$, and for all $z \in \mathbb{R}^d$:
$$\text{prox}_{h}(z) \in \arg\min_{x} h(x) + \frac{1}{2} \| x - z\|^2 $$Below we give the closed form for the proximal operator of $\lambda h$, which shows that it can be obtained efficiently from the proximal operator of the half-squared $k$-support norm (which is known, cf. [1] or [4]).
Lemma 1. (Closed form for the proximal operator)
$$\text{prox}_{\lambda h} (z) = \begin{cases} \text{prox}_{\frac{\lambda}{1 - \lambda} g}(\frac{1}{1 - \lambda}z) ~\text{if}~ \lambda < 1\\ \mathcal{H}_k(z) ~\text{if}~ \lambda \geq 1 \end{cases}, $$where $ \mathcal{H}_k$ denotes the hard-thresholding operator, which keeps the $k$ largest components (in absolute value), and sets the remaining one to 0 (if there are ties, we can break them for instance lexicographically).
Proof.
Case 1: $\lambda < 1$
We have:
\begin{align*} \text{prox}_{\lambda h} (z) &= \arg\min_{x} \lambda \left(\frac{1}{2} \left(\|x \|^{sp}_{k}\right)^2 - \frac{1}{2} \|x \|^2 \right) + \frac{1}{2} \| x - z\|^2\\ &= \arg\min_{x} \frac{\lambda}{2} \left(\|x \|^{sp}_{k}\right)^2 + \frac{1-\lambda}{2} \|x \|^2 - \langle x, z \rangle\\ &= \arg\min_{x} \frac{\lambda}{2(1-\lambda)} \left(\|x \|^{sp}_{k}\right)^2 +\frac{1}{2} \|x \|^2 - \langle x, \frac{1}{1-\lambda} z \rangle\\ &= \arg\min_{x} \frac{\lambda}{2(1-\lambda)} \left(\|x \|^{sp}_{k}\right)^2 +\frac{1}{2} \|x - \frac{1}{1 - \lambda} z \|^2\\ &= \text{prox}_{\frac{\lambda}{1- \lambda} g} (\frac{1}{1 - \lambda} z) \end{align*}Where we used the fact that changing the optimization problem by an additive constant and/or a strictly positive multiplicative constant does not change the $\arg\min$.
Case 2: $\lambda \geq 1$
[1] introduce the $k$-support norm as the norm which unit ball is the tightest convex relaxation of the $\ell_0$-pseudo-ball (of radius $k$) and the $\ell_2$ unit ball, that is, the $k$-support support norm is the gauge function associated to the following set: $\mathcal{C}_k = \text{conv}\{x \in \mathbb{R}^d: \| x\|_0 \leq k \text{ and } \| x\|_2\leq 1\}$, that is for a vector $x \in \mathbb{R}^d$: $\| x \|_{k}^{sp} = \inf \{ \lambda \in \mathbb{R}_{+} \text{ s.t. } x \in \lambda \mathcal{C}_k\}$ (as per footnote 4 in [1]). Let us now consider some $(v, w) \in (\mathbb{R}^d)^2$. According to the above, there exist, for some $N \in \mathbb{N}$, some $\{y_i\}_{i=1}^N$ s.t. $w = \lambda \sum_{i=1}^N \nu_i y_i $ with $\lambda \leq \| w\|_{k}^{sp}$ and for all $i$ in $[N]$, $\| y_i\|_2 \leq 1$, $\|y_i \|_0 \leq k$, $\nu_i \in [0, 1]$, and $\sum_{i=1}^N \nu_i = 1$. Since the $k$-support norm is a norm (and therefore verifies the homogeneity property $\lambda \|\cdot \|_{k}^{sp} = \|\lambda \cdot \|_{k}^{sp}$), this implies (by defining $x_i := \lambda y_i$ for all $i \in [N]$ ) that there exist some $\{x_i\}_{i=1}^N$ s.t. $w = \sum_{i=1}^N \nu_i x_i $ and for all $i$ in $[N]$, $\|x_i \|_0 \leq k$, $\| x_i\| \leq \| x \|_{k}^{sp}$, $\nu_i \in [0, 1]$, and $\sum_{i=1}^N \nu_i = 1$. Let us now consider the following function:
\begin{align} h_{v}(w)&:=\lambda \left(\frac{1}{2} \left(\|w\|_k^{s p}\right)^2 - \frac{1}{2} \|w\|^2 \right) + \frac{1}{2} \|w - v \|^2 \nonumber\\ &=\frac{\lambda}{2} \left(\|w\|_k^{s p}\right)^2 + \frac{1 - \lambda}{2} \| w \|^2 + \langle w, v \rangle + \frac{1}{2}\|v\|^2\nonumber\\ &\overset{(a)}{\geq} \frac{\lambda}{2} \left(\|w\|_k^{s p}\right)^2 + \frac{1 - \lambda}{2} \left( \sum_{i=1}^N \nu_i \| x_i \|^2 \right)+ \langle w, v \rangle + \frac{1}{2}\|v\|^2\nonumber\\ &\overset{(b)}{\geq } \frac{\lambda}{2} \left(\sum_{i=1}^N\nu_i \|x_i\|^2\right) + \frac{1 - \lambda}{2} \left( \sum_{i=1}^N \nu_i \| x_i \|^2 \right)+ \langle w, v \rangle + \frac{1}{2}\|v\|^2\nonumber\\ &= \frac{1}{2} \left(\sum_{i=1}^N\nu_i \|x_i\|^2\right) + \left(\sum_{i=1}^N \nu_i \langle x_i, v \rangle\right) + \sum_{i=1}^N \frac{\nu_i}{2}\|v\|^2\nonumber\\ &= \frac{1}{2}\sum_{i=1}^N \nu_i\| x_i - v \|^2 \geq \frac{1}{2}\min_{i \in [N]}\| x_i - v \|^2 \overset{(c)}{\geq} \frac{1}{2}\min_{x \in \mathbb{R}^d \text{ s.t. } \|x \|_0\leq k} \| x - v \|^2&\overset{(d)}{=} \frac{1}{2} \|\mathcal{H}_k(v) - v \|^2 \tag{1} \end{align}where (a) follows from the Jensen inequality and the fact that $1 - \lambda \leq 0 $, (b) from the fact that for all $i \in [N]$, $\| x_i \| \leq \| w \|_{k}^{sp}$ as mentioned above, (c) from the fact that $\| x_i \|_0 \leq k$ for all $i \in [N]$ as mentioned above, and (e) from the fact that projecting a vector $v$ onto the $\ell_0$ pseudo-ball of radius $k$ amounts to taking the hard-thresholding operator of $v$, i.e. preserving the top-$k$ (in magnitude) components of $v$, and setting the others to 0 (ties can be broken lexicographically) (this can be easily seen by observing that the minimization problem for the projection is separable and can be done coordinate-wise).
Therefore, from (1) above, we have that: \begin{equation} \tag{2} \forall w \in \mathbb{R}^d: h_{v}(w) \geq \frac{1}{2} \| \mathcal{H}_k(v) - v\|^2 \end{equation}
Additionally, for any $x \in \mathbb{R}^d$ such that $\|x \|_0 \leq k$, we have that $\| x\|_{k}^{sp} = \| x\|$ (see [1] for more details, this can be seen for instance from Proposition 2.1 in [1]), which implies that: \begin{equation} \tag{3} h_v(\mathcal{H}_k(v)) = \lambda \times 0 + \frac{1}{2} \| \mathcal{H}_k(v) - v \|^2 = \frac{1}{2} \| \mathcal{H}_k(v) - v \|^2. \end{equation} Therefore, from (2), we have that for any $v \in \mathbb{R}^d$, $\frac{1}{2} \| \mathcal{H}_k(v) - v \|^2$ is a lower bound on $h_v(w)$ when $w$ spans $\mathbb{R}^d$, and from (3) we have that this lower bound is attained for $w = \mathcal{H}_k(v)$, therefore, that lower bound is a minimum of $h_v(w)$ when $w$ spans $\mathbb{R}^d$, which implies that:
$$\mathcal{H}_k(v) \in \arg\min_{x \in \mathbb{R}^d} h_v(w) = \text{prox}_{\lambda h}(v)$$$\square$
Visualizing the penalty and its proximal operator¶
Below we provide a few visualizations for the penalty and the proximal operator of $\lambda h$, for various values of $\lambda$, when $d=2$ and $k=1$. As we can see, the larger the $\lambda$, the more the penalty resembles the $\ell_0$ constraint.
import plotly.graph_objects as go
import numpy as np
# Create figure
fig = go.Figure()
# Add traces, one for each slider step
for step in np.arange(0, 100, 10):
x = np.arange(-5, 5, 0.05)
y = np.arange(-5, 5, 0.05)
z = step * 1/2 * ( (np.abs(x[None]) + np.abs(y[:, None]))**2 - (x[None]**2 + y[:, None]**2))
fig.add_trace(
go.Contour(
visible=False,
z=z,
x=x, # horizontal axis
y=y, # vertical axis,
contours=dict(
start=0,
end=500
),
))
fig.data[4].visible = True
# Create and add slider
steps = []
for i in range(len(fig.data)):
step = dict(
method="restyle",
args=["visible", [False] * len(fig.data)],
label=str(10*i),
)
step["args"][1][i] = True # Toggle i'th trace to "visible"
steps.append(step)
sliders = [dict(
active=5,
currentvalue={"prefix": "λ: "},
# pad={"t": 50},
steps=steps
)]
fig.update_layout(
sliders=sliders,
width=600, # Set the width of the plot
height=600 # Set the height of the plot to be equal to the width
)
fig.show()