Chebyshev Scalarization Explained

I've been reading about multi-objective optimization recently.1 Many papers state limitations of "linear scalarization" approaches, mainly that it might not be able to represent all Pareto-optimal solutions (if this sentence does not make sense to you, see background). Chebyshev scalarization is sometimes mentioned as an alternative which can represent all solutions. However, these papers mention it in passing without a proper explanation, and I did not find a good explanation of it online.2

After doing a bit of my own research,3 I found that Chebyshev scalarization is actually not too complicated, so I thought I would explain it online. In this post, I:

  • Give definitions for Chebyshev scalarization (for both maximization and minimization)
  • Give a simple proof that it can represent all Pareto-optimal solutions
  • Explain its relationship to linear scalarization via $\ell_p$ norms.
  • Give a geometric interpretation via an interactive visualization
Notation and Acronyms
  • MOO: multi-objective optimization
  • Bold denotes vectors (eg $\mathbf{f}$), $f_i$ is element of a vector

Background to multi-objective optimization

Feel free to skip to the next section if you are familiar with this field, my notation and definitions are fairly standard.

Multi-objective optimization (MOO)

The goal of multi-objective optimization is to optimize a k-dimensional vector-valued function $\mathbf{f}: \mathcal{X} \mapsto \mathbb{R}^k$ ($\mathcal{X}$ is an arbitrary input space). Optimize can mean maximize or minimize. In this post I will generally assume maximization. This means we want to find a single point $x\in\mathcal{X}$ where $\mathbf{f}(x)$ is as large as possible in each dimension.4

Pareto optimality and non-domination

Although MOO is a problem in $\mathcal{X}$ space, the remainder of this post will only discriminate between points $x$ via their values of $\mathbf{f}(x)$. Therefore we will drop the dependence on $x$ and simply write $\mathbf{f}$ for the remainder of the post (and always assume there is a point $x$ attached to $\mathbf{f}$). This should not cause any ambiguity.

If a point $\mathbf{f}$ is at least as large as a point $\mathbf{f}'$ in every dimension and exceeds $\mathbf{f}'$ in at least one dimension, we say that $\mathbf{f}$ dominates $\mathbf{f}'$ and write5 $\mathbf{f}\succ \mathbf{f}'$.

Unlike in single-objective optimization, there may not be a single dominant point $\mathbf{f}^\ast$. For example, in the set $\{[0, 1], [1, 0]\}$, neither point dominates the other: they are both better in one dimension and worse in the other. Therefore, MOO generally seeks points $\mathbf{f}^*$ which are not dominated by any other vector $\mathbf{f}$. Such vectors are called Pareto optimal, and the collection of all such vectors is called the Pareto set.

For a more precise definition, see this footnote.6

Scalarization: turning multi-objective optimization into single-objective optimization

Optimization algorithms which are truly multi-objective will look very different from single-objective optimization algorithms because they need to find a Pareto set and explore different trade-offs between the k objectives. This is a shame, since single-objective optimization is well-studied and many useful algorithms have been developed. Scalarization is essentially an attempt to turn a MOO problem into a set of single-objective problems via a set of scalarization functions $$\{s_\theta: \mathbb{R}^k\mapsto \mathbb{R}\}_{\theta\in\Theta}\ .$$

Each scalarization function assigns a single (scalar) score to the vector-valued output of $\mathbf{f}$. The most common type of scalarization is linear scalarization, aka "weighted sum": $$s_{\mathbf{w}}(\mathbf{f}) = \sum_i w_i f_i$$ where $\mathbf{w}=[w_1,\ldots,w_k]$ is a vector of positive weights.

One would hope to find Pareto optimal points by optimizing $s_\theta(\mathbf{f}(x))$ for various values of $\theta$. Clearly this will not actually produce Pareto optimal points for any functions $s_\theta$. However:

  • If the scalarization functions are are strictly monotonically increasing in all of their inputs, then the maximizer(s) of the scalarization function must all be Pareto optimal.
  • If the scalarization functions are non-decreasing (ie non-strictly monotonically increasing), then the set of maximizers will contain Pareto optimal points, but may also include sub-optimal points.

Hopefully this makes intuitive sense, otherwise see these theorems.

Chebyshev scalarization

Definition

Chebyshev scalarization7 functions take two parameters:

  1. A vector of positive weights $\mathbf{w}$ (like linear scalarization)
  2. A reference point $\mathbf{z}\in\mathbb{R}^k$, interpreted as the ideal value for $\mathbf{f}$.

Scalarization was introduced previously as a set of functions $s_\theta$ to be maximized. In this form, the function is:

$$s_{\mathbf{w}, \mathbf{z}}(\mathbf{f}) = s_{\mathrm{Cheby(max)},\mathbf{w}, \mathbf{z}}(\mathbf{f}) = - \max_i \left[ w_i|f_i - z_i|\right]\ .$$

Note that by maximization, I mean that $s$ is to be maximized, not $\mathbf{f}$. The form is the same regardless of whether $\mathbf{f}$ is to be maximized or minimized- that information is encoded by the reference point $\mathbf{z}$.

Importantly, notice the negative sign: it means that this function is always negative. This is because Chebyshev scalarization is normally defined as a scalar to be minimized. This alternative (and more natural) definition is:

$$ s_{\mathrm{Cheby(min)},\mathbf{w}, \mathbf{z}}(\mathbf{f}) = \max_i \left[ w_i|f_i - z_i|\right]\ .$$

$s_{\mathrm{Cheby(min)},\mathbf{w}, \mathbf{z}}(\mathbf{f})$ and $s_{\mathrm{Cheby(max)},\mathbf{w}, \mathbf{z}}(\mathbf{f})$ differ only by a negative sign. I will mainly use the (max) variant in this post for consistency.

Interpretation

Chebyshev scalarization is essentially a penalty equal to the largest deviation between $\mathbf{f}$ and the reference point $\mathbf{z}$ across any dimension, with all deviations re-scaled by the weight vector $\mathbf{w}$. It is optimized when this penalty is smallest (hence the negative sign when it is maximized).

Dimension-wise maximum is essentially the definition of the $\ell_\infty$ norm, also called the Chebyshev norm. This is actually where Chebyshev scalarization gets its name.8 In this form, it is:

$$s_{\mathbf{w}, \mathbf{z}}(\mathbf{f}) = -\left| \mathbf{w}\odot \left(\mathbf{f} - \mathbf{z} \right) \right|_\infty \ .$$

Therefore, think of Chebyshev scalarization as "distance to reference point in a different norm."

Pareto coverage property

This is the most important property of Chebyshev scalarization.

IF $\mathbf{z} > \mathbf{f}$ in each dimension for all achievable points $\mathbf{f}$, then:

  1. All Pareto-optimal points maximize $s_{\mathbf{w}, \mathbf{z}}(\cdot)$ for some positive weight vector $\mathbf{w}$.
  2. For any positive weight vector $\mathbf{w}$, maximizing $s_{\mathbf{w}, \mathbf{z}}(\cdot)$ will produce at least one Pareto-optimal point.

This means:

  1. Picking a weight vector $\mathbf{w}$ and maximizing $s_{\mathbf{w}, \mathbf{z}}(\cdot)$ will produce a Pareto optimal point (although it may also produce additional maximums for sub-optimal points). Useful, but this is not surprising since, if $\mathbf{z} > \mathbf{f}$ element-wise, then the function is non-decreasing and proposition A.2 applies.
  2. ALL Pareto-optimal points can be produced by picking a suitable vector $\mathbf{w}$.
  3. These properties do not depend on a specific value of the reference point $\mathbf{z}$ (as long as it is an unachievable high value).

Property 2 is what distinguishes Chebyshev scalarization from linear scalarization (which can only find points on the convex hull of the Pareto frontier). At the end of this post I give a simple proof by construction of point 2, which essentially amounts to setting $w_i = 1/\left(z_i-f_i\right)$.

(NOTE: for minimization of $\mathbf{f}$, simply replace the condition $\mathbf{z} > \mathbf{f}$ with $\mathbf{z} < \mathbf{f}$) and the same result will hold.

Note about the weights

Because one can factor a multiplicative constant out of the weights, it suffices to consider only weights which sum up to 1: $$\mathbf{w}: \sum_i w_i = 1\ .$$ This is true for Chebyshev and linear scalarization.

Comparison with linear scalarization

Earlier I stated that Chebyshev scalarization is essentially minimizing the $\ell_\infty$ norm between $\mathbf{f}$ and $\mathbf{z}$. Linear scalarization actually has a close connection to the $\ell_1$ distance between $\mathbf{f}$ and $\mathbf{z}$.

Assume that we have a reference $\mathbf{z}$ where $\mathbf{z} > \mathbf{f}$ element-wise (just as done previously to get Pareto coverage for Chebyshev scalarization). Then, we write

$$-\left| \mathbf{w}\odot \left(\mathbf{f} - \mathbf{z} \right) \right|_1 = -\sum_i w_i \left|f_i - z_i \right| =\sum_i w_i f_i - \sum_i w_i z_i$$

(Step-by-step explanations in footnote9) The first time is linear scalarization. The second term is constant with respect to the input vector $\mathbf{f}$, and therefore will not change which inputs maximize the scalarization. This allows the reference point to be excluded from the scalarization without consequence, but also allows us to include a reference point in linear scalarization without consequence.

Summary: Chebyshev and linear scalarization can both by interpreted as $\ell_p$ distances between objectives and an unattainable reference point. Linear scalarization uses $p=1$, and you can manipulate the expression to put the reference point in an additive constant. Chebyshev scalarization uses $p=\infty$ and does not allow you to separate out the reference point.

Augmented Chebyshev

Because Chebyshev scalarization is not strictly monotonic, there may be some maximizers which are not Pareto optimal. This can be fixed by augmenting it with a linear scalarization term, weighted by a factor $\rho > 0$: $$s_{\mathrm{Cheby(aug)},\mathbf w, \mathbf z, \rho} = - \max_i \left[ w_i|f_i - z_i|\right] + \rho\sum_i w_i f_i \ .$$

This augmented Chebyshev function is strictly monotonic, and therefore all maximizers will be Pareto optimal. Unfortunately, this function does not guarantee that all Pareto optimal points can be found, although a sufficiently small value of $\rho$ makes this likely in practice.

Geometric interpretation

To build understanding, I made an interactive visualization comparing linear, Chebyshev, and augmented Chebyshev scalarization. Disclaimer: this was vibe-coded with Cursor and definitely isn't perfect; I just had no experience using plotly before. The visualization will probably work much better on desktop compared to mobile (apologies 🙏). If you want to modify / improve the visualization, check out the source code here.

The visualization shows 3 types of Pareto fronts in $\mathbf{f}$ space: convex (default), concave, and mixed. The feasible region is marked in blue. You can switch between Pareto front types using the drop-down menu.

Because this is a 2D problem, we can set $w_2=1-w_1$ and just look at the scalarizations as a function of $w_1$. The bottom slider controls $w_1$ for all scalarizations simultaneously. All scalarizations use the reference point $\mathbf{z} = [1.1, 1.1]$ (which dominates all points in the feasible region for all examples).

The feasible point which maximizes the scalarization is marked with a black dot (if several points are tied then there may be multiple black dots). The level sets of the scalarizations are shown with contour lines (hopefully making it clear where the maximum comes from).

I suggest you take some time to play around with the visualization now (otherwise I provide a summary below).

My analysis of visualization

First, note the shapes of the level sets. For "Linear" they are parallel lines. For Chebyshev they are square-like (which should not be surprising since the $\ell_\infty$ unit ball is a square). For augmented Chebyshev they are almost square, except the angle is a bit larger than 90°.

Now, let's compare the behavior as $w_1$ ranges from 0 to 1:

  • For the "convex" Pareto front, all 3 functions sweep across the entire front. The maximum point is always tangent10 to a level set.
  • For the "concave" Pareto front:
    • Linear scalarization completely fails. For $w_1<0.5$ the maximum is near the top left, and for $w_1>0.5$ it suddenly shifts to the bottom right.11 This is a well-understood limitation of linear scalarization.
    • Chebyshev scalarization nicely sweeps across the entire front. However, note that for very small or very large values of $w_1$ there is a tie and a small number of dominated points also maximize the objective.
    • Augmented Chebyshev also covers the entire front, but does not include duplicate points at the end (NOTE: at $w_1=0.5$ 2 points appear, but this is just an artifact of the maximization being done on a finite grid, the true maximum should be between those two points).
  • For the "mixed" Pareto front:
    • Linear scalarization finds the two convex sections, but completely skips the concave part (looking at the level sets should make it clear why this is).
    • Chebyshev scalarization sweeps across the entire front, but for several values of $w_1$ around 0.25 and 0.75 a few dominated points are included (see small vertical or horizontal lines of black dots).
    • Augmented Chebyshev scalarization sweeps across most of the front, but if you look carefully, some parts of the concave region are actually skipped (look for a big jump around $w_1\approx 0.45$ which does not appear for Chebyshev). It should be clear that this is caused by the level sets not being "pointy" enough to reach into all parts of the concavity. If $\rho$ were decreased then this problem would be reduced.

Conclusion

This post explained Chebyshev scalarization, which is essentially a weighted $\ell_\infty$ distance to a reference point. Its main advantage is its ability to cover every point on a Pareto front, while its main disadvantage is that it can be maximized by some dominated points. Another potential disadvantage is that it requires specifying a reference point which is guaranteed to dominate all other points. Its Pareto coverage property does not depend on the exact value of this reference point, but it must dominate all points, and it does affect which point(s) are optimal for a given value of $\mathbf{w}$.

Augmented Chebyshev has properties in between that of Chebyshev and linear scalarization: it may not cover every point on a Pareto front, but will likely cover most points, especially when $\rho$ is small. However, dominated points will not maximize augmented Chebyshev.

Perhaps another middle ground is the $\ell_p$ distance with large $p$. The level sets would be "pointier", but the function would still be strictly monotonic in all its inputs.

Ultimately, I think Chebyshev scalarization is a good idea and people should try to use it!


Theorems and proofs

A: Scalarization and monotonicity

The following propositions show that strictly monotonic functions are only maximized by Pareto-optimal points, and that non-strictly monotonic functions are at least maximized by Pareto-optimal points (although other maximizers may not be Pareto optimal).

Proposition A.1: if $s: \mathbb{R}^k\mapsto\mathbb{R}$ is strictly monotonically increasing, then all maximizers $$\mathbf f^\ast\in\arg\max_{\mathbf{f} \in \mathbf{f}(\mathcal{X})} s(\mathbf{f})$$ must be Pareto optimal.

Proof: Assume a $\mathbf f^\ast$ is not Pareto optimal, then there is another point $\mathbf f ' \succ\mathbf f^\ast$. By monotonicity, $s(\mathbf f') > s(\mathbf f)$, because $f'_i \geq f_i$ in all dimensions (showing $s(\mathbf f') \geq s(\mathbf f)$), and $f'_i > f_i$ in at least 1 dimension, thereby getting strict inequality. Therefore, $s(\mathbf{f}^\ast)$ must not actually be a maximizer: a contradiction.

Proposition A.2: if $s: \mathbb{R}^k\mapsto\mathbb{R}$ is non-strictly monotonically increasing, then the set $$M=\arg\max_{\mathbf{f} \in \mathbf{f}(\mathcal{X})} s(\mathbf{f})$$ must contain at least one non-dominated point.

Proof: Similar to above. Assume it does not contain a non-dominated point, and let $\mathbf{f}^\ast\not\in M$ be a point which dominates at least one point in $M$ (call it $\mathbf{f}$). By monotonicity, we must have that $s(\mathbf{f}^\ast) \geq s(\mathbf{f})$, and therefore it should be that $\mathbf{f}^\ast\in M$: a contradiction.

B: Chebyshev coverage of Pareto front

Here I provide a simple proof by construction for Pareto front coverage. Note that this is not my original proof.

Proposition B.1: If $\mathbf{z} > \mathbf{f}$ in each dimension, then all Pareto-optimal points maximize $s_{\mathbf{w}, \mathbf{z}}(\cdot)$ for some positive weight vector $\mathbf{w}$.

Proof: Let $\mathbf f^\ast$ be a Pareto optimal point, and construct $\mathbf w$ by $$w_i = \frac{1}{z_i - f^\ast_i}\ .$$ Note that $w_i>0$ for all $i$ because $z_i>f_i^\ast$ by assumption. Then $$s_{\mathbf{w}, \mathbf{z}}(\mathbf f ^\ast) = - \max_i \left[ w_i|f_i^\ast - z_i|\right]=-\max_i \left[ \frac{z_i-f_i^\ast}{z_i-f_i^\ast} \right] = -1\ .$$

Now, assume that for some other $\mathbf{f}$, $s_{\mathbf{w}, \mathbf{z}}(\mathbf f) > -1$. For this to happen, for all $0\leq i \leq k$ $$\frac{z_i-f_i}{z_i - f_i^\ast} < 1 \Rightarrow f_i^\ast < f_i\ .$$ This would imply $\mathbf f \succ \mathbf f^\ast$, contradicting the assumption that $\mathbf f^\ast$ is Pareto optimal!


Some references

Here are some references which I relied on to write this post:

  • Wikipedia has a page (link), but it describes 2 different kinds of function, and only describes the function for minimizing $\mathbf{f}$.
  • An old paper by Steuer & Choo (1983) describes random Chebyshev scalarization and basically covers the same material as this post, but with denser notation and more theorems (such that the key points are harder to ascertain).
  • This NeurIPS paper by Lin et al re-states some of this content more clearly, but jumps to augmented Chebyshev scalarization without fully explaining its consequences.

  1. Since real-world drug design problems are always multi-objective (e.g. potency, toxicity, cost), this shouldn't be very surprising! 

  2. More information about my sources is in the references section

  3. Aided by Gemini 2.5, which was surprisingly helpful! 

  4. Some papers write $\max_{x\in\mathcal{X}} \mathbf{f}(x)$. This captures the spirit of the problem, but is not a well-defined mathematical expression. I always cringe when I see this 😬 

  5. We use $\succ$ instead of $>$ because "domination" is only a partial ordering. 

  6. Is the Pareto set a subset of $\mathcal{X}$ or $\mathbb{R}^k$? It can be both depending on the context. For this post I will define the Pareto set to be a subset of $\mathbb{R}^k$, specifically the set of all non-dominated vectors in the image of $\mathcal{X}$ under $\mathbf{f}$. $$\{\mathbf{f}(x)| x\in\mathcal{X} \text{ and } \not\exists y\in\mathcal{X} : \mathbf{f}(y)\succ \mathbf{f}(x)\}$$ However, in the broader context of MOO I think the Pareto set should really be the set of all inputs $x$ which achieve these Pareto optimal values. I've mainly chosen the first definition for simplicity in this post. 

  7. Sometimes also spelled Tschebyschev or Tchebycheff, but I will just use Chebyshev. 

  8. I think "$\ell_\infty$-scalarization" would be a better name than Chebyshev scalarization, but it's hard to change an established name 🫠 

  9. First use definition of $\ell_1$ norm, then note that $f_i-z_i$ always negative by the assumption that $\mathbf{z} > \mathbf{f}$ in every dimension, then cancel with minus sign at front and distribute the product. 

  10. I'm counting points on a corner as "tangent" too. 

  11. Note that nothing is intrinsically special about the value 0.5, the switch only happens here because the feasible region is symmetric between $f_1$ and $f_2$. For other concave objectives the critical value might be different.