Chebyshev Scalarization Explained

machine learning

bayesian optimization

_all-time-best

Published

May 12, 2025

I’ve been reading about multi-objective optimization recently.¹ Many papers state limitations of “linear scalarization” approaches, mainly that it might not be able to represent all Pareto-optimal solutions (if this sentence does not make sense to you, see background). Chebyshev scalarization is sometimes mentioned as an alternative which can represent all solutions. However, these papers mention it in passing without a proper explanation, and I did not find a good explanation of it online.²

After doing a bit of my own research,³ I found that Chebyshev scalarization is actually not too complicated, so I thought I would explain it online. In this post, I:

Give definitions for Chebyshev scalarization (for both maximization and minimization)
Give a simple proof that it can represent all Pareto-optimal solutions
Explain its relationship to linear scalarization via \ell_p norms.
Give a geometric interpretation via an interactive visualization

Notation and Acronyms

MOO: multi-objective optimization
Bold denotes vectors (eg \mathbf{f}), f_i is element of a vector

Background to multi-objective optimization

Feel free to skip to the next section if you are familiar with this field, my notation and definitions are fairly standard.

Multi-objective optimization (MOO)

The goal of multi-objective optimization is to optimize a k-dimensional vector-valued function \mathbf{f}: \mathcal{X} \mapsto \mathbb{R}^k (\mathcal{X} is an arbitrary input space). Optimize can mean maximize or minimize. In this post I will generally assume maximization. This means we want to find a single point x\in\mathcal{X} where \mathbf{f}(x) is as large as possible in each dimension.⁴

Pareto optimality and non-domination

Although MOO is a problem in \mathcal{X} space, the remainder of this post will only discriminate between points x via their values of \mathbf{f}(x). Therefore we will drop the dependence on x and simply write \mathbf{f} for the remainder of the post (and always assume there is a point x attached to \mathbf{f}). This should not cause any ambiguity.

If a point \mathbf{f} is at least as large as a point \mathbf{f}' in every dimension and exceeds \mathbf{f}' in at least one dimension, we say that \mathbf{f} dominates \mathbf{f}' and write⁵ \mathbf{f}\succ \mathbf{f}'.

Unlike in single-objective optimization, there may not be a single dominant point \mathbf{f}^\ast. For example, in the set \\{[0, 1], [1, 0]\\}, neither point dominates the other: they are both better in one dimension and worse in the other. Therefore, MOO generally seeks points \mathbf{f}^* which are not dominated by any other vector \mathbf{f}. Such vectors are called Pareto optimal, and the collection of all such vectors is called the Pareto set.

For a more precise definition, see this footnote.⁶

Scalarization: turning multi-objective optimization into single-objective optimization

Optimization algorithms which are truly multi-objective will look very different from single-objective optimization algorithms because they need to find a Pareto set and explore different trade-offs between the k objectives. This is a shame, since single-objective optimization is well-studied and many useful algorithms have been developed. Scalarization is essentially an attempt to turn a MOO problem into a set of single-objective problems via a set of scalarization functions \\{s_\theta: \mathbb{R}^k\mapsto \mathbb{R}\\}_{\theta\in\Theta}\ .

Each scalarization function assigns a single (scalar) score to the vector-valued output of \mathbf{f}. The most common type of scalarization is linear scalarization, aka “weighted sum”: s_{\mathbf{w}}(\mathbf{f}) = \sum_i w_i f_i where \mathbf{w}=[w_1,\ldots,w_k] is a vector of positive weights.

One would hope to find Pareto optimal points by optimizing s_\theta(\mathbf{f}(x)) for various values of \theta. Clearly this will not actually produce Pareto optimal points for any functions s_\theta. However:

If the scalarization functions are are strictly monotonically increasing in all of their inputs, then the maximizer(s) of the scalarization function must all be Pareto optimal.
If the scalarization functions are non-decreasing (ie non-strictly monotonically increasing), then the set of maximizers will contain Pareto optimal points, but may also include sub-optimal points.

Hopefully this makes intuitive sense, otherwise see these theorems.

Chebyshev scalarization

Definition

Chebyshev scalarization⁷ functions take two parameters:

A vector of positive weights \mathbf{w} (like linear scalarization)
A reference point \mathbf{z}\in\mathbb{R}^k, interpreted as the ideal value for \mathbf{f}.

Scalarization was introduced previously as a set of functions s_\theta to be maximized. In this form, the function is:

s_{\mathbf{w}, \mathbf{z}}(\mathbf{f}) = s_{\mathrm{Cheby(max)},\mathbf{w}, \mathbf{z}}(\mathbf{f}) = - \max_i \left[ w_i|f_i - z_i|\right]\ .

Note that by maximization, I mean that s is to be maximized, not \mathbf{f}. The form is the same regardless of whether \mathbf{f} is to be maximized or minimized- that information is encoded by the reference point \mathbf{z}.

Importantly, notice the negative sign: it means that this function is always negative. This is because Chebyshev scalarization is normally defined as a scalar to be minimized. This alternative (and more natural) definition is:

s_{\mathrm{Cheby(min)},\mathbf{w}, \mathbf{z}}(\mathbf{f}) = \max_i \left[ w_i|f_i - z_i|\right]\ .

s_{\mathrm{Cheby(min)},\mathbf{w}, \mathbf{z}}(\mathbf{f}) and s_{\mathrm{Cheby(max)},\mathbf{w}, \mathbf{z}}(\mathbf{f}) differ only by a negative sign. I will mainly use the (max) variant in this post for consistency.

Interpretation

Chebyshev scalarization is essentially a penalty equal to the largest deviation between \mathbf{f} and the reference point \mathbf{z} across any dimension, with all deviations re-scaled by the weight vector \mathbf{w}. It is optimized when this penalty is smallest (hence the negative sign when it is maximized).

Dimension-wise maximum is essentially the definition of the \ell_\infty norm, also called the Chebyshev norm. This is actually where Chebyshev scalarization gets its name.⁸ In this form, it is:

s_{\mathbf{w}, \mathbf{z}}(\mathbf{f}) = -\left\| \mathbf{w}\odot \left(\mathbf{f} - \mathbf{z} \right) \right\|_\infty \ .

Therefore, think of Chebyshev scalarization as “distance to reference point in a different norm.”

Pareto coverage property

This is the most important property of Chebyshev scalarization.

IF \mathbf{z} > \mathbf{f} in each dimension for all achievable points \mathbf{f}, then:

All Pareto-optimal points maximize s_{\mathbf{w}, \mathbf{z}}(\cdot) for some positive weight vector \mathbf{w}.
For any positive weight vector \mathbf{w}, maximizing s_{\mathbf{w}, \mathbf{z}}(\cdot) will produce at least one Pareto-optimal point.

This means:

Picking a weight vector \mathbf{w} and maximizing s_{\mathbf{w}, \mathbf{z}}(\cdot) will produce a Pareto optimal point (although it may also produce additional maximums for sub-optimal points). Useful, but this is not surprising since, if \mathbf{z} > \mathbf{f} element-wise, then the function is non-decreasing and proposition A.2 applies.
ALL Pareto-optimal points can be produced by picking a suitable vector \mathbf{w}.
These properties do not depend on a specific value of the reference point \mathbf{z} (as long as it is an unachievable high value).

Property 2 is what distinguishes Chebyshev scalarization from linear scalarization (which can only find points on the convex hull of the Pareto frontier). At the end of this post I give a simple proof by construction of point 2, which essentially amounts to setting w_i = 1/\left(z_i-f_i\right).

(NOTE: for minimization of \mathbf{f}, simply replace the condition \mathbf{z} > \mathbf{f} with \mathbf{z} < \mathbf{f}) and the same result will hold.

Note about the weights

Because one can factor a multiplicative constant out of the weights, it suffices to consider only weights which sum up to 1: \mathbf{w}: \sum_i w_i = 1\ . This is true for Chebyshev and linear scalarization.

Comparison with linear scalarization

Earlier I stated that Chebyshev scalarization is essentially minimizing the \ell_\infty norm between \mathbf{f} and \mathbf{z}. Linear scalarization actually has a close connection to the \ell_1 distance between \mathbf{f} and \mathbf{z}.

Assume that we have a reference \mathbf{z} where \mathbf{z} > \mathbf{f} element-wise (just as done previously to get Pareto coverage for Chebyshev scalarization). Then, we write

-\left\| \mathbf{w}\odot \left(\mathbf{f} - \mathbf{z} \right) \right\|_1 = -\sum_i w_i \left|f_i - z_i \right| =\sum_i w_i f_i - \sum_i w_i z_i

(Step-by-step explanations in footnote⁹) The first time is linear scalarization. The second term is constant with respect to the input vector \mathbf{f}, and therefore will not change which inputs maximize the scalarization. This allows the reference point to be excluded from the scalarization without consequence, but also allows us to include a reference point in linear scalarization without consequence.

Summary: Chebyshev and linear scalarization can both by interpreted as \ell_p distances between objectives and an unattainable reference point. Linear scalarization uses p=1, and you can manipulate the expression to put the reference point in an additive constant. Chebyshev scalarization uses p=\infty and does not allow you to separate out the reference point.

Augmented Chebyshev

Because Chebyshev scalarization is not strictly monotonic, there may be some maximizers which are not Pareto optimal. This can be fixed by augmenting it with a linear scalarization term, weighted by a factor \rho > 0: s_{\mathrm{Cheby(aug)},\mathbf w, \mathbf z, \rho} = - \max_i \left[ w_i|f_i - z_i|\right] + \rho\sum_i w_i f_i \ .

This augmented Chebyshev function is strictly monotonic, and therefore all maximizers will be Pareto optimal. Unfortunately, this function does not guarantee that all Pareto optimal points can be found, although a sufficiently small value of \rho makes this likely in practice.

Geometric interpretation

To build understanding, I made an interactive visualization comparing linear, Chebyshev, and augmented Chebyshev scalarization. Disclaimer: this was vibe-coded with Cursor and definitely isn’t perfect; I just had no experience using plotly before. The visualization will probably work much better on desktop compared to mobile (apologies 🙏). If you want to modify / improve the visualization, check out the source code here.

The visualization shows 3 types of Pareto fronts in \mathbf{f} space: convex (default), concave, and mixed. The feasible region is marked in blue. You can switch between Pareto front types using the drop-down menu.

Because this is a 2D problem, we can set w_2=1-w_1 and just look at the scalarizations as a function of w_1. The bottom slider controls w_1 for all scalarizations simultaneously. All scalarizations use the reference point \mathbf{z} = [1.1, 1.1] (which dominates all points in the feasible region for all examples).

The feasible point which maximizes the scalarization is marked with a black dot (if several points are tied then there may be multiple black dots). The level sets of the scalarizations are shown with contour lines (hopefully making it clear where the maximum comes from).

I suggest you take some time to play around with the visualization now (otherwise I provide a summary below).

My analysis of visualization

First, note the shapes of the level sets. For “Linear” they are parallel lines. For Chebyshev they are square-like (which should not be surprising since the \ell_\infty unit ball is a square). For augmented Chebyshev they are almost square, except the angle is a bit larger than 90°.

Now, let’s compare the behavior as w_1 ranges from 0 to 1:

For the “convex” Pareto front, all 3 functions sweep across the entire front. The maximum point is always tangent¹⁰ to a level set.
For the “concave” Pareto front:
- Linear scalarization completely fails. For w_1<0.5 the maximum is near the top left, and for w_1>0.5 it suddenly shifts to the bottom right.¹¹ This is a well-understood limitation of linear scalarization.
- Chebyshev scalarization nicely sweeps across the entire front. However, note that for very small or very large values of w_1 there is a tie and a small number of dominated points also maximize the objective.
- Augmented Chebyshev also covers the entire front, but does not include duplicate points at the end (NOTE: at w_1=0.5 2 points appear, but this is just an artifact of the maximization being done on a finite grid, the true maximum should be between those two points).
For the “mixed” Pareto front:
- Linear scalarization finds the two convex sections, but completely skips the concave part (looking at the level sets should make it clear why this is).
- Chebyshev scalarization sweeps across the entire front, but for several values of w_1 around 0.25 and 0.75 a few dominated points are included (see small vertical or horizontal lines of black dots).
- Augmented Chebyshev scalarization sweeps across most of the front, but if you look carefully, some parts of the concave region are actually skipped (look for a big jump around w_1\approx 0.45 which does not appear for Chebyshev). It should be clear that this is caused by the level sets not being “pointy” enough to reach into all parts of the concavity. If \rho were decreased then this problem would be reduced.

Conclusion

This post explained Chebyshev scalarization, which is essentially a weighted \ell_\infty distance to a reference point. Its main advantage is its ability to cover every point on a Pareto front, while its main disadvantage is that it can be maximized by some dominated points. Another potential disadvantage is that it requires specifying a reference point which is guaranteed to dominate all other points. Its Pareto coverage property does not depend on the exact value of this reference point, but it must dominate all points, and it does affect which point(s) are optimal for a given value of \mathbf{w}.

Augmented Chebyshev has properties in between that of Chebyshev and linear scalarization: it may not cover every point on a Pareto front, but will likely cover most points, especially when \rho is small. However, dominated points will not maximize augmented Chebyshev.

Perhaps another middle ground is the \ell_p distance with large p. The level sets would be “pointier”, but the function would still be strictly monotonic in all its inputs.

Ultimately, I think Chebyshev scalarization is a good idea and people should try to use it!

Theorems and proofs

A: Scalarization and monotonicity

The following propositions show that strictly monotonic functions are only maximized by Pareto-optimal points, and that non-strictly monotonic functions are at least maximized by Pareto-optimal points (although other maximizers may not be Pareto optimal).

Proposition A.1: if s: \mathbb{R}^k\mapsto\mathbb{R} is strictly monotonically increasing, then all maximizers \mathbf f^\ast\in\arg\max_{\mathbf{f} \in \mathbf{f}(\mathcal{X})} s(\mathbf{f}) must be Pareto optimal.

Proof: Assume a \mathbf f^\ast is not Pareto optimal, then there is another point \mathbf f ' \succ\mathbf f^\ast. By monotonicity, s(\mathbf f') > s(\mathbf f), because f'_i \geq f_i in all dimensions (showing s(\mathbf f') \geq s(\mathbf f)), and f'_i > f_i in at least 1 dimension, thereby getting strict inequality. Therefore, s(\mathbf{f}^\ast) must not actually be a maximizer: a contradiction.

Proposition A.2: if s: \mathbb{R}^k\mapsto\mathbb{R} is non-strictly monotonically increasing, then the set M=\arg\max_{\mathbf{f} \in \mathbf{f}(\mathcal{X})} s(\mathbf{f}) must contain at least one non-dominated point.

Proof: Similar to above. Assume it does not contain a non-dominated point, and let \mathbf{f}^\ast\not\in M be a point which dominates at least one point in M (call it \mathbf{f}). By monotonicity, we must have that s(\mathbf{f}^\ast) \geq s(\mathbf{f}), and therefore it should be that \mathbf{f}^\ast\in M: a contradiction.

B: Chebyshev coverage of Pareto front

Here I provide a simple proof by construction for Pareto front coverage. Note that this is not my original proof.

Proposition B.1: If \mathbf{z} > \mathbf{f} in each dimension, then all Pareto-optimal points maximize s_{\mathbf{w}, \mathbf{z}}(\cdot) for some positive weight vector \mathbf{w}.

Proof: Let \mathbf f^\ast be a Pareto optimal point, and construct \mathbf w by w_i = \frac{1}{z_i - f^\ast_i}\ . Note that w_i>0 for all i because z_i>f_i^\ast by assumption. Then s_{\mathbf{w}, \mathbf{z}}(\mathbf f ^\ast) = - \max_i \left[ w_i|f_i^\ast - z_i|\right]=-\max_i \left[ \frac{z_i-f_i^\ast}{z_i-f_i^\ast} \right] = -1\ .

Now, assume that for some other \mathbf{f}, s_{\mathbf{w}, \mathbf{z}}(\mathbf f) > -1. For this to happen, for all 0\leq i \leq k \frac{z_i-f_i}{z_i - f_i^\ast} < 1 \Rightarrow f_i^\ast < f_i\ . This would imply \mathbf f \succ \mathbf f^\ast, contradicting the assumption that \mathbf f^\ast is Pareto optimal!

Some references

Here are some references which I relied on to write this post:

Wikipedia has a page (link), but it describes 2 different kinds of function, and only describes the function for minimizing \mathbf{f}.
An old paper by Steuer & Choo (1983) describes random Chebyshev scalarization and basically covers the same material as this post, but with denser notation and more theorems (such that the key points are harder to ascertain).
This NeurIPS paper by Lin et al re-states some of this content more clearly, but jumps to augmented Chebyshev scalarization without fully explaining its consequences.

Footnotes

Since real-world drug design problems are always multi-objective (e.g. potency, toxicity, cost), this shouldn’t be very surprising!↩︎
More information about my sources is in the references section.↩︎
Aided by Gemini 2.5, which was surprisingly helpful!↩︎
Some papers write \max_{x\in\mathcal{X}} \mathbf{f}(x). This captures the spirit of the problem, but is not a well-defined mathematical expression. I always cringe when I see this 😬↩︎
We use \succ instead of > because “domination” is only a partial ordering.↩︎
Is the Pareto set a subset of \mathcal{X} or \mathbb{R}^k? It can be both depending on the context. For this post I will define the Pareto set to be a subset of \mathbb{R}^k, specifically the set of all non-dominated vectors in the image of \mathcal{X} under \mathbf{f}. \\{\mathbf{f}(x)| x\in\mathcal{X} \text{ and } \not\exists y\in\mathcal{X} : \mathbf{f}(y)\succ \mathbf{f}(x)\\} However, in the broader context of MOO I think the Pareto set should really be the set of all inputs x which achieve these Pareto optimal values. I’ve mainly chosen the first definition for simplicity in this post.↩︎
Sometimes also spelled Tschebyschev or Tchebycheff, but I will just use Chebyshev.↩︎
I think “\ell_\infty-scalarization” would be a better name than Chebyshev scalarization, but it’s hard to change an established name 🫠↩︎
First use definition of \ell_1 norm, then note that f_i-z_i always negative by the assumption that \mathbf{z} > \mathbf{f} in every dimension, then cancel with minus sign at front and distribute the product.↩︎
I’m counting points on a corner as “tangent” too.↩︎
Note that nothing is intrinsically special about the value 0.5, the switch only happens here because the feasible region is symmetric between f_1 and f_2. For other concave objectives the critical value might be different.↩︎