Free Lunch for optimisation under the universal distribution

Publisher: IEEE

Abstract:
Function optimisation is a major challenge in computer science. The No Free Lunch theorems state that if all functions with the same histogram are assumed to be equally probable then no algorithm outperforms any other in expectation. We argue against the uniform assumption and suggest a universal prior exists for which there is a free lunch, but where no particular class of functions is favoured over another. We also prove upper and lower bounds on the size of the free lunch.
Date of Conference: 06-11 July 2014
Date Added to IEEE Xplore: 22 September 2014
ISBN Information:
ISSN Information:
Publisher: IEEE
Conference Location: Beijing, China

SECTION I.

Introduction

Finite black-box optimisation is the problem of finding an optimal value (usually the maximum or minimum) of a target function f:XY where X and Y are finite. A wide range of tasks may be formulated in this setting. For instance, drug-design may be viewed as the task of finding a mix of chemicals that maximises recovery chances. Since experimentation is expensive it is crucial that the best drug be found as soon as possible.

It is desirable to find optimisation algorithms that perform well on a wide variety of target functions, as this minimises the need for fine-tuning the algorithm to the problem. Indeed, several such algorithms exist and are regularly employed in practice; examples include hill-climbing and simulated annealing, as well as genetic algorithms. However, the theoretical understanding of the conditions permitting such “universal” algorithms remains limited [CO01], [Str03], [WR06], [JC11]. To approach this problem, we derive bounds for expected optimisation-performance under assumptions justified in all (or virtually all) optimisation settings.

The original No Free Lunch (NFL) theorems state that when the global performance of an optimisation algorithm is measured by taking a uniform average of its performance over all functions from X to Y, then no algorithm is better than random (assuming no point is sampled more than once) [WM97]. The uniform assumption is justified by assuming the absence of prior knowledge and the results are often used to claim that no optimisation algorithm can be universal.

There is, however, another viewpoint. If we assume that the function f:XY to be optimised is generated by some (unknown) computer program, then taking a uniform prior over programs is arguably more natural. This is a reasonable assumption based on the commonly held view that the universe is likely to be (stochastically) computable [Fre92], [Wo102], [Hut12]. The distribution on functions induced by this approach is the famous universal lower-semicomputable semi-distribution1 developed by Solomonoff and others [LV08]. The universal distribution satisfies many nice properties, both theoretical and philosophical. It is a natural choice when formalising Occam's razor in combination with Epicurus' principle of multiple explanations since it favours simplicity over complexity without disregarding the possibility that the truth is complex [Hut05], [RH11]. The universal distribution also exhibits a range of other desirable properties, discussed further in Section III.

If performance is measured in expectation with respect to the universal distribution, then the no free lunch theorems can no longer be applied. Indeed, under some highly technical conditions Streeter [Str03] showed that there is a free lunch for optimisation under Solomonoff's universal distribution. Tightly related to the universal distribution is Kolmogorov complexity: Borenstein and Po Ii [BP06] discuss Kolmogorov complexity and optimisation, and also give a good account of previous research in this area (see also [McG06]). Several authors report on Kolmogorov complexity not being perfectly related to searchability [SVW01], [DJW02], [BP06], but except for [Str03], implications for search performance under the universal distribution have not been investigated. The relation between the universal distribution and the NFL theorems for supervised learning has been studied by Lattimore and Hutter [LH11]. In sequence prediction and reinforcement learning, the universal distribution has been extensively researched [Hut05].

We first improve on [Str03] by presenting the first easily interpretable theorem that there is a free lunch if performance is measured in expectation with respect to Solomonoff's universal distribution rather than the uniform distribution originally used by Wolpert and Macready (Section VI). Unfortunately the size of the free lunch turns out to be somewhat limited. Under only weak assumptions we show that no computable algorithm can perform much better than random, even when performance is averaged with respect to the universal distribution (Section VII). This result is then extended to arbitrary (possibly non-computable) optimisation algorithms for a commonly used performance measure.

SECTION II.

Preliminaries

A (finite binary) string is a finite sequence x=b1b2bn with biB={0,1} and length (x)=n. The set of all finite binary strings is denoted by B. Strings may be concatenated in the obvious way. Power notation is used to represent multiple concatenations: for example, 0140=011110

A problem context is a pair X,Y of finite subsets of B, both containing at least 0 and 1 (to avoid degenerate cases). In a problem context X,Y, the search space is X and the range is Y. We let X and Y be the sets of all search spaces and all ranges respectively.

Definition 1:

An optimisation problem is a collection P={PXY}, where PXY is a measure over the finite set YX={f:XY} of functions from X to Y.

A search trace Tn on X,Y is an ordered n-tuple (x1,y1),,(xn,yn)(X×Y)n, representing a search history. The empty search trace will be denoted . Let Tn(X,Y) be the set of all search traces of length n, and let T(X,Y):=|X|i=0Ti(X,Y) be the set of traces of any length. Further, let T:=X,YX×YT(X,Y) be the set of search traces on any context.’ If Tn=(x1,y1),,(xn,yn), then Txn:=x1,,xn and Tyn:=y1,,yn.

An optimiser is a function a:X×Y×TX where a(X,Y,T)XTx for all (X,Y,T). The optimiser selects new, unvisited search points in the search space based on previously seen data and the problem context. That the optimiser is only permitted to sample unvisited points is standard in the literature, and non-restrictive in the noise-free setting considered in this paper.

The setup is this: A problem context X,Y is fixed, and a target function f:XY is sampled from YX according to the problem distribution PXY. The optimiser is initialised with the empty search trace and the problem context, and outputs a search point x1X by α(X,Y,)=x1. The search trace becomes x1,f(x1). The new search trace is fed to the optimiser, which produces a new search point x2 via a(X,Y,(x1,f(x1))), and so on. Observe that the search trace is a function of the optimiser, the problem context and the sampled function f. We write TXY(a,f) for the “full” trace of length |X| that a generates on f and X,Y; when X,Y is clear from the context, T(a,f) will suffice. The Y-components TyXY(a,f) will be called the result vector of a on f and X,Y We will also use R to denote result vectors. Let ‘R(X,Y be the set of all result vectors on X,Y, and let R=X,YX×YR(X,Y).

The performance of an optimiser a on a problem P is measured in terms of the result vectors it produces. A function M:X×Y×R[0,) defines a performance measure by the P-expected value of M for a on each problem context X,Y:

MPXY(a):=fYXPXY(f)MXY(Ty(a,f))(1)
View SourceRight-click on figure for MathML and additional features.

We use [s] for the Iverson bracket that is 1 when s is true, and 0 otherwise. For any list R,R[i] extracts its ith element.

SECTION III.

The Universal Distribution

We now give a short introduction to Kolmogorov complexity and the universal distribution. Detailed references are [LV08] and [Ca102].

Prefix codes are central elements in algorithmic information theory. A prefix code is a set of code words (formally, strings) where no code word is a prefix of another. This makes prefix codes uniquely decodable: in a sequence of appended code words it is possible to tell where one code word ends another begins. Kraft's inequality gives a lower bound on the length of the code words in a prefix code [LV08, p. 76]. Definition 2 gives some commonly used prefix encodings of strings, numbers, lists and functions.

Definition 2: String Encodings

Let x be a binary string, n a natural number and Z=z1,,zn. a list of strings. Then x¯:=1(x)0x,n¯:=1n0 and Z¯:=n¯z¯1z¯n defines prefix codes for x,n and Z. The code for lists may be applied recursively to lists of lists. Target functions f:XY are encoded by lists f(x1),, f(xn) where x1,, Xn are the elements of X in order.

For technical reasons, regular Turing machines are not suitable for defining the universal distribution, so prefix machines are often used instead.

Definition 3: Prefix Machines

A prefix machine V is a Turing machine with one unidirectional input tape, one unidirectional output tape, and some bidirectional work tapes. Input tapes are read only, output tapes are write only. All tapes are binary and work tapes are initialised with zeros. We say v halts with output x on input p given s and write V(p|s)=x, if s¯p is to the left of the input head and x is to the left of the output head when V halts. For any sB, the inputs on which V(|s) halts form a prefix code. Also, just as for regular Turing machines, there are universal prefix machines that can simulate any other prefix machine.

Definition 4: Prefix Complexity

Let x,yB be finite binary strings and U a universal prefix machine, then the Kolmogorov complexity of x conditioned on y is the length of the shortest program that given y outputs x.

KU(x|y):=min{(p):U(p|y)=x}(2)
View SourceRight-click on figure for MathML and additional features.

A simple but fundamental theorem is that KU depends on U only up to constant factors, so from now on, as is usual in algorithmic information theory, we fix an arbitrary universal prefix machine as a reference machine and simply write K(x) for KU(x).

Definition 5: Function Complexity

Let f:XY, then the complexity of f is K(f|X,Y), with f and X,Y encoded by strings according to Definition 2.

The Martin-Lof-Chaitin Thesis states that randomness may be defined as incompressibility [GTW+ 11, p. 705]. A target function is incompressible or random if K(f|X,Y)|X|log|Y|. A classical result in algorithmic information theory shows that almost all functions are (nearly) random. Thus, the uniform distribution puts the majority of its weight on random functions, which is one explanation for why it is hard to optimise under the uniform distribution. In contrast, the universal distribution puts more weight on “simple”, nonrandom functions:

Definition 6: Universal Distribution

For each context X,Y, the universal distribution is defined as

mXY(f):=cmXY2K(f|X,Y),(3)
View SourceRight-click on figure for MathML and additional features. where cmXY=1/f:XY2K(f|X,Y) is just a normalising constant. In the literature, unnormalised versions of m are often considered. Although cmXY may fluctuate with X,Y, there is a constant cm depending only on the reference machine such that 1cmXYcm for all X,Y. Note that m is an optimisation problem, since mXY is a distribution over YX for each X,Y.

Somewhat surprisingly, there is an equivalent definition of m as the distribution obtained by feeding random coin-flips into a universal prefix machine with access to X,Y.

mXY(f)pB2(p)[V(p|X,Y)=f].(4)
View SourceRight-click on figure for MathML and additional features.

The approximation holds up to irrelevant multiplicative factors, so (4) is often used in place of (3) as the definition of the universal distribution.2 Feeding a universal prefix-machine random coin-flips is a natural formalisation of the uniform prior over computer programs advocated in the introduction. Thus m may be justified as a subjective prior for the assumption that the target function has been computably generated.3

The bias away from randomness also aligns with our intuition of how functions “ought to be optimised”. If the first hundred observations are predicted by a simple polynomial, then common sense (and Occam's razor) suggests that the best prediction of unseen points is that they follow the polynomial. In general, the “simplest” structure perceivable in the data should be the most likely extrapolation. The universal distribution is consistent with this intuition. A detailed discussion of the philosophical justification for the universal prior can be found in [RH11].

To summarise, we have argued for the universal distribution on the following grounds:

  • (Weak assumptions): If the target function is generated by a computer program, then a uniform prior over computer programs is justified. Formalised as in (4), this yields the universal distribution.

  • (Downweighs randomness): A uniform prior over target functions puts the (vast) majority of the weight on (nearly) random functions. The universal distribution concentrates on structured functions, without favouring any particular class of functions.

  • (Aligns with Occam): Intuition and Occam's razor suggests that the best extrapolation is the continuation of the “simplest” pattern observable in data, which corresponds well with the relative probabilities of the universal distribution.

Next we will present some background on the NFL theorems and introduce our performance measure Mot, before taking a closer look at optimisation under m.

SECTION IV.

No Free Lunch

The NFL theorems provide important insights into the possibility of universal optimisation. They show that for certain distributions PXY all optimisers perform identically with respect to some (or all) performance measure(s). This is often phrased as “there is no free lunch available for PXY”. For example, if NFL holds for a performance measure depending on how many function evaluations are required to find the maximum, then this implies that in expectation a hill-climbing optimiser will find the maximum as slowly as a hill-descending optimiser.

Definition 7: Performance Measure-NFL

NFL holds for a distribution PXY and a performance measure M if MPXY(a)=MPXY(b) for all optimisers a and b. If NFL holds for all performance measures M, then NFL simply holds for PXY. If NFL does not hold for PXY (and M), then we say that there is free lunch for PXY (and M).

The stronger statement that NFL holds for all performance measures may equivalently be defined in terms of result vectors. The proof of the equivalence is a straightforward application of the definitions.

Lemma 8: Result Vector-NFL

Let pXYa(R) be the probability fYXPXY(f)[Ty(a,f)=R] that an optimiser a generates result vector R, then NFL holds for PXY if and only if PXYa(R)=PXYb(R)) for every pair of optimisers a and b and every result vector RY|X|.

A. The NFL Theorems

Igel and Toussaint [IT04] showed that the precise condition for NFL is block uniformity of PXY.

Definition 9: Block Uniformity

A histogram for a function f is a function hf:YN defined as hf(y)=|f1(y), indicating how many x'S map to every y. The subset of all functions XY with histogram h is called the base class of h, and is denoted Bh. The distribution PXY is block uniform if for every h it holds that f,gBhPXY(f)=PXY(g).

Theorem 10: Non-Uniform NFL (IT041)

NFL holds for PXY if and only if PXY is block uniform.

The original NFL theorem by Wolpert and Macready [WM97] showed that NFL holds when PXY is uniform on YX. As uniform distributions are a special case of block uniform distributions, Wolpert and Macready's result follows from Igel and Toussaint's.

Another special case is the NFL theorem for uniform optimisation problems over function classes closed under permutation (c.u.p.) by Schumacher et al. [SVW01]. A permutation is a bijective function σ:XX that permutes functions via (σf)(x)=f(σ1(x)). A class FYX is c.u.p. if fFσfF for all permutations σ. The uniform distribution uF over F is defined by uF(f):=1/|F| if fF and 0 otherwise.

Theorem 11: NFL for C.U.D. classes [SVW01]

If PXY is the uniform distribution over a class FYX, then NFL holds for PXY if and only if F is c.u.p.

A simple consequence of the NFL theorems is that all optimisers produce the same result vectors. We state this as a lemma for future reference.

Lemma 12:

([SVW01]): The set of result vectors {RR(X,Y):Ty(a,f)=R for some fYX} ever produced by an optimiser a on X,Y is the same for all optimisers.

The NFL theorems have also been investigated in infinite and continuous domains. Depending on the generalisation, free lunches mayor may not emerge in those settings [AT07], [RVW09].

SECTION V.

Performance Measures

So far we have only considered problems for which either NFL holds for all performance measures, or for which a free lunch is available for some performance measures. Often, however, we are interested in performing well under a fixed performance measure of interest. One natural choice of performance measure is optimisation time, which in this context means the number of function evaluations required to find the maximum. 4

Definition 13:

The optimisation time performance measure Mot is defined as

Mot,XY(R):=mini(R[i]=maxY),MPot,XY(a):=fYXPXY(f)Mot,XY(Ty(a,f))
View SourceRight-click on figure for MathML and additional features. for result vectors and optimisers, respectively. Under Mot a low score is better than a high score.

A variety of performance measures have been considered in the literature. Some use properties of the k first function evaluations, for example the number of values exceeding a certain threshold [CO01], [WR06], [JC11], or the probability that some seen value exceed the threshold [WM97]. Griffiths and Orponnen [GO05] use a performance measure Mmax depending on the size of the greatest value of the first k observations. Others, such as [BP06], [Jan13], use Mot. The main reason we prefer Mot to the other alternatives is that it is better suited for the asymptotic results we will aim for.

Results about particular performance measures often have greater practical interest than their arbitrary-measure coun-terparts. In addition, particular performance measures may also have theoretical interest. Under particular performance measures, NFL may hold for classes that are not c.u.p. [GO05]. This does not contradict Theorem 11, which only claims that for every non-c.u.p. class, there is some performance measure permitting a free lunch. Indeed, it is unsurprising that NFL will apply to wider ranges of function classes when a fixed performance measure is used. The conditions for NFL under Griffiths and Orponnen's performance measure Mmax turn out to be significantly more intricate compared to the standard NFL case [GO05].

Another difference is found in the “cleverness” required to exploit a free lunch. Optimisers that choose the next point to probe irrespective of previous observations are called non-adaptive; such optimisers can only exhibit a limited amount of sophistication. Proposition 14 shows that when a free lunch is available and arbitrary measures are allowed, then there is free lunch for a non-adaptive optimiser. In contrast, under particular measures such as Mot and Mmax, adaptive optimisers may differ in performance while all non-adaptive optimisers perform the same. In this sense, “smarter” algorithms may be required for exploiting a free lunch when using a particular performance measure, compared to when arbitrary performance measures are permitted.

Proposition 14:

If NFL does not hold for a distribution PXY, then there is free lunch for a non-adaptive optimiser under some performance measure.

Proof:

Since NFL does not hold for PXY we have by Theorem 10 that PXY is not block uniform. Hence there are two functions f and σf in the same base class Bh such that PXY(f)>PXY(σf)’ where σ is a permutation on X. Let e and eσ be non-adaptive optimisers, with e searching X={x1,,xn} in order, and eσ searching X in the order of σX={σ(x1),,σ(xn}. Now e generates the result vector Rf=f(x11,,f(xn), exactly when f is the true function, and eσ generates Rf exactly when σf is the true function. An immediate consequence is that PXYe(Rf)=PXY(f)>PXY(σf)=PXYeσ(Rf). That is, the non-adaptive algorithms e and eσ generate Rf with different probability, which means that there is free lunch for some non-adaptive optimiser under some performance measure by Lemma 8.

In conclusion, specific performance measures can be considered for both practical and theoretical reasons. They are more practically relevant in the sense that they measure aspects we care about in practice (such as how long it takes to find a maximum). But they also have theoretical interest, as they expose aspects that are invisible from an arbitrary-measure perspective.

SECTION VI.

Universal Free Lunch

We now turn to the question of whether or not a free lunch is available under m, which we will answer in the affirmative for both arbitrary performance measures and Mot.

The universal distribution solves the induction problem for sequence prediction [Hut05], [RH11]. Black-box optimisation also include an induction problem in the extrapolation of target-function behaviour from the points already evaluated. Although successful inference of the target-function behaviour may not be strictly necessary, it will typically enable better choices of future search points.

There are several important differences between sequence prediction and optimisation. First, optimisation is an active setting: the choices of the optimiser affect both the learning outcome and the reward. This entails an exploration/exploitation tradeoff in the choice between potentially informative points and points likely to mean high performance (e.g. points likely to be a maximum). Further, optimisation is a finite setting, which yields less time to exploit a good model (compared to sequence prediction where infinite sequences are considered). There are also major differences in the hypothesis classes and in how performance is measured.

Fig. 1. - The function $f$ has complexity bounded by a constant $c_{f}$ independent of $X$ and $Y$. In contrast, the complexity of $g$ grows logarithmically with $\vert X\vert$. See the proof of Theorem 15 for details.
Fig. 1.

The function f has complexity bounded by a constant cf independent of X and Y. In contrast, the complexity of g grows logarithmically with |X|. See the proof of Theorem 15 for details.

Section VII presents a number of results bounding the amount of free lunch under m. Perhaps surprisingly, only a small amount of free lunch is available under the universal distribution.

A. Free Lunch under Arbitrary Performance Measures

Streeter [Str03] showed that there is free lunch for m under certain technical conditions. We prove a similar result, but with more interpretable conditions (in terms of the size of X, only). We also use a different proof than Streeter.

Theorem 15: Universal free lunch

There is free lunch for the problem m for all problem contexts with sufficiently large search space (the required size depending on the reference machine only).

Proof:

It will be shown that mXY is not block uniform for problem contexts with sufficiently large X, which by Theorem 10 implies that NFL does not hold.

Pick a problem context X,Y. Consider two functions f and g in the base class BhYX of functions with one value 1 and the rest of the values 0. Let f be 1 at x1 and let g be 1 at some point xk chosen so that K(g|X,Y)log2|X|1 (see Fig. 1). To see that such a g exists, note that there are |X| different functions in Bh. As the halting programs for the reference machine constitute a prefix code, there can be at most |X|/2 halting programs of length log,|X|1 by Kraft's inequality. Thus at least one of the Bh-functions must have a shortest program longer than log2|X|1, and therefore complexity K(g|X,Y)log2|X|1. Meanwhile, K(f|X,Y)cf for some constant cf independent of the problem context. So for search spaces with log2(|X|)1>cf, this means that f will have lower complexity than g, and thus that mXY will assign different probabilities to f and g’ As f and g are elements of the same base class, this shows that m is not block uniform for search spaces greater than 2cf+1. By Theorem 10, this implies a free lunch for m under some performance measure.

Indeed, m is not even close to block uniform for large search spaces in the sense that the functions of type f and g will receive substantially different weights. However, this does not necessarily imply a big free lunch, as we shall see in Section VII.

B. Free Lunch Under Mot

As has been discussed, in practice we often care about a particular performance measure such as Mot.

Theorem 16:

There is free lunch for the problem m under the performance measure Mot for all problem contexts with sufficiently large search space (the required size depending on the reference machine only).

The proof is similar to Theorem 15, but more work is required to ensure a complexity difference between two potential maximums, rather than between two specific functions. A full proof is included in the Appendix.

SECTION VII.

Upper Bounds

Theorems 15 and 16 show that there is free lunch under the universal distribution. This section will bound the amount of free lunch available, and show that it is only possible to outperform random search by a constant factor. First we show that the performance of computable optimisers deteriorates linearly with the worst-case scenario and the size of the search space. This result applies to decidable performance measures in general, and has a concrete interpretation for Mot, where it implies that as the size of the domain is increased, a non-zero fraction of the domain must be probed before a maximum is found in expectation. This does not contradict the free lunches above, as the required fraction may differ between optimisers.

We also consider possible ways to circumvent the negative result described above by means of incomputable search procedures. A further negative result for Mot is obtained: It does not appear possible to find the maximum with only o(|X|) target function evaluations. That is, the expected number of probes required to find the maximum grows linearly with the size of the search space, but again, the proportion may differ substantially between optimisers.

A. Computable Optimisers

To bound the amount of free lunch available for computable optimisers, we will adapt a proof-technique for showing that average-case complexity is equal to the worst-case complexity under the universal distribution [LV08, Section 4.4]. Although no formal theorem relies on it, we will think of greater M-values as worse performance.

Lemma 17.

A function fbad:XY is maximally bad for an optimiser a on the problem context X,Y with respect to a performance measure M if MXY(Ty(a,fbad))=maxRR(X,Y)MXY(R),. There always exists a maximally bad function fbad:XY for a with respect to M, regardless of the performance measure M, the optimiser a and the problem context X,Y.

Proof:

By Lemma 12, all optimisers produce the same result vectors, so it suffices to show that some optimiser has a maximally bad function. The non-adaptive optimiser e that searches X in order has a maximally bad function. To see this, let Rbad be a maximally bad result vector on X,Y, and let f be the function satisfying f(xi)=Rbad[i] for all xiX. Then e produces Rbad on f.

A performance measure M for which there is an algorithm deciding whether MXY(R1)<MXY(R2) for every X,Y and every pair of result vectors R1,R2R(X,Y) is decidable. For any decidable performance measure M, it is possible to create a procedure FINDWORST (a,X,Y) that given a computable optimiser a (specified by some binary string) and a context X,Y, returns a maximally bad function fbad:XY for a. FINDWORST is a computable operation since a is computable and M is decidable: FINDWORST need only simulate a on all possible functions in YX, and output one that yields a worst result vector. This shows that for all decidable performance measures, all computable optimisers a and all contexts X,Y, there is a maximally bad function fbad:XY for a with complexity

K(fbad|X,Y)(FINDWORST)+(a)+c,(5)
View SourceRight-click on figure for MathML and additional features. where the c term depends only on the reference machine, and absorbs the cost for initialising FINDWORST with a,X and Y. Pivotally, the bound is independent of X,Y. This is the central observation behind the following theorem, which shows that expected performance always deteriorates linearly with the worst-case scenario. The theorem's prime relevance is for performance measures whose worst-case value grows with X.

Theorem 18: Almost NFL for m

For every decidable performance measure M and every computable optimiser a there exists a constant ca>0 such that for all X.Y

MmXY(a)camaxRR(X,Y)MXY(R).
View SourceRight-click on figure for MathML and additional features.

Proof:

Let X,Y be a problem context and fbad be the output of FINDWORST (a,X,Y), then

MmXY(a)=fYXmXY(f)MXY(Ty(Y,a,f))cmXY2K(fbad|X,Y)MXY(Ty(a,fbad)),
View SourceRight-click on figure for MathML and additional features. where we have first used the definition (1) of performance measures, and then that the sum of non-negatives is greater than all of its terms. But cmXY2K(fbad|X,Y)ca for some ca>0 independent of X.Y due to cmXY1 and the complexity bound (5). And Ty(a,fbad) was a worst result vector by the construction of FINDWORST. Combined, this gives the bound MmXY(a)camaxRR(X,Y)MXY(R).

This theorem shows that for every performance measure M, there is only a constant amount of free lunch available in an asymptotic sense. It has no impact on performance measures whose worst-case value does not grow unboundedly with either X or Y. However, the “semi-assumption” of higher values being worse is not necessary: If the converse is the case and high values are better, then the proposition shows that all optimisers will do well. Indeed, this is also an NFL result, as it implies that random search (and even optimisers designed to do poorly!) will perform well.

Applied to the performance measure Mot, Theorem 18 has a fairly concrete interpretation: For any computable optimiser a, the expected number of evaluations to find the maximum grows linearly with |X|. Corollary 19 follows immediately from Theorem 18 and the observation that for any context X,Y, the worst-case scenario is to find a maximum only at the very last probe; that is, maxRR(X,Y)Mot,XY(a)=|X|.

Corollary 19:

For every computable optimiser a there exists a constant ca>0 such that MmotXY(a)ca|X| for all optimisers a and all problem contexts X,Y.

The implications of this result should not be overstated. The constant Ca may be very small; for example, if the description of the optimiser a is 100 bits long, then ca becomes of the order 2100. The fact that the expected number of probes is forced to grow linearly with such a constant is mainly of theoretical importance. Nonetheless, the result does illustrate a fundamental hardness of optimisation, and shows that the universal distribution does not provide enough bias for efficient (sublinear) maximum finding.

B. Needle-in-a-haystack functions

A problematic class of functions is the class of so-called needle-in-a-haystack (NIAH) functions. We will use them to generalise Corollary 19 to incomputable optimisers. A NIAH-function is a target function that is 0 in all points except one where it equals 1. The exception point is called the needle. It should be intuitively clear that it is hard to find the maximum of a NIAH-function. Probing a NIAH-function, the output will generally just turn out to be 0 and provide no clues to where the needle might be.

Formally, for any X,Y let NIAHXY be the class of NIAH functions on X,Y and let uNIAH be the uniform NIAH problem defined as uNIAH,XY(f):=1/ NIAH XY| if fNIAHXY and 0 otherwise. The function class NIAH XY is c.u.p. for any X,Y, so NFL holds for uNIAH,XY by Theorem 11. The expected performance (of any optimiser) on the uniform NIAH-problem can be calculated from a general result of Igel and Toussaint [IT03]. They show that for any c.u.p. problem uF where F only contains functions with exactly m maxima, the expected number of probes to find a maximum is (|X|+1)/(m+1). The NIAH-functions have exactly one maximum, which gives the following lemma.

Lemma 20:

Under uNIAHXY, the expected optimisation time is MuNIAHot,XY(a)=(|X|+1)/2 for any optimiser a.

One feature that makes the NIAH-class more problematic than other c.u.p. function classes is that the NIAH-functions all have fairly low complexity (as remarked by [SVW01], [BP06]). The NIAH-functions have low complexity, since to encode a NIAH-function one only needs to encode that it is NIAH (which takes a constant number of bits) and the position of the needle (which requires at most log2|X| bits). A NIAH-function thus has complexity of order O(log2|X|); in comparison, a random function has exponentially greater complexity (above |X|log2|Y|).

The NIAH-measure is computable. This is intuitively obvious, and easily verified against the formal definitions of computable functions. Definitions of real-valued computable functions can be found in [LV08]. It is well-known that m dominates any computable measure in the following sense.

Lemma 21: m dominates uNIAH

There is a constant cNIAH>0 such that for all X,Y and all functions f:XY, it holds that mXY(f)cNIAHuNIAH,XY(f).

C. Incomputable Optimisers

Theorem 18 and Corollary 19 were proven for computable optimisers. We now show that even incomputable optimisers suffer a linearly growing loss in |X| when the performance measure is Mot. Incomputable search procedures may seem like remote objects of concern, but for example the (Bayes)optimal procedure for m is incomputable due to the incomputability of m. Therefore, incomputable procedures do at least have theoretical interest.

The following theorem generalises Corollary 19 to incomputable search procedures, showing that they also must search a linearly growing portion of X to find the maximum. The theorem does not generalise to arbitrary performance measures however, so the analogous generalisation of Theorem 18 may not be true.

Theorem 22: Almost NFL for m and Mot

Under m, the expected optimisation time grows linearly with |X|, regardless of the optimisation strategy.

Proof:

The dominance of m over uNIAH is used in (7), between an expansion (6) and a contraction (8) according to the definition (1) of performance measures:

Mmot,XY(a)=cNIAH=cNIAHfYXmXY(f)Mot,XY(Ty(a,f))fYXuNIAH,XY(f)Mot,XY(Ty(a,f))MuNIAHot,XY(a)(6)(7)(8)
View SourceRight-click on figure for MathML and additional features.

Lemma 20 established (8) to be cNIAH(|X|+1)/2 for all optimisers a. Thus the expected Mot-performance is always bounded below by cNIAH(|X|+1)/2, which grows linearly with |X|.

Since optimisation time can never grow faster than linearly with |X|, the bound is asymptotically tight. In this sense, Theorem 22 may be viewed as an asymptotic almost-NFL theorem for the universal distribution and Mot. The constant cNIAH in the proof may be very small however, so Theorem 22 does not rule out that expected optimisation time differ substantially between optimisers.

SECTION VIII.

Conclusion

In this paper we investigated the No Free Lunch theorems when the performance of an algorithm is measured in expectation with respect to Solomonoff's universal distribution. We showed in Theorem 15 that there is a free lunch with respect to this distribution.

Somewhat surprisingly, despite the bias away from randomness exhibited by the universal distribution, the size of the free lunch turns out to be quite small, at least asymptotically (Theorems 18 and 22). The reason for this is that there are many functions that are both simple and hard to optimise. Most notably the needle-in-a-haystack functions, which have complexity of at most O(log|X|), but for which a maximum cannot be found without O(|X|) probes.

It should be emphasised that there is little need to be too gloomy about the negative results. The upper bounds on the size of the free lunch given in both negative theorems depend on constants that in practise are likely to be very small. Optimisation is a hard problem, so we should not be too surprised if there are some reasonably frequently occurring functions that cannot be efficiently optimised.

The fact that simplicity is not a sufficient characterisation of the difficulty of optimising a function is unfortunate. This is not true in other domains such as supervised learning and sequence prediction where approaches based on Solomonoff's universal prior are theoretically optimal in a certain sense [Hut05]. One difficulty of optimisation lies in the exploration/exploitation problem, which occurs because at each time-step an optimisation algorithm must make a choice between trying to learn the true function and probing the point that it believes to be the maximum.

Since Kolmogorov complexity is (by itself) insufficient for characterising the difficulty of optimising a function, a new criterion is required. We are currently unsure what this should look like and consider this interesting future research.

Appendix

We here include a proof of Theorem 16. The proof builds on the following definitions and lemmas.

Definition 23:

A point xX is incompressible with respect to the context X,Y if K(x|X,Y)log(|X|).

At least half of the points of any search space will be incompressible. Functions that only have incompressible maxima (except, possibly, for a maximum at x1) will play an important role since they are guaranteed to have high complexity. The reason for excluding x1 will be apparent in the proof of Theorem 16.

Lemma 24:

Let X,Y be a problem context, and let DX be a non-empty set of incompressible points. Let g:XY have at least one maximum in D, and no maximum outside D{x1}. Then K(g|X,Y)log2(|X|)c, where c depends only on the reference machine and not on g,X or Y.

Proof:

Let g be as in the Lemma statement, and let xmX{x1} be the first maximum of g not at x1. Then Xm can be coded by means of g with constant length procedure FIRSTMAX(g) that computes the first maximum not at x1 for a given function g. Hence K(xm|X,Y)K(g|X,Y)+(FIRSTMAX)+c The constant c depends only on the reference machine, and absorbs the cost of initialising Firstmax with a provided description of g.

By assumption Xm was incompressible, so K(xm|X,Y)log2|X|. Combined and rearranged, this gives K(g|X,Y)log2|X|(FIRSTMAX)c. The lemma now follows by absorbing (FIRSTMAX) into c.

We are now ready for the proof of Theorem 16 that shows that there is free lunch for Mot on the problem m. The key idea is to show that there is a trace after which two unexplored points have different probability of being the maximum.

Theorem 16:

There is free lunch for the Problem m under the performance measure Mot for all problem contexts with sufficiently large search space (the required size depending on the reference machine only).

Proof:

Let k>2 and let X,Y be a problem context with |X|2k. Let DkX be of size k and only include incompressible points. Let Q=XDk{x1} Let G={gYX:xQq(x)=0} contain all functions that are 0 on Q. Let f be 0 everywhere, except at x1, where f is 1. The complexity of f is upper bounded by a constant cf independent of X. Since fG, we get mXY(G)mXY(f)2cf.

Let xmDk be an incompressible point, and let Gm={gG:g(xm)=maxg}. As the functions in G are all 0 on Q, the cardinality of Gm is at most |Y||XQ|=|Y|k+1. Also, the functions in Gm all have complexity above log|X|c for some c independent of X,Y, by Lemma 24.

We will now show that mXY(Gm|G) tends to 0 with growing |X|, while mXY(f|G) remains bounded away from O. This will establish that provided G, a maximum at x1 is more likely than a maximum at Xm. Provided G, the probability of a maximum at x1 is always above 2cf, since mXY(maxatx1|G)mXY(f|G)mXY(f)2cf. A maximum at xm, on the other hand, is less likely since only functions in Gm can have a maximum there:

mXY(maxatxm|G)=mXY(Gm)/mXY(G)mXY(Gm)/2cf=2cfcmXYgGm2K(g|X,Y)(9)
View SourceRight-click on figure for MathML and additional features.

Using the lower bound on the complexity from Lemma 24, (9) is bounded by

2cfcmXYgGm2log2|X|+c=2cfcmXY|Gm|2log2|X|+c(10)
View SourceRight-click on figure for MathML and additional features. and since the cardinality of G1 is less than |Y|k+1 (10) is bounded by
2cfcmXY|Y|k+12log2|X|+c=2cf+ccmXY|Y|k+1|X|(11)
View SourceRight-click on figure for MathML and additional features.
the last equality by elementary simplification.

As cmXY is bounded above by a constant cm for all X,Y, (11) goes to 0 with growing search space (and fixed k and Y). This shows that for large enough search spaces, x1 is more likely to host a maximum than Xm.

Now all that remains is to use this to create two algorithms that perform differently under Mot. Let a start by searching Q in order. If the perceived function points are consistent with f (i.e., the event G is verified), then a proceeds at x1 and then at xm, whereafter a searches the remaining points XDk{x1} in order. If the trace is not consistent with f, then a directly proceeds to search all remaining points in order. Define b the same way, with the only exception that after Q it searches xm before x1 in case the trace is consistent with f.

This way, a and b will perform the same except when encountering a function in G, in which case a will have a strictly better chance of finding the maximum at step |Q|+1. If neither a nor b finds a maximum at step |Q|+1, then neither x1 nor xm is a maximum, so neither a nor b will find a maximum at step |Q|+2 either. Finally, on step |Q|+3 and onwards their behaviour will again be identical, and therefore also their Mot performance. So a has a strictly better chance at step |Q|+1 and a and b's performance is identical on all other steps and in all other situations. This shows that there is a (possibly small) free lunch for Mot on m for sufficiently large search spaces.

    References

    1.
    Anne Auger and Olivier Teytaud. Continuous lunches are free! In GECCO'07, 2007.
    2.
    Yossi Borenstein and Riccardo Poli. Kolmogorov complexity, optimization and hardness. In CEC'06, pages 112-119, 2006.
    3.
    Cristian Calude. Information and randomness: An algorithmic perspective. Springer, 2002.
    4.
    Steffen Christensen and Franz Oppacher. What can we learn from no free lunch? a first attempt to characterize the concept of a searchable function. In GECCO'01, pages 1219-1226, 2001.
    5.
    Stefan Droste, Thomas Jansen, and Ingo Wegener. Optimization with randomized search heuristics-the (A)NFL Theorem, realistic scenarios, and difficult functions. Theoretical Computer Science, 287(1):131-144, 2002.
    6.
    Edward Fredkin. Finite nature. XXVIIth Rencotre de Moriond, 1992.
    7.
    Evan J Griffiths and Pekka Orponen. Optimisation, block designs and no free lunch theorems. Information Processing Letters, 94(2):55-61, 2005.
    8.
    Dov M Gabbay, Paul Thagard, John Woods, Prasanta S Bandyopadhyay, and Malcolm R Forster. Philosophy of Statistics. Elsevier, 2011.
    9.
    Marcus Hutter. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Lecture Notes in Artificial Intelligence (LNAI 2167). Springer, 2005.
    10.
    Marcus Hutter. On Universal Prediction and Bayesian Confirmation. Theoretical Computer Science, 384(1):33-48, 2007.
    11.
    Marcus Hutter. The subjective computable universe. In Hector Zenil, editor, A Computable Universe: Understanding and Exploring Nature as Computation, chapter 21, pages 399-416. World Scientific, 2012.
    12.
    Christian Igel and Marc Toussaint. Neutrality and self-adaptation. Natural Computing, 2(2):117-132, 2003.
    13.
    Christian Igel and Marc Toussaint. A no-free-lunch theorem for non-uniform distributions of target functions. Journal of Mathematical Modelling and Algorithms, 3:312-322, 2004.
    14.
    Thomas Jansen. Analyzing Evolutionary Algorithms: The Computer Science Perspective. Springer Berlin Heidelberg, 2013.
    15.
    Pei Jiang and Ying-ping Chen. Free lunches on the discrete Lipschitz class. Theoretical Computer Science, 412(17):1614-1628, April 2011.
    16.
    Tor Lattimore and Marcus Hutter. No free lunch versus Occam's razor in supervised learning. In Proceedings of the Solomonoff 85th Memorial Conference, Melbourne, Australia, November 2011. Springer.
    17.
    Ming Li and Paul Vitanyi. Kolmogorov Complexity and its Applications. Springer Verlag, third edition, 2008.
    18.
    Simon McGregor. No free lunch and algorithmic randomness. In GECCO'06, pages 2-4, 2006.
    19.
    Samuel Rathmanner and Marcus Hutter. A philosophical treatise of universal induction. Entropy, 13(6):1076-1136, 2011.
    20.
    Jonathan E Rowe, Michael D Vose, and Alden H Wright. Reinterpreting no free lunch. Evolutionary computation, 17(1):117-129, January 2009.
    21.
    Matthew J Streeter. Two broad classes of functions for which a no free lunch result does not hold. In GECCO'03, pages 1418-1430, 2003.
    22.
    Christopher W Schumacher, Michael D Vose, and L Darrell Whitley. The no free lunch and problem description length. In GECCO'01, pages 565-570, 2001.
    23.
    David H Wolpert and William G Macready. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1):270-283, 1997.
    24.
    Stephen Wolfram. A New Kind of Science. Wolfram Media, 2002.
    25.
    L Darrell Whitley and Jonathan E Rowe. Subthreshold-seeking local search. Theoretical Computer Science, 361(1):2-17, 2006.