SFB 823 Time efficient optimization of instance based problems with application to tone onset detection D iscussion P aper Nadja Bauer, Klaus Friedrichs, Claus Weihs Nr. 85/2016 Time efficient optimization of instance based problems with application to tone onset detection Nadja Bauera,∗, Klaus Friedrichsa, Claus Weihsa aDepartment of Statistics, TU Dortmund University, D-44221 Dortmund, Germany Abstract A time efficient optimization technique for instance based problems is proposed, where for each parameter setting the target function has to be evaluated on a large set of problem instances. Computational time is reduced by beginning with a performance estimation based on the evaluation of a representative subset of instances. Subsequently, only promising settings are evaluated on the whole data set. As application a comprehensive music onset detection algorithm is intro- duced where several numerical and categorical algorithm parameters are opti- mized simultaneously. Here, problem instances are music pieces of a data base. Sequential model based optimization is an appropriate technique to solve this optimization problem. The proposed optimization strategy is compared to the usual model based approach with respect to the goodness measure for tone onset detection. The performance of the proposed method appears to be competitive with the usual one while saving more than 84% of instance evaluation time on average. One other aspect is a comparison of two strategies for handling categorical parameters in Kriging based optimization. Keywords: model based optimization, instance optimization, Kriging, onset detection, categorical parameters 1. Introduction Parameter optimization is an important issue for industrial optimizations as well as for computer based applications. The established industrial optimization approach is based on a design of experiments where the trials are conducted according to a pre-defined plan [1]. The so called plan matrix should satisfy some special statistical criteria (like A- or D-optimality). After evaluation of the plan a relationship between the target variable and the influencing parameters ∗Correspondence to: Department of Statistics, Vogelpothsweg 87, D-44227 Dortmund, TU Dortmund University, Germany. Tel.: +49231 755 7222; Fax: +49231 755 4387. Email address: bauer@statistik.tu-dortmund.de (Nadja Bauer) is fitted by a regression model. Usually, a linear model with interactions and quadratic terms is used for this task. The optimal parameter setting is then defined as the point with best model prediction value. Despite the long success history of the optimization approach mentioned above [2, p. 467] it shows two essential weak points. First, the real experiments are often time and cost consuming. In some cases the real processes can be replaced by a sophisticated computer simulation. However, even one run of such simulation can take several days or months [3]. Hence, it is crucial to avoid trials in not promising areas and conduct them in the neighborhood of the supposed optimum instead. Second, the assumptions of the linear model are often violated so that it does not represent the true complex relationship between the target and the influencing factors. The sequential Model Based Optimization (MBO) – also called as efficient global optimization – which gained popularity with the seminal work of [4] is a modern nonlinear optimization technique. The main idea of MBO is: After the initial phase – evaluation of some randomly chosen starting points – new points are proposed and evaluated iter- atively with respect to a surrogate model fitted to all previous evaluations and an appropriate infill criterion to decide which point is the most promising. The most prominent surrogate model and infill criterion are the Kriging model in combination with the Expected Improvement (EI) criterion [4]. EI looks for a compromise of surrogate model uncertainty in one point and its estimated function value compared to the current optimum. We will call this combination classical MBO which is explained in detail in Section 2. Nevertheless, also classical MBO has a number of shortcomings. For this work two of them are especially relevant: limitation to only numerical param- eters and the absent concept for instance based function evaluations. The first problem is mainly caused by the standard Kriging model which works only with numerical parameters. In this paper we aim to compare two simple approaches for handling categor- ical parameters, which we call naive Kriging and dummy Kriging, respectively. By naive Kriging categorical parameters are converted to numerical ones. This leads to an unnatural order with equal distances between the parameter levels. The dummy Kriging solution is more intuitive from a statistical point of view. Here, every categorical parameter is replaced by a set of dummy variables. In each MBO iteration only one dummy variable can have the value 1 while the others are set to 0. Regarding the second MBO shortcoming the definition of an instance based problem is most important. Such situation occurs in applications where in each MBO iteration the target function has to be evaluated on a set of problem instances. The mean performance on this set is mostly considered as the tar- get function value. However, it is not always advisable to evaluate the target function on all instances (as their quantity can be very high) and sampling a ran- dom subset in each iteration will lead to noise in the target function landscape. Hence, a strategy for a sensible selection of instances is needed. In [5] we proposed such instance based optimization strategy. The main idea is that instead of evaluating of all instances, a few instances might already be 2 sufficient to recognize unpromising parameter settings. Therefore, a subset of representative instances is chosen which is used in the consecutive stages for the classification of the proposed parameter settings into “good” and “bad” set- tings. Only if the probability of beating the current optimum is above a specified threshold, the setting is classified as “good” and evaluation is continued on all instances. For an adequate performance prediction, the representative subset should be as diverse as possible. This can be achieved, e.g., by clustering the instances with respect to their individual target function performance or other features. In this paper, we provide a new concept for reducing the represen- tative subset’s size without loosing its predictive ability. The instance based optimization is introduced more deeply in Section 3. Here, the application for comparing classical MBO with the modified in- stance based MBO is tone Onset Detection (OD). Other applications for in- stance based optimization are, e.g., speech recognition (training of algorithms for different voices) or algorithm configuration (see, e.g, [6]). A tone onset is the time point of the beginning of a musical note. OD in music signals is an important step for many subsequent applications like music transcription and rhythm recognition. Several approaches have been proposed, but most of them can be reduced to the same basic algorithm just differing in the parameter set- tings [7, 8, 9, 10]. Most of them follow the same scheme: windowing the signal, calculating an Onset Detection Function (ODF) for each window and localizing the tone onsets by picking the relevant peaks of the ODF. Many numerical and categorical parameters are involved in this procedure like the window size, the window overlap and the applied ODF which have a strong influence on the algo- rithm performance. However, neither these influences nor the optimal parameter values are realistically quantified in most related publications (s. Section 4.4 for more details). Furthermore, the optimization results are mostly not validated on an additional test data base which can lead to over-fitting ([8]). Obviously, optimization of OD is an instance based problem as each param- eter setting is rated by its average performance over all music pieces of a data base. Hence, it seems to be an appropriate application for testing the proposed instance based approach. In contrast to our previous study in [5], the number of parameters of the optimized algorithm is enlarged. Among other things, the extended algorithm also enables online variants of OD. Furthermore, the op- timization is conducted on a larger data base combining almost all data sets frequently used in previous studies. The onset detection algorithm, which will be optimized here, is briefly described in Section 4. Validation of the considered optimization approaches is conducted in a so- phisticated manner by repeatedly dividing the data base into training and test data. Section 5 presents research questions, comparison experiments, validation scheme and statistical analysis methods. Afterwards, the experimental results are analyzed in Section 6. Finally, in Section 7 the main findings are summarized and several ideas for future research are discussed. 3 2. Classical model based optimization The aim of model based optimization is the minimization of a time in- tensive and highly complex target function f : X ⊂ Rd → R, f(x) = y, x = (x1, . . . , xd) T . Each function parameter xi is assumed to be a value of a con- tinuous feature1 in a pre-fixed optimization interval [`i, ui]. The parameter space is the Cartesian product of the individual intervals: X = [`1, u1]× . . .× [`d, ud]. A possible parameter setting xi ∈ X is called a point and yi is the target value in this point. A set of n points is called a design and is denoted with D = (x1, . . . ,xn)T . Moreover, y = (f(x1), . . . , f(xn))T is the vector of the target values on D. Here, we assume only the deterministic target functions. If optimization of a noisy function is required, the readers are advised to acquaint themselves with the extensions provided in [11]. Algorithm 1 provides the scheme of MBO which can be summarized as fol- lows: In the first step, an initial design with n points is evaluated and a surrogate model is fitted. The surrogate model is then used for the prediction of a new design point. As long as the optimization budget is not exhausted, a new point is chosen in the parameter space based on a so-called infill criterion derived from the surrogate model. The target function is evaluated in this point. The surrogate model is then updated on the design extended by the new point. The updated model is used for the next iteration. The point with the minimal target function value is taken as the result of the optimization. In what follows, these steps will be discussed in more detail. Algorithm 1: Sequential model based optimization. 1 generate an initial design D ⊂ X ; 2 evaluate f on the initial design: y = (f(x1), . . . , f(xn)) T ; 3 while optimization budget is not exhausted do 4 fit the surrogate model on D and y; 5 find x∗ with the best infill criterion value; 6 evaluate f on x∗: y∗ = f(x∗); 7 update D ← (D,x∗)T and y ← (y, y∗)T ; 8 end 9 return ymin = min(y) and the corresponding xmin . 2.1. Initial design and optimization budget The optimization budget should depend on the dimension of the target func- tion (number of function parameters d) and is a trade-off between a good opti- mization result and the required computational time effort. The choice of the initial design should take into account that too small designs are not able to sufficiently scan the target function which might lead to local convergence. In 1The treatment of categorical features will be discussed later. 4 [12] the influence of the initial design size (4d vs. 10d) and the optimization budget (20d vs. 40d) is studied with respect to the optimization results. The au- thors recommend small initial designs and larger optimization budgets. Similar results are also concluded in [13] and [14]. Here, Latin Hypercube Sampling (LHS, [15]) designs are used for the initial- ization step. LHS designs are very popular in the computational optimization community due to their properties of uniformly covering the interesting param- eter space and arbitrary selectable size (in contrast to the classical orthogonal designs). Because of the complexity of the optimization problem and the cor- responding high computation time required for function evaluations, the size of the initial design is set to 5d and the number of sequential steps (iterations) is set to 20d. 2.2. Surrogate model A surrogate model is used for proposing a new design point. Theoretically, an arbitrary regression model can be used as a surrogate model. Very popular is the Kriging model (as proposed by [16]) which can model high dimensional multimodal function landscapes in good quality already with few points. We use the ordinary Kriging modell [12]: Y (x) = µ+ Z(x). Y (x) is a random variable and the error term Z(x) is a Gaussian process ex- pressing the uncertainty in Y (x) having the properties: E(Z(x)) = 0, K(x, x˜) = Cov(Z(x), Z(x˜)) = σ2zk(x, x˜), where K(x, x˜) is the so-called covariance kernel, k(x, x˜) the spatial Correlation Function (CF), and σ2z the process variance [17]. The CF models the structure in the data points and is, obviously, the most decisive part of Gaussian process specification ([18]). For multidimensional data the product correlation rule is applied: k(x, x˜) = d∏ j=1 kj(xj , x˜j). Many different CFs were proposed in the last decades. See [18] for a detailed summary. We use the 3/2 - Mate´rn CF: kj(xj , x˜j) = ( 1 + √ 3|xj − x˜j | θj ) exp ( − √ 3|xj − x˜j | θj ) . The greater the distance between xj and x˜j , the smaller is the value of kj(xj , x˜j). Therefore, the influence of already evaluated points on the prediction of new points shrinks with increasing distance. The parameter θj controls the speed of influence reduction: the greater θj , the greater is the influence region. The 5 CF parameters (θ1, . . . , θd) T as well as the process variance σ2z are estimated by means of the Maximum-Likelihood method. See [12] for a detailed description. The Kriging mean (model prediction, fˆ(x)D) and the Kriging variance (model uncertainty, sˆ(x)2D) for any new point x ∈ X are defined as: fˆ(x)D = E [Y (x)|y] , sˆ(x)2D = Var [Y (x)|y] . The index D clarifies the dependency of these two functions on already evaluated points. The detailed derivation of fˆ(x)D and sˆ(x)2D can be found in [12]. 2.3. Adapting Kriging for categorical parameters One of the most important disadvantages of the standard Kriging model is its limitation to numerical influencing parameters. If there are also categorical parameters to be optimized, Kriging can be extended in several manners. A naive variant is to treat categorical parameters as if they were contin- uous. First, each level is assigned to an integer. The proposed values of the corresponding continuous parameter in the sequential steps are rounded and converted back to the nearest categorical level. In this manner, the structure of the input x is not affected since it is an internal modification. This procedure has to be treated with care as we artificially define order and intervals between the levels which actually do not exist. Another possibility is dummy Kriging, where each categorical parameter with m possible values is expressed by m different parameters (called dummy variables) which take the value 1 if the corresponding level is taken and 0 other- wise. These dummy variables are included into the vector of influence variables x. Hence, the number of parameters in the surrogate model increases leading to longer run times for model fits. 2.4. Infill criterion The infill criterion is a rating function which estimates how promising a target function evaluation at a given point is. Very intuitive criteria relate to the goodness of model predictions. However, a disadvantage of such a criterion would be a possible convergence to a local optimum. Instead, typically the expected improvement criterion, as proposed in [4], is used as a compromise between exploitation of the surrogate model and explo- ration of the function landscape. The EI criterion supports global convergence [19] and becomes the standard criterion in many applications. The EI in a point x is defined as the expected value of a positive improvement of the target function in this point: EI(x) = E[max{0, ymin − fˆ(x)D}] = (ymin − fˆ(x)D) Φ ( ymin − fˆ(x)D sˆ(x)D ) + sˆ(x)D φ ( ymin − fˆ(x)D sˆ(x)D ) , 6 where φ and Φ are the density and the cumulative distribution function of the standard normal distribution. ymin denotes the, so far, minimal value of the target function. The expected improvement should be maximized leading to the new design point x∗ = arg max x∈X EI(x). 2.5. Optimization of infill criterion In each MBO iteration we would like to choose a new point by maximizing the infill criterion. To solve the corresponding non-linear optimization problem, we use the focus search algorithm implemented in the R package mlrMBO [20] which successively focuses the parameter space on the most promising regions. The main idea is: Generate a random LHS design of the size Npoints on X and calculate the corresponding values of the infill criterion. Then, shrink the parameter space in each dimension to the environment of the best point. Iterate the shrinking Nmaxit times. As this procedure can lead to a local optimum, focus search should be replicated several (Nrestarts) times. As our final new point x ∗ we will take the best point over all iterations and repetitions. We will use the following settings: Npoints = 10 000, Nmaxit = 5 and Nrestarts = 3. 3. Model based optimization of instance based problems In many applications the target function does not have to be evaluated just once in a point xi ∈ X , but on a set of k problem instances I = (I1, . . . , Ik). Let yki = f(xi, Ik) define the individual performance measure of the kth in- stance in the ith point. Mostly, the mean response over all instances y¯i = 1/k ∑k j=1 f (xi, Ij) should be optimized. Obviously, one can apply MBO and calculate the mean response as the target. However, the problem with many in- stances is the high computational time needed for the evaluation. Therefore, we are looking for a short cut by excluding unpromising parameter settings with- out evaluating all instances. For simplification we will call promising settings “good” and unpromising ones “bad” in the following. One possible approach is the instance based SMAC method (Sequential Model-based optimization for general Algorithm Configuration) of [6], where the initialization step is only based on the evaluation of a few randomly se- lected instances and newly proposed points are iteratively evaluated on random instances until there is an indication that the point is worse than the so far best. However, instead of a randomly chosen subset a systematic selection pro- cedure seems natural. Therefore, in [5] we proposed a novel method for the optimization of instance based problems which is summarized in Algorithm 2 and explained in the following. The first two steps of Algorithm 2 are the same as in Algorithm 1: an initial design D is generated and all instances I are evaluated on all points of D. Now, our aim is to define a representative subset of instances Ipretest ⊆ I in order to pretest the proposed new points on them. Here, we propose to cluster the k problem instances in kpretest clusters according to some features. Note, the 7 Algorithm 2: Model based optimization of instance based problems. 1 generate an initial design D ⊂ X ; 2 evaluate D on all problem instances: y = (y¯i, . . . , y¯n)T ; 3 cluster all instances in kpretest clusters according to their features; 4 choose class representatives randomly for the pretest set Ipretest ⊆ I; 5 fit Mpretest : y ∼ f(x, Ipretest1 ) + · · ·+ f(x, Ipretestkpretest ); 6 apply forward variable selection for Mpretest regarding the R 2 adj measure by choosing k ′ pretest ≤ kpretest instances so that M ′ pretest has at least R2adj = 0.98 (if achievable); 7 while budget is not exceeded do 8 fit surrogate model Msur on D and y; 9 find x∗ by infill criterion optimization; 10 evaluate x∗ on k ′ pretest selected instances; 11 predict mean performance: ˆ¯y∗ = M ′ pretest(x ∗); 12 calculate 99% confidence interval: [ˆ¯y∗low , ˆ¯y ∗ upp ]; 13 if ˆ¯y∗low ≤ ymin then 14 evaluate x∗ on remaining k − k′pretest instances: y¯∗ = 1/k ∑k j=1 f(x ∗, Ij); 15 update D and y: D ← (D,x∗)T and y ← (y, y¯∗)T ; 16 update model M ′ pretest using the new observation; 17 update elements of y corresponding to “bad” points by the prediction values of M ′ pretest; 18 else 19 update D and y: D ← (D,x∗)T and y ← (y, ˆ¯y∗); 20 end 21 end 22 return ymin = min(y) and corresponding xmin . kpretest number should be chosen by the user and is a trade-off between good representatives (for large kpretest) and computational time in sequential steps. If instances have a set of special features, they could be used for clustering. Otherwise, their individual target function performance can be used as feature. Subsequently, the pretest set could be compounded by randomly chosen repre- sentatives of each cluster. A linear pretest model Mpretest is fitted on D with the mean performance over all instances as the target and the individual performance of the pretest instances as influencing factors (line 5): 1/k k∑ j=1 f (xi, Ij) = β0 + β1 · f(x, Ipretest1 ) + . . .+ βkpretest · f(x, Ipretestkpretest ) + ε. In this manner we aim to build a model for predicting the overall performance 8 of a new parameter setting by observing just the individual performance of the selected instances. In our previous experiments in [5] we found that Mpretest has mostly very high values of the adjusted coefficient of determination measure (R2adj ), even if the number of influencing factors is reduced. Hence, a variable selection step seems to be meaningful in order to further reduce the subset size (line 6). Here, forward variable selection regarding the R2adj measure is newly proposed. Stopping criterion is achieving of R2adj = 0.98. Through this step the number of pretest instances can be reduced to k ′ pretest . If R 2 adj = 0.98 is not achievable, k ′ pretest = kpretest . The model on k ′ pretest instances is denoted with M ′ pretest . In the following repeated steps (lines 7-21), the surrogate model is fitted on D and y (line 8) and the new parameter setting x∗ is proposed by optimizing the infill criterion (line 9). Then, the proposed point is first evaluated on the k ′ pretest instances (line 10) and classified as “good” or “bad”. This can be done in different ways. We base our decision on prediction intervals: for the point x∗ we calculate the prediction ˆ¯y∗ and the 99% prediction interval2 [ˆ¯y∗low , ˆ¯y ∗ upp ] based on M ′ pretest (lines 11 - 12). If the lower limit of the prediction interval is smaller than the so far reached performance minimum, x∗ is classified as a “good” point, otherwise as “bad”. The “good” points are evaluated on the remaining k − k′pretest problem in- stances (line 14), D and y are updated (line 15), and the pretest model is newly estimated in order to include the new information of the additional observation (line 16). If a “bad” point is found, y¯∗ is estimated only by ˆ¯y∗ and given back to the MBO loop (line 19). It is important to notice that these estimations are updated after each new update of the pretest model (line 16). Obviously, the proposed approach leads to saving of function evaluations compared to the classical MBO. On the other hand, the ability to find an optimal solution could be affected thereby. 4. Application problem: onset detection The aim of onset detection (OD) is to recognize the time points of the beginnings of new musical tones. Figure 1 shows the amplitude of the same tone sequence played by piano (left plot) and flute (right plot), respectively. The onset times are marked by vertical lines. This figure illustrates that depending on the music instrument different signal features might to be sensible for OD task. At first, the classical OD algorithm is briefly introduced. Next, the goodness measures of OD and the data sets used for optimization are discussed. Finally, some details regarding the application of the proposed MBO to OD are provided. 2In [5] we compared 99% and 95% prediction intervals. The 95% prediction interval led to more as “bad” classified points. As better results were achieved with the 99% prediction interval, we propose to use this interval here. 9 0 1 2 3 4 time a m pl itu de − 70 00 0 70 00 0 1 2 3 time a m pl itu de −5000 0 5000 Figure 1: Amplitude of the same tone sequence played by piano (left) and flute (right). The vertical lines mark the true tone onset times. 4.1. Onset detection algorithm The classical onset detection approach is presented in Figure 2 (see, e.g., [7, 21, 10, 5]). In each step, many algorithm parameters are defined which have to be optimized in the experimental part of this work. The considered ranges for the parameter values are discussed at the end of this section. Figure 2: Eight steps of the classical onset detection approach. Each step has a several parameters to optimize which are given in parantheses. There are two kinds of onset detection: online and offline. For online OD – in contrast to offline OD – no future signal information is allowed for distinguishing between ‘onset’ and ‘no onset’ in a current signal frame. We assume a digital audio signal sampled with a rate of Fs = 44.1 kHz. This signal is split in the first step into l (overlapping) windows (or frames) of length N samples. If required 10 by the used OD feature, a Short-Time Fourier Transformation (STFT, [22]) is applied in each signal frame: Xstft [n, µ] = 1 N N∑ k=1 xsig [h · (n− 1) + k]wN (k)e−2piiµk/N , (1) where xsig is the music signal and Xstft [n, µ] is the Fourier coefficient (a complex number) of the µth frequency bin in nth frame, n = 1, . . . , l. The hop size parameter h determines the distance in samples between the windows. wN () is the window function (parameter window .fun for optimization) which is used to weight the signal amplitude in the frames. See [23] for an overview of such functions. In the second step, the output of the STFT can be optionally modified. According to [21] we apply spectral filtering and logarithmize the spectral mag- nitudes. Hence, pre-processing is used only for spectral based OD featured. For spectral filtering (parameter spec.filt) a filter bank is applied to the spec- tral magnitudes (absolute values of the Fourier coefficients) which bounds the frequency bins according to the semitones of the western music scale: |Xfilt [n, ν]| = N/2∑ j=1 |Xstft [n, µ]| · F [µ, ν]. (2) Spectral filtering reduces the number of frequency bins for the subsequent fea- ture calculation. Furthermore, the (optionally filtered) spectral coefficients can be logarith- mized (parameter spec.log in step 3): |X log [n, ν]| = log10(`|Xstft [n, ν]|+ 1) (3) where ` is a compression parameter. The computation of an onset detection function in a window of the pre- processed signal is often called reduction (step 4), since after this step not the signal is analyzed anymore but only the ODF values. Many ODFs are based on the comparison of neighboring windows. An increase of an ODF generally indicates a tone onset, a decrease a tone offset. However, also offset information can improve onset detection (see [24]). The algorithm parameter od .fun has 18 levels corresponding to 18 ODFs considered here. These ODF’s are based on the following signal features, some of them use tone offset information: zero-crossing rate, absolute maximum [25], amplitude energy, weighted spectral energy [26], spectral centroid [27], spectral spread [27], spectral skewness [27], spectral flux [7, 8], spectral Euclidean distance [28], phase deviation [29, 8] and complex domain [8]. For example, one of the most popular OD features – spectral flux – is defined as follows: Spec.Flux (n) = N/2∑ j=1 H(|X[n, ν]| − |X[n− 1, ν]|), (4) 11 where H(x) = (x + |x|)/2 is a filter which sets the decrease of the spectral magnitude in a frequency bin to zero. The aim of the normalization (step 5) is to transform the odf feature vector into a standardized form for the subsequent thresholding. At first, exponential smoothing with parameter α can be applied, where for α = 1 the time series stays unchanged and for α = 0 all values of a feature are equal: sm.odf 1 = odf 1, sm.odf n = α · odf n + (1− α) · sm.odf n−1. (5) Moreover, for offline OD the normalization of sm.odf is meaningful utilizing e.g., the maximum of the odf feature vector. Since not every local maximum of an ODF represents an onset, the threshold function aims at the distinction between relevant and non-relevant peaks (step 6). A fixed value for the threshold is unfavorable since the method could then not react to dynamic changes of the signal. Instead, moving threshold functions are widespread [10]: Tn = δ + λ ·mov .fun(|sm.odf n−lT |, . . . , |sm.odf n+rT |), where the parameter mov .fun (moving function) is either the ‘median’, ‘mean’ or ‘p-quantile’. lT and rT are the numbers of windows to the left and to the right, respectively, of the nth window. In the following step the tone onsets are localized in the windows where sm.odf values exceed the threshold and are a local maximum of a certain window sequence (step 7). Following [21] we also use a third condition: A minimum distance min.dist (in number of windows) between the actual window and the window of the previous tone onset iprev .onset should be exceeded. To summarize: On =  1, if sm.odf n > Tn and sm.odf n = max(sm.odf n−lO , . . . , sm.odf n+rO ) and n > nprev .onset + min.dist , 0, otherwise. O = (O1, . . . , Ol) T is the tone onset vector and lO and rO are additional param- eters, namely the number of windows to the left or right of the actual window, respectively, which define the region for local maxima. The left limits of signal frames with On = 1 are taken as the time points of the tone onsets. In [21], it is proposed to report the onsets one window later as actually detected. The authors argue that some features can increase earlier than a human listener would firstly recognize and note a tone onset. For the window length used in [21] this would correspond to a fix time shift of 10 ms. In our work we will consider the parameter onset .shift for optimization (step 8). In contrast to the most papers on the topic, we do not fix window length N and hop size h a priori but optimize them. This means, that parameter settings corresponding to the number of windows (like rT ) could stand for very 12 different time periods depending on N and h. Therefore, all such parameters are re-defined according to the desired time length, N and h. Hence, we will not consider the parameters rT , lT , rO, lO, and min.dist , but the times t(rT ), t(lT ), t(rO), t(lO), and t(min.dist). In what follows, the regions of interests for all parameter are defined. We will consider N = 512, 1024, 2048 and 4096 samples as frame length (powers of 2 are chosen in order to apply the fast Fourier transform ). The hop size h lies between N/10 and N . The following window functions are considered for window .fun: ‘Uniform’, ‘Hamming’, ‘Blackmann’ and ‘Gauss’ (with standard deviation σ = 0.4). Parameters spec.filt and spec.log have two settings ‘yes’ and ‘no’. Furthermore, ` ∈ [0.01, 20] and the smoothing parameter α varies in [0, 1]. Regarding the thresholding, we optimize δ ∈ [0, 10] and λ ∈ [1.1, 2.6] for mov .fun = ‘median’ or ‘mean’ and fix λ to 1 while optimizing p ∈ [0.8, 0.98] for mov .fun = ‘p-quantile’. The shifting parameter onset .shift lies in [−0.01, 0.02] s. Negative values are allowed as it can not be excluded that some features report tone onsets with a certain time delay. For online applications, t(rO) and t(rT ) are set to 0 s. In the offline case and universally for t(lO) and t(lT ) these intervals are set to [0, 0.5] s. The region of interest for the parameter t(min.dist) is [0, 0.05] s. To summarize, there are two application problems to optimize: online and offline OD. Offline OD has a set of 17 parameters while for online OD two parameters are fixed to 0 (i.e., 15 parameters remain for the optimization). In both cases four parameters are categorical and the remaining are numerical. 4.2. Goodness measures for onset detection In the most onset detection related publications the goodness of onset de- tection is measured by the F -measure (e.g., see [8]): F = 2 · TP 2 · TP + FP + FN , F ∈ [0, 1], where TP , FP , and FN stand for the number of true positive cases, false positive cases, and false negative cases, respectively, and F = 1 represents an optimal detection. A found tone onset is correctly identified if it is inside a tolerance interval around the true onset. We use here ±25 ms as the tolerance [21] while ±50 ms setting is also frequently applied in the literature [7, 8]. Note that the true negative cases are not taken into account in the calculation of the F -measure. Another disadvantage is that the distance between true and estimated onsets is only taken into account via the tolerance region. An alternative evaluation measure is therefore the mean (relative) deviation of the estimated onset times from the true ones. This will be called D-measure in what follows. We will optimize the F -measure and use the D-measure only as an additional evaluation feature, e.g., for the clustering in Section 4.4. 4.3. Music data base The greatest challenge with respect to the creation of a music data base is the necessity of information about the onset times. There are at least two ways to 13 generate audio data with the corresponding onset times. In the literature, music pieces are manually annotated most of the time, leading to only small numbers of annotated pieces and possible annotation errors. The second possibility is the use of the MIDI format, offering the advantage to have many such music pieces at one’s disposal. There are different programs available which can generate audio recordings in the so-called wave format from MIDI files using instrument specific signal models. Naturally, a music piece generated synthetically in this way will normally not realistically mimic real music recordings. The aim of the composition of the data base for our optimization is to cover as many music aspects as possible, e.g., way of generation (annotated real music, synthesized MIDI), type of instrument, degree of polyphony, tempo, music genre, and music style. However, it is not meaningful to combine the labeled (real) music pieces with recordings generated from MIDI format as they differ in the onset annotation principle. While for manually annotated pieces the definition of the perceptual tone onset can be applied (i.e., time point where a human listener can firstly recognize a tone onset, [8]), for MIDI pieces the definition of the physical onset is more appropriate (i.e., time point of the first amplitude rise of the new tone, [8]). Hence, we expect that due to delayed annotation in the first case the optimal parameter set can differ from the second case. For this reason, we consider two data bases which will be referred to as WAV and MIDI data sets. The WAV data base consists of three frequently used manually annotated data sets: data base introduced in [7] with 23 pieces, in [9] with 92 recordings and in [21] with 206 music pieces. Altogether it consists of 2 750 tone onsets. Many music instruments (like wind or string) and music styles (like European or oriental) are represented in this data set. According to [21], we aggregated true onset times reported within 30 ms to only one tone onset. The MIDI data base includes the German folk song data base proposed in [28] with 24 music pieces, the MIDI data collection introduced in [5] with 200 pieces and the music epoch data base mentioned in [14] with 22 recordings. Furthermore, 20 pieces were additionally generated for this work. The most MIDI files were converted to wave files using the MIDI to WAV Converter3 while also the RealConverter proposed in [25] was used. There are overall 63 067 tone onsets in the MIDI data base. 4.4. Onset detection as instance based problem Although the necessity of parameter optimization arises in nearly all studies on onset detection, they usually just examine a subset of all possible parame- ters, while other parameters are set to some fixed empirical values, which were determined in preview studies or are frequently used in the OD community. Merely a few numerical parameters are then optimized via grid search or some generic algorithms. This procedure could lead to a local optimum. We aim, 3http://www.maniactools.com/soft/midi_converter/index.shtml. 14 in contrast, to optimize all parameters simultaneously in order to enable lobal optimization. The goal of OD optimization is to find the optimal values of the algorithm parameters mentioned in Section 4.1 (input x) with respect to the mean F -value (output y). The parameter space X of the offline onset detection problem is the Cartesian product of the regions of interest of d = 17 parameters introduced above (in the online case: d = 15). For each problem instance (music piece) the target function f : X ⊂ Rd1 × Nd20 → [0, 1] is defined as the F -value of the associated OD algorithm (characterized by a point x ∈ X ) for this instance, where d1 ans d2 are the numbers of the numerical and the categorical param- eters, respectively. The levels of the categorical parameters are coded with integer numbers (e.g., levels of the parameter window .fun are coded as follows: 0 = ‘Uniform’, 1 = ‘Hamming’, 2 = ‘Blackmann’ and 3 = ‘Gauss’). For the classical MBO approach the OD performance in x is the averaged F -value over the considered data base of k music pieces I = (I1, . . . , Ik): f(x) = 1/k ∑k j=1 f (x, Ij). As the F -value has to be maximized, we will minimize the negated F -value. With respect to Algorithm 2 more details have to by clarified regarding the OD application. The same surrogate model and infill criterion as mentioned in Section 2 for the classical MBO are used. Although an instance based prob- lem is given, the target variable is defined as the mean performance over all instances (real performance values for “good” points and estimated values for “bad” points), so that the usual Kriging model and expected improvement cri- terion can be applied. Categorical parameters are transformed into numeric or dummy variables according to the applied approach (naive or dummy Kriging). In line 3 all instances should be clustered in kpretest clusters. In [5] we used only the individual F -values of instances in the initial design as classification features. Here, we extend this step by considering also the individual D-values (see Section 4.2). Instances in the same cluster should have similar responses (F - and D-values) for most parameter configurations. In this way we strive to reach the greatest diversity of the music pieces in this pretest subset. The k-means algorithm [30] is used as the clustering approach in line 3. The size of the pretest subset (kpretest) is set to 5% of the training data volume. This corresponds to 10 pieces for the WAV data set and 8 pieces for the MIDI data set (as in the training phase only 2/3 of the whole data set is used, s. Section 5). 5. Research questions and experimental design The following main research questions will be studied in this work which can be divided in two classes: optimization questions (OQ) and application questions (AQ): • OQ1: Which surrogate model results in better optimization performance: naive or dummy Kriging? (Section 2.3) 15 • OQ2: Does the proposed MBO for instance based problems achieve the same values of the target function as the classical MBO, despite the fewer number of function evaluations? (Sections 2 and 3) • AQ1: Is there a difference in performance between online and offline OD? (Section 4.1) • AQ2: Is there a difference in performance between WAV and MIDI data? (Section 4.3) This results in 24 = 16 ‘optimization strategy - application problem‘ combi- nations. In order to show the efficiency of MBO, a random search with the same number of function evaluations (generated via an LHS design) is also conducted for online onset detection. Furthermore, a simple reference strategy for the pro- posed MBO is also implemented: after the evaluation of the sequential design the optimization in sequential steps is done merely on the kpretest randomly chosen instances. A very important aspect when comparing many optimization strategies is dealing with over-fitting to the training data. In order to avoid a very good fit on the training data and a much worse performance on new test data, it is essential to use the resampling technique [31]. The most simple resampling technique is the holdout approach where a part of the data (usually 2/3) is used for training and the remaining part for testing. Very popular is the k-fold-cross- validation procedure: Split the data in k disjunct blocks of equal size, conduct the optimization procedure on the training data of k − 1 blocks and validate the best found parameter setting on the remaining block (test data). In this manner, k goodness values (i.e., for each test part) are produced which either can be analyzed as a vector or averaged to only one measure. In the former case two competitive approaches can be compared according to the distribution of the goodness values. As remarked in [32], the k goodness values are not independent through the block structure of the cross-validation. However, independence is assumed in almost all statistical test. In order to achieve the independence mentioned above, we replicate the holdout approach 30 times and get, hence, for each optimization strategy a vector of 30 goodness values. In our case, the training data set is used for finding the optimal parameter set of the online or offline OD algorithm while this setting is then applied to the test data. The mean F -value over the music pieces of the test data is then the goodness value of the applied optimization strategy in the respective holdout-replication. In one replication of the holdout approach the same initial design is used for all strategies in order to ensure uniform starting conditions. As the F -values are not assumed to be normally distributed, the Wilcoxon signed rank test [33, p. 128 ff.] is considered as a non-parametric alternative to the t-test. In accordance with [33] (p. 132) the sample size of 30 observations is sufficient for the desired asymptotic property of the test statistics. Note that although a multiple testing problem occurs here (as some samples have to be applied for testing several times) we waive the methods for holding the global 16 significance level (like Bonferroni-Holm procedure). Instead, the results of the statistical tests will not be over-interpreted but used rather in a descriptive way. The significance level is assumed to be 5%. The experiments were executed in parallel using the BatchExperiments R package [34] on the Linux-HPC cluster system4 of TU Dortmund University. The classical MBO optimization is conducted using a developing version of mlrMBO R package [20]. The OD algorithm as well as MBO for instance based problems are implemented in the R programming language [35] and can be provided on request. 6. Results Here, we successively answer the research questions mentioned in Section 5. Figure 3 illustrates the distribution of the validated F -values for the classical MBO optimization strategy with naive and dummy Kriging as surrogate models. The left five boxplots correspond to the results on the WAV data bank while the right five boxplots consider the MIDI data set. The x-axis provides the OD kind: online or offline. The reference strategy – random search – was applied just for the online OD detection in order to demonstrate the essential advantage of model based optimization. The first optimization question and both application questions can be an- alyzed descriptively (s. Figure 3). It seems that there is no big difference in the optimization performance between the two surrogate models (see the first optimization question, OQ1). Moreover, naive Kriging shows slightly better F -values than the statistically more sensible dummy Kriging. The Wilcoxon signed rank test for the equality of median was applied to four hypotheses (on- line and offline OD on both data sets) while all tests show p-values noticeably greater than 0.05. Hence, no significant difference can be stated between the optimization results of naive and dummy Kriging models. As a further comparison aspect regarding the surrogate models, we will consider the optimization time. Table 1 shows the mean time distribution over the three optimization elements: fitting the surrogate model, EI optimization and function evaluations (see lines 4, 5 and 6 of Algorithm 1, respectively). Time analysis was conducted only for the WAV data set and considers only the sequential steps of MBO (lines 3-8 of Algorithm 1). As the EI optimization algorithm (focus search, see Section 2.5) depends on the number of influencing parameters which is much higher for the dummy Kriging approach, a relevant additional time expense can be noticed for dummy Kriging compared to naive Kriging (3.5 hours more on average for one optimization run). Furthermore, 84.6% of the computational time is needed on average for function evaluations in the case of online OD while for offline OD this ratio is 71%. This illustrates that OD is a typical case of time expensive optimization problems. 4http://lidong.itmc.tu-dortmund.de/ldw/index.php?title=System_overview&oldid= 259. 17 ll l l 0. 55 0. 65 0. 75 0. 85 kind of onset detection F− va lu e o n lin e o n lin e o n lin e o ffl in e o ffl in e o n lin e o n lin e o n lin e o ffl in e o ffl in e WAV data bank MIDI data bank random search MBO_naivKM MBO_dummyKM Figure 3: Validated results of the classical MBO with naive and dummy Kriging as surrogate models for online and offline OD application problems. The left five boxplots: results on the WAV data bank; the right five boxplots: results on the MIDI data bank. Table 1: Mean time in hours for surrogate model fit, EI optimization and function evaluation in one MBO run on the WAV data base. Time for function evaluations on initial design is not considered here. naive: naive Kriging surrogate model, dummy: dummy Kriging. model fit EI optimization function evaluation OD kind naive dummy naive dummy naive dummy online 1.866 3.602 4.348 6.081 34.214 33.274 offline 3.028 5.251 5.922 7.942 21.876 23.580 Results of the statistical tests and the time analysis in Table 1 indicate that naive Kriging seems to be more beneficial than dummy Kriging because of its simplicity and efficiency. Therefore, further optimization experiments will be conducted just with naive Kriging model. However, this finding might not be generalized to all optimization problems. The good performance of naive Kriging can probably be explained through the size and type of the used initial design. For online OD, e.g., the initial design consists of 75 points, so that each level of the categorical parameter od .fun (with 18 levels) occurs approximately four times (according to the construction of LHS designs). This seems to be sufficient for fitting the function landscape appropriately. It would be interesting for further research to study the behavior of naive and dummy Kriging models under different circumstances. Regarding the first application question (AQ1), as expected, offline OD achieves clearly better F -values than the offline OD while this difference seems 18 l l l l l 0. 50 0. 60 0. 70 0. 80 kind of onset detection F− va lu e o n lin e o n lin e o n lin e o ffl in e o ffl in e o n lin e o n lin e o ffl in e o ffl in e WAV data bank MIDI data bank reference classical MBO time efficient MBO Figure 4: Validated F -values of the classical and proposed MBO applied to the online and the offline OD on the WAV and the MIDI data sets. Surrogate model: naive Kriging. to be more remarkable on the WAV data base. The p-value of the Wilcoxon signed rank test for online vs. offline OD performance comparison on the WAV data set (for naive Kriging model) is 1.94 · 10−08 and on the MIDI data sets 0.007. The observed difference is also significant and can be explained due to additional information about the future signal behavior. Further on, Figure 3 reveals that noticeable better F -values can be observed on the MIDI data set compared to the WAV data (see application question 2, AQ2) This difference is highly significant according to the Wilcoxon test both for online and offline OD (p-values are 1.33 ·10−15 and 9.54 ·10−10, respectively). A possible explanation could be the artificially regular patterns of tone begin- nings in the synthetically generated recordings. Such pieces can not reproduce many musical aspects like development of overtones or legato (smoothly playing of many tones) so that the time points of tone onsets can be easier identified compared to real pieces. A more difficult problem is how to compare classical MBO with the pro- posed time efficient instance based approach (optimization question 2, OQ2). The proposed MBO can be seen as an approximation of the classical MBO which, on the one hand, is probably a bit worse but, on the other hand, much faster. There are two possible points of view to compare the two strategies. One possibility is based on the same number of iterations and other one is based on the same number of instance evaluations. Which approach is better depends on the run-time costs of the instance evaluations versus the run-time costs of fitting the Kriging model. Taking the above time analysis into account, the 19 more sophisticated comparison seems to be based on the number of instance evaluations. However, to achieve a better generalization to other optimization tasks, we will consider both points of views. ll 0 20 40 60 80 10 0 kind of onset detection ho ur s o n lin e o n lin e o ffl in e o ffl in e o n lin e o n lin e o ffl in e o ffl in e WAV data bank MIDI data bank classical MBO time efficient MBO Figure 5: Time (in hours) for function evaluations in sequential steps for the considered optimization stretgies on the WAV (left) and MIDI (right) data base. Figure 4 illustrates the distribution of the validated F -values for both opti- mization strategies in case of the same number of iterations (for online and offline OD on WAV and MIDI data sets). Moreover, a simple time efficient reference strategy is applied for online OD on the WAV data base: after complete evalua- tion of the initial design, kpretest music pieces were chosen randomly so that the optimization in the sequential steps was done just on these instances. As can be seen, the reference strategy shows the worst results. It can be stated for all pairwise comparisons that the proposed instance based approach is slightly or noticeably worse than the classical MBO. The observed difference is significant only on the MIDI data set with p-values of 0.037 for the online OD and 0.042 for the offline OD. Figure 5 shows the time efficiency of the proposed variant compared to clas- sical MBO. Here, the total computational time in hours on the used computer cluster system for function evaluations in sequential steps is presented over 30 replications of the associated optimization approaches. The mean time saving amounts approx. 23.5 hours on the WAV data set and approx. 51.3 hours on the MIDI data set. This corresponds in both cases to the astonishingly high mean time saving of 84.3%. This saving depends mainly on the number of “good” points, where the complete data set should be evaluated: the more such points, the lower is the saving. However, also the size of the pretest subset influences 20 ll l l l l l 0. 50 0. 60 0. 70 0. 80 kind of onset detection F− va lu e o n lin e o n lin e o ffl in e o ffl in e o n lin e o n lin e o ffl in e o ffl in e WAV data bank MIDI data bank MBO_cut time efficient MBO Figure 6: F -values of the proposed time efficitent optimization compared to MBO cut for the online and the offline OD on the WAV and the MIDI data bases. the function evaluation time. To remain, in line 6 of Algorithm 2 we proposed a strategy for reducing the subset size. As the result for this application, the num- ber of selected problem instances was reduced from 5% to 3.4% of the training data set on average. After we compared both MBO strategies on the same number of MBO iter- ations (Figure 4), let us now compare them based on the same number of music piece evaluations. For this reason, the number of instance evaluations made by an application of the time efficient MBO is divided by the size of the training data set and rounded up. In this manner, the number of equivalent complete data set evaluations is calculated. Afterwards, the optimization path of the associated classical MBO run (in the same validation step and with the same initial design) is cut to this number of iterations and its best already achieved F -value is saved. This approach is abbreviated by MBO cut. The comparison of the proposed MBO and MBO cut in Figure 6 shows a substantial superiority of the proposed approach for both data sets (especially for offline OD). These superiority is highly significant according to the Wilcoxon tests. Although the number of music piece evaluations is nearly the same, the computational time for the instance based optimization might be slightly larger due to modeling overhead. This overhead can be ignored if the computational time for instance evaluations is noticeable larger than the M ′ pretest and Msur model fitting time. This is the case for our application, as already illustrated in Table 1. 21 7. Conclusion The main idea of the proposed time efficient variant of MBO for instance based problems is choosing a representative set of problem instances which can accurately estimate the overall performance. When allowing the same number of instance evaluations, this approach performs significantly better than classical MBO – according to the Wilcoxon signed rank test – for online and offline onset detection. However, this comparison assumes that time for instance evaluations is the only important one while time for surrogate modeling and infill criterion optimization can be neglected. Note, under this condition, the time efficient strategy needs more MBO iterations. For the onset detection application on a large music data base this assumption is fulfilled. Nevertheless, we also compare both MBO variants concerning the same num- ber of iterations to get a more generalized result also for other applications. Then, classical MBO is slightly better while requiring approximately six times more computational time for instance evaluations. To conclude, the more in- stances the instance based problem contains and the more time expansive they are, the more preferable the proposed MBO approach becomes. Additionally, since classical MBO with the popular Kriging surrogate model is limited only to continuous parameters, two extensions for categorical param- eters are compared: (1) naively handling them as continuous variables and (2) building dummy variables. No significant difference in the optimization perfor- mance can be stated between both extensions while the optimization runs with the naive Kriging model require on average 3.5 hours fewer computational time than with the dummy Kriging model. However, to get a more general state- ment also for other applications with categorical parameters, further simulation studies are important to investigate the naive Kriging performance. Other interesting results with respect to the optimization problem are, firstly, the significantly better performance of offline onset detection (due to the using of the future signal information) compared to the online variant. Secondly, synthetically generated recordings (MIDI data) show significantly better results than the real music pieces (WAV data). This might be caused by inevitable simplifications of the sound properties made by the artificial signal generation. In future work, we aim, on the one hand, to improve the proposed instance based MBO since there are many sensible alternatives both for the selection of the pretest instances and for the choice of the pretest model. For example, also the application of a classification method to select “good” and “bad” points might be sensible. On the other hand, other techniques for optimizing applica- tion problems with categorical parameters should be compared to the proposed ones. For instance, in [6] a random forest surrogate model and another defini- tion of model uncertainty are used for handling this problem. But also the usual covariance kernels can be applied by utilizing an appropriate distance measure (like the gover metric, [36]). Finally, we are interested to test the time efficient MBO on further instance based problems. 22 Acknowledgment This paper is based on investigations of the projects C2 and B3 of SFB 823, which are kindly supported by the German Research Foundation (DFG). References [1] C. Weihs, J. Jessenberger, Statistische Methoden zur Qualita¨tssicherung und -optimierung in der Industrie, Wiley-VCH, 1999. [2] J. C. Spall, Introduction to Stochastic Search and Optimization: Estima- tion, Simulation, and Control, Wiley, 2003. [3] S. Herbrandt, C. Weihs, U. Ligges, M. Ferreira, C. Rautert, D. Biermann, W. Tillmann, Optimization of a simulation for inhomogeneous mineral sub- soil machining, in: Proceedings of the European Conference on Data Anal- ysis, Springer International Publishing, 2015. [4] D. R. Jones, M. Schonlau, W. J. Welch, Efficient global optimization of ex- pensive black-box functions, Journal of Global Optimization 13 (4) (1998) 455–492. [5] N. Bauer, K. Friedrichs, B. Bischl, C. Weihs, Fast model based optimization of tone onset detection by instance sampling, in: A. Wilhelm, H. A. Adal- bert (Eds.), Analysis of Large and Complex Data, Springer International Publishing, 2015, p. to appear. [6] F. Hutter, H. H. Hoos, K. Leyton-Brown, Sequential model-based opti- mization for general algorithm configuration, in: C. A. C. Coello (Ed.), Learning and Intelligent Optimization, Vol. 6683 of Lecture Notes in Com- puter Science, Springer, 2011, pp. 507–523. [7] J. P. Bello, L. Daudet, S. A. Abdallah, C. Duxbury, M. E. Davies, M. B. Sandler, A tutorial on onset detection in music signals, IEEE Transactions on Speech and Audio Processing 13 (5-2) (2005) 1035–1047. [8] S. Dixon, Onset detection revisited, in: Proceedings of the 9th International Conference on Digital Audio Effects (DAFx’09), 2006, pp. 133–137. [9] A. Holzapfel, Y. Stylianou, A. Gedik, B. Bozkurt, Three dimensions of pitched instrument onset detection, IEEE Transactions on Audio, Speech, and Language Ppocessing 18 (6) (2010) 1517–1527. [10] C. Rosa˜o, R. Ribeiro, D. Martins De Matos, Influence of peak selection methods on onset detection, in: Proceedings of the 13th International So- ciety for Music Information Retrieval Conference (ISMIR’12), 2012, pp. 517–522. 23 [11] V. Picheny, D. Ginsbourger, Noisy kriging-based optimization methods: A unified implementation within the diceoptim package, Computational Statistics & Data Analysis 71 (C) (2014) 1035–1053. [12] V. Picheny, T. Wagner, D. Ginsbourger, A benchmark of kriging-based infill criteria for noisy optimization, Structural and Multidisciplinary Op- timization 48 (3) (2013) 607–626. [13] N. Bauer, J. Schiffner, C. Weihs, Comparison of classical and sequential design of experiments in note onset detection, in: B. Lausen, D. V. den Poel, A. Ultsch (Eds.), Data Analysis, Machine Learning and Knowledge Discovery, Springer International Publishing, 2013, pp. 501–509. [14] N. Bauer, J. Schiffner, C. Weihs, Comparison of parameter optimization techniques for a music tone onset detection algorithm, in: Proceedings of the 4th Meeting on Statistics and Data Mining (MSDM), 2013, pp. 28–34. URL http://www.tasa-online.com/ [15] M. D. McKay, R. J. Beckman, W. J. Conover, Comparison of three methods for selecting values of input variables in the analysis of output from a computer code, Technometrics 21 (2) (1979) 55–61. [16] D. G. Krige, A statistical approach to some basic mine valuation problems on the witwatersrand, Metallurgical and Mining Society of South Africa 52 (6) (1951) 119–139. [17] M. Schonlau, Computer experiments and global optimization, Ph.D. thesis, University of Waterloo (1997). [18] C. E. Rasmussen, C. K. I. Williams, Gaussian processes for machine learn- ing, The MIT Press, 2006. [19] D. R. Jones, A taxonomy of global optimization methods based on response surfaces, Journal of Global Optimization 21 (4) (2001) 345–383. [20] B. Bischl, J. Bossek, D. Horn, M. Lang, mlrMBO: Model-Based Optimiza- tion for mlr, R package version 1.0 (2015). URL https://github.com/berndbischl/mlrMBO [21] S. Bo¨ck, F. Krebs, M. Schedl, Evaluating the online capabilities of onset detection methods, in: Proceedings of the 13th International Society for Music Information Retrieval Conference (ISMIR’12), 2012. [22] J. Cooley, J. Tukey, An algorithm for the machine calculation of complex fourier series, Mathematics of Computation 19 (90) (1965) 297–301. [23] F. J. Harris, On the use of windows for harmonic analysis with the discrete fourier transform, IEEE 66 (1) (1978) 51–83. 24 [24] E. Benetos, S. Dixon, Polyphonic music transcription using note onset and offset detection, in: Proceedings of the International Conference on Acous- tics, Speech and Signal Processing (ICASSP), IEEE, 2011, pp. 37–40. [25] N. Bauer, J. Schiffner, C. Weihs, Einfluss der Musikinstrumente auf die Gu¨te der Einsatzzeiterkennung, Tech. Rep. 10/12 (2012). [26] P. Masri, Computer modeling of sound for transformation and synthesis of musical signal, Ph.D. thesis, University of Bristol (1996). [27] G. Peeters, X. Rodet, A large set of audio feature for sound description (similarity and classification) in the cuidado project, Tech. rep., Ircam (2004). [28] N. Bauer, K. Friedrichs, D. Kirchhoff, J. Schiffner, C. Weihs, Tone onset de- tection using an auditory model, in: M. Spiliopoulou, L. Schmidt-Thieme, R. Janning (Eds.), Data Analysis, Machine Learning and Knowledge Dis- covery, Springer International Publishing, 2014, pp. 315–324. [29] F. Eyben, S. Bo¨ck, B. Schuller, A. Graves, Universal onset detection with bidirectional long short-term memory neural networks, in: J. S. Downie, R. C. Veltkamp (Eds.), Proceedings of the 11th International Society for Music Information Retrieval Conference (ISMIR’10), International Society for Music Information Retrieval, 2010, pp. 589–594. [30] J. A. Hartigan, M. A. Wong, Algorithm as 136: A k-means clustering algo- rithm, Journal of the Royal Statistical Society, Series C (Applied Statistics) 28 (1979) 100–108. [31] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, Springer Series in Statistics, Springer New York Inc., New York, 2001. [32] Y. Bengio, Y. Grandvalet, No unbiased estimator of the variance of k-fold cross-validation, Journal of Machine Learning Research 5 (2004) 1089–1105. [33] S. Siegel, N. Castellan, Nonparametric statistics for the behavioral sciences, 2nd Edition, McGraw–Hill, Inc., 1988. [34] B. Bischl, M. Lang, O. Mersmann, BatchExperiments: Statistical experi- ments on batch computing clusters., R package version 1.2 (2014). URL http://CRAN.R-project.org/package=BatchExperiments [35] R Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria (2014). URL http://www.R-project.org/ [36] J. C. Gower, A general coefficient of similarity and some of its properties, Biometrics 26 (1971) 857–874. 25