Input Sample Generation

Introduction

Why Use Sampled Monte Carlo Input?

Monte Carlo simulations are valuable when model inputs are uncertain, random, or when future developments are unknown.
The core concept is simple: run the model many times (called realizations), each with different input values, and then analyze the outputs statistically.

This approach allows you to estimate metrics such as:

Mean
Standard deviation
Confidence intervals

However, when input variables are continuous, it becomes impossible to test every possible combination.
Even using a discrete mesh grid of possible inputs quickly becomes infeasible — a challenge often referred to as the curse of dimensionality.

Illustrative Example:

Suppose the model has five input parameters (dimensions):

Levels per Parameter	Total Simulations Required
2	32
3	243
4	1,024
10	100,000

As shown, computational cost grows exponentially with the number of parameters and levels.
Therefore, the goal is to select a representative set of input samples that sufficiently explore the parameter space — while keeping the total computational effort manageable.

How sample Monte Carlo Input?

Before sampling can begin, each input parameter must be clearly defined, including the uncertainty associated with it. This uncertainty is expressed using a distribution function across the parameter’s possible range. For example, if every value within a range is equally likely (or if the distribution is unknown), a uniform distribution is typically used. The SA-Toolbox supports the following seven distribution types:

Uniform

Log-Uniform

Normal

Log-Normal

Weibull

Triangular

Trapezoidal

The SA-Toolbox also includes seven sampling methods. The choice of sampling method affects

whether and to what extent clustering or gaps appear in the sample;
the number of samples required to cover the parameter space, and therefore the computational cost;
whether the probability distributions of the input parameters are used;
how many parameters (dimensions) can be included;
the likelihood of artifacts appearing when the dataset is resampled.

Note: In this context, “dataset resampling” refers to reducing the size of an existing dataset by selecting entries based on their original order in the file (e.g. row number), without considering parameter values. For more details, see Step 4 in the chapter “Subset / Filter” of the User Guide.

Sampling Methods Overview

The SA-Toolbox provides several methods for parameter sampling. Conceptually, these can be broadly divided into three categories:

(Pseudo)Random Methods;
Structured Methods;
Quasi-Monte Carlo Methods.

Random sampling methods are robust and easy to understand, but due to clustering effects, they typically require a large number of samples to achieve good coverage of the parameter space. The word “random” implies that the selection of the next sample point does not depend on the data points that have already been selected, and that each run of the sample generator creates a different set of data points (sample). However, since random number generators rely on deterministic algorithms, the term “pseudo-random” is more accurate.

Structured methods improve coverage by dividing each parameter range into intervals and ensuring that each interval is sampled at least once. This reduces clustering and gaps, therefore providing a more even distribution of samples across the parameter space. The sample points are not statistically independent, because choosing one point may limit where other points can be placed. This means that some areas of the parameter space may be excluded from further sampling based on previous selections. Although the methods are deterministic in nature, most algorithms introduce an element of randomness, so that re-running the sample generator will produce a different set of samples each time.

Quasi-Monte Carlo (QMC) methods are highly structured sampling techniques. They use low-discrepancy sequences to minimize both clustering and gaps, resulting in a more even distribution of sample points. They are deterministic – meaning that running the sample generator multiple times will produce the same set of sample points each time, unless a random component is explicitly added. QMC methods are computationally efficient. However, with very large numbers of parameters, the sequences may lose their efficiency – this issue is not based on an increase in computational cost, but is inherent in the mathematics itself. Depending on the method, stability issues may also arise. In such cases, random or structured methods may offer better performance.

In practice, it is not easy to classify sampling methods according to these categories. Nevertheless, they can be helpful for understanding the underlying principles of the various sampling methods, and recognizing the reasons for their respective strengths and weaknesses.

Simple Random Sampling

This is the “classic” Monte Carlo sampling method that generates sample points by selecting values from the parameter distributions without any structure. While it is statistically similar to true randomness, it can suffer from clustering and gaps, which can lead to inefficient coverage of the parameter space.

Simple Random Sampling

Advantages

Easy to understand
Respects the parameter distributions
Error rate independent of input dimensionality
Sample generation is very fast even for large samples (no complex computation required)

Disadvantages

Slow convergence - requires large number of samples to cover entire parameter range
Parameter space is not covered evenly
Assumes input parameters are independent, which can lead to unrealistic combinations if parameters are actually correlated or dependent

References

Knuth, Donald Ervin (1981): The Art of Computer Programming. Volume 2. Seminumerical algorithms. 2. ed. Reading, Mass.: Addison-Wesley (Addison-Wesley series in computer science and information processing).

Grid Sampling

In grid sampling, the range of each parameter is divided into equal intervals.
The number of intervals is the same for every parameter.

Grid Sampling

The number of intervals \( n \) depends on the chosen sample size \( s \) and the number of parameters \( d \) as follows:

\[ n = s^{1/d} \]

This means each parameter will take on \( n \) discrete values.
All possible combinations of these values form a d-dimensional grid with a total of \((n+1)^d\) grid points (also called nodes or vertices).

Important: Not the vertices themselves, but the \( n^d \) midpoints of the grid cells (“middle-of-box placement”) form the set of points from which the sample is drawn. If the requested sample size exactly matches \( s = n^d \), all cell midpoints are selected for the sample — this is equivalent to a full factorial design at the chosen interval size. In this special case, re-running the sample generator returns the same sample each time. Otherwise, a new random subsample is created during each run of the sample generator.

Advantages

Provides systematic coverage of the parameter space
Simple and easy to understand

Disadvantages

Can become computationally expensive for a large number of parameters or very fine grids
Assumes input parameters are independent, which can lead to unrealistic combinations if parameters are actually correlated or dependent

References

Latin Hypercube Sampling (LHS)

In Latin Hypercube Sampling, the parameter space is divided into intervals of equal probability—not equal size, as in grid sampling. Each interval is then sampled exactly once. This ensures that the entire range of each parameter is used, and that the underlying probability distribution for each parameter is respected. The sampled values for each parameter are then randomly combined with values from other parameters. As a result, while each parameter is sampled evenly according to its distribution, some clustering or gaps may still appear in the overall multidimensional sample space.

LHS Sampling

Advantages

Provides much better coverage of the marginal distributions than random sampling
Requires fewer samples (lower computational cost) than grid sampling
Scales better to higher dimenstions by avoiding the exponential increase in sample size inherent in many other methods
Improved convergence compared to MC, if the simulation model is additive

Disadvantages

Assumes input parameters are independent, which can lead to unrealistic combinations if parameters are actually correlated or dependent

References

Helton, J. C.; Davis, F. J. (2002): Illustration of sampling-based methods for uncertainty and sensitivity analysis. In: Risk Analysis: An Official Publication of the Society for Risk Analysis, 22(3), 591–622. DOI: 10.1111/0272-4332.00041
Helton, J. C.; Davis, F. J. (2003): Latin hypercube sampling and the propagation of uncertainty in analyses of complex systems. In: Reliability Engineering & System Safety, 81(1), 23–69. DOI: 10.1016/s0951-8320(03)00058-9
scipy Library LatinHypercube

Quasi-Monte Carlo Methods

The Sobol and Halton low-discrepancy sequences are designed to minimize gaps in the sample space. For each parameter they produce values that are evenly distributed, not correlated with values from any of the other parameters, and do not form patterns (e.g. repeating).

QMC

Sobol Sampling

The Sobol method generates sample points using binary-based computations.
A unique set of direction numbers is assigned to each of the \(d\) parameters. These act as seed values, and the literature provides precomputed sets for practical use.

Sobol' sequence Sampling

Each parameter is linked to a primitive polynomial, which determines how the direction numbers evolve.
Using the initial direction numbers, the polynomial, and the sample index, up to \(2^d\) sample points can be generated.
Larger sample sets can be realized efficiently with bit-level operations such as bit shifting and XOR.
Each new sample point is constructed based on the previous ones.

Since the Sobol method uses binary computations, the ideal sample size is a power of 2.
In SA-Toolbox, this is implemented as a hard requirement.

This method is deterministic: re-running the sample generator will always create the same sample.

Advantages

Very fast to generate large numbers of samples due to efficient bit-level operations
Often requires fewer samples than methods like LHS for variance-based sensitivity analysis
More stable than Halton in high dimensions; works well even with hundreds of input parameters
Widely used as a standard method for Quasi-Monte Carlo sampling

Disadvantages

Assumes input parameters are independent, which can lead to unrealistic combinations if parameters are correlated
In very high dimensions, performance may decrease

References

SciPy Sobol Sequence Library

Sobol-Scramble

The Sobol-Scramble method is based on the standard Sobol sequence and preserves its structure, but introduces a random component by applying a scrambling algorithm (a combination of a linear transformation and a random shift) that changes each parameter value slightly.

Advantages

Reduces artifacts compared to Sobol due to the introduced randomness
Scrambling method is computationally cheap
Sample size can be expanded
Very fast to generate large numbers of samples due to efficient bitwise operations
Often requires fewer samples than methods like LHS for variance-based sensitivity analysis
More stable than Halton in higher dimensions; works well even with hundreds of parameters
Commonly used as a standard method for Quasi-Monte Carlo sampling

Disadvantages

Assumes input parameters are independent, which can lead to unrealistic combinations if parameters are actually correlated or dependent
In very high dimensions, performance may degrade
Problems with methods that assume a white noise error (e.g. nearest neighbor methods)

References

SciPy Sobol-Scramble Library

Halton

The Halton method assigns a unique prime number to each parameter and then uses the radical inverse function to generate quasi-random numbers. For each sample, a natural number (the sample's index) is converted into the assigned prime base, the digits are reversed, and then expressed as a decimal fraction to generate a point in the unit interval [0,1]. This is then scaled to the parameter’s defined range and distribution. This process is repeated for each parameter, producing a multi-dimensional sample point. The sample points are generated in a fixed order based on the index; the Halton sequence is therefore reproducible and re-running the sample generator will always return the same sample.

Advantages

Can be computed analytically and is generally faster than the recursive calculations required by the Sobol method
Good uniformity in low-dimensional spaces

Disadvantages

Only suitable for low dimensions (typically <10 parameters); above that, correlations and visible patterns can occur
Assumes input parameters are independent, which can lead to unrealistic combinations if parameters are actually correlated or dependent

References

SciPy Halton Library

Stick-Breaking

The Stick-Breaking method was developed specifically with geochemical compositions in mind, i.e. for an application where the parameters are not independent and the parameter values must add up to 1 for each realization (e.g. concentrations or mass fractions). The parameter ranges must be defined within the interval [0, 1], and the defined bounds must allow a total sum of 1 (i.e. a feasible region must exist). The method uses the fact that the minimum of several uniform samples follows a beta distribution, which can be sampled directly. Parameters are then generated one by one in a recursive process: Each new value is sampled from [0, 1] and then scaled by the remaining part of the stick (i.e., what's left after previous values are subtracted from 1). This ensures the sum stays below 1. The last parameter takes the remaining amount to exactly reach a total of 1.

Advantages

Takes parameter dependencies into account

Disadvantages

Does not respect parameter distributions

References

Ng, Kai Wang; Tian, Guo-Liang; Tang, Man-Lai (2011): Dirichlet and related distributions. Theory, methods and applications. 1st ed., Chichester: Wiley (Wiley Series in Probability and Statistics)