Input Sample Generation
Introduction
Why Use Sampled Monte Carlo Input?
Monte Carlo simulations are valuable when model inputs are uncertain, random, or when future developments are unknown.
The core concept is simple: run the model many times (called realizations), each with different input values, and then analyze the outputs statistically.
This approach allows you to estimate metrics such as:
- Mean
- Standard deviation
- Confidence intervals
However, when input variables are continuous, it becomes impossible to test every possible combination.
Even using a discrete mesh grid of possible inputs quickly becomes infeasible — a challenge often referred to as the curse of dimensionality.
Illustrative Example:
Suppose the model has five input parameters (dimensions):
| Levels per Parameter | Total Simulations Required |
|---|---|
| 2 | 32 |
| 3 | 243 |
| 4 | 1,024 |
| 10 | 100,000 |
As shown, computational cost grows exponentially with the number of parameters and levels.
Therefore, the goal is to select a representative set of input samples that sufficiently explore the parameter space — while keeping the total computational effort manageable.
How sample Monte Carlo Input?
Before sampling can begin, each input parameter must be clearly defined, including the uncertainty associated with it. This uncertainty is expressed using a distribution function across the parameter’s possible range. For example, if every value within a range is equally likely (or if the distribution is unknown), a uniform distribution is typically used. The SA-Toolbox supports the following seven distribution types:
The SA-Toolbox also includes seven sampling methods. The choice of sampling method affects
- whether and to what extent clustering or gaps appear in the sample;
- the number of samples required to cover the parameter space, and therefore the computational cost;
- whether the probability distributions of the input parameters are used;
- how many parameters (dimensions) can be included;
- the likelihood of artifacts appearing when the dataset is resampled.
Note: In this context, “dataset resampling” refers to reducing the size of an existing dataset by selecting entries based on their original order in the file (e.g. row number), without considering parameter values. For more details, see Step 4 in the chapter “Subset / Filter” of the User Guide.
Sampling Methods Overview
The SA-Toolbox provides several methods for parameter sampling. Conceptually, these can be broadly divided into three categories:
- (Pseudo)Random Methods;
- Structured Methods;
- Quasi-Monte Carlo Methods.

Random sampling methods are robust and easy to understand, but due to clustering effects, they typically require a large number of samples to achieve good coverage of the parameter space. The word “random” implies that the selection of the next sample point does not depend on the data points that have already been selected, and that each run of the sample generator creates a different set of data points (sample). However, since random number generators rely on deterministic algorithms, the term “pseudo-random” is more accurate.
Structured methods improve coverage by dividing each parameter range into intervals and ensuring that each interval is sampled at least once. This reduces clustering and gaps, therefore providing a more even distribution of samples across the parameter space. The sample points are not statistically independent, because choosing one point may limit where other points can be placed. This means that some areas of the parameter space may be excluded from further sampling based on previous selections. Although the methods are deterministic in nature, most algorithms introduce an element of randomness, so that re-running the sample generator will produce a different set of samples each time.
Quasi-Monte Carlo (QMC) methods are highly structured sampling techniques. They use low-discrepancy sequences to minimize both clustering and gaps, resulting in a more even distribution of sample points. They are deterministic – meaning that running the sample generator multiple times will produce the same set of sample points each time, unless a random component is explicitly added. QMC methods are computationally efficient. However, with very large numbers of parameters, the sequences may lose their efficiency – this issue is not based on an increase in computational cost, but is inherent in the mathematics itself. Depending on the method, stability issues may also arise. In such cases, random or structured methods may offer better performance.
In practice, it is not easy to classify sampling methods according to these categories. Nevertheless, they can be helpful for understanding the underlying principles of the various sampling methods, and recognizing the reasons for their respective strengths and weaknesses.
Simple Random Sampling
This is the “classic” Monte Carlo sampling method that generates sample points by selecting values from the parameter distributions without any structure. While it is statistically similar to true randomness, it can suffer from clustering and gaps, which can lead to inefficient coverage of the parameter space.

Advantages
- Easy to understand
- Respects the parameter distributions
- Error rate independent of input dimensionality
- Sample generation is very fast even for large samples (no complex computation required)
Disadvantages
- Slow convergence - requires large number of samples to cover entire parameter range
- Parameter space is not covered evenly
- Assumes input parameters are independent, which can lead to unrealistic combinations if parameters are actually correlated or dependent
References
- Knuth, Donald Ervin (1981): The Art of Computer Programming. Volume 2. Seminumerical algorithms. 2. ed. Reading, Mass.: Addison-Wesley (Addison-Wesley series in computer science and information processing).
Grid Sampling
In grid sampling, the range of each parameter is divided into equal intervals.
The number of intervals is the same for every parameter.

The number of intervals \( n \) depends on the chosen sample size \( s \) and the number of parameters \( d \) as follows:
This means each parameter will take on \( n \) discrete values.
All possible combinations of these values form a d-dimensional grid with a total of \((n+1)^d\) grid points (also called nodes or vertices).
Important: Not the vertices themselves, but the \( n^d \) midpoints of the grid cells (“middle-of-box placement”) form the set of points from which the sample is drawn. If the requested sample size exactly matches \( s = n^d \), all cell midpoints are selected for the sample — this is equivalent to a full factorial design at the chosen interval size. In this special case, re-running the sample generator returns the same sample each time. Otherwise, a new random subsample is created during each run of the sample generator.
Advantages
- Provides systematic coverage of the parameter space
- Simple and easy to understand
Disadvantages
- Can become computationally expensive for a large number of parameters or very fine grids
- Assumes input parameters are independent, which can lead to unrealistic combinations if parameters are actually correlated or dependent
References
Latin Hypercube Sampling (LHS)
In Latin Hypercube Sampling, the parameter space is divided into intervals of equal probability—not equal size, as in grid sampling. Each interval is then sampled exactly once. This ensures that the entire range of each parameter is used, and that the underlying probability distribution for each parameter is respected. The sampled values for each parameter are then randomly combined with values from other parameters. As a result, while each parameter is sampled evenly according to its distribution, some clustering or gaps may still appear in the overall multidimensional sample space.

Advantages
- Provides much better coverage of the marginal distributions than random sampling
- Requires fewer samples (lower computational cost) than grid sampling
- Scales better to higher dimenstions by avoiding the exponential increase in sample size inherent in many other methods
- Improved convergence compared to MC, if the simulation model is additive
Disadvantages
- Assumes input parameters are independent, which can lead to unrealistic combinations if parameters are actually correlated or dependent
References
-
Helton, J. C.; Davis, F. J. (2002): Illustration of sampling-based methods for uncertainty and sensitivity analysis. In: Risk Analysis: An Official Publication of the Society for Risk Analysis, 22(3), 591–622. DOI: 10.1111/0272-4332.00041
-
Helton, J. C.; Davis, F. J. (2003): Latin hypercube sampling and the propagation of uncertainty in analyses of complex systems. In: Reliability Engineering & System Safety, 81(1), 23–69. DOI: 10.1016/s0951-8320(03)00058-9
Quasi-Monte Carlo Methods
The Sobol and Halton low-discrepancy sequences are designed to minimize gaps in the sample space. For each parameter they produce values that are evenly distributed, not correlated with values from any of the other parameters, and do not form patterns (e.g. repeating).

Sobol Sampling
The Sobol method generates sample points using binary-based computations.
A unique set of direction numbers is assigned to each of the \(d\) parameters. These act as seed values, and the literature provides precomputed sets for practical use.

Each parameter is linked to a primitive polynomial, which determines how the direction numbers evolve.
Using the initial direction numbers, the polynomial, and the sample index, up to \(2^d\) sample points can be generated.
Larger sample sets can be realized efficiently with bit-level operations such as bit shifting and XOR.
Each new sample point is constructed based on the previous ones.
Since the Sobol method uses binary computations, the ideal sample size is a power of 2.
In SA-Toolbox, this is implemented as a hard requirement.
This method is deterministic: re-running the sample generator will always create the same sample.
Advantages
- Very fast to generate large numbers of samples due to efficient bit-level operations
- Often requires fewer samples than methods like LHS for variance-based sensitivity analysis
- More stable than Halton in high dimensions; works well even with hundreds of input parameters
- Widely used as a standard method for Quasi-Monte Carlo sampling
Disadvantages
- Assumes input parameters are independent, which can lead to unrealistic combinations if parameters are correlated
- In very high dimensions, performance may decrease
References
Sobol-Scramble
The Sobol-Scramble method is based on the standard Sobol sequence and preserves its structure, but introduces a random component by applying a scrambling algorithm (a combination of a linear transformation and a random shift) that changes each parameter value slightly.
Advantages
- Reduces artifacts compared to Sobol due to the introduced randomness
- Scrambling method is computationally cheap
- Sample size can be expanded
- Very fast to generate large numbers of samples due to efficient bitwise operations
- Often requires fewer samples than methods like LHS for variance-based sensitivity analysis
- More stable than Halton in higher dimensions; works well even with hundreds of parameters
- Commonly used as a standard method for Quasi-Monte Carlo sampling
Disadvantages
- Assumes input parameters are independent, which can lead to unrealistic combinations if parameters are actually correlated or dependent
- In very high dimensions, performance may degrade
- Problems with methods that assume a white noise error (e.g. nearest neighbor methods)
References
Halton
The Halton method assigns a unique prime number to each parameter and then uses the radical inverse function to generate quasi-random numbers. For each sample, a natural number (the sample's index) is converted into the assigned prime base, the digits are reversed, and then expressed as a decimal fraction to generate a point in the unit interval [0,1]. This is then scaled to the parameter’s defined range and distribution. This process is repeated for each parameter, producing a multi-dimensional sample point. The sample points are generated in a fixed order based on the index; the Halton sequence is therefore reproducible and re-running the sample generator will always return the same sample.
Advantages
- Can be computed analytically and is generally faster than the recursive calculations required by the Sobol method
- Good uniformity in low-dimensional spaces
Disadvantages
- Only suitable for low dimensions (typically <10 parameters); above that, correlations and visible patterns can occur
- Assumes input parameters are independent, which can lead to unrealistic combinations if parameters are actually correlated or dependent
References
Stick-Breaking
The Stick-Breaking method was developed specifically with geochemical compositions in mind, i.e. for an application where the parameters are not independent and the parameter values must add up to 1 for each realization (e.g. concentrations or mass fractions). The parameter ranges must be defined within the interval [0, 1], and the defined bounds must allow a total sum of 1 (i.e. a feasible region must exist). The method uses the fact that the minimum of several uniform samples follows a beta distribution, which can be sampled directly. Parameters are then generated one by one in a recursive process: Each new value is sampled from [0, 1] and then scaled by the remaining part of the stick (i.e., what's left after previous values are subtracted from 1). This ensures the sum stays below 1. The last parameter takes the remaining amount to exactly reach a total of 1.
Advantages
- Takes parameter dependencies into account
Disadvantages
- Does not respect parameter distributions
References
- Ng, Kai Wang; Tian, Guo-Liang; Tang, Man-Lai (2011): Dirichlet and related distributions. Theory, methods and applications. 1st ed., Chichester: Wiley (Wiley Series in Probability and Statistics)