Satellite remote sensing of snow and ice requires solving an inverse problem: given a measured reflectance spectrum, determine the physical properties of the surface — grain size, impurity loading, density, layer structure. The forward direction is handled by radiative transfer models like BioSNICAR, which solve the equations of light scattering and absorption through a layered snowpack. But forward models are too slow (~0.5 s per evaluation) for the millions of evaluations needed during optimisation and uncertainty quantification.
This notebook teaches every data-science concept behind the BioSNICAR emulator and inversion scheme, from first principles, using toy examples that run without BioSNICAR installed. Each technique is motivated by a concrete physical problem, demonstrated on a toy snow/ice spectrum, and then connected to the exact line of BioSNICAR source code where it is used.
| Act | Theme | Sections | Core question |
|---|---|---|---|
| I | Sampling & the Curse of Dimensionality | 1 – 4 | How do we efficiently explore parameter space? |
| II | Dimensionality Reduction & Emulation | 5 – 9 | How do we make the forward model fast enough? |
| III | Cost Functions & Optimisation | 10 – 16 | How do we find the best-fit parameters? |
| IV | Uncertainty Quantification | 17 – 22 | How confident should we be in the answer? |
| V | Putting It All Together | 23 – 25 | Full retrieval pipeline, end to end |
.ipynb
filegit clone https://github.com/jmcook1186/biosnicar-py.git && cd biosnicar-py/notebooks && jupyter notebook emulator_and_inversions.ipynb
A forward model is a function that maps physical parameters (things we want to know) to observables (things we can measure). In cryospheric remote sensing the parameters describe the snowpack — grain size, density, impurity concentration, layer thickness — and the observable is the spectral albedo: the fraction of incoming sunlight reflected at each wavelength.
$$\alpha(\lambda) = f(r_{\mathrm{eff}},\; \rho,\; c_{\mathrm{imp}},\; \mathrm{d}z,\; \theta_s,\; \ldots)$$where $r_{\mathrm{eff}}$ is effective grain radius, $\rho$ is density, $c_{\mathrm{imp}}$ is impurity concentration, $\mathrm{d}z$ is layer thickness, and $\theta_s$ is solar zenith angle.
Snow and ice albedo is controlled by two competing processes:
Grain size controls the path length between scattering events. Coarser grains mean longer paths through ice, so more absorption in the NIR — but the visible is barely affected because ice is almost transparent at those wavelengths.
Impurities (black carbon, dust, algae) absorb strongly in the visible but are spectrally flat or decreasing into the NIR. Their effect is concentrated below ~0.7 µm.
BioSNICAR solves this physics rigorously via a multi-layer adding-doubling radiative transfer scheme:
biosnicar/optical_properties/column_OPs.py:get_layer_OPs)biosnicar/optical_properties/column_OPs.py:mix_in_impurities)biosnicar/rt_solvers/adding_doubling_solver.py)biosnicar/drivers/run_model.py:run_model)A single evaluation takes ~0.5 seconds. This is fine for one-off predictions but far too slow for inverse problems (which need millions of evaluations). The entire purpose of Acts I–II is to build a fast surrogate (emulator) that replaces this 0.5 s call with a ~50 µs neural network prediction.
We define a simplified 2-parameter function that captures the essential spectral physics:
BioSNICAR connection — The real forward model lives in
biosnicar/drivers/run_model.py. It accepts keyword overrides likerds(grain radius, µm),rho(density, kg m⁻³),black_carbon(ppb),snow_algae(cells mL⁻¹), andsolzen(solar zenith, degrees). It returns anOutputsobject with.albedo(480-band spectral albedo) and.BBA(broadband albedo).The pipeline is:
setup_snicar()→get_layer_OPs()→mix_in_impurities()→adding_doubling_solver(). Each step has a direct physical interpretation: Mie theory for grain optics, MAC mixing for impurities, and the adding-doubling method for multi-layer radiative transfer with Fresnel surface reflectance.Our toy function replaces this entire pipeline with a parametric approximation that runs in microseconds — fast enough for interactive exploration.
To build an emulator (Section 8), we need to evaluate the forward model at a set of training points scattered across the parameter space. The quality of the emulator depends critically on how well those points cover the space.
There are three natural strategies:
| Strategy | Idea | Weakness |
|---|---|---|
| Random | Uniform i.i.d. draws | Gaps and clusters — some regions are poorly sampled |
| Grid | Regular lattice | Explodes exponentially in high dimensions (Section 3) |
| LHS | Stratified per dimension | Space-filling and cheap; linear cost in $d$ |
The key idea is borrowed from stratified sampling in statistics. For $n$ samples in $d$ dimensions:
This guarantees that the marginal projection onto any single dimension is perfectly uniform — there is exactly one point in every bin. No clustering, no gaps, regardless of $n$ or $d$.
Formally, for dimension $j$, sample $i$ is placed at:
$$x_{ij} = \frac{\pi_j(i) + u_{ij}}{n}, \quad u_{ij} \sim \mathrm{Uniform}(0,1)$$where $\pi_j$ is a random permutation of $\{0, 1, \ldots, n-1\}$.
We implement this from scratch below, matching the algorithm in
biosnicar/emulator.py:_latin_hypercube() (line 20).
Key insight: LHS guarantees that each row and column of the stratified grid contains exactly one sample — so marginal projections are perfectly uniform, while the interior is space-filling. This is analogous to a Latin square in combinatorics (hence the name): no two samples share the same row or column of any 2-D projection.
Pure random sampling suffers from clustering and gaps that become worse in higher dimensions. LHS eliminates marginal gaps by construction. While LHS doesn't guarantee optimal joint coverage (that would require more expensive designs like maximin or minimax), it provides an excellent balance of coverage quality and computational simplicity.
A grid provides perfect coverage but costs $k^d$ evaluations for $k$ points per dimension. For 8 parameters (a typical BioSNICAR emulator), even a modest 20-point grid requires $20^8 \approx 25.6$ billion forward model evaluations — each taking ~0.5 s. Section 3 quantifies this "curse of dimensionality" in detail.
BioSNICAR connection —
biosnicar/emulator.py:_latin_hypercube()(line 20) uses this exact algorithm. It takesn_samplesandn_dims(derived from the length of theparamsdictionary passed toEmulator.build()), and returns a matrix of samples in $[0,1]^d$. These are then rescaled to physical bounds (e.g. grain radius 100–5000 µm, black carbon 0–500,000 ppb) before evaluating the forward model on each sample.
Richard Bellman coined the phrase "curse of dimensionality" in 1961 to describe a pervasive problem: the volume of a space grows exponentially with the number of dimensions. A regular grid with $k$ points per dimension needs $k^d$ total samples. This means:
| Dimensions ($d$) | Grid points ($20^d$) | At 0.5 s each |
|---|---|---|
| 2 | 400 | 3 minutes |
| 4 | 160,000 | 22 hours |
| 6 | 64,000,000 | 370 days |
| 8 | 25,600,000,000 | 406 years |
The BioSNICAR emulator typically has 4–8 free parameters (grain radius, density, layer thickness, solar zenith angle, plus up to 4 impurity concentrations). Grid-based training is completely infeasible.
LHS decouples sample count from dimensionality. Whether we have 2 parameters or 20, we can choose any fixed budget $n$ (say 2,000 samples) and still get uniform marginal coverage. The cost is $O(n)$ forward model evaluations regardless of $d$. Of course, covering a high-dimensional space with finite samples inevitably leaves gaps — but LHS ensures those gaps are distributed as evenly as possible, rather than concentrating in low-probability corners.
BioSNICAR connection —
Emulator.build()inbiosnicar/emulator.py(line 261) uses LHS to generate training samples. The default is typically 2,000–10,000 samples regardless of dimensionality. At ~0.5 s per forward model evaluation, 2,000 samples take ~17 minutes — compared to the 406 years needed for an 8-D grid. This is why LHS is essential, not optional.The emulator parameters are defined as a dictionary passed to
build():emu = Emulator.build( params={"rds": (100, 5000), "rho": (200, 917), "black_carbon": (0, 500000), "solzen": (0, 89)}, n_samples=2000, )Each key becomes one dimension of the LHS design.
Not all parameters are created equal. Grain radius might range from 100 to 5,000 µm — a factor of 50. But impurity concentrations span from 0 to 500,000 ppb — a factor of 500,000. If we sample uniformly in linear space, the distribution of training points is disastrously skewed:
We need training data that represents the full dynamic range, with adequate resolution at both the low and high ends.
Sampling in $\log_{10}(x+1)$ space maps the range $[0, 500000]$ onto $[0, 5.7]$. A uniform draw in this compressed space gives roughly equal density per order of magnitude in the original space — far more points near zero, which is exactly what we need.
$$x_{\mathrm{log}} \sim \mathrm{Uniform}\big(\log_{10}(x_{\min}+1),\; \log_{10}(x_{\max}+1)\big)$$$$x_{\mathrm{linear}} = 10^{x_{\mathrm{log}}} - 1$$The $+1$ offset ensures that $x=0$ maps to $\log_{10}(1) = 0$ rather than $-\infty$, so the transformation is well-defined for physically meaningful zero-concentration scenarios.
BioSNICAR connection —
biosnicar/emulator.py(line 68) defines:_LOG_SAMPLE_PARAMS = {"black_carbon", "snow_algae", "glacier_algae", "dust"}When
Emulator.build()generates LHS samples, any parameter whose name appears in_LOG_SAMPLE_PARAMSis sampled in $\log_{10}(x+1)$ space (line 331). This ensures that the training set includes adequate clean-ice scenarios alongside heavily polluted ones.Parameters not in this set (like
rds,rho,dz,solzen) are sampled linearly because their ranges are moderate (e.g. $r$ from 100 to 5,000 µm — only a factor of 50) and there is no special need to oversample the low end.The same log transform reappears in the optimiser (Section 12) where
_LOG_SPACE_PARAMScontrols which parameters are optimised in log space.
Exercise 1: Modify latin_hypercube to accept a dictionary of bounds
and a set of log-space parameter names. Generate 200 samples for
{"grain_size": (0, 1), "impurity": (0, 500000)} where impurity is
log-sampled. Plot the 2-D scatter to confirm the grain axis is uniform
while the impurity axis is log-distributed.
The forward model produces a 50-band spectrum (480 in real BioSNICAR). Training an emulator to predict all 50 bands independently would be wasteful — most of the information is redundant. This act shows how to compress the output using PCA and then predict only the compressed representation with a neural network.
A snow albedo spectrum is not 50 independent numbers. Adjacent wavelength bands are driven by the same underlying physics — ice absorption varies smoothly with wavelength, and a change in grain size affects a broad swath of the spectrum, not individual bands. This means neighbouring bands carry nearly identical information.
Formally, if we collect many spectra (varying the input parameters) and compute the Pearson correlation between every pair of bands, we find large blocks of nearly perfect correlation ($r \approx 1$). The intrinsic dimensionality of the spectral output is much lower than the number of bands — typically 3–10 independent "modes" suffice to describe
99.9% of the variance.
This is why we compress: instead of training a neural network to predict 50 (or 480) correlated outputs, we project onto 3–5 orthogonal principal components and predict those. This is faster, more numerically stable, and produces smoother reconstructions.
The bright red blocks show groups of highly correlated bands. Bands within the visible plateau (0.3–0.7 µm) are almost perfectly correlated because they are all controlled by the same impurity darkening. The NIR bands beyond 1.0 µm correlate with each other because they are all driven by ice absorption (which depends on grain size).
The anti-correlations (blue) between visible and NIR bands reflect the physical fact that impurities darken the visible while grain size darkens the NIR — these are partially independent effects.
A 50-dimensional output with such strong inter-band correlation can be faithfully represented in far fewer dimensions — that is the job of Principal Component Analysis (PCA).
Principal Component Analysis is a linear dimensionality reduction technique. It finds the orthogonal directions along which the data varies most, then discards the directions with negligible variance. For spectral data, these directions are physical "modes" of variability — the first PC typically captures the overall brightness level, the second captures the VIS-vs-NIR contrast (related to grain size), etc.
Given a data matrix $\mathbf{X}$ of shape ($n$ samples × $p$ bands):
The variance explained by the first $k$ components is $\sum_{i=1}^k \lambda_i \;/\; \sum_{i=1}^p \lambda_i$. If 3 components capture 99.9% of variance, the remaining 47 dimensions are mostly noise.
This decomposition is powerful because it separates physically distinct effects into orthogonal components, making the subsequent regression (neural network) much easier.
Key insight: Just 2–3 principal components capture >99.9% of the variance in our toy spectra. This dramatic compression is possible because the spectral output is controlled by only 2 underlying parameters (grain size and impurity). For BioSNICAR's 480-band output with more free parameters, typically 5–10 components suffice.
Other dimensionality reduction methods exist (autoencoders, t-SNE, UMAP), but PCA has key advantages for emulation:
inverse_transform is lossless up to truncation.BioSNICAR connection —
biosnicar/emulator.py(line 395) usesPCA(n_components=0.999)— scikit-learn automatically determines the number of components needed to retain 99.9% of variance. The PCA mean vector and component matrix are saved alongside the neural network weights in the.npzfile (line 662), so reconstruction at inference time needs only NumPy — no scikit-learn dependency.
Exercise 2: Try n_components=1 and n_components=5. At what
point
does the reconstruction RMSE become negligible (<1e-4)? How does this
relate to the scree plot?
After PCA, the emulator's job reduces to learning a mapping from a small input (2–8 physical parameters) to a small output (3–10 PCA coefficients). This is a multi-output regression problem, and we solve it with the simplest neural network architecture: the Multi-Layer Perceptron (MLP).
The mapping from physical parameters to PCA coefficients is nonlinear — doubling the grain size doesn't simply double the spectral change. The relationship involves exponential absorption, multiple scattering, and nonlinear mixing of optical properties. Linear regression can't capture this. Polynomial regression can, but requires high degree and extrapolates poorly. An MLP with ReLU activations is a universal approximator: given enough hidden units, it can fit any continuous function to arbitrary precision (Leshno et al., 1993).
Single neuron: applies a linear transformation followed by a nonlinear activation:
$$y = \sigma(\mathbf{w}^\top \mathbf{x} + b)$$where $\mathbf{w}$ are weights, $b$ is bias, and $\sigma$ is an activation function. With ReLU ($\sigma(z) = \max(0, z)$), the neuron outputs zero for negative inputs and passes positive inputs through.
Layer: many neurons in parallel, each with its own weights: $\mathbf{h} = \sigma(\mathbf{W}\mathbf{x} + \mathbf{b})$
MLP: layers stacked in sequence. Each layer's output feeds the next. The final layer has no activation (linear output) so the network can predict any real-valued PCA coefficient. The hidden layers use ReLU to introduce nonlinearity — without it, the entire network would collapse to a single linear transformation.
The MLP is trained by backpropagation + gradient descent: compute the
mean squared error between predicted and true PCA coefficients, then
adjust all weights to reduce the error. Scikit-learn's MLPRegressor
handles this automatically with the Adam optimiser and early stopping.
BioSNICAR connection —
biosnicar/emulator.pyusesMLPRegressorwithhidden_layer_sizes=(128, 128, 64)(line 406). This architecture was chosen empirically: 3 hidden layers with decreasing width provide enough capacity to fit the nonlinear parameter-to-spectrum mapping without overfitting on typical training set sizes (1,000–10,000 samples).At prediction time, the code extracts the raw weight matrices (
mlp.coefs_,mlp.intercepts_) and runs a pure-NumPy forward pass (_forward_pass(), line 505). This is critical: it means the saved emulator (.npzfile) has no dependency on scikit-learn. Users only need NumPy to load and evaluate the emulator, and the forward pass is just matrix multiplies + ReLU — about 50 µs per evaluation.
Input (2) Hidden 1 (64) Hidden 2 (64) Hidden 3 (32) Output (3 PCA coeffs)
o─────┐ o──────┐ o──────┐ o──────┐ o → PCA inverse
o─────┤──[W1]──▶ o──────┤──[W2]──▶ o──────┤──[W3]──▶ o──────┤──[W4]──▶ o → transform
ReLU ReLU ReLU Linear o → 50-band albedo
np.maximum(x, 0).An emulator is a chain of transformations that replaces the expensive forward model with a fast approximation:
$$\hat{\alpha}(\lambda) = \mathrm{PCA}^{-1}\Big(\mathrm{MLP}\big(\mathrm{scale}(\theta)\big)\Big)$$The full pipeline is:
The class below implements this, matching biosnicar/emulator.py:Emulator.build().
BioSNICAR connection —
Emulator.build()follows this exact pipeline. Key implementation details:
- Weight extraction (line 505):
_forward_pass()takes raw NumPy arraysweightsandbiasesextracted frommlp.coefs_andmlp.intercepts_. This decouples inference from scikit-learn..npzsave format (line 662): the emulator saves PCA components, PCA mean, scaler parameters, MLP weights, MLP biases, and metadata (R², parameter names, bounds) into a single compressed NumPy archive. Loading isnp.load()+ a few attribute assignments.- Portability: a saved
.npzfile can be shared with collaborators who only have NumPy installed. No retraining, no scikit-learn, no BioSNICAR needed to make predictions.- Speed: the real BioSNICAR forward model takes ~0.5 s per evaluation. The emulator forward pass takes ~50 µs — a 10,000× speedup that makes iterative optimisation and MCMC sampling feasible.
Exercise 3: Build a second emulator with only 100 training samples. Compare its R² and RMSE against the 2000-sample version. At what sample count does accuracy saturate?
Many regression methods could serve as emulators. The choice matters because the emulator sits in the inner loop of the optimiser — it is evaluated thousands to millions of times during retrieval. The key requirements are:
| Method | Smooth? | Predict speed | Multi-output | Training cost |
|---|---|---|---|---|
| MLP | Yes (ReLU piecewise-linear) | ~50 µs | Natural | Minutes |
| Random Forest | No (step function) | ~100 µs | Per-output | Seconds |
| Gaussian Process | Yes (kernel-smooth) | ~1 ms | Expensive | $O(n^3)$ |
| Polynomial | Yes | ~10 µs | Natural | Instant |
Let us test these head-to-head on our toy problem.
MLP produces a smooth, continuous surface that closely tracks the true function. The ReLU activation creates piecewise-linear segments that are fine enough to appear smooth. Crucially, the gradients are well-defined everywhere (except at ReLU kinks), which is essential for L-BFGS-B.
Random Forest produces a step function — it predicts the average of nearby training points within each tree leaf. These jumps create artefacts when used as a cost function target (the optimiser "bounces" between steps rather than converging smoothly). This disqualifies RF as an emulator for gradient-based optimisation.
Gaussian Process gives excellent fit quality and provides posterior uncertainty for free (the predictive variance at each point). However, GP training requires inverting the kernel matrix, which costs $O(n^3)$ in time and $O(n^2)$ in memory. At $n = 2000$ this is marginal; at $n = 10000$ it is impractical. Sparse GP approximations exist but add complexity and lose the exact uncertainty guarantee.
Polynomial (degree 4 with Ridge regularisation) is fast and smooth but fails for complex, high-dimensional functions. The number of terms grows as $\binom{d + \mathrm{deg}}{\mathrm{deg}}$, and high-degree polynomials oscillate wildly outside the training domain (Runge's phenomenon).
BioSNICAR connection — The emulator uses MLP + PCA because it must: (a) predict 480-band spectra smoothly for gradient-based optimisation, (b) train on 1,000–10,000 samples in reasonable time, (c) run millions of evaluations during MCMC and optimisation. The pure-NumPy forward pass (
emulator.py:_forward_pass, line 505) is the critical bottleneck — it must be as fast as possible.
Acts I–II built a fast emulator that predicts spectra from parameters. Now we reverse the arrow: given an observed spectrum (from a satellite or field spectrometer), find the parameters that best explain it. This is the inverse problem, and it is the core of any retrieval scheme.
The forward problem is well-posed: given parameters $\theta$, there is a unique spectrum $\alpha = f(\theta)$. The inverse problem is harder:
We can't simply "invert" the forward model analytically. Instead, we frame the inverse problem as optimisation: find the parameters $\theta^*$ that minimise the mismatch between the observed spectrum and the model prediction:
$$\theta^* = \arg\min_\theta \; J\big(f(\theta),\; \alpha^{\mathrm{obs}}\big)$$where $J$ is a cost function (also called loss function or objective function) that measures the misfit. The shape of the cost surface $J(\theta)$ determines how hard the optimisation is and what algorithm we should use.
The cost surface shows a single, well-defined minimum near the true parameters. The elliptical contours suggest the surface is approximately quadratic near the minimum — this is important for both optimisation (gradient methods work well on quadratic surfaces) and uncertainty quantification (the Hessian approximation in Act IV assumes quadratic shape).
Note the elongation of the contours: the cost function is more sensitive to grain size (steeper gradient) than to impurity (shallower gradient). This is because grain size affects a larger portion of the spectrum (the entire NIR) while impurity only affects the visible.
BioSNICAR connection —
biosnicar/inverse/cost.pyimplements two cost functions:
spectral_cost()(line 23): full-spectrum chi-squared over all 480 bands.band_cost()(line 71): chi-squared over satellite band reflectances (e.g., Sentinel-2 B3, B4, B11).Both compute: $\chi^2 = \sum_i (y_i^{\mathrm{pred}} - y_i^{\mathrm{obs}})^2 / \sigma_i^2$
The
retrieve()function inoptimize.pydispatches to the appropriate cost function based on whether aplatformargument is provided.
The cost function encodes our assumptions about measurement uncertainty and prior knowledge. Different choices lead to different retrieved values and different uncertainty estimates. Getting it right is as important as choosing the optimiser.
Treats all spectral bands equally. Simple, but problematic when bands have different noise levels — a noisy band and a precise band contribute equally, allowing the noisy band to bias the solution.
Each band's contribution is normalised by its measurement uncertainty $\sigma_i$. Bands with large uncertainty (e.g. in strong water absorption regions where the signal is weak) are down-weighted. At the true parameters, $\chi^2 \approx n$ (the number of bands) for a good fit.
This is the maximum likelihood solution under the assumption that measurement errors are Gaussian: $y_i^{\mathrm{obs}} \sim \mathcal{N}(y_i^{\mathrm{true}}, \sigma_i^2)$.
Adds a penalty that pulls parameters toward prior estimates $\mu_k$. This is useful when the data alone cannot constrain all parameters — for example, snow density has very little spectral signature, so without a prior it would be poorly constrained. The regularisation term is equivalent to a Gaussian prior in Bayesian inference.
Notice how the uncertainty-weighted cost surface has a rounder shape — the elongation caused by the 1.5 µm water-ice absorption band (which is noisy and unreliable) has been suppressed by down-weighting those bands.
BioSNICAR connection — Both
spectral_cost()andband_cost()inbiosnicar/inverse/cost.pyacceptobs_uncertainty(per-band $\sigma$) andregularization(dictionary of{param: (prior_mean, prior_sigma)}) keyword arguments. Whenobs_uncertaintyisNone, the cost function falls back to unweighted least squares. The regularisation is particularly important for band-mode retrieval (Sentinel-2, Landsat) where only 3–5 broadband observations must constrain multiple parameters.
Section 4 introduced log-space sampling for the training set. The same problem arises during optimisation: when parameters span orders of magnitude, the cost surface is extremely elongated in linear space.
A gradient step of $\Delta x = 1$ is tiny when $x = 100{,}000$ but enormous when $x = 10$. The optimiser either overshoots at small values or crawls at large values. The finite-difference gradient estimate (used by L-BFGS-B) also becomes unreliable: a fixed step $\epsilon$ is too coarse for small $x$ and too fine for large $x$.
By reparametrising to $x_{\log} = \log_{10}(x + 1)$, the cost surface becomes more isotropic (similar curvature in all directions) and the gradient step size adapts automatically to the parameter magnitude.
BioSNICAR connection —
biosnicar/inverse/optimize.py(line 32) defines:_LOG_SPACE_PARAMS = {"black_carbon", "snow_algae", "glacier_algae", "dust", "ssa"}The
_to_log()and_from_log()functions (line 35) perform the $\log_{10}(x+1)$ / $10^x - 1$ transformation. Insideretrieve(), the optimiser always works in the transformed space:
- Bounds are transformed before passing to scipy.
- The cost function wrapper transforms back to linear before calling the emulator.
- The result is transformed back to linear before returning.
This is transparent to the user — they provide and receive linear values. The only visible effect is better convergence and more meaningful uncertainty estimates for skewed parameters.
BioSNICAR's retrieve() function supports four optimisation methods,
each suited to a different regime:
| Method | Type | Pros | Cons | BioSNICAR usage |
|---|---|---|---|---|
| L-BFGS-B | Gradient-based, local | Fast, respects bounds | Stuck in local minima | Default for 1-param; local refinement in hybrid |
| Nelder-Mead | Simplex, derivative-free | No gradients needed | Slow in high dimensions; no native bounds | Fallback when gradients are unreliable |
| Differential Evolution | Population-based, global | Finds global minimum | Expensive (thousands of evaluations) | Global pre-search in hybrid |
| Hybrid DE + L-BFGS-B | Two-stage | Best of both worlds | Slightly slower than pure L-BFGS-B | Default for ≥2 parameters |
The hybrid approach is the BioSNICAR default for multi-parameter retrieval: a short DE run (100 iterations, small population) identifies the right valley, then L-BFGS-B converges rapidly to the precise minimum. This avoids the classic failure mode where a gradient method starts in the wrong basin.
The code below demonstrates all four on the same 2-parameter cost surface.
Why this works so well: DE identifies the right valley quickly (it doesn't need to converge precisely, just get close). L-BFGS-B then converges rapidly because it starts near the global minimum, where the cost surface is approximately quadratic and the gradient points straight to the bottom.
BioSNICAR connection — This is the default strategy for ≥2 free parameters in
retrieve()(line 445):de_result = differential_evolution(cost_fn, bounds, maxiter=100, popsize=10, seed=seed, polish=False) x0_refined = de_result.x if de_result.fun < cost_fn(x0) else x0 result = minimize(cost_fn, x0_refined, method="L-BFGS-B", bounds=bounds)The
polish=Falseflag tells scipy's DE not to run its own internal L-BFGS-B polish — BioSNICAR runs a separate L-BFGS-B step with explicit control over tolerances and callback logging.For single-parameter retrievals, the hybrid is unnecessary — the 1-D cost surface has no local minima, so L-BFGS-B alone suffices.
Exercise 4: Create a cost surface with two local minima (e.g. add a secondary Gaussian dip to the toy spectrum). Show that L-BFGS-B alone finds the wrong minimum depending on x0, while the hybrid always finds the global minimum.
BioSNICAR connection —
biosnicar/inverse/optimize.pydispatches to the appropriate method based on themethodparameter: L-BFGS-B (line 424), Nelder-Mead (line 426), differential evolution (line 338), or the default hybrid (line 445, which chains a quick DE pre-search into L-BFGS-B). The cost function is eitherspectral_cost()(cost.py:19) orband_cost()(cost.py:76).
A point estimate without an error bar is incomplete. BioSNICAR provides two approaches to uncertainty:
The curvature of the cost function at the optimum encodes how tightly the data constrain each parameter. The Hessian matrix (matrix of second derivatives) captures this curvature; its inverse is the covariance matrix of the parameter estimates:
$$\sigma_k^2 \approx \left[H^{-1}\right]_{kk}, \quad H_{jk} = \frac{\partial^2 J}{\partial \theta_j \, \partial \theta_k}$$This is fast (one evaluation of the Hessian at the optimum) and works well when the posterior is approximately Gaussian — i.e., when the cost surface is parabolic near the minimum. It breaks down for degenerate problems (elongated valleys) or multimodal posteriors.
When the Hessian approximation is insufficient — typically for 3+
parameter problems or when degeneracies create non-Gaussian posteriors —
Markov Chain Monte Carlo sampling maps the full posterior distribution.
BioSNICAR uses the emcee ensemble sampler, which runs multiple "walkers"
in parallel to explore parameter space efficiently.
MCMC produces chains of parameter samples whose density is proportional to the posterior probability. From these chains you can extract:
The cost: MCMC requires thousands of forward evaluations — feasible only with the emulator (~10 µs each), not the full solver (~0.5 s each).
BioSNICAR connection —
biosnicar/inverse/optimize.py:_run_mcmc()implements MCMC usingemcee.EnsembleSampler. If emcee is not installed, it raisesImportError. The ensemble sampler uses multiple "walkers" that communicate — this is much more efficient than single-chain Metropolis-Hastings because the walkers can explore the posterior in parallel and adapt their step sizes based on the ensemble spread. The MCMC chain is stored inRetrievalResult.chains, and the corner plot can be generated directly from the result object.The default Hessian uncertainty is always computed (it's cheap). MCMC is run when
method="mcmc"is passed toretrieve(), and provides a more complete picture of the posterior — especially important for degenerate parameter combinations like grain size vs density. The default is 32 walkers and 2,000 steps (plus 500 burn-in).
Exercise 5: Run the Metropolis-Hastings sampler on the Rosenbrock (banana) function from Section 18 with 100,000 steps. Compare the resulting chain to the Hessian ellipse — the MCMC should capture the curved shape that the Hessian misses.
Retrieved parameters (rds, ρ, BC) are often intermediate — the quantity of interest is a derived value like SSA = 3(1 − ρ/917) / (r·ρ). Uncertainty must be propagated through the derivation.
BioSNICAR handles this automatically: RetrievalResult.ssa_uncertainty
(result.py:137) uses first-order error propagation from the Hessian, and
MCMC chains can be transformed directly (compute SSA for each sample,
take the standard deviation).
An additional subtlety: parameters optimised in log₁₀(x+1) space have
uncertainties in log space. The back-transform σ_linear ≈ (x+1)·ln(10)·σ_log
is applied automatically by retrieve() (optimize.py:354–358).
BioSNICAR connection — Hessian uncertainty is computed by
_hessian_uncertainty()(optimize.py:582) using finite-difference second derivatives. MCMC uses_run_mcmc()(optimize.py:519) wrappingemcee.EnsembleSampler. SSA error propagation is in_ssa_uncertainty()(result.py:39). Log-space back-transform is at optimize.py:354–358.
We now compare all three uncertainty methods on the same problem to understand when each is appropriate. The key question is: does the Hessian (fast, cheap, Gaussian assumption) agree with MCMC (slow, exact, no assumptions)? If yes, use the Hessian and save time. If not, the posterior is non-Gaussian and MCMC is needed.
For this well-behaved (near-Gaussian) problem, Hessian and MCMC agree closely — the MCMC contours overlap the Hessian ellipse. This is the best-case scenario for the Hessian: the cost surface is approximately quadratic near the minimum, so the Gaussian approximation holds.
| Method | Speed | Non-Gaussian? | Correlations? | When to use |
|---|---|---|---|---|
| Hessian | Fast (~ms) | No | Yes (local) | Default; first check |
| MCMC | Slow (~min) | Yes | Yes (global) | Degenerate params, multi-modal |
| MC propagation | Medium (~s) | N/A | Inherited | Derived quantities |
Practical rule: Always compute Hessian uncertainty (it's nearly free). If you suspect non-Gaussianity (e.g., strong grain size–density degeneracy, or band-mode retrieval with few observations), run MCMC to verify. If the MCMC and Hessian agree, trust the Hessian for future retrievals with similar settings.
BioSNICAR connection —
retrieve()always computes Hessian uncertainty. MCMC is available viamethod="mcmc"and usesemcee.EnsembleSamplerwith 32 walkers and 2,000 steps per walker (plus 500 burn-in, by default). For a typical 4-parameter retrieval, this takes ~10 seconds with the emulator (compared to ~70 hours with the full forward model).
Exercise 6: Modify the toy spectrum to have a cost surface with two modes. Compare Hessian and MCMC uncertainty estimates — the Hessian should give a single ellipse while MCMC captures both modes.
The previous acts developed each component in isolation. This act
assembles them into a complete retrieval pipeline — the same sequence
used by BioSNICAR's retrieve() function — and validates it end-to-end.
A complete retrieval consists of six steps:
.npz file).Let us run this pipeline on our toy problem with known truth.
The retrieval successfully recovers both parameters within 3σ uncertainty. This validates the end-to-end pipeline:
| Step | Method | Section | BioSNICAR code |
|---|---|---|---|
| Sampling | LHS + log-space | 2, 4 | emulator.py:_latin_hypercube |
| Compression | PCA (99.9% variance) | 6 | emulator.py:PCA(n_components=0.999) |
| Emulation | MLP (128, 128, 64) | 7, 8 | emulator.py:Emulator.build |
| Optimisation | Hybrid DE + L-BFGS-B | 15, 16 | optimize.py:retrieve |
| Uncertainty | Hessian finite-diff | 18 | optimize.py:_hessian_uncertainty |
Each technique was motivated by a specific physical or computational challenge in snow/ice retrieval, and the whole is greater than the sum of its parts: the emulator makes the optimiser fast, the log transform makes the optimiser accurate, and the Hessian makes the uncertainty cheap.
The BioSNICAR inverse module provides sensible defaults, but understanding why those defaults were chosen helps you adapt to new situations.
| Situation | Sampling | Surrogate | Optimiser | Uncertainty |
|---|---|---|---|---|
| 1–2 params, fast model | Grid | None needed | L-BFGS-B | Hessian |
| 3–5 params, slow model | LHS | PCA + MLP | Hybrid DE + L-BFGS-B | Hessian |
| >5 params or non-Gaussian | LHS (log) | PCA + MLP | Hybrid | MCMC |
| Satellite bands (≤5 obs) | LHS (log) | PCA + MLP | Hybrid | MCMC* |
| Noisy/discontinuous cost | LHS | PCA + MLP | Nelder-Mead or DE | MCMC |
* With only 5 broadband observations, you cannot constrain more than ~3
free parameters. Fix the rest (e.g., density, layer thickness) via
fixed_params in retrieve().
This section requires biosnicar to be installed. If available, it
demonstrates a real retrieval using the full radiative transfer model:
This closes the loop from toy examples to production code — every concept from the previous 24 sections is used here on real physics.
This notebook taught every data-science technique underlying the BioSNICAR emulator and inversion scheme, from first principles:
| # | Technique | Physical motivation | BioSNICAR module |
|---|---|---|---|
| 1 | Latin Hypercube Sampling | Explore 4–8D parameter space efficiently | emulator.py:_latin_hypercube |
| 2 | Log-space sampling | Cover clean-ice through polluted-ice regimes | emulator.py:_LOG_SAMPLE_PARAMS |
| 3 | PCA compression | Exploit spectral band correlation to reduce output dimension | emulator.py:PCA(n_components=0.999) |
| 4 | MLP neural network | Fast, smooth surrogate for iterative optimisation | emulator.py:MLPRegressor(128,128,64) |
| 5 | Pure-NumPy forward pass | Deploy emulator without scikit-learn | emulator.py:_forward_pass |
| 6 | Chi-squared cost | Weight spectral bands by measurement confidence | cost.py:spectral_cost |
| 7 | Log-space optimisation | Isotropic cost surface for skewed parameters | optimize.py:_to_log/_from_log |
| 8 | L-BFGS-B | Fast gradient-based refinement with bounds | optimize.py (via scipy) |
| 9 | Nelder-Mead | Derivative-free fallback for noisy surfaces | optimize.py (penalty bounds) |
| 10 | Differential Evolution | Global search avoiding local minima | optimize.py (DE pre-search) |
| 11 | Hybrid DE + L-BFGS-B | Best of global + local for multi-param retrieval | optimize.py:retrieve |
| 12 | Hessian uncertainty | Cheap uncertainty from cost curvature | optimize.py:_hessian_uncertainty |
| 13 | MCMC sampling | Full posterior for non-Gaussian cases | optimize.py:_run_mcmc |
| 14 | Error propagation | Uncertainty on derived quantities (SSA) | result.py:_ssa_uncertainty |
The fundamental challenge is this: a physically rigorous radiative transfer model is too slow for iterative inversion (~0.5 s × millions of evaluations). The entire data science pipeline exists to make inversion tractable:
Each technique was demonstrated on a toy problem that captures the essential physics of snow/ice spectroscopy, then connected to the exact BioSNICAR source code.
docs/INVERSION.mdbiosnicar.inverse.retrieve() on synthetic and real observationsEmulator.build()