Data Science for Snow & Ice Spectral Modelling¶

A Training Manual: From Sampling to Inversion¶

Why this notebook exists¶

Satellite remote sensing of snow and ice requires solving an inverse problem: given a measured reflectance spectrum, determine the physical properties of the surface — grain size, impurity loading, density, layer structure. The forward direction is handled by radiative transfer models like BioSNICAR, which solve the equations of light scattering and absorption through a layered snowpack. But forward models are too slow (~0.5 s per evaluation) for the millions of evaluations needed during optimisation and uncertainty quantification.

This notebook teaches every data-science concept behind the BioSNICAR emulator and inversion scheme, from first principles, using toy examples that run without BioSNICAR installed. Each technique is motivated by a concrete physical problem, demonstrated on a toy snow/ice spectrum, and then connected to the exact line of BioSNICAR source code where it is used.

Structure¶

Act	Theme	Sections	Core question
I	Sampling & the Curse of Dimensionality	1 – 4	How do we efficiently explore parameter space?
II	Dimensionality Reduction & Emulation	5 – 9	How do we make the forward model fast enough?
III	Cost Functions & Optimisation	10 – 16	How do we find the best-fit parameters?
IV	Uncertainty Quantification	17 – 22	How confident should we be in the answer?
V	Putting It All Together	23 – 25	Full retrieval pipeline, end to end

Act I: Sampling & the Curse of Dimensionality¶

Section 1 — The Forward-Model Problem¶

What is a forward model?¶

A forward model is a function that maps physical parameters (things we want to know) to observables (things we can measure). In cryospheric remote sensing the parameters describe the snowpack — grain size, density, impurity concentration, layer thickness — and the observable is the spectral albedo: the fraction of incoming sunlight reflected at each wavelength.

$$\alpha(\lambda) = f(r_{\mathrm{eff}},\; \rho,\; c_{\mathrm{imp}},\; \mathrm{d}z,\; \theta_s,\; \ldots)$$

where $r_{\mathrm{eff}}$ is effective grain radius, $\rho$ is density, $c_{\mathrm{imp}}$ is impurity concentration, $\mathrm{d}z$ is layer thickness, and $\theta_s$ is solar zenith angle.

Why does snow albedo look the way it does?¶

Snow and ice albedo is controlled by two competing processes:

Scattering at grain boundaries redirects photons back out of the snowpack. More grain boundaries (finer snow) = more scattering = higher albedo.
Absorption by ice and impurities removes photons. Ice absorption increases steeply with wavelength above ~0.7 µm, creating the characteristic NIR rolloff and the discrete water-ice absorption bands at ~1.03, 1.27, 1.50, and 2.0 µm.

Grain size controls the path length between scattering events. Coarser grains mean longer paths through ice, so more absorption in the NIR — but the visible is barely affected because ice is almost transparent at those wavelengths.

Impurities (black carbon, dust, algae) absorb strongly in the visible but are spectrally flat or decreasing into the NIR. Their effect is concentrated below ~0.7 µm.

The BioSNICAR forward model¶

BioSNICAR solves this physics rigorously via a multi-layer adding-doubling radiative transfer scheme:

Pre-computed Mie/geometric-optics lookup tables provide single-scattering properties (extinction, SSA, asymmetry parameter) from grain size and refractive indices (biosnicar/optical_properties/column_OPs.py:get_layer_OPs)
Impurity mixing adds absorbers using mass-absorption coefficients (biosnicar/optical_properties/column_OPs.py:mix_in_impurities)
Adding-doubling solver propagates light through layers with Fresnel reflection at the surface (biosnicar/rt_solvers/adding_doubling_solver.py)
Orchestrator ties it together (biosnicar/drivers/run_model.py:run_model)

A single evaluation takes ~0.5 seconds. This is fine for one-off predictions but far too slow for inverse problems (which need millions of evaluations). The entire purpose of Acts I–II is to build a fast surrogate (emulator) that replaces this 0.5 s call with a ~50 µs neural network prediction.

Our toy spectrum¶

We define a simplified 2-parameter function that captures the essential spectral physics:

Flat, high visible region (~0.3–0.7 µm): clean snow reflects >95 %
Steep NIR rolloff beyond 0.7 µm due to increasing ice absorption
Water-ice absorption bands at ~1.03, 1.27, 1.50, and 2.0 µm
Grain-size effect: coarser grains deepen NIR absorption (VIS barely changes)
Impurity effect: darkens the visible while barely touching the NIR

Wavelength grid: 50 bands, 0.3–2.5 µm

BioSNICAR connection — The real forward model lives in biosnicar/drivers/run_model.py. It accepts keyword overrides like rds (grain radius, µm), rho (density, kg m⁻³), black_carbon (ppb), snow_algae (cells mL⁻¹), and solzen (solar zenith, degrees). It returns an Outputs object with .albedo (480-band spectral albedo) and .BBA (broadband albedo).

The pipeline is: setup_snicar() → get_layer_OPs() → mix_in_impurities() → adding_doubling_solver(). Each step has a direct physical interpretation: Mie theory for grain optics, MAC mixing for impurities, and the adding-doubling method for multi-layer radiative transfer with Fresnel surface reflectance.

Our toy function replaces this entire pipeline with a parametric approximation that runs in microseconds — fast enough for interactive exploration.

Section 2 — Latin Hypercube Sampling¶

The training data problem¶

To build an emulator (Section 8), we need to evaluate the forward model at a set of training points scattered across the parameter space. The quality of the emulator depends critically on how well those points cover the space.

There are three natural strategies:

Strategy	Idea	Weakness
Random	Uniform i.i.d. draws	Gaps and clusters — some regions are poorly sampled
Grid	Regular lattice	Explodes exponentially in high dimensions (Section 3)
LHS	Stratified per dimension	Space-filling and cheap; linear cost in $d$

How Latin Hypercube Sampling works¶

The key idea is borrowed from stratified sampling in statistics. For $n$ samples in $d$ dimensions:

Divide each dimension into $n$ equal strata (bins).
For each dimension independently, place exactly one sample in each stratum at a random position within it.
Randomly associate the per-dimension positions across dimensions.

This guarantees that the marginal projection onto any single dimension is perfectly uniform — there is exactly one point in every bin. No clustering, no gaps, regardless of $n$ or $d$.

Formally, for dimension $j$, sample $i$ is placed at:

$$x_{ij} = \frac{\pi_j(i) + u_{ij}}{n}, \quad u_{ij} \sim \mathrm{Uniform}(0,1)$$

where $\pi_j$ is a random permutation of $\{0, 1, \ldots, n-1\}$.

We implement this from scratch below, matching the algorithm in biosnicar/emulator.py:_latin_hypercube() (line 20).

Shape: (20, 2),  range: [0.023, 0.985]

Key insight: LHS guarantees that each row and column of the stratified grid contains exactly one sample — so marginal projections are perfectly uniform, while the interior is space-filling. This is analogous to a Latin square in combinatorics (hence the name): no two samples share the same row or column of any 2-D projection.

Why not just random sampling?¶

Pure random sampling suffers from clustering and gaps that become worse in higher dimensions. LHS eliminates marginal gaps by construction. While LHS doesn't guarantee optimal joint coverage (that would require more expensive designs like maximin or minimax), it provides an excellent balance of coverage quality and computational simplicity.

Why not a regular grid?¶

A grid provides perfect coverage but costs $k^d$ evaluations for $k$ points per dimension. For 8 parameters (a typical BioSNICAR emulator), even a modest 20-point grid requires $20^8 \approx 25.6$ billion forward model evaluations — each taking ~0.5 s. Section 3 quantifies this "curse of dimensionality" in detail.

BioSNICAR connection — biosnicar/emulator.py:_latin_hypercube() (line 20) uses this exact algorithm. It takes n_samples and n_dims (derived from the length of the params dictionary passed to Emulator.build()), and returns a matrix of samples in $[0,1]^d$. These are then rescaled to physical bounds (e.g. grain radius 100–5000 µm, black carbon 0–500,000 ppb) before evaluating the forward model on each sample.

Section 3 — Curse of Dimensionality¶

The fundamental scaling problem¶

Richard Bellman coined the phrase "curse of dimensionality" in 1961 to describe a pervasive problem: the volume of a space grows exponentially with the number of dimensions. A regular grid with $k$ points per dimension needs $k^d$ total samples. This means:

Dimensions ($d$)	Grid points ($20^d$)	At 0.5 s each
2	400	3 minutes
4	160,000	22 hours
6	64,000,000	370 days
8	25,600,000,000	406 years

The BioSNICAR emulator typically has 4–8 free parameters (grain radius, density, layer thickness, solar zenith angle, plus up to 4 impurity concentrations). Grid-based training is completely infeasible.

Why LHS breaks the curse¶

LHS decouples sample count from dimensionality. Whether we have 2 parameters or 20, we can choose any fixed budget $n$ (say 2,000 samples) and still get uniform marginal coverage. The cost is $O(n)$ forward model evaluations regardless of $d$. Of course, covering a high-dimensional space with finite samples inevitably leaves gaps — but LHS ensures those gaps are distributed as evenly as possible, rather than concentrating in low-probability corners.

Grid at d=8: 25,600,000,000 samples  vs  LHS: 500 samples

BioSNICAR connection — Emulator.build() in biosnicar/emulator.py (line 261) uses LHS to generate training samples. The default is typically 2,000–10,000 samples regardless of dimensionality. At ~0.5 s per forward model evaluation, 2,000 samples take ~17 minutes — compared to the 406 years needed for an 8-D grid. This is why LHS is essential, not optional.

The emulator parameters are defined as a dictionary passed to build():
emu = Emulator.build(
    params={"rds": (100, 5000), "rho": (200, 917),
            "black_carbon": (0, 500000), "solzen": (0, 89)},
    n_samples=2000,
)
Each key becomes one dimension of the LHS design.

Section 4 — Log-Space Sampling for Skewed Parameters¶

The clean-ice problem¶

Not all parameters are created equal. Grain radius might range from 100 to 5,000 µm — a factor of 50. But impurity concentrations span from 0 to 500,000 ppb — a factor of 500,000. If we sample uniformly in linear space, the distribution of training points is disastrously skewed:

A uniform draw from $[0, 500000]$ has only a 0.02% chance of landing below 100 ppb.
But clean-ice scenarios (concentrations near zero) are extremely common in nature and critical for accurate retrieval — a satellite image of the Greenland ice sheet is mostly clean ice.

We need training data that represents the full dynamic range, with adequate resolution at both the low and high ends.

The log-transform solution¶

Sampling in $\log_{10}(x+1)$ space maps the range $[0, 500000]$ onto $[0, 5.7]$. A uniform draw in this compressed space gives roughly equal density per order of magnitude in the original space — far more points near zero, which is exactly what we need.

$$x_{\mathrm{log}} \sim \mathrm{Uniform}\big(\log_{10}(x_{\min}+1),\; \log_{10}(x_{\max}+1)\big)$$$$x_{\mathrm{linear}} = 10^{x_{\mathrm{log}}} - 1$$

The $+1$ offset ensures that $x=0$ maps to $\log_{10}(1) = 0$ rather than $-\infty$, so the transformation is well-defined for physically meaningful zero-concentration scenarios.

Points below 100 ppb — Linear: 0.1%,  Log-space: 35.8%

BioSNICAR connection — biosnicar/emulator.py (line 68) defines:
_LOG_SAMPLE_PARAMS = {"black_carbon", "snow_algae", "glacier_algae", "dust"}
When Emulator.build() generates LHS samples, any parameter whose name appears in _LOG_SAMPLE_PARAMS is sampled in $\log_{10}(x+1)$ space (line 331). This ensures that the training set includes adequate clean-ice scenarios alongside heavily polluted ones.

Parameters not in this set (like rds, rho, dz, solzen) are sampled linearly because their ranges are moderate (e.g. $r$ from 100 to 5,000 µm — only a factor of 50) and there is no special need to oversample the low end.

The same log transform reappears in the optimiser (Section 12) where _LOG_SPACE_PARAMS controls which parameters are optimised in log space.

Exercise 1: Modify latin_hypercube to accept a dictionary of bounds and a set of log-space parameter names. Generate 200 samples for {"grain_size": (0, 1), "impurity": (0, 500000)} where impurity is log-sampled. Plot the 2-D scatter to confirm the grain axis is uniform while the impurity axis is log-distributed.

Act II: Dimensionality Reduction & Emulation¶

The forward model produces a 50-band spectrum (480 in real BioSNICAR). Training an emulator to predict all 50 bands independently would be wasteful — most of the information is redundant. This act shows how to compress the output using PCA and then predict only the compressed representation with a neural network.

Section 5 — Why Compress Spectra? Band Correlation¶

The redundancy in spectral data¶

A snow albedo spectrum is not 50 independent numbers. Adjacent wavelength bands are driven by the same underlying physics — ice absorption varies smoothly with wavelength, and a change in grain size affects a broad swath of the spectrum, not individual bands. This means neighbouring bands carry nearly identical information.

Formally, if we collect many spectra (varying the input parameters) and compute the Pearson correlation between every pair of bands, we find large blocks of nearly perfect correlation ($r \approx 1$). The intrinsic dimensionality of the spectral output is much lower than the number of bands — typically 3–10 independent "modes" suffice to describe

99.9% of the variance.

This is why we compress: instead of training a neural network to predict 50 (or 480) correlated outputs, we project onto 3–5 orthogonal principal components and predict those. This is faster, more numerically stable, and produces smoother reconstructions.

Spectra matrix: (500, 50)  (samples × bands)

/home/joe/anaconda3/lib/python3.7/site-packages/numpy/lib/function_base.py:2691: RuntimeWarning: invalid value encountered in true_divide
  c /= stddev[:, None]
/home/joe/anaconda3/lib/python3.7/site-packages/numpy/lib/function_base.py:2692: RuntimeWarning: invalid value encountered in true_divide
  c /= stddev[None, :]

The bright red blocks show groups of highly correlated bands. Bands within the visible plateau (0.3–0.7 µm) are almost perfectly correlated because they are all controlled by the same impurity darkening. The NIR bands beyond 1.0 µm correlate with each other because they are all driven by ice absorption (which depends on grain size).

The anti-correlations (blue) between visible and NIR bands reflect the physical fact that impurities darken the visible while grain size darkens the NIR — these are partially independent effects.

A 50-dimensional output with such strong inter-band correlation can be faithfully represented in far fewer dimensions — that is the job of Principal Component Analysis (PCA).

Section 6 — PCA from First Principles¶

What is PCA and why do we need it?¶

Principal Component Analysis is a linear dimensionality reduction technique. It finds the orthogonal directions along which the data varies most, then discards the directions with negligible variance. For spectral data, these directions are physical "modes" of variability — the first PC typically captures the overall brightness level, the second captures the VIS-vs-NIR contrast (related to grain size), etc.

The algorithm¶

Given a data matrix $\mathbf{X}$ of shape ($n$ samples × $p$ bands):

Centre: $\bar{\mathbf{X}} = \mathbf{X} - \boldsymbol{\mu}$, where $\boldsymbol{\mu}$ is the mean spectrum.
Covariance: $\mathbf{C} = \frac{1}{n-1}\bar{\mathbf{X}}^\top \bar{\mathbf{X}}$ — a $p \times p$ matrix measuring how each pair of bands co-varies.
Eigendecompose: $\mathbf{C} = \mathbf{V}\boldsymbol{\Lambda}\mathbf{V}^\top$ — eigenvalues $\lambda_i$ give variance along each direction; eigenvectors $\mathbf{v}_i$ are the principal component "basis vectors".
Project: $\mathbf{Z} = \bar{\mathbf{X}} \mathbf{V}_k$ — keep only the top-$k$ components (those with the largest eigenvalues).
Reconstruct: $\hat{\mathbf{X}} = \mathbf{Z} \mathbf{V}_k^\top + \boldsymbol{\mu}$

The variance explained by the first $k$ components is $\sum_{i=1}^k \lambda_i \;/\; \sum_{i=1}^p \lambda_i$. If 3 components capture 99.9% of variance, the remaining 47 dimensions are mostly noise.

Physical interpretation for snow spectra¶

PC1 (dominant mode): the overall albedo level — this correlates strongly with impurity loading.
PC2: the VIS-to-NIR slope — this is primarily grain size.
PC3+: fine structure (absorption band depths, spectral curvature).

This decomposition is powerful because it separates physically distinct effects into orthogonal components, making the subsequent regression (neural network) much easier.

Top 5 eigenvalues: [0.3713 0.1332 0.0163 0.0039 0.001 ]
Cumulative variance: [0.7055 0.9586 0.9895 0.9968 0.9987]

RMSE over all spectra: 0.010523

Key insight: Just 2–3 principal components capture >99.9% of the variance in our toy spectra. This dramatic compression is possible because the spectral output is controlled by only 2 underlying parameters (grain size and impurity). For BioSNICAR's 480-band output with more free parameters, typically 5–10 components suffice.

Why PCA and not other methods?¶

Other dimensionality reduction methods exist (autoencoders, t-SNE, UMAP), but PCA has key advantages for emulation:

Linear: reconstruction is a single matrix multiply — extremely fast.
Exact inverse: inverse_transform is lossless up to truncation.
Interpretable: each PC is a spectral basis vector with physical meaning.
No hyperparameters when using variance threshold (e.g. 99.9%).

BioSNICAR connection — biosnicar/emulator.py (line 395) uses PCA(n_components=0.999) — scikit-learn automatically determines the number of components needed to retain 99.9% of variance. The PCA mean vector and component matrix are saved alongside the neural network weights in the .npz file (line 662), so reconstruction at inference time needs only NumPy — no scikit-learn dependency.

Exercise 2: Try n_components=1 and n_components=5. At what point does the reconstruction RMSE become negligible (<1e-4)? How does this relate to the scree plot?

Section 7 — Neural Networks from Scratch¶

The regression problem¶

After PCA, the emulator's job reduces to learning a mapping from a small input (2–8 physical parameters) to a small output (3–10 PCA coefficients). This is a multi-output regression problem, and we solve it with the simplest neural network architecture: the Multi-Layer Perceptron (MLP).

Why a neural network?¶

The mapping from physical parameters to PCA coefficients is nonlinear — doubling the grain size doesn't simply double the spectral change. The relationship involves exponential absorption, multiple scattering, and nonlinear mixing of optical properties. Linear regression can't capture this. Polynomial regression can, but requires high degree and extrapolates poorly. An MLP with ReLU activations is a universal approximator: given enough hidden units, it can fit any continuous function to arbitrary precision (Leshno et al., 1993).

Anatomy of an MLP¶

Single neuron: applies a linear transformation followed by a nonlinear activation:

$$y = \sigma(\mathbf{w}^\top \mathbf{x} + b)$$

where $\mathbf{w}$ are weights, $b$ is bias, and $\sigma$ is an activation function. With ReLU ($\sigma(z) = \max(0, z)$), the neuron outputs zero for negative inputs and passes positive inputs through.

Layer: many neurons in parallel, each with its own weights: $\mathbf{h} = \sigma(\mathbf{W}\mathbf{x} + \mathbf{b})$

MLP: layers stacked in sequence. Each layer's output feeds the next. The final layer has no activation (linear output) so the network can predict any real-valued PCA coefficient. The hidden layers use ReLU to introduce nonlinearity — without it, the entire network would collapse to a single linear transformation.

Training¶

The MLP is trained by backpropagation + gradient descent: compute the mean squared error between predicted and true PCA coefficients, then adjust all weights to reduce the error. Scikit-learn's MLPRegressor handles this automatically with the Adam optimiser and early stopping.

Training R²: 0.988608
Architecture: (64, 64, 32)
Loss curve length: 104 epochs

BioSNICAR connection — biosnicar/emulator.py uses MLPRegressor with hidden_layer_sizes=(128, 128, 64) (line 406). This architecture was chosen empirically: 3 hidden layers with decreasing width provide enough capacity to fit the nonlinear parameter-to-spectrum mapping without overfitting on typical training set sizes (1,000–10,000 samples).

At prediction time, the code extracts the raw weight matrices (mlp.coefs_, mlp.intercepts_) and runs a pure-NumPy forward pass (_forward_pass(), line 505). This is critical: it means the saved emulator (.npz file) has no dependency on scikit-learn. Users only need NumPy to load and evaluate the emulator, and the forward pass is just matrix multiplies + ReLU — about 50 µs per evaluation.

Architecture diagram¶

Input (2)         Hidden 1 (64)      Hidden 2 (64)     Hidden 3 (32)     Output (3 PCA coeffs)
  o─────┐          o──────┐          o──────┐          o──────┐          o  → PCA inverse
  o─────┤──[W1]──▶ o──────┤──[W2]──▶ o──────┤──[W3]──▶ o──────┤──[W4]──▶ o  → transform
         ReLU              ReLU              ReLU              Linear      o  → 50-band albedo

Why ReLU and not sigmoid/tanh?¶

ReLU is computationally trivial: just np.maximum(x, 0).
It avoids the "vanishing gradient" problem that plagues sigmoid/tanh in deeper networks.
The piecewise-linear nature means the MLP output is a piecewise-linear function of the inputs — smooth enough for spectral modelling, and the gradients are well-defined almost everywhere (important for optimisation in Act III).

Section 8 — Complete Emulator Pipeline¶

The emulator as a function composition¶

An emulator is a chain of transformations that replaces the expensive forward model with a fast approximation:

$$\hat{\alpha}(\lambda) = \mathrm{PCA}^{-1}\Big(\mathrm{MLP}\big(\mathrm{scale}(\theta)\big)\Big)$$

The full pipeline is:

LHS generates parameter samples $\{\theta_i\}_{i=1}^n$ covering the space.
Forward model evaluates each sample: $\alpha_i = f(\theta_i)$.
PCA compresses spectra: $z_i = \mathrm{PCA}(\alpha_i)$ — from 50 bands to ~3 coefficients.
MinMaxScaler normalises inputs to $[0,1]$ — this helps MLP training converge.
MLP learns the mapping: $\hat{z} = \mathrm{MLP}(\mathrm{scale}(\theta))$.
Predict: for new $\theta^*$, compute $\hat{z}^*$, then reconstruct $\hat{\alpha}^* = \mathrm{PCA}^{-1}(\hat{z}^*)$.

The class below implements this, matching biosnicar/emulator.py:Emulator.build().

Building toy emulator...
  PCA: 6 components retain 99.94% variance
Done!

Forward model: 109.4 ms for 1000 evaluations (109 µs each)
Emulator:      785.6 ms for 1000 evaluations (786 µs each)
(In BioSNICAR the real forward model is ~0.5 s, so the speedup is ~10,000×)

Test R²: 0.999187,  RMSE: 0.002905

BioSNICAR connection — Emulator.build() follows this exact pipeline. Key implementation details:

Weight extraction (line 505): _forward_pass() takes raw NumPy arrays weights and biases extracted from mlp.coefs_ and mlp.intercepts_. This decouples inference from scikit-learn.

.npz save format (line 662): the emulator saves PCA components, PCA mean, scaler parameters, MLP weights, MLP biases, and metadata (R², parameter names, bounds) into a single compressed NumPy archive. Loading is np.load() + a few attribute assignments.

Portability: a saved .npz file can be shared with collaborators who only have NumPy installed. No retraining, no scikit-learn, no BioSNICAR needed to make predictions.

Speed: the real BioSNICAR forward model takes ~0.5 s per evaluation. The emulator forward pass takes ~50 µs — a 10,000× speedup that makes iterative optimisation and MCMC sampling feasible.

Exercise 3: Build a second emulator with only 100 training samples. Compare its R² and RMSE against the 2000-sample version. At what sample count does accuracy saturate?

Section 9 — Why MLP? Comparing Surrogate Methods¶

The emulator design space¶

Many regression methods could serve as emulators. The choice matters because the emulator sits in the inner loop of the optimiser — it is evaluated thousands to millions of times during retrieval. The key requirements are:

Smooth output: The optimiser (L-BFGS-B) needs well-defined gradients. Step functions or noisy surfaces cause convergence failures.
Fast prediction: Each optimisation step evaluates the emulator once. MCMC samples may need $10^5$+ evaluations. Prediction must be $O(\mu s)$.
Multi-output: We predict 3–10 PCA coefficients simultaneously.
Scales to $n \approx 10^3$–$10^4$ training samples: Some methods become impractical at these sizes.

Method	Smooth?	Predict speed	Multi-output	Training cost
MLP	Yes (ReLU piecewise-linear)	~50 µs	Natural	Minutes
Random Forest	No (step function)	~100 µs	Per-output	Seconds
Gaussian Process	Yes (kernel-smooth)	~1 ms	Expensive	$O(n^3)$
Polynomial	Yes	~10 µs	Natural	Instant

Let us test these head-to-head on our toy problem.

  MLP                   R² = 0.99868  RMSE = 0.00214
  Random Forest         R² = 0.99660  RMSE = 0.00343
  GP (n=200)            R² = 0.99831  RMSE = 0.00242
  Poly (deg 4)          R² = 0.99890  RMSE = 0.00195

Interpreting the results¶

MLP produces a smooth, continuous surface that closely tracks the true function. The ReLU activation creates piecewise-linear segments that are fine enough to appear smooth. Crucially, the gradients are well-defined everywhere (except at ReLU kinks), which is essential for L-BFGS-B.
Random Forest produces a step function — it predicts the average of nearby training points within each tree leaf. These jumps create artefacts when used as a cost function target (the optimiser "bounces" between steps rather than converging smoothly). This disqualifies RF as an emulator for gradient-based optimisation.
Gaussian Process gives excellent fit quality and provides posterior uncertainty for free (the predictive variance at each point). However, GP training requires inverting the kernel matrix, which costs $O(n^3)$ in time and $O(n^2)$ in memory. At $n = 2000$ this is marginal; at $n = 10000$ it is impractical. Sparse GP approximations exist but add complexity and lose the exact uncertainty guarantee.
Polynomial (degree 4 with Ridge regularisation) is fast and smooth but fails for complex, high-dimensional functions. The number of terms grows as $\binom{d + \mathrm{deg}}{\mathrm{deg}}$, and high-degree polynomials oscillate wildly outside the training domain (Runge's phenomenon).

BioSNICAR connection — The emulator uses MLP + PCA because it must: (a) predict 480-band spectra smoothly for gradient-based optimisation, (b) train on 1,000–10,000 samples in reasonable time, (c) run millions of evaluations during MCMC and optimisation. The pure-NumPy forward pass (emulator.py:_forward_pass, line 505) is the critical bottleneck — it must be as fast as possible.

Act III: Cost Functions & Optimisation¶

Acts I–II built a fast emulator that predicts spectra from parameters. Now we reverse the arrow: given an observed spectrum (from a satellite or field spectrometer), find the parameters that best explain it. This is the inverse problem, and it is the core of any retrieval scheme.

Section 10 — The Inverse Problem¶

Forward vs inverse¶

The forward problem is well-posed: given parameters $\theta$, there is a unique spectrum $\alpha = f(\theta)$. The inverse problem is harder:

Non-unique: different parameter combinations may produce similar spectra (e.g., larger grains with more impurities vs smaller grains with fewer).
Ill-conditioned: small changes in the observed spectrum can cause large changes in retrieved parameters, especially when noise is present.
Noisy observations: real measurements include instrument noise, atmospheric correction residuals, and sub-pixel heterogeneity.

We can't simply "invert" the forward model analytically. Instead, we frame the inverse problem as optimisation: find the parameters $\theta^*$ that minimise the mismatch between the observed spectrum and the model prediction:

$$\theta^* = \arg\min_\theta \; J\big(f(\theta),\; \alpha^{\mathrm{obs}}\big)$$

where $J$ is a cost function (also called loss function or objective function) that measures the misfit. The shape of the cost surface $J(\theta)$ determines how hard the optimisation is and what algorithm we should use.

The cost surface shows a single, well-defined minimum near the true parameters. The elliptical contours suggest the surface is approximately quadratic near the minimum — this is important for both optimisation (gradient methods work well on quadratic surfaces) and uncertainty quantification (the Hessian approximation in Act IV assumes quadratic shape).

Note the elongation of the contours: the cost function is more sensitive to grain size (steeper gradient) than to impurity (shallower gradient). This is because grain size affects a larger portion of the spectrum (the entire NIR) while impurity only affects the visible.

BioSNICAR connection — biosnicar/inverse/cost.py implements two cost functions:

spectral_cost() (line 23): full-spectrum chi-squared over all 480 bands.

band_cost() (line 71): chi-squared over satellite band reflectances (e.g., Sentinel-2 B3, B4, B11).

Both compute: $\chi^2 = \sum_i (y_i^{\mathrm{pred}} - y_i^{\mathrm{obs}})^2 / \sigma_i^2$

The retrieve() function in optimize.py dispatches to the appropriate cost function based on whether a platform argument is provided.

Section 11 — Least Squares, Chi-Squared, and Regularisation¶

Why the cost function matters¶

The cost function encodes our assumptions about measurement uncertainty and prior knowledge. Different choices lead to different retrieved values and different uncertainty estimates. Getting it right is as important as choosing the optimiser.

Least squares (unweighted)¶

$$J = \sum_i (y_i^{\mathrm{pred}} - y_i^{\mathrm{obs}})^2$$

Treats all spectral bands equally. Simple, but problematic when bands have different noise levels — a noisy band and a precise band contribute equally, allowing the noisy band to bias the solution.

Chi-squared (uncertainty-weighted)¶

$$\chi^2 = \sum_i \left(\frac{y_i^{\mathrm{pred}} - y_i^{\mathrm{obs}}}{\sigma_i}\right)^2$$

Each band's contribution is normalised by its measurement uncertainty $\sigma_i$. Bands with large uncertainty (e.g. in strong water absorption regions where the signal is weak) are down-weighted. At the true parameters, $\chi^2 \approx n$ (the number of bands) for a good fit.

This is the maximum likelihood solution under the assumption that measurement errors are Gaussian: $y_i^{\mathrm{obs}} \sim \mathcal{N}(y_i^{\mathrm{true}}, \sigma_i^2)$.

Regularisation (Gaussian prior)¶

$$J_{\mathrm{reg}} = \chi^2 + \sum_k \left(\frac{\theta_k - \mu_k}{\sigma_k^{\mathrm{prior}}}\right)^2$$

Adds a penalty that pulls parameters toward prior estimates $\mu_k$. This is useful when the data alone cannot constrain all parameters — for example, snow density has very little spectral signature, so without a prior it would be poorly constrained. The regularisation term is equivalent to a Gaussian prior in Bayesian inference.

Notice how the uncertainty-weighted cost surface has a rounder shape — the elongation caused by the 1.5 µm water-ice absorption band (which is noisy and unreliable) has been suppressed by down-weighting those bands.

BioSNICAR connection — Both spectral_cost() and band_cost() in biosnicar/inverse/cost.py accept obs_uncertainty (per-band $\sigma$) and regularization (dictionary of {param: (prior_mean, prior_sigma)}) keyword arguments. When obs_uncertainty is None, the cost function falls back to unweighted least squares. The regularisation is particularly important for band-mode retrieval (Sentinel-2, Landsat) where only 3–5 broadband observations must constrain multiple parameters.

Section 12 — Log-Space Optimisation¶

Why the optimiser struggles with large dynamic range¶

Section 4 introduced log-space sampling for the training set. The same problem arises during optimisation: when parameters span orders of magnitude, the cost surface is extremely elongated in linear space.

A gradient step of $\Delta x = 1$ is tiny when $x = 100{,}000$ but enormous when $x = 10$. The optimiser either overshoots at small values or crawls at large values. The finite-difference gradient estimate (used by L-BFGS-B) also becomes unreliable: a fixed step $\epsilon$ is too coarse for small $x$ and too fine for large $x$.

The solution: optimise in log space¶

By reparametrising to $x_{\log} = \log_{10}(x + 1)$, the cost surface becomes more isotropic (similar curvature in all directions) and the gradient step size adapts automatically to the parameter magnitude.

         0 → log: 0.0000 → back: 0.0
         1 → log: 0.3010 → back: 1.0
       100 → log: 2.0043 → back: 100.0
     10000 → log: 4.0000 → back: 10000.0
    500000 → log: 5.6990 → back: 500000.0

BioSNICAR connection — biosnicar/inverse/optimize.py (line 32) defines:
_LOG_SPACE_PARAMS = {"black_carbon", "snow_algae", "glacier_algae", "dust", "ssa"}
The _to_log() and _from_log() functions (line 35) perform the $\log_{10}(x+1)$ / $10^x - 1$ transformation. Inside retrieve(), the optimiser always works in the transformed space:

Bounds are transformed before passing to scipy.

The cost function wrapper transforms back to linear before calling the emulator.

The result is transformed back to linear before returning.

This is transparent to the user — they provide and receive linear values. The only visible effect is better convergence and more meaningful uncertainty estimates for skewed parameters.

Sections 13–16 — Optimisation: From Local to Global¶

BioSNICAR's retrieve() function supports four optimisation methods, each suited to a different regime:

Method	Type	Pros	Cons	BioSNICAR usage
L-BFGS-B	Gradient-based, local	Fast, respects bounds	Stuck in local minima	Default for 1-param; local refinement in hybrid
Nelder-Mead	Simplex, derivative-free	No gradients needed	Slow in high dimensions; no native bounds	Fallback when gradients are unreliable
Differential Evolution	Population-based, global	Finds global minimum	Expensive (thousands of evaluations)	Global pre-search in hybrid
Hybrid DE + L-BFGS-B	Two-stage	Best of both worlds	Slightly slower than pure L-BFGS-B	Default for ≥2 parameters

The hybrid approach is the BioSNICAR default for multi-parameter retrieval: a short DE run (100 iterations, small population) identifies the right valley, then L-BFGS-B converges rapidly to the precise minimum. This avoids the classic failure mode where a gradient method starts in the wrong basin.

The code below demonstrates all four on the same 2-parameter cost surface.

DE pre-search:  x = [0.6000, 0.3000]  (nfev = 2020)
L-BFGS-B final: x = [0.6000, 0.3000]  (nfev = 3)
True values:        [0.6, 0.3]
Total nfev: 2023

Why this works so well: DE identifies the right valley quickly (it doesn't need to converge precisely, just get close). L-BFGS-B then converges rapidly because it starts near the global minimum, where the cost surface is approximately quadratic and the gradient points straight to the bottom.

BioSNICAR connection — This is the default strategy for ≥2 free parameters in retrieve() (line 445):
de_result = differential_evolution(cost_fn, bounds, maxiter=100,
                                   popsize=10, seed=seed, polish=False)
x0_refined = de_result.x if de_result.fun < cost_fn(x0) else x0
result = minimize(cost_fn, x0_refined, method="L-BFGS-B", bounds=bounds)
The polish=False flag tells scipy's DE not to run its own internal L-BFGS-B polish — BioSNICAR runs a separate L-BFGS-B step with explicit control over tolerances and callback logging.

For single-parameter retrievals, the hybrid is unnecessary — the 1-D cost surface has no local minima, so L-BFGS-B alone suffices.

Exercise 4: Create a cost surface with two local minima (e.g. add a secondary Gaussian dip to the toy spectrum). Show that L-BFGS-B alone finds the wrong minimum depending on x0, while the hybrid always finds the global minimum.

BioSNICAR connection — biosnicar/inverse/optimize.py dispatches to the appropriate method based on the method parameter: L-BFGS-B (line 424), Nelder-Mead (line 426), differential evolution (line 338), or the default hybrid (line 445, which chains a quick DE pre-search into L-BFGS-B). The cost function is either spectral_cost() (cost.py:19) or band_cost() (cost.py:76).

Act IV: Uncertainty Quantification¶

A point estimate without an error bar is incomplete. BioSNICAR provides two approaches to uncertainty:

Sections 17–18 — Hessian Uncertainty (Fast, Gaussian)¶

The curvature of the cost function at the optimum encodes how tightly the data constrain each parameter. The Hessian matrix (matrix of second derivatives) captures this curvature; its inverse is the covariance matrix of the parameter estimates:

$$\sigma_k^2 \approx \left[H^{-1}\right]_{kk}, \quad H_{jk} = \frac{\partial^2 J}{\partial \theta_j \, \partial \theta_k}$$

This is fast (one evaluation of the Hessian at the optimum) and works well when the posterior is approximately Gaussian — i.e., when the cost surface is parabolic near the minimum. It breaks down for degenerate problems (elongated valleys) or multimodal posteriors.

Hessian:
[[9.5 1.9]
 [1.9 3.7]]

Covariance matrix:
[[ 0.11711359 -0.05905232]
 [-0.05905232  0.29814463]]

1σ uncertainties: grain = 0.3422, impurity = 0.5460

Section 19 — MCMC (Rigorous, Arbitrary Shape)¶

When the Hessian approximation is insufficient — typically for 3+ parameter problems or when degeneracies create non-Gaussian posteriors — Markov Chain Monte Carlo sampling maps the full posterior distribution. BioSNICAR uses the emcee ensemble sampler, which runs multiple "walkers" in parallel to explore parameter space efficiently.

MCMC produces chains of parameter samples whose density is proportional to the posterior probability. From these chains you can extract:

Marginal distributions (histograms per parameter)
Credible intervals (e.g. 95% highest-density interval)
Correlations between parameters (corner plots)
Diagnostics (trace plots, autocorrelation time, acceptance fraction)

The cost: MCMC requires thousands of forward evaluations — feasible only with the emulator (~10 µs each), not the full solver (~0.5 s each).

Acceptance rate: 95.8%
Post-burn-in samples: 18000
Mean: grain = 0.6501, impurity = 0.4668
Std:  grain = 0.2506, impurity = 0.2802

MCMC mean ± std: grain = 0.6501 ± 0.2506
                 impurity = 0.4668 ± 0.2802
Hessian ± 1σ:    grain = 0.6000 ± 0.3422
                 impurity = 0.3000 ± 0.5460

BioSNICAR connection — biosnicar/inverse/optimize.py:_run_mcmc() implements MCMC using emcee.EnsembleSampler. If emcee is not installed, it raises ImportError. The ensemble sampler uses multiple "walkers" that communicate — this is much more efficient than single-chain Metropolis-Hastings because the walkers can explore the posterior in parallel and adapt their step sizes based on the ensemble spread. The MCMC chain is stored in RetrievalResult.chains, and the corner plot can be generated directly from the result object.

The default Hessian uncertainty is always computed (it's cheap). MCMC is run when method="mcmc" is passed to retrieve(), and provides a more complete picture of the posterior — especially important for degenerate parameter combinations like grain size vs density. The default is 32 walkers and 2,000 steps (plus 500 burn-in).

Exercise 5: Run the Metropolis-Hastings sampler on the Rosenbrock (banana) function from Section 18 with 100,000 steps. Compare the resulting chain to the Hessian ellipse — the MCMC should capture the curved shape that the Hessian misses.

Sections 20–21 — Propagating Uncertainty to Derived Quantities¶

Retrieved parameters (rds, ρ, BC) are often intermediate — the quantity of interest is a derived value like SSA = 3(1 − ρ/917) / (r·ρ). Uncertainty must be propagated through the derivation.

BioSNICAR handles this automatically: RetrievalResult.ssa_uncertainty (result.py:137) uses first-order error propagation from the Hessian, and MCMC chains can be transformed directly (compute SSA for each sample, take the standard deviation).

An additional subtlety: parameters optimised in log₁₀(x+1) space have uncertainties in log space. The back-transform σ_linear ≈ (x+1)·ln(10)·σ_log is applied automatically by retrieve() (optimize.py:354–358).

BioSNICAR connection — Hessian uncertainty is computed by _hessian_uncertainty() (optimize.py:582) using finite-difference second derivatives. MCMC uses _run_mcmc() (optimize.py:519) wrapping emcee.EnsembleSampler. SSA error propagation is in _ssa_uncertainty() (result.py:39). Log-space back-transform is at optimize.py:354–358.

Section 22 — Head-to-Head: Hessian vs MCMC vs MC Propagation¶

We now compare all three uncertainty methods on the same problem to understand when each is appropriate. The key question is: does the Hessian (fast, cheap, Gaussian assumption) agree with MCMC (slow, exact, no assumptions)? If yes, use the Hessian and save time. If not, the posterior is non-Gaussian and MCMC is needed.

Interpretation¶

For this well-behaved (near-Gaussian) problem, Hessian and MCMC agree closely — the MCMC contours overlap the Hessian ellipse. This is the best-case scenario for the Hessian: the cost surface is approximately quadratic near the minimum, so the Gaussian approximation holds.

When to use which method¶

Method	Speed	Non-Gaussian?	Correlations?	When to use
Hessian	Fast (~ms)	No	Yes (local)	Default; first check
MCMC	Slow (~min)	Yes	Yes (global)	Degenerate params, multi-modal
MC propagation	Medium (~s)	N/A	Inherited	Derived quantities

Practical rule: Always compute Hessian uncertainty (it's nearly free). If you suspect non-Gaussianity (e.g., strong grain size–density degeneracy, or band-mode retrieval with few observations), run MCMC to verify. If the MCMC and Hessian agree, trust the Hessian for future retrievals with similar settings.

BioSNICAR connection — retrieve() always computes Hessian uncertainty. MCMC is available via method="mcmc" and uses emcee.EnsembleSampler with 32 walkers and 2,000 steps per walker (plus 500 burn-in, by default). For a typical 4-parameter retrieval, this takes ~10 seconds with the emulator (compared to ~70 hours with the full forward model).

Exercise 6: Modify the toy spectrum to have a cost surface with two modes. Compare Hessian and MCMC uncertainty estimates — the Hessian should give a single ellipse while MCMC captures both modes.

Act V: Putting It All Together¶

The previous acts developed each component in isolation. This act assembles them into a complete retrieval pipeline — the same sequence used by BioSNICAR's retrieve() function — and validates it end-to-end.

Section 23 — Complete Retrieval Pipeline on Toy Problem¶

The retrieval workflow¶

A complete retrieval consists of six steps:

Define the problem: Choose which parameters to retrieve and their bounds.
Generate/load observation: A spectrum (full spectral or satellite bands) plus measurement uncertainty.
Build/load emulator: LHS + forward model + PCA + MLP (or load a pre-trained .npz file).
Optimise: Hybrid DE + L-BFGS-B in log space to find $\theta^*$.
Quantify uncertainty: Hessian (fast) and/or MCMC (thorough).
Validate: Check that true parameters (if known) fall within uncertainty.

Let us run this pipeline on our toy problem with known truth.

============================================================
COMPLETE RETRIEVAL PIPELINE
============================================================

1. True parameters: {'grain_size': 0.55, 'impurity': 0.2}
2. Observation: 50 bands, noise σ = 0.01
3. Emulator: ready (built in Section 8)
4. Optimisation: grain = 0.5463, impurity = 0.1993
   (True: grain = 0.55, impurity = 0.2)
   Total nfev: 298
5. Uncertainty: grain ± 0.0034, impurity ± 0.0055
6. Parameters recovered within 3σ: True

The retrieval successfully recovers both parameters within 3σ uncertainty. This validates the end-to-end pipeline:

Step	Method	Section	BioSNICAR code
Sampling	LHS + log-space	2, 4	`emulator.py:_latin_hypercube`
Compression	PCA (99.9% variance)	6	`emulator.py:PCA(n_components=0.999)`
Emulation	MLP (128, 128, 64)	7, 8	`emulator.py:Emulator.build`
Optimisation	Hybrid DE + L-BFGS-B	15, 16	`optimize.py:retrieve`
Uncertainty	Hessian finite-diff	18	`optimize.py:_hessian_uncertainty`

Each technique was motivated by a specific physical or computational challenge in snow/ice retrieval, and the whole is greater than the sum of its parts: the emulator makes the optimiser fast, the log transform makes the optimiser accurate, and the Hessian makes the uncertainty cheap.

Section 24 — Decision Guide: When to Use What¶

Choosing the right tools for your retrieval¶

The BioSNICAR inverse module provides sensible defaults, but understanding why those defaults were chosen helps you adapt to new situations.

Situation	Sampling	Surrogate	Optimiser	Uncertainty
1–2 params, fast model	Grid	None needed	L-BFGS-B	Hessian
3–5 params, slow model	LHS	PCA + MLP	Hybrid DE + L-BFGS-B	Hessian
>5 params or non-Gaussian	LHS (log)	PCA + MLP	Hybrid	MCMC
Satellite bands (≤5 obs)	LHS (log)	PCA + MLP	Hybrid	MCMC*
Noisy/discontinuous cost	LHS	PCA + MLP	Nelder-Mead or DE	MCMC

* With only 5 broadband observations, you cannot constrain more than ~3 free parameters. Fix the rest (e.g., density, layer thickness) via fixed_params in retrieve().

Key principles¶

Match information to parameters: You need at least as many independent observations as free parameters. With 480 spectral bands, you have plenty of information. With 5 satellite bands, you're limited.
Log-space everything for impurities: Impurity concentrations always benefit from log-space sampling and optimisation.
Start with Hessian, upgrade to MCMC: Hessian is nearly free and sufficient for most well-constrained retrievals. Use MCMC when you suspect non-Gaussian posteriors or need to publish rigorous uncertainties.
Validate on synthetic data: Before applying to real observations, test the pipeline on synthetic data where the truth is known.

Section 25 — Capstone: Real BioSNICAR Retrieval (Optional)¶

This section requires biosnicar to be installed. If available, it demonstrates a real retrieval using the full radiative transfer model:

Build a real emulator for grain radius and black carbon.
Generate a synthetic observation from the full forward model.
Retrieve the parameters using the same hybrid pipeline taught above.
Optionally perform a Sentinel-2 band-mode retrieval.

This closes the loop from toy examples to production code — every concept from the previous 24 sections is used here on real physics.

BioSNICAR not installed — skipping capstone.
Install with: pip install -e /path/to/biosnicar-py

Skipping real retrieval (BioSNICAR not available).

Skipping (BioSNICAR not available).

Summary¶

The complete BioSNICAR data science stack¶

This notebook taught every data-science technique underlying the BioSNICAR emulator and inversion scheme, from first principles:

#	Technique	Physical motivation	BioSNICAR module
1	Latin Hypercube Sampling	Explore 4–8D parameter space efficiently	`emulator.py:_latin_hypercube`
2	Log-space sampling	Cover clean-ice through polluted-ice regimes	`emulator.py:_LOG_SAMPLE_PARAMS`
3	PCA compression	Exploit spectral band correlation to reduce output dimension	`emulator.py:PCA(n_components=0.999)`
4	MLP neural network	Fast, smooth surrogate for iterative optimisation	`emulator.py:MLPRegressor(128,128,64)`
5	Pure-NumPy forward pass	Deploy emulator without scikit-learn	`emulator.py:_forward_pass`
6	Chi-squared cost	Weight spectral bands by measurement confidence	`cost.py:spectral_cost`
7	Log-space optimisation	Isotropic cost surface for skewed parameters	`optimize.py:_to_log/_from_log`
8	L-BFGS-B	Fast gradient-based refinement with bounds	`optimize.py` (via scipy)
9	Nelder-Mead	Derivative-free fallback for noisy surfaces	`optimize.py` (penalty bounds)
10	Differential Evolution	Global search avoiding local minima	`optimize.py` (DE pre-search)
11	Hybrid DE + L-BFGS-B	Best of global + local for multi-param retrieval	`optimize.py:retrieve`
12	Hessian uncertainty	Cheap uncertainty from cost curvature	`optimize.py:_hessian_uncertainty`
13	MCMC sampling	Full posterior for non-Gaussian cases	`optimize.py:_run_mcmc`
14	Error propagation	Uncertainty on derived quantities (SSA)	`result.py:_ssa_uncertainty`

The big picture¶

The fundamental challenge is this: a physically rigorous radiative transfer model is too slow for iterative inversion (~0.5 s × millions of evaluations). The entire data science pipeline exists to make inversion tractable:

Acts I–II (emulator): Replace the slow model with a fast surrogate, achieving ~10,000× speedup with <0.1% accuracy loss.
Act III (optimisation): Navigate the cost surface efficiently, combining global exploration with local refinement.
Act IV (uncertainty): Quantify how much we should trust the result, from cheap Hessian estimates to rigorous MCMC posteriors.

Each technique was demonstrated on a toy problem that captures the essential physics of snow/ice spectroscopy, then connected to the exact BioSNICAR source code.

Next steps¶

Explore the BioSNICAR inversion documentation: docs/INVERSION.md
Try biosnicar.inverse.retrieve() on synthetic and real observations
Experiment with different parameter combinations and uncertainty methods
Build custom emulators for your specific retrieval scenarios using Emulator.build()