Viewer 3DDistributionsProxy Metrics

Impact of Intensity Distribution Transforms on 3D Brain Tumor Segmentation

BraTS 2023 GLI · nnU-Net v2 · 956 training patients (fold 0) · 4 MRI modalities

Independent research project — no academic supervision

1. Problem Statement

MRI intensity distributions in brain tumor imaging are heavily right-skewed (skewness +2.5 to +5.8 per-patient mean across modalities). The standard preprocessing in nnU-Net (per-patient z-score) normalizes mean and variance but does not correct the distribution shape. This skewness affects gradient stability, particularly for small tumor subregions (NCR, ET) where the intensity signal is dominated by the long tail of healthy tissue.

ModalityRaw skewAfter z-scorePhysical basis
T1+2.48+2.48T1 relaxation — gray/white contrast
T1ce+5.76+5.76Gadolinium enhancement — tumor vasculature
T2+3.67+3.67T2 relaxation — fluid/edema bright
FLAIR+4.01+4.01Fluid suppression — perilesional edema

Measured on 956 training patients (fold 0), per-patient mean skewness. Z-score does not change skewness (linear transform).

2. Literature Gap

No study systematically compares intensity distribution transforms for brain tumor segmentation.

Key references

Reinhold et al. (2019) — Compares 7 normalization methods (z-score, WhiteStripe, Nyul, FCM, KDE, etc.) for MR synthesis. Does not include distribution transforms (Box-Cox, log, quantile).

Durso-Finley et al. (2024) — “Negligible effect of brain MRI preprocessing for tumor segmentation.” Tests skull-stripping, bias field, histogram matching, denoising. Concludes InstanceNorm compensates. Critical distinction: their transforms are linear or quasi-linear (scale, shift, resampling) — trivially compensated by InstanceNorm (mean/std normalization). Does not test nonlinear distribution transforms (Box-Cox, log, quantile) that modify skewness (3rd moment), which InstanceNorm does not correct.

Isensee et al. (2021) — nnU-Net uses per-case z-score for MRI. No skewness correction.

BraTS 2023/2024 winners — All use z-score via nnU-Net. No preprocessing innovation.

Key theoretical distinction: Durso-Finley's transforms (skull-strip, bias field, histogram matching) are linear or quasi-linear — they change scale, shift, or resample intensities. InstanceNorm trivially compensates these by normalizing mean and variance (moments 1-2).

Our transforms (Box-Cox, log, quantile) are nonlinear — they reshape the distribution itself (skewness = moment 3, kurtosis = moment 4). InstanceNorm does not correct higher moments. The open question: does skewness survive 7 layers of InstanceNorm + LeakyReLU + Conv3D? If yes, nonlinear transforms could matter where linear ones don't.

Our contribution: First systematic benchmark of nonlinear distribution transforms on BraTS, with explicit distinction from the linear transforms tested by Durso-Finley. Whether the result is positive or negative, it extends the InstanceNorm compensation hypothesis to a previously untested class of transforms.

3. Methods Compared

IDMethodDescriptionParametric?Addresses skew?
C1z-scorePer-patient (x - mu) / sigmaNoNo
C2Box-Cox MLE + z-scorePower transform lambda_MLE, then z-scoreYes (lambda)Partially (-55%)
C3log(1+x) + z-scoreLog compression, then z-scoreNoPartially (-35%)
C4Quantile-to-GaussianCDF^-1(F(x)) → exact N(0,1)NoYes (100%)
C5Clip 99th + z-scoreRemove outliers, then z-scoreNoPartially (-25%)

4. Methodological Innovation: Zero-Cost Proxy Metrics

Hypothesis: An ensemble of zero-cost metrics measured on labeled data can predict segmentation performance, enabling preprocessing selection without training. Adapted from NAS zero-cost proxies (architecture evaluation) to data evaluation — a novel application.

5 metrics computed (956 training patients, fold 0)

MetricWhat it measuresCostReference
FDRStatic inter-class separability: (mu1-mu2)²/(var1+var2)MinutesFisher 1936
Gradient SNRGradient signal strength per class (1 backward pass)~15 min / methodProche GraSP (ICLR 2020)
NASWOTDiversity of ReLU activation patterns (log|K| Gram matrix)~15 min / methodMellor (ICML 2021)
Linear ProbeF1 of logistic regression on random encoder features~15 min / methodStandard ML
Rank AggregationAverage rank across all metrics0 secAZ-NAS (CVPR 2024)

5. Results: Zero-Cost Proxies (956 training patients)

Ensemble ranking (lower = better)

RankMethodFDR (R)SNR (R)Loss (R)NASWOT (R)Probe (R)Ensemble
1C3 log(1+x)2.452 (3)0.801 (2)2.424 (3)-282 (4)0.956 (1)2.6
2C2 Box-Cox3.430 (1)0.801 (1)2.408 (2)-284 (5)0.918 (5)2.8
2C5 Clip 99th2.959 (2)0.800 (3)2.425 (4)-278 (3)0.956 (2)2.8
4C4 Quantile1.703 (5)0.799 (5)2.406 (1)-245 (1)0.946 (3)3.0
5C1 z-score1.704 (4)0.800 (4)2.433 (5)-277 (2)0.943 (4)3.8
Key findings:
1. No single proxy dominates. FDR ranks C2 first, NASWOT ranks C4 first, Probe ranks C3 first. Each metric captures a different aspect of data quality.
2. Ensemble rank breaks the tie. C3 log(1+x) wins by being good everywhere (never first, never last on any individual metric). This is exactly what AZ-NAS (CVPR 2024) predicts: the method “good across all proxies” outperforms the one “excellent on one proxy.”
3. Gaussianization ≠ quality. C4 Quantile has perfect skew=0 and best NASWOT/Loss, but worst FDR — it collapses inter-class differences. Skewness alone does not predict performance.
4. z-score baseline is last. All transforms improve over the nnU-Net default.

Individual proxy contradictions

ProxyWinnerLoserInterpretation
FDRC2 Box-CoxC4 QuantileBox-Cox best separates classes in intensity space
Gradient SNRC2 Box-CoxC4 QuantileConsistent with FDR — gradient follows static separability
NASWOTC4 QuantileC2 Box-CoxQuantile creates most diverse activation patterns
LossC4 QuantileC1 z-scoreQuantile gives best initial optimization landscape
Linear ProbeC3 logC2 Box-CoxLog features are most linearly separable in random features

FDR/SNR and NASWOT/Loss give opposite rankings. This demonstrates that static separability (intensity space) and network-level separability (activation space) are different properties. The ensemble rank integrates both perspectives.

6. Multi-Level Validation Protocol

Three levels of evaluation, from fastest to most reliable. Each level validates the previous. If rankings are consistent across levels, proxy metrics can replace training for preprocessing selection.

LevelMethodTimeWhat it measuresStatus
L1Proxy metrics (FDR, CV, skew)5 min / 5 methodsStatic class separabilityDone ✓
L2aGradient SNR (956 training patients)~20 min / 5 methodsGradient signal per classDone
L2bNASWOT (log|K| Gram matrix)~20 min / 5 methodsActivation diversityDone
L2cLinear Probe (random features)~20 min / 5 methodsFeature separabilityDone
L2dRank Aggregation (AZ-NAS style)0 secCombined proxy rankingDone
L3Mini-training (5 epochs)~42 min / methodEarly convergence speedDone — proxy INVERTED
L4Optuna/BoTorch per-modality tuning~10-15hOptimal preprocessing paramsPlanned
L5Full training (40+ epochs)~5.5h / methodActual Dice performanceC1 done

Ranking consistency across levels

LevelRankingConsistent?
L1 (FDR)C2 Box-Cox > C5 Clip > C3 log > C4 ≈ C1Baseline
L2a (SNR)C2 ≈ C3 > C5 > C1 > C4Partial match L1
L2b (NASWOT)C4 > C1 > C5 > C3 > C2✗ contradicts L1
L2c (Probe)C3 ≈ C5 > C4 > C1 > C2✗ contradicts L1
L2d (Ensemble)C3 log > C2 ≈ C5 > C4 > C1Novel ranking
L3 (5-ep)C1 0.767 > C5 0.660 > C3 0.624 > C2 0.552INVERTED vs L2d
L5 (40-ep)C1 z-score: Dice 0.867. Others: pending after HP tuning.

If L1→L2→L3→L4 rankings are consistent, this validates FDR as a zero-cost proxy for preprocessing selection — a methodological contribution applicable beyond BraTS.

6b. Training Results (nnU-Net v2, fold 0)

Mini-training 5 epochs (default HP: lr=0.01, SGD Nesterov)

Dice RankRunPreprocessingVal Dice 5epProxy RankStatus
1C1z-score (baseline)0.7675th (3.8)Baseline winner
2C5Clip 99th + z-score0.6602nd (2.8)Complete
3C3log(1+x) + z-score0.6241st (2.6)Complete
4C2Box-Cox MLE + z-score0.5522nd (2.8)Complete

C1 z-score 40 epochs: EMA Dice 0.867, Val Dice 0.824. Others at 5 epochs only.

Critical finding: proxy ranking is INVERTED vs actual Dice.
Predicted: C3 > C2 = C5 > C1. Observed: C1 >> C5 > C3 > C2.

Confound: hyperparameter bias. nnU-Net default HP (lr=0.01, momentum=0.99, PolyLR, augmentations) were tuned for z-score preprocessing. Custom transforms change the gradient distribution, making the same lr suboptimal. This is a comparison of “preprocessing + fixed HP”, not “preprocessing + optimal HP”.

Next step: HP tuning per-preprocessing (lr sweep) to eliminate this confound before concluding.

7. Experimental Protocol

Phase 1Proxy metrics (FDR, CV, skew) — 5 methods × 956 training patients
Phase 2Zero-cost proxies (SNR, NASWOT, Linear Probe, Rank Aggregation) — 956 training patients
Phase 3Mini-training 5 epochs — proxy ranking INVERTED vs Dice (HP bias)
Phase 4Optuna/BoTorch — tune preprocessing params per modality using proxy as objective
Phase 5Training 40 epochs — optimized preprocessing vs baseline
Phase 6Full training 200+ epochs — best preprocessing
Phase 75-fold cross-validation + Wilcoxon tests
Phase 8Alpha=1/d loss on best preprocessing (second contribution)
Phase 9Correlation analysis: proxy ranking → 5ep → 40ep → 200ep Dice

8. Per-Modality Optimal Preprocessing

Each MRI modality measures a different physical property. The optimal preprocessing differs per modality — this is scientifically justified because the intensity distributions have different underlying causes.

ModalityBest methodFDRWhy
T1Box-Cox2.683Power transform handles moderate right skew from WM/GM contrast
T1ceBox-Cox3.209Compresses gadolinium enhancement peak, preserves tumor signal
T2Box-Cox4.113Strong skew from fluid — Box-Cox normalizes effectively
FLAIRClip 99th6.596Hyperintense lesions preserved by clipping, Box-Cox compresses them

Proposed hybrid: Box-Cox for T1/T1ce/T2 + Clip for FLAIR. To be validated by training after uniform method validation.

8b. Bayesian Optimization of Preprocessing (Planned)

Instead of choosing between fixed methods, optimize preprocessing hyperparameters per modality using zero-cost proxies as the objective function. Optuna/BoTorch explores the parameter space efficiently.

ParameterMethodCurrent valueSearch rangePer modality?
lambdaBox-CoxMLE (optimizes Gaussianity)[-2, 2]Yes
percentileClip99th[90, 99.9]Yes
shiftlog(a+x)a=1[0.01, 10]Yes
clip + lambdaClip + Box-Cox99th + MLECombined searchYes
Key insight: MLE optimizes lambda for Gaussianity. But our proxies show that Gaussianity ≠ class separability (Quantile has perfect Gaussianity but worst FDR). By optimizing lambda for proxy ensemble score instead of Gaussianity, we find the transform that maximizes class separability — a fundamentally different objective.

Contribution: First application of Bayesian optimization to MRI preprocessing hyperparameter selection using zero-cost proxy metrics as surrogate objectives.

8c. The InstanceNorm Compensation Hypothesis

Durso-Finley et al. (2024) — “Negligible effect of brain MRI preprocessing”

Tested skull-stripping, bias field correction, histogram matching, and denoising on BraTS. Found negligible effect on Dice. Conclusion: InstanceNorm layers compensate for preprocessing differences by re-normalizing features internally at every network stage.

nnU-Net uses InstanceNorm3d at all 7 encoder/decoder stages. Each layer re-centers and re-scales the feature maps, potentially nullifying input distribution differences.

Linear vs nonlinear transforms — why our case is different

PropertyDurso-Finley transformsOur transforms
TypeLinear / quasi-linearNonlinear
ExamplesSkull-strip, bias field, histogram matchBox-Cox, log(1+x), quantile-to-Gaussian
What changesScale, shift, which voxels presentDistribution shape (skewness, kurtosis)
Moments affected1st (mean), 2nd (variance)3rd (skew), 4th (kurtosis)
InstanceNorm corrects?Yes — normalizes mean+stdNot directly — higher moments persist

Open question: does input skewness survive 7 layers of InstanceNorm + LeakyReLU + Conv3D, or do the nonlinearities reshape it regardless? This is what our experiment tests.

Our 5-epoch result (with default HP): z-score baseline dominates (0.767 vs 0.552-0.660). This is consistent with the InstanceNorm compensation hypothesis. However, our comparison is confounded: nnU-Net HP (lr=0.01, SGD Nesterov, PolyLR) were tuned for z-score data. Custom transforms change the gradient landscape, making the same lr suboptimal.

Current experiment: Optuna + BoTorch lr tuning per preprocessing (10 trials each). This eliminates the HP confound. Three possible outcomes:
1. Transforms still don't help after HP tuning → confirms and extends Durso-Finley to distribution transforms. Publishable negative result (replication + extension).
2. Some transforms improve with tuned lr → InstanceNorm compensation is partial, and distribution shape matters when HP are adapted. Challenges Durso-Finley's generality.
3. Optimal lr differs per preprocessing but final Dice converges → preprocessing affects convergence speed, not final performance. InstanceNorm compensates asymptotically.

9. Known Limitations

InstanceNorm compensates preprocessing
Impact: 5-epoch results show z-score baseline wins with default HP (Durso-Finley 2024)
Mitigation: Optuna lr tuning per-preprocessing to eliminate HP confound
Proxy ranking inverted vs Dice
Impact: Zero-cost proxies do not predict 5-epoch Dice
Mitigation: Negative result documented; proxies may predict after HP tuning or at convergence
HP confound in 5-epoch comparison
Impact: lr=0.01 tuned for z-score, suboptimal for custom transforms
Mitigation: Optuna + BoTorch lr sweep per preprocessing (in progress)
Single dataset (BraTS 2023 GLI)
Impact: Results may not generalize
Mitigation: Limitation documented, MedNeXt generalization planned
Box-Cox assumes unimodal distribution
Impact: Brain MRI is multimodal (CSF/GM/WM/tumor)
Mitigation: Per-patient MLE finds best compromise

10. Technical Setup

ComponentChoiceJustification
ArchitecturennU-Net v2 (PlainConvUNet, 7 stages)SOTA reproducible, Nature Methods 2021
DatasetBraTS 2023 GLI (1196 patients, 4 modalities)Multi-institutional, recent
GPUNVIDIA RTX PRO 6000 Blackwell (96 GB VRAM)Full-brain patches 160x256x256
PrecisionBF16 (no GradScaler)Native Blackwell sm_120
OptimizerSGD Nesterov, lr=0.01, PolyLRnnU-Net default
EvaluationDice (WT/TC/ET), HD95, WilcoxonBraTS challenge standard
ViewerNext.js + FastAPI + Three.jsClinical interface + research dashboards

11. Skills Demonstrated

Research Methodology
Hypothesis-driven experimental design, factorial protocol, proxy metric validation, honest limitation reporting
Deep Learning Engineering
nnU-Net customization, BF16 optimization, torch.compile, custom trainers, distributed data augmentation
Medical Image Analysis
MRI physics understanding, BraTS challenge protocol, clinical report generation (RANO 2.0, volumetrics)
Data Engineering
1196-patient pipeline, memmap optimization, pre-computation caching, gzip/WebP compression
Software Engineering
Full-stack viewer (React/Three.js/FastAPI), clinical presets, lossless compression, deployment-ready
Scientific Rigor
Corrected overestimated claims (Box-Cox skew), negative results documented, statistical testing planned

Project: BraTS Brain Tumor Segmentation — independent research

Data: BraTS 2023 GLI Challenge (1196 patients, multi-institutional)

Framework: nnU-Net v2 (Isensee et al., Nature Methods 2021)

Last updated: 2026-03-26