Impact of Intensity Distribution Transforms on 3D Brain Tumor Segmentation

BraTS 2023 GLI · nnU-Net v2 · 956 training patients (fold 0) · 4 MRI modalities

Independent research project — no academic supervision

1. Problem Statement

MRI intensity distributions in brain tumor imaging are heavily right-skewed (skewness +2.5 to +5.8 per-patient mean across modalities). The standard preprocessing in nnU-Net (per-patient z-score) normalizes mean and variance but does not correct the distribution shape. This skewness affects gradient stability, particularly for small tumor subregions (NCR, ET) where the intensity signal is dominated by the long tail of healthy tissue.

Modality	Raw skew	After z-score	Physical basis
T1	+2.48	+2.48	T1 relaxation — gray/white contrast
T1ce	+5.76	+5.76	Gadolinium enhancement — tumor vasculature
T2	+3.67	+3.67	T2 relaxation — fluid/edema bright
FLAIR	+4.01	+4.01	Fluid suppression — perilesional edema

Measured on 956 training patients (fold 0), per-patient mean skewness. Z-score does not change skewness (linear transform).

2. Literature Gap

No study systematically compares intensity distribution transforms for brain tumor segmentation.

Key references

Reinhold et al. (2019) — Compares 7 normalization methods (z-score, WhiteStripe, Nyul, FCM, KDE, etc.) for MR synthesis. Does not include distribution transforms (Box-Cox, log, quantile).

Durso-Finley et al. (2024) — “Negligible effect of brain MRI preprocessing for tumor segmentation.” Tests skull-stripping, bias field, histogram matching, denoising. Concludes InstanceNorm compensates. Critical distinction: their transforms are linear or quasi-linear (scale, shift, resampling) — trivially compensated by InstanceNorm (mean/std normalization). Does not test nonlinear distribution transforms (Box-Cox, log, quantile) that modify skewness (3rd moment), which InstanceNorm does not correct.

Isensee et al. (2021) — nnU-Net uses per-case z-score for MRI. No skewness correction.

BraTS 2023/2024 winners — All use z-score via nnU-Net. No preprocessing innovation.

Key theoretical distinction: Durso-Finley's transforms (skull-strip, bias field, histogram matching) are linear or quasi-linear — they change scale, shift, or resample intensities. InstanceNorm trivially compensates these by normalizing mean and variance (moments 1-2).

Our transforms (Box-Cox, log, quantile) are nonlinear — they reshape the distribution itself (skewness = moment 3, kurtosis = moment 4). InstanceNorm does not correct higher moments. The open question: does skewness survive 7 layers of InstanceNorm + LeakyReLU + Conv3D? If yes, nonlinear transforms could matter where linear ones don't.

Our contribution: First systematic benchmark of nonlinear distribution transforms on BraTS, with explicit distinction from the linear transforms tested by Durso-Finley. Whether the result is positive or negative, it extends the InstanceNorm compensation hypothesis to a previously untested class of transforms.

3. Methods Compared

ID	Method	Description	Parametric?	Addresses skew?
C1	z-score	Per-patient (x - mu) / sigma	No	No
C2	Box-Cox MLE + z-score	Power transform lambda_MLE, then z-score	Yes (lambda)	Partially (-55%)
C3	log(1+x) + z-score	Log compression, then z-score	No	Partially (-35%)
C4	Quantile-to-Gaussian	CDF^-1(F(x)) → exact N(0,1)	No	Yes (100%)
C5	Clip 99th + z-score	Remove outliers, then z-score	No	Partially (-25%)

4. Methodological Innovation: Zero-Cost Proxy Metrics

Hypothesis: An ensemble of zero-cost metrics measured on labeled data can predict segmentation performance, enabling preprocessing selection without training. Adapted from NAS zero-cost proxies (architecture evaluation) to data evaluation — a novel application.

5 metrics computed (956 training patients, fold 0)

Metric	What it measures	Cost	Reference
FDR	Static inter-class separability: (mu1-mu2)²/(var1+var2)	Minutes	Fisher 1936
Gradient SNR	Gradient signal strength per class (1 backward pass)	~15 min / method	Proche GraSP (ICLR 2020)
NASWOT	Diversity of ReLU activation patterns (log\|K\| Gram matrix)	~15 min / method	Mellor (ICML 2021)
Linear Probe	F1 of logistic regression on random encoder features	~15 min / method	Standard ML
Rank Aggregation	Average rank across all metrics	0 sec	AZ-NAS (CVPR 2024)

5. Results: Zero-Cost Proxies (956 training patients)

Ensemble ranking (lower = better)

Rank	Method	FDR (R)	SNR (R)	Loss (R)	NASWOT (R)	Probe (R)	Ensemble
1	C3 log(1+x)	2.452 (3)	0.801 (2)	2.424 (3)	-282 (4)	0.956 (1)	2.6
2	C2 Box-Cox	3.430 (1)	0.801 (1)	2.408 (2)	-284 (5)	0.918 (5)	2.8
2	C5 Clip 99th	2.959 (2)	0.800 (3)	2.425 (4)	-278 (3)	0.956 (2)	2.8
4	C4 Quantile	1.703 (5)	0.799 (5)	2.406 (1)	-245 (1)	0.946 (3)	3.0
5	C1 z-score	1.704 (4)	0.800 (4)	2.433 (5)	-277 (2)	0.943 (4)	3.8

Key findings:
1. No single proxy dominates. FDR ranks C2 first, NASWOT ranks C4 first, Probe ranks C3 first. Each metric captures a different aspect of data quality.
2. Ensemble rank breaks the tie. C3 log(1+x) wins by being good everywhere (never first, never last on any individual metric). This is exactly what AZ-NAS (CVPR 2024) predicts: the method “good across all proxies” outperforms the one “excellent on one proxy.”
3. Gaussianization ≠ quality. C4 Quantile has perfect skew=0 and best NASWOT/Loss, but worst FDR — it collapses inter-class differences. Skewness alone does not predict performance.
4. z-score baseline is last. All transforms improve over the nnU-Net default.

Individual proxy contradictions

Proxy	Winner	Loser	Interpretation
FDR	C2 Box-Cox	C4 Quantile	Box-Cox best separates classes in intensity space
Gradient SNR	C2 Box-Cox	C4 Quantile	Consistent with FDR — gradient follows static separability
NASWOT	C4 Quantile	C2 Box-Cox	Quantile creates most diverse activation patterns
Loss	C4 Quantile	C1 z-score	Quantile gives best initial optimization landscape
Linear Probe	C3 log	C2 Box-Cox	Log features are most linearly separable in random features

FDR/SNR and NASWOT/Loss give opposite rankings. This demonstrates that static separability (intensity space) and network-level separability (activation space) are different properties. The ensemble rank integrates both perspectives.

6. Multi-Level Validation Protocol

Three levels of evaluation, from fastest to most reliable. Each level validates the previous. If rankings are consistent across levels, proxy metrics can replace training for preprocessing selection.

Level	Method	Time	What it measures	Status
L1	Proxy metrics (FDR, CV, skew)	5 min / 5 methods	Static class separability	Done ✓
L2a	Gradient SNR (956 training patients)	~20 min / 5 methods	Gradient signal per class	Done
L2b	NASWOT (log\|K\| Gram matrix)	~20 min / 5 methods	Activation diversity	Done
L2c	Linear Probe (random features)	~20 min / 5 methods	Feature separability	Done
L2d	Rank Aggregation (AZ-NAS style)	0 sec	Combined proxy ranking	Done
L3	Mini-training (5 epochs)	~42 min / method	Early convergence speed	Done — proxy INVERTED
L4	Optuna/BoTorch per-modality tuning	~10-15h	Optimal preprocessing params	Planned
L5	Full training (40+ epochs)	~5.5h / method	Actual Dice performance	C1 done

Ranking consistency across levels

Level	Ranking	Consistent?
L1 (FDR)	C2 Box-Cox > C5 Clip > C3 log > C4 ≈ C1	Baseline
L2a (SNR)	C2 ≈ C3 > C5 > C1 > C4	Partial match L1
L2b (NASWOT)	C4 > C1 > C5 > C3 > C2	✗ contradicts L1
L2c (Probe)	C3 ≈ C5 > C4 > C1 > C2	✗ contradicts L1
L2d (Ensemble)	C3 log > C2 ≈ C5 > C4 > C1	Novel ranking
L3 (5-ep)	C1 0.767 > C5 0.660 > C3 0.624 > C2 0.552	INVERTED vs L2d
L5 (40-ep)	C1 z-score: Dice 0.867. Others: pending after HP tuning.	—

If L1→L2→L3→L4 rankings are consistent, this validates FDR as a zero-cost proxy for preprocessing selection — a methodological contribution applicable beyond BraTS.

6b. Training Results (nnU-Net v2, fold 0)

Mini-training 5 epochs (default HP: lr=0.01, SGD Nesterov)

Dice Rank	Run	Preprocessing	Val Dice 5ep	Proxy Rank	Status
1	C1	z-score (baseline)	0.767	5th (3.8)	Baseline winner
2	C5	Clip 99th + z-score	0.660	2nd (2.8)	Complete
3	C3	log(1+x) + z-score	0.624	1st (2.6)	Complete
4	C2	Box-Cox MLE + z-score	0.552	2nd (2.8)	Complete

C1 z-score 40 epochs: EMA Dice 0.867, Val Dice 0.824. Others at 5 epochs only.

Critical finding: proxy ranking is INVERTED vs actual Dice.
Predicted: C3 > C2 = C5 > C1. Observed: C1 >> C5 > C3 > C2.

Confound: hyperparameter bias. nnU-Net default HP (lr=0.01, momentum=0.99, PolyLR, augmentations) were tuned for z-score preprocessing. Custom transforms change the gradient distribution, making the same lr suboptimal. This is a comparison of “preprocessing + fixed HP”, not “preprocessing + optimal HP”.

Next step: HP tuning per-preprocessing (lr sweep) to eliminate this confound before concluding.

7. Experimental Protocol

Phase 1Proxy metrics (FDR, CV, skew) — 5 methods × 956 training patients

Phase 2Zero-cost proxies (SNR, NASWOT, Linear Probe, Rank Aggregation) — 956 training patients

Phase 3Mini-training 5 epochs — proxy ranking INVERTED vs Dice (HP bias)

Phase 4Optuna/BoTorch — tune preprocessing params per modality using proxy as objective

Phase 5Training 40 epochs — optimized preprocessing vs baseline

Phase 6Full training 200+ epochs — best preprocessing

Phase 75-fold cross-validation + Wilcoxon tests

Phase 8Alpha=1/d loss on best preprocessing (second contribution)

Phase 9Correlation analysis: proxy ranking → 5ep → 40ep → 200ep Dice

8. Per-Modality Optimal Preprocessing

Each MRI modality measures a different physical property. The optimal preprocessing differs per modality — this is scientifically justified because the intensity distributions have different underlying causes.

Modality	Best method	FDR	Why
T1	Box-Cox	2.683	Power transform handles moderate right skew from WM/GM contrast
T1ce	Box-Cox	3.209	Compresses gadolinium enhancement peak, preserves tumor signal
T2	Box-Cox	4.113	Strong skew from fluid — Box-Cox normalizes effectively
FLAIR	Clip 99th	6.596	Hyperintense lesions preserved by clipping, Box-Cox compresses them

Proposed hybrid: Box-Cox for T1/T1ce/T2 + Clip for FLAIR. To be validated by training after uniform method validation.

8b. Bayesian Optimization of Preprocessing (Planned)

Instead of choosing between fixed methods, optimize preprocessing hyperparameters per modality using zero-cost proxies as the objective function. Optuna/BoTorch explores the parameter space efficiently.

Parameter	Method	Current value	Search range	Per modality?
lambda	Box-Cox	MLE (optimizes Gaussianity)	[-2, 2]	Yes
percentile	Clip	99th	[90, 99.9]	Yes
shift	log(a+x)	a=1	[0.01, 10]	Yes
clip + lambda	Clip + Box-Cox	99th + MLE	Combined search	Yes

Key insight: MLE optimizes lambda for Gaussianity. But our proxies show that Gaussianity ≠ class separability (Quantile has perfect Gaussianity but worst FDR). By optimizing lambda for proxy ensemble score instead of Gaussianity, we find the transform that maximizes class separability — a fundamentally different objective.

Contribution: First application of Bayesian optimization to MRI preprocessing hyperparameter selection using zero-cost proxy metrics as surrogate objectives.

8c. The InstanceNorm Compensation Hypothesis

Durso-Finley et al. (2024) — “Negligible effect of brain MRI preprocessing”

Tested skull-stripping, bias field correction, histogram matching, and denoising on BraTS. Found negligible effect on Dice. Conclusion: InstanceNorm layers compensate for preprocessing differences by re-normalizing features internally at every network stage.

nnU-Net uses InstanceNorm3d at all 7 encoder/decoder stages. Each layer re-centers and re-scales the feature maps, potentially nullifying input distribution differences.

Linear vs nonlinear transforms — why our case is different

Property	Durso-Finley transforms	Our transforms
Type	Linear / quasi-linear	Nonlinear
Examples	Skull-strip, bias field, histogram match	Box-Cox, log(1+x), quantile-to-Gaussian
What changes	Scale, shift, which voxels present	Distribution shape (skewness, kurtosis)
Moments affected	1st (mean), 2nd (variance)	3rd (skew), 4th (kurtosis)
InstanceNorm corrects?	Yes — normalizes mean+std	Not directly — higher moments persist

Open question: does input skewness survive 7 layers of InstanceNorm + LeakyReLU + Conv3D, or do the nonlinearities reshape it regardless? This is what our experiment tests.

Our 5-epoch result (with default HP): z-score baseline dominates (0.767 vs 0.552-0.660). This is consistent with the InstanceNorm compensation hypothesis. However, our comparison is confounded: nnU-Net HP (lr=0.01, SGD Nesterov, PolyLR) were tuned for z-score data. Custom transforms change the gradient landscape, making the same lr suboptimal.

Current experiment: Optuna + BoTorch lr tuning per preprocessing (10 trials each). This eliminates the HP confound. Three possible outcomes:
1. Transforms still don't help after HP tuning → confirms and extends Durso-Finley to distribution transforms. Publishable negative result (replication + extension).
2. Some transforms improve with tuned lr → InstanceNorm compensation is partial, and distribution shape matters when HP are adapted. Challenges Durso-Finley's generality.
3. Optimal lr differs per preprocessing but final Dice converges → preprocessing affects convergence speed, not final performance. InstanceNorm compensates asymptotically.

9. Known Limitations

InstanceNorm compensates preprocessing

Impact: 5-epoch results show z-score baseline wins with default HP (Durso-Finley 2024)

Mitigation: Optuna lr tuning per-preprocessing to eliminate HP confound

Proxy ranking inverted vs Dice

Impact: Zero-cost proxies do not predict 5-epoch Dice

Mitigation: Negative result documented; proxies may predict after HP tuning or at convergence

HP confound in 5-epoch comparison

Impact: lr=0.01 tuned for z-score, suboptimal for custom transforms

Mitigation: Optuna + BoTorch lr sweep per preprocessing (in progress)

Single dataset (BraTS 2023 GLI)

Impact: Results may not generalize

Mitigation: Limitation documented, MedNeXt generalization planned

Box-Cox assumes unimodal distribution

Impact: Brain MRI is multimodal (CSF/GM/WM/tumor)

Mitigation: Per-patient MLE finds best compromise

10. Technical Setup

Component	Choice	Justification
Architecture	nnU-Net v2 (PlainConvUNet, 7 stages)	SOTA reproducible, Nature Methods 2021
Dataset	BraTS 2023 GLI (1196 patients, 4 modalities)	Multi-institutional, recent
GPU	NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM)	Full-brain patches 160x256x256
Precision	BF16 (no GradScaler)	Native Blackwell sm_120
Optimizer	SGD Nesterov, lr=0.01, PolyLR	nnU-Net default
Evaluation	Dice (WT/TC/ET), HD95, Wilcoxon	BraTS challenge standard
Viewer	Next.js + FastAPI + Three.js	Clinical interface + research dashboards

11. Skills Demonstrated

Research Methodology

Hypothesis-driven experimental design, factorial protocol, proxy metric validation, honest limitation reporting

Deep Learning Engineering

nnU-Net customization, BF16 optimization, torch.compile, custom trainers, distributed data augmentation

Medical Image Analysis

MRI physics understanding, BraTS challenge protocol, clinical report generation (RANO 2.0, volumetrics)

Data Engineering

1196-patient pipeline, memmap optimization, pre-computation caching, gzip/WebP compression

Software Engineering

Full-stack viewer (React/Three.js/FastAPI), clinical presets, lossless compression, deployment-ready

Scientific Rigor

Corrected overestimated claims (Box-Cox skew), negative results documented, statistical testing planned

Project: BraTS Brain Tumor Segmentation — independent research

Data: BraTS 2023 GLI Challenge (1196 patients, multi-institutional)

Framework: nnU-Net v2 (Isensee et al., Nature Methods 2021)

Last updated: 2026-03-26