# Evaluation Once you have a model, evaluate it on the test split. The evaluation API expects per-image predictions in the same format as the ground truth targets: a dict with keys `y` (geometry tensor), `labels`, and `scores`. ## Expected Prediction Format Each prediction is a dict with the following keys: | Task | Key | Shape / Dtype | Description | |--------------|------------|-------------------------------|---------------------------------------------| | TreeBoxes | `"y"` | `Tensor[N, 4]` float32 | Bounding boxes `[xmin, ymin, xmax, ymax]` | | TreePoints | `"y"` | `Tensor[N, 2]` float32 | Point coordinates `[x, y]` | | TreePolygons | `"y"` | `Tensor[N, H, W]` uint8 | Binary instance masks | | All | `"labels"` | `Tensor[N]` int64 | Class labels (typically all 0 for "tree") | | All | `"scores"` | `Tensor[N]` float32 | Confidence scores (required for predictions)| ```python from milliontrees.common.data_loaders import get_eval_loader import torch # Use the test split for evaluation test_dataset = dataset.get_subset("test") test_loader = get_eval_loader("standard", test_dataset, batch_size=16) all_y_pred = [] # list[dict] all_y_true = [] # list[dict] for metadata, images, targets in test_loader: # Run your model to produce predictions for this batch batch_preds = MyModel(images) # Accumulate per-image predictions and targets for pred, target in zip(batch_preds, targets): all_y_pred.append(pred) all_y_true.append(target) # Evaluate. Pass the metadata array from the same subset used for evaluation results, results_str = dataset.eval(all_y_pred, all_y_true, metadata=test_dataset.metadata_array) print(results_str) ``` The evaluation returns a dictionary of metrics and a formatted string with per-source breakdowns and averages. ## TreePolygons train / checkpoint eval Polygon training and `training/polygons/eval.py` use **`--eval-mode stream`** by default: metrics are updated each batch instead of building full `y_pred` / `y_true` lists (much lower peak RAM on large test splits). Metrics match **`--eval-mode legacy`**, which keeps the old “accumulate everything, then `dataset.eval()`” flow. For custom scripts, the pattern above (lists + `dataset.eval()`) is unchanged. ## Evaluation visualizations For qualitative debugging, pass **`viz_dir`** and optionally **`viz_n_per_source`** (default `4`) to `eval()`. The library writes PNGs under `viz_dir`, grouped in subfolders by source name, with up to `viz_n_per_source` images per source (in dataloader order). - **Purple**: ground-truth geometry (boxes, points, or mask fill). - **Orange**: predictions with `scores` above the dataset’s eval score threshold (same rule as metrics). Images are resized to the dataset’s eval `image_size` so coordinates match `y_pred` and `y_true` from `get_eval_loader`. When `viz_dir` is set, **`results["eval_visualization_paths"]`** lists the written PNG paths (strings). ```python results, results_str = dataset.eval( all_y_pred, all_y_true, metadata=test_dataset.metadata_array[: len(all_y_pred)], viz_dir="work/eval_viz", viz_n_per_source=3, ) ``` ### Example overlay (TreeBoxes + DeepForest) Below: one TreeBoxes test image from the onboarding split, ground-truth boxes in purple, and boxes from the pretrained **`weecology/deepforest-tree`** model in orange (same resize as eval). ![TreeBoxes evaluation overlay: purple ground truth, orange DeepForest predictions](public/eval_visualization_sample.png) To regenerate this image after changing overlay styles or the sample batch, install dev extras and run: ```bash uv sync --dev uv run python docs/scripts/generate_eval_viz_sample.py --root-dir onboarding_data ``` The script uses **`include_unsupervised=True`** so a local tree layout such as `onboarding_data/TreeBoxes_v0.12/` is found (the default supervised-only layout uses a different directory name). Use **`--mini --download`** only if you rely on the mini zip URL instead. ## External Model Adapter Example (Segmentation / DeepTrees-style) If your model outputs instance masks (for example, a segmentation model such as DeepTrees), use the adapter example: `existing_models/external_segmentation_adapter.py` Run a full smoke test without any external dependency: ```bash python existing_models/external_segmentation_adapter.py \ --mini --download --mock --root-dir onboarding_data ``` Then replace `run_external_model_batch(images)` in the script with your model call. The adapter function `adapt_segmentation_prediction(...)` handles conversion to MillionTrees format: - `masks` -> `y` (`Tensor[N, H, W]`, uint8) - optional `scores` -> `scores` (`Tensor[N]`, float32, defaults to ones) - optional `labels` -> `labels` (`Tensor[N]`, int64, defaults to zeros) ## TreeBoxes The following can be recreated in the git repo: https://github.com/weecology/MillionTrees/blob/main/existing_models/deepforest/eval_boxes.py ### Accuracy First are the detection accuracy results. Detection accuracy measures the proportion of ground truth objects that are correctly detected (matched to a prediction above the IoU threshold), averaged across images. $$ \text{Detection Accuracy} = \frac{1}{|I|} \sum_{i \in I} \frac{\text{matched}_i}{\text{total\_gt}_i} $$ where $|I|$ is the number of images, $\text{matched}_i$ is the number of ground truth objects in image $i$ that have a matching prediction (above the IoU threshold), and $\text{total\_gt}_i$ is the total number of ground truth objects in image $i$. ``` ============================================================ ACCURACY RESULTS ============================================================ Source-wise Results: -------------------------------------------------- Source Score Count -------------------------------------------------- NEON_benchmark 0.576 0 Radogoshi et al. 2021 0.469 0 Kwon et al. 2023 0.426 0 SelvaBox 0.320 0 Weecology_University_Florida 0.313 0 Velasquez-Camacho et al. 2023 0.227 0 Zamboni et al. 2021 0.210 0 Dumortier et al. 2025 0.154 0 Reiersen et al. 2022 0.114 0 Sun et al. 2022 0.097 0 OAM-TCD 0.059 0 World Resources Institute 0.057 0 Santos et al. 2019 0.001 0 Summary Statistics: ---------------------------------------- Average accuracy: 0.211 Worst-group accuracy: 0.001 Min accuracy: 0.001 Max accuracy: 0.576 Std accuracy: 0.171 ``` ### Recall Recall is proportion of correctly predicted true positives. ### Mask-aware precision (partial-annotation aware) `TreeBoxes`, `TreePoints`, and `TreePolygons` now report a mask-aware precision metric (`maskaware_detection_precision`, `maskaware_keypoint_precision`, `maskaware_mask_precision`), using optional per-image `tree_coverage_mask` to avoid penalizing plausible trees that are present in imagery but missing from annotation boxes. For each prediction above score threshold: - Matched predictions are true positives (same task-specific matching as existing metrics: IoU for boxes/masks, distance-based matching for points). - Duplicate predictions are still false positives. - Unmatched predictions are counted as false positives only when the predicted geometry does not sufficiently overlap tree pixels in `tree_coverage_mask`. Per image: $$ \text{MaskAwarePrecision} = \frac{TP}{TP + FP_{\text{adjusted}}} $$ where `FP_adjusted` excludes unmatched predictions with tree-pixel fraction above the configured threshold (default `0.5`). If `tree_coverage_mask` is unavailable for an image, the metric falls back to strict precision behavior for that image. This adjustment applies to precision only. Recall metrics are unchanged. ``` ============================================================ RECALL RESULTS ============================================================ Source-wise Results: -------------------------------------------------- Source Score Count -------------------------------------------------- NEON_benchmark 0.779 0 Radogoshi et al. 2021 0.594 0 Kwon et al. 2023 0.506 0 Zamboni et al. 2021 0.484 0 Velasquez-Camacho et al. 2023 0.483 0 Weecology_University_Florida 0.444 0 SelvaBox 0.426 0 Dumortier et al. 2025 0.228 0 Reiersen et al. 2022 0.183 0 World Resources Institute 0.108 0 Sun et al. 2022 0.104 0 OAM-TCD 0.103 0 Santos et al. 2019 0.016 0 Summary Statistics: ---------------------------------------- Average recall: 0.311 Worst-group recall: 0.016 Min recall: 0.016 Max recall: 0.779 Std recall: 0.224 Average detection_acc across source: nan Average detection_accuracy: 0.211 source_id = 0 [n = 94]: detection_accuracy = 0.154 source_id = 1 [n = 5]: detection_accuracy = 0.426 source_id = 2 [n = 33]: detection_accuracy = 0.576 source_id = 3 [n = 784]: detection_accuracy = 0.059 source_id = 4 [n = 286]: detection_accuracy = 0.469 source_id = 5 [n = 12]: detection_accuracy = 0.114 source_id = 6 [n = 95]: detection_accuracy = 0.001 source_id = 7 [n = 253]: detection_accuracy = 0.320 source_id = 8 [n = 27]: detection_accuracy = 0.097 source_id = 9 [n = 176]: detection_accuracy = 0.227 source_id = 10 [n = 461]: detection_accuracy = 0.313 source_id = 11 [n = 94]: detection_accuracy = 0.057 source_id = 12 [n = 42]: detection_accuracy = 0.210 Worst-group detection_accuracy: 0.001 Average detection_recall: 0.311 source_id = 0 [n = 94]: detection_recall = 0.228 source_id = 1 [n = 5]: detection_recall = 0.506 source_id = 2 [n = 33]: detection_recall = 0.779 source_id = 3 [n = 784]: detection_recall = 0.103 source_id = 4 [n = 286]: detection_recall = 0.594 source_id = 5 [n = 12]: detection_recall = 0.183 source_id = 6 [n = 95]: detection_recall = 0.016 source_id = 7 [n = 253]: detection_recall = 0.426 source_id = 8 [n = 27]: detection_recall = 0.104 source_id = 9 [n = 176]: detection_recall = 0.483 source_id = 10 [n = 461]: detection_recall = 0.444 source_id = 11 [n = 94]: detection_recall = 0.108 source_id = 12 [n = 42]: detection_recall = 0.484 Worst-group detection_recall: 0.016 ``` To see more examples for formatted and output of models, see examples/ in the git repo.