This is an old revision of the document!
[bertlluk]CTU: Help needed. We don't know what the intent was behind this chapter.
Having designed a sensor, object recognition, and location services section, how does one test these components. The fundamentals are consistent with the discussions in chapter 2. One defines an ODD, builds tests underneath this ODD, applies these tests, and determines correctness. The application of the tests can virtual (simulation), physical (test track), or even a mix based on components (Hardware in Loop or Software in Loop). The population of tests needs to be complete enough to show sufficient coverage. The introduction of sensors and AI add significant complexity to this process.
Testing sensors in safety-critical systems is particularly challenging when viewed through the lens of verification, validation (V&V), and certification, because sensors are both hardware devices and context-dependent measurement systems. Verification—ensuring the sensor meets its design specifications—can be addressed through laboratory calibration, environmental stress testing, and compliance with standards such as ISO 16750 (environmental conditions), DO-160 (avionics), and MIL-STD-810 (defense systems). However, validation—ensuring the sensor performs adequately in real operational contexts—is far more complex. Sensor performance depends heavily on the operational design domain (ODD), including weather, lighting, clutter, and interference conditions, which are difficult to fully replicate or bound. This gap between controlled verification and real-world validation is especially acute for perception sensors (e.g., cameras, radar, lidar), where performance is probabilistic rather than deterministic and strongly influenced by environmental variability. Today, there is a great deal of innovation in mechanical test apparatus which mimic physical movement inside Anechoic Chambers to recreate difficult test scenarios. In the outdoor environment, hives of drones as EM sensors and producers of noise provide a similar function for test tracks.
We decompose the stack into Perception (object detection/tracking), Mapping (HD map/digital twin creation and consistency), and Localization (GNSS/IMU and vision/LiDAR aiding) and validate each with targeted KPIs and fault injections. The evidence is organized into a safety case that explains how module results compose at system level. Tests are derived from the ODD and instantiated as logical/concrete scenarios (e.g., with a scenario language like Scenic) over the target environment. This gives you systematic coverage and reproducible edge-case generation while keeping hooks for standards-aligned arguments (e.g., ISO 26262/SOTIF) and formal analyses where appropriate.
The objective is to quantify detection performance—and its safety impact—across the ODD. In end-to-end, high-fidelity (HF) simulation, we log both simulator ground truth and the stack’s detections, then compute per-class statistics as a function of distance and occlusion. Near-field errors are emphasized because they dominate braking and collision risk. Scenario sets should include partial occlusions, sudden obstacle appearances, vulnerable road users, and adverse weather/illumination, all realized over the site map so that failures can be replayed and compared.
Figure 1 explains object comparison. Green boxes are shown for objects captured by ground truth, while Red boxes are shown for objects detected by the AV stack. Threshold-based rules are designed to compare the objects. It is expected to provide specific indicators of detectable vehicles in different ranges for safety and danger areas.
Validation begins with how the map and digital twin are produced. Aerial imagery or LiDAR is collected with RTK geo-tagging and surveyed control points, then processed into dense point clouds and classified to separate roads, buildings, and vegetation. From there, you export OpenDRIVE (for lanes, traffic rules, and topology) and a 3D environment for HF simulation. The twin should be accurate enough that perception models do not overfit artifacts and localization algorithms can achieve lane-level continuity.
Key checks include lane topology fidelity versus survey, geo-consistency in centimeters, and semantic consistency (e.g., correct placement of occluders, signs, crosswalks). The scenarios used for perception and localization are bound to this twin so that results can be reproduced and shared across teams or vehicles. Over time, you add change-management: detect and quantify drifts when the real world changes (construction, foliage, signage) and re-validate affected scenarios.
Here, the focus is on the robustness of ego-pose to sensor noise, outages, and map inconsistencies. In simulation, you inject GNSS multipath, IMU bias, packet dropouts, or short GNSS blackouts and watch how quickly the estimator diverges and re-converges. Similar tests perturb the map (e.g., small lane-mark misalignments) to examine estimator sensitivity to mapping error.
The following is a short KPI list:
The current validation methods perform a one-to-one mapping between the expected and actual locations. As shown in Fig. 2, for each frame, the vehicle position deviation is computed and reported in the validation report. Later parameters, like min/max/mean deviations, are calculated from the same report. In the validation procedure, it is also possible to modify the simulator to embed a mechanism to add noise in the localization process to check the robustness and validate its performance.
A two-stage workflow balances coverage and realism. First, use LF tools (e.g., planner-in-the-loop with simplified sensors and traffic) to sweep large grids of logical scenarios and identify risky regions in parameter space (relative speed, initial gap, occlusion level). Then, promote the most informative concrete scenarios to HF simulation with photorealistic sensors for end-to-end validation of perception and localization interactions. Where appropriate, a small, curated set of scenarios is carried to closed-track trials. Success criteria are consistent across all stages, and post-run analyses attribute failures to perception, localization, prediction, or planning so fixes are targeted rather than generic.