EC M202A Final Project

Course project website for EC M202A / CS M213A

View the Project on GitHub

Urban Traffic Inference Perimeter Vision & Encrypted-Camera Side-Channel Extraction

Project Banner

👥 Team


👥 Mentors


📝 Abstract

Modern cities face a tension between traffic monitoring needs and privacy concerns. This project explores vehicle tracking in urban zones by fusing perimeter vision with encrypted-camera side-channel signals. We leverage entry/exit detections from edge cameras (YOLOv8 + ByteTrack + OSNet ReID for multi-camera association) and interior motion cues inferred from WiFi packet traces (LSTM-based bitrate analysis) to estimate vehicle counts, locations, and dwell times without direct observation of private feeds. Implemented in the CARLA simulator with Mininet-WiFi emulation, our Kalman-filter fusion system tracks multiple vehicles through realistic urban scenarios, achieving trajectory estimates within acceptable error bounds while operating near real-time. Results demonstrate that side-channel data can meaningfully augment visual tracking when carefully synchronized and fused, opening a new avenue for privacy-conscious urban sensing, though with important caveats about appearance variability, temporal alignment, and ethical deployment.


📑 Slides


1. Introduction

Urban environments increasingly rely on camera networks for traffic management, safety, and city planning. Yet cities face a fundamental tension: comprehensive visual surveillance enables effective situational awareness but raises serious privacy concerns, while privacy-preserving approaches often sacrifice visibility and utility. This project explores a middle ground by asking: Can we maintain awareness of vehicles moving through an urban zone using only perimeter cameras and indirect signals from interior private feeds?

1.1 Motivation & Objective

Urban mobility applications (e.g. congestion control, emergency response) require knowing how many vehicles are in a given area, where they are, and how long they dwell. Today’s cities often rely on fixed cameras or sensor networks, but full coverage is impractical and raises privacy issues. For instance, automated license plate reader (ALPR) systems can log vehicles entering or exiting a town, but cannot monitor movements once inside and pose serious privacy concerns (Automated License Plate Readers, n.d.). Our objective is to maintain real-time awareness of vehicles in a city zone without blanket surveillance. We propose to achieve this by fusing perimeter camera vision with encrypted-camera side-channel data. Encrypted video streams, such as those from private interior cameras, inadvertently leak information through their bitrate patterns, which correlate with scene motion (Apthorpe et al., 2017). By monitoring these side-channel signals, we can infer when and where vehicle motion occurs inside the zone, then combine that with multi-vantage perimeter detections to track vehicles continuously. This approach is important because it could enable traffic monitoring that is less intrusive yet effective in complex urban environments.

1.2 State of the Art & Its Limitations

Current solutions for city-scale vehicle tracking fall short either in coverage or privacy. Surveillance camera networks with multi-camera tracking algorithms have been studied extensively. For example, the CityFlow benchmark demonstrated tracking vehicles across 40 cameras in a city environment (Tang et al., 2019). However, such systems assume extensive camera deployment with open video feeds, which is costly and invasive. Many cities instead deploy ALPR cameras at key points to record license plates providing entry/exit logs and can reveal travel patterns (Automated License Plate Readers, n.d.), but fail to continuously localize vehicles. Another line of work uses IoT and wireless signals: Researchers have shown that even when video feeds are encrypted, the network traffic rates can indicate activity. For instance, Apthorpe et al. found that a home security camera’s encrypted traffic spikes when it detects motion or is being viewed (Apthorpe et al., 2017). Similarly, Ji et al. demonstrated that user presence and daily routines can be inferred from smart camera network traffic (Ji et al., 2018). Li et al. extended this to show that even fine-grained activities (e.g. moving, eating, etc.) could be distinguished via encrypted video stream patterns (Li et al., 2016). These studies highlight a privacy loophole: the bitrate of compressed video is strongly influenced by scene motion. To mitigate this, some systems like Microsoft’s Visor have proposed padding or shaping video traffic to mask the bitrate variations (Shokri et al., 2019). In practice, however, most cameras do not employ such padding, and the side-channel remains available. No existing traffic-monitoring approach has combined these modalities: both vision and encrypted traffic for multi-vehicle tracking. The state-of-art either uses direct observation (many cameras or sensors) or coarse traffic counts, but cannot both protect privacy and maintain continuous tracking. This gap motivates our approach to fuse sparse, high-trust data (perimeter video feeds) with ubiquitous, low-trust signals (encrypted network feeds) to overcome these limitations.

1.3 Novelty & Rationale

Our approach is novel in that it treats encrypted camera traffic as a sensor for vehicle tracking. This is enabled by the key insight that encrypted camera streams (which one cannot and should not decrypt) still reveal some information through their network traffic patterns. When a video encoder compresses a scene with motion, the bitrate increases; when the scene is static, bitrate drops. An observer monitoring only packet statistics (not content) can leverage these properties to infer that something moved without ever seeing the video. We repurpose this side-channel phenomenon, traditionally studied as a privacy vulnerability, as a useful sensor for vehicle tracking.

Prior research on encrypted traffic focused on privacy attacks or anomaly detection, whereas we repurpose the phenomenon effectively using private cameras’ data rates to “peek” at motion on streets without seeing video content. This has not been explored before in multi-target tracking. We then fuse this side-channel with traditional vision-based tracking at the perimeter, incorporating both types of motion events into a unified tracking model. Each vehicle entering the zone is initialized via a perimeter camera detection, and thereafter updates its state either when seen by another edge camera or when triggered by interior side-channel events.

We also implement a multi-camera appearance matching technique using deep learning: by extracting appearance embeddings (e.g. using a OsNET CNN) for vehicles at edge cameras, we perform cross-camera re-identification based on cosine similarity (Zhou et al., 2019). This improves the consistency of vehicle identities across different camera views. The rationale is that combining these complementary sources will succeed where each alone fails: perimeter cameras give reliable IDs, temporal position, and entry/exit times but no internal visibility, while interior encrypted feeds give continuous presence hints without identification. Together, they can maintain a coarse but continuous track of each vehicle. In summary, the novelty lies in the multimodal fusion of vision and encrypted-network signals for tracking, and in demonstrating a working prototype of such a system in a realistic simulator.

1.4 Potential Impact

This project has the potential to influence the design of future smart-city sensing infrastructure by suggesting that useful situational awareness can be obtained without blanket CCTV coverage. Cities could leverage privately owned cameras as privacy-preserving sensors, only tapping metadata like traffic rates rather than live video feeds. This alleviates the need for new hardware installation, and allows monitoring in sensitive areas where video is restricted. Technically, the approach opens up a new avenue in multimodal tracking, fusing conventional sensors with unconventional side-channels. It also may inform privacy policy; by highlighting that encrypted camera streams still convey significant information, our work may encourage stricter standards (e.g. mandatory traffic shaping) or conversely, demonstrate a constructive use-case of the side-channel (improving public services without additional surveillance). In domains beyond traffic, the concept could be applied to track occupancy in smart buildings or public spaces using encrypted IoT data, balancing utility and privacy. Finally, our system contributes to embedded and IoT research by showcasing inference under resource constraints; using minimal additional hardware (just a WiFi sniffer and a few cameras) to achieve a complex sensing task.

1.5 Challenges

Implementing the proposed system presents several challenges.

1.6 Metrics of Success

Our success criteria is event error count. A non-successful event includes a false positive, false negative, or a mis-classified event (an event that was triggered by a different car than it was attributed to). Because of time constraints, this metric is evaluated heuristically, but could be automatically calculated from ground truth with further processing of ground truth data.


2. Related Work

Our project builds upon and intersects multiple research threads, including multi-camera tracking, network traffic analysis for inference, and privacy-preserving sensing. In this section, we review representative prior works in these areas, highlighting how they relate to our solution and what gaps remain. Furthermore, reference citations provided in Section 6.

In summary, while we draw on ideas from related work in multi-camera tracking, side-channel analysis, and sensor fusion, our system addresses a unique combination of these problems that has not been fully explored in the literature. This positions our work as a stepping stone toward more integrated, privacy-conscious urban sensing solutions.


3. Technical Approach

3.1 Assumptions & Proposed Solutions

Throughout this project, we make the following assumptions:

3.2 System Architecture

System Pipeline

The system architecture is organized into three major subsystems: edge-camera perception, inner-camera wireless side-channel sensing, and a central fusion/inference module. Edge cameras handle visual detection, tracking, and global ID assignment, while inner cameras contribute anonymized event streams derived from encrypted Wi-Fi traffic. Events are processed by a final inference loop, which is implemented either with a Kalman+Hungarian tracker as shown or a Graph Optimization Problem approach. The final layer integrates these heterogeneous event sources into coherent vehicle position inferences.

3.3 Data Pipeline

CARLA Setup

The CARLA simulator is instianted via the pylot docker container and all ports are exposed on the host. The simulator is run within a docker container to ensure reproducability and easy sharing of dependencies. All ephermal code and data collection is executed on the host itself.

Car Control & Scenario Generation

Vehicle trajectories are generated deterministically to ensure reproducible experiments across trials. The system employs a multi-point route structure where each vehicle follows spawn → interior → destination waypoints, with the spawn and destination points constrained to lie outside the monitoring zone (perimeter) and the interior waypoints required to pass through it. This design guarantees that every vehicle both enters and exits the region of interest, enabling comprehensive evaluation of entry detection, interior tracking, and exit confirmation. For ease of developent, all waypoints in routes are selected from the list of legitimate spawn points specified by CARLA, since these represent locations that are well-aligned with road lanes, minimizing undesired car behaviors.

Each vehicle is controlled by a CARLA BehaviorAgent, a rule-based local planner that wraps the Traffic Manager's low-level steering/throttle control with higher-level waypoint-following logic. Since the version of CARLA's BehaviorAgent does not provide an end method as later versions do, we detect reaching a destination via catching a queueing error that the BehaviorAgent triggers when the car approaches its destination. This ad hoc solution was found to be extremely deterministic and producing useful behavior for our project's standards.

For multi-vehicle scenarios, any combination of pre-defined routes can be specified when running the multi_car_route.sh script. Colors are deterministically assigned from a restricted set consisting of high contrast colors.

Storage of route files (written by the one_car_route.py with the appropriate flags, or manually) provides a certain reproducibility guarantee, limited by the variation in behavior due to the slow synchronous ticks. CARLA's BehaviorAgent may make slightly different decisions in steering aggression between runs, but car paths throughout the world are deterministic between runs and across machines.

Camera Capture

camera_layout

Static RGB cameras are placed at fixed, repeatable poses so every trial observes identical viewpoints; controlled camera geometry improves cross-run comparability and is standard practice in multi-camera tracking benchmarks. Each camera is assigned a unique ID to be tracked throughout data processing. Cameras sample at 20 FPS with 1280×720 resolution and a 90° field of view to balance spatial detail with real-time throughput (similar rates are used in KITTI/nuScenes to match perception pipelines) (Caesar et al., 2020). Frames arrive as raw bytes and are first buffered in per-camera queues to decouple acquisition from storage, a common technique in real-time vision systems to prevent frame drops when I/O stalls.

Frames are compressed on the fly with HEVC fed via stdin. Piping raw frames directly into the encoder avoids intermediate disk writes and aligns with recommendations from the video systems literature for reducing latency and preserving quality in real-time capture (e.g., FFmpeg-based pipelines in robotics and teleoperation studies). Queue draining is synchronized with the simulator tick.

Vehicle poses are logged once per tick in world coordinates (x, y; ground plane), assuming a flat road surface. Pre- and post-padding with zeros yields a fixed-length trajectory per vehicle, which simplifies later alignment between tracks and videos; fixed-length temporal representations are widely used in trajectory forecasting datasets to enable direct frame-indexed fusion.

Mininet WiFi

Camera-to-access-point links are emulated in software with Mininet-WiFi to retain the full Linux TCP/IP stack while avoiding specialized radio hardware. A minimal topology with one access point and two stations mirrors common sender/receiver lab setups and keeps contention controlled. The emulator’s wmediumd interference model injects realistic wireless effects (loss, rate adaptation) instead of idealized zero-loss pipes, following recommendations from the Mininet-WiFi authors for fidelity in SDN and wireless research.

Over-the-air traffic is captured via a monitor-mode interface and recorded per video as PCAPs; this mirrors methodology in traffic-analysis and side-channel studies where timing, burstiness, and packet sizes—not payloads—are the primary signals. Transmissions are replayed at source frame rate using application-layer streaming so temporal characteristics match the encoded video. Each capture is closed after a clip to ensure one-to-one alignment between videos and PCAPs, improving downstream feature extraction reliability.

Mininet-WiFi is chosen because it offers repeatable experiments, real kernel code paths, and scriptable control at low cost, as documented in the SDN and wireless emulation literature (e.g., Mininet-WiFi design and evaluation papers). It also enables rapid iteration compared with physical testbeds while providing more realism than packet-level simulators.

Reproducibility

Refer to the setup instructions provied here.

3.4 Algorithm / Model Details

Edge Camera: Multi-Vantage Tracking

camera_layout

We run two prerecorded edge videos (cameras 4 and 5) frame-by-frame. Each frame goes through YOLOv8x for detection plus ByteTrack for short-term tracking, which yields a box and a per-camera track id for each vehicle. We crop the box and pass it to an OSNet ReID network to get a 512-D appearance embedding; OSNet is used because it is lightweight and pretrained for person/vehicle re-identification, so it works well without heavy fine-tuning. The bottom-center of each box is projected into world coordinates using calibrated intrinsics/extrinsics so both cameras report positions in the same frame. A global appearance tracker keeps a cross-camera gallery: cosine similarity (threshold 0.65) links new embeddings to existing global IDs, otherwise it spawns a new one. To smooth noise, gallery embeddings are updated with a running average (80% previous, 20% new). We log per-frame JSON with camera pose, global/local IDs, and estimated world positions; this compact log is later used by the fusion step without needing to replay video. Key design choices for non-experts: YOLO+ByteTrack gives robust boxes and stable short tracks; appearance-only matching (no timing/GPS) avoids needing synchronization; the similarity threshold trades off false merges vs. splits; the running average keeps IDs stable even if a single frame is noisy.

Inner-Camera: Side Channel PCAP Feature Extraction

camera_layout

Deterministic Approach

We implement a deterministic signal processing pipeline to detect vehicle presence solely from the encrypted bitrate patterns of the inner cameras. Raw 802.11 data packets are first aggregated into 20 Hz discrete-time signals representing bytes-per-frame. A median filter (kernel size 5) is applied to this raw signal to suppress periodic high-bandwidth keyframes (I-frames) and transient transmission spikes that do not correspond to physical motion. To normalize performance across heterogeneous camera hardware, we employ an iterative statistical calibration routine. This algorithm recursively clips signal outliers (3 standard deviations in this case) to isolate and calculate the true "noise floor" (mean and variance) of the camera stream. Detection is performed using dual-threshold hysteresis: a "Trigger" high threshold (mean + 2*sigma) to initiate the event (prevent random noise from starting a detection) and a "Sustain" low threshold to maintain the active state during minor bitrate dips. This prevents a single vehicle event from fracturing into multiple detection fragments. By analyzing the Signal-to-Noise Ratio and variance of the initial pass, the model identifies specific cameras with high noise floors or weak signals. We adjusted the average smoothing window and trigger thresholds for these specific streams.

Machine Learning Approach

Packet traces are converted into per-frame descriptors by isolating 802.11 data traffic after the first high-volume video packet and aggregating over the camera frame period (1/FPS). Each frame records packet count, total bytes, mean and variance of packet sizes, inter-arrival statistics, and index bounds; empty frames are zero-filled to preserve alignment. Paired video-derived features are matched by camera identifier, truncated to equal length, and standardized after discarding the first 500 frames to suppress startup transients—steps consistent with traffic-analysis practice. Three packet-side channels (count, total bytes, size variability) serve as predictors, while video embeddings supply supervision.

Overlapping windows of 16 frames are generated with stride 1 to retain fine temporal structure while expanding the effective dataset. An 80/20 split yields train/test partitions. A two-layer bidirectional LSTM (hidden size 128, dropout 0.1) maps each window to per-timestep predictions, optimized with mean squared error and Adam (learning rate 3e-4) for 75 epochs. This bi-directional, windowed formulation mirrors standard sequence-to-sequence regression setups, enabling the model to leverage both past and future context within each clip.

Model Architecture:

 BiLSTMRegressor(
  (lstm): LSTM(3, 128, num_layers=2, batch_first=True, dropout=0.1, bidirectional=True)
  (head): Sequential(
    (0): Linear(in_features=256, out_features=128, bias=True)
    (1): ReLU()
    (2): Linear(in_features=128, out_features=1, bias=True)
  )
)

Event Parsing

Events are parsed on a per-camera basis using the methods described above, then synthesized into two final lists all_edge_events.json and all_inner_events.json that are sorted by time. These represent all the scenario data that will be provided to the final algorithm(s).

Fusion Algorithm

We compare two approaches that perform inference upon the final lists of edge and inner events.

Kalman Filter + Hungarian Algorithm

To integrate data from the edge cameras with the anonymous location data from the inner cameras, we designed a tracking algorithm using a Kalman filter and softmax-based data association. We modeled the vehicle state as a four-dimensional vector representing position and velocity in the 2D plane, assuming a constant velocity motion model. As sensing differ between the two types of cameras, we assigned separate measurement noise covariance matrices. The edge cameras provide ground-truth localization so they are assigned a low noise variance, whereas the inner cameras are subject to more noise and estimation errors and are assigned a higher variance. This ensures the filter trusts the edge data significantly more while still allowing the inner data to smooth the trajectory and update velocity estimates throughout the blind zone.

To associate inner camera events to existing vehicle tracks, we use a softmax-weighted Mahalanobis Distance. For each incoming inner camera event, we compute the distance between the event location and each active track’s predicted state, normalized by the track’s positional uncertainty. These costs are converted into association probabilities using the softmax function and the event is assigned to the track with the highest probability, provided it exceeds a confidence threshold of 0.2.

Track management is handled through the edge-camera IDs. When an edge camera detects a car ID that does not exist in the current state, a new Kalman filter instance is initialized at that location. Then, to prevent premature track deletion due to sensor noise near the boundaries, we implement a robust exit logic. A track is only deleted if the vehicle is at the extreme north or south limits within the lane width and has been active for a minimum duration of 10 frames. This ensures reliable termination of tracks without sacrificing continuity.

Kalman Filter Algorithm pseudocode:

# sorted by timestamp
edge_cam_data = [[frame_num, location, car id], [frame_num, location, car id]...]
inner_cam_data = [[frame_num, location], [frame_num, location]...]

class Car_KF:
    car_id

    # factor velocity into KF but only really output position

    known_position
    known_confidence

    pred_position
    pred_confidence

    curr_position
    curr_confidence

    predict()
        calculate pred_current_position
        calculate pred_confidence

        curr_position = pred_current_position
        curr_confidence = pred_confidence 

    update(location)
        calculate known_position
        calculate known_confidence

        curr_position = known_position
        curr_confidence = known_confidence
        

main:
    CURR_STATE = { 
        car_1: Car_KF,
        car_2: Car_KF,
        ....
    } # in implementation, starts empty {}

    for each curr_frame in video_frames: 

        # update car positions every frame, whether there was an event or not
        for each Car_KF in CURR_STATE
            car.KF.predict()

        EVENT = None
        # case inner event at current frame
        while inner_cam_data[0].frame_num == curr_frame:
            inner_event = inner_cam_data[0]

            event_Car_KF = identify_event(CURR_STATE, inner_event)
            output "event at {curr_frame} was triggered by car {event_car_id}"

            eventCar = get Car_KF with id event_car_id
            eventCar.KF.update(inner_event_location)

            inner_cam_data.pop()

        # case edge event at current frame
        while edge_cam_data[0].frame_num == curr_frame: 
            edge_event = edge_cam_data[0]

            if eventCar in CURR_STATE:
                remove eventCar from CURR_STATE # car exited
                output "car {eventCar.id} exited at location {edge_event.location}"
            else: 
                add a new Car_KF initialized to {edge_event.location} to CURR_STATE
            
            outer_cam_data.pop()

identify_event(event_location):
    cost = [] 
    for car in CURR_STATE: 
        cost[car_id] = mahalanobis_distance(car, event_location)
    
    return Car_KF with lowest cost # hungarian algorithm step, equivalent to min of vector

Note that for the sake of extracting interesting information from the inner event data, identification of each inner anonymous event is treated as ground truth with respect to updating car locations.

In implementation, the location of each event is abstracted in the event data as a camera_id, then mapped to coordinates that represent the camera's location. We did attempt world projection at the edge cameras, but realized that this information is not needed, as the point of an edge event is to identify an entry point and report the ground truth about a car's identity. The small difference between using the camera's position and the car's real position on the road is minimal compared to the much larger camera spacing. In other words, there is no ambiguity about For the inner cameras,

This approach relies heavily upon the quality of the input event sequences, such as desiring that:

  1. Events are correctly sorted by time
  2. Events are unique (no duplicate reports of the same car/camera/time encounter)
  3. Edge camera car_id's are correctly assigned
  4. camera_id's that reported each event are accurate

Global Graph-Based Optimization Approach

In this approach, we solve tracking by formulating a global graph-based Mixed Integer Program (MIP). To robustly associate observed car trajectories with inner sensor events, we employ a global optimization framework rooted in graph theory. Rather than relying on local or greedy assignments, our method builds a global graph where all plausible assignments—consistent with physical and temporal constraints—are considered simultaneously. This approach allows us to jointly determine the most likely set of car paths, as well as to detect and label sensor events that cannot be explained by any car as noise.

Comparison of Approaches

In summary, our system implements two distinct algorithms for associating inner sensor events with vehicle identities:

  1. A Kalman Filter with Hungarian algorithm for per-event greedy matching.
  2. A global Graph Optimization approach for jointly optimal trajectory assignment.

The Kalman + Hungarian algorithm is computationally efficient and intuitive, updating each vehicle's belief state frame by frame and making assignment decisions based on predicted positions. We expect it to perform well in straightforward, low-density settings, but it may struggle with ambiguous or noisy events that would benefit from considering global context.

The global Graph Optimization approach considers all possible associations and constraints simultaneously, optimizing for the globally most consistent set of trajectories and noise rejections. This should make it more robust to ambiguous or missing detections, at the cost of higher computational complexity.

Between the two, the Kalman + Hungarian algorithm approach is much more strongly dependent on the quality of event data. Because it makes decisions based only on the current filter state and immediate event observations, degraded event quality (such as dropped, noisy, or out-of-order events) can rapidly undermine tracking accuracy. The MIP approach, conversely, is able to use global context to compensate for some level of poor or missing data, making it more robust to event imperfections.

Finally, there is a key distinction in inference timing:

3.5 Hardware / Software Implementation

Docker/PyLot Development Environment

create_dev_cont.sh and run_cont.sh provision a GPU-enabled, privileged, host-networked container with X11 passthrough, enabling CARLA/pylot to render while leveraging hardware acceleration. Containerization standardizes dependencies and runtime behavior across simulation and networking experiments.

CARLA and Camera Capture

The CARLA simulation serves as a controlled environment in which multiple RGB sensors are instantiated from blueprints with fixed resolution, field of view, and frame rate. In spawn_world5_cameras.py, each camera is spawned at prescribed poses, streams raw BGRA frames into per-camera queues, and feeds those frames to downstream processing without display overhead.

FFmpeg (HEVC) Encoding

Raw camera frames are piped directly into individual ffmpeg processes, which encode to H.264 MP4 (yuv420p) at the simulation frame rate. This design decouples simulation ticks from disk I/O and preserves per-camera isolation. The same script records vehicle trajectories per frame, padding temporal gaps so position logs align with the total number of simulation ticks.

Mininet Wi‑Fi Emulation

two_stations_wifi.py constructs a minimal Wi‑Fi topology (two stations, one WPA2 access point) with interference modeling to approximate over‑the‑air behavior. One station streams local MP4s over UDP/MPEG‑TS via ffmpeg, the other runs an ffmpeg sink, and tcpdump captures 802.11 traffic to PCAPs for subsequent analysis—providing a reproducible network path for video delivery experiments.

Google OR-Tools (Graph Optimization)

The global graph-based optimization approach leverages Google OR-Tools, an open-source operations research library. We formulate vehicle tracking as a Mixed Integer Program (MIP) where decision variables represent whether a car traverses each edge in the assignment graph, and constraints enforce flow conservation and physical feasibility. The OR-Tools solver efficiently finds the globally optimal trajectory assignment and noise classification, balancing trajectory smoothness against the cost of misassigning or ignoring observations. This approach is implemented in graph_algorithm.py.

End-to-End Pipeline Perspective

CARLA supplies synchronized, configurable sensor data; ffmpeg transforms those data streams into compressed video frames; Mininet Wi-Fi emulates wireless link characteristics while capturing ground-truth network traces; and the Docker-based PyLot environment provides the reproducible substrate tying simulation, encoding, and emulation together for controlled, measurable experiments.

3.6 Key Design Decisions & Rationale

Describe the main design decisions you made.

We chose CARLA's preset town 5 because of its inherent structure as a perimeter and inner areas, with two specific entry points at the east and west sides of of the town. This constrains our environment nicely, while still providing multiple entry points to confirm that global tracking works. The inner structure is mostly grid-like, lending itself to easy spacing of cameras with non-overlapping fields of view.

For our data fusion step, we decided against using a machine-learning approach due to lack of ML experience amongst team members. Both algorithmic approaches were suggested by our mentors, then researched and implemented. Kalman Filters are a basic and well-studied model for object tracking, and have been considered and studied with surveillance tracking motivations in view [Gunjal et al.]. Similarly, tracking of interacting objects has been modeled as a network-flow Mixed Integer Program [Wang et al.].


4. Evaluation & Results

Test Scenarios

We evalate our system across five scenarios, named by the number of cars in the scenario, the route ID # of each car, and the color of the corresponding car.

Demo Case Description Purpose
one_car_2 A single car moves straight across from camera 5 to camera 4, exiting at a different edge camera. - Tests straight trajectory across cameras.
- Simple case of exiting at a different edge camera.
one_car_6 A single car exits at the same edge camera but completes a loop inside the town. - Tests looping within the town.
- Returns to the same edge camera.
two_cars_6_cyan_5_black Two cars share some similar path components but in succession. - Tests successive, partially overlapping paths for different cars.
- Demonstrates lack of direct interaction.
two_cars_6_green_8_black Two cars enter at similar times from opposite edges, travel spatially far apart routes, meet briefly, and exit the way they entered. - Tests overlap in timing but spatial diversity.
- Demonstrates brief points of proximity and independent exiting routes.
three_cars_1_cyan_6_purple_8_white Builds on two_cars_6green_8black with a third car taking a longer route and exiting on the opposite side. - Adds complexity with a third car.
- Highlights longer and independent paths.

Inner Camera Event Data

As discussed before, our algorithms depend heavily upon the quality of the data provided to them. Below are our results in inner event data accuracy:

Scenario Ground Truth True Positives (Matches) False Negatives (Missing) False Positives (Ghosts)
one_car_2 4 4 0 1
one_car_6 8 6 2 1
two_cars_6_cyan_5_black 14 14 0 0
two_cars_6_green_8_black 13 12 1 3
three_cars_1_cyan_6_purple_8_white 24 23 1 1
TOTAL 63 59 4 6

Performance Metrics

Edge Camera Event Data

Edge camera data was highly reliable, with the only error being a single ghost event in the three car scenario incorrectly indicating that car 3 exited from camera 4 after it had already exited at camera 5.

Fusion Algorithm

To evaluate the performance trade-offs between the Kalman Filter and the Graph Optimization, we employed three core metrics: Root Mean Square Error (RMSE) for accuracy, Maximum Error for robustness, and Max Drift for stability.

To find the error, we established a frame ID for all three data sets (Ground Truth, Kalman Filter, and Graph Optimization) by mapping the floating-point timestamps to a 20 FPS rate and applying a frame offset to the Ground Truth data. The Root Mean Square Error (RMSE) is then computed as the direct Euclidean distance between the estimated coordinate and the true coordinate at that exact Frame ID. Importantly, to ensure a fair comparison between the high frequency Kalman Filter (10 Hz) and the Graph Optimization output, we subsampled the Kalman predictions to align with the exact timestamps of the Graph Optimization nodes. This ensures that our smoothness metric evaluates the trajectory decisions at identical moments in time, rather than simply penalizing the real-time filter for its higher sampling rate.


Test Results

We now present a discussion of each demo scenario, its significance, and failures.

Demo 1: One Car, Simple Route

Kalman Filter (Left) vs. Graph Optimization (Right)

Frame Time (s) Ground Truth Kalman [Distance Error] Graph [Distance Error]
724 36.20 (34.2, 176.1) (30.0, 202.5) [E:26.69] (30.0, 202.5) [E:26.69]
794 39.70 (35.1, 152.9) (30.0, 143.3) [E:10.89] (30.0, 142.5) [E:11.61]
1012 50.60 (35.1, 80.5) (30.0, 39.9) [E:40.94] (30.0, 40.0) [E:40.86]
1366 68.30 (35.4, -37.0) (30.0, -49.6) [E:13.70] (30.0, -50.0) [E:14.07]
1670 83.50 (34.6, -137.8) (35.0, -150.1) [E:12.27] (35.0, -150.0) [E:12.20]

Comparison of Coordinates Between the Three Datasets

Car ID Metric Kalman Graph
1 Total Path Length 721.73 m 721.73 m
RMSE (Meters) 28.07 m 26.82 m
Error % 3.89% 3.72%
Max Drift (Meters) 43.39 m 43.43 m

In this simple scenario, both approaches successfully detect all events in order, with no missing events, no added events, and no misclassified events. The quantitative metrics show comparable performance with error rates hovering near 4% of the total path length (3.89% for Kalman vs. 3.72% for Graph). This scenario is mostly a sanity check of our system, since with only one car in the area, identification of anonymous events is trivial. This scenario also shows global ID event tracking, as the car is not re-assigned a new identy upon exiting at the opposite edge camera. However, this test confirms that for simple, linear paths with sparse traffic, the Kalman Filter provides sufficient precision without the computational cost of the Graph Optimization.

Demo 2: One Car, Moderate Route

Kalman Filter (Left) vs. Graph Optimization (Right)

Car ID Metric Kalman Graph
1 Total Path Length 714.25 m 714.25 m
RMSE (Meters) 19.31 m 16.17 m
Error % 2.70% 2.26%
Max Drift (Meters) 37.21 m 23.19 m

This scenario varies from the first in that the car exits from the same edge camera that it entered from. In addition, the route becomes slightly more complex, introducing turns and crossing the same camera (camera 1) at two different times. Both models continue to give similar results, with the Kalman Filter giving a slightly larger error.

Demo 3: Two Cars, Spatially Sparse

one_car_2_KF
Kalman Filter
one_car_2_graph
Graph Optimization
Car ID Metric Kalman Graph
1 Total Path Length 1222.73 m 1222.73 m
RMSE (Meters) 208.00 m 107.09 m
Error % 17.01% 8.76%
Max Drift (Meters) 311.81 m 242.91 m
2 Total Path Length 714.28 m 714.28 m
RMSE (Meters) 220.88 m 73.88 m
Error % 30.92% 10.34%
Max Drift (Meters) 460.63 m 192.25 m

This scenario introduces a second car. Both cars enter the town within a few seconds of each other from different edge cameras, and traverse paths inside the town. They near the center of the map at similar times, providing a test of spatial ambiguity. Due to error in the inner camera event where camera 9 miscalculatead 3 additional events, both algorithms predicted an event at the wrong location, resulting in the high Max Drift. However, the Graph Optimization method was able to recover the trajectory using global consistency, maintaining a 8.76% and 10.34% error rate for both cars respectively despite the difficult conditions, compared to the Kalman Filter's 17.01% and 30.92% error rate. This demonstrates the Graph method's superior ability to recover from noisy data.

Demo 4: Two Cars, Temporally Sparse

one_car_2_KF
Kalman Filter
one_car_2_graph
Graph Optimization
Car ID Metric Kalman Graph
1 Total Path Length 714.18 m 714.18 m
RMSE (Meters) 15.11 m 15.80 m
Error % 2.12% 2.21%
Max Drift (Meters) 22.25 m 22.29 m
2 Total Path Length 843.86 m 843.86 m
RMSE (Meters) 12.30 m 12.05 m
Error % 1.46% 1.43%
Max Drift (Meters) 21.29 m 21.29 m

This scenario demonstrates the dependence of our algorithms on event data quality. In this scenario, we were able to obtain accurate inner camera events. As a result, both algorithms achieved minimal error rates, confirming that when data quality is high, the Kalman Filter can be just as effective as Graph Optimization.

Demo 5: Three Cars

one_car_2_KF
Kalman Filter
one_car_2_graph
Graph Optimization
Car ID Metric Kalman Graph
1 Total Path Length 1222.69 m 1222.69 m
RMSE (Meters) 343.13 m 25.83 m
Error % 28.06% 2.11%
Max Drift (Meters) 694.01 m 61.08 m
2 Total Path Length 714.25 m 714.25 m
RMSE (Meters) 74.62 m 16.44 m
Error % 10.45% 2.30%
Max Drift (Meters) 206.56 m 26.84 m
3 Total Path Length 2164.59 m 2164.59 m
RMSE (Meters) 30.97 m 14.24 m
Error % 1.43% 0.66%
Max Drift (Meters) 72.70 m 27.15 m

The three-car scenario contained our only error in edge camera data, where one car incorrectly had three edge events, with a second, later exit being recorded at camera 4 after its correct exit at camera 5. Both algorithms incorrectly selected this faulty exit. This resulted in the high maximum drift we see. However, like Demo 2, the Graph Optimization was able to recover the correct trajectory and maintain a low error despite the conditions.

The close encounter between cars 1 and 3 when they pass through the same intersection and proceed in opposite ways proved to be challenging for both algorithms to solve. The Kalman filter incorrectly assigned a ghost event at camera 8 belonging to car 1 to car 3, and the Graph Optimization incorrectly assigns the event at camera 3 to car 1 instead of car 3. For the Kalman Filter approach, the mistake might have been mitigated if the missing event at camera 12 was correctly present, modifying the trajectory from camera 10 and picking up the camera 9 event instead of car 3. For the Graph Optimization approach, inserting more knowledge of the road structure, such as including the information that car 1 could not have passed by camera 3 immediately after camera 10 (it must first pass through camera 9 or 7) could deterministically fix the misclassification.


5. Discussion

This project successfully validated the core feasibility of fusing heterogeneous data sources for urban vehicle tracking. By demonstrating that encrypted 802.11 Wi-Fi traffic can be combined with sparse video data, we proved that vehicle trajectories can be reliably reconstructed even in challenging "blind" urban zones. Overall, the results highlight a core trade-off between latency and accuracy, as well as the strengths and weaknesses of different fusion strategies.

Graph Optimization proved to be the superior method for trajectory reconstruction. By leveraging future predictions to satisfy global geometric constraints, it was able to consistently maintain a low error rate. It was able to effectively mitigated sensor noise and false positive inner camera events. Kalman Filtering (Real-Time) demonstrated efficiency and precision in sparse, linear scenarios (Demo 1 & 4), achieving sub-1% error rates when data quality was high. However, its reliance on a constant velocity model and lack of future context caused higher error rates and rendered it unable to mitigate error when inaccurate camera data was supplied, as demonstrated by severe drift in complex scenarios (e.g. Demo 5).

The singular edge event ghost error we saw could be combated by algorithmically ignoring any duplicate exit events for a particular car ID once the first exit was recorded. This does not protect against incorrectly early ghost events, however, and high-volume testing for further robustness optimizations is recommended.

Our initial assumption was that all cameras would share a similar noise baseline but we found out that some cameras (9 and 19) experienced a noise floor nearly doubled to other cameras, causing standard thresholds to trigger false positives constantly. In an attempt to capture less events we made the settings stricter but that caused the algorithm to detect no events. With more time we could solve this issue by obtaining more data and find an average settings for these cameras rather than custom tuning, or if figuring out where this huge noise might be coming from. Our data currently focuses on bitrate, but future iterations could incorporate things like packet size variance or arrival time to distinguish signal patterns and improve the model.

Overall, this project confirms that combining encrypted Wi-Fi traffic with sparse video can yield reliable cross-zone vehicle tracking, provided the reconstruction algorithm is capable of enforcing global consistency. With stronger environmental modeling and richer signal features, this fusion framework could scale into a practical, city-wide blind-zone tracking system.

5.1 Future Work

Future development of the tracking fusion algorithm could focus on enriching the current cost function, which relies primarily on the Mahalanobis distance, by incorporating motion-model penalties, temporal consistency terms, and appearance-based similarity scores from the edge-camera ReID system. These additions would allow the association stage to weigh not only geometric proximity but also velocity continuity, expected travel time, and the likelihood that two events correspond to the same vehicle identity. Introducing environmental constraints into the grid, such as road topology, lane connectivity, and turn restrictions, would further enable path-feasibility checks that rule out implausible assignments. By simulating CARLA’s road network directly within the fusion module, the tracker could verify that a vehicle could realistically move from its previous state to the event location within the allowed time interval, which is especially important in dense or ambiguous scenarios. Together, these enhancements would shift the fusion algorithm from a purely distance-based decision process toward a structured, constraint-aware inference system that better reflects the physical and semantic properties of urban movement.

As a next step beyond the Kalman and graph-based fusion, we propose a multimodal deep learning model that jointly consumes perimeter-camera detections and inner-camera encrypted-traffic signals for vehicle-level inference. At each time step, we construct a feature vector that concatenates (i) spatial context from the perimeter cameras, such as a low-resolution occupancy grid, a one-hot encoding of active cameras, recent per-vehicle world coordinates, velocities, and time since last visual observation, with (ii) side-channel features derived from the inner cameras’ packet traces, including per-frame total bytes, packet count, mean and standard deviation of packet size, mean and standard deviation of inter-arrival time. These sequences are fed into a hybrid architecture in which a small CNN encodes the spatial occupancy grid, a two-layer bidirectional LSTM (hidden size 128 with dropout) encodes the temporal evolution of the side-channel features, and the resulting embeddings are fused via concatenation. This is followed by two fully connected with layers with ReLU activations to produce per-vehicle outputs. Depending on the task, the decoder can predict either continuous future positions for each vehicle or a discrete occupancy distribution over the inner-camera grid. Training data can be generated entirely in simulation by recording synchronized edge detections, ground-truth trajectories, and per-camera 802.11 traces in CARLA and Mininet-WiFi, then slicing them into overlapping windows of length T (for example 16 frames) where the first T frames form the input and the next K frames (for example 5–10) form the prediction target. To improve robustness to real-world conditions such as packet loss and noise, we can augment these sequences by randomly dropping packets, jittering inter-arrival times, perturbing bitrate statistics, or synthetically adding low-amplitude background traffic while preserving the alignment between the side-channel features and ground-truth vehicle motion.


6. References


7. Supplementary Material

7.a. Software

External Libraries & Models

Internal Modules

External Tools & Services

Dataset

The dataset used by this report is provided both in the GitHub repository under the ./demo folder and is also found in this Google Drive Folder. To generate these datasets, please refer to ./README.md in the GitHub repository.