Phase 2E: Location Clustering¶
Clusters raw GPS coordinates into categorical Place IDs using DBSCAN.
Run¶
Output¶
data/processed/place_ids.csv with the following columns:
| Column | Description |
|---|---|
id | Unique record identifier |
user_id | User identifier |
raw_lat | Original latitude |
raw_lon | Original longitude |
place_id | Cluster identifier (e.g., place_01) |
centroid_lat | Cluster centroid latitude |
centroid_lon | Cluster centroid longitude |
is_new_cluster | For incremental processing |
Algorithm¶
- Snap-to-grid: Truncate coordinates to 4 decimal places (~11m buffer)
- DBSCAN: Haversine metric, ε ≈ 7.85×10⁻⁶ radians (~50m), min_samples=1
Design Choice
Location is treated as categorical context, not a continuous vector. This prevents overfitting to GPS noise.
Parameters¶
| Parameter | Value | Description |
|---|---|---|
| Grid precision | 4 decimal places | ~11m spatial buffer |
| DBSCAN ε | 7.85×10⁻⁶ rad | ~50m radius |
| min_samples | 1 | Single-point clusters allowed |