A novel cross-lingual benchmark dataset comprising nearly 99,869 real traffic incident records from Vienna (2013-2023) to evaluate LLMs’ Spatio and Temproal robustness as multilingual agents for hallucination problem
+⭐ 08th Sep. News: You can now access our dataset using Google Cloud Storage as well! if you want to utilize our data with LlamaIndex Vertex AI for RAG!
Website: H&PS Hallucination Traffic Incidents Dataset
The H&PS Traffic Incidents Dataset comprises 99,869 real incident records in Vienna public transportation system, categorized into 14 distinct scenarios: Faulty Vehicles, Acute Track Damages, Acute Switch Damages, Overhead Line Faults, Signal Faults, Rescue Operations, Police Interventions, Fire Brigade Interventions, Illegal Parking, Traffic Accidents, Demonstrations, Events, Delays, and Other Incidents.
This dataset has been/could be downloaded via Kaggle, paper with code, and Hugging face!
Category | Details |
---|---|
LLM Models Covered | GPT series include GPT-4, TinyLlama, Claude-3-Haiku, Claude-3-Sonnet, Gemini-Pro 1.0, Mistral Medium, Mistral-8x7B, Llama-3-70B |
Dataset Complexity | - Both Temporal and Spatio domain logical reasoning tasks. |
Number of Records | - 99,869 real traffic incident records. |
Year of Records | - Over ten years (2013 to 2023). |
Covered Variants | - Over 500 tramcars, more than 131 bus lines. |
Covered Variants | - 5 underground lines (U1, U2, U3, U4, U6). |
Covered Variants | - 24 night lines. |
Covered Variants | - More than 1,076 Tram Stop Stations. |
Covered Variants | - 4,291 Bus Stop Stations. |
Prompt Torken Length | - Daily sentence tokens > 4K. |
Language Types | - Both in German and English. |
Format of Representation | json format |
Sample of Dataset | { |
IncidentID | "id": 1, |
Incident Category | "title": "U3: Polizeieinsatz", |
Incident Description | "description": "Wegen eines Polizeieinsatzes in der Station Landstrasse S U ist die Linie U3 in Fahrtrichtung Simmering an der Weiterfahrt gehindert...Das Störungsende ist derzeit nicht absehbar.", |
Incident Start Time | "start": "2023-11-21 12:26:12", |
Traffic Delay Start time | "traffic_start": "2023-11-21 12:27:42", |
Incident End Time | "end": "", |
Effect Lines | "lines": "U3" |
Sample of Dataset | } |
- The actual start time of a traffic delay can sometimes differ from the recorded start time. The recorded start time refers to when support services first receive the incident report. Due to the current legacy manual system for incident reporting and monitoring, the end time is optional and is sometimes not documented in the database, adding to the complexity.
- The total number of incident categories could exceed 20, given that definitions have evolved over the past decade. However, for this study, we focus on testing LLMs against 14 distinct scenarios: Faulty Vehicles, Acute Track Damages, Acute Switch Damages, Overhead Line Faults, Signal Faults, Rescue Operations, Police Interventions, Fire Brigade Interventions, Illegal Parking, Traffic Accidents, Demonstrations, Events, Delays, and Other Incidents.
Störungen nach Jahr | 2013* | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | Gesamt |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Schadhafte Fahrzeuge (Faulty Vehicles) | 132 (446) | 477 | 592 | 966 | 1282 | 921 | 949 | 1062 | 1527 | 1753 | 2326 | 11987 |
Akute Gleisschäden (Acute Track Damages) | 11 (37) | 46 | 38 | 63 | 48 | 53 | 33 | 52 | 50 | 54 | 70 | 518 |
Akute Weichenschäden (Acute Switch Damages) | 1 (3) | 4 | 11 | 57 | 58 | 59 | 68 | 64 | 45 | 57 | 69 | 493 |
Fahrleitungsgebrechen (Overhead Line Faults) | 16 (54) | 69 | 77 | 94 | 104 | 114 | 101 | 70 | 108 | 80 | 100 | 933 |
Signalstörungen (Signal Faults) | 2 (7) | 20 | 21 | 45 | 25 | 27 | 28 | 21 | 40 | 48 | 65 | 342 |
Rettungseinsätze (Rescue Operations) | 198 (669) | 701 | 912 | 1247 | 1341 | 1224 | 1378 | 1188 | 1693 | 1955 | 2413 | 14250 |
Polizeieinsätze (Police Interventions) | 54 (183) | 266 | 442 | 783 | 759 | 702 | 653 | 679 | 1062 | 1236 | 1289 | 7925 |
Feuerwehreinsätze (Fire Brigade Interventions) | 17 (57) | 84 | 152 | 267 | 274 | 305 | 325 | 287 | 325 | 323 | 403 | 2562 |
Falschparker (Illegal Parking) | 137 (463) | 507 | 778 | 953 | 975 | 1017 | 1124 | 932 | 1246 | 1328 | 1229 | 10126 |
Verkehrsunfälle (Traffic Accidents) | 394 (1332) | 1386 | 1466 | 1749 | 1549 | 1457 | 1528 | 1292 | 1761 | 1879 | 2102 | 16563 |
Demonstrationen (Demonstrations) | 0 (0) | 25 | 40 | 44 | 40 | 89 | 147 | 122 | 215 | 239 | 252 | 1213 |
Veranstaltungen (Events) | 0 (0) | 0 | 0 | 20 | 70 | 71 | 70 | 107 | 122 | 107 | 81 | 648 |
Verspätungen (Delays) | 1675 (5661) | 4838 | 2608 | 1213 | 325 | 468 | 944 | 1137 | 3320 | 8048 | 2502 | 26976 |
Sonstige (Other Incidents) | 220 (744) | 651 | 655 | 339 | 410 | 394 | 428 | 503 | 647 | 943 | 1724 | 6914 |
Störungen gesamt (Total Incidents) | 2863 (9676) | 9074 | 7812 | 7890 | 7261 | 6900 | 7813 | 7431 | 12258 | 17877 | 14625 | 102804 |
Please note: Data collection for 2013 covers only from September 14th to December 31st. The numbers in parentheses represent a hypothetical projection for the entire year.
├── hypothesize # Our hypothesized dataset based on hypothesis: given sentence indexing, date-to-text conversion, and German-to-English translation.
├── original # Original incident records in a JSON file, includes real traffic records from Vienna.
├── results-1 # Results files and logs for hypothesis 1.
├── results-2 # Results files and logs for hypothesis 2.
├── results-3 # Results files and logs for hypothesis 3.
├── Spatio # Results files and logs for spatial-related experiments, includes prompt samples, and all the traffic logs and ground truth.
├── Temporal # Results files and logs for time/temporal-related experiments, includes prompt samples, and all the traffic logs and ground truth.
├── webfile # Web files used for building the website and README, and incidents.
├── database.json # Simple dataset you can easily play with using LLMs, since the original file includes 99K incidents, major LLMs won't accept such token length input.
├── gptapt-ipynb # GPT Jupyter notebook for experiments, typically closed-source, enterprise LLM. You will need to have an API key (key.txt) for experiments.
├── tinyllama.ipynb # TinyLlama Jupyter notebook for experiments, open-source LLM. You can download the model weights and do local testing.
├── Licence and legal declaration # CC BY 4.0 but with imporant Declaration
└── README.md
If you find this dataset useful in your research, please consider citing it. We extend our sincere thanks to those who contributed to our paper, helped significantly with the development of GenAI applications, dataset cleaning, and hypothesis brainstorming sessions.
We strongly against using this dataset for any kind of warfare or activities targeting human beings. We reserve the right to revoke or completely remove our dataset for such purposes and will not be responsible for any kind of loss resulting from such actions.
This dataset is intended to encourage public thinking about how LLMs might responsibly answer spatio-temporal queries, thereby helping to build robust systems for our community.
Ongoing XAI Project with targeting of exploring llm hallucination problem
-
Open_LLM_leaderboard for select LLM: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
-
Opensource LLama model: https://github.com/jzhang38/TinyLlama/blob/main/EVAL.md
🌟LangFlow and Vector Store RAG With our Dataset as Context Embedding + Spatio/ Temporal Query Embedding 🌟
| Category | Prompt/Questions | GPT 4 Response | RAG embedded GPT-4
|--------------------|------------------------------------------------------------------------------------------------------|----------------| ----------------
| **Space** | From Schloss Schönbrunn to Musikverein Wien on 21st Nov 2023, am my trip affected? | ✓ | x
| | From Haus des Meeres to U-Bahn-Station Roßauer Lände on 21st Nov 2023, am my trip affected? | ~ | x
| | From Theater in der Josefstadt to Naturhistorisches Museum Wien on 19th September 2023, am my trip affected? | ~ | x
| | From Museum für angewandte Kunst to Wiener Kriminalmuseum on 19th September 2023, am my trip affected? | ✓ | x
| **Time** | List of disruption causes per hour? | ✓ | ✗
| | Lines with most disruptions during peak hours? | ✗ | ✓
| | Time spans with most disruptions? | ✗ | ~
| | First and last disruption of the year? | ✗ | ✓
| | 3 disruptions with the greatest impact? | ✗ | ~
| | 3 events with the longest duration? | ✓ | ✗
| | The average duration of all events? | ✓ | ✗
| | All events starting between 6 AM and 6 PM | ✗ | ✗
| | All 'Long events' and their average duration | ✗ | ✗
| | The total duration of events by time of day? | ✗ | ✗
| Question (selective) | RAG OUTPUT sample |
|------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Average duration of all events | "To calculate the average duration of all events, we need to determine the duration for each event from the provided context. Identify the events: Each event has a start time and, where applicable, an estimated end time or an indication of how long it lasts. Calculate the duration of each event: For events with a specified end time, subtract the start time from the end time. For those with a duration stated (like ""until X"" or ""lasts until X""), use that information to calculate the duration. Convert the duration to a consistent unit: For this calculation, we can convert all durations to minutes. Average the durations: Sum all the durations and divide by the total number of events. However, since the detailed calculation requires processing each event, I can provide a general outline of how you would do this, but I cannot compute the exact average without all the specific durations calculated from the data provided. If you need further assistance in calculating this or need a specific approach, please let me know!" |
| 3 disruptions with the greatest impact? | "Based on the provided context, the three disruptions with the greatest impact are: U3: Polizeieinsatz - The U3 line in the direction of Simmering is unable to proceed due to a police operation at the Landstraße S U station, with no estimated end time. 40A: Falschparker - The 40A line is hindered in the direction of Döblinger Friedhof due to a parked vehicle in the Sechsschimmelgasse area, with an expected disruption until 13:15. 40 , 41: Falschparker - Both the 40 and 41 lines are blocked in the direction of Gersthof and Pötzleinsdorf due to a parked vehicle in the Gentzgasse area, with an expected disruption until 12:45. These disruptions are significant as they involve major lines and ongoing police actions, which tend to affect larger numbers of passengers." |
| Total duration of events by time of day | "To calculate the total duration of events by time of day, we need to extract the start and end times from the provided context and determine the duration of each event. Here's a breakdown of the events along with their respective durations: 40A: Falschparker Start: 12:46:57 End: 13:15:00 (assumed based on the duration stated) Duration: 28 minutes U3: Polizeieinsatz Start: 12:26:12 Duration: Unknown (stated as not foreseeable) 40 , 41: Falschparker Start: 12:20:54 End: 12:45:00 (assumed based on the duration stated) Duration: 24 minutes U6: Betriebsstörung Start: 12:11:12 End: 12:26:02 Duration: 15 minutes 44: Schadhaftes Fahrzeug Start: 12:02:47 End: 12:30:00 (assumed based on the duration stated) Duration: 27 minutes 66A: Rettungseinsatz Start: 11:27:16 End: 11:30:00 (assumed based on the duration stated) Duration: 3 minutes U3: Schadhaftes Fahrzeug Start: 10:51:03 End: 12:25:01 Duration: 94 minutes 10: Feuerwehreinsatz Start: 10:08:10 End: 10:40:00 (assumed based on the duration stated) Duration: 32 minutes 12A: Schadhaftes Fahrzeug Start: 10:02:33 End: 10:36:00 (assumed based on the duration stated) Duration: 34 minutes O: Demonstration Start: 09:55:25 End: 10:30:00 (assumed based on the duration stated) Duration: 35 minutes ... and so forth for each event listed. The total duration of events can be calculated by summing the durations of all events that have an explicit duration. For events without a clear end time, we cannot contribute to the total. To summarize, if we were to categorize all of them by hour and add the durations, we would have a total duration of the events per hour. Finally, the results would look something like this: 09:00 - 10:00: Total Duration = X minutes 10:00 - 11:00: Total Duration = Y minutes 11:00 - 12:00: Total Duration = Z minutes 12:00 - 13:00: Total Duration = W minutes You would need to sum all the durations for each time slot accordingly." |
| Time spans with most disruptions | "Based on the provided context, the time spans with the most disruptions can be identified as follows: 12:00 PM - 1:00 PM (UTC+1): Multiple disruptions affecting lines 40A, U3, and 41 due to various issues including false parking and police actions. 11:00 AM - 12:00 PM (UTC+1): Disruptions noted for lines U3, 66A, and 44, including police actions and faulty vehicles. 10:00 AM - 11:00 AM (UTC+1): Multiple incidents including police actions affecting lines O and U3, along with disruptions on lines 10, 12A, and 66A. These time spans indicate periods where multiple transit lines experienced significant service interruptions due to various incidents." |
| Lines with most disruptions during peak hours | "Based on the provided context, here are the lines with the most disruptions during peak hours (which typically include busy times such as mornings and late afternoons): U3: This line has multiple disruptions due to police activity, a faulty vehicle, and irregular service noted on several occasions. U4: There are numerous mentions of disruptions due to rescue operations and faulty vehicles impacting service. 10: This line has faced disruptions due to both a fire department operation and a traffic accident. 43: This line has been affected by police interventions and traffic accidents, leading to service interruptions. 40A: This line has seen disruptions due to parking violations and traffic congestion. 49: This line has been impacted by traffic accidents causing delays. 2: This line has experienced several disruptions from rescue operations and vehicle issues. These lines have shown a pattern of disturbances during peak hours, making them significant in terms of service reliability." |
| First and last disruption of the year | "The first disruption of the year was on January 1, 2023, with a title related to a 'Falschparker' (wrongfully parked vehicle) affecting the line 5. The last disruption recorded was on November 21, 2023, involving a 'Falschparker' that hindered the line 40A, expected to last until 13:15." |
Install LangFlow using pip. Ensure you have Python 3.10 or above installed.
- Command to check Python version:
python --version
or
python3 --version
- If Python is not installed:
- Download and install it from python.org.
- Open your terminal and run the following command:
pip install langflow --pre --force-reinstall
Explanation:
pip install
is the command to install Python packages.langflow
is the name of the package.--pre
allows the installation of pre-release versions.--force-reinstall
forces the reinstallation of the package even if it's already installed.
- Command to run LangFlow:
langflow run
- Open LangFlow in your browser:
- Navigate to
http://localhost:7860
if it doesn't open automatically. - Can also insert our H&PS langflow Json
- Navigate to
- Steps:
- Click "New Project."
- Select "Blank Flow."
- From the sidebar, drag and drop the "Text Input" and "Chat Input" components.
- Rename "Text Input" to "Name":
- Set it to capture the user's name.
- Connect the output of "Name" to the sender name input of "Chat Input."
- Add a "Prompt" component.
- Edit the template to include placeholders for context, question, and history:
Hey, answer the user's question based on the following context:
The context is this: {context}
And this is the message History: {history}
The users question is this: {question}
- Drag and drop the "Chat Memory" component.
- Connect the name input to the session ID of the chat memory.
- Drag and drop the "Chat Output" component.
- Connect the output of the OpenAI component to the chat output.
- Set the sender name to "AI."
- Sign up at OpenAI.
- Steps:
- Go to API keys in your OpenAI dashboard.
- Click "Create new secret key."
- Name the key to LangFlow-API (Makes things easier if you use more than one OpenAI API key)
- Give it access to everything.
- Copy the generated key.
- Add an "OpenAI" component.
- Name it OpenAI_Key_New.
- Paste your OpenAI API key in the Value field (Box).
- Make the 'Type' Credential.
- Click Save Variable.
- Drag the Chat Output from the menu and drag it to the left of OpenAI and place it beside.
- Drag and link the connection from OpenAI 'text' to the Chat Output 'Message'.
- Connect the prompt to the OpenAI component.
- Go to DataStax Astra and sign up.
- Steps:
- Click on "Create Database."
- Select "Serverless Vector Database."
- Name your database (e.g., " you can directly use our H&PS LangFlow").
- Choose your cloud provider and region.
- Click "Create Database."
- Add an "Astra DB" component.
- Enter your Astra DB endpoint, token, and collection name.
- Connect the embeddings to the Astra DB component.
- Add a "File Loader" component.
- Upload your Json file (e.g., monthly or Yearly dataset).
- Add a "Split Text" component.
- Connect the file loader to the split text component.
- Add an "OpenAI Embeddings" component.
- Connect the split text output to the embeddings input.
- Connect the chat input to the Vector Search component.
- Connect the embeddings to the Vector Search component.
- Connect the output of the Vector Search component to the prompt context input.
If you find our Dataset useful in your research, please consider citing:
@inproceedings{
li2025how,
title={How {LLM}s React to Industrial Spatio-Temporal Data? Assessing Hallucination with a Novel Traffic Incident Benchmark Dataset},
author={Qiang Li and Mingkun Tan and Xun Zhao and Dan Zhang and Daoan Zhang and Shengzhao Lei and Anderson S. Chu and Lujun Li and Porawit Kamnoedboon},
booktitle={2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics Industry Track},
year={2025},
url={https://openreview.net/forum?id=gL5i5HUin1}
}