Comparison and Validation of Injury Risk Classifiers for Advanced ...

Report 4 Downloads 44 Views
Traffic Injury Prevention

Comparison and Validation of Injury Risk Classifiers for Advanced Automated Crash Notification Systems

Kristofer Kusano Virginia Tech 440 Kelly Hall, 325 Stanger St (MC 0194) Blacksburg, VA 24061 [email protected]

Hampton C. Gabler Virginia Tech 445 Kelly Hall, 325 Stanger St (MC 0194) Blacksburg, VA 24061 [email protected]

APPENDIX A: DETAILS OF SELECTED CASES In NASS/CDS years 2002 to 2011, there were a total of 48,826 crashes that corresponded to 22.4 million collisions involving 88,002 vehicles corresponding to 40.4 million vehicles. Vehicles used to fit models were restricted in a similar way as in Kononen et al. (2012). Data exclusions are summarized in Table 1. Rollovers were excluded because predictors such as ∆V are not estimated for most rollover crashes and even if they were, they may not describe injury risk as well compared to planar collisions. Because injury risk in rollover crashes is so great, it would be reasonable to assume that all rollovers would be treated as serious injury crashes in an AACN system. Vehicles in which the primary damage was to the top or undercarriage of the vehicle were also excluded because they are not planar collisions. The largest proportion of vehicles which were excluded are those with a missing or unknown clock direction of force. This value is associated with the reconstruction of ∆V. We also excluded cases with extreme weights, or those greater than 5,000. Kononen et al. argued that these extreme weight cases introduced more variance into estimates than their exclusion adds bias (2012). The high weight cases accounted for less than 1% of raw vehicles yet 20% of weighted cases. Finally, vehicles that were not passenger vehicles, e.g. heavy vehicles, motorcycles, buses, were excluded. In total, 48,401 vehicles which corresponded to 17.9 million vehicles or 44% of all vehicles in NASS/CDS were included in the sample. Table 1. Number of Vehicles with Each Exclusion Criteria as a Proportion of the All Vehicles in NASS/CDS 2002-2011 Exclusion All Vehicles Rollover Primary Damage is Top or Undercariage Unknown Clock Direction of Force Extreme Weight (>5,000) Not a Passenger Vehicle Total Vehicles Included

N

88,002 7,051 3,635 36,286 742 3,490 48,401

Freq 40,352,597 2,171,496 1,228,813 17,142,268 8,066,132 1,290,941 17,946,628

% of Vehicles 100% 5.4% 3.0% 42% 20% 3.1% 44%

Some vehicles that met the inclusion criteria had missing values for some key predictors, as summarized in Table 2. In 28% of vehicles, there was no occupant information on injury outcome, age, or sex. In many cases, ∆V was not reconstructed by the NASS/CDS investigators. The ∆V is reconstructed using a program called WinSmash that correlates measured vehicle crash to absorbed energy (Sharma et al. 2007). Sometimes the investigators are unable to locate the vehicle to measure crush or the vehicle has already been repaired. The ∆V also cannot be estimated by this methodology if there is more than one impact to the same location on the vehicle and in sliding impacts. Finally, seat belt use was not estimated in 30% of vehicles. Excluding cases with missing values left 25,353 vehicles corresponding to 8.4 million vehicles. This is 47% of cases that met the inclusion criteria and 20% of all vehicle records in NASS/CDS.

Table 2 Included Vehicles with Missing Information

Missing Value

n

No Occupant Information (ISS/Age/Sex) Delta-V Front Seat Belt Status

11,055 15,587 12,024

Freq 5,035,697 6,166,663 5,345,913

% of Included Vehicles 28% 34% 30%

AACN will only activate in crashes where there is a significant impact. Kononen et al. defined the cases where their hypothetical ACN would notify emergency responders as a crash where the ∆V was greater than 15 mph or where an airbag deployed (2012). In our sample, 27% of crashes had a ∆V greater than 15 mph, 40% had an airbag deployment, and 51% had either. Vehicles with a ∆V greater than 15 mph or an airbag deployment were 16,398 vehicles which corresponded to 4.31 million collisions and was the final dataset used in the model evaluation. Table 3 compares the frequency of the model predictors between the non-notification and notification datasets. Frequencies for the 25,353 cases with no missing information are shown in Table 4. Number of impacts were identical in the two sets. Frontal impacts were far more common in the notification dataset compared to the nonnotification. The large number of frontal impacts in the notification set could be due to the prevalence of frontal airbags, compared to side or curtain airbags, in NASS/CDS vehicle population. Distributions of vehicle type, age, sex, and seat belt use were similar between sets. Almost no vehicles in the non-notification set resulted in serious injury. Table 3 Frequency of Predictor Variables in Non-Notification and Notification Datasets.

Variable

Value

Number of Impacts

Single Multiple Front Right Left Rear Car SUV Van Pickup All < 55 Some > 55 No females Females Present All Belted Some Unbelted True False

Damage Side

Vehicle Type

Age Sex Front Seat Belt Use ISS15+ in Veh.

NonNotification Set 72% 28% 56% 15% 16% 13% 65% 16% 8% 11% 79% 21% 42% 58% 87% 13% 0.6% 99.4%

Notification Set 72% 28% 80% 7% 6% 7% 73% 13% 6% 8% 83% 17% 44% 56% 80% 20% 2.6% 97.4%

Figure 1 shows the distribution of ∆V for vehicles on the notification dataset. There are two spikes in the data, one corresponding to those with airbag deployments at ∆V less than 15 mph and then another at approximately 15 mph,

including some crashes without an airbag deployment. The median ∆V was 15.5 mph (25 kph) and 85% of ∆V were below 21.7 mph (35 kph). 5

Number of Observations

2.5

x 10

2 1.5 1 0.5 0 0

20

40

60

80

100

V (mph)

Figure 1 Distribution of Total ∆V for All Notification Cases.

Overall, in the notification set 2.62% of vehicles contained at least one seriously injured occupant (ISS15+). Figure 2 shows the proportion of occupants that were injured for categories of total ∆V. As total ∆V increase, the risk for injury greatly increases. 50 40

%

30 20 10 0 0-10

11-20

21-30 31-40 V (mph)

41-50

50+

Figure 2 Proportion of Vehicles with Serious Injury (ISS15+).

Table 4 shows the proportion of vehicles with each of the other predictor variables and the proportion of vehicles with each level that had at least one seriously injured occupant (ISS15+). The presence of multiple impacts and occupants over the age of 55 increased the observed injury rates. Right and left side impacts had more injury than front and rear impacts. There were not large differences between vehicle types, with vans having slightly lower and pickup trucks having slightly higher rates than cars and SUVs. The difference between vehicles with and without at least one female was also small.

Table 4 Proportion of Vehicles with Serious Injury Rates for Predictor Variables in Notification Dataset. Variable

Value

Number of Impacts

Single Multiple Front Right Left Rear Car SUV Van Pickup All < 55 Some > 55 No females Females Present All Belted Some Unbelted

Damage Side

Vehicle Type

Age Sex Front Seat Belt Use All Vehicles

% of Vehicles 72% 28% 80% 7% 7% 7% 73% 12% 6% 8% 83% 17% 44% 56% 80% 20% 100%

% Injured 1.95% 4.35% 2.11% 4.82% 8.54% 0.84% 2.75% 2.14% 1.37% 3.25% 2.03% 5.45% 2.42% 2.79% 1.70% 6.35% 2.63%

APPENDIX B: DATASET AND EXAMPLE LOGISTIC REGRESSION MODELS The dataset used to compare AACN risk curve performance was composed of 16,398 vehicles corresponding to 4.31 million collisions. NASS/CDS years 2002 to 2011 were aggregated to form this sample. See the manuscript text and Appendix A for details on the case selection. This online supplement includes two files related to the dataset used in this study: 

AACN_dataset.csv – The dataset used in this study to compare AACN injury risk curves.



sample_models.m – A MATALB code file that reads in the dataset and produces sample models like the ones used in the study.

Description of Variables Table 5 lists and describes the variables in the output dataset. Columns 1 to 5 contain identifying information for each vehicle in the dataset. Columns 6 to 12 contain the predictor variables used in the models. Columns 13 to 22 contain the cross validation index (i.e. 1 to 10) for each of the 10 partitions used in the study. Finally, column 23 contains the outcome variable, whether any occupant had ISS15 or greater injuries. Table 5. Variable Names and Descriptions for AACN_dataset.csv Column 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Column Name caseyear psu caseno vehno ratwgt log_deltaV GAD all_belted body_type multiple_events veh55_or_older veh_all_male cv1 cv2 cv3 cv4 cv5 cv6 cv7 cv8 cv9 cv10 vehicle_iss15

Description NASS/CDS case year NASS/CDS Primary Sampling Unit (PSU) NASS/CDS Case number NASS/CDS vehicle number NASS/CDS ratio weights Natural logarithm of the delta-V from the most harmful event in mph General Area of Damage (1 = Front, 2 = Right, 3 = Left, 4 = Back) Driver and right front passenger (if present) were belted (1 = Yes, 0 = No) Vehicle body type (1 = Car, 2 = SUV, 3 = Van, 4 = Pickup) Vehicle experienced more than one event (1 = Yes, 0 = No) Vehicle has at least one occupant that was 55 years or older (1 = Yes, 0 = No) All occupants in the vehicle were male (1 = Yes, 0 = No) Index of cross validation set for partition 1 Index of cross validation set for partition 2 Index of cross validation set for partition 3 Index of cross validation set for partition 4 Index of cross validation set for partition 5 Index of cross validation set for partition 6 Index of cross validation set for partition 7 Index of cross validation set for partition 8 Index of cross validation set for partition 9 Index of cross validation set for partition 10 At least one occupant in vehicle had an ISS of 15 or greater (1 = Yes, 0 = No)

Sample Code File A code file is provided to demonstrate the different models evaluated in the study (logistic regression, Random Forests, AdaBoost, Naïve Bayes, Support Vector Machines, and classification K-Nearest Neighbors). The sample code file reads in the dataset from the CSV file and fits models to the entire dataset. The cross validation samples are not used in the example script. The example code does not evaluate the models. The “results” cell array contains model fit objects for each model that can be used to make predictions using future data. The example code was run using MATLAB 2013b and requires a license to the statistics toolbox.