The dataset consists of data collected from heavy Scania trucks in everyday usage. The system in focus is the Air Pressure system (APS) which generates pressurised air that are utilized in various functions in a truck, such as braking and gear changes. The datasets' positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS. The data consists of a subset of all available data, selected by experts.
Creators: Scania CV AB Vagnmakarvägen 1, 151 32 Södertälje, Stockholm, Sweden; Donor: Tony Lindgren (tony@dsv.su.se) and Jonas Biteus (jonas.biteus@scania.com) Date: September, 2016
| aa_000 | ab_000 | ac_000 | ad_000 | ae_000 | af_000 | ag_000 | ag_001 | ag_002 | ag_003 | ... | ee_003 | ee_004 | ee_005 | ee_006 | ee_007 | ee_008 | ee_009 | ef_000 | eg_000 | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 76698.0 | NaN | 2.130706e+09 | 280.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 493384.0 | 721044.0 | 469792.0 | 339156.0 | 157956.0 | 73224.0 | 0.0 | 0.0 | 0.0 | b'false' |
| 1 | 38016.0 | NaN | 2.130706e+09 | 704.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 194988.0 | 406226.0 | 319432.0 | 184390.0 | 113950.0 | 144250.0 | 2832.0 | 0.0 | 0.0 | b'false' |
| 2 | 12566.0 | NaN | 2.130706e+09 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 267790.0 | 168402.0 | 115292.0 | 7158.0 | 2044.0 | 1358.0 | 0.0 | 0.0 | 0.0 | b'false' |
| 3 | 40056.0 | NaN | 2.130706e+09 | 11268.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 118466.0 | 317164.0 | 339570.0 | 365458.0 | 177766.0 | 1520.0 | 0.0 | 0.0 | 0.0 | b'false' |
| 4 | 97098.0 | NaN | 2.130706e+09 | 396.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 386786.0 | 701182.0 | 513864.0 | 465926.0 | 390008.0 | 881294.0 | 61634.0 | 0.0 | 0.0 | b'false' |
5 rows × 171 columns
Observation: In the data set there are many attributes with missing values marked by NaN.
aa_000 ab_000 ac_000 ad_000 ae_000 \
count 6.000000e+04 13671.000000 5.666500e+04 4.513900e+04 57500.000000
mean 5.933650e+04 0.713189 3.560143e+08 1.906206e+05 6.819130
std 1.454301e+05 3.478962 7.948749e+08 4.040441e+07 161.543373
min 0.000000e+00 0.000000 0.000000e+00 0.000000e+00 0.000000
25% 8.340000e+02 0.000000 1.600000e+01 2.400000e+01 0.000000
50% 3.077600e+04 0.000000 1.520000e+02 1.260000e+02 0.000000
75% 4.866800e+04 0.000000 9.640000e+02 4.300000e+02 0.000000
max 2.746564e+06 204.000000 2.130707e+09 8.584298e+09 21050.000000
af_000 ag_000 ag_001 ag_002 ag_003 \
count 57500.000000 5.932900e+04 5.932900e+04 5.932900e+04 5.932900e+04
mean 11.006817 2.216364e+02 9.757223e+02 8.606015e+03 8.859128e+04
std 209.792592 2.047846e+04 3.420053e+04 1.503220e+05 7.617312e+05
min 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
50% 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
75% 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
max 20070.000000 3.376892e+06 4.109372e+06 1.055286e+07 6.340207e+07
... ee_002 ee_003 ee_004 ee_005 \
count ... 5.932900e+04 5.932900e+04 5.932900e+04 5.932900e+04
mean ... 4.454897e+05 2.111264e+05 4.457343e+05 3.939462e+05
std ... 1.155540e+06 5.433188e+05 1.168314e+06 1.121044e+06
min ... 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% ... 2.936000e+03 1.166000e+03 2.700000e+03 3.584000e+03
50% ... 2.337960e+05 1.120860e+05 2.215180e+05 1.899880e+05
75% ... 4.383960e+05 2.182320e+05 4.666140e+05 4.032220e+05
max ... 7.793393e+07 3.775839e+07 9.715238e+07 5.743524e+07
ee_006 ee_007 ee_008 ee_009 ef_000 \
count 5.932900e+04 5.932900e+04 5.932900e+04 5.932900e+04 57276.000000
mean 3.330582e+05 3.462714e+05 1.387300e+05 8.388915e+03 0.090579
std 1.069160e+06 1.728056e+06 4.495100e+05 4.747043e+04 4.368855
min 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000
25% 5.120000e+02 1.100000e+02 0.000000e+00 0.000000e+00 0.000000
50% 9.243200e+04 4.109800e+04 3.812000e+03 0.000000e+00 0.000000
75% 2.750940e+05 1.678140e+05 1.397240e+05 2.028000e+03 0.000000
max 3.160781e+07 1.195801e+08 1.926740e+07 3.810078e+06 482.000000
eg_000
count 57277.000000
mean 0.212756
std 8.830641
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1146.000000
[8 rows x 170 columns]
Observation: Many attributes contains mostly zeros.
Shape of remaining dataframe:
(591, 170)
Observation: Not applicable in our case because too many values are missing.
While the best method would be to evaluate each model with each method to impute missing values, it triples the traing time. To overcome that problem I tested with one model (Naive Bayesian model -- because it performed well in previous test done in Weka, and also performed fairly well as a submission at Kaggle), and I used the best performing method for all subsequently tested models because it is very likely that imputing methods performs similarily for all methods. Here are the test results:
FP: 1809 (cost: 10) FN: 127 (cost: 500) Score: 1.3598333333333332
FP: 1821 (cost: 10) FN: 126 (cost: 500) Score: 1.3535
FP: 1908 (cost: 10) FN: 122 (cost: 500) Score: 1.3346666666666667
Verdict: A imputing method which replaced missing values with most frequent values of the given column scored the lowest (i.e. it is the best in our case). Intuitively, that result would also the best because data set contains many zeros, so that method would put zeros to the missing values for the columns that containted mostly zeros (in some cases 75% of the values in a column was 0).
Methods used:
MaxAbsScaler were specifically designed for scaling sparse data, and are the recommended way to go about this. The motivation to use this scaling include robustness to very small standard deviations of features and preserving zero entries in sparse data.QuantileTransformer puts all features into the same, known range or distribution. However, by performing a rank transformation, it smooths out unusual distributions and is less influenced by outliers than scaling methods. It does, however, distort correlations and distances within and across features.Source: https://scikit-learn.org/stable/modules/preprocessing.html
In the following I tested multiple mechine learning method and compared their scores. Because of the abundance of models implemented in sklearn, first I intentionally made a methodological error: I trained and tested on the same dataset without cross-validation. That way I could try out magnitude of algorithms, and filter out those not able to learn our data set well. In the next "round", I tested the best performing methods with cross-validation to filter out overfitting models.
FP: 1908 (cost: 10) FN: 122 (cost: 500) Score: 1.3346666666666667
FP: 29819 (cost: 10) FN: 21 (cost: 500) Score: 5.144833333333334
FP: 29819 (cost: 10) FN: 21 (cost: 500) Score: 5.144833333333334
FP: 5227 (cost: 10) FN: 150 (cost: 500) Score: 2.121166666666667
I used Gaussian Naive Bayes because it performed the best among Naive Bayes classifiers
FP: 606 (cost: 10)
FN: 40 (cost: 500)
Score: 1.316161616161616
10-fold cross validation:
Min: 0.956082786471479
Max: 0.9722222222222222
Avg: 0.9651013651462981
FP: 682 (cost: 10)
FN: 39 (cost: 500)
Score: 1.3292929292929292
10-fold cross validation:
Min: 0.9484848484848485
Max: 0.9656565656565657
Avg: 0.9604042932643058
FP: 681 (cost: 10)
FN: 38 (cost: 500)
Score: 1.3035353535353535
10-fold cross validation:
Min: 0.9565656565656566
Max: 0.9676767676767677
Avg: 0.9626264135849232
FP: 2318 (cost: 10)
FN: 38 (cost: 500)
Score: 2.1303030303030304
10-fold cross validation:
Min: 0.7267676767676767
Max: 0.8575757575757575
Avg: 0.8341929315089993
Verdict: While score was lowest in case of maxAbsScaler, 10-fold cross validation showed that cross validation score decreased in cases of all normalization methods, so normalization rather helps overfitting.
FP: 0 (cost: 10) FN: 5 (cost: 500) Score: 0.04166666666666666
That is surely overfitting...
FP: 4 (cost: 10) FN: 139 (cost: 500) Score: 1.159
FP: 17 (cost: 10) FN: 773 (cost: 500) Score: 6.4445
FP: 51 (cost: 10) FN: 451 (cost: 500) Score: 3.7668333333333335
FP: 58 (cost: 10) FN: 29 (cost: 500) Score: 0.25133333333333335
Verdict: From SVM classifiers, the one with standard normalization and class weightening was the best.
FP: 0 (cost: 10) FN: 0 (cost: 500) Score: 0.0
It would be awesome to believe that result, but I'm totally sure that it is overfitting...
FP: 0 (cost: 10) FN: 41 (cost: 500) Score: 0.3416666666666667
Far the best, so far, but I'm afraid that it might be overfitting, as well.
Parameters: max_samples=0.5, max_features=0.5
FP: 65 (cost: 10) FN: 556 (cost: 500) Score: 4.644166666666667
Parameters: max_samples=0.7, max_features=0.3
FP: 47 (cost: 10) FN: 478 (cost: 500) Score: 3.9911666666666665
Parameters: max_samples=0.3, max_features=0.7
FP: 80 (cost: 10) FN: 621 (cost: 500) Score: 5.1883333333333335
Verdict: While it is visible that changing parameters would improve result, but it is still far behind others...
FP: 134 (cost: 10) FN: 282 (cost: 500) Score: 2.372333333333333
FP: 143 (cost: 10) FN: 675 (cost: 500) Score: 5.648833333333333
After a few trials, layer size (100, 100, 10) appeared to be a good setup. While I'm sure that a much better can be achieved with lot of trials, these values might be enough to compare that method with others.
FP: 26083 (cost: 10) FN: 583 (cost: 500) Score: 9.2055
FP: 0 (cost: 10) FN: 8 (cost: 500) Score: 0.06666666666666667
FP: 90 (cost: 10) FN: 193 (cost: 500) Score: 1.6233333333333333
FP: 5946 (cost: 10) FN: 380 (cost: 500) Score: 4.157666666666667
Verdict: Multilayer Perceptron seems to work well with MaxAbsScaler normalization, as the result of standard normalization is very likely overfitting.
Parameters: loss="hinge", penalty="l2", tol=0.1
FP: 1031 (cost: 10) FN: 240 (cost: 500) Score: 2.1718333333333333
Parameters: loss="hinge", penalty="l1", tol=0.1
FP: 304 (cost: 10) FN: 338 (cost: 500) Score: 2.8673333333333333
Parameters: loss="hinge", penalty="elasticnet", tol=0.1
FP: 1911 (cost: 10) FN: 100 (cost: 500) Score: 1.1518333333333333
Parameters: loss="modified_huber", penalty="l2", tol=0.1
FP: 435 (cost: 10) FN: 370 (cost: 500) Score: 3.1558333333333333
Parameters: loss="modified_huber", penalty="l1", tol=0.1
FP: 294 (cost: 10) FN: 327 (cost: 500) Score: 2.774
Parameters: loss="modified_huber", penalty="elasticnet", tol=0.1
FP: 198 (cost: 10) FN: 352 (cost: 500) Score: 2.9663333333333335
Parameters: loss="log", penalty="l2", tol=0.1
FP: 708 (cost: 10) FN: 293 (cost: 500) Score: 2.5596666666666668
Parameters: loss="log", penalty="l1", tol=0.1
FP: 275 (cost: 10) FN: 337 (cost: 500) Score: 2.8541666666666665
Parameters: (loss="log", penalty="elasticnet", tol=0.1
FP: 1888 (cost: 10) FN: 153 (cost: 500) Score: 1.5896666666666666
Verdict: Stochastic Gradient Descent appeared to perform the best with hinge loss with elasticnet penalty.
Now, I test the best performing methods (scored between 0.1 and 2.0) from first round with 5-fold cross validation
Min: 0.8330833333333333 Max: 0.995 Avg: 0.9535500000000001
Min: 0.887 Max: 0.9878333333333333 Avg: 0.9669833333333333
Min: 0.9470833333333334 Max: 0.994 Avg: 0.9819833333333333
Min: 0.9288333333333333 Max: 0.99525 Avg: 0.9793999999999998
Min: 0.944 Max: 0.9919166666666667 Avg: 0.9799
As Random Forest was the winner of cross validation, I narrowed down research to fine tuning parameters of Random Forest classifier.
From the documentation of sklearn module:
The main parameters to adjust when using these methods is n_estimators and max_features. The former is the number of trees in the forest. The larger the better, but also the longer it will take to compute. In addition, note that results will stop getting significantly better beyond a critical number of trees. The latter is the size of the random subsets of features to consider when splitting a node. The lower the greater the reduction of variance, but also the greater the increase in bias. Empirical good default values are max_features=n_features for regression problems, and max_features=sqrt(n_features) for classification tasks (where n_features is the number of features in the data). Good results are often achieved when setting max_depth=None in combination with min_samples_split=2 (i.e., when fully developing the trees).
So, I stared to play a bit with parameters, and I found max_features, max_depth, and min_samples_split are better to leave to default values, as any changes increased significantly. Also, because Random Forest is not sensitive to scaling, I skipped to cross validate results on results of different normalization methods. Reference
Finally, I had one parameters that have large impact on cross validation score: n_estimators. Additionally, I read the article of the winners of IDA 2016 Industrial Challenge, and they reported that the key of their success was to adjusting prediction threshold of Random Forest classifier. Reference
A quick search for a good starting-point threshold value using the default n_estimator parameter (=10). Calculated 10-fold cross validation scores for given thresholds:
Threshold: 0.05 average score: 0.9756666666666666 Threshold: 0.1 average score: 0.9756666666666666 Threshold: 0.15 average score: 1.2061666666666666 Threshold: 0.2 average score: 1.2061666666666666 Threshold: 0.25 average score: 1.6945000000000001 Threshold: 0.3 average score: 1.6945000000000001 Threshold: 0.35 average score: 2.274 Threshold: 0.4 average score: 2.274 Threshold: 0.45 average score: 3.0678333333333336
Now search for optimal n_estimators value using threshold=0.05. Calculated 10-fold cross validation scores for given n_estimators values:
n_estimators: 5 average score: 1.2908333333333335 n_estimators: 10 average score: 0.9756666666666666 n_estimators: 15 average score: 0.9504999999999999 n_estimators: 20 average score: 0.9906666666666668 n_estimators: 25 average score: 0.8776666666666666 n_estimators: 30 average score: 0.883 n_estimators: 35 average score: 0.8876666666666667 n_estimators: 40 average score: 0.8878333333333334 n_estimators: 45 average score: 0.8695 n_estimators: 50 average score: 0.8568333333333333 n_estimators: 55 average score: 0.8734999999999999 n_estimators: 60 average score: 0.8783333333333333 n_estimators: 65 average score: 0.8871666666666667 n_estimators: 70 average score: 0.9046666666666667 n_estimators: 75 average score: 0.8821666666666668 n_estimators: 80 average score: 0.8601666666666669 n_estimators: 85 average score: 0.8998333333333333 n_estimators: 90 average score: 0.9033333333333333 n_estimators: 95 average score: 0.8868333333333334
n_estimators=50 parameter setting has the lowest score. Let's fine tune threshold value again around that parameter! Calculated 10-fold cross validation scores for given thresholds:
Threhold: 0.01 average score: 1.1179999999999999 Threhold: 0.02 average score: 1.1179999999999999 Threhold: 0.03 average score: 0.923 Threhold: 0.04 average score: 0.923 Threhold: 0.05 average score: 0.8568333333333333 Threhold: 0.06 average score: 0.8568333333333333 Threhold: 0.07 average score: 0.8661666666666668 Threhold: 0.08 average score: 0.8661666666666668 Threhold: 0.09 average score: 0.8886666666666667
Verdict: The minimal value here is threshold=0.05, so I use it to predict test data. Winner submission under name of hakkelt is predicted by that model.