Data Mining & Machine Learning

Assigment2

Hakkel Tamás
2018.12.08.

1. Data Set
- 1.1. Information
  - 1.1.0.1. First couple lines
  - 1.1.0.2. Statistics
2. Missing values
- 2.0.1. Drop rows with missing values
- 2.0.2. Replace missing values (imputing)
- 2.0.2.1. Replacing with mean:
- 2.0.2.2. Replacing with median:
- 2.0.2.3. Replacing with most frequent value:
3. Normalization
4. Methods - First round
5. Methods - Second Round
- 5.0.1. Gaussian Naive Bayes without normalization
- 5.0.2. SVM with Standard normalization
- 5.0.3. Random Forest without normalization
- 5.0.4. Multilayer Perceptron with MaxAbsScaler normalization
- 5.0.5. Stochastic Gradient Descent without normalization
6. Fine Tuning

1. Data Set

1.1. Information

The dataset consists of data collected from heavy Scania trucks in everyday usage. The system in focus is the Air Pressure system (APS) which generates pressurised air that are utilized in various functions in a truck, such as braking and gear changes. The datasets' positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS. The data consists of a subset of all available data, selected by experts.

Creators: Scania CV AB Vagnmakarvägen 1, 151 32 Södertälje, Stockholm, Sweden; Donor: Tony Lindgren (tony@dsv.su.se) and Jonas Biteus (jonas.biteus@scania.com) Date: September, 2016

1.1.0.1. First couple lines

	aa_000	ab_000	ac_000	ad_000	...	ee_003	ee_004	ee_005	ee_006	ee_007	ee_008	ee_009	class
0	76698.0	NaN	2.130706e+09	280.0	...	493384.0	721044.0	469792.0	339156.0	157956.0	73224.0	0.0	b'false'
1	38016.0	NaN	2.130706e+09	704.0	...	194988.0	406226.0	319432.0	184390.0	113950.0	144250.0	2832.0	b'false'
2	12566.0	NaN	2.130706e+09	NaN	...	267790.0	168402.0	115292.0	7158.0	2044.0	1358.0	0.0	b'false'
3	40056.0	NaN	2.130706e+09	11268.0	...	118466.0	317164.0	339570.0	365458.0	177766.0	1520.0	0.0	b'false'
4	97098.0	NaN	2.130706e+09	396.0	...	386786.0	701182.0	513864.0	465926.0	390008.0	881294.0	61634.0	b'false'

5 rows × 171 columns

Observation: In the data set there are many attributes with missing values marked by NaN.

1.1.0.2. Statistics

             aa_000        ab_000        ac_000        ad_000        ae_000  \
count  6.000000e+04  13671.000000  5.666500e+04  4.513900e+04  57500.000000   
mean   5.933650e+04      0.713189  3.560143e+08  1.906206e+05      6.819130   
std    1.454301e+05      3.478962  7.948749e+08  4.040441e+07    161.543373   
min    0.000000e+00      0.000000  0.000000e+00  0.000000e+00      0.000000   
25%    8.340000e+02      0.000000  1.600000e+01  2.400000e+01      0.000000   
50%    3.077600e+04      0.000000  1.520000e+02  1.260000e+02      0.000000   
75%    4.866800e+04      0.000000  9.640000e+02  4.300000e+02      0.000000   
max    2.746564e+06    204.000000  2.130707e+09  8.584298e+09  21050.000000   

             af_000        ag_000        ag_001        ag_002        ag_003  \
count  57500.000000  5.932900e+04  5.932900e+04  5.932900e+04  5.932900e+04   
mean      11.006817  2.216364e+02  9.757223e+02  8.606015e+03  8.859128e+04   
std      209.792592  2.047846e+04  3.420053e+04  1.503220e+05  7.617312e+05   
min        0.000000  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00   
25%        0.000000  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00   
50%        0.000000  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00   
75%        0.000000  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00   
max    20070.000000  3.376892e+06  4.109372e+06  1.055286e+07  6.340207e+07   

           ...             ee_002        ee_003        ee_004        ee_005  \
count      ...       5.932900e+04  5.932900e+04  5.932900e+04  5.932900e+04   
mean       ...       4.454897e+05  2.111264e+05  4.457343e+05  3.939462e+05   
std        ...       1.155540e+06  5.433188e+05  1.168314e+06  1.121044e+06   
min        ...       0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00   
25%        ...       2.936000e+03  1.166000e+03  2.700000e+03  3.584000e+03   
50%        ...       2.337960e+05  1.120860e+05  2.215180e+05  1.899880e+05   
75%        ...       4.383960e+05  2.182320e+05  4.666140e+05  4.032220e+05   
max        ...       7.793393e+07  3.775839e+07  9.715238e+07  5.743524e+07   

             ee_006        ee_007        ee_008        ee_009        ef_000  \
count  5.932900e+04  5.932900e+04  5.932900e+04  5.932900e+04  57276.000000   
mean   3.330582e+05  3.462714e+05  1.387300e+05  8.388915e+03      0.090579   
std    1.069160e+06  1.728056e+06  4.495100e+05  4.747043e+04      4.368855   
min    0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00      0.000000   
25%    5.120000e+02  1.100000e+02  0.000000e+00  0.000000e+00      0.000000   
50%    9.243200e+04  4.109800e+04  3.812000e+03  0.000000e+00      0.000000   
75%    2.750940e+05  1.678140e+05  1.397240e+05  2.028000e+03      0.000000   
max    3.160781e+07  1.195801e+08  1.926740e+07  3.810078e+06    482.000000   

             eg_000  
count  57277.000000  
mean       0.212756  
std        8.830641  
min        0.000000  
25%        0.000000  
50%        0.000000  
75%        0.000000  
max     1146.000000  

[8 rows x 170 columns]

Observation: Many attributes contains mostly zeros.

2. Missing values

2.0.1. Drop rows with missing values

Shape of remaining dataframe:
(591, 170)

Observation: Not applicable in our case because too many values are missing.

2.0.2. Replace missing values (imputing)

While the best method would be to evaluate each model with each method to impute missing values, it triples the traing time. To overcome that problem I tested with one model (Naive Bayesian model -- because it performed well in previous test done in Weka, and also performed fairly well as a submission at Kaggle), and I used the best performing method for all subsequently tested models because it is very likely that imputing methods performs similarily for all methods. Here are the test results:

2.0.2.1. Replacing with mean:

FP: 1809 (cost: 10)
FN: 127 (cost: 500)
Score: 1.3598333333333332

2.0.2.2. Replacing with median:

FP: 1821 (cost: 10)
FN: 126 (cost: 500)
Score: 1.3535

2.0.2.3. Replacing with most frequent value:

FP: 1908 (cost: 10)
FN: 122 (cost: 500)
Score: 1.3346666666666667

Verdict: A imputing method which replaced missing values with most frequent values of the given column scored the lowest (i.e. it is the best in our case). Intuitively, that result would also the best because data set contains many zeros, so that method would put zeros to the missing values for the columns that containted mostly zeros (in some cases 75% of the values in a column was 0).

3. Normalization

Methods used:

StandardScaler: Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.
MaxAbsScaler: Centering sparse data would destroy the sparseness structure in the data, and thus rarely is a sensible thing to do. However, it can make sense to scale sparse inputs, especially if features are on different scales. MaxAbsScaler were specifically designed for scaling sparse data, and are the recommended way to go about this. The motivation to use this scaling include robustness to very small standard deviations of features and preserving zero entries in sparse data.
QuantileTransformer: Like scalers, QuantileTransformer puts all features into the same, known range or distribution. However, by performing a rank transformation, it smooths out unusual distributions and is less influenced by outliers than scaling methods. It does, however, distort correlations and distances within and across features.

Source: https://scikit-learn.org/stable/modules/preprocessing.html

4. Methods - First round

In the following I tested multiple mechine learning method and compared their scores. Because of the abundance of models implemented in sklearn, first I intentionally made a methodological error: I trained and tested on the same dataset without cross-validation. That way I could try out magnitude of algorithms, and filter out those not able to learn our data set well. In the next "round", I tested the best performing methods with cross-validation to filter out overfitting models.

4.1. Naive Bayes

4.1.1. Gaussian Naive Bayes

FP: 1908 (cost: 10)
FN: 122 (cost: 500)
Score: 1.3346666666666667

4.1.2. Multinomial Naive Bayes

FP: 29819 (cost: 10)
FN: 21 (cost: 500)
Score: 5.144833333333334

4.1.3. Complement Naive Bayes

FP: 29819 (cost: 10)
FN: 21 (cost: 500)
Score: 5.144833333333334

4.1.4. Bernoulli Naive Bayes

FP: 5227 (cost: 10)
FN: 150 (cost: 500)
Score: 2.121166666666667

4.1.5. Test effect of normalization

I used Gaussian Naive Bayes because it performed the best among Naive Bayes classifiers

4.1.5.1. Without normalization

FP: 606 (cost: 10)
FN: 40 (cost: 500)
Score: 1.316161616161616
10-fold cross validation:
    Min: 0.956082786471479
    Max: 0.9722222222222222
    Avg: 0.9651013651462981

4.1.5.2. Standard

FP: 682 (cost: 10)
FN: 39 (cost: 500)
Score: 1.3292929292929292
10-fold cross validation:
    Min: 0.9484848484848485
    Max: 0.9656565656565657
    Avg: 0.9604042932643058

4.1.5.3. MaxAbsScaler

FP: 681 (cost: 10)
FN: 38 (cost: 500)
Score: 1.3035353535353535
10-fold cross validation:
    Min: 0.9565656565656566
    Max: 0.9676767676767677
    Avg: 0.9626264135849232

4.1.5.4. QuantileTransformer

FP: 2318 (cost: 10)
FN: 38 (cost: 500)
Score: 2.1303030303030304
10-fold cross validation:
    Min: 0.7267676767676767
    Max: 0.8575757575757575
    Avg: 0.8341929315089993

Verdict: While score was lowest in case of maxAbsScaler, 10-fold cross validation showed that cross validation score decreased in cases of all normalization methods, so normalization rather helps overfitting.

4.2. SVM

4.2.0.5. Without normalization

FP: 0 (cost: 10)
FN: 5 (cost: 500)
Score: 0.04166666666666666

That is surely overfitting...

4.2.0.6. With Standard normalization

FP: 4 (cost: 10)
FN: 139 (cost: 500)
Score: 1.159

4.2.0.7. With MaxAbsScaler normalization

FP: 17 (cost: 10)
FN: 773 (cost: 500)
Score: 6.4445

4.2.0.8. With QuantileTransformer normalization

FP: 51 (cost: 10)
FN: 451 (cost: 500)
Score: 3.7668333333333335

4.2.0.9. With Standard normalization and weighting "false" class

FP: 58 (cost: 10)
FN: 29 (cost: 500)
Score: 0.25133333333333335

Verdict: From SVM classifiers, the one with standard normalization and class weightening was the best.

4.3. Decision Tree

FP: 0 (cost: 10)
FN: 0 (cost: 500)
Score: 0.0

It would be awesome to believe that result, but I'm totally sure that it is overfitting...

4.4. Ensamble Learning

4.4.1. Random Forest

FP: 0 (cost: 10)
FN: 41 (cost: 500)
Score: 0.3416666666666667

Far the best, so far, but I'm afraid that it might be overfitting, as well.

4.4.2. Bagging meta-estimator

Parameters: max_samples=0.5, max_features=0.5

FP: 65 (cost: 10)
FN: 556 (cost: 500)
Score: 4.644166666666667

Parameters: max_samples=0.7, max_features=0.3

FP: 47 (cost: 10)
FN: 478 (cost: 500)
Score: 3.9911666666666665

Parameters: max_samples=0.3, max_features=0.7

FP: 80 (cost: 10)
FN: 621 (cost: 500)
Score: 5.1883333333333335

Verdict: While it is visible that changing parameters would improve result, but it is still far behind others...

4.4.3. AdaBoost

FP: 134 (cost: 10)
FN: 282 (cost: 500)
Score: 2.372333333333333

4.4.4. Gradient Tree Boosting

FP: 143 (cost: 10)
FN: 675 (cost: 500)
Score: 5.648833333333333

4.5. Multilayer Perceptron

After a few trials, layer size (100, 100, 10) appeared to be a good setup. While I'm sure that a much better can be achieved with lot of trials, these values might be enough to compare that method with others.

FP: 26083 (cost: 10)
FN: 583 (cost: 500)
Score: 9.2055

4.5.0.1. With standard normalization

FP: 0 (cost: 10)
FN: 8 (cost: 500)
Score: 0.06666666666666667

4.5.0.2. With MaxAbsScaler normalization

FP: 90 (cost: 10)
FN: 193 (cost: 500)
Score: 1.6233333333333333

4.5.0.3. With QuantileTransformer normalization

FP: 5946 (cost: 10)
FN: 380 (cost: 500)
Score: 4.157666666666667

Verdict: Multilayer Perceptron seems to work well with MaxAbsScaler normalization, as the result of standard normalization is very likely overfitting.

4.6. Stochastic Gradient Descent

Parameters: loss="hinge", penalty="l2", tol=0.1

FP: 1031 (cost: 10)
FN: 240 (cost: 500)
Score: 2.1718333333333333

Parameters: loss="hinge", penalty="l1", tol=0.1

FP: 304 (cost: 10)
FN: 338 (cost: 500)
Score: 2.8673333333333333

Parameters: loss="hinge", penalty="elasticnet", tol=0.1

FP: 1911 (cost: 10)
FN: 100 (cost: 500)
Score: 1.1518333333333333

Parameters: loss="modified_huber", penalty="l2", tol=0.1

FP: 435 (cost: 10)
FN: 370 (cost: 500)
Score: 3.1558333333333333

Parameters: loss="modified_huber", penalty="l1", tol=0.1

FP: 294 (cost: 10)
FN: 327 (cost: 500)
Score: 2.774

Parameters: loss="modified_huber", penalty="elasticnet", tol=0.1

FP: 198 (cost: 10)
FN: 352 (cost: 500)
Score: 2.9663333333333335

Parameters: loss="log", penalty="l2", tol=0.1

FP: 708 (cost: 10)
FN: 293 (cost: 500)
Score: 2.5596666666666668

Parameters: loss="log", penalty="l1", tol=0.1

FP: 275 (cost: 10)
FN: 337 (cost: 500)
Score: 2.8541666666666665

Parameters: (loss="log", penalty="elasticnet", tol=0.1

FP: 1888 (cost: 10)
FN: 153 (cost: 500)
Score: 1.5896666666666666

Verdict: Stochastic Gradient Descent appeared to perform the best with hinge loss with elasticnet penalty.

5. Methods - Second Round

Now, I test the best performing methods (scored between 0.1 and 2.0) from first round with 5-fold cross validation

5.0.1. Gaussian Naive Bayes without normalization

Min: 0.8330833333333333
Max: 0.995
Avg: 0.9535500000000001

5.0.2. SVM with Standard normalization

Min: 0.887
Max: 0.9878333333333333
Avg: 0.9669833333333333

5.0.3. Random Forest without normalization

Min: 0.9470833333333334
Max: 0.994
Avg: 0.9819833333333333

5.0.4. Multilayer Perceptron with MaxAbsScaler normalization

Min: 0.9288333333333333
Max: 0.99525
Avg: 0.9793999999999998

5.0.5. Stochastic Gradient Descent without normalization

Min: 0.944
Max: 0.9919166666666667
Avg: 0.9799

6. Fine Tuning

As Random Forest was the winner of cross validation, I narrowed down research to fine tuning parameters of Random Forest classifier.

From the documentation of sklearn module:

The main parameters to adjust when using these methods is n_estimators and max_features. The former is the number of trees in the forest. The larger the better, but also the longer it will take to compute. In addition, note that results will stop getting significantly better beyond a critical number of trees. The latter is the size of the random subsets of features to consider when splitting a node. The lower the greater the reduction of variance, but also the greater the increase in bias. Empirical good default values are max_features=n_features for regression problems, and max_features=sqrt(n_features) for classification tasks (where n_features is the number of features in the data). Good results are often achieved when setting max_depth=None in combination with min_samples_split=2 (i.e., when fully developing the trees).

So, I stared to play a bit with parameters, and I found max_features, max_depth, and min_samples_split are better to leave to default values, as any changes increased significantly. Also, because Random Forest is not sensitive to scaling, I skipped to cross validate results on results of different normalization methods. Reference

Finally, I had one parameters that have large impact on cross validation score: n_estimators. Additionally, I read the article of the winners of IDA 2016 Industrial Challenge, and they reported that the key of their success was to adjusting prediction threshold of Random Forest classifier. Reference

A quick search for a good starting-point threshold value using the default n_estimator parameter (=10). Calculated 10-fold cross validation scores for given thresholds:

Threshold: 0.05
	average score: 0.9756666666666666

Threshold: 0.1
	average score: 0.9756666666666666

Threshold: 0.15
	average score: 1.2061666666666666

Threshold: 0.2
	average score: 1.2061666666666666

Threshold: 0.25
	average score: 1.6945000000000001

Threshold: 0.3
	average score: 1.6945000000000001

Threshold: 0.35
	average score: 2.274

Threshold: 0.4
	average score: 2.274

Threshold: 0.45
	average score: 3.0678333333333336

Now search for optimal n_estimators value using threshold=0.05. Calculated 10-fold cross validation scores for given n_estimators values:

n_estimators: 5
	average score: 1.2908333333333335

n_estimators: 10
	average score: 0.9756666666666666

n_estimators: 15
	average score: 0.9504999999999999

n_estimators: 20
	average score: 0.9906666666666668

n_estimators: 25
	average score: 0.8776666666666666

n_estimators: 30
	average score: 0.883

n_estimators: 35
	average score: 0.8876666666666667

n_estimators: 40
	average score: 0.8878333333333334

n_estimators: 45
	average score: 0.8695

n_estimators: 50
	average score: 0.8568333333333333

n_estimators: 55
	average score: 0.8734999999999999

n_estimators: 60
	average score: 0.8783333333333333

n_estimators: 65
	average score: 0.8871666666666667

n_estimators: 70
	average score: 0.9046666666666667

n_estimators: 75
	average score: 0.8821666666666668

n_estimators: 80
	average score: 0.8601666666666669

n_estimators: 85
	average score: 0.8998333333333333

n_estimators: 90
	average score: 0.9033333333333333

n_estimators: 95
	average score: 0.8868333333333334

n_estimators=50 parameter setting has the lowest score. Let's fine tune threshold value again around that parameter! Calculated 10-fold cross validation scores for given thresholds:

Threhold: 0.01
	average score: 1.1179999999999999

Threhold: 0.02
	average score: 1.1179999999999999

Threhold: 0.03
	average score: 0.923

Threhold: 0.04
	average score: 0.923

Threhold: 0.05
	average score: 0.8568333333333333

Threhold: 0.06
	average score: 0.8568333333333333

Threhold: 0.07
	average score: 0.8661666666666668

Threhold: 0.08
	average score: 0.8661666666666668

Threhold: 0.09
	average score: 0.8886666666666667

Verdict: The minimal value here is threshold=0.05, so I use it to predict test data. Winner submission under name of hakkelt is predicted by that model.