Mercedes-Benz Car Testing Time

SUJIT

13 min readMar 30, 2020

Speedier testing will contribute towards lowering emissions and a more efficient manufacturing line

GitHub link of the project. LinkedIn profile

Content

Business Problem
Problem Statement
Prerequisites
Performance Metrics
Business Constraints
About Data
Data Collection
Data Loading
Let’s find some pattern in the data
Preprocessing — X0 feature(Clustering)
Train the model
Important features
Model implementation
Final Prediction Function
My final approach summary
Thing’s that didn’t work for me
Conclusion
Future work

1. Business Problem

The first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include, for example, the passenger safety cell with crumple zone, the airbag and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Daimler’s Mercedes-Benz cars are leaders in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.

To ensure the safety and reliability of each and every unique car configuration before they hit the road, Daimler’s engineers have developed a robust testing system. But, optimizing the speed of their testing system for so many possible feature combinations is complex and time-consuming without a powerful algorithmic approach. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Daimler’s production lines.

Other aspect of this case study will be also this

If a car takes more time on testing then it will affect the price of the car because Daimler’s is spending their electricity, man-power, warehouse cost and other material. So more the test time more will be the cost of the car.

The more time car spend during testing, the more CO2 will produce. If we compare CO2 emission of the whole world w.r.t Daimler’s testing machine then Daimler’s contribution of CO2 is less. If whole testing industries will adapt this then it will somehow decrease the CO2 emission on large scale.

The objective of the Mercedes-Benz Greener Manufacturing competition is to develop a machine learning model that can accurately predict the time a car will spend on the test bench based on the vehicle configuration. The vehicle configuration is defined as the set of customization options and features selected for the particular vehicle. The motivation behind the problem is that an accurate model will be able to reduce the total time spent testing vehicles by allowing cars with similar testing configurations to be run successively.

2. Problem Statement

The motivation behind the problem is that an accurate model will be able to reduce the total time spent testing vehicles by allowing cars with similar testing configurations to be run successively.
In this problem, we have to predict a continuous value(time in sec) for a given query (Regression type problem). Continuous value is how much time a car needs to pass the testing.

3. Prerequisites

This post assumes familiarity with basic Machine learning concepts like Regression algorithm, Clustering algorithm ,Optimisation, Probability, Python syntax and pandas, numpy, seaborn and matplotlib library, etc.

4. Performance metrics

R² (Coefficient of determination)

5. Business Constraints

Company has no given constraint related to testing time prediction but still predicting queries should not take much seconds.

If it takes 1 to 2 sec to predict single test data then the model will be a waste because if we have more data to predict then company are wasting there time just getting a result. For eg. total input data has 100k points and model takes 2 sec to predict each quries then total time model will take to predict will be 26+hr. This time will be much much longer for manufacturing companies.

6. About data

The dataset contains an anonymized set of variables, each representing a custom feature in a Mercedes car. For example, a variable could be 4WD, added air suspension, or a head-up display.
There is 8 categorical feature, 1 ID feature, 368 are binary feature and y column which is time in sec.

7. Data Collection

Kaggle provides a ZIP file which contains Train.csv, Test.csv and Submission.csv. Data

8. Loading data

train_data=pd.read_csv('train.csv')
print('Train data shape : ',train_data.shape)
train_data.head(2)

8 categorical data
370 binary data
1 ID feature
Target value

By looking data we cannot judge anything because it contained an only a binary number and some categorical value without any labels.

9. Let’s find some pattern in the data

a. Analysis of target variable y

Histogram

Maximum number of y lies between 80 to 120
Plot is skewed

Scatter plot

1. We can notice in boxplot and scatter plot that there is one point which is very far away. This point must be an outlier.

CDF

We can consider that y>150 to be an outlier.

2. Almost 99.83368% of data cover before 150.

Remove all the ‘y’ which is greater than 150

b. Analysis of categorical feature

I checked the official website of Daimler’s so that I can understand, what

the categorical value represents but it was not much use full for me. I only get to knows that Mercedes gives multiple options to there customer such as the selection of wheel, Trim, Wireless charging, Online navigation system, ventilation, brake system, airbag etc.

Let’s make a strip plot for categorical feature one by one

X0 feature

This feature is well spread. It means that all unique item’s lies in some specific range. e.g. all ‘a’ have time greater than 110, similarly ‘aa’ have time greater than 135, similarly ‘az’ time lie below 110. By this information, we can guess or predict approx time easily. E.g. if my query point has ‘a’ then I don’t know the exact value but I know that will be more than 110 sec.

We can put each unique item’s into sets by looking at all unique item’s median.

You can think in this way that all the item’s whose median come above the red line will get one set, all the item’s which comes between green and red will get the same set and all the item’s which comes below green will get the same set as shown in fig 7.

For finalizing threshold we had use different different threshold values which can maximize correlation between target variable and independent variable. So finally we got this as show in below code.

Make a new feature

X2 feature

Item’s ‘ak’, ’as’, ‘ae’ is the unique value which does not have a specific range of time and ‘ak’,’as’,’ae’ cover 69% of total X2 feature as shown in fig 8.

we can also make set’s as follows:

X3 feature

X5 feature

X6 feature

X8 feature

X1 feature

1. The feature X3, X1, X5, X6, X8, X4 can not be converted into a different set because the unique value does not lie into a specific range. It means there is no set pattern. For eg in plot X1: ‘b’ value can lies from 83 to 140+ similarly ‘c’ value can lies from 86 to 138 that’s why we cannot put an item’s into sets according to there median.

2. The feature X3, X1, X5, X6, X8, X4 are less informative as compare to X0.

Label encoding of categorical feature

Checking correlation

Model based approach

In this we took linear regression model and than fitting all feature one by one. After fitting we check what is the ‘mean square error’ we got for all individual feature. By looking above plot we observed that X0_n is much correlated with target variable because it has low MSE value as compare to other feature.

Correlation order is like this

X0_n>X0_n>X0>X3>X2>X5>X4>X1>X6>X8

c. Analysis of Binary feature

Total no. of binary feature

There are 13 features which have all-zero value. Which means no customer has used those products for there car. So we can remove all zero feature because it do not contain any information.

The features which contain only zero’s.

Remove all the Zero’s column which exists in training data.

d. Analysis of ID feature

Plot b/w ID and y

The linear fitting line(dark blue line) is going down as ID value increase, Which means as ID increase than time taken by a car is also reducing(slightly).

Other way to look this

We know we can use avg. method to smooth the zig-zag curve. So if we smooth plot b/w target variable and ID it will look like this

By looking this plot we can get better intuition that as my ID is increasing there is slight decrease in time.

So by observing this we can make a new column in which we will give less weightage to the new id. Which means ID 1st will get more weightage in compare to 10th ID.

One way of doing this (1/x) where x will be the ID number.

I had taken log just because when ID will increase then my (1/x) decrease significantly. Just because of this there will a chance of rounding the value and when this thing happens then my current value and next value might become similar.

10. Preprocessing — X0 feature

We can also try to make a cluster of X0 by following steps:

code

for training data

Finding best k

We can see best k=4

for testing data

dc=clust(train_data)
data_te['X0_clus']=test['X0'].map(dc.groupby('X')['label'].median())

After making cluster we will see Percentage distribution of data in each cluster.

We can see maximum number point belongs to cluster 0 with 49.7%.

11. Train the model

Train the model by all the pre-processed data(whatever we have done above).

12. Important features

Let see both model important features in order.

XG boost feature importance

Random Forest feature importance

Now let’s do some Feature Engineering using the interaction variable:

we can make a new feature by doing this [‘X314_plus_X315’], [‘X118_plus_X314_plus_X315’].

13. Model Implementation

Note: There are certain hyper-parameter in this code that can be fine tuned according to ones requirements. I have tuned them to find a sweet spot between speed and accuracy.

We had made 4 models

Simple Linear Regression
Ridge Regression
Decision Tree
XG Boost

Simple Linear Regression

Note- We have not use all the feature in Linear regression because linear regression has an assumption that there should be some relationship between target variable and independent variable. So to do this we had first most correlation independent feature by using linear regression model. If my independent variable fit more on training data and increase the metrics value then we can conclude that there is some relationship between independent variable and dependent variable.

Before making model we find the combination of features which help to fits more on training data.

We can see there are 7 which help to gives max metrics value. So we will take these feature to make our linear regression model.

Result of linear regression

Ridge Regression

Before making model we find the combination of features which help to fits more on training data.

We can see there are 18 which help to gives max metrics value.

After this we had done a grid search cv to find best alpha parameter. And after fitting we got best alpha = 6. So we will take these feature to make our Ridge regression model.

Note — In this we took zero value which are not present in cluster because maximum point lies in zero only.

Result of Ridge regression

Decision Tree

Before making we had found best ‘depth’ hyper parameter value for decision tree. And we got best depth value as 3. So now we will make a model.

Result of Decision Tree

XG Boost

First we had done grid search cv to find best parameter.
Second we made a model.

X=data_tr.drop('y',axis=1)
y=data_tr.y
xg=XGBRegressor(learning_rate=.01,max_depth=3,n_estimators=600,colsample_bytree=.55,subsample=.85,gamma=.65,colsample_bylevel=.95)#fitting X and y
xg.fit(X,y)

After making the model we saved into local drive and then we test the model on X_test data and we got R2-score as 0.605.

but after adding small value in predict value and we notice that at 0.4 value we got max R2 score.

Note: After getting a result, add 0.4 seconds. E.g. if my model predicts 98.2 then add 0.4 into. So now my final y_pred will be 98.6 sec. as shown in fig 21

fig 21

Final all model summary

We can see linear regression model had worked quiet well.

14. Final Prediction Function

15. My final approach summary:

Step 1: Remove all the ‘y’ whose value is greater than 150.
Step 2: Made two set’s for X0, X2 feature.
Step 3: Label encoded these feature X0,X1,X2,X3,X4,X5,X6,X8.
Step 4: Made new ID by doing : (1/(log(x)+3))
Step 5: Removed all columns which contain only zero.
Step 6: Made cluster for X0 feature.
Step 7: Interacting variable [‘X314_plus_X315’], [‘X118_plus_X314_plus_X315’].
Step 7: Finally used XGboost regressor with the tuned parameter.

16. Thing’s that didn't work for me

Removing only the last value.
Made sets for X1,X3,X4,X5,X8 feature.

This is my final Kaggle score

My kaggle score:- private = 0.55329 and public = 0.55621

By this solution, I got 42nd position.

17. Conclusion

Data is very anonymous.
Data contain outlier.
Making simple set’s of X0 was good because

a. We notice that in train data whose car testing time was less than 80 in that 95.77% have X0 feature value as ‘az’, 100% have X27,X10 feature value as 1,0 resp.

Similarly in prediction whose car testing time was less than 80 in that 96.15% have X0 feature as ‘az’, 100% have X27,X10 feature value as 1,0 resp.

b. We notice that in train data whose car testing time between 90 and 100 in that 72.1 % have X0 feature value as [‘y’, ’z’, ’t’, ’o’, ’f’, ’n’, ’s’, ’al’, ’e’].

Similarly in prediction whose car testing time between 90 and 100 in that 84.4 % have X0 feature value as [‘y’, ’z’, ’t’, ’o’, ’f’, ’n’, ’s’, ’al’, ’e’].

c. We notice that in train data whose car testing time between 100 and 120 in that 79.4 % have X0 feature value as [‘ak’, ’x’, ’ay’, ’w’ , ’j’, ’aj’, ’ap’, ’h’, ’d’, ’v’].

Similarly in prediction whose car testing time between 100 and 120 in that 89.3 % have X0 feature value as [‘ak’, ’x’, ’ay’, ’w’ ,’j’, ’aj’, ’ap’, ’h’, ’d’, ’v’].

4. Simple feature engineering like with interaction variable was also good.

18. Future Work

Thanks a lot if have reached here. This is my first attempt in blogging so I expect the readers to be a bit generous and might ignore the minor mistake I might have.

We can do a lot of modification to improve the solution. Some of them are here.

There is some duplicate binary column which we can remove.
We can try some other algorithm. If this business idea saves lot of money for the company, then we also try to make our model based on the Neural Network.
We can also try to remove those features which are highly correlated among each other.
There can be another way to modify the ID column. Instead of using the log in the denominator, we can use a weighted exponential moving average.
Using Neural network.

References