Severstal Steel defect detection

Can you detect and classify defects in steel?

SUJIT
16 min readAug 26, 2020

LinkedIn profile and GitHub profile

Before going deep. See this Deployment Video of this Project on YouTube. It will increase curiosity to know internal of this Case-Study.

Content

  1. Business Problem
  2. Problem Statement
  3. Prerequisites
  4. Performance Metrics
  5. Business Constraints
  6. About Data
  7. Data Collection
  8. Objective
  9. Strategic plan to find a defect and segment
  10. Data Loading
  11. Let’s find some pattern in the data
  12. Binary Classification
  13. Multi-Label Classification
  14. Segmentation
  15. Approach to the predict queries
  16. Future Scope

1. Business Problem

Steel is one of the most important building materials of modern times. Steel buildings are resistant to natural and man-made wear which has made the material ubiquitous around the world. To help make a production of steel more efficient, this competition will help identify defects.

fig 1

Severstal is leading the charge of inefficient steel mining and production. They believe the future of metallurgy requires development across the economic, ecological, and social aspects of the industry — and they take corporate responsibility seriously. The company recently created the country’s largest industrial data lake, with petabytes of data that were previously discarded. Severstal is now looking to machine learning to improve automation, increase efficiency, and maintain high quality in their production.

fig 2

.

The production process of flat sheet steel is especially delicate. From heating and rolling, to drying and cutting, several machines touch flat steel by the time it’s ready to ship.

fig 3

.

.

Today, Severstal uses images from high-frequency cameras to power a defect detection algorithm.

.

.

.

.

.

2. Problem Statement

The objective is to predict the location and type of defects present in steel using the images provided.

Ie this, 1st by the help of roller, steel sheet will move forward and simultaneously high-frequency cameras will take a photo. Then those images will be sent to the system and system will check if there is any defect or not. If the system finds any major defect then it will alert. as shown in fig 3.

3. Prerequisites

This post assumes familiarity with basic Deep Learning concepts like Multi-layered Perceptrons, Convolution Neural Networks, Segmentation, Transfer Learning, Optimisation, Backpropagation, Overfitting, Probability, Python syntax and Keras library, etc.

4. Performance metrics

  1. F1 score
  2. Dice_coef

Short Note on Dice Metric

The Dice coefficient can be used to compare the pixel-wise agreement between a predicted segmentation and its corresponding ground truth. The formula is given by:

where X is the predicted set of pixels and Y is the ground truth. The Dice coefficient is defined to be 1 when both X and Y are empty.

5. Business Constraints

There is no such constraint mention in this competition but in the real world, it is important to detect whether steel has a defect or not and if it has a defect then which type of defect does steel have and where the defect is present. This is important because if Severstal knows all 3 answers then their quality of product will be great because if company know which type of defect steel has then that type of machining can be done to remove the defect and if they also know the region of the defect then we can do machining on that particular region. If we do not remove defect then it will affect the strength of steel and increase the chances of rusting also.

So when we are making a classification model, at that time we have to keep in mind that F1 score should be high and there should be no overfitting.

Some type of defect

To know in-depth here

There are main 4 types of defect.

  • Casting defects
  • Rolling defects
  • Forging defects
  • Welding defects

Subcategories of the defect.

fig 4

6. About data

  • Each Image may have no defects, a defect of a single class, or defects of multiple classes (ClassId = [1, 2, 3, 4]).
  • For each class image i.e. [1,2,3,4], there will be a segmented image which is encoded.
  • Total train images and test images are 12571 and 5506 respectively.

7. Data Collection

Kaggle provides a ZIP file which contains 2 folders and 2.csv file.

fig 5

The first folder contains train images and the second folder contain test images.

1st CSV file contains ImageId and it’s encoded pixel and 2nd CSV file contains formate of submission. Data here.

8. Objective

For any query image, we have to classify the defect and locate the segmentation of the defect. For each image, you must segment the defects if it belongs to each of the class (ClassId = [1, 2, 3, 4]).

fig 6

In this competition, we have to maximise the Dice coefficient for segmentation and for classification we have to maximise F1 score.

.

.

.

Note- Segmented image is present in encoded form. E.g. if an image is in this format ‘aaabbccccddd’ then encode formate will be ‘a3b2c4d3’. We can observe length has decreased in encode formate i.e. 12 to 8, you can see video for more info here.

Reason of tf.data API

tf_data improves the performance by prefetching the next batch of data asynchronously so that GPU need not wait for the data. You can also parallelize the process of preprocessing and loading the dataset. Read this

Tf.data API train model fastly.

Let’s get started with data……….

9. Strategic plan to find defect and segment

fig 7

Let’s execute our plan step by step……..

10. Data Loading

According to our planning, 1st we will load Dataframe which Kaggle had provided.

DataFrame.

  • Read ‘train.csvby the help of pandas.

11. Let’s find some pattern in the data

  1. Count plot
fig 8
  • Data is much balance to classify whether steel has a defect or not
fig 9
  • Data is very unbalanced. In the dataset, 3rd class defect is much more than the others.
  • Due to unbalance data we cannot much rely on accuracy metric.

2. Let’s check encoded pixel

Before going further we will make a function to calculate the Sum of encoding of a pixel. (fig)

eg. if suppose this is encoded pixel ‘a3b2c4d3’ than the sum of encoding of the defect pixel will 3+2+4+3=12.

def sum_enc(i):
return sum([int(k) for k in i.split(' ')[1::2]])

a. Sum of encoding class 1

The distplot plot looks like this

fig10

Let’s see value from 0 to 100 percentile

fig11
fig12
  • In class 1, the sum of encoding start from 163 i.e. zero percentile.
  • We can observe that after 99.5 percentile there is a sudden increase for 2K.

This how the defect image and its segmentation looks

fig13

Note — These value will help us to make a threshold for doing classification and segmentation. It means if it belongs to class 1 then it’s the sum of the pixel of the defect should lie between [163,17983].

b. Sum of encoding class 2

fig15
fig16
  • In class 2, the sum of encoding starts from 316 (zero percentile)
  • We can observe that after 98 percentile there is a sudden increase for 2k(approx).
  • For defect 2 the sum of the pixel of the defect should lie between [316,8163].

This how the defect image and it’s segmentation looks

fig17

c. Sum of encoding class 3

fig18
fig19
  • The plot is very skewed. Which means there are only some images whose sum_encoding is greater than 100k.
  • In class 3, the sum of encoding starts from 115.
  • We can observe that after 98 percentile there is a sudden increase for 3k(approx).

This how the defect image and it’s segmentation looks

fig20

d. Sum of encoding class 4

fig 21
fig 22
  • The plot is skewed.
  • In class 4, the sum of encoding starts from 491.
  • We can set a threshold of 128k.

This how the defect image and it’s segmentation looks

fig 23

Note- We will use this sum of the defect pixel as a threshold after segmentation model predicts mask.

Observation

By visualisation of all the defect image, we had concluded that

  1. Steel having defect 1 has small pinholes on the steel as shown in (fig4)
  2. Steel having defect 2 has a little transverse crack on it.
  3. Steel having defect 3 has a large transverse crack on it.
  4. Steel having defect 4 has a large patch on it.

5. No defect steel does not mean that it has no defect. T here is some sort of defect into the steel but that defect is other than 1,2,3, and 4th defect, no defect also contain that some design also and it also contain that steel which does not have any type of defect.

Note — defect4>Defect3>Defect2>Defect1

12. Binary Classification

We will follow this flowchart to find whether steel has a defect or not.

  • Data preparation for binary classification

For this, we will take all the images names from ‘train_images’ folder. If filename is present ‘train.csv’ then Image has defect otherwise image does not have a defect.

fig 24
  • Splitting the data
  • Data distribution in train and validation
fig 25

Observation — Data is almost balanced.

  • Model Architecture

In our case study, we have used a pre-trained Xception model.

Note we had used Xception model because it had less computation. In this we use Depthwise separable convolution instead of simple convolution and it also a benefits of ResNet. Read this for Depthwise separable convolution and read this for Xception structured.

Save the weight into a disk if f1 test score increase from previous epoch f1 score and also make tensorboard so that we can analysis how loss is changing at every epoch.

  • Training Procedure
fig 25
  • Plot for F1score and loss
fig 26
fig 27

We can observe that at blue F1-score(red curve) is increasing and loss is decreasing. By doing this we get a confirmation that the model is not overfitting on train data.

  • Choosing the best threshold
fig 28

We had taken threshold value as 0.4,0.45,0.5,0.55,0.6,0.7 to check which threshold will give max value for F1 score on training data. So the best score I when threshold value was 0.45. Total data in training was 9731 as shown in fig 28.

Now we check how 0.45 threshold value performs on test data. At 0.45 we get 0.984 f1 score& total data in testing was 1509.

fig 29
  • Some prediction on testing data
fig 30

We had taken some random sample from the testing dataset and we can see the actual and prediction is the same. Model is working very well on the testing dataset.

13. Multi Classification

This flowchart to help to find which types of defect steel has.

  • Data preparation for Multi-label classification
fig 31

Note- In the above figure, ImageLabel is just helping to do stratified sampling while splitting into train, test and validation.

While splitting data into train/test/ validation we must do a stratified sampling so that class distribution should same in train, test and validation.

  • Data distribution in train and validation
fig 32

By looking train and validation class distribution we can see there is a class imbalance. Type 3 images contain very much as compare to other classes.

  • Model Architecture

In this also have taken a pre-trained model i.e. Xception.

Note- We had used sigmoid as an activation function for the last layer so that we can get the probability for each defect individually and for multi-label classification, we need individual probability for each class because one image can have more than one class. Eg. [0.001,0.005,0.98,0.93]

Save the weight into a disk if f1 score increase from previous epoch and also make tensorboard to analysis how loss is changing at every epoch.

d. Plot for F1score and loss

fig 33
fig 34

When all training gets completed we have checked our model by test image and even on train and validation image and F1 score.

  • Choosing the best threshold

We had taken threshold value as 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.75, 0.8, 0.85, 0.9 to check which threshold will give max value for F1 score on training data. So the best score I when threshold value for Class 1, Class 2, Class 3, Class 4 was [0.8, 0.5, 0.65, 0.35] . Total data in training was 4759.

fig 35

Now we check how [0.8, 0.5, 0.65, 0.35] threshold value performs on test data. Total data in testing was 1000.

fig 36
  • Some prediction on testing data
fig 37

14. Segmentation

fig 38

This is how multi-label segmentation looks like. In this, we can see how one image has multiple masks i.e. class 0,1,2,3, 4 and 5. Similarly, in our case, steel can contain either 1,2,3 or all 4 defects. So we will also do multi-label segmentation.

.

We know the type of defect but how the system knows where the defect is present on steel. For this, we will discuss below.

.

.

This flow chart will help us to find where the defect is present in the steel.

a. Model for Defect 1

  • Making Dataframe

In DataFrame there will be only two columns, 1st will Image_Id and 2nd will be EncodedPixels for class 1.

seg_type_1_data=df[df.ClassId==1]seg_type_1_data['ImageId']=['train_images/'+i for i in seg_type_1_data.ImageId]
  • Metrics and loss

a. Dice Coefficient

Dice coefficient, which is essentially a measure of overlap between two samples. This measure ranges from 0 to 1 where a Dice coefficient of 1 denotes perfect and complete overlap. Check this video here

Dice coefficient is roughly equal to the harmonic mean of precision and recall. Dice coeff = F1 score. Read this

b. Dice Loss

Dice loss = 1- dice coefficient.

Read to know loss this

fig 39
  • Model Architecture
fig 40

In this case study, we have used Unet with pre-trained EfficientnetB0 encoder architecture.

Note — We had use Efficientnet because it reduce the number computation in model because it do efficient scaling between depth, channel and resolution. Read this and see this video

  • Training procedure
fig 41
  • Plot for Dice and loss
fig 42
fig 43

We can see the dice coefficient is increasing and loss is decreasing at every epoch.

  • Some prediction on testing data
fig 44

Note- Model output will be a Mask we have to convert into RLE

b. Model for Defect 2

  • Making DataFrame

In DataFrame there will be only two columns, 1st will Image_Id and 2nd will be EncodedPixels for class 2

seg_type_2_data=df[df.ClassId==2]seg_type_2_data['ImageId']=['train_images/'+i for i in seg_type_2_data.ImageId]
  • Model

we have used the same Unet with pre-trained EfficientnetB0 encoder architecture.

  • Plot for Dice and loss
fig 45
fig 46

We can see the dice coefficient is increasing and loss is decreasing at every epoch.

  • Some prediction on testing data
fig 47

Note- Model output will be a Mask we have to convert into RLE

c. Model for Defect 3

  • Making DataFrame

In DataFrame there will be only two columns, 1st will Image_Id and 2nd will be EncodedPixels for class 3.

seg_type_3_data=df[df.ClassId==3]seg_type_3_data['ImageId']=['train_images/'+i for i in seg_type_3_data.ImageId]
  • Model

we have used the same Unet with pre-trained EfficientnetB0 encoder architecture.

  • Plot for Dice and loss
fig 48
fig 49

We can see the dice coefficient is increasing and loss is decreasing.

  • Some prediction on testing data
fig 50

Note- Model output will be a Mask we have to convert into RLE

d. Model for Defect 4

  • Making DataFrame

In DataFrame there will be only two columns, 1st will Image_Id and 2nd will be EncodedPixels for class 4.

seg_type_4_data=df[df.ClassId==4]seg_type_4_data['ImageId']=['train_images/'+i for i in seg_type_4_data.ImageId]
  • Model

we have used the same Unet with pre-trained EfficientnetB0 encoder architecture.

  • Plot for Dice and loss
fig 51
fig 52

We can see the dice coefficient is increasing and loss is decreasing at every epoch.

  • Some prediction on testing data
fig 53

Note- Model output will be a Mask we have to convert into RLE

Oh yeah!! All models are ready…….

15. Time to put the model in this sequence

fig 54
  1. Take a query steel image of any shape (a,b,3)
  2. Give a query image will convert (a,b,3) into (1,299,299,3) size. Then this image will pass to the binary model.

The binary model will give a probability value as an output e.g. [[0.98]]. Then will check if prob. value is greater than 0.45. If it’s greater than then steel has a defect(label=1) otherwise no defect(label=0).

3. If the steel has a defect then that image of size(1,299,299,3) will pass into a multi-label model.

The multi-label model will give a 4 probability value as an output e.g. [[0.99,0.88,0.001,0.002]]. This probability will signify that 99% chance is that it belongs to class 1 defect, 88% chance is it has class 2 defect also, 0.1% chance is it has class 3 defect also, 0.2% chance is it has class 4 defect also.

4. Let a=[[0.99,0.88,0.001,0.002]]

Now,

if a[0][0]>=0.8:
pass same image of size (1,256,800,1) to segmentation model which detect type 1 defect. Output of this model will be (256,1600,1). Convert image size into (256,1600).
if a[0][1]>=0.5:
pass same image of size (1,256,800,1) to segmentation model which detect type 2 defect. Output of this model will be (256,1600,1). Convert image size into (256,1600).
if a[0][2]>=0.65:
pass same image of size (1,256,800,1) to segmentation model which detect type 3 defect. Output of this model will be (256,1600,1). Convert image size into (256,1600).
if a[0][3]>=0.35:
pass same image of size (1,256,800,1) to segmentation model which detect type 4 defect. Output of this model will be (256,1600,1). Convert image size into (256,1600).

Finally, concatenate all the 4 masks and we will get the output of shape (256,1600,4)

  • Code

LinkedIn profile and GitHub profile

  • Prediction of new data results

Deployment Video using StreamLit of this Project on YouTube

16. Future Scope

  • We can use a different pipeline, which may perform better than this Faster RCNN
  • We can try another pipeline which is better than this.
  • We can try some other loss also like a focal loss.
  • Try a different threshold.
  • Increase the batch size if you have good computation resources.
  • Use some more augmentation also.

References

  1. https://www.appliedaicourse.com
  2. https://www.kaggle.com/ekhtiar/defect-area-segments-eda-with-plotly-fp-mining
  3. https://www.kaggle.com/paulorzp/rle-functions-run-lenght-encode-decode
  4. https://arxiv.org/abs/2006.14822
  5. https://udibhaskar.github.io/practical-ml/debugging%20nn/neural%20network/overfit/underfit/2020/02/03/Effective_Training_and_Debugging_of_a_Neural_Networks.html
  6. https://www.tensorflow.org/guide/data
  7. https://stackoverflow.com/questions/58693261/create-a-rle-run-lenth-encoding-mask-with-tensorflow-datasets

--

--

SUJIT
SUJIT

Responses (1)