Why Should I Trust You?
In this blog, we will discuss the LIME concept. LIME basically help us to Interpret any Machine Learning and Deep Learning Models. LIME provides local model interpretability.
Why LIME was Introduce?
There was one major issue, as the Machine Learning model is getting complex(SVM with the kernel, Deep Learning, etc) we are losing the interpretability of the model. Which mean we are not able to know on what basis the model is predicting output as class 1, 2 or etc.
For eg, If you had trained a model to classify a horse and dog and the model is working really good on the train and test dataset but after deploying the model it’s got failed. There could a possibility that all horse’s images contain grass background and the model had learned that where ever grass will come it will predict as a horse. That’s why we need an Interpretable model to look at why the model is saying particular output.
Other eg which I have taken from LIME paper.
eg 1.
In this, you can see that model is predicting correct but Christianity is learned in the wrong way because Christianity is dependent on some stopwords like “this”, “by”, “in” etc.
eg 2:
In this, you can see the model is working in a good sense because the model can learn where is “labrador” or “guitar”.
NOTE — This interpretation is very useful while making a case study belong from medical or finance.
REACTIONS
WOW!!
NACE!!
AWESOME!!
COOL!!
Interpretability IMPORTANCE
- We can trust our model predictions.
- We can know exactly how our AI system behaves.
- We can understand the behaviour of the model. Is our data behaving as expected?
- We can do improvement and enhancement of the model.
- Better decision making by interpretation.
Read more importance of interpretability here
Let digest LIME slowly slowly
Let first ask a basic question what is Local in LIME(Local Interpretable Model-agnostic Explanations).
To understand Local we have to understand, what Global is?
If we make a linear model to classify two classes then the model will give a global feature importance score. Now suppose we give a test query ‘x_q’ to a linear model then the interpretation of ‘x_q’ will be based on a global hyperplane or global feature importance score. In global, it doesn't matter where my ‘x_q’ will be the feature importance score will be the same.
Similarly, in the tree-based model, we are looking globally to decide the output of the test query.
Note — Linear and Tree-based models are highly interpretable.
Now let’s understand, what Local is?
Let f() be any function that makes this type of boundary. The key point is if we get test case x_q then we will not look at the whole plane we will only look at a small area to know why the model had given x_q as a particular class. That’s why it is called a local interpretable. As shown in fig.
.
NOTE — It can work for Regression and Classification. Output i.e. ‘y’ will real number.
Intuition behind LIME
NOTE — Big and small red plus similarly blue big and small dot is telling that we are giving less weightage as point are far from the LINE.
REACTIONS
What the heck!!
What is this!!
Is this Amoeba?
Ok!! Just for now see as there will be one line or hyperplane which will classify two classes in the local region. This linear model + regularisation is called a surrogate model which is represented by g().
The surrogate model is almost equivalent to the black-box i.e. f() model but the input i.e ‘x’ for the surrogate model is not the same input on which f() had been trained. In the surrogate model, we give ‘x_dash’ where ‘x_dash’ is an interpretable vector. ‘x_dash’ interpretable means that we are converting ‘x’ into the binary vector using OHE or BOW because these vectors are highly interpretable.
NOTE — ‘x’ dim != ‘x_dash’ dim because ‘x_dash’ is a binary vector.
Now Let’s see how we will convert data into the interpretable form in different datasets for eg TEXT, NUMERICAL, IMAGE.
TEXT :
To make text data in interpretable form we can convert the text into a bag of words. In BOW we know easily understand how many times words repeat or which words had come so it is very interpreted.
NUMERICAL:
To convert the value into interpretable format then we can make a bin and then do OHE(one-hot encoding).
Eg. ==>
Input : [1.1, 2.1, 2.7, 7, 8.4, 9.1, 9.7, 10]
Approach :
Now we will get A, B, C…….
After this, we can convert categorical input into OHE
.
.
.
.
.
IMAGE :
To convert the image into the interpretable format we break the image into superpixel and then make a vector by looking pixel present or not.
Read more about superpixel here
.
.
.
REACTION
Nice hack!! for image
.
Before going on maths let’s clear some terms
- Surrogate Model :
- g() = It is an approx of f() neighbor of x_q
- Model g∈G = class of easily interpretable model. {Linear model with regularisation or Decision tree}
- Ω(g) = Comeplexity of g (eg depth of DT model or number of non zero weight in Linear model)
- g(x_dash) != g(x) because x_dash is interpretable representation and x is original feature vector.
2. Proximity Function
3. Local Fidelity and Interpretability
- Our goal is to make g() model be very same as f() in the region π_x because g() is interpretable. g() is interpretable because we have converted x into x_dash and because of x_dash now we can interpret our model.
COOOOOOL!!
Optimization
- π_x(z) is equal to the exponential kernel because it will help us to look at the local region. Local region area depends on σ². As σ² increase kernel will take more points like KNN.
- Capital Z is set or local region area and small z and small z_dash should lie in set i.e. capital Z.
- We are doing the same thing as we do in regression optimization i.e. is square loss. π_x(z) is giving more weightage to that point which is closer to line or plane.
- Ω(g) is taken as a K lasso so that we can reduce the complexity of the linear model because in a linear model the complexity is dependent on non-zero weights so if we use L1 reg. then L1 reg. will make our non-important feature weight to be zero.
POWER of L1 reg!!
Advantage:
- Work with Image, Text and Tabular data.
- Lasso or low depth tree for G is highly interpretable.
Disadvantage:
- Kernal width(σ) — What is the right value?
- ‘x_dash’ may not be straightforward to convert into an interpretable vector.
- Lack of robustness.
NOTE — LIME CODE is 4–6 lines. Read this documentation.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import lime
from lime import lime_tabular##Loading Data
data = pd.read_csv('wine.csv')
data.head()
##Data spliting
X = data.drop('Wine', axis=1)
y = data['Wine']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)##Model
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)
score = rf_model.score(X_test, y_test)##LIME
explainer = lime_tabular.LimeTabularExplainer(training_data=np.array(X_train),
feature_names=X_train.columns,
class_names=[1,2,3],mode='classification')exp = explainer.explain_instance(data_row=X_test.iloc[3],
predict_fn=rf_model.predict_proba)exp.show_in_notebook(show_table=True)
Observation — It is saying class = 1 because the Proline level is greater than 932.75, colour int is between 4.6 and 6.12, Alcohol is greater than 13.68.
References
- https://www.youtube.com/watch?v=d6j6bofhj2M
- https://www.youtube.com/results?search_query=lime+code
- https://www.appliedaicourse.com
- https://arxiv.org/pdf/1602.04938.pdf
- https://lime-ml.readthedocs.io/en/latest/
- https://towardsdatascience.com/lime-how-to-interpret-machine-learning-models-with-python-94b0e7e4432e