Understanding Support Vector Machine with Kernels
A practical guide on how to predict whether a client will subscribe to a term deposit or not using a bank marketing dataset
In this article, I am going to describe how to build a support vector machine with kernels to predict whether a client will subscribe to a term deposit or not. The dataset I am using is a Bank Marketing dataset. The dataset can be downloaded here.
1. What is Support Vector Machine?
Support vector machine is a supervised machine learning algorithm that can be used for both classification or regression challenges. But, it is mostly used in classification problems.
Support vectors simply mean the coordinates of individual observations. Classification in SVM is done by segregating classes with a hyper-plane or a line.
SVM performs well with small datasets.
What is Support Vector Machine with kernels?
Using kernels is an optimization technique for SVM. When the kernel function is applied it will make the calculation easier for the SVM. When the classes cannot be linearly separable, the kernel function is useful. It can take low-dimensional input space and transforms it into a higher-dimensional space. And then the class segregation can be done in the higher dimensional space.
Examples of SVM kernels are polynomial kernel, gaussian kernel, radial basis function, and sigmoid kernel.
2. Dataset description
Bank Marketing dataset is obtained from UCI Machine Learning Repository.
Attribute Information
Input variables:
# bank client data:
1. age (numeric)
2. job : type of job (categorical: ‘admin.’,’blue-collar’,’entrepreneur’,’housemaid’,’management’,’retired’,’self-employed’,’services’,’student’,’technician’,’unemployed’,’unknown’)
3. marital : marital status (categorical: ‘divorced’,’married’,’single’,’unknown’; note: ‘divorced’ means divorced or widowed)
4. education (categorical: ‘basic.4y’,’basic.6y’,’basic.9y’,’high.school’,’illiterate’,’professional.course’,’university.degree’,’unknown’)
5. default: has credit in default? (categorical: ‘no’,’yes’,’unknown’)
6. housing: has housing loan? (categorical: ‘no’,’yes’,’unknown’)
7. loan: has personal loan? (categorical: ‘no’,’yes’,’unknown’)
# related with the last contact of the current campaign:
8. contact: contact communication type (categorical: ‘cellular’,’telephone’)
9. month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
10. day_of_week: last contact day of the week (categorical: ‘mon’,’tue’,’wed’,’thu’,’fri’)
11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14. previous: number of contacts performed before this campaign and for this client (numeric)
15. poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,’nonexistent’,’success’)
# social and economic context attributes
16. emp_var_rate: employment variation rate — quarterly indicator (numeric)
17. cons_price_idx: consumer price index — monthly indicator (numeric)
18. cons_conf_idx: consumer confidence index — monthly indicator (numeric)
19. euribor3m: euribor 3 month rate — daily indicator (numeric)
20. nr_employed: number of employees — quarterly indicator (numeric)Output variable (desired target):
21. y — has the client subscribed a term deposit? (binary: ‘yes’,’no’)
3. Problem Statement
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.
The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).
4. Developing the SVM
The target column of our dataset is y. It indicates whether the client has subscribed to a term deposit or not. The values are binary ‘1’ and ‘0’. First, the target column needs to be checked to get an understanding of the class distribution.
X[‘y’].value_counts()
Output:
The class imbalance is graphically represented below.
It is obvious that there is a considerable class imbalance in the dataset. To handle the class imbalance I will apply SMOTE later.
Data Preprocessing
There were 12 duplicate records in the dataset and they were removed. The dataset was checked to identify missing values but there were not any missing values. But when we check the dataset we can see there are values called ‘unknown’ in few columns. But I did not remove them since they may be recorded as unknown due to some valid reason although we do not know for sure.
In the dataset description below note is included for the column duration.
duration: This attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
So I dropped the column ‘duration’.
Handling outliers
In the dataset, there are both numerical and categorical columns. Numerical columns should be checked to detect outliers. Boxplots for all 9 numerical columns are shown below.
When we consider the box plots age, pdays, emp_var_rate, cons_price_idx, euribor3m and nr_employed do not have outliers.
But campaign, cons_conf_idx, previous have outliers.
Below code snippet is used to drop outliers in the campaign column.
warnings.filterwarnings("ignore")fig, axes = plt.subplots(1,2)
plt.tight_layout(0.2)
print("Before Shape:", X.shape)#remove outliers that is more than 50
X_df =X[X['campaign']<50]
print("After Shape:", X_df.shape)
sns.boxplot(X['campaign'],orient='v',ax=axes[0])
axes[0].title.set_text("Before")
sns.boxplot(X_df['campaign'],orient='v',ax=axes[1])
axes[1].title.set_text("After")
plt.show()X=X_df;
X=X.reset_index(drop=True)
Output:
Likewise, all the outliers are removed in other columns also.
Feature Encoding
Feature encoding should be applied to all categorical columns. To identify the categorical columns in the dataset following code snippet is used.
cat_df = X1.select_dtypes(include = np.object)
cat_df.info()
Output:
To check the unique values in each of these columns use the below code.
cat_df.nunique()
Output:
We can apply label encoding to all categorical columns. But when we consider the education column we know that there is an order in educational qualification. So education column can be considered as ordinal. We need to assign each qualification a rank accordingly.
To find the unique values in the education column below code is used.
X1.education.unique()
Output:
To assign a rank for the values below code can be used.
scale_mapper = {"unknown":0, "illiterate":1, "basic.4y":2, "basic.6y":3, "basic.9y":4, "high.school":5, "professional.course":6, "university.degree":7}X1["education"] = X1["education"].replace(scale_mapper)
I have provided the least value 0 for unknown values and the highest value 7 for university degree. Other values also are mapped in order. The output column can be seen as below.
Output:
All the other categorical columns can be encoded using label encoder.
The output dataset is shown below.
Transformations
Q-Q plots and histograms for each numerical column are shown below. By analyzing those graphs skewness can be identified.
Both age and campaign show a right skewness. So, we need to transform data to get a normal distribution. In the below code, I have shown how to apply square root transformation to reduce the right skewness of age.
The normal distribution can be seen now in age and campaign.
Similarly, square root transformation should be applied to campaign as well.
I decided not to apply transformations for pdays since it is distributed only in 2 separate regions. But previous is right-skewed. So I applied square root transformation to it.
We can see emp_car_rate shows a left skewness and I applied square transformation to reduce the left skewness of emp_var_rate.
The output is shown below.
For cons_price_idx transformations were not applied.
The Q-Q plots and histograms for cons_conf_idx and euribor3m are shown below.
Transformations were not applied to both cons_conf _idx and euribor3m.
Since it shows a left skewness square transformation was applied to nr_employed. The output is shown below.
Now we have applied the necessary transformations and next, we should do standardization.
Standardization
We apply standardization only for numerical columns. The code used is shown below.
Feature Extraction
To identify the significant features we can use the significance matrix.
plt.figure(figsize = (18,10))
sns.heatmap(X3.corr(), annot = True);
Significance Matrix:
Here when we consider the correlation between attributes we cannot see any attributes with considerably high correlated attributes. And when we consider the significance of attributes there are some attributes with low significance.
I decided to drop 4 attributes which are day_of_week, job, housing, and loan. I tried dropping few different combinations of attributes but dropping those 4 attributes gave a high precision than other combinations. The recall and F1 score did not show much difference.
SMOTE
SMOTE is applied after first, training SVM with class imbalance. But it was observed that the overall performance was less. So, SMOTE is applied and although the performance was not highly increased, it was better than without applying SMOTE.
Now I am going to separate the target variable, and split the dataset, and after splitting handle class imbalance by applying SMOTE technique.
First, I separated the target column ‘y’.
y_true = X4['y']
X_dataset = X4.drop('y', axis=1)
Then the dataset is split into a 0.8 : 0.2 ratio and SMOTE is applied to the training data which was obtained after splitting.
When we visualize the values of the target variable ‘y’, we can see that the class imbalance problem is resolved.
X_train = smoted_X
y_train = smoted_y
smoted_X is out X_train data and smoted_y is our y_train data. For X_test and y_test those changes were not applied.
Principal Component Analysis
We should consider the summation of high variance dimensions. The summation of 9 high variance dimensions is 96%. So, I made n_componets equal to 9.
pca.explained_variance_ratio_
Output:
SVM with Kernel
RBF kernel is used since it works well in practice and it is relatively easy to tune.
Intuitively, the
gamma
parameter defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’. Thegamma
parameters can be seen as the inverse of the radius of influence of samples selected by the model as support vectors.The
C
parameter trades off correct classification of training examples against maximization of the decision function’s margin. For larger values ofC
, a smaller margin will be accepted if the decision function is better at classifying all training points correctly. A lowerC
will encourage a larger margin, therefore a simpler decision function, at the cost of training accuracy. In other wordsC
behaves as a regularization parameter in the SVM.
I used C=1 and gamma=0.05. These values produced the result with better predictions than other different combinations of C and gamma values.
Confusion matrix is created to evaluate the performance of the classification model.
Output:
- True positives — 560
- True negatives — 6500
- False positives — 830
- False negatives — 380
The classification performace for true positives is low. It may be due to the class imbalance of the original dataset.
Classification report is used to measure the quality of predictions of a classification model.
The output is shown below.
The precision is the ratio
tp / (tp + fp)
wheretp
is the number of true positives andfp
the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.The recall is the ratio
tp / (tp + fn)
wheretp
is the number of true positives andfn
the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.
The F-beta score weights recall more than precision by a factor of
beta
.beta == 1.0
means recall and precision are equally important.The support is the number of occurrences of each class in
y_true
.
Precision, recall, and F1 score for test data is obtained as below.
Results are shown below.
We can also get training set score and testing set score as below.
The result is shown below.
Here test set score is high but it’s due to the class imbalance in the testing data. Since the number of true negatives is very high it gives a high test score.
Conclusion
In this article, I gave you a step-by-step guide on data preprocessing, feature encoding, transformations, standardization, feature extraction, handling class imbalance, dimensionality reduction and finally building a support vector machine with kernels.
I hope this article helped you to gain a good understanding of the important steps we should follow to build a binary classification model using SVM.
Happy Coding!