Understanding Support Vector Machine with Kernels

A practical guide on how to predict whether a client will subscribe to a term deposit or not using a bank marketing dataset

11 min readJun 20, 2021

In this article, I am going to describe how to build a support vector machine with kernels to predict whether a client will subscribe to a term deposit or not. The dataset I am using is a Bank Marketing dataset. The dataset can be downloaded here.

1. What is Support Vector Machine?

Support vector machine is a supervised machine learning algorithm that can be used for both classification or regression challenges. But, it is mostly used in classification problems.

Support vectors simply mean the coordinates of individual observations. Classification in SVM is done by segregating classes with a hyper-plane or a line.

SVM performs well with small datasets.

What is Support Vector Machine with kernels?

Using kernels is an optimization technique for SVM. When the kernel function is applied it will make the calculation easier for the SVM. When the classes cannot be linearly separable, the kernel function is useful. It can take low-dimensional input space and transforms it into a higher-dimensional space. And then the class segregation can be done in the higher dimensional space.

Examples of SVM kernels are polynomial kernel, gaussian kernel, radial basis function, and sigmoid kernel.

2. Dataset description

Bank Marketing dataset is obtained from UCI Machine Learning Repository.

Attribute Information

Input variables:
# bank client data:
1. age (numeric)
2. job : type of job (categorical: ‘admin.’,’blue-collar’,’entrepreneur’,’housemaid’,’management’,’retired’,’self-employed’,’services’,’student’,’technician’,’unemployed’,’unknown’)
3. marital : marital status (categorical: ‘divorced’,’married’,’single’,’unknown’; note: ‘divorced’ means divorced or widowed)
4. education (categorical: ‘basic.4y’,’basic.6y’,’basic.9y’,’high.school’,’illiterate’,’professional.course’,’university.degree’,’unknown’)
5. default: has credit in default? (categorical: ‘no’,’yes’,’unknown’)
6. housing: has housing loan? (categorical: ‘no’,’yes’,’unknown’)
7. loan: has personal loan? (categorical: ‘no’,’yes’,’unknown’)
# related with the last contact of the current campaign:
8. contact: contact communication type (categorical: ‘cellular’,’telephone’)
9. month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
10. day_of_week: last contact day of the week (categorical: ‘mon’,’tue’,’wed’,’thu’,’fri’)
11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14. previous: number of contacts performed before this campaign and for this client (numeric)
15. poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,’nonexistent’,’success’)
# social and economic context attributes
16. emp_var_rate: employment variation rate — quarterly indicator (numeric)
17. cons_price_idx: consumer price index — monthly indicator (numeric)
18. cons_conf_idx: consumer confidence index — monthly indicator (numeric)
19. euribor3m: euribor 3 month rate — daily indicator (numeric)
20. nr_employed: number of employees — quarterly indicator (numeric)
Output variable (desired target):
21. y — has the client subscribed a term deposit? (binary: ‘yes’,’no’)

3. Problem Statement

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.
The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

4. Developing the SVM

The target column of our dataset is y. It indicates whether the client has subscribed to a term deposit or not. The values are binary ‘1’ and ‘0’. First, the target column needs to be checked to get an understanding of the class distribution.

X[‘y’].value_counts()

Output:

The class imbalance is graphically represented below.

Graphical representation of class imbalance

It is obvious that there is a considerable class imbalance in the dataset. To handle the class imbalance I will apply SMOTE later.

Data Preprocessing

There were 12 duplicate records in the dataset and they were removed. The dataset was checked to identify missing values but there were not any missing values. But when we check the dataset we can see there are values called ‘unknown’ in few columns. But I did not remove them since they may be recorded as unknown due to some valid reason although we do not know for sure.

In the dataset description below note is included for the column duration.

duration: This attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

So I dropped the column ‘duration’.

Handling outliers

In the dataset, there are both numerical and categorical columns. Numerical columns should be checked to detect outliers. Boxplots for all 9 numerical columns are shown below.

When we consider the box plots age, pdays, emp_var_rate, cons_price_idx, euribor3m and nr_employed do not have outliers.

But campaign, cons_conf_idx, previous have outliers.

Below code snippet is used to drop outliers in the campaign column.

warnings.filterwarnings("ignore")fig, axes = plt.subplots(1,2)
plt.tight_layout(0.2)
print("Before Shape:", X.shape)#remove outliers that is more than 50
X_df =X[X['campaign']<50]
print("After Shape:", X_df.shape)
sns.boxplot(X['campaign'],orient='v',ax=axes[0])
axes[0].title.set_text("Before")
sns.boxplot(X_df['campaign'],orient='v',ax=axes[1])
axes[1].title.set_text("After")
plt.show()X=X_df;
X=X.reset_index(drop=True)

Output:

Before and after outliers in the campaign column are removed

Likewise, all the outliers are removed in other columns also.

Feature Encoding

Feature encoding should be applied to all categorical columns. To identify the categorical columns in the dataset following code snippet is used.

cat_df = X1.select_dtypes(include = np.object)
cat_df.info()

Output:

To check the unique values in each of these columns use the below code.

cat_df.nunique()

Output:

Unique values in each categorical column

We can apply label encoding to all categorical columns. But when we consider the education column we know that there is an order in educational qualification. So education column can be considered as ordinal. We need to assign each qualification a rank accordingly.

To find the unique values in the education column below code is used.

X1.education.unique()

Output:

To assign a rank for the values below code can be used.

scale_mapper = {"unknown":0, "illiterate":1, "basic.4y":2, "basic.6y":3, "basic.9y":4, "high.school":5, "professional.course":6, "university.degree":7}X1["education"] = X1["education"].replace(scale_mapper)

I have provided the least value 0 for unknown values and the highest value 7 for university degree. Other values also are mapped in order. The output column can be seen as below.

Output:

All the other categorical columns can be encoded using label encoder.

The output dataset is shown below.

Transformations

Q-Q plots and histograms for each numerical column are shown below. By analyzing those graphs skewness can be identified.

Q-Q plots and Histograms for age and campaign

Both age and campaign show a right skewness. So, we need to transform data to get a normal distribution. In the below code, I have shown how to apply square root transformation to reduce the right skewness of age.

The normal distribution can be seen now in age and campaign.

Histogram for age after applying square root transformation

Similarly, square root transformation should be applied to campaign as well.

Q-Q plots and histograms for pdays and previous

I decided not to apply transformations for pdays since it is distributed only in 2 separate regions. But previous is right-skewed. So I applied square root transformation to it.

Q-Q plots and Histograms for emp_var_rate and cons_price_isx

We can see emp_car_rate shows a left skewness and I applied square transformation to reduce the left skewness of emp_var_rate.

The output is shown below.

Histogram after applying square transformation to emp_var_rate

For cons_price_idx transformations were not applied.

The Q-Q plots and histograms for cons_conf_idx and euribor3m are shown below.

Transformations were not applied to both cons_conf _idx and euribor3m.

Since it shows a left skewness square transformation was applied to nr_employed. The output is shown below.

Histogram after applying square transformation to nr_employed

Now we have applied the necessary transformations and next, we should do standardization.

Standardization

We apply standardization only for numerical columns. The code used is shown below.

Feature Extraction

To identify the significant features we can use the significance matrix.

plt.figure(figsize = (18,10))
sns.heatmap(X3.corr(), annot = True);

Significance Matrix:

Here when we consider the correlation between attributes we cannot see any attributes with considerably high correlated attributes. And when we consider the significance of attributes there are some attributes with low significance.

I decided to drop 4 attributes which are day_of_week, job, housing, and loan. I tried dropping few different combinations of attributes but dropping those 4 attributes gave a high precision than other combinations. The recall and F1 score did not show much difference.

SMOTE

SMOTE is applied after first, training SVM with class imbalance. But it was observed that the overall performance was less. So, SMOTE is applied and although the performance was not highly increased, it was better than without applying SMOTE.

Now I am going to separate the target variable, and split the dataset, and after splitting handle class imbalance by applying SMOTE technique.

First, I separated the target column ‘y’.

y_true = X4['y']
X_dataset = X4.drop('y', axis=1)

Then the dataset is split into a 0.8 : 0.2 ratio and SMOTE is applied to the training data which was obtained after splitting.

When we visualize the values of the target variable ‘y’, we can see that the class imbalance problem is resolved.

Class frequencies in y after handling class imbalance

X_train = smoted_X
y_train = smoted_y

smoted_X is out X_train data and smoted_y is our y_train data. For X_test and y_test those changes were not applied.

Principal Component Analysis

We should consider the summation of high variance dimensions. The summation of 9 high variance dimensions is 96%. So, I made n_componets equal to 9.

pca.explained_variance_ratio_

Output:

pca.explained_variance_ratio

SVM with Kernel

RBF kernel is used since it works well in practice and it is relatively easy to tune.

Intuitively, the gamma parameter defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’. The gamma parameters can be seen as the inverse of the radius of influence of samples selected by the model as support vectors.
The C parameter trades off correct classification of training examples against maximization of the decision function’s margin. For larger values of C, a smaller margin will be accepted if the decision function is better at classifying all training points correctly. A lower C will encourage a larger margin, therefore a simpler decision function, at the cost of training accuracy. In other words C behaves as a regularization parameter in the SVM.

I used C=1 and gamma=0.05. These values produced the result with better predictions than other different combinations of C and gamma values.

Confusion matrix is created to evaluate the performance of the classification model.

Output:

True positives — 560
True negatives — 6500
False positives — 830
False negatives — 380

The classification performace for true positives is low. It may be due to the class imbalance of the original dataset.

Classification report is used to measure the quality of predictions of a classification model.

The output is shown below.

The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.
The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.
The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.
The F-beta score weights recall more than precision by a factor of beta. beta == 1.0 means recall and precision are equally important.
The support is the number of occurrences of each class in y_true.

Precision, recall, and F1 score for test data is obtained as below.

Results are shown below.

We can also get training set score and testing set score as below.

The result is shown below.

Training set and Testing set scores

Here test set score is high but it’s due to the class imbalance in the testing data. Since the number of true negatives is very high it gives a high test score.

Conclusion

In this article, I gave you a step-by-step guide on data preprocessing, feature encoding, transformations, standardization, feature extraction, handling class imbalance, dimensionality reduction and finally building a support vector machine with kernels.

I hope this article helped you to gain a good understanding of the important steps we should follow to build a binary classification model using SVM.

Understanding Support Vector Machine with Kernels

A practical guide on how to predict whether a client will subscribe to a term deposit or not using a bank marketing dataset

1. What is Support Vector Machine?

What is Support Vector Machine with kernels?

2. Dataset description

3. Problem Statement

4. Developing the SVM

Data Preprocessing

Feature Encoding

Transformations

Standardization

Feature Extraction

SMOTE

Principal Component Analysis

SVM with Kernel

Conclusion

Colab Code

Google Colaboratory

Support Vector Machine With RBF Kernel

References

Written by Manusha Priyanjalee