Modeling Credit Risk
May 2021

This analysis was a technical exercise I completed as part of a job application.

It’s a brief demonstration of some of my statistics and programming skills and was written in a Jupyter Notebook. This blog is generated using the Pelican blogging framework which makes it super easy to convert a notebook to a blog post.

Data Exploration Exercise

Using whichever methods and libraries you prefer, create a notebook with the following:

  1. Data preparation and Data exploration
  2. Identify the three most significant data features which drive the credit risk
  3. Modeling the credit risk
  4. Model validation and evaluation using the methods that you find correct for the problem

Your solution should have instructions and be self-contained. For instance, If your choice is a python notebook, your notebook should install all the required dependencies to run it.

Import and preparation

In [1]:
# display more than 1 output per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
In [2]:
%%capture 
import sys
!{sys.executable} -m pip install --upgrade pip
!{sys.executable} -m pip install pandas numpy sklearn matplotlib seaborn statsmodels
In [3]:
import pandas as pd
pd.options.display.max_columns = None # show all columns in a pandas dataframe
In [4]:
data = pd.read_csv('credit-g.csv')
data.head()
Out[4]:
checking_status duration credit_history purpose credit_amount savings_status employment installment_commitment personal_status other_parties residence_since property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign_worker class
0 <0 6 critical/other existing credit radio/tv 1169 no known savings >=7 4 male single none 4 real estate 67 none own 2 skilled 1 yes yes good
1 0<=X<200 48 existing paid radio/tv 5951 <100 1<=X<4 2 female div/dep/mar none 2 real estate 22 none own 1 skilled 1 none yes bad
2 no checking 12 critical/other existing credit education 2096 <100 4<=X<7 2 male single none 3 real estate 49 none own 1 unskilled resident 2 none yes good
3 <0 42 existing paid furniture/equipment 7882 <100 4<=X<7 2 male single guarantor 4 life insurance 45 none for free 1 skilled 2 none yes good
4 <0 24 delayed previously new car 4870 <100 1<=X<4 3 male single none 4 no known property 53 none for free 2 skilled 2 none yes bad

Data Exploration

In [5]:
### overview of the dataset
data.shape
data.columns
data.nunique() # unique values per column
print(f'class: {len(data[data["class"]=="good"])} "good" rows')
print(f'class: {len(data[data["class"]=="bad"])} "bad" rows')
Out[5]:
(1000, 21)
Out[5]:
Index(['checking_status', 'duration', 'credit_history', 'purpose',
       'credit_amount', 'savings_status', 'employment',
       'installment_commitment', 'personal_status', 'other_parties',
       'residence_since', 'property_magnitude', 'age', ' other_payment_plans',
       'housing', 'existing_credits', 'job', 'num_dependents', 'own_telephone',
       ' foreign_worker', 'class'],
      dtype='object')
Out[5]:
checking_status             4
duration                   33
credit_history              5
purpose                    10
credit_amount             921
savings_status              5
employment                  5
installment_commitment      4
personal_status             4
other_parties               3
residence_since             4
property_magnitude          4
age                        53
 other_payment_plans        3
housing                     3
existing_credits            4
job                         4
num_dependents              2
own_telephone               2
 foreign_worker             2
class                       2
dtype: int64
class: 700 "good" rows
class: 300 "bad" rows

In order to visually inspect the data it’s necessary to convert the datatype of the categorical features from string to category. This will also be necessary to train the model. Therefore the data will be formatted before further exploration.

Outcome distribution

Ideally there would be an approximately equal number of outcome classes. In this dataset the split is 30% “bad” and 70% “good”. The low number of bad credit risk assessments may limit the models ability to accurately predict a bad credit assessment relative to good assessments because of the limited training examples. This could result in more False Positives than would typically be expected.

Data formating

In [6]:
# remove whitespace around column names
data.columns = [col.strip() for col in data.columns]
In [7]:
# Categorical variables have a limited and usually fixed number of possible values.
# Categorical data might have an order (e.g. ‘strongly agree', ‘agree’, 'disagree', 'strongly disagree') 

unordered_category_cols = [
    'credit_history',
    'purpose',
    'personal_status',
    'other_parties',
    'property_magnitude',
    'other_payment_plans',
    'housing',
    'job',
    'own_telephone',
    'foreign_worker',
    'class'
]

for col in unordered_category_cols:
    data[col] = data[col].astype('category')


ordered_category_cols = [
    ('checking_status', ["no checking", "<0", "0<=X<200", ">=200"], True),
    ('savings_status',['no known savings', '<100', '100<=X<500', '500<=X<1000', '>=1000'], True ),
    ('employment', ['unemployed', '<1', '1<=X<4', '4<=X<7', '>=7'], True),
]

for col in ordered_category_cols:
    data[col[0]]=pd.Categorical(data[col[0]], categories=col[1], ordered=col[2])
In [8]:
# convert categories to numnerical values, for SelectKBest
cat_columns = data.select_dtypes(['category']).columns
data[cat_columns] = data[cat_columns].apply(lambda x: x.cat.codes)
In [9]:
# all columns are now either categorical and encoded as an int (ordered or unordered) or numerical.
data.dtypes
Out[9]:
checking_status            int8
duration                  int64
credit_history             int8
purpose                    int8
credit_amount             int64
savings_status             int8
employment                 int8
installment_commitment    int64
personal_status            int8
other_parties              int8
residence_since           int64
property_magnitude         int8
age                       int64
other_payment_plans        int8
housing                    int8
existing_credits          int64
job                        int8
num_dependents            int64
own_telephone              int8
foreign_worker             int8
class                      int8
dtype: object
In [10]:
# this will take a while..
import seaborn as sns # Create the default pairplot
pairplot = sns.pairplot(
    data, 
    hue="class",
    diag_kind = 'kde',
    plot_kws = {'alpha': 0.6, 's': 80, 'edgecolor': 'k'},
    height = 3
)
fig = pairplot.fig
fig.savefig('pairplot.png', dpi=200) # default dpi is 100