Course Code BIT405
Semester Fall 2013
Instructor’s Name Dr. Bilel ELAYEB

Scope and Focus:
? Classification Methods
? The Naive Rule
? Naive Bayes
? k-Nearest Neighbors

Contributing to the following CLOs:
CLO #1 Describe the role of Business Intelligence in an organization.

CLO #2 Understand the data mining process and its related issues.

CLO #3 Create, evaluate and apply different intelligence models.

Questions 1 2 Total
Point 8 7 15
Student Mark

Note: This Assignment accounts for 15% of the student’s final grade.

Exercise 1: Naïve Bayes (8 marks)

Personal Loan Acceptance. The file “UniversalBank.xls” contains data on 5000 customers of Universal Bank. The data include customer demographic information (age, income, etc.), the customer’s relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign. In this exercise we focus on two predictors: Online (whether or not the customer is an active user of online banking services) and Credit Card (abbreviated CC below) (does the customer hold a credit card issued by the bank), and the outcome Personal Loan (abbreviated Loan below).
Partition the data into training (60%) and validation (40%) sets.
Table 1 gives the pivot table for the training data with Online as a column variable, CC as a row variable, and Loan as a secondary row variable. The values inside the cells should convey the count (how many records are in that cell).
Table 1 : The pivot table for the training data

1) Consider the task of classifying a customer that owns a bank credit card and is actively using online banking services. Looking at the pivot table, what is the probability that this customer will accept the loan offer? (This is the probability of loan acceptance (Loan=1) conditional on having a bank credit card (CC=1) and being an active user of online banking services (Online=1)).

……………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………

2) Given the two separate pivot tables for the training data. Table 2 has Loan (rows) as a function of Online (columns) and Table 3 has Loan (rows) as a function of CC.

Table 2: Pivot tables for the training data: Loan (rows) as a function of Online (columns)

Table 3: Pivot tables for the training data: Loan (rows) as a function of CC.

Compute the following quantities (P(A|B) means “the probability of A given B”):

Answer: (4.5 marks, 0.75 for each)

i. P(CC = 1|Loan = 1) (the proportion of credit card holders among the loan acceptors)
……………………………………………………………………………………………………………………………………………………………………………………………………………………………………
ii. P(Online = 1|Loan = 1)
…………………………………………………………………………………………………………………………………………………………………………………………………………………………………
iii. P(Loan = 1) (the proportion of loan acceptors)
……………………………………………………………………………………………………………………………………………………………………………………………………………………………………

iv. P(CC = 1|Loan = 0)
……………………………………………………………………………………………………………………………………………………………………………………………………………………………………

v. P(Online = 1|Loan = 0)
……………………………………………………………………………………………………………………………………………………………………………………………………………………………………
vi. P(Loan = 0)
……………………………………………………………………………………………………………………………………………………………………………………………………………………………………
3) Use the quantities computed above to compute the naive Bayes probability:

Answer: (1.5 mark, 0.5 for each)

P(Loan = 1|CC = 1; Online = 1)
……………………………………………………………………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………………………………………………

4) Compare this value with the one obtained from the crossed pivot table in question 1). Which is a more accurate estimate?

……………………………………………………………………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………………………………………………

Exercise 2: k-Nearest Neighbors (7 marks)

Personal Loan Acceptance. Universal Bank is a relatively young bank growing rapidly in terms of overall customer acquisition. The majority of these customers are liability customers (depositors) with varying sizes of relationship with the bank. The customer base of asset customers (borrowers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loam business. In particular, it wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the hank ran last year for liability customers showed a healthy conversion rate of over
9%, success. This has encouraged the retail marketing department to devise smarter campaigns with better target marketing. The goal of our analysis is to model the previous campaign’s customer behavior to analyze what combination of factors make a customer more likely to accept a personal loan. This will serve as the basis for the design of a new campaign.
The file “UniversalBank.xls” contains data on 5000 customers. The data include customer demographic information (age, income, etc.). the customer’s relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 = (9.6%) accepted the personal loan that was offered to them in the earlier campaign.
Partition the data into training (60%) and validation (40%) sets.

1). Perform a k-NN classification with all predictors except ID and ZIP code using k = 1.
Remember to transform categorical predictors with more than two categories into dummy variables first. Specify the success class as 1 (loan acceptance), and use the default cutoff value of 0.5. How would this customer be classified?
For classification of customer refer to the following tables 4 and 5 given by the output of XLMiner.

Classes in Input Data set
# Classes 2
Class 1 (Success) 1
Class 2 0
Table 4: Classes in Input Data set

Prior class probabilities

Class Prob.
1 0.0953333
0 0.9046667
Table 5: Prior class probabilities

……………………………………………………………………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………………………………………………

2). Given the following table 6 as an output of XLMiner, what is a choice of k that balances between overfitting and ignoring the predictor information?

Table 6: Prior class probabilities
……………………………………………………………………………………………………………………………………………………………………………………………………………………………………
…………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………………………………………………
…………………………………………………………………………………………………………………
3). Given the classification confusion matrix for the validation data that results from using the best k (Table 7), it’s corresponding error report (Table 8) and the following table 8 as an output of XLMiner, classify the customer using the best k.

Table 7: Validation Data scoring – Summary Report (for k=9)
Table 8: Error report (for k = 9)

Table 9: XLMiner Output

……………………………………………………………………………………………………………………………………………………………………………………………………………………………………
…………………………………………………………………………………………………………………

4). Repartition the data, this time into training (50%), validation (30%) and test sets (20%). Tables 10, 11 and 12 give the output of XLMiner for the k-NN method with the k chosen above.
Compare the classification matrix of the test set with that of the training and validation sets. Comment on the differences and their reason.
Table 10: Training Data scoring – Summary Report (for k=9)
Table 11: Validation Data scoring – Summary Report (for k=9)
Table 12: Test Data scoring – Summary Report (for k=9)

Answer: (2.5 marks, 1.25 mark for each)
……………………………………………………………………………………………………………………………………………………………………………………………………………………………………
…………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………………………………………………
…………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………………………………………………
…………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………………………………………………
…………………………………………………………………………………………………………………
ORDER THIS ESSAY HERE NOW AND GET A DISCOUNT !!!

Get a 10 % discount on an order above \$ 100
Use the following coupon code :
WIZARDS35