---
# <div align="center"><font color='green'> COSC 2673/2793 | Machine Learning  </font></div>
## <div align="center"> <font color='green'> **Example: Week08 Lecture QandA**</font></div>
---

# Feature Selection Demo

**Disclaimer: To simplify the example this code assumes the following without any investigations**

> A polynormial classifier with degree 3 is assumed to be appropriate.

> Hold out test set is used instead of cros-validation even though this might not be a good stratergy. 

> The objective is to demostrate how the techniques are used and not to come up with the best model.


First lets load the libraries needed.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Lets now load the dataset.
> Dataset 1 is a modified version of the `PIMA Indians diabetes dataset` (Missing values resolved). The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. 

In [None]:
# Dataset 1
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiaPedigreeFunction', 'Age', 'Class']
att_names = names[:-1]
dataframe = pd.read_csv(url, names=names)
dataframe.head(5)
s_features = 8

Lets set up the data into attributes and target variable. 

In [None]:
Y = dataframe['Class']
X = dataframe.drop(['Class'], axis = 1)
X.describe()

Lets first hold out some data for model evaluation. 80-20 split.

In [None]:
from sklearn.model_selection import train_test_split
TrainX, TestX, TrainY, TestY = train_test_split(X, Y, test_size=0.2, random_state=0)

Lets write a function to do model fitting

In [None]:
from sklearn.preprocessing import PolynomialFeatures
def get_trandformed_features(TrainX, TestX):
    poly = PolynomialFeatures(3).fit(TrainX)
    TrainX_poly = poly.transform(TrainX)
    TestX_poly = poly.transform(TestX)
    
    scaler = preprocessing.MinMaxScaler().fit(TrainX_poly)

    TrainX_poly = scaler.transform(TrainX_poly)
    TestX_poly = scaler.transform(TestX_poly)
    
    return TrainX_poly, TestX_poly

In [None]:
from sklearn import preprocessing
def fit_model_predict_test(TrainX, TrainY, TestX):
    logReg = LogisticRegression(C=100000, max_iter=100, solver='liblinear', class_weight='balanced', random_state=0)
    
    TrainX_poly, TestX_poly = get_trandformed_features(TrainX, TestX)
    
    logReg.fit(TrainX_poly, TrainY)
    pred = logReg.predict(TestX_poly)
    
    return pred

Now lets train a simple logistic regression model that use all the features. No regularaization (C = high value). 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, accuracy_score


pred = fit_model_predict_test(TrainX, TrainY, TestX)

print("Test results for logistic regression with no feature selection")
print("\tF1 Score: ", f1_score(TestY, pred, average='macro'))
print("\tAccuracy: ", accuracy_score (TestY, pred))

## Filter methods - Example
Lets first try a filter method in sk-learn first. for this we are plannig to use Mutual information mesure to establisth the "best" features. The number of features to pick is set to 5. 

Mutual information (MI) between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_classif, mutual_info_classif


featureSelector = SelectKBest(score_func=mutual_info_classif, k=5).fit(TrainX, TrainY)

plt.figure(figsize=(8,8))
scores = featureSelector.scores_
plt.xticks(rotation='vertical')
plt.barh(att_names, scores, )

# get the selected feature vectors
TrainX_new = featureSelector.transform(TrainX)
TestX_new = featureSelector.transform(TestX)

Lets now see how logistic regression works on selected features. 

In [None]:
logReg = LogisticRegression(C=10000, max_iter=100, solver='liblinear', class_weight='balanced', random_state=0)

pred = fit_model_predict_test(TrainX_new, TrainY, TestX_new)

print("Test results for logistic regression with filtered feature selection")
print("\tF1 Score: ", f1_score(TestY, pred, average='macro'))
print("\tAccuracy: ", accuracy_score (TestY, pred))

How will performance chage with the hyper-parameter k. Would you use cross valiadation?

In [None]:
k_all = np.arange(1,s_features+1)
f1_hold = []
acc_hold = []
for k in k_all:
    featureSelector = SelectKBest(score_func=f_classif, k=k).fit(TrainX, TrainY)
    TrainX_new = featureSelector.transform(TrainX)
    TestX_new = featureSelector.transform(TestX)
    
    # Should consider CV
    pred = fit_model_predict_test(TrainX_new, TrainY, TestX_new)
    
    f1_hold.append(f1_score(TestY, pred, average='macro'))
    acc_hold.append(accuracy_score (TestY, pred))

plt.plot(k_all, f1_hold)
plt.plot(k_all, acc_hold)
plt.legend(['F1-score','Accuracy'])
plt.xlabel('Number of features selected')

## Wrapper methods - Example
Next lets implement `Recursive Feature Elimination` which is a type of wrapper feature selection method.

In [None]:
from sklearn.feature_selection import RFE

TrainX_poly, TestX_poly = get_trandformed_features(TrainX, TestX)

model = LogisticRegression(C=100000, max_iter=100, solver='liblinear', class_weight='balanced', random_state=0)

rfe = RFE(model, 20).fit(TrainX_poly, TrainY)
print("Num Features: %s" % (rfe.n_features_))
sel_inx = np.ix_(rfe.support_)[0].tolist()
# print("Selected Features: %s" % [att_names[i] for i in sel_inx])

# get the selected feature vectors
TrainX_new = rfe.transform(TrainX_poly)
TestX_new = rfe.transform(TestX_poly)

Lets now see how logistic regression works on selected features.

In [None]:
logReg = LogisticRegression(C=100000, max_iter=100, solver='liblinear', class_weight='balanced', random_state=0)
logReg.fit(TrainX_new, TrainY)
pred = logReg.predict(TestX_new)

print("Test results for logistic regression with Recursive Feature Elimination")
print("\tF1 Score: ", f1_score(TestY, pred, average='macro'))
print("\tAccuracy: ", accuracy_score (TestY, pred))

## Embedded methods - Example
Weel lets use l1 penalization. This is the Lasso penalty. Should use cross validation to set C. 

In [None]:
TrainX_poly, TestX_poly = get_trandformed_features(TrainX, TestX)

logReg = LogisticRegression(C=1, max_iter=100, solver='liblinear', class_weight='balanced', random_state=0)
logReg.fit(TrainX_poly, TrainY)
pred = logReg.predict(TestX_poly)


print("Test results for logistic regression with l1 penalty")
print("\tF1 Score: ", f1_score(TestY, pred, average='macro'))
print("\tAccuracy: ", accuracy_score (TestY, pred))

In [None]:
# plt.figure(figsize=(8,8))
# coef = pd.Series(np.squeeze(logReg.coef_), index = att_names)
# imp_coef = coef.sort_values()
# imp_coef.plot(kind = "barh")
# plt.title("Feature importance using Lasso Linear Model")
# plt.show()

## Tree based feature selection - Feature Importance

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel

tree_clf = ExtraTreesClassifier(n_estimators=100).fit(TrainX, TrainY)
scores = tree_clf.feature_importances_  
plt.figure(figsize=(8,8))
plt.xticks(rotation='vertical')
plt.barh(att_names, scores)

model = SelectFromModel(tree_clf, prefit=True)
TrainX_new = model.transform(TrainX)
TestX_new = model.transform(TestX)
print("Num Features: %s" % (TrainX_new.shape[1]))

tree_clf = ExtraTreesClassifier(n_estimators=100).fit(TrainX_new, TrainY)
pred = tree_clf.predict(TestX_new)

print("Test results for Random forest with tree feature selection")
print("\tF1 Score: ", f1_score(TestY, pred, average='macro'))
print("\tAccuracy: ", accuracy_score (TestY, pred))

# Feature construction with PCA

In [None]:
from sklearn.decomposition import PCA

TrainX_poly, TestX_poly = get_trandformed_features(TrainX, TestX)


components = 10
pca = PCA(n_components=components).fit(TrainX_poly)
TrainX_new = pca.transform(TrainX_poly)
TestX_new = pca.transform(TestX_poly)

logReg = LogisticRegression(C=10000, max_iter=100, solver='liblinear', class_weight='balanced')
logReg.fit(TrainX_new, TrainY)
pred = logReg.predict(TestX_new)

print("Test results for logistic regression with PCA feature construction")
print("\tF1 Score: ", f1_score(TestY, pred, average='macro'))
print("\tAccuracy: ", accuracy_score (TestY, pred))

In [None]:
print(pca.explained_variance_ratio_)
plt.bar(np.arange(1,components+1), pca.explained_variance_ratio_)
plt.show()