Personalized Cancer Diagnosis

Problem statement : Classify the given genetic variations/mutations based on evidence from text-based clinical literature.

Real-world/Business objectives and constraints :

No low-latency requirement.
Interpretability is important.
Errors can be very costly.
Probability of a data-point belonging to each class is needed

Machine Learning Problem Formulation :

DATA OVERVIEW :

Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/data
We have two data files: one contains the information about the genetic mutations and the other contains the clinical evidence (text) that human experts/pathologists use to classify the genetic mutations.
Both these data files are have a common column called ID
Data file's information:

training_variants (ID , Gene, Variations, Class)
training_text (ID, Text)

On the basis of Gene, Variation and Text, we will classify it in 9 classes in which some classes will be for cancer patients.

So this problem can be taken as multi class classification.

Performance Metric :

Multi class log-loss
Confusion matrix

Multi class Log-loss : This performance metric is good in multi class classification. It changes significantly even with the slightest change in predicted value with respect to original value. Log-loss varies from 0 to infinite.

Confusion Matrix : This is also good for multi class classification. This metric is good for visualization.

Machine Learning Objectives and Constraints :

Objective: Predict the probability of each data-point belonging to each of the nine

classes.

 Constraints:

* Interpretability

* Class probabilities are needed.

* Penalize the errors in class probabilites => Metric is Log-loss.

* No Latency constraints

Train, CV and Test Datasets

Split the dataset randomly into three parts train, cross validation and test with 64%,16%, 20% of data 

respectively.

Exploratory Data Analysis

We will be importing bunch of libraries in order to operate on the dataset.

import pandas as pd
import matplotlib.pyplot as plt
import re
import time
import warnings
import numpy as np
from nltk.corpus import stopwords
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import normalize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.manifold import TSNE
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics.classification import accuracy_score, log_loss
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from imblearn.over_sampling import SMOTE
from collections import Counter
from scipy.sparse import hstack
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold 
from collections import Counter, defaultdict
from sklearn.calibration import CalibratedClassifierCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import math
from sklearn.metrics import normalized_mutual_info_score
from sklearn.ensemble import RandomForestClassifier
warnings.filterwarnings("ignore")
from mlxtend.classifier import StackingClassifier
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression

Reading Data

1. Reading Gene/variants data

              data = pd.read_csv('training_variants')

        print('Number of data points : ', data.shape[0])

        print('Number of features : ', data.shape[1])

        print('Features : ', data.columns.values)

You will get the total number of data point (3321) and total number of 

features in this dataset (4 features).

Four features are :

--> ID :the id of the row used to link the mutation to the clinical evidence

--> Gene : the gene where this genetic mutation is located

--> Variation : the aminoacid change for this mutations

--> Class :  1-9 the class this genetic mutation has been classified on

2. Reading Text data

        # note the separator in this file ( It is 2 vertical lines )

        data_text =pd.read_csv("training/training_text",sep="\|\|",

                    engine="python",names=["ID","TEXT"],skiprows=1)

        print('Number of data points : ', data_text.shape[0])

        print('Number of features : ', data_text.shape[1])

        print('Features : ', data_text.columns.values)

You will get the total number of data point (3321) and total number of 
features in this dataset (2 features).

Two features are :

--> ID

--> Text

Preprocessing of Text

NLTK library will be used a lot while preprocessing of text data.

For each line of code, a comment has been written to explain the code.

import nltk
nltk.download('stopwords')   #all the stop words has been downloaded

# loading stop words from nltk library
stop_words = set(stopwords.words('english'))

Create a function called nlp_preprocessing
def nlp_preprocessing(total_text, index, column):
    if type(total_text) is not int:
        string = ""
        # replace every special char with space
        total_text = re.sub('[^a-zA-Z0-9\n]', ' ', total_text)
        # replace multiple spaces with single space
        total_text = re.sub('\s+',' ', total_text)
        # converting all the chars into lower-case.
        total_text = total_text.lower()
        
        for word in total_text.split():
        # if the word is a not a stop word then retain that word from the data
            if not word in stop_words:
                string += word + " "
        
        data_text[column][index] = string

After creating function, call it and apply on text data.

#text processing stage.
start_time = time.clock()
for index, row in data_text.iterrows():
    if type(row['TEXT']) is str:
        nlp_preprocessing(row['TEXT'], index, 'TEXT')
    else:
        print("there is no text description for id:",index)
print('Time took for preprocessing the text :',time.clock() - start_time, "seconds")

Comment in green font has explained the code.

It is required to merge gene_variations and text data based on ID
#merging both gene_variations and text data based on ID
result = pd.merge(data, data_text,on='ID', how='left')

Train Test and Cross Validation (64:20:16)

y_true = result['Class'].values
result.Gene      = result.Gene.str.replace('\s+', '_')
result.Variation = result.Variation.str.replace('\s+', '_')

# split the data into test and train by maintaining same distribution of 
output varaible 'y_true' [stratify=y_true]
X_train, test_df, y_train, y_test = train_test_split(result, y_true,
stratify=y_true, test_size=0.2)
# split the train data into train and cross validation by maintaining
 same distribution of output varaible 'y_train' [stratify=y_train]
train_df, cv_df, y_train, y_cv = train_test_split(X_train, y_train,
 stratify=y_train, test_size=0.2)

We split the data into train, test and cross validation data sets, preserving 

the ratio of class distribution in the original data set.

We will try to plot the class distribution in train, test and CV datasets.

train_class_distribution = train_df['Class'].value_counts().sort_index()

test_class_distribution = test_df['Class'].value_counts().sort_index()

cv_class_distribution = cv_df['Class'].value_counts().sort_index()

# It returns a dict, keys as class labels and values as the number of data points 
in that class

For plotting distribution of train data :

my_colors = 'rgbkymc'

train_class_distribution.plot(kind='bar')

plt.xlabel('Class')

plt.ylabel('Data points per Class')

plt.title('Distribution of yi in train data')

plt.grid()

plt.show()

# -(train_class_distribution.values): the minus sign will give us in decreasing order

sorted_yi = np.argsort(-train_class_distribution.values)

for i in sorted_yi:

    print('Number of data points in class', i+1, ':',

    train_class_distribution.values[i], '(',np.round(

    (train_class_distribution.values[i]/train_df.shape[0]*100), 3), '%)')

For plotting distribution of test data set :

my_colors = 'rgbkymc'

test_class_distribution.plot(kind='bar')

plt.xlabel('Class')

plt.ylabel('Data points per Class')

plt.title('Distribution of yi in test data')

plt.grid()

plt.show()

#-(train_class_distribution.values): the minus sign will give us in decreasing order

sorted_yi = np.argsort(-test_class_distribution.values)

for i in sorted_yi:

    print('Number of data points in class', i+1, ':',

    test_class_distribution.values[i], '(',np.round(

    (test_class_distribution.values[i]/test_df.shape[0]*100), 3), '%)')

For plotting distribution of CV data set

my_colors = 'rgbkymc'

cv_class_distribution.plot(kind='bar')

plt.xlabel('Class')

plt.ylabel('Data points per Class')

plt.title('Distribution of yi in cross validation data')

plt.grid()

plt.show()

# -(train_class_distribution.values): the minus sign will give us 

in decreasing order

sorted_yi = np.argsort(-train_class_distribution.values)

for i in sorted_yi:

    print('Number of data points in class', i+1, ':',

    cv_class_distribution.values[i], '(',np.round(

    (cv_class_distribution.values[i]/cv_df.shape[0]*100), 3), '%)')

Prediction using a 'Random' Model

In Random Model, we generate 9 class probabilities randomly so that they sum to 1.

You can create a function to plot confusion matrix to visualise log loss.

(Here we will not do visualisation. We will just find out log loss.)

--->Each Line in code is explained through a comment.

Random model and Log loss using Random model.

# we need to generate 9 numbers and the sum of numbers should be 1
# one solution is to genarate 9 numbers and divide each of the numbers by their sum
# ref: https://stackoverflow.com/a/18662466/4084039
test_data_len = test_df.shape[0]
cv_data_len = cv_df.shape[0]

# we create a output array that has exactly same size as the CV data
cv_predicted_y = np.zeros((cv_data_len,9))
for i in range(cv_data_len):
    rand_probs = np.random.rand(1,9)
    cv_predicted_y[i] = ((rand_probs/sum(sum(rand_probs)))[0])
print("Log loss on Cross Validation Data using Random Model",log_loss
    (y_cv,cv_predicted_y, eps=1e-15))

# Test-Set error.
#we create a output array that has exactly same as the test data
test_predicted_y = np.zeros((test_data_len,9))
for i in range(test_data_len):
    rand_probs = np.random.rand(1,9)
    test_predicted_y[i] = ((rand_probs/sum(sum(rand_probs)))[0])
print("Log loss on Test Data using Random Model",log_loss(
    y_test,test_predicted_y, eps=1e-15))

OUTPUT
Log loss on Cross Validation Data using Random Model 2.509805217627685
Log loss on Test Data using Random Model 2.487222032491184

It is observed that random model gives a log loss of 2.5.So in order to
built a efficient model, Log loss has to be less than 2.5.

Univariate Analysis--->code for response coding with Laplace smoothing.
alpha : used for laplace smoothing
feature: ['gene', 'variation']
df: ['train_df', 'test_df', 'cv_df']

ALGORITHM :
-->Consider all unique values and the number of occurances of given feature in 
train data dataframe
--> build a vector (1*9) , the first element = (number of times it occured in 
class1 + 10*alpha / number of time it occurred in total data+90*alpha)
--> gv_dict is like a look up table, for every gene it store a (1*9) representation
 of it
-->for a value of feature in df,if it is in train data:
-->we add the vector that was stored in 'gv_dict' look up table to 'gv_fea'
-->if it is not there is train:
-->we add [1/9, 1/9, 1/9, 1/9,1/9, 1/9, 1/9, 1/9, 1/9] to 'gv_fea'
-->return 'gv_fea'
code :
# get_gv_fea_dict: Get Gene variation Feature Dict
def get_gv_fea_dict(alpha, feature, df):

    value_count = train_df[feature].value_counts()
    # gv_dict : Gene Variation Dict, which contains the probability 
    array for each gene/variation
    gv_dict = dict()
     # denominator will contain the number of time that particular feature 
    occurred in whole data
    for i, denominator in value_count.items():
        # vec will contain (p(yi==1/Gi) probability of gene/variation belongs 
        to particular class
        # vec is 9 dimensional vector
        vec = []
        for k in range(1,10):
            # cls_cnt.shape[0] will return the number of rows
            cls_cnt = train_df.loc[(train_df['Class']==k) &
            (train_df[feature]==i)]
             # cls_cnt.shape[0](numerator) will contain the number of time
             that particular feature occured in whole data
            vec.append((cls_cnt.shape[0] + alpha*10)/ (denominator + 90*alpha))

        # we are adding the gene/variation to the dict as key and vec as value
        gv_dict[i]=vec
    return gv_dict

# Get Gene variation feature
def get_gv_feature(alpha, feature, df):    gv_dict = 
    get_gv_fea_dict(alpha, feature, df)
    # value_count is similar in get_gv_fea_dict
    value_count = train_df[feature].value_counts()
    
    # gv_fea: Gene_variation feature, it will contain the feature for each
     feature value in the data
    gv_fea = []
    # for every feature values in the given data frame we will check if it is there
     in the train data then we will add the feature to gv_fea
    # if not we will add [1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9] to gv_fea
    for index, row in df.iterrows():
        if row[feature] in dict(value_count).keys():
            gv_fea.append(gv_dict[row[feature]])
        else:
            gv_fea.append([1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9])
    return gv_fea

when we calculate the probability of a feature belongs to any 
particular class,we apply laplace smoothing.

Univariate Analysis on Gene Feature-->Gene is a categorical variable
--->How many categories are there and How they are distributed?????
Code-1
unique_genes = train_df['Gene'].value_counts()
print('Number of Unique Genes :', unique_genes.shape[0])
output-1
Number of Unique Genes : 233

PLOTTING CDF FOR GENE :
Code-2
s = sum(unique_genes.values);
h = unique_genes.values/s;
c = np.cumsum(h)
plt.plot(c,label='Cumulative distribution of Genes')
plt.grid()
plt.legend()
plt.show()

Output-2













Observation : First 50 Genes consists of about 77% of data. Top 50 genes occur more 
frequently.

How to featurize this Gene feature ?
We will choose the appropriate featurization based on the ML model we use.  For this 
problem of multi-class classification with categorical features or for logistic 
regression, one-hot encoding is  better while response coding is better for 
Random Forests.

Code-3
#response-coding of the Gene feature
# alpha is used for laplace smoothing
alpha = 1
# train gene feature
train_gene_feature_responseCoding = np.array(get_gv_feature(alpha, "Gene",
                                     train_df))
# test gene feature
test_gene_feature_responseCoding = np.array(get_gv_feature(alpha, "Gene",
                                 test_df))
# cross validation gene feature
cv_gene_feature_responseCoding = np.array(get_gv_feature(alpha, "Gene",
                         cv_df))
print("train_gene_feature_responseCoding is converted feature using 
response coding method. The shape of gene feature:",
 train_gene_feature_responseCoding.shape)

Output-3
train_gene_feature_responseCoding is converted feature using respone 
coding method.
The shape of gene feature: (2124, 9)

Code-4
# one-hot encoding of Gene feature.
gene_vectorizer = CountVectorizer()
train_gene_feature_onehotCoding = gene_vectorizer.fit_transform(train_df['Gene'])
test_gene_feature_onehotCoding = gene_vectorizer.transform(test_df['Gene'])
cv_gene_feature_onehotCoding = gene_vectorizer.transform(cv_df['Gene']
print("train_gene_feature_onehotCoding is converted feature using one-hot encoding 
method. The shape of gene feature:", train_gene_feature_onehotCoding.shape)

Output-4
train_gene_feature_onehotCoding is converted feature using one-hot 
encoding method.
The shape of gene feature: (2124, 232)

 How good is this gene feature in predicting y_i?
There are many ways to estimate how good a feature is, in predicting y_i.
One of the good methods is to build a proper ML model using just this feature. 
In this case, we will build a logistic regression model using only Gene feature
(one hot encoded) to predict y_i.

Code-5 (Model using only Gene feature)
alpha = [10 ** x for x in range(-5, 1)] # hyperparameter for SGD classifier.
default parameters for SGDClassifier
# SGDClassifier(loss=’hinge’, penalty=’l2’, alpha=0.0001, l1_ratio=0.15, 
fit_intercept=True, max_iter=None, tol=None,shuffle=True, verbose=0, epsilon=0.1,
n_jobs=1, random_state=None, learning_rate=’optimal’, eta0=0.0, power_t=0.5, 
class_weight=None, warm_start=False, average=False, n_iter=None)
cv_log_error_array=[]
for i in alpha:
    clf = SGDClassifier(alpha=i, penalty='l2', loss='log', random_state=42)
    clf.fit(train_gene_feature_onehotCoding, y_train)
    sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
    sig_clf.fit(train_gene_feature_onehotCoding, y_train)
    predict_y = sig_clf.predict_proba(cv_gene_feature_onehotCoding)
    cv_log_error_array.append(log_loss(y_cv, predict_y, labels=clf.classes_,
     eps=1e-15))
    print('For values of alpha = ', i, "The log loss is:",
        log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))

fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
    ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()


best_alpha = np.argmin(cv_log_error_array)
clf = SGDClassifier(alpha=alpha[best_alpha], penalty='l2', 
loss='log', random_state=42)
clf.fit(train_gene_feature_onehotCoding, y_train)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(train_gene_feature_onehotCoding, y_train)

predict_y = sig_clf.predict_proba(train_gene_feature_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",
log_loss(y_train, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(cv_gene_feature_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The cross validation 
log loss is:",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(test_gene_feature_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",
log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))

Output-5

















Is the Gene feature stable across all the data sets (Test, Train, Cross validation)?
Yes, it is. Otherwise, the CV and Test errors would be significantly more than
train error.
Code-6
print("Q6. How many data points in Test and CV datasets are covered by the ", 
unique_genes.shape[0], " genes in train dataset?")
test_coverage=test_df[test_df['Gene'].isin(list(set(train_df['Gene'])))].shape[0]
cv_coverage=cv_df[cv_df['Gene'].isin(list(set(train_df['Gene'])))].shape[0]
print('Ans\n1. In test data',test_coverage, 'out of',test_df.shape[0], ":",
(test_coverage/test_df.shape[0])*100)
print('2. In cross validation data',cv_coverage, 'out of ',cv_df.shape[0],":" ,
(cv_coverage/cv_df.shape[0])*100)

Output :
Q6. How many data points in Test and CV datasets are covered by the  233  
genes in train dataset?
Ans
1. In test data 648 out of 665 : 97.44360902255639
2. In cross validation data 512 out of  532 : 96.2406015037594

Univariate Analysis on Variation Feature

Variation is a categorical variable.
Code-1
unique_variations = train_df['Variation'].value_counts()
print('Number of Unique Variations :', unique_variations.shape[0])
Output :
Number of Unique Variations : 1929

Code-2
s = sum(unique_variations.values);
h = unique_variations.values/s;
c = np.cumsum(h)
print(c)
plt.plot(c,label='Cumulative distribution of Variations')
plt.grid()
plt.legend()
plt.show()


Obsevartion : 1500 out of 1927
variations consists of 80% data.










How to featurize this Variation feature ?
There are two ways we can featurize this variable :
-->One hot Encoding
-->Response coding
We will be using both these methods to featurize the Variation Feature.

Code-3
# alpha is used for laplace smoothing
alpha = 1
# train gene feature
train_variation_feature_responseCoding = np.array(get_gv_feature(alpha,
 "Variation", train_df))
# test gene feature
test_variation_feature_responseCoding = np.array(get_gv_feature(alpha,
 "Variation", test_df))
# cross validation gene feature
cv_variation_feature_responseCoding = np.array(get_gv_feature(alpha,
 "Variation", cv_df))
print("train_variation_feature_responseCoding is a converted feature using
the response coding method. The shape of Variation feature:",
train_variation_feature_responseCoding.shape)

Output-3
train_variation_feature_responseCoding is a converted feature using the
response coding method. The shape of Variation feature: (2124, 9)

Code-4
# one-hot encoding of variation feature.
variation_vectorizer = CountVectorizer()
train_variation_feature_onehotCoding = variation_vectorizer.fit_transform(
train_df['Variation'])
test_variation_feature_onehotCoding = variation_vectorizer.transform(
test_df['Variation'])
cv_variation_feature_onehotCoding = variation_vectorizer.transform(
cv_df['Variation'])
print("train_variation_feature_onehotEncoded is converted feature using 
the one-hot encoding method. The shape of Variation feature:", 
train_variation_feature_onehotCoding.shape)

Output-4 
train_variation_feature_onehotEncoded is converted feature using the 
one-hot encoding method.The shape of Variation feature: (2124, 1962)

How good is this Variation feature  in predicting y_i??
Let's build a model just like the earlier!

Code-5
Training logistic regression using one hot encoded train dataset.
alpha = [10 ** x for x in range(-5, 1)]
cv_log_error_array=[]
for i in alpha:
    clf = SGDClassifier(alpha=i, penalty='l2', loss='log', random_state=42)
    clf.fit(train_variation_feature_onehotCoding, y_train)
    
    sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
    sig_clf.fit(train_variation_feature_onehotCoding, y_train)
    predict_y = sig_clf.predict_proba(cv_variation_feature_onehotCoding)
    
    cv_log_error_array.append(log_loss(y_cv, predict_y, labels=clf.classes_, 
    eps=1e-15))
    print('For values of alpha = ', i, "The log loss is:",log_loss(
    y_cv, predict_y, labels=clf.classes_, eps=1e-15))

fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
    ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()


best_alpha = np.argmin(cv_log_error_array)
clf = SGDClassifier(alpha=alpha[best_alpha], penalty='l2', loss='log',
random_state=42)
clf.fit(train_variation_feature_onehotCoding, y_train)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(train_variation_feature_onehotCoding, y_train)

predict_y = sig_clf.predict_proba(train_variation_feature_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The train log loss is:"
,log_loss(y_train, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(cv_variation_feature_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The cross
validation log loss is:",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(test_variation_feature_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:"
,log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))

Output-5 















It is observed that Training loss is 1.06 for alpha = 0.001 and at same alpha
test loss and CV loss is 1.72.
This feature may not be stable across all dataset (train,test and cv) but it is
useful as loss is less than random model loss.

Univariate Analysis on Text Feature

1. How many unique words are present in train data?
2. How are word frequencies distributed?
3. How to featurize text field?
4. Is the text feature useful in predicitng y_i?
5. Is the text feature stable across train, test and CV datasets?

Code-1
# cls_text is a data frame
# for every row in data fram consider the 'TEXT'
# split the words by space
# make a dict with those words
# increment its count whenever we see that word

def extract_dictionary_paddle(cls_text):
    dictionary = defaultdict(int)
    for index, row in cls_text.iterrows():
        for word in row['TEXT'].split():
            dictionary[word] +=1
    return dictionary

Code-2: Text response coding
import math
#https://stackoverflow.com/a/1602964
def get_text_responsecoding(df):
    text_feature_responseCoding = np.zeros((df.shape[0],9))
    for i in range(0,9):
        row_index = 0
        for index, row in df.iterrows():
            sum_prob = 0
            for word in row['TEXT'].split():
                sum_prob += math.log(((dict_list[i].get(word,0)+10 )
                            /(total_dict.get(word,0)+90)))
            text_feature_responseCoding[row_index][i] = math.exp(
            sum_prob/len(row['TEXT'].split()))
            row_index += 1
    return text_feature_responseCoding

Code-3
# building a CountVectorizer with all the words that occured minimum 3 times
in train data
text_vectorizer = CountVectorizer(min_df=3)
train_text_feature_onehotCoding = text_vectorizer.fit_transform(train_df['TEXT'])
# getting all the feature names (words)
train_text_features= text_vectorizer.get_feature_names()

# train_text_feature_onehotCoding.sum(axis=0).A1 will sum every row and 
returns (1*number of features) vector
train_text_fea_counts = train_text_feature_onehotCoding.sum(axis=0).A1

# zip(list(text_features),text_fea_counts) will zip a word with its number 
of times it occured
text_fea_dict = dict(zip(list(train_text_features),train_text_fea_counts))
print("Total number of unique words in train data :", len(train_text_features))

Output-3
Total number of unique words in train data : 53434

Code-4
dict_list = []
# dict_list =[] contains 9 dictoinaries each corresponds to a class
for i in range(1,10):
    cls_text = train_df[train_df['Class']==i]
    # build a word dict based on the words in that class
    dict_list.append(extract_dictionary_paddle(cls_text))
    # append it to dict_list

# dict_list[i] is build on i'th  class text data
# total_dict is buid on whole training text data
total_dict = extract_dictionary_paddle(train_df)
confuse_array = []
for i in train_text_features:
    ratios = []
    max_val = -1
    for j in range(0,9):
        ratios.append((dict_list[j][i]+10 )/(total_dict[i]+90))
    confuse_array.append(ratios)
confuse_array = np.array(confuse_array)

Code-5
#response coding of text features
train_text_feature_responseCoding  = get_text_responsecoding(train_df)
test_text_feature_responseCoding  = get_text_responsecoding(test_df)
cv_text_feature_responseCoding  = get_text_responsecoding(cv_df)

Code-6 Normalization
# we convert each row values such that they sum to 1  
train_text_feature_responseCoding = (train_text_feature_responseCoding.T/
train_text_feature_responseCoding.sum(axis=1)).T
test_text_feature_responseCoding = (test_text_feature_responseCoding.T/
test_text_feature_responseCoding.sum(axis=1)).T

# don't forget to normalize every feature
train_text_feature_onehotCoding = normalize(train_text_feature_onehotCoding, axis=0)

# we use the same vectorizer that was trained on train data
test_text_feature_onehotCoding = text_vectorizer.transform(test_df['TEXT'])
# don't forget to normalize every feature
test_text_feature_onehotCoding = normalize(test_text_feature_onehotCoding, axis=0)

# we use the same vectorizer that was trained on train data
cv_text_feature_onehotCoding = text_vectorizer.transform(cv_df['TEXT'])
# don't forget to normalize every feature
cv_text_feature_onehotCoding = normalize(cv_text_feature_onehotCoding, axis=0)
cv_text_feature_responseCoding = (cv_text_feature_responseCoding.T/
cv_text_feature_responseCoding.sum(axis=1)).T

Code-7 : Counting of words
sorted_text_fea_dict = dict(sorted(text_fea_dict.items(), key=lambda x:
 x[1] , reverse=True))
sorted_text_occur = np.array(list(sorted_text_fea_dict.values()))
# Number of words for a given frequency.
print(Counter(sorted_text_occur))

Output-7
Counter({3: 5198, 4: 4101, 6: 2939, 5: 2619, 7: 2121, 8: 1855, 9: 1793, 12: 1554,
 10: 1475,14: 1024, 11: 1022, 13: 953, 18: 811, 15: 806..............})

As we have answered all the questions asked above :
Now, we will train the model.

Code-8
# Train a Logistic regression+Calibration model using text features 
which are on-hot encoded
alpha = [10 ** x for x in range(-5, 1)]
cv_log_error_array=[]
for i in alpha:
    clf = SGDClassifier(alpha=i, penalty='l2', loss='log', random_state=42)
    clf.fit(train_text_feature_onehotCoding, y_train)
    
    sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
    sig_clf.fit(train_text_feature_onehotCoding, y_train)
    predict_y = sig_clf.predict_proba(cv_text_feature_onehotCoding)
    cv_log_error_array.append(log_loss(y_cv, predict_y, labels=clf.classes_, 
     eps=1e-15))
    print('For values of alpha = ', i, "The log loss is:",log_loss(y_cv,
         predict_y, labels=clf.classes_, eps=1e-15))

fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
    ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()


best_alpha = np.argmin(cv_log_error_array)
clf = SGDClassifier(alpha=alpha[best_alpha], penalty='l2', loss='log',random_state=42)
clf.fit(train_text_feature_onehotCoding, y_train)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(train_text_feature_onehotCoding, y_train)

predict_y = sig_clf.predict_proba(train_text_feature_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The train log loss is:"
,log_loss(y_train, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(cv_text_feature_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The cross validation 
log loss is:",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(test_text_feature_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",
log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))

Output-8















 
Is the Text feature stable across all the data sets (Test, Train,
 Cross validation)?
Yes, it seems like!

Code-9
def get_intersec_text(df):
    df_text_vec = CountVectorizer(min_df=3)
    df_text_fea = df_text_vec.fit_transform(df['TEXT'])
    df_text_features = df_text_vec.get_feature_names()
    df_text_fea_counts = df_text_fea.sum(axis=0).A1
    df_text_fea_dict = dict(zip(list(df_text_features),df_text_fea_counts))
    len1 = len(set(df_text_features))
    len2 = len(set(train_text_features) & set(df_text_features))
    return len1,len2
len1,len2 = get_intersec_text(test_df)
print(np.round((len2/len1)*100, 3), "% of word of test data appeared in train data")
len1,len2 = get_intersec_text(cv_df)
print(np.round((len2/len1)*100, 3), "% of word of Cross Validation 
appeared in train data")

Output-8 :
96.681 % of word of test data appeared in train data
98.478 % of word of Cross Validation appeared in train data

Machine learning ModelsBefore going to ML models, we will create some function to use while 
building models.

Data preparation for ML models
def predict_and_plot_confusion_matrix(train_x, train_y,test_x, test_y, clf):
    clf.fit(train_x, train_y)
    sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
    sig_clf.fit(train_x, train_y)
    pred_y = sig_clf.predict(test_x)

    # for calculating log_loss we willl provide the array of probabilities
     belongs to each class
    print("Log loss :",log_loss(test_y, sig_clf.predict_proba(test_x)))
    # calculating the number of data points that are misclassified
    print("Number of mis-classified points :", np.count_nonzero(
        (pred_y- test_y))/test_y.shape[0])
    plot_confusion_matrix(test_y, pred_y)

Getting log loss
def report_log_loss(train_x, train_y, test_x, test_y,  clf):
    clf.fit(train_x, train_y)
    sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
    sig_clf.fit(train_x, train_y)
    sig_clf_probs = sig_clf.predict_proba(test_x)
    return log_loss(test_y, sig_clf_probs, eps=1e-15)

Stacking the three types of features
Merging gene, variance and text features.
What is stacking ??
a = [[1, 2], 
     [3, 4]]
b = [[4, 5], 
     [6, 7]]
hstack(a, b) = [[1, 2, 4, 5],
               [ 3, 4, 6, 7]]
train_gene_var_onehotCoding = 
hstack((train_gene_feature_onehotCoding,train_variation_feature_onehotCoding))
test_gene_var_onehotCoding = 
hstack((test_gene_feature_onehotCoding,test_variation_feature_onehotCoding))
cv_gene_var_onehotCoding = 
hstack((cv_gene_feature_onehotCoding,cv_variation_feature_onehotCoding))

train_x_onehotCoding = 
hstack((train_gene_var_onehotCoding, train_text_feature_onehotCoding)).tocsr()
train_y = np.array(list(train_df['Class']))

test_x_onehotCoding = 
hstack((test_gene_var_onehotCoding, test_text_feature_onehotCoding)).tocsr()
test_y = np.array(list(test_df['Class']))

cv_x_onehotCoding = 
hstack((cv_gene_var_onehotCoding, cv_text_feature_onehotCoding)).tocsr()
cv_y = np.array(list(cv_df['Class']))

train_gene_var_responseCoding = 
np.hstack((train_gene_feature_responseCoding,train_variation_feature_responseCoding))
test_gene_var_responseCoding = 
np.hstack((test_gene_feature_responseCoding,test_variation_feature_responseCoding))
cv_gene_var_responseCoding = 
np.hstack((cv_gene_feature_responseCoding,cv_variation_feature_responseCoding))

train_x_responseCoding = 
np.hstack((train_gene_var_responseCoding, train_text_feature_responseCoding))
test_x_responseCoding = 
np.hstack((test_gene_var_responseCoding, test_text_feature_responseCoding))
cv_x_responseCoding = 
np.hstack((cv_gene_var_responseCoding, cv_text_feature_responseCoding))

Logistic Regression with Class Balancingdefault parameters
SGDClassifier(loss=’hinge’, penalty=’l2’, alpha=0.0001, l1_ratio=0.15, 
fit_intercept=True, max_iter=None, tol=None, shuffle=True, verbose=0, epsilon=0.1,
n_jobs=1, random_state=None, learning_rate=’optimal’, eta0=0.0, power_t=0.5, 
class_weight=None, warm_start=False, average=False, n_iter=None)

alpha = [10 ** x for x in range(-6, 3)]
cv_log_error_array = []
for i in alpha:
    print("for alpha =", i)
    clf = SGDClassifier(class_weight='balanced', alpha=i, penalty='l2', loss='log',
         random_state=42)
    clf.fit(train_x_onehotCoding, train_y)
    sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
    sig_clf.fit(train_x_onehotCoding, train_y)
    sig_clf_probs = sig_clf.predict_proba(cv_x_onehotCoding)
    cv_log_error_array.append(log_loss(cv_y, sig_clf_probs, labels=clf.classes_,
                             eps=1e-15))
    # to avoid rounding error while multiplying probabilites we use 
        log-probability estimates
    print("Log Loss :",log_loss(cv_y, sig_clf_probs)) 

fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
    ax.annotate((alpha[i],str(txt)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()


best_alpha = np.argmin(cv_log_error_array)
clf = SGDClassifier(class_weight='balanced', alpha=alpha[best_alpha], penalty='l2',
                     loss='log', random_state=42)
clf.fit(train_x_onehotCoding, train_y)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(train_x_onehotCoding, train_y)

predict_y = sig_clf.predict_proba(train_x_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The train log loss is:"
,log_loss(y_train, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(cv_x_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The cross validation 
log loss is:",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(test_x_onehotCoding)
print('For values of best alpha = ', alpha[best_alpha], "The test 
log loss is:",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))



Testing the model with best hyper paramtersclf = SGDClassifier(class_weight='balanced', alpha=alpha[best_alpha],
        penalty='l2', loss='log', random_state=42)
predict_and_plot_confusion_matrix(train_x_onehotCoding, train_y,
        cv_x_onehotCoding, cv_y, clf)





Observation : 
1. The best alpha is 0.001.
2. Train log loss is 0.61.
3. Test log loss is 1.21.
4. CV log loss is 1.10
5. Misclassified points are 34.8%.

Logistic regression is highly interpretable model.