ML Project 1 (Post 4) – Becoming A Data Scientist

Wow, this was a tough one!! I actually had the right idea for this Gaussian Bayes Classifier from the start, but I got totally stuck because my Gaussian values were coming out as 2×2 matrixes instead of probabilities. It turns out my x-mu vectors were being stored as arrays, not matrixes, and in Python, arrays don’t have the same dimensions as matrices, apparently. So I don’t know why it didn’t get an error, but the math was coming out all wonky.

I stepped through each piece of that equation, and eventually discovered the “shape” property, which showed me that what I thought was a matrix and a matrix transpose were being seen as the same shape. Red flag! So, I found out that numpy can convert arrays into proper matrices, the it all came together and it looks like it worked! Can you tell I’m excited to have gotten over this hurdle tonight?!? The project is due Thursday, which means I have to finish it tomorrow night.

And it’s 3:15am and I have an important meeting at 8:30. Lovely.

Anyway, here’s my very messy code. (someone needs a tutorial in Python lists vs arrays and indexing, sheesh)

import numpy as np

#bring in data from files
data_train = np.loadtxt(open("train.txt"))
data_test = np.loadtxt(open("test.txt"))
#can't define 3rd column as integer, or it will collapse to 1D array and next line won't work?
Xtrn = data_train[:, 0:2]  # first 2 columns of training set
Ytrn = data_train[:, 2]  # last column, 1/0 labels
Xtst = data_test[:, 0:2]  # first 2 columns of test set
Ytst = data_test[:, 2]  # last column, 1/0 labels
print("Length of training set: %d " % len(Xtrn))
print("Length of test set: %d " % len(Xtst))

#mean of each x in Class 0, and covariance
#initialize
Xtrn_new0 = [[0 for x in range(2)] for y in range(int(len(Xtrn)/2))] #note, currently hardcoded that each class is 1/2 of the total
#loop through training data and find the items labeled 0
n = 0
for train_items in data_train:
    if train_items[2] == 0:
        #setting up new more general way so can do covariance matrix
        Xtrn_new0[n][0] = train_items[0]
        Xtrn_new0[n][1] = train_items[1]
        n=n+1
print("\nNumber of Class 0 items in training set: %d " % len(Xtrn_new0))

#get the means for each column in class 0
col_mean0 = [0,0]
for m in range(0,len(Xtrn_new0)):
    col_mean0[0] += Xtrn_new0[m][0]
    col_mean0[1] += Xtrn_new0[m][1]
col_mean0[0] = col_mean0[0]/n
col_mean0[1] = col_mean0[1]/n

#build covariance matrix for class 0 (hard-coding as 2x2 for now)
E0 = [[0 for x in range(2)] for y in range(2)]
for i in range(0,2):
   for j in range(0,2):
       for x in range (0, len(Xtrn_new0)):
            E0[i][j] += (Xtrn_new0[x][i] - col_mean0[i])*(Xtrn_new0[x][j] - col_mean0[j])
#divide each item in E by N-1
E0[0][0] = E0[0][0] / (n-1) #this should be correct with n in loop above
E0[0][1] = E0[0][1] / (n-1)
E0[1][0] = E0[1][0] / (n-1)
E0[1][1] = E0[1][1] / (n-1)
print("Covariance Matrix Class 0")
print(E0)

#mean of each x in Class 1, and covariance
Xtrn_new1 = [[0 for x in range(2)] for y in range(int(len(Xtrn)/2))] #note, currently hardcoded that each class is 1/2 of the total
#loop through training data and find the items labeled 0
n = 0
for train_items in data_train:
    if train_items[2] == 1:
        #setting up new more general way so can do covariance matrix
        Xtrn_new1[n][0] = train_items[0]
        Xtrn_new1[n][1] = train_items[1]
        n=n+1
print("\nNumber of Class 1 items in training set: %d " % len(Xtrn_new1))

#get the means for each column in class 1
col_mean1 = [0,0]
for m in range(0,len(Xtrn_new1)):
    col_mean1[0] += Xtrn_new1[m][0]
    col_mean1[1] += Xtrn_new1[m][1]
col_mean1[0] = col_mean1[0]/n
col_mean1[1] = col_mean1[1]/n

#build covariance matrix for class 1 (hard-coding as 2x2 for now)
E1 = [[0 for x in range(2)] for y in range(2)]
for i in range(0,2):
   for j in range(0,2):
       for x in range (0, len(Xtrn_new1)):
            E1[i][j] += (Xtrn_new1[x][i] - col_mean1[i])*(Xtrn_new1[x][j] - col_mean1[j])

#divide each item in E by N-1
E1[0][0] = E1[0][0] / (n-1) #this should be correct with n in loop above
E1[0][1] = E1[0][1] / (n-1)
E1[1][0] = E1[1][0] / (n-1)
E1[1][1] = E1[1][1] / (n-1)
print("Covariance Matrix Class 1")
print(E1)

#d-dimensional gaussian for class 0
Gauss0_0 = []
Gauss0_1 = []
Xprob = []

#tests to see what dimensions each of these are
#print((1/((2*np.pi) ** (2/2)))*(1/(np.power(np.linalg.det(E0),1/2))))
#print(np.transpose(np.matrix([i - j for i, j in zip(Xtrn_new0[0], col_mean0)])).shape) #note: AN ARRAY DOESN'T HAVE THE SAME SHAPE AS A MATRIX!!!
#print(np.linalg.inv(E0).shape)
#print(np.matrix([i - j for i, j in zip(Xtrn_new0[x], col_mean0)]).shape)
#print()
#print((-1/2)*np.matrix([i - j for i, j in zip(Xtrn_new0[0], col_mean0)])*np.linalg.inv(E0)*np.transpose(np.matrix([i - j for i, j in zip(Xtrn_new0[x], col_mean0)])))

#get the probability that each X-pair (from the test file) lies within a particular class
for x in range(len(Xtst)):
    #max(p(xm|ci)p(ci))
    #2D Gaussian distributions
   Gauss0_0.append((1/((2*np.pi) ** (2/2)))*(1/(np.power(np.linalg.det(E0),1/2)))*np.exp((-1/2)*np.matrix([i - j for i, j in zip(Xtst[x], col_mean0)])*np.linalg.inv(E0)*np.transpose(np.matrix([i - j for i, j in zip(Xtst[x], col_mean0)]))))
   Gauss0_1.append((1/((2*np.pi) ** (2/2)))*(1/(np.power(np.linalg.det(E1),1/2)))*np.exp((-1/2)*np.matrix([i - j for i, j in zip(Xtst[x], col_mean1)])*np.linalg.inv(E1)*np.transpose(np.matrix([i - j for i, j in zip(Xtst[x], col_mean1)]))))
   if Gauss0_0[x] > Gauss0_1[x]:
       Xprob.append(0)
   else:
       Xprob.append(1)

#check probable-class against the true y
count_correct = 0
for x in range(len(Xprob)):
    if Xprob[x] == Ytst[x]:
        count_correct +=1

print("\nCorrectly classified: %d" % count_correct)
print("Incorrectly classified: %d" % (len(Xprob)-count_correct))

The output is:

Length of training set: 400
Length of test set: 400

Number of Class 0 items in training set: 200
Covariance Matrix Class 0
[[0.9023749470732253, 1.2711079773385794], [1.2711079773385794, 8.3594624640701252]]

Number of Class 1 items in training set: 200
Covariance Matrix Class 1
[[0.90550613040585903, -1.0031029933232485], [-1.0031029933232485, 8.7788991831326726]]

Correctly classified: 352
Incorrectly classified: 48