ML Project 2 (Post 2) – Becoming A Data Scientist

I finished the Linear Regression & Classification project for Machine Learning class tonight. The part that took me the longest was definitely the conversion of the C sample code to Python! I can read C OK, but have never written more than the most basic code in C, and want to learn Python as well as possible during this class anyway. (If you haven’t read my past posts, I just started learning Python on my own this semester.) The project gave us the option of using any code language, so I set out on the task of converting it, and many frustrations with loops and matrices later, I got the code below. The output doesn’t match the one from the C code exactly (off by around .1 and not consistently the same each time I run it) so it’s likely something is incorrect, but it runs and gives an almost-correct result, so I had to move on in order to finish on time.

I’m not sure whether the modifications I made for the rest of the tasks in the project are correct yet (I’ll update when I get it back), but I’ve attached my code files below. You can see in the pasted code below that I was outputting at every step of the way to debug (and actually, I removed most of my many print statements to clean it up!). Now that it’s turned in, let me know if you have any recommendations for improving the code!

import numpy as np

#bring in data from training file
i = 0
x = [] #inputs x
ty = [] #outputs ty
f = open("regression.tra")
for line in f.readlines():
    #every other line in file is 8 input values or 7 output values
    if i%2 == 0:
       x.append(list(map(float, line.split())))
    else:
       ty.append(list(map(float, line.split())))
    i=i+1

print("TRAINING DATA")
print("Length of training set: %d , %d " % (len(x), len(ty)))
#print(i)
#input("Press Enter to view input data...")
#print('x:')
#print(x)
#input("Press Enter to view output data...")
#print('ty:')
#print(ty)

#x-augmented, copy x add a column of all ones
xa = np.append(x,np.ones([len(x),1]),1)
print("Shape xa: " + str(np.shape(xa)))
print("Shape ty: " + str(np.shape(ty)))
Nin = 9
Nout = 7

#bring in data from TEST file
i2 = 0
x2 = [] #inputs x
ty_test = [] #outputs ty
f2 = open("regression.tst")
for line in f2.readlines():
    #every other line in file is 8 input values, 7 output values
    if i2%2 == 0:
       x2.append(list(map(float, line.split())))
    else:
       ty_test.append(list(map(float, line.split())))
    i2=i2+1
print("\nTEST DATA")
print("Length of test set: %d , %d " % (len(x2), len(ty_test)))
#print(i)
#input("Press Enter to view input data...")
#print('x2:')
#print(x2)
#input("Press Enter to view output data...")
#print('ty_test:')
#print(ty_test)

#x-augmented, copy x add a column of all ones
xa_test = np.append(x2,np.ones([len(x2),1]),1)
print("Shape xa_test: " + str(np.shape(xa_test)))
print("Shape ty_test: " + str(np.shape(ty_test)))

input("\nPress Enter to continue...")

print("Calculating auto-correlation...")
#auto-correlation xTx
R = [[0.0 for j in range(Nin)] for i in range(Nin)]
for xarow in xa:
     for i in range(Nin):
         for j in range(Nin):
             R[i][j] = R[i][j] + (xarow[i] * xarow[j])

print("Calculating cross-correlation...")
#cross-correlation xTty
C = [[0.0 for j in range(Nin)] for i in range(Nout)]
for n in range(len(xa)):
    for i in range(Nout):
        for j in range(Nin):
             C[i][j] = C[i][j] + (ty[n][i] * xa[n][j])

#print("Shape R: " + str(np.shape(R)) + "   Shape C: " + str(np.shape(C)))

print("Normalizing correlations...")
#normalize (1/Nv)
for i in range(Nin):
    for j in range(Nin):
        R[i][j] = R[i][j]/(len(xa))
for i in range(Nout):
    for j in range(Nin):
        C[i][j] = C[i][j]/(len(ty))

meanseed = 0.0
stddevseed = 0.5
##set up W
w0 = [[0.0 for j in range(Nin-1)] for i in range(Nout)]
W = [[0.0 for j in range(Nin)] for i in range(Nout)]
for i in range(Nout):
    for j in range(Nin-1):
        #assign random weight for initial value
        w0[i][j] = np.random.normal(meanseed,stddevseed) 
        W[i][j] = w0[i][j]
    W[i][Nin-1] =  np.random.normal(meanseed,stddevseed)

#conjugate gradient subroutine (this could be called as a function)
#input("Press enter to calculate weights...")
print("Calculating weights...")
for i in range(Nout): #loop around CGrad in sample
    passiter = 0
    XD = 1.0
    #copying matrix parts needed
    w = W[i]
    r = R
    c = C[i]
    Nu = Nin 
   
    while passiter < 2: #2 passes
        p = [0.0 for j in range(Nu)]
        g = [0.0 for j in range(Nu)]
        for j in range(Nu): #equivalent to "iter" loop in sample code (again, check loop values)
            for k in range(Nu): #equiv to l loop in sample
                tempg = 0.0
                for m in range(Nu):
                    tempg = tempg + w[m]*r[m][k]

                g[k] = -2.0*c[k] + 2.0*tempg

            XN = 0.0
            for k in range(Nu):
                XN = XN + g[k] * g[k]
            B1 = XN / XD

            XD = XN
            for k in range(Nu):
                p[k] = -g[k] + B1*p[k]
            Den = 0.0
            Num = Den
            for k in range(Nu):
                #numerator of B2
                Num = Num + p[k] * g[k] / -2.0
                #denominator of B2
                for m in range(Nu):
                    Den = Den + p[m] * p[k] * r[m][k]
            B2 = Num / Den
            #update weights
            for k in range(Nu):
                w[k] = w[k] + B2 * p[k]
        passiter += 1

    #after the two passes, store back in W[i] before next i
    W[i] = w

Error = [0.0 for i in range(Nout)]
MSE = 0.0
#input("Press enter to calculate error...")
print('\nCalculating Training MSE...')
#calculate mean squared error
for N in range(len(xa)):
    for i in range(Nout):
        y = 0.0
        for j in range(Nin):
            y += xa[N][j]*W[i][j]
        Error[i] += (ty[N][i]-y)*(ty[N][i]-y)
for i in range(Nout):
    MSE += Error[i]/(len(ty)+1)
    print('Error at node %d: %f' % (i+1, Error[i]/(len(ty)+1)))
print('Total M.S. Error [TRAIN]: %f' % MSE)

print('\nCalculating Testing MSE...')
#calculate mean squared error for test file
Error = [0.0 for i in range(Nout)]
MSE = 0.0
for N in range(len(xa_test)):
    for i in range(Nout):
        y_test = 0.0
        for j in range(Nin):
            y_test += xa_test[N][j]*W[i][j]
        Error[i] += (ty_test[N][i]-y_test)*(ty_test[N][i]-y_test)
for i in range(Nout):
    MSE += Error[i]/(len(ty_test)+1)
    print('Error at node %d: %f' % (i+1, Error[i]/(len(ty_test)+1)))
print('Total M.S. Error [TEST]: %f' % MSE)

The output of that code looks like this:

TRAINING DATA
Length of training set: 1768 , 1768
Shape xa: (1768, 9)
Shape ty: (1768, 7)

TEST DATA
Length of test set: 1000 , 1000
Shape xa_test: (1000, 9)
Shape ty_test: (1000, 7)
Calculating auto-correlation…
Calculating cross-correlation…
Normalizing correlations…
Calculating weights…

Calculating Training MSE…
Error at node 1: 0.015161
Error at node 2: 0.000194
Error at node 3: 0.337191
Error at node 4: 0.000155
Error at node 5: 0.043815
Error at node 6: 0.037067
Error at node 7: 0.000179
Total M.S. Error [TRAIN]: 0.433762

Calculating Testing MSE…
Error at node 1: 0.014133
Error at node 2: 0.000206
Error at node 3: 0.362302
Error at node 4: 0.000159
Error at node 5: 0.051767
Error at node 6: 0.037002
Error at node 7: 0.000192
Total M.S. Error [TEST]: 0.465761

And here are the files for the rest of the project where we had to convert the sample linear regression code into a classifier, then also do a reduced/regularized version.

Project2Regression

Project2Classifier_Reduced

Project2Classifier

(I had to change the .py extension to .txt for WordPress to allow me to upload)

Any feedback is welcome!