ML Project 3 (Post 2) – Becoming A Data Scientist

Tonight I have learned how to use PyBrain’s Feed-Forward Neural Networks for Classification. Yay!

I had already created a neural network and used it on the project’s regression data set earlier this week, then used those results to “manually” classify (by picking which class the output was closer to, then counting up how many points were correctly classified), but tonight I fully implemented the PyBrain classification, using 1-of-k method of encoding the classes, and it appears to be working great!

The neural network still takes a while to train, but it’s much quicker on this 2-input 2-class data than it was on the 8-input 7-output data for part 1 of the project. I’m actually writing this as it trains for the next task (see below).

The code I wrote is:


print("\nImporting training data...")
from pybrain.datasets import ClassificationDataSet
#bring in data from training file
traindata = ClassificationDataSet(2,1,2)
f = open("classification.tra")
for line in f.readlines():
    #using classification data set this time (subtracting 1 so first class is 0)
    traindata.appendLinked(list(map(float, line.split()))[0:2],int(list(map(float, line.split()))[2])-1)
    
 
print("Training rows: %d " %  len(traindata) )
print("Input dimensions: %d, output dimensions: %d" % ( traindata.indim, traindata.outdim))
#convert to have 1 in column per class
traindata._convertToOneOfMany()
#raw_input("Press Enter to view training data...")
#print(traindata)
print("\nFirst sample: ", traindata['input'][0], traindata['target'][0], traindata['class'][0])

print("\nCreating Neural Network:")
#create the network
from pybrain.tools.shortcuts import buildNetwork
from pybrain.structure.modules import SoftmaxLayer
#change the number below for neurons in hidden layer
hiddenneurons = 2
net = buildNetwork(traindata.indim,hiddenneurons,traindata.outdim, outclass=SoftmaxLayer)
print('Network Structure:')
print('\nInput: ', net['in'])
#can't figure out how to get hidden neuron count, so making it a variable to print
print('Hidden layer 1: ', net['hidden0'], ", Neurons: ", hiddenneurons )
print('Output: ', net['out'])

#raw_input("Press Enter to train network...")
#train neural network
print("\nTraining the neural network...")
from pybrain.supervised.trainers import BackpropTrainer
trainer = BackpropTrainer(net,traindata)
trainer.trainUntilConvergence(dataset = traindata, maxEpochs=100, continueEpochs=10, verbose=True, validationProportion = .20)

print("\n")
for mod in net.modules:
    for conn in net.connections[mod]:
        print conn
        for cc in range(len(conn.params)):
            print conn.whichBuffers(cc), conn.params[cc]

print("\nTraining Epochs: %d" % trainer.totalepochs)

from pybrain.utilities import percentError
trnresult = percentError( trainer.testOnClassData(dataset = traindata),
                              traindata['class'] )
print("  train error: %5.2f%%" % trnresult)
#result for each class
trn0, trn1 =  traindata.splitByClass(0)
trn0result = percentError( trainer.testOnClassData(dataset = trn0), trn0['class'])
trn1result = percentError( trainer.testOnClassData(dataset = trn1), trn1['class'])
print("  train class 0 samples: %d, error: %5.2f%%" % (len(trn0),trn0result))
print("  train class 1 samples: %d, error: %5.2f%%" % (len(trn1),trn1result))

raw_input("\nPress Enter to start testing...")

print("\nImporting testing data...")
#bring in data from testing file
testdata = ClassificationDataSet(2,1,2)
f = open("classification.tst")
for line in f.readlines():
    #using classification data set this time (subtracting 1 so first class is 0)
    testdata.appendLinked(list(map(float, line.split()))[0:2],int(list(map(float, line.split()))[2])-1)
    

print("Test rows: %d " %  len(testdata) )
print("Input dimensions: %d, output dimensions: %d" % ( testdata.indim, testdata.outdim))
#convert to have 1 in column per class
testdata._convertToOneOfMany()
#raw_input("Press Enter to view training data...")
#print(traindata)
print("\nFirst sample: ", testdata['input'][0], testdata['target'][0], testdata['class'][0])

print("\nTesting...")
tstresult = percentError( trainer.testOnClassData(dataset = testdata),
                              testdata['class'] )
print("  test error: %5.2f%%" % tstresult)
#result for each class
tst0, tst1 =  testdata.splitByClass(0)
tst0result = percentError( trainer.testOnClassData(dataset = tst0), tst0['class'])
tst1result = percentError( trainer.testOnClassData(dataset = tst1), tst1['class'])
print("  test class 0 samples: %d, error: %5.2f%%" % (len(tst0),tst0result))
print("  test class 1 samples: %d, error: %5.2f%%" % (len(tst1),tst1result))

With 2 neurons in the hidden layer, I got a training result of:
5% Class 0 misclassified
5.5% Class 1 misclassified
Overall 5.25% error

and when run on my test data (200 samples in each class):
3% Class 0 misclassified
6% Class 1 misclassified
Overall 4.5% error

Looks good to me! Now, on to the next task, which is to do this same thing with a data file that has 3000 samples with 16 inputs and 10 classes. This could take a while :)

2 comments

Rula says:

April 28, 2014 at 2:01 am

Hi, loving your blog. have been following you on twitter – out of interest what machine learning course you doing??

Many thanks,
Kind regards
1. Renee says:
  
  April 28, 2014 at 10:27 am
  
  Hi, thanks!
  
  I was avoiding mentioning the specific course because I don’t have permission from the professor to post the assignments online, and don’t want to have any honor code violations or anything, so I haven’t mentioned the course number. However, you can view it at this link :)
  http://bit.ly/1pGOuJc
  
  The course has focused on the theory and math behind these algorithms. We haven’t done any code in class, so all of the projects have involved a lot of self-teaching. The homework has been heavy statistics and deriving maximum likelihoods and such. I didn’t expect to be doing matrix calculus when I signed up, but I sure have learned a lot!
  
  This is the texbook we use:
  http://www.amazon.com/gp/product/0387310738/ref=as_li_ss_il?ie=UTF8&camp=1789&creative=390957&creativeASIN=0387310738&linkCode=as2&tag=becomingadatascientist-20
  
  Thanks for your comment!

Comments are closed.