11 March, 2009 § Leave a Comment
As part of my Natural Language Processing course, I developed a simple part-of-speech tagger using the Penn Treebank. The software reads in a trained corpus of labeled data and then applies what it has learned against a testing corpus of unlabeled data. The unlabeled data is labeled after it is run through a hidden Markov model that calculates the best path that can be followed through the sentence.
What is a hidden Markov model?
In the most general sense, it is an algorithm that can calculate the best path through a series of states. The path that is determined is a set of hidden variables that pertain to each step in the known series. For example, with this application the sentence that the program reads in is known and visible. What is not known is the part-of-speech for each word, such as if it is a verb, pronoun, etc.
The hidden Markov model uses a dynamic programming algorithm called the Viterbi algorithm to turn an O(N^T) in to just O(N^2 * T) (with N: number of words in the sentence and with T: number of possible tags in corpus)
Care to take a look at the source code? I’ve uploaded the entire source code for the project and two other applications that I created to help with the development. One of them was previously released, and the other is an application to compare differences between the expected output and the actual output. It is a .zip file and you’ll have to rename it and remove the .doc and replace it with .zip, since WordPress won’t allow you to upload zip files. It is a Visual Studio 2005 solution. Inside of the zip file you will find a readme that will show the usage of the program.