11 March, 2009 § Leave a comment
As part of my Natural Language Processing course, I developed a simple part-of-speech tagger using the Penn Treebank. The software reads in a trained corpus of labeled data and then applies what it has learned against a testing corpus of unlabeled data. The unlabeled data is labeled after it is run through a hidden Markov model that calculates the best path that can be followed through the sentence.
What is a hidden Markov model?
In the most general sense, it is an algorithm that can calculate the best path through a series of states. The path that is determined is a set of hidden variables that pertain to each step in the known series. For example, with this application the sentence that the program reads in is known and visible. What is not known is the part-of-speech for each word, such as if it is a verb, pronoun, etc.
The hidden Markov model uses a dynamic programming algorithm called the Viterbi algorithm to turn an O(N^T) in to just O(N^2 * T) (with N: number of words in the sentence and with T: number of possible tags in corpus)
Care to take a look at the source code? I’ve uploaded the entire source code for the project and two other applications that I created to help with the development. One of them was previously released, and the other is an application to compare differences between the expected output and the actual output. It is a .zip file and you’ll have to rename it and remove the .doc and replace it with .zip, since WordPress won’t allow you to upload zip files. It is a Visual Studio 2005 solution. Inside of the zip file you will find a readme that will show the usage of the program.
3 February, 2009 § 1 Comment
As part of a Natural Language Processing course I’m taking, we are developing a simple part-of-speech tagger using the Penn Treebank. Our part-of-speech tagger will use a training and a testing corpus. To make life easier, we can start with an already-tagged testing corpus to compare our results with the “truth”.
I wrote a small command-line app that will remove the part-of-speech tags from the testing corpus so you can begin working on the part-of-speech tagger and not have to worry about this orthogonal task.
As part of my new initiative that I am taking upon myself to release more of my code, I have attached the code for this [7kb]. It is a .zip file and you’ll have to rename it and remove the .doc and replace it with .zip, since WordPress won’t allow you to upload zip files. It is a Visual Studio 2005 solution. The code is STD C++ and runs through about 1.1 megs of the truth file in under a second.
It’s usage is: PosTagStripper <testingfile> <outputfile>
Let me know if you have any problems when running the code.