Part-of-speech Tag Stripper
3 February, 2009 § 1 Comment
As part of a Natural Language Processing course I’m taking, we are developing a simple part-of-speech tagger using the Penn Treebank. Our part-of-speech tagger will use a training and a testing corpus. To make life easier, we can start with an already-tagged testing corpus to compare our results with the “truth”.
I wrote a small command-line app that will remove the part-of-speech tags from the testing corpus so you can begin working on the part-of-speech tagger and not have to worry about this orthogonal task.
As part of my new initiative that I am taking upon myself to release more of my code, I have attached the code for this [7kb]. It is a .zip file and you’ll have to rename it and remove the .doc and replace it with .zip, since WordPress won’t allow you to upload zip files. It is a Visual Studio 2005 solution. The code is STD C++ and runs through about 1.1 megs of the truth file in under a second.
It’s usage is: PosTagStripper <testingfile> <outputfile>
Let me know if you have any problems when running the code.