July 6

Today was a simple day wherein we worked on coming up with helpful graphs, numbers, and conclusions drawn from our ever expanding corpus annotated for specificity. We looked at various summarizations of the text and found that the majority of articles are between 3 and 5 in average specificity–meaning that, according to our findings, articles are more general than specific on average. We debated whether these findings indicate that the annotations were on the general side or whether the articles really were that general. It’s incredibly hard to tell as the annotator pool is very, very small at the moment, so we hope to discuss the topic with Ani at a meeting tomorrow.

I will tackle the following prompt today: “Did you have to teach yourself how to do something in order to advance your internship? Describe how this went.” I’ve had to teach myself quite a few things for the internship, actually! How to use GitHub (reasonably well), how to use a Linux terminal, how to work with paths and directories through Python, and many different topics relating the basics of natural language processing. While I had some amount of prior knowledge with most of these, I’d like to focus on one where I did not–teaching myself how to use the LibSVM classifier.

Knowing nothing about NLP upon arrival besides that it involved the analysis of human language, I was ready to be assaulted by massive amounts of information my first few weeks, and it did not disappoint. Ani suggested we focused on using an SVM classifier, specifically LibSVM. It is a classifier that is complex internally, but could be easily treated as a “black box”, so that we wouldn’t need to know all the gory details to use it. Supposedly. The website has a handy link to a practical guide primarily for beginners, which I happily clicked thinking it was the place for me. Oh no. Even the beginner’s guide was incredibly full of statistics and terminology that I couldn’t make heads or tails of. I still can’t, but that is neither here nor there. The readme file(s) in the classifier’s folders were slightly more helpful, as they informed me which files to run and what information they required from me: A training file, a testing file, an output file. Straightforward enough? I felt I had made some real progress. I may not have understood anything about what the classifier actually did once it ran, but I could give it what it needed and hopefully it would give me fair payment in results without needing further assistance. Unfortunately…none of the documentation that I could find (or decipher, more accurately) told me how those input files needed to be formatted. Drats.

Wenli and Lily were equally confused, and all the way we were attempting to make some sense of the instructional papers we were finding. Thankfully, Wenli found just about the simplest tutorial which really saved me from the crushing weight of my stress. It came to us in our time of need, and we were so grateful. I had to build upon the basis laid by the tutorial in order to add features besides binary “bag of words” sort it describes, but I was even prouder for having discovered it myself. As it turns out, it’s not so complicated; you give it a unique feature id and then whatever value you have. Amazing. Why couldn’t they have said that in the beginner’s guide? Am I still too much of a beginner for even this, supposedly?

No, I don’t think so–that’s what I take away from this learning experience. With enough searching and inventive tomfoolery with what information you can understand, one can learn a lot by oneself even amidst plenty of obstacles. I will be the first to admit that I have a tendency to panic when I cannot understand something intuitively, but it’s something I hope to improve upon in order to make the mots of situations like this one.