Week 7

Saturday was Independence Day, and it wasn’t until earlier this week that I realized I IMAG1030was in the city for the holiday. They signed the Declaration here! This is why we even have the 4th of July (even if the date is a little off). How cool is it that I got to be here to celebrate it in view of Independence Hall? I saw a replica Revolutionary War Encampment, a concert right outside Independence Hall, and a fireworks show above the Philadelphia Museum of Art (it couldn’t compare to Thunder over Louisville in my heart, but it was lovely nonetheless). I have a picture of the concert set up on the right; I still find it hilarious that the Declaration and a Founding Father’s statue were the backdrop for a concert.

Because of the holiday, we had a short week at work, but it was not unproductive. I took a break from working with the NYT and PubMed data for the most part, as I had gotten too frustrated with errors involving the file sizes to properly solve them. Instead, Wenli, Lily, and I worked on ideas to make our data presentable and running new analyses on our annotated corpus both to find trends that could represent potential features for specificity classification and to verify the annotator agreement on the sentence classification and phrases that the annotator’s believe to be under-specified.

The former agreement has increased due to changes in the instructions of the task and seems to be promising according to pairwise annotator correlations. The latter is not quite as correlated, but the vast majority of under-specified phrases are noun phrases, which is interesting as it may point to what information readers consider most essential in sentences, namely, the “who and what” of a situation. We’ve looked into classifying adjectives into two categories, those that require prior knowledge to compare to (tall, stronger) and those that do not (equal, round); however, based on the frequency of adjectives found to be under-specified (less than 2% of words selected), this information may not be very useful as a feature for classification. It’s such an interesting idea, but ultimately we should only focus on what will be relevant and useful, and simply leave such ideas for another day as the data suggests it will have little effect on our results. This is especially true this far into the internship, but it’s still a hard lesson to learn!