Thursday, February 23, 2012

Why every Pub Quiz team should have a Classicist...

One benefit/drawback of playing pub quiz on Tuesday with a former ABD in Classics is that he just couldn't let this one go:
An arctophile is a lover of what?
Our team gave it a good try, but we fell flat on that question. Our weak, last second stab-at-the-dark answer was "a lover of ice". Forgive us. Andy the Classicist was virtually inconsolable. To be fair, he nailed KT Tunstall's "Black Horse and the Cherry Tree", but c'mon, pop culture? No one is proud of pop culture. It should go without saying that Andy had his epiphany on the Metro the next day. He couldn't resist emailing us the play-by-play:
It hit me this morning that Arctic, actually "arktos" means "land of bears." I shoulda got this last night, but there aren't a lot of bears in Greece, and the Latin for bear "ursa" doesn't look like it is derived from arktos, though it does. The one that kills me is that I shoulda thought of the star Arcturus which would have led me to bear but I'm not a big astronomy buff.
What have we learned from this, children? Be kind to your favorite ABD Classicist. Their etymologies might be rusty, but they make up for it with knowledge of Indie Scottish singer/songwriters.

FYI, an arctophile is a lover of Teddy Bears.

Tuesday, February 21, 2012

I want my historical sentiment analysis ... and I want it NOW!

The hot new shiny thang in 21st Century NLP is undeniably sentiment analysis. It's being used to track pop star popularity, political approval, and corporate satisfaction. But as far as I can tell, all of the current efforts are focused on contemporary, real-time analysis of emerging trends and changes. However, I'm stuck tonight desperately in need of automated historical sentiment analysis. Recently, the political fact checking web site Politifact came under scrutiny by left-leaning US pundits like Rachel Maddow because of a "Mostly false" rating they gave to a promotional ad by MSNBC host Lawrence O'Donnell (video here). Here is as faithful of a transcript of O'Donnell's words as I can make (the offending statement is in boldface):
Ya know, when the GIs came home after World War Two, six percent of the adults in this country were college graduates. The GI Bill pushed that up to twenty. The GI Bill put my father through college. He then was able to earn a living to put his five kids through college. It's the most successful educational program that we’ve ever had in this country, and the critics called it welfare.
Politifact rated the ad "Mostly False" here stating:
We found no evidence of critics referring to the GI Bill as welfare. Yet some fretted that the law’s unemployment compensation element would encourage laziness. We see a touch of truth to O’Donnell’s claim, which we rate Mostly False.
I was hoping to use simple, fast, freely available online tools to do some digging on this. I reasoned that Google Ngrams and BYU's COHA would be perfect tools for this job. Alas, they are not quite powerful or nuanced enough to tease apart the issues. Here's What I did:

  • Used Wiki to discover the original GI Bill, actually named Servicemen's Readjustment Act of 1944
  • Typed in "GI Bill" to Ngram Viewer. However, this tool does not allow collocates (I really want to see co-occurrences of "GI Bill" & "welfare").
  • Tried to use COHA to find collocates, but, sadly, that tool does not allow multi-word collocates ("You cannot have strings of two or more words in the [COLLOCATES] box.").
  • Tried a Google search on: welfare "GI Bill". This retrieved some hits, though too recent to count as examples of contemporary 1944 criticism.
  • When I Googled simply "Servicemen's Readjustment Act of 1944 welfare", I found several examples including what purports to be a contemporaneous reporting at As They saw It "News & articles published shortly after events occurred, they reflect the information available at that time and how people reacted" but I cannot verify the veracity of this site. Nonetheless, it does seem to support two ideas. The word "welfare" was commonly used in reference to the GI Bill and its use was overwhelmingly positive: 1944: Social Service, Public.

But this exposed a fundamental problem: In 1944, "welfare" didn't appear to have a pejorative sentiment. In fact, it seems to have a positive sentiment. Even if I have full access to huge historical corpora from which to extract instances of "GI Bill" and "welfare" collocates, how do I distinguish between positive uses of "welfare" and negative? To do this properly, I'd need quality sentiment analysis (also, I fear there'd be no small amount of Perl scripting involved. And, ya know, Perl is like the Merlot of programming. It's okay as a table wine when you're too tired to go to the store for a nice a Tempranillo or Cab Franc, but really, you'd never brag about drinking Merlot).

In the end, I was unable to use freely available, online corpus linguistic tools to properly evaluate Politifact's rating because the tools have not matured yet to provide quite the right analysis. However, the good news is that these tools could very easily be enhanced using exiting technologies to do exactly the job needed.

A linguist asks some questions about word vectors

I have at best a passing familiarity with word vectors, strictly from a 30,000 foot view. I've never directly used them outside a handfu...