Monday, December 16, 2013

Porn for Linguists!

Finally! One thing in linguistics Len Talmy, Paul Postal, Noam Chomsky, and Joan Bresnan can agree on: At $12.99, The Speculative Grammarian Essential Guide to Linguistics is a modest last minute Christmas gift that takes less effort to purchase than red sweaters with white fluffy trim, yay! Nerdy uncles around the world thank you!

At 10,700 single spaced pages, 9 point font, Vera Sans Bold, this thin volume is a reminder of why my dissertation never quite fulfilled its promise, or never quite filled 50 pages, for that matter (can you say Ay Bee Dee, boys and girls?).

This volume of linguistic paraphernalia appears to be an elaborate sting designed to con some otherwise reputable institution into bestowing a commemorative matchbook cover on Trey Jones, a linguist best known for not being Terry Jones.

Out of kindness to the editors, I will refrain from discussing their shocking decision, vis à vis two white spaces after a period or one (I'll leave it to you, dear reader, to judge the depth of their depravity on your own). As to their policy regarding the Oxford comma, scandalous!

Am I paranoid, or was the blank page four a none-too-subtle homage to covert logical form? Obvious Chomskyan propaganda, I was disgusted.

'Tis not without its charms, though. A personal fave: Kean Kaufmann's cartoon depiction of when Daniel Jones discovered history's first cardinal vowel by plucking it, virginal and innocent, from his perfectly formed vowel space:



The volume also contains some rarely discussed dark moments in linguistics history, such as the catastrophic linguistic consequences of the 2004–5 NHL lockout on Canadian language production. So many "ehs" lost in time, like teardrops in the rain...

Rumor has it that Steven Pinker saw the book and immediately cried out, "Jones? TREY Jones? That guy owes me money!"

There are worse things you can do than spend $12.99 on pure linguistics fun.

Sunday, December 15, 2013

Why Big Data Needs Big Humanities

There's a new book out using Google's Ngrams and Wikipedia to discover the historical significance of people, places, and things: Who is Bigger? I have only taken a cursory glance at the web page, but it doesn't take a genius to see that the results look deeply biased, and it's no surprise why.

The two data sets they used, Wikipedia and Google's Ngrams, are both deeply biased towards recent, Western data. Wikipedia authors and editors are famously biased towards young, white, Western males. It's no surprise then that the results on the web page are obviously biased towards recent, Western people, places and things (not uniquely so, to be clear, but the bias is obvious imho).

The most glaring example is the complete non-existence of Genghis Khan on any of the lists. Khan is undeniably one of the most influential humans to have ever existed. In the book Destiny Disrupted: A History of the World Through Islamic Eyes, author Tamim Ansary referred to Khan as the Islamic world's Hitler. But he died in 1227 and mostly influenced what we in the West call the East.

Another example is the appearance of the two most recent US presidents, George W. Bush and Barack Obama, in the top ten of the top fifty most influential things in history. Surely this is a pure recency effect. How can this be taken seriously as historical analysis?

Perhaps these biases are discussed in the book's methodology discussion, I don't know. Again, this is my first impression based on the web page. But it speaks to a point I blogged earlier in response to a dust-up between CMU computer scientists and UT Austin grad students:

"NLP engineers are good at finding data and working with it, but often bad at interpreting it. I don't mean they're bad at interpreting the results of complex analysis performed on data. I mean they are often bad at understanding the nature of their data to begin with. I think the most important argument the UT Austin team make against the CMU team is this (important point underlined and boldfaced just in case you're stupid):
By focusing on cinematic archetypes, Bamman et al.’s research misses the really exciting potential of their data. Studying Wikipedia entries gives us access into the ways that people talk about film, exploring both general patterns of discourse and points of unexpected divergence.
In other words, the CMU team didn't truly understand what their data was. They didn't get data about Personas or Stereotypes in film. Rather, they got data about how a particular group of people talk about a topic. This is a well known issue in humanities studies of all kinds, but it's much less understood in sciences and engineering, as far as I can tell."

One of the CMU team members responded to this with the fair point that they were developing a methodology first and foremost and their conference paper was focused on that. I agree with that point. But it does not apply to the Who is Bigger project primarily because it is a long book, and claims explicitly to be an application of computational methods to "measure historical significance". That is a bold claim.

To their credit, the authors say they use their method to study "the underrepresentation of women in the historical record", but that doesn't seem to be their main point. As the UT Austin grad student suggested above, the cultural nature of the data is the main story, not a charming sub plot. Can you acknowledge the cultural inadequacies of a data set at the same time you use it for cultural analysis? This strikes me as unwise.

I acknowledge again that this is a first impression based on a web site.

UPDATE: Cass Sunstein wrote a thorough debunking of the project's methodology a few weeks ago, concluding that the authors "have produced a pretty wacky book, one that offers an important warning about the misuses of quantification."

Monday, December 2, 2013

Dictionary of American Regional English

One of the most useful things any research program in any field can do is provide a resource to other researchers. The Dictionary of American Regional English is a rich linguistic resource decades in the making, and it is now available online

Here's a description from the project's About page:

The Dictionary of American Regional English (DARE) is a multi-volume reference work that documents words, phrases, and pronunciations that vary from one place to another place across the United States. 
Challenging the popular notion that our language has been "homogenized" by the media and our mobile population, DARE demonstrates that there are many thousands of differences that characterize the dialect regions of the U.S. 
DARE is based on face-to-face interviews carried out in all 50 states between 1965 and 1970 and on a comprehensive collection of written materials (diaries, letters, novels, histories, biographies, newspapers, government documents, etc.) that cover our history from the colonial period to the present. 
The entries in DARE include regional pronunciations, variant forms, some etymologies, and regional and social distributions of the words and phrases.
A striking feature of DARE is its inclusion in the text of the Dictionary of selected maps that show where words were found in the 1,002 communities investigated during the fieldwork.

Nuts and Bolts of Applying Deep Learning (Andrew Ng)

I recently watched Andrew Ng's excellent lecture from 2016 Nuts and Bolts of Applying Deep Learning and took notes. I post them as a he...