Monday, March 31, 2008

Speaking English

Steven Levitt, Freako-economist, posted this tempting morsel recently:

I got an email the other day from a blog reader who tells me that there are now more non-native English speakers than native English speakers.

Having silly expectations of writers, I foolishly assumed Levitt would tell us all WHERE this fact held true. If he is referring to The U.S., then it's quite a remarkable claim. China, not so much. He seems to be claiming that some change has occurred where a once predominately English speaking country is no longer so. Unfortunately, his post never answers this, rather he is just looking for a cute way to transition from a story about Malaysian baby names to a modestly humorous email about Jello. It's a blogger's prerogative to tease readers into reading on, so no harm done.

But, I can't help wondering just what was he referring to in his introductory sentence? Has Malaysia ever been predominantly English speaking? As far as I know, no. The current Ethnologue report says this: "National or official language: Malay. Also includes Burmese, Chinese Sign Language, Eastern Panjabi (43,000), Malayalam (37,000), Sylheti, Telugu (30,000)."

No English.

So, can any of you, dear readers, come up with a once predominately English speaking country that is no longer so? A nice little challenge.

Monday, March 24, 2008

Google Linguistics

Erin made the following well-taken point in a comment to this earlier post:

This appeal to the authority of Google is troublesome in linguistics, since we often refer to Google results for evidence for hypotheses about usage. That is documents indexed by Google as a data source, rather than its search results as authoritative figure, of course, but this may not be obvious to the average Joe. :\

I have used Google repeatedly to find instances of constructions that I could not find using standard corpus linguistics methods with hand compiled corpora like the BNC. Typically I’m looking for any instance, just to prove people really do say the thing I’m claiming is possible. For example, I needed to find some examples of passivized complements embedded under 60 different barrier verbs following this pattern:

a. I banned John from being examined by the doctor.
b. I banned John from getting examined by the doctor.

Many of the verbs I wanted to search for are low frequency in the BNC (e.g., barricade, derail, hamper, etc) so the likelihood of finding examples of passivized complements using say a Tgrep2 search is low. So, I ventured into the scary land of Google Linguistics. I used the search query “verbed * from being” and “verbed * from getting” Within a short time, I had multiple examples for most of the verbs I was looking for. I can’t imagine performing this task more efficiently with any other tool. Google really worked well under those circumstances.

Let me note that I have not used Google hit counts or page counts to derive any statistics regarding frequency of occurrence, though. When I do this sort of thing, I’m careful to use my common sense to decide if a return is from a native speaker or not, and often what I do is skim a page to see if there are any obvious ESL errors. Also, I use my own intuition regarding the acceptability of a usage (by pure coincidence, Peter Ludlow from U. Toronto will be here in Buffalo this week giving a talk on the role of linguistic intuitions).

One of the more thorough discussions of the use of search engines in linguistics research is Adam Kilgarriff’s “Googleology is bad science”, a squib from Computational Linguistics (2007, v33, 1)

He writes that the web is attractive to linguists because it is “enormous, free, immediately available, and largely linguistic”. But, he points out four major flaws:

1. search engines do not lemmatise or part-of-speech tag
2. search syntax is limited
3. there are constraints on numbers of queries and numbers of hits per query
4. search hits are for pages, not for instances.

Kilgarriff offers this alternative: “work like the search engines, downloading and indexing substantial proportions of the web, but to do so transparently, giving reliable figures, and supporting language researchers’ queries”

The squib goes on to detail how we might go about doing that in a principled way. It’s well worth the read.

Wednesday, March 19, 2008

"According to Google,..."

Being both a poker player and former writing teacher, I am better acquainted than most with just how stupid the average person is. The fear that this day would come has lurked in my mind for some time, but today, I re-discovered the ugly truth that people just don't understand even the most basic tenants of reason, research, skepticism, and critical thinking.

Through a series of blog links, I happened on to the comment thread for a popular TV/radio talk show host's web page (I refuse to link to it). The topic regarded one of the current presidential candidates' alleged ethnicity (clearly false/ridiculous hypotheses peppered the thread). I have long since been accustomed to idiocy regarding high profile public figures, so none of this interested me, until I skimmed past one commenter whose attempt at validating the allegation started with "According to google,..." and proceeded to quote some unspecified website. This would be a classic case of argument from authority were it not for the fact that the mere Google search engine alone was being treated as the authority in question.

If Google returns it, it must be true.

There is a scary group of idiots out there who, deep in their hearts, believe that Google magically filters their search returns for QUALITY. Hence, Google is being treated as a primary source.

"Burn down the mission, if we're gonna stay alive..."

Monday, March 17, 2008

The Ling-O-Sphere Revisited

In December I posted about an idea regarding my desire to see a linguistics blog aggregator that "automatically checks a given set of linguistics websites, then updates a topic cloud which clusters posts according to relevance for a particular topic" (see my full post and relevant comments here ).

I see now that William Cohen at his Cranial Darwinism blog has recently posted two new academic papers on the automatic discovery of blog topics (aka, latent topic modeling) as well automatic methods of modeling blog influence. Daume has posted on related topics in the past as well (see here for one relevant post).

Having skimmed the first paper a bit, I see lots of scary words and phrases like "Latent Dirichlet Allocation" and "probabilistic framework"; I'm neck deep in finishing my dissertation (or failing to finish it; I'll be able to distinguish the two in about 3 weeks), so my interest in struggling through challenging papers is low, but they look well worth the read ... someday ... sigh.

Sunday, March 16, 2008

On The Cognitive Properties of Skin

After posting on the voiceless phone call story below, I began searching around for more information on how the device actually works. Failing to find any relevant patents pending (suspicious, I thought) I began searching for information on Michael Callahan, the wunderkind who appears to be the principle inventor, though many are probably involved.

After some searching, the most specific information I have yet found on the technology behind the voiceless phone was found in this article from the University of Illinois at Urbana-Champaign Engineering department website. Note the passage I have emphasized:

“Once we hit upon the idea of direct input, we were off and running,” explained Thomas Coleman, a project team member. The young researchers discovered that information sent from the brain can be accurately measured through the conductive properties of the skin. Typically, according to Coleman, these measurements are obtained through rigid metallic electrodes which neither respond to natural movements of the body nor to increasing skin moisture. They often become very uncomfortable under prolonged use.

"Our system uses proprietary technology to gather neurological information through encapsulated conductive gel pads, shielding the embedded electrode from the skin,” Coleman said. ‘The Audeo’ device we developed applies gentle pressure over the vocal cords, while the form-fitting band automatically adjusts in diameter, accommodating head and neck movements to maintain efficient contact.”

From there, team members created a computer program, which reads the intercepted neurological signals, and communicates a ‘response,’ both on the screen and as an audio signal. Initial work centered on determining the differences between a ‘yes’ and a ‘no’ response, which could be recognized by the computer. The software has since been enhanced to effectively ‘learn’ and adapt to the user’s neurological signals without the need of extensive training. The equipment analyzes the user during a one-time calibration process and generates a personalized user identity.
(my emphasis; quote marks had to be manually inserted to replace funny characters, but i tried to represent the original faithfully)

This is how far removed from serious neuroscience I am. I had no clue. I realized some information could be gathered from the skin, like Galvanic skin response, but I must say I’m shocked to learn that phonemic information regarding unarticulated utterances can be retrieved from the skin around a person’s neck. Clearly, there is more to this story. I’ll keep digging.

Friday, March 14, 2008

Wireless Phone Calls and Speech Production

There is a new viral video going around involving a “voiceless phone call”. Tom Simonite writes on

A neckband that translates thought into speech by picking up nerve signals has been used to demonstrate a "voiceless" phone call for the first time.

With careful training a person can send nerve signals to their vocal cords without making a sound. These signals are picked up by the neckband and relayed wirelessly to a computer that converts them into words spoken by a computerised voice.

The system demonstrated at the TI conference can recognise only a limited set of about 150 words and phrases, says Callahan, who likens this to the early days of speech recognition software.

At the end of the year Ambient plans to release an improved version, without a vocabulary limit. Instead of recognising whole words or phrases, it should identify the individual phonemes that make up complete words.

I have no clue how this actually works (there’s an HMM in there somewhere, right?), but its implications for models of speech production ought to be significant. The folks over at Haskins Lab ought to be interested, I should think.

(HT Andrew Sullivan)

Here's the video. Cool stuff.

Tuesday, March 11, 2008

On Crowdsourcing and Linguistics

Rumbling around in my head for some time has been this question: can linguistics take advantage of powerful prediction markets to further our research goals?

It's not clear to me what predictions linguists could compete over, so this remains an open question. However, having just stumbled on to an service designed to harness the power of crowdsourcing called Mechanical Turk (HT Complex Systems Blog) I'm tempted to believe this somewhat related idea could be useful very quickly to complete large scale annotation projects (something I've posted about before), despite the potential for lousy annotations.

The point of crowdsourcing is to complete tasks that are difficult for computers, but easy for humans. For example, here are five tasks currently being listed:

1. Create an image that looks like another image.
2. Extract Meeting Date Information from Websites
3. Your task is to identify your 3 best items for the lists you're presented with.
4. Describe the sport and athlete's race and gender on Sports Illustrated covers
5. 2 pictures to look at and quickly rate subjectively

It should be easy enough to crowdsource annotation tasks (e.g., create a web site people can log in to from anywhere which contains the data with an easy-to-use interface for tagging). "Alas!", says you, "surely the poor quality of annotations would make this approach hopeless!"

Would it?

Recently, Breck Baldwin over at the LingPipe blog discussed the problems of inter-annotator agreement (gasp! there's inter-annotator DIS-agreement even between hip geniuses like Baldwin and Carpenter? Yes ... sigh ... yes there is). However (here's where the genius part comes in) he concluded that, if you're primarily in the business of recall (i.e, making sure the net you cast catches all the fish in the sea, even if you also pick up some hub caps along the way), then the reliability of annotators is not a critical concern. Let's let Breck explain:

The problem is in estimating what truth is given somewhat unreliable annotators. Assuming that Bob and I make independent errors and after adjudication (we both looked at where we differed and decided what the real errors were) we figured that each of us would miss 5% (1/20) of the abstract to gene mappings. If we took the union of our annotations, we end up with .025% missed mentions (1/400) by multiplying our recall errors (1/20*1/20)–this assumes independence of errors, a big assumption.

Now we have a much better upper limit that is in the 99% range, and more importantly, a perspective on how to accumulate a recall gold standard. Basically we should take annotations from all remotely qualified annotators and not worry about it. We know that is going to push down our precision (accuracy) but we are not in that business anyway.

Unless I've mis-understood Baldwin's post (I'm just a lousy linguist mind you, not a genius, hehe) then the major issue is adjudicating the error rate of a set of crowdsourced raters. Couldn't a bit of sampling do this nicely? If you restricted the annotators to, say, grad students in linguistics and related fields, the threshold of "remotely qualified" should be met, and there's plenty of grad students floating around the world.

This approach strikes me as related to the recent revelations that Wikipedia and Digg and other groups that try to take advantage of web democracy/crowd wisdom are actually functioning best when they have a small group of "moderators" or "chaperones" (read Chris Wilson's article on this topic here).

So, take a large group of raters scattered around the whole wide world, give them the task and technology to complete potentially huge amounts of annotations quickly, chaperone their results just a bit, and voilà, large scale annotation projects made easy.

You're welcome, hehe.

Sunday, March 9, 2008

Jason Wins, hehe

As if it wasn’t obvious, I decided to reiterate Jason’s point from the previous post, regarding the ante-previous post by taking my post and running through Google’s English to Italian translation. A thing of beauty, haha. Enjoy:

Invece di commentare i miei commenters per quanto riguarda il mio post Blog di Amore, stile italiano, ho deciso di fare questo è un post --

In risposta a Jason's acerbic commento "Credo che la più grande macchina di traduzione è stato solo uno scherzo, la pubblicazione della traduzione automatica. :) ",

Con la presente risposta nel seguente modo:

Non essere talkin 'trash' bout mio prezioso Google traduzioni; senza di loro, non potrei mai leggere la mia e-mail amico spagnolo Ana invia. Il suo inglese è peggiore di quella di Google traduzioni, in modo I'll take Google (rimshot!).

E lei non crede che ci sia qualcosa di poetico nella prima riga. Ho potuto vedere alcuni 20th Century poeta americano Wallace Stevens iscritto come questo:

Abbiamo aspettato mesi e mesi
In attesa di Titlepage dolce,
Il sito dovrebbe offrire conversazioni
(E perché non parlare)
Ardente e appassionato editoriale
Le ultime notizie, un nuovo modello
Algonquin Round Table

On Google Translations

Instead of commenting to my commenters regarding my post Blog Love, Italian Style, I decided to make this it’s own post –

In response to Jason’s acerbic comment “I think the biggest machine translation joke was just posting the machine translation itself. :)”,

I hereby reply thusly:

Don't be talkin' trash 'bout my precious Google translations; without them, I could never read the emails my Spanish friend Ana sends. Her English is worse than the Google translations, so I'll take Google (rimshot!).

And don't you think there is something poetic in the first line. I could see some 20th Century American poet like Wallace Stevens writing this:

We have waited months and months
In sweet Titlepage Pending,
The site should offer conversations
(and why not talk)
Passionate and fiery editorial
On the latest news, a new model
Algonquin Round Table

Thursday, March 6, 2008

Blog Love, Italian Style

Sitemeter consistently shows referrals to my blog from the Italian language blog Taccuino di traduzione 2.0 which Google translates as Translation Notebook 2.0. Unfortunately, I lack Italian language skills, so I am unable to enjoy the blogs postings. But I thought I'd pass it along to any of you who may wish to indulge. The latest post has a great painting of the famed Algonquin Roundtable titled "A Vicious Circle" by Natalie Ascencios.

Meaning no offense to the superior original, but my lack of Italian drove me to Google translate the whole post. Reading this poor translation makes me want to run out, study Italian real quick, then read the rest of the blog:

We have waited months and months in sweet Titlepage pending, the site should offer conversations (and why not talk) passionate and fiery editorial on the latest news, a new model Algonquin Round Table, with videointerviste choirs, forums on different literary genres For readers who do not give up ever, a blog, reviews, reports, awards, cotillons and who knows what else.

All false promises. Although well prepared on the subject, the presenter (which surely read as a young Hamlet in jeans and black sweater, in some alternative theatre company) is uncomfortable in front of the camera (average training, anyone?), The writers guests look around terrified, set design probably is the work of a student to first weapons, the conversation is woody, boring and, above all, language, not to mention the editing of footage (used scissors?). A great sin. But this can only improve.


Tuesday, March 4, 2008

an ear for accents

This women has a gifted "ear" for accents. She starts in England, moves through Europe, on to Australia, then makes her way from west to east through North America. I'll note that her Texas and South Carolina are pure stereotype, but damn she nails California and Toronto.

(HT Andrew Sullivan)

"yeah right" again

Eureka! I posted about the prosody of the phrase "yeah right" some time ago here. In particular, I claimed there are 3 three interpretations of the phrase, but I don't have one of them in my dialect (Northern California), namely what I called "back-channel (sentiment agreement)" which is roughly equivalent to ‘mm-hmm’. However, I had no sound files. Now I've found a near perfect example of this mystery prosody in the trailer for Juno, about 36 seconds in (here).

You can also read my most excellent review of Juno here.

NLPers: How would you characterize your linguistics background?

That was the poll question my hero Professor Emily Bender posed on Twitter March 30th. 573 tweets later, a truly epic thread had been cre...