Monday, January 31, 2011

the false narrative of small town slang

There is a common critique of journalists that they often let an internal narrative color their reporting, to the point where they simply parrot back the narrative in their head rather than report the facts on the ground (see here for a discussion of this).

My hometown of Chico got its spotlight in the sun recently because its favorite son Aaron Rogers is the star quarterback of the Packers about to play in the Super Bowl. Unfortunately, the NYT's article is a near perfect example of journalists letting a narrative do the talking when the facts blatantly contradict their claims:

The usual slang words like awesome or cool are not heard much. Nice is in. As in: “You won the lottery? Nice.

The narrative this spins is that small towns are all Mayberrys where everyone is pure and innocent and righteous and better than them damned city-folk. It has been evoked routinely in political reporting.

I'm a Chico boy. I graduated from Chico jr. High and walked across the street and graduated from Chico High then walked across the street and graduated from Chico State*. And I can assure you that awesome and cool are every bit as frequent there as anywhere else (personally, I had an unhealthy fondness for hella back in 1987). And believe me, if you won the lottery in Chico, no one would say nice. They would say, "No fukkin way! No fukkin way! Really! No fukkin way!" ... just like everywhere else.

*no joke, those three schools are literally across the street from each other.

how (not) to do linguistics

Jonah Lehrer, the neuro-blogger, has a mixed track record, as far as I'm concerned. His initial blogging was nice, but a tad lightweight, then he started to sound a bit too Malcom Gladwell-ee (in that I wasn't entirely sure he knew what he was talking about beyond having a few short phone calls with one or two scientists then babbling on about a topic).

But he's hit a home run with this long New Yorker piece about the failure of the journal review process in science: The Truth Wears Off. He draws examples from medicine, physics, and psychology.

Perhaps the most disappointing part is the realization the the standards of testing and conclusiveness in linguistics are so far from those in more established science.

Before the effectiveness of a drug can be confirmed, it must be tested and tested again. Different scientists in different labs need to repeat the protocols and publish their results. The test of replicability, as it’s known, is the foundation of modern research. Replicability is how the community enforces itself. It’s a safeguard for the creep of subjectivity.

Repeating studies is virtually unheard of in linguistics. Also, Lehrer mentions the publication bias in journals. When a result is discovered, there is a bias towards positive results. After a while, once the result is accepted, then only negative results are published because only that is "interesting" anymore. But I would expand this point to say this same bias exists at every stage of the research process. We want to find things that happen, we don't care about spending 5 years and thousands of hours discovering that X does NOT cause Y! So when young grad students begin scoping out a new study, they throw away anything that doesn't seem fruitful, where fruitful is defined as yielding positive results. This bias affects the very foundation of the research process, namely answering the basic question: what should I study?

As a side note, engineers seem perfectly happy to follow through on null results. They need to know the full scope of their problem before solving it. Scientists can learn a lot from engineers (and vice versa).

[Psychology professor Jonathan] Schooler recommends the establishment of an open-source database, in which researchers are required to outline their planned investigations and document all their results. “I think this would provide a huge increase in access to scientific work and give us a much better way to judge the quality of an experiment,” Schooler says. “It would help us finally deal with all these issues that the decline effect is exposing.”

Coincidentally, I was recently tweeting with moximer and jasonpriem about this and we agreed that research wikis are worth explolring. My vision would be something akin to Wikipedia but where a researcher stores all of their data, stimuli, results, etc, finished or not. The data could be tagged as tentative, draft, failed, successful, etc. As the research goes on, the data get updated. Not only would this record failure (which, as Leherer points out in the article) is as valuable as success, it also records change. How did a study evolve over time?

True, the data would become huge over time across many disciplines, but that just means means we need better and better data mining tools (and the boys at LingPipe are working away at those tools).

HT rapella

Friday, January 28, 2011

the sociolinguistics of height in China

Ingrid at Language on the Move has some thoughtful comments on the relationship between height and learning English in China. If you're under 1.6 meters, forget it. There are subtle but very real socioeconomic barriers in your way. Money quote:

I have supervised research related to English language learning and teaching in China for almost a decade and have read most of the research on the topic published in English. However, never before have I come across the importance of height. I take this as evidence for the importance of doing ethnographic research. Otherwise, what is the point of doing sociolinguistic research if you can’t discover anything you hadn’t already decided in advance would be important?!

I taught EFL in China back in 1998 at a private school in Guangzhou catering mostly to working professionals. Much has changed since then, as China has changed so much. I don't recall ever talking about height as a factor, but certainly cost and hours were a significant issue that made it virtually impossible for any poor workers to consider taking English classes. As a 6 foot 4 blond American, though, I was treated like a rock star. It was kinda weird.

chomsky and performance art

Artist Annie Dorsen has created a chatbot performance piece around the debate between Noam Chomsky and Michele Foucault on Dutch TV in 1971 (videos here).

This snippet is mildly interesting, but I couldn't help wondering what technology was used, especially the speech synthesis, because, frankly, it's a bit clunky and old-fashioned. Perhaps that's part of the point. The computer screens appear to be running DOS shells too. Nothing wrong with that, purists will likely prefer it even, but combined with the clunky speech, the performance appears to be trading on a very outdated computational linguistic aesthetic. Truly the desert of the real? (okay, I had to through in some Baudrillard)

Thursday, January 27, 2011

can a machine learn jazz?

There's a contest dedicated to trying to answer that question: ISMIS 2011 Contest: Music Information Retrieval.

Computer scientists and engineers have long used contests and bake-offs to stimulate cutting edge research in linguistics (e.g., MUC), but linguists have lagged in this department. You rarely if ever hear about contests that pit one linguistic theory against another using a standardized data set (or maybe I've just missed them).

Nobel prize winning economist Joseph Stiglitz argues here that prizes are good for stimulating academic research. I agree whole heartedly and would like to see more direct competition between theorists. Exactly how a contest would be constructed is up for debate (I have a vague memory of some group trying to devise criteria by which to evaluate linguistic theories, maybe out of UCLA, but I can't seem to track it down; it's a remarkably difficult Google query to form).

HT: Jochen L. Leidner

the linguistics of heaven and hell

The value of pop culture data for legitimate research is being put to the test. Exactly what, if anything, can the reality show Big Brother tell us about language change over time?

Voice Onset Time is a measure of how long you wait to begin vibrating your vocal folds after you release a stop consonant. Voiced stop consonants like /b/ and /d/ require two things: 1) stop all airflow from escaping the airway by closing the glottis and 2) after the air is released, begin vibrating the glottis (by using the rushing air). For non-linguists, think of a garden hose. Imagine you use your thumb to stop the water for a second and you let the pressure build, then you let go and water rushes out, but then you use your thumb to clamp down just a bit on the water to spray it. This is kinda like the speech production of voiced stop consonants in human language.

(image from

Though I’m no phoneticist, I really like VOT as a target of linguistic study for one crucial reason: it’s a clear example of a linguistic feature that varies according to your human language system but which you do NOT have conscious control over. What that means is that you cannot consciously change the length of your own personal VOT. Go ahead, try it. Make your VOT 20 milliseconds longer. Go ahead, I’ll wait…

Of course you can’t. Well, not consciously, but what researchers have found is that your brain, quite independent of conscious will or knowledge, can! Lab studies have found that people will unknowingly alter their VOTs according to certain situations, and the results are predictable. For example, they found that when listening to a set of long VOT stimuli, subjects will begin to lengthen their own VOTs, in essence accommodating the longer VOTs. Over the longer term it has also been shown that people will lengthen their VOT over their lifetime to accommodate cultural shifts. It has been shown that The Queen Mother herself had a longer VOT in her later life than during her younger days (few other people have been recorded consistently over a long period to provide such valuable data).

Here’s what Bane et al. did: They took recordings of confessional sequences from the UK reality TV show Big Brother (where groups of strangers are made to live with each other and occasionally speak to a camera alone like a video diary) and tested what happened to 4 crucial individuals (the ones that stayed on the show long enough to provide several months worth of data points). What they found was that their VOTs did in fact change, though no linear pattern was discovered (i.e., they did not simply get longer in a steady line). This paper is labeled as a progress report because they don't have a firm hypothesis about what actually is happening. Nice trick there boys, ;)

They did find one interesting thing: During part of the show, the house mates were physically divided into basically a caste system where half the people were low caste and half were high (a heaven and a hell). And this seemed to have an effect on VOT as well (sociolinguists are slap happy about this, I'm sure).

I haven’t looked at the actual numbers very closely, but in section 6, they say “Housemate trajectories seem to diverge when the divide is present…” However, just taking a glance at the Figure 3, it looks like they diverge at the beginning, then converge at the end, episode 65 (and remain somewhat similar until several episodes of non-DIVIDE have gone by). If my cursory glance is correct, I would assume it takes awhile for the convergence to manifest, and then it persists for awhile after DIVIDE is gone. But this is just me looking at the picture, not the actual data.

Finally, and this is just a readability point, but I would order the names in Figure 3 in the same order as the end point of each trajectory, making it easier to follow who is doing what.
Max Bane, Peter Graff, & Morgan Sonderegger (2011). Longitudinal phonetic variation in a closed system Linguistic Society of America 2011 Annual Meeting.

Wednesday, January 26, 2011

more jobs for linguists

As the economy continues to grow (Dow over 12,000), so do the non-academic opportunities for linguists. Here's an interesting one for an Analyst in the California* Bay Area :

The Analyst looks for opportunities to improve our Natural Language and Directed-dialog applications using the data logged by them. The Analyst is primarily the responsible team member charged with analyzing the data and making new implementation recommendations [...] Besides analyzing our speech applications and improving our Analytics framework, you will also have the opportunity to carry out independent research, which forms a big part of the success of our speech applications [...]

*living in the metro DC region has taught this Northern California boy that there's more than one "Bay Area."

Call For Participation

More and more researchers are using the web to gather data for serious research, but they need your help as participants. As a proponent, I like to do my part and share the calls for participation that I know about. If you know of any others, I'm happy to add:

MPI -- The Max Planck Institute for Psycholinguistics: Investigates how people use language.

Cue-word memories -- Clare Rathbone, University of Reading: The study is specifically interested in the way people remember events from their lives. You will be asked to recall 16 memories and then rate them for details, such as vividness and how often you have thought about the memories before. You will also be asked to fill in two short questionnaires. Please note, this questionnaire is for people over the age of 40 only - please do not take part if you are aged 39 or younger.

Phrase Detectives -- University of Essex: Lovers of literature, grammar and language, this is the place where you can work together to improve future generations of technology. By indicating relationships between words and phrases you will help to create a resource that is rich in linguistic information.

Games With Words -- Joshua Hartshorne, Harvard University: Test your language sense! Play a game while participating in cutting-edge research. How good is your language sense?

Color Naming -- Dimitris Mylonas, London College of Communication: This is a multi-lingual colour naming experiment. It is part of research on colour naming and colour categorisation within different cultures, and aims to improve the inter-cultural colour dialogue. By taking part you are helping us to develop an online colour naming model which will be based on the "natural" language provided from your responses.

CogLab 2.0 -- The Cognitive Psychology Online Laboratory: Aggregated set of many online research projects.

do we need parsed corpora?

Maybe not, according to Google: For many tasks, words and word combinations provide all the representational machinery we need to learn from text...invariably, simple models and a lot of data trump more elaborate models based on less data.

I've been wondering about this very issue for 5 years or so. When I first started collecting parsed BNC data for my defunct dissertation, I needed sentences involving various verbs and prepositions, but the examples I found were often of the wrong structural type because of preposition attachment ambiguity. I used Tgrep2 queries to find proper examples, but even then there were false positives, so I did some error correction. One of the more interesting discoveries I made was a relationship between a verb's role in its semantic class and its error rate.

I was trying to find a way to objectively define core members of a semantic verb class and peripheral members. I had a pretty good intuition about which were which, but I wanted to get beyond intuition (yes yes, it's all very Beth Levin).

For example, one of the objective clues for barrier verbs (a class of negative verbs encoding obstruction, like prevent, ban, exclude, etc) was the unusual role of the preposition from in sentences like these:
  • She prevented them from entering the pub.
  • He banned them from the pub.
  • They were excluded from the pub.
The preposition from is usually used to mark sources (He drove here from Buffalo) but in these sentences it's acting much more like a complementizer. This is fairly unique to barrier verbs and I felt it was distinctive of the verb class, so I wanted a bunch of examples. Because I needed to exclude examples involving old-fashioned source from, I used a Tgrep2 search that required the PP to be in a particular relationship to the verb (the BNC parse was a bit odd as I recall, and required some gymnastics).

Again, I had a lot of false positives even with Tgrep2 so I did some manual error analysis and discovered that certain verbs had very low error rates while others had very high rates and the difference coincided nicely with my intuition about which verbs were core members of the class and which were peripheral: core members like prevent had very low error rates. This means that when prevent is followed by a from-PP, it's almost always the complementizer from; obvious to adults, the meaning of a barrier verb doesn't easily include source (necessary for old-fashioned from), but how would a kid learn that? If I ban you from the pub, how does a kid know the pub is NOT where you started (source) but rather the opposite, it's where you're not allowed to end up (goal)? Cool little learning problem, I thought ... and with a data set other than frikkin dative (which Pinker and Levin have, let's face it, done to death).

I assumed there was something central to the meaning of the verb class that caused this special use of from. Then it occurred to me, if this is true, why do I need the parse? Imagine I ignore structure, take all sentences where from follows a relevant verb, then sample for false positives. That should give me basically the same thing.

I became increasingly fascinated with this methodology. I was now interested in how I was studying language, not what I was studying. And that led me to ask whether or not the parse info was all that valuable for other linguistic studies? But then I realized that when big news stories start getting old, the media always, always starts reporting on themselves, on how the news gets made ... I didn't like where I was heading ...

...and then I got a job and that was that ...

HT: Melodye

Tuesday, January 25, 2011

Obama's State Of The Union and word frequency

In anticipation of President Obama's 2011 State Of The Union speech tonight, and the inevitable bullshit word frequency analysis to follow, I am re-posting my post from last year's SOTU reaction, in hope that maybe, just maybe, some political pundit might be slightly less stupid than they were last year ... sigh .. here's to hope ...

(cropped image from Huffington Post)

It has long been a grand temptation to use simple word frequency* counts to judge a person's mental state. Like Freudian Slips, there is an assumption that this will give us a glimpse into what a person "really" believes and feels, deep inside. This trend came and went within linguistics when digital corpora were first being compiled and analyzed several decades ago. Linguists quickly realized that this was, in fact, a bogus methodology when they discovered that many (most) claims or hypotheses based solely on a person's simple word frequency data were easily refuted upon deeper inspection. Nonetheless, the message of the weakness of this technique never quite reached the outside world and word counts continue to be cited, even by reputable people, as a window into the mind of an individual. Geoff Nunberg recently railed against the practice here: The I's Dont Have It.

The latest victim of this scam is one of the blogging world's most respected statisticians, Nate Silver who performed a word frequency experiment on a variety of U.S. presidential State Of The Union speeches going back to 1962 HERE. I have a lot of respect for Silver, but I believe he's off the mark on this one. Silver leads into his analysis talking about his own pleasant surprise at the fact that the speech demonstrated "an awareness of the difficult situation in which the President now finds himself." Then, he justifies his linguistic analysis by stating that "subjective evaluations of Presidential speeches are notoriously useless. So let's instead attempt something a bit more rigorous, which is a word frequency analysis..." He explains his methodology this way:

To investigate, we'll compare the President's speech to the State of the Union addresses delivered by each president since John F. Kennedy in 1962 in advance of their respective midterm elections. We'll also look at the address that Obama delivered -- not technically a State of the Union -- to the Congress in February, 2009. I've highlighted a total of about 70 buzzwords from these speeches, which are broken down into six categories. The numbers you see below reflect the number of times that each President used term in his State of the Union address.

The comparisons and analysis he reports are bogus and at least as "subjective" as his original intuition. Here's why:

Saturday, January 22, 2011

the perils of translation: does und mean well?

I'm watching the truly powerful 2009 Oscar winning German film The White Ribbon on Netflix. Even after a few minutes it has grabbed me and impressed me with its simplicity and power, in the style of many great films. Hollywood used to make films like this. Films that mattered. Films that taught deep truths about what it means to be human. Films like "Inherit The Wind", "Guess Who's Coming To Dinner", "To Kill A Mockingbird". Now Hollywood makes three Jennifer Anister rom-coms a year and panders to fan boys... but I digress ...

My German is pretty rusty, but the film's dialogue is simple enough (in a good way) for me to catch most of it even without the subtitles, which is exactly the source of the linguistic point I want to discuss. In one early scene, the narrator, a teacher, recounts an incident involving himself* and a student wherein the student was endangering himself, so the teacher demands the student explain his actions. When he fails to get a proper response, he says repeatedly Und? ... Und? ... German und is easily translated as and but the film's translators choose to use the English word well instead.
(screen grab from Netflix)

As a native speaker of English, I can see the reasoning behind well, yet I must say, it's equally plausible to use and in that situation as well, maybe even more so. The use of well in English would suggest a certain formality that the translator felt was proper, but it also makes me, as an English speaker, feel a bit awkward, like I'm being fed an anachronism. Perhaps that's appropriate for the movie, I'm not sure, it just struck me as an interesting linguistic choice. It's a nice example of the beautiful ambiguity of lexical items, really. For example, just a few scenes later the teacher encounters Eva and asks her about who she is and what he's heard about her, namely that she's a new nanny in town, and her response is und, but it is translated as English so:

(screen grab from Netflix)

Again, as a native speaker of English, I can "get" the translation, but still, I'd be perfectly happy with and in both. I've never been a translator and I have much respect for the difficult job professional translators do navigating these treacherous waters. I don't mean to second guess. Rather, it strikes my as an interesting point of discussion.

*why can't I say hisself? Oh, where are you Jeff Runner when I need you!

Friday, January 21, 2011

like wikipedia with a voice?

It can be difficult to get a feel for what some tech start-ups are going for. This demo of Qwiki at a Tech Crunch event asks us to think of information as an experience. I'm pretty sure the voice is synthesized because of some odd prosody and the weird way Yelp is pronounced (oh, and the unlikelihood that they could pre-record all the possible narration ... yeah, that too). At the end, all I could think of was "it's like Wikipedia with a voice..."

Qwiki at TechCrunch Disrupt from Qwiki on Vimeo.

Tuesday, January 18, 2011

oh snap! daume talkin trash 'bout "stupid" penn tree bank

Hal Daume at his excellent NLPers blog is wondering aloud about parsing algorithms doing "real" syntax:

One thing that stands in our way, of course, is the stupid Penn Treebank, which was annotated only with very simple transformations (mostly noun phrase movements) and not really "deep" transformations as most Chomskyan linguists would recognize them [emphasis added].

Oh no he di'nt!

[UPDATE: hal responds thoughtfully in the comments and properly corrects my misunderstandings of his post.]

It's certainly fair to say that the Penn Treebank is not annotated for everything. Sure. But show me the perfect resource and I'll let you throw all the stones you want. More to the point, once you get beyond deciding what the basic chunks are (NPs,VPs, PPs, etc), there's little agreement on what is and what is not a "real" syntactic thing. In order to annotate anything above this level, you have to choose a theoretical camp to park your tent in. You have to take sides. Daume is happy to be a Chomskyan. He's taken his side. Good for him.

In order to annotate Daume's beloved deep transformations, one must first admit such things exist. I do not. And if Daume started annotating the Penn Treebank with such things, I wouldn't care. I would argue he is wasting his time chasing unicorns.

Daume may believe that Chomskyan theory is "real" syntax, but I do not. Nor do most linguists (if you surveyed all linguists throughout the world, yes I do believe a majority would disagree with the statement I believe in Chomskyan deep structure).

UPDATE: Daume's comments and his responses are well worth reading.

the most difficult linguistics sentence ever?

Imagine I give you the sentence template that follows:
  • If speakers omit X to avoid Y, optional Z should be less likely if W.
Question: What X, Y, Z and W could possibly make that sentence EASIER to understand?

For no particular reason other than (that) I love linguistics and will read any free article that catches my fancy, I've been reading Florian Jaeger's Phonological Optimization and Syntactic Variation: The Case of Optional that. Submitted for Proceedings of 32nd BLS (pdf). I have nothing but respect for Jaeger as a linguist* and this is a very interesting paper that I have enjoyed reading**. But flo*** has a knack for producing very difficult to read sentences. Here's the original that produced the template above:

If speakers omit optional that to avoid segmental OCP violations with the immediately preceding or following segment, optional that should be less likely if the segments was to share some articulatory feature with the adjacent segment of that.

It actually got worse WITH context, right? And I read the actual paper, with all kinds o' context. And I still had to re-read that sentence many many times. I'm still not sure I understand it. I may have to whip out PowerPoint, a laser pointer, and a flashlight before I figure it out for sure. Now, I'm prepared to admit that the three pints of BBC Bourbon Barrel Stout at Galaxy Hut may have influenced my critique ...

...but not entirely for the worse. If I ever get around to typing up my awesome and prodigious commentary, it might make a great blog post ... but don't hold your breath. I have a stack of linguistics articles I've read and reviewed over the last 12 months and yet somehow, I just never get around to typing up my truly awesome comments (including in-depth discussion of flo's partner-in-crime Peter Graff's Longitudinal Phonetic Variation in a Closed System -- I got mad comments on that one). Maybe I should have called this blog The Lazy Linguist?

*I've never met the guy so maybe he's a bastard in person, I dunno, I hope not...
**Not in the least because it has some tangential connection to my somewhat defunct dissertation research.
***Hey, he calls himself that on his site...

Sunday, January 16, 2011

god awful is an odd phrase

I used the phrase god awful in a comment at Language Log and it occurs to me that it's an odd little creature. From the OED*:

Pronunciation:  /ˌgɒdˈɔːfʊl/
Forms:  Also God awful, Godawful.(Show More)
Etymology:  < god n. + awful adj.

slang (orig. U.S.).

  Terrible; extremely unpleasant. (In quot. 1878 the sense is ‘impressively large’.)

1878    J. H. Beadle Western Wilds xxxvii. 611   Put thirty acres‥into wheat, and went to work with a hurrah in 1874 to make a God-awful crop.
1897    C. M. Flandrau Harvard Episodes 88   Ellis is such a God awful fool.
1930    W. S. Maugham Breadwinner ii. 124   Your affairs are in a god-awful mess.
1946    ‘S. Russell’ To Bed with Grand Music i. 14   Listen to the most godawful programmes on the radio.
1958    R. Graves in Times Lit. Suppl. 15 Aug. p. x/4   The credible and vivid story that any context (red-brick, yellow-brick, or otherwise God-awful) offers.
1959    P. McCutchan Storm South iv. 63,   I heard the most God-awful racket above my head.

The meaning is derived from using god as an intensifier like very. Fine, I get this analysis, it makes sense. But is god ever used in any other construction to intensifier a negative quality like awful?

This is a case where corpora are not terribly useful because the instances of god are so frequent, and so frequently NOT in this kind of construction, it's difficult to discover automatically. I could go all qualitative and just read a million phrases with god in them, but that would take a really long time and still have a low probability of success.

HT to the OED for making their site freely available this month! Use name/password trynwoed/trynewoed.

Saturday, January 15, 2011

true grit phonological ambiguity

Thanks to Jeff Bridges' now infamous mumbling performance, the clever folks at College Humor give True Grit a version of the lip reading treatment that Star Trek received not too long ago.

apologies for the weird embedding, I don't know how to fix it (I just pasted the embed code into the Blogger HTML with no option to adjust size)...and yes, I'll have some more of that woop woop, please...

Thursday, January 13, 2011

how distinctive is app store?

Microsoft is arguing that Apple cannot trademark the term app store because it is a generic term.

"An 'app store' is an 'app store'," Russell Pangborn, Microsoft's associate general counsel, said, according to the BBC. "Like 'shoe store' or 'toy store', it is a generic term that is commonly used by companies, governments and individuals that offer apps."

A commenter at Hacker News begs to differ:

Ngram data shows no usage of "App Store" or "app store" from the time of 1800 to 2008. I was suspicious of this, but using the terms "app,store" separately produced lots of data points. My tentative hypothesis is that Ngram is using data that existed before the App Store went public and thus will not show up in Ngram.

I'm no trademark expert, but the basic idea, as Wikipedia defines it, is distinctivenessA trademark may be eligible for registration, or registrable, if amongst other things it performs the essential trademark function, and has distinctive character. Registrability can be understood as a continuum, with "inherently distinctive" marks at one end, "generic" and "descriptive" marks with no distinctive character at the other end, and "suggestive" and "arbitrary" marks lying between these two points.

First, I used BYU's Corpus of Contemporary American English and found an instance in 2009 of 'app store" being used to describe Zune's product: Oh, the Zune has an app store, all right. As of today, there are exactly nine programs in the Zune App Store.

A quick google search reveals that it commonly gets applied to non-Apple related products as well: Yep, Amazon Launching Their Own App Store For Android Too.

While it may be the case that Apple introduced the term in 2008, it seems to have expanded to generic use in less than a year and now gets used at least semi-regularly for non-Apple products. I'm not an Apple user myself and my own reading of app store is definitely generic. It does not distinctly mean Apple's product at all, to me. I have no clue if a court would agree.

do you despise eReaders and have tons of extra cash

...then this is for you: The Penguin Classics Complete Library is a massive box set consisting of nearly every Penguin Classics book ever published and is available on Amazon for only (only!) $13,413.30.

A rundown:
  • 1,082 titles
  • laid end to end they would hit the 52-mile mark
  • 700 pounds in weight
  • 828 feet if you stacked them 
  • They arrived in 25 boxes
My only complaint would be that Penguin Classics tend to be crappy books physically.

HT Kottke.

Tuesday, January 11, 2011

doggie do do at the the HuffPo

The Huffington Post is resetting the bar for astoundingly stupid science reporting: They report on a dog, Chaser, who has been trained to accurately fetch over 1000 toys by sound of the name and conclude that the dog's abilities, wait for it, place her at an intelligence level equivalent to a three-year-old human child!

My oh my, their view of the cognitive ability of 3 year olds is as depressing as it is profoundly wrong. Sorry, 3 year old humans can do more than make one-to-one correspondences between sounds and objects. They can, for example, recognize that the sound swing can mean an object with a seat attached to ropes OR the action you perform when you move your body back and forth on that thing with the ropes, they can watch TV and follow plot developments, ... sigh, I mean fuck it, it's not worth debunking ...

UPDATE: Sean at Replicated Typo reviews the original research involving Chaser.

Do rich families talk to their kids more than poor families?

Are Children in professional families talked to three times as much as the children in welfare families? That's the underlying assumption behind a new program at Bellevue hospital designed to coach "poor families on how to talk to their infant children, encouraging more interaction."

At least, that's how the Huffington Post wants you to think about this story:

University of Kansas graduate student Betty Hart and her professor, Todd Risley, wanted to figure out the cause of the education gap between the rich and poor. So, they targeted early education and headed a study that recorded the first three years of 40 infants' lives. The conclusion? Rich families talk to their kids more than poor families.

Pretty impressive, huh? Sounds cutting edge, right? With a little searching I discovered the following:
  • Betty Hart was a grad student at KU in the 1960s.
  • The research data for this study was collected in the early 1980s.
  • The paper publishing these results was published in 1995.
I have no problem with the common sense underlying these notions: talking to babies a lot helps them achieve higher success in academics later in life. Good advice all around, no doubt. But I'm suspicious of several assumptions about the finding of the original paper. From Alix Spiegel:

According to their research, the average child in a welfare home heard about 600 words an hour while a child in a professional home heard 2,100. "Children in professional families are talked to three times as much as the average child in a welfare family," Hart says [emphasis added].

Hearing words in your environment and talking to children are two different things and need to be distinguished, as well as child-directed speech. All I have are secondary sources not the 1995 book (Spiegel's article is the most thorough) so I can't tell how the data was coded and what they looked for (did the make the above three distinctions?).

But more to the point is the contemporary rush to paint these old findings as rationale to create new programs aimed at poor parents as if being poor makes your language use wrong somehow. It strikes me as convoluted logic to take a 15 year old book (based on 20 year old data) and decide that poor parents need linguistic intervention. Exactly how much grant money did Dr. Mendelsohn spend on this program? Even if the 3-1 ratio holds true (I suspect it would not under close scrutiny), what other factors might be affecting this?

It struck me that people with basically good intentions took a small amount of science out of context and used it to reinforce class stereotypes and class pressure.

Monday, January 10, 2011

The Psychological Functions of Function Words

Here is Chung & Pennebaker's 2007 paper on function words which crucially relies on Pennebakers' LWIC data: The Psychological Functions of Function Words (pdf). I have long felt that function words have been wrongly ignored by computational linguists and SEO specialists. While the use of stop lists have sped up processing time considerably, they have also wiped out huge amounts of semantically meaningful data. Nonetheless, I also feel the Pennebaker's LWIC corpus is not as transparent or as comprehensive as I would prefer it to be.

Saturday, January 8, 2011

biggest linguistics story of 2010?

I have nothing but respect and admiration for Erin McKean, CEO and Co-Founder of the awesome Wordnik project as well as the person who has given by far the single greatest lingo-TED-talk ever; nonetheless, I take exception to her most recent column in the Boston Globe titled The year in language which is an article about the best and worst language stories of 2010. She notes many worthy events, yet...

With no offense meant, I can say that I was shocked, SHOCKED! to discover that no mention whatsoever was made of what I consider to be the single most important and shocking linguistics related story of 2010: the revelation that Harvard's Marc Hauser fabricated data regarding rule learning by monkeys. For years, Hauser has posed as a giant in the Chomsky camp, and created an ivy-league cottage industry based on his research. 2010's revelations of his still-unclear-yet-nonetheless-obvious-forgery is a shock-wave whose full power and ramifications have yet to be fully understood. Plus, it was the Boston Globe itself, the paper Erin publishes in, that broke the original story.

Language Log's extensive discussions of the Hauser story can be found here.

replace QWERTY with little circles?

Android users can look forward to a new typing layout specifically designed for one handed, hand-held device typing by 8pen. There have long been alternatives to the traditional QWERTY layout, but this one replaces keys with hand motion, so rather than landing your finger on the letter you want to type (the conceptual foundation of most keyboard concepts) this one rests on the idea that you make little circles on the screen while different letters are accessed.
In the words of the horse from Ren and Stimpy, no sir, I don't like it. Why not?

While inefficient and clumsy, the classic idea of touching the letter you want is fundamentally natural and clear. Any child or lazy adult can grasp it immediately. The little circles idea creates an artificial and unnatural interface that puts you multiple steps away from what you want. I'm not trying to make circles, I'm trying to type a frikkin k. I'm sure with practice anyone could get good at this, but I don't wanna practice typing for frik's sake! That's why I've been a clumsy hunt and pecker for 30 years with the damned QWERTY. I could have practiced typing on this damn thing also, but I didn't for the same reason I'm not gonna practice the little circles: I'm lazy. But at least with keys I can just touch the letter I want and get it. It's clear and obvious. I'm sure the little circles would drive me mad.

Friday, January 7, 2011

adults process language in a baby way!

Do babies process language in a "grown-up" way? First, read this from UCSD:

Babies, even those too young to talk, can understand many of the words that adults are saying – and their brains process them in a grown-up way.

Combining the cutting-edge technologies of MRI and MEG, scientists at the University of California, San Diego show that babies just over a year old process words they hear with the same brain structures as adults, and in the same amount of time. Moreover, the researchers found that babies were not merely processing the words as sounds, but were capable of grasping their meaning [emphasis added].

It certainly is an interesting finding to discover that infant and adult lexical processing may be similar, but why couch it in asymmetrical phrasing? Given the facts as this press release states them, could we equally as well say that adults process language in a baby way? This wouldn't get any press attention, though, would it. Or worse, it would be mocked. The author of the press release, Debra Kain, is referred to as a spokesperson for the UCSD Medical Center in this article. But it's not clear she consulted Jeff Elmen, a very well respected cognitive scientist who participated in the research. I'm not sure how comfortable he would have been with the somewhat excitable language.

Thursday, January 6, 2011

annals of unnecessary censorship, literary canon edition

Upcoming NewSouth 'Huck Finn' Eliminates the 'N' Word.

Twain scholar Alan Gribben and NewSouth Books plan to release a version of Huckleberry Finn, in a single volume with The Adventures of Tom Sawyer, that does away with the "n" word (as well as the "in" word, "Injun") by replacing it with the word "slave."

"What he suggested," said La Rosa, "was that there was a market for a book in which the n-word was switched out for something less hurtful, less controversial. We recognized that some people would say that this was censorship of a kind, but our feeling is that there are plenty of other books out there—all of them, in fact—that faithfully replicate the text, and that this was simply an option for those who were increasingly uncomfortable, as he put it, insisting students read a text which was so incredibly hurtful."

I'm curious about this notion of replacement as an "option" for two reasons. First, it reminds me of Ted Turner's infamous and ill-fated 1980s colorization project whereby he went back and artificially colorized black and white movies. As I recall, Turner also spoke of it as an "option", but it failed miserably as a cultural movement. Second, now that eReaders are becoming commonplace I wonder if publishers will begin to offer sanitized versions of books as an option. I don't have an eReader, so maybe this is already available, but I could imagine a filter that you click on and magically Henry Miller's Tropic of Cancer becomes a weirdly different novel.

HT kottke

Wednesday, January 5, 2011

jobs for linguists

As the economy slowly starts to wake, I hope and expect to see more jobs like this one popping up where general linguistics skills are being sought by innovative tech companies (these were a dime a dozen in the glory days of the tech boom 90s). Were I a bit younger, and less well-payed, I'd probably consider applying myself.

We are seeking a Linguist interested in joining a rapidly growing organization. The Linguist will work closely with our NLP Team in researching and developing lexica and grammars specific to various languages (“Language Packs”) that will be used for various NLP tasks. She/he will be expected to
contribute substantive insight/action with regard to developing language packs and must have a keen eye for understanding the end-user experience.

Specific responsibilities include:
- Research specific languages for their lexical, morphological, and
grammatical structures
- Develop original lexicons and reformat acquired lexicons
- Create grammatical rules using the research done above or other sources
- Analyze results from the system for mistakes and plan for improvement
- Willingness to focus research and development of Language Packs on
meeting the end-user’s needs

If you're a linguist interested in a non-academic career, you could do worse than apply here.

And for the record, I have no association with this company, have never worked for them, get nothing from posting this, but I do know one of their employees (we went to grad school together).

Tuesday, January 4, 2011

the germans fear my language too, muahahaha

It's a mighty era to be a native speaker of English. It seems the world fears my language and is instituting fruitless policies to protect their languages against my own. First the Chinese banned English words and phrases. Now, the Germans are getting on the banning bandwagon:

Germany's Transport Minister claimed to have struck an important blow for the preservation of the German language yesterday after enforcing a strict ban on the use of all English words and phrases within his ministry.

Peter Ramsauer stopped his staff from using more than 150 English words and expressions that have crept into everyday German shortly after being appointed in late 2009.

His aim, which was backed by Chancellor Angela Merkel, was to defend his language against the spread of "Denglish" – the corruption of German with words such as "handy" for mobile phone and other expressions including "babysitten" and "downloaden". As a result, words such as "laptop", "ticket" and "meeting" are verboten in Mr Ramsauer's ministry. Instead, staff must use their German equivalents: "Klapprechner", "Fahrschein" and "Besprechung" as well as many other common English words that the minister has translated back into German.

naive bayes knows restaurants better than 5,000 mechanical turks

Yelp recently sponsored a bake-off between a Naive Bayes classifier and the online crowd-sourcing site Mechanical Turk. The task was classifying web sites according to their business category (i.e., is it a restaurant or a doctors office?).  The classifier beat the turkers handily:

Money quote:
In almost every case, the algorithm, which was trained on a pool of 12 million user-submitted Yelp reviews, correctly identified the category of a business a third more often than the humans. In the automotive category, the computer was twice as likely as the assembled masses to correctly identify a business.

There are a variety of qualifications (why did 99% of Turkers who applied for the task fail the basic test? ESL issues perhaps?). But it's an interesting result.

HT kdnuggets

Monday, January 3, 2011

how we hear ourselves speak

Science Daily has a nice article on new neurolinguistic research out of Cal linking auditory and speech processes:

"We used to think that the human auditory system is mostly suppressed during speech, but we found closely knit patches of cortex with very different sensitivities to our own speech that paint a more complicated picture," said Adeen Flinker, a doctoral student in neuroscience at UC Berkeley and lead author of the study.

"We found evidence of millions of neurons firing together every time you hear a sound right next to millions of neurons ignoring external sounds but firing together every time you speak," Flinker added. "Such a mosaic of responses could play an important role in how we are able to distinguish our own speech from that of others."

HT Linguistic News Feeds

the evolution of journalistic quotes

They're getting shorter:

According to a new article in the academic journal Journalism Studies by David M. Ryfe and Markus Kemmelmeier, both professors at the University of Nevada, newspaper quotations evolved in much the same way as TV sound bites. By 1916, they found, the average political quotation in a newspaper story had fallen to about half the length of the average quotation in 1892.

(HT Daily Dish)

NLPers: How would you characterize your linguistics background?

That was the poll question my hero Professor Emily Bender posed on Twitter March 30th. 573 tweets later, a truly epic thread had been cre...