Thursday, August 22, 2013

corpus data 1: barrier verb frequencies

This is the eigth in a series of posts detailing data and analysis from my not-quite-entirely-completely-achieved linguistics dissertation (list of previous posts here).

Recall that if an entity wants to achieve a certain outcome, yet is impeded by some force, this situation can be encoded by a barrier verb in English, such as prevent, ban, protect.

Corpus Data
All data was extracted from The British National Corpus in roughly 2007 (yeah yeah, I could re-do this ... someday). Below are four tables representing the co-occurrence percentages of the most frequent verbs in each of the four categories for which I extracted barrier verb data.

Recall that barrier verbs can occur in one of four full syntactic templates* (S = clause, or an ING verb) which I call the Barrier Verb Construction (BVC):
  • A: verb X from S — prevent bad guys from stealing the TV.
  • B: verb X from NP — exclude students from the auditorium.
  • C: verb X against S — guard against getting athlete's foot.
  • D: verb X against NP — defend yourself against the police.
Without getting into the greasy details, the BVC data below was extracted from a parsed version of the British National Corpus, so it involved more than mere word frequencies (it required specific syntactic relationships to hold in tree structures). The numbers are sorted by the percentage of total occurrences (this equals the total BVC occurrences divided by the total frequency of each verb as reported by Adam Kilgarriff).

How to read the table: the verb prevent had a total frequency of occurrence of 10286 according to the Kilgarriff data. I found 2152 correct occurrences of prevent intype A of the BVC. I interpret this to mean that about 21% of all instances of the verb prevent (and its morphological variants) within the BNC occur within type A of the Barrier Verb Construction (i.e., with a from ING complement). On the other hand, the word suppress occurred 1311 times overall, but only two of those times did it occur in the BVC (i.e., with a from ING complement).

There is much more to be said about these stats. I offer this as a tantalizing morsel. To be continued...

*These four basic construction types do not include passives or sentences where there is only an implied complement.

Wednesday, August 21, 2013

deep semantics 3: barrier verbs and aktionsart

This is the seventh in a series of posts detailing data and analysis from my not-quite-entirely-completely-achieved linguistics dissertation (list of previous posts here).

Recall that if an entity wants to achieve a certain outcome, yet is impeded by some force, this situation can be encoded by a barrier verb in English, such as prevent, ban, protect.

Barrier Verbs and Aktionsart

Part of the semantic interpretation of barrier verbs involves event duration. Barrier verbs typically represent states (i.e., the temporal extent of the negated event is presupposed to have no necessary end boundary). Dowty described some basic tests for determining the Aktionsart class of a verb in a sentence. These now classic tests include the “occurs with X for an hour, spend an hour Xing” test for states and activities. However, some barrier verbs more readily allow a for an hour duration phrase than others.

For example, detain allows for an hour readily, but ban is less acceptable (to my American English speaking ears):
a. John was detained from entering Canada for an hour.
b. ?John was banned from entering Canada for an hour.
Although (b) is neither strictly ungrammatical nor strictly unacceptable, it intuitively seems like less of a default association between the stative event that the verb ban evokes and the duration phrase. I will stipulate, however, that (b) may be more acceptable to British speakers than American English speakers.

The verb detain seems to suggest a temporary state. However, none of these barrier verbs is strictly telic, as the in an hour test shows:
c. *John was detained from entering Canada in an hour.
d. *John was banned from entering Canada in an hour.
This is truly a fine grained semantic distinction requiring much more detailed analysis.

Tuesday, August 20, 2013

deep semantics 2: entailment vs. invited inference in barrier verbs

This is the sixth in a series of posts detailing data and analysis from my not-quite-entirely-completely-achieved linguistics dissertation (list of previous posts here).

Recall that if an entity wants to achieve a certain outcome, yet is impeded by some force, this situation can be encoded by a barrier verb in English, such as prevent, ban, protect.

Deep Semantics 2: Entailment vs. Invited Inference

Even though barrier verbs appear to be clearly Negative verbs (see here), I will be careful not to overstate the logical relationship between the negative semantics of barrier verbs and the outcome of the complement event because in some cases the relationship seems like entailment, but in others it seems closer to invited inference. For example, in (a) it seems like the verb prevent entails that the car did not get wet; however, in (b), it seems plausible that, while Tom may be exempted from paying taxes, he went ahead and paid them anyway (perhaps by mistake).
(a) The garage prevented the car from getting wet.
(b) The IRS exempted Tom from paying taxes.
Simple presupposition involves the existence of some assertion in the background knowledge of all people involved which allows another assertion to be true. Here is a classic example from Chierchia and McConnell-Ginet 1990:
ASSERTION - Tom stopped smoking.
In order to utter “Tom stopped smoking” felicitously, it must be assumed that the listener already knows that “Tom smoked”. In the example below, the barrier verb ban requires, on some level, the listener to believe that the journalists want to go to the courtroom:
ASSERTION - The judge banned journalists from her courtroom.
PRESUPPOSITION - The journalists want to go to the courtroom.
And, indeed, this belief passes the three primary tests for presuppositions:
NEGATION: The judge did not ban journalists from her courtroom.
QUESTIONING: Did the judge ban journalists from her courtroom?
CONDITIONAL: If the judge banned journalists from his courtroom, then their will be trouble.
The presupposition “the journalists want to enter the judge’s courtroom” survives under all three of these tests, but this alone does not mean that it is presupposed. There are presupposition-like phenomena which produce the same or nearly the same results. For example, if a person I don’t know very well came up to me and said, “my father just stopped smoking recently” I could infer (from Grice’s well known maxims of QUALITY and QUANTITY most likely) that her father had smoked previously and add that assertion to my background knowledge thereby making her utterance felicitous (and she could assume all along that that is exactly what I would do, also making her utterance felicitous). The assertion is not presupposed per se, but it is inferred and added to background knowledge in the moment.

One alternative is typically referred to as invited inference which mirrors many of the properties of presuppositions. Invited inferences are inferences we make based on background knowledge and our desire to follow basic principles of cooperative communication (ala Grice’s maxims). They can be very dependent on the verb they occur with. Saeed cites Levinson 1983 for the following examples:
ASSERTION - He cried before he finished his thesis.PRESUPPOSITION = He finished his thesis
ASSERTION - She died before she finished her thesis.
PRESUPPOSITION ≠ She finished her thesis.
Based on our knowledge of the world, we can recover or infer the fact that she did not finish in the second sentence. One possible analysis, that can save the presupposition, is defeasibility.

Defeasibility says, in essence, we do in fact make the same presupposition for the second sentence, but then we cancel it after checking with world knowledge. To test whether there is a consistent presupposition with all the barrier verbs, I performed the three presupposition tests above on a sub-set of all the barrier verbs in a preliminary NYT corpus. The goal was to perform the tests on two active sentences for each verb, preferably one with an NP complement and one with a VBG complement. This was not always possible, either because some verbs had no variation in complements in this corpus or there was a scarcity of active sentences (often these verbs are found in the passive or in nominals) or there was only one sentence found in the corpus for a particular verb (e.g., “to guard”). In the cases where there was no variation, two sentences with the same complement type were used. In the case where there was no good active sentence, one was formed from a passive with minimal adjustment (you will forgive this linguistic slight of hand as no change in the relevant meaning resulted. FYI, see Nunes* for a relevant discussion of the argument structure of deverbal nominal).

In the cases where only one sentence was available, the tests were performed on that one sentence and then the study proceeded on to the next verb. In all cases, a presupposition was contrived that could survive all the tests. Take, for example, the one sentence involving the verb guard:
Lavish and extensive measures guard the president from myriad threats.
If we take the “the president” to have a tendency away from “the myriad threats”, then how is that tendency to be paraphrased so as to test it with the presuppositions? As these tests are linguistic in nature, the linguistic form of the paraphrase of the situation feature of tendency is rather important to make sure the tests are being performed correctly. We might say that the assertion p is presupposed: p = The president wants to avoid threats.
NEGATION: Lavish and extensive measures do not guard the president from myriad threats.
QUESTIONING: Is it the case lavish and extensive measures guard the president from myriad threats?
IF/THEN: If lavish and extensive measures guard the president from myriad threats, then there’s going to be trouble.
One fine distinction can be made regarding speech-act barrier verbs like ban and exempt where it is possible that the undergoer of the prohibition is either not aware of it or chooses to flout it. This allows for the possibility that the prohibited event occurs despite the ban, making the negation of the event an invited inference rather than an entailment.

Thus, the jury is heavily leaning towards entailment for most core barrier verbs, but the jury is still out.

* Nunes, M. 1993. Argument Linking in English Derived Nominals. In Van Valin (ed) Advances in Role and Reference Grammar, John Benjamins, 375-432.

Monday, August 19, 2013

barrier verbs as negative verbs

This is the fifth in a series of posts detailing data and analysis from my not-quite-entirely-completely-achieved linguistics dissertation (list of previous posts here).

Recall that if an entity wants to achieve a certain outcome, yet is impeded by some force, this situation can be encoded by a barrier verb in English, such as prevent, ban, protect.

Barrier Verbs as Negative Verbs

It has been assumed since at least Klima 1964 (pdf) that some verbs are inherently negative. This means that they entail that some event did NOT occur. For example, the (a) sentence is from Laka (1990:105):
a. The witnesses denied [that anybody left the room before dinner].
b. Jean neglected [to turn off the lights].
For (a), it should be intuitively clear to a native speaker of English that it is part of the semantics of deny which entails that the proposition encoded by the embedded clause is false. Similarly the end state of the lights in (b) should be on, the opposite of off. The verb neglect entails that the event encoded by the embedded clause did not occur. This inherent negativity is a crucial feature in the semantics of barrier verbs.

Core barrier verbs are negative verbs that indicate an event did not happen (some non-barrier verbs can be coerced into a barrier verb interpretation, by being used within the barrier verb construction, but these verbs do not meet barrier verb entailments outside of the construction).
a. The roof prevented the car from [getting wet] → the car did NOT get wet.
b. The law exempted Tom from [paying taxes] → Tom did NOT pay taxes.
Laka draws some testable conclusions about negative verbs based on their interaction with negative polarity items (NPIs). In (a) the NPI anything fails to be licensed by the negative verb deny, while in (b) a negative complementizer is selected that in turn licenses anything in the embedded clause.
a. *The witness denied anything.
b. I deny that the witness denied anything.
The negation entailed by deny is not consistent with the NPI anything. There are two kinds of NPIs, licensed and free. There are three criteria to distinguish NPIs:

1) 'Just' Force
The adverb just forces a ‘free choice’ interpretation (where ‘free choice’ = “press any key”; your choice, but you must choose one) interpretation on licensed NPIs.The adverb just reverses negation:
I didn’t eat anything = I ate nothing
I didn’t eat just anything = I ate something
2) Verb Force
Negative verbs (N-verbs) force licensed NPI sentences to become ungrammatical. N-verbs play no role in licensing any, so they play no role in grammaticality.

3) Affective 'All'
N-verbs license affective ALL reading of “a single N”; Laka says that “a single N” has no ‘free choice reading available” (110).

CONCLUSION -- NPI’s are licensed only in clausal comps of N-verbs.

My interpretation of Laka is that any means either ALL or ONE. Negated N-verbs entail the ALL reading. So, acceptable examples of a negative verb candidate embedded under a negative verb in a clausal complement with an NPI should establish the legitimacy of that candidate verb as an N-verb (Phew! That's a sentence only a linguist could love).

In order to test the interaction between barrier verbs and NPIs, I performed a set of simple tests. First, I created a template of four sentences, each representing a verb’s interaction with the NPI anything. Then, I inserted each barrier verb into the verb slot and judged the grammaticality of the result. Then, I Googled searches of the form "* from [barrier verb] anything". This was designed to return cases of verbs that took a clausal barrier verb + NPI complement. Two examples here should suffice:
a. *Bob prevented anything.
b. *John prevented Bob from anything.
c. John didn’t prevent Bob from anything.
d. John prevented Bob from preventing anything.

Google results for "* from preventing anything"
  • FEMA must be prevented from preventing anything when hours are lives.
  • What is to stop the govt from preventing anything from being shown "for the good of society"?
  • I stopped my firewall from preventing anything from working and i reinstalled limewire.
a. *Bob protected anything.
b. *John protected Bob from anything.
c. John didn’t protect Bob from anything.
d. John prevented Bob from protecting anything.

Google results for "* from protecting anything"
  • In addition, an "idea/expression dichotomy" in copyright law prevents copyrights from protecting anything on the "idea" level.
  • Far from protecting anything, the technobabble creates a pointless risk.
What these tests show is that barrier verbs by and large do not allow an NPI unless they are first embedded under a negative verb, like the I deny that the witness denied anything example. This, at least at first blush, confirms that English barrier verbs are N-verbs under Laka’s definition. The second Google protect sentence (Far from protecting anything) is particularly interesting in that it seems to be the preposition from which licenses the NPI, suggesting that from has a negative entailment all its own, which conforms to Jackendoff's and Van Valin's analysis (yet to be discussed).

In the (a) examples below, the verb stop is neutral with respect to the event of barking; it is the presence of the word from which adds the negation in (b):
a. Chris stopped the dogs barking = the dogs were barking, then they stopped
b. Chris stopped the dogs from barking = the dogs were never barking
In the examples below, the verb prevent negatively entails the event of barking, regardless of the presence of the word from
a) Chris prevented the dogs barking = the dogs were never barking.
b) Chris prevented the dogs from barking = the dogs were never barking.
One of the issues here is the temporal relationship between the event of preventing and the event of barking. Barrier verbs entail no temporal overlap between the two events. This will be taken up in a later post.

Sunday, August 18, 2013

barrier verb subclasses

This is the fourth in a series of post detailing data and analysis from my not-quite-entirely-completely-achieved linguistics dissertation (one here, two here, three here).

Recall that if an entity wants to achieve a certain outcome, yet is impeded by some force, this situation can be encoded by a barrier verb in English, such as prevent, ban, protect.

In order not to confuse Talmy’s description with mine, I will use different terms from this point forward. In defining the semantics of barrier verb, I will use the term “blocker” to refer to the participant who initiates or causes the blocking event to occur, the term “blockee” to refer to the participant which is affected by the blocking, the term “barrier” to refer to the participant which actually creates the blockade, and finally the term “outcome” to refer to the result of the event which was blocked (somewhat related to goals). These terms may have some overlap with well known semantic terms (e.g., “actor”, “agent”, “undergoer”, “patient” ,“instrument”, “resultative”); however, they are used here as labels of convenience, so they should not be confused with other terms used outside of this dissertation.

I will show that two semantically distinct subclasses of barrier verbs can be described:

Set 1) a prevent subclass where the syntactic object of the barrier verb is the blockee of the blocked event but presupposes an intention to achieve the outcome of the blocked event on the part of the blockee.

Example 1: Chris banned Wallis from going to the movies.
  • Blocker = Chris
  • Blockee = Wallis
  • Barrier = speech act ‘ban’
  • Outcome = seeing the movie

Set 2) a protect subclass where the syntactic object of the barrier verb is the blockee of the blocked event and presupposes a desire to circumvent the outcome.

Example 2: The doctor protects children from the flu with vaccines.
  • Blocker = Doctors
  • Blockee = the children
  • Barrier = vaccines
  • Outcome = getting the flu
The critical difference between the two classes is that prevent-type barrier verbs encode a negative relationship between the blocker and the blockee, while protect-type verbs encode a positive relationship between the blocker and the blockee. For example, in Example 1 above, it is presupposed that Wallis wants to achieve the outcome of seeing the movie and the blocker Chris stops Wallis from achieving this goal against Wallis's wish. In Example 2 above, it is presupposed that children want to avoid the outcome getting the flu and the blocker The doctor helps the children avoid this outcome.

The Verbs

Set 1 - prevent class
ban, bar, barricade, block, detain, discourage, enjoin, exclude, hamper, hinder, interrupt, obstruct, occlude, pre-empt, prevent, prohibit, restrain, restrict, thwart

Set 2 - protect class
deflect, exempt, guard, insulate, protect, screen, shield

Wednesday, August 14, 2013

barrier verb construction and selectional preferences

This is the third in a series of post detailing data and analysis from my not-quite-entirely-completely-achieved linguistics dissertation (one here, two here).

The Barrier Verb Construction Template
I will use the term construction loosely to mean a syntactic template composed of slots which constrain their fillers syntactically or semantically. Barrier verbs fit into the following general constructional template:

NP1 verb-bar NP2 from/against NP3/VP

In this construction, the syntactic subject of a barrier verb (NP1) is either the agent which wields the barrier as an instrument, or the instrument itself. NP2 is the goal-directed participant and the complement NP3/VP represents the goal or outcome. For example, in both sentences below the government impedes the refugees from achieving their goal of entering the country.

  • The government barred the refugees from the country.
  • The government barred the refugees from entering the country.

Note that the event of entering the country can be represented by the NP the country and is presupposed to be intended, but not yet achieved (I assume some sort of coercion process allows for the event interpretation.). The PP complement in this construction involves either the preposition from or against (Not all barrier verbs allow the against alternation; this will be taken up in a later post) and represents the goal.

Many of the verbs I identify as barrier verbs occur in other lexical-semantic classifications like Levin (1993), FrameNet, Korhonen and Briscoe (2004), Brew, and Bresnen. For example, Korhonen and Briscoe list a FORBID class which includes prohibit and ban and a LIMIT class which includes restrict and restrain. Yet, no classification to date has recognized a single, natural class of barrier verbs which exhibit the properties this dissertation discusses. This is a reminder that no classification is perfect. FrameNet includes four frames which overlap somewhat with barrier verbs, namely HINDERING, PREVENTING, PROHIBITING, and THWARTING. However, none of these four frames recognize a superordinate category frame, something like BARRIER, which classifies the verbs presented in this dissertation as a single coherent class.

Therefore, I argue that barrier verbs constitute a natural, coherent class of verbs with the unique cluster of syntactic, semantic, and lexical properties found in Table 1 (forgive the fuzzy old fashioned MS Word image):

It is the lexical preferences for complement type that intrigues me more than any other fact about these verbs. For example, prevent almost always occurs with an ING complement but actually can occur with an NP. This is not strictly a selectional restriction because violations are possible, acceptable, and grammatical (and non-metaphorical), they are just low frequency. I'll post more about this later, but its juicy and weird and cool.

The following verbs are argued to be ‘core’ members of the class because they contain basic barrier verb semantics in their default lexical entries:

ban, bar, barricade, block, deflect, detain, discourage, enjoin, exclude, exempt, guard, hamper, hinder, interrupt, obstruct, occlude, protect, screen, shield, pre-empt, prevent, prohibit, restrain, restrict, thwart

As will be seen below, the construction is productive and many more verbs can take on barrier semantics. As attested by corpus evidence, the following 64 verbs can all occur in the construction with from and some allow against (not only is it OK if you find some these are not obviously barrier verbs at first glance, but in fact, that's the point! The construction coerces non-barrier verbs into the barrier verb semantics):

avert, ban, barricade, defend, deflect, derail, detain, exclude, exempt, guard, harbor, hide, insulate, occlude, protect, relegate, screen, secure, shackle, shield, avoid, bar, block, bond, check, constrain, curb, delay, deter, disable, discourage, embarrass, encumber, enjoin, foil, forestall, freeze, frustrate, halt, hamper, hamstring, handicap, hinder, hold, impair, impede, inhibit, interrupt, invalidate, keep, occlude, obstruct, obviate, outlaw, preclude, pre-empt, prevent, prohibit, proscribe, restrain, restrict, retard, staved-off, stay, stop, stymie, suppress, thwart.

More to come...

Tuesday, August 13, 2013

a verb class only a cognitive semanticist could love

Continuing my walk down dissertation memory lane (walk #1 here), this time I revisit the semantics of barrier verbs (and I remind you that this is largely a cut and paste job from my draft chapter on semantics).

Here is my set of "core" barrier verbs (I'll explain later how I distinguish between core members of a verb class and peripheral members as this was a topic of great interest to me).

ban, bar, barricade, block, deflect, detain, discourage, enjoin, exclude, exempt, guard, hamper, hinder, interrupt, obstruct, protect, pre-empt, prevent, prohibit, restrain, restrict, screen, shield, thwart

Note that neither stop nor keep are in the core set, yet either can easily be coerced into the barrier verb class. A keen spidey sense for semantics might also alert you to the fact that there are two sub-classes within that list: protect versus prevent. Oh, sooo much to discuss there. Too much for now, but yes, semantic madness lies that way.

My linguistics dissertation grew out of work by Len Talmy, so I’ll begin with a brief overview of his work on these verbs. Len wrote a 40 page monograph on this verb class and I may in fact possess the only extant copy. I really should scan that. I'll show in a later post how this semantic description impacts the syntactic construction that barrier verbs often occur in, as well as how the semantics impacts some quirky frequency facts. But for now, on with cognitive semantics!

The class of English barrier verbs are causative object control verbs* which encode the relationships between a goal directed participant (or “agonist” in Len's terms), its goal and a barrier participant (or “antagonist”). Situations involving barriers are more nuanced than simply one thing being in-between two other things. A barrier necessarily impedes the motion of one of the things it is in-between. Barrier situations require motion as well. However, we will see that this motion can be extended metaphorically to intentions and goals if not many other things. If an entity wants to achieve a certain outcome, yet is impeded by some force, this situation can be encoded by a barrier verb. Some examples:

Physical Blocking 
The fence blocked the car from entering the driveway

Intentional Exclusion 
The club excluded me from membership

Speech Act Pronouncement 
The judge banned journalists from the courtroom

Spybot protected my computer from a virus.

In the examples above, there is an explicit goal directed agonist (the car, me, journalists, a virus) and a goal (the driveway, membership, the courtroom, my computer). Only in the first sentence is there a physical barrier (the fence). In the other sentences there is an implied barrier (the club’s power to exclude, the judge’s ban, Spybot). But in all cases, the barrier interferes with the goal-directed agonist's ability to achieve its intended outcome.

But interference alone is not enough to properly distinguish a barrier situation from a simple in-the path situation that a verb like place evokes in a sentence like this one:

John placed the table between Chris and the kitchen. 

In this case, to place does not necessarily evoke the notion of interfering with goal-directed motion. One would have to infer (perhaps via Gricean maxims) that Chris wants to enter the kitchen in order derive a barrier interpretation of this sentence. But that notion is not entailed by the verb place, it is at best added via inference. A member of the barrier verb class should entail the notion that the agonist is goal directed (via motion or metaphorical extensions of motion). Therefore, the two end points must have a particular relationship to each another. Namely, one end point participant must be moving towards the other, or have some sort of tendency towards the other end point (this use of tendency is adapted from Talmy).

Talmy assumes a model of barrier dynamics in which there are three salient participants: A GOAL-directed Agonist X, a barrier-forming Antagonist Y, and a GOAL Z. Figure 1 (Talmy loves figures) represents this state of affairs where the arrow represents the X participant’s tendency towards the Z participant.

Talmy also recognizes the potential for the inclusion of a SOURCE entity as well (“an object at which the Agonist begins its path” (unpublished manuscript: 2), but it is only these three salient X, Y, Z entities which form the necessary basic structure of barrier dynamics.

Note that the inclusion of a GOAL distinguishes this set of situations from simple impeded motion, lexicalized in such English verbs as “stop”:
  • I stopped the lawnmower.
  • The judge banned journalists.
In the first case, the lawnmower is not encoded with any inherent GOAL by virtue of the meaning of stop. This could simply mean that the lawnmower was turned off. Note, however that adding a complement with the preposition from adds the notion of GOAL:
  • I stopped the lawnmower from destroying the flowers.
Above,  the verb ban encodes journalists with an inherent GOAL (presumably the judge’s courtroom as it would be pragmatically odd for a judge to ban journalists from her kitchen). Is this inherent tendency towards a goal a presupposition or invited inference or semantic entailment? Those arguments must wait for later.

What's crucial is that this tendency toward a GOAL is part of what constitutes the barrier situation and hence acts as a distinguishing feature which separates these verbs from other stopped-motion verbs like stop.

It is important to recognize that this set of situations is not just a possible set of events in the world but is actually lexicalized in certain English verbs, forming a natural class (there are cognate classes in Dutch and German and probably other languages). Identifying the properties of the verbs in this class (particularly with respect to the prepositions from and against) will require empirical, corpus based methods that will be the subject of later posts.

But what will really blow your mind is when I post about the difference betwen these two sentences:
Chris kept the dogs barking.
Chris kept the dogs from barking.
Chew on that for awhile.

*Maybe. I recognize that calling a verb a causative object control verb requires you to believe in such things as causative object control verbs, and some do not. There's really no escaping at least some theoretical stipulations.

Monday, August 12, 2013

On Ennui and Verb Classification Methodologies

Linguists and NLPers alike love word classes, especially verb classes. But linguistic categories are are tricky little buggers. They drove me to a deep ennui which led me out of academia and into industry.

Nonetheless, I occasionally retrace my old steps. Recently, I stumbled across an old chapter from my failed dissertation on verb classes and wondered if this little table of mine still holds water:
Here was the motivation (this is a cut and paste job from a draft chapter, largely unedited. Anyone already familiar with standard verb classification can easily skim away): The general goal of any verb classification scheme is to group verbs into sets based on similar properties, either semantic or syntactic. For linguists, the value of these classifications comes from trying to understand how the human language system naturally categorizes verbs within the mental lexicon (the value may be quite different for NLPers). One assumes that the human language system includes some categorical association between verbs within the mental lexicon and one attempts to construct a class of verbs that is consistent with those mental lexicon associations.

Verbs can be categorized into groups based on their semantic similarity. For example, the verbs hit, punch, kick, smack, slap could all be categorized as verbs of HITTING. They could also be grouped based on constructions. For example, verbs like give and send occur in both the ditransitive and double object constructions:
Chris gave the box to Willy.
Chris sent the box to Willy.
Double Object
Chris gave Willy the box.
Chris sent Willy the box.
Verb classes have long been a central part of linguistics research. However, any set of naturally occurring objects can allow different sub-groups to be created using different criteria or features. The unfortunate truth is that we don’t really know how the mental lexicon is organized (this is not to say that patterns of relations have not been found using, say, priming experiments, or language acquisition, or fMRI. They have. But the big picture of mental lexicon organization remains fuzzy, if not opaque). Therefore, all verb classifications are speculative and all verb classification methodologies are experimental. Two key challenges face the verb classification enterprise:
  1. Identify the natural characteristics of each class (e.g., defining the frame)
  2. Identify the verbs which invoke the frame (e.g., which verbs are members of the class)
But how do we overcome these two challenges? There is, as yet, no standard method for doing either. Most verb classification projects to date have employed some combination of empirical corpus data collection, automatic induction (e.g., k-means clustering), psycholinguistic judgment tasks or old fashioned intuition. Nonetheless, in recent years there have emerged certain best practices which appear to be evolving into a de facto standard.

This emerging de facto standard includes a mixture of intuitive reasoning (about verbs, their meaning, and their relationships to each other) and corpus analysis (e.g., frequencies, collocations). Below is a table detailing methods of verb classification and some of the major researchers associated with the methods:

But how do we know if our speculations about a verb class are "correct" (in the sense that a proposed class should be consistent with a class assumed to exist in the mental lexicon)? The quick answer is that we don’t. Without a better understanding of the mental lexicon, we are left to defend our classes based on our methods only: proposed verb class A is good to the extent that it was constructed using sound methods (a somewhat circular predicament). We also have cross-validation testing methods available. If my class A contains most of the same verbs that your class B contains (using different methods of constructing the classes) this suggests that we have both identified a class that is consistent with a natural grouping. Finally, via consensus, a certain classification can emerge as the most respected, quasi-gold standard classification and further attempts to create classes can be measured by their consistency with that gold standard.

The closest thing to a gold standard for English verb classes is the Berkeley FrameNet project. FrameNet is perhaps the most comprehensive attempt to hand-create a verb classification scheme that is consistent with natural, cognitively salient verb classes. It is based on painstaking annotation of naturally occurring sentences containing target words.

But even FrameNet is ripe for criticism. It's not good at distinguishing exemplar members of a verb class from coerced members, save by arbitrary designation.

For example, I was working on a class of verbs evoking barrier events like prevent, ban, protect. What was curious in my research was how some verbs had a strong statistical correlation with the semantics of the class (like prevent and protect), yet there were others that clearly appeared in the proper semantic and syntactic environments evoking barriers, but were not, by default, verbs of barring. For example, stop. The verb stop by itself does not evoke the existence of a barrier. For example, "Chris stopped singing", or "It stopped raining." Neither of those two events involve a barrier to the singing or raining. Yet in "Chris stopped Willy from opening the door" there is now a clear barrier meaning evoked (yes yes, the from is crucial. I have a a whole chapter on that. What will really blow your mind is when you realize that from CANNOT be a preposition in this case...).

The process of coercing verbs into a new verb class with new meaning was a central part of my dissertation. Damned interesting stuff. I found some really weird examples too. For example I found a sentence like "Chris joked Willie into going to the movie with us", meaning Chris used the act of joking to convince Willie to do something he otherwise would not have done.

Sunday, August 11, 2013

Linguistic Curiosities in 'Elysium'

Matt Damon's latest hit movie Elysium has a few linguistic oddities worth pointing out. The film takes place in a dystopian future set in 2154.
  • Jodie Foster's weird accent. She speaks French occasionally in the movie, but when she speaks English, she affects a weird accent that is un-placeable, inconsistent, and off-putting. A good director needs to tell a star like Foster that it's just not working, go back to your real voice.

  • Matt Damon's inexplicable bilingualism. His character "Max" is shown growing up speaking Spanish, surrounded by Spanish speakers. The boy inexplicably starts speaking English in one scene. When we meet the adult Max, his English is fluent. One could make the argument that he learned English somewhere in the missing years the movie doesn't show. Here's the thing, we get to hear all of his friends speaking English too, and they all speak with Spanish accents! Max is the only one who manages to grow up in that Spanish dominant culture and yet speak flawless English.

  • Speech synthesis straight out of 1998. Damon has an early scene with a robo-parole officer which speaks with a stilted, halting robo-voice that reminded me of speech synthesis that's already ten years out-of-date. This is the same movie that depicts medical science as being so advanced, a machine can diagnose and cure any disease in seconds. Harumph...

  • Horrible dialogue dubbing (non-linguistic, but worth pointing out.). Particularly for Jodie Foster, the sound dubbing was awful, destroying any suspension of disbelief I managed to retain in spite of the many ridiculous moments in the film. Not gonna win any sound editing awards anytime soon.
Dystopian futures are a peculiar genre in film making. From cheap Aussie films like Mad Max to slick Hollywood blockbusters like Blade Runner, they've been a staple of film makers who want to make a social statement while also giving the audience a fun romp. But, for a dystopian film to really work it needs to care about creating believability in at least three broad areas: 1) visual world 2) social world, and 3) plot. Sadly, most movies devote all their time to 1 and precious little to 2 and 3. Elysium is clearly one of the many that put all their budget into look and feel and none into the story.

While Neill Blomkamp managed to create a visually beautiful world, the story is pure shit. It's the worst kind of liberal stereotype where every rich person is an evil sociopath and every poor person is a good hearted victim. It's too bad because there is a core of truth to the movie. There really is a tremendous wealth gap and there really are tragic inequalities, but this movie uses these facts as little more than a cheap backdrop to a thin beat 'em up thriller while pretending to be a socially conscious movie.

But here's the thing: There are no lessons in this movie. You won't gain a deeper understanding of anything. You won't have an "a hah" moment. It's Fox News for liberals, and it's equally as patronizing and empty. It's clear that Blomkamp put all his energy into the look of the film, and none into the story. There is tremendous nuance and detail in every item of clothing and every object, but the social structure is barely a cartoon and the plot is wafer thin (crucially depending on a series of coincidences like bad action films so often do). Ultimately, it was not really worth it. I'd like my $7 matinee ticket back, please.

Thursday, August 1, 2013

in the dark heart of a language model lies madness....

This is the second in a series of post detailing experiments with the Java Graphical Authorship Attribution Program. The first post is here.

In my first run (seen above), I asked JGAAP to normalize for white space, strip punctuation, turn everything into lowercase. Then I had it run a Naive Bayes classifier on the top 50 tri-grams from the three known authors (Shakespeare, Marlowe, Bacon) and one unknown author (Shakespeare's sonnets).

Based on that sample, JGAAP came to the conclusion that Francis Bacon wrote the sonnets. We know that because it lists its guesses in order from best to worst in the left window in the above image. Bacon is on top. This alone is cause to start tinkering with the model, but the results didn't look as flat weird until I looked at the image again today. It lists the probability that the sonnets were written by Bacon as 1. A probability of 1 typically means absolute certainty. So this model, given the top 50 trigrams, is absolutely certain that Francis Bacon wrote those sonnets ... Bullshit. A probabilistic model is never absolutely certain of anything. That's what makes it probabilistic, right?

So where's the bug? Turns out, it might have been poor data management on my part. I didn't bother to sample in any kind of fair and reasonable way. Here are my corpora:

Known Authors
  • Bacon - (2 works) - 950 KB
  • Marlowe (Works vol 3) - 429 KB
  • Shakespeare (all plays) - 4.4 MB
Unknown Author
  • (Sonnets) - 113 KB
Clearly, I provided a much larger Shakespeare data set than any of the others. However, keep in mind that JGAAP only used the 50 most common tri-grams from any of these corpora (if I understand their Event Culling tool properly). Is the disparity in corpora size relevant if I'm also sampling just the top 50 tri-grams? Just how different would those tri-grams be if the corpora were equivalent? Let's find out.

The Infinite Madness of Language Models 
As far as I can tell, the current version does not have any obvious way of turning on error reporting logs (though I suspect that is possible, if one had the source code). It also offers no way of printing the features it's using. Id' love to see a list of those top 50 tri-grams for each author. But as of right now, it does not appear to support that. I'll add that to my enhancement requests. However, JGAAP is fast enough to simply run several trial-and-error runs in order to compare output. My goals are 1) get JGAAP to guess Shakespeare as the unknown author with a high degree of certainty and 2) try to figure out why it gave such a high confidence score to Bacon during round one.

Here are the results of several follow up experiments. Mostly, I want to tune the language model - in the parlance of JGAAP, Event Drivers (linguistic features) + Event Culling (sampling) = a language model (unless I'm misunderstanding something).

Round 2: Same specifications as Round one. I used the all of the same corpora, except I replaced Shakespeare with a sample of about 500 KB to bring it in line with the others. Then I repeated the analysis using all the same parameters. This time ... drum roll ... Bacon still wins in a landslide. JGAAP remains absolutely confident that Bacon wrote those sonnets.

Round 3: Okay. Let's expand the set of tri-grams. Same everything else as Round 2, but now I'll use the top 100 tri-grams.

D'oh! Well, it's less confident that Marlowe is involved (drunk bastard).

Round 4: For good measure. Let's expand the set of tri-grams. Same everything else as Rounds 2 and 3, but now I'll use the top 200 tri-grams.


Okay, it appears that adding more tri-grams alone gives us nothing. I feel confident dropping back down to 100. Now, I'll add one simple feature - Words (I assume this is a frequency list; again, the Event Culling will choose just the top 100 most frequent words, as well as the top 100 tri-grams, if I'm understanding this right).

We have a winner! The top score above shows that for Words, Shakespeare finally wins (though he still loses on Ngrams, the second highlighted score). As a comparison, I threw in another feature, Rare Words.

No help. My interpretation of these results is that the feature "Words" is the best predictor of Shakespearean authorship (given this set of competing authors with these tiny corpora).

But this is a stacked-deck experiment. I know perfectly well that the "Unknown Author" is Shakespeare. I'm just playing with linguistic features until I get the result I want. The actual problem of determining unknown authorship requires far more sophisticated work than what I did here (again, read Juola's detailed explanation of what he and his team did to out J.K. Rowling).

Nonetheless, I could imagine not sleeping for several days just playing with the different combinations of features to produce different language models just to see how they move the results (mind you, I didn't play with the classifier either, which adds its own dimension of playfulness).

Herein lies the value of JGAAP. More than any other tool I have personally seen, JGAAP gives the average person the easy-to-use platform to splash around and play with language in an NLP environment. When thinking about my first two experiences with JGAAP, the most salient word that jumps out at me is FUN! It's just plain fun to play around. It's fast and simple and fun. I can't say that about R, or Python, or Octave. All three of those are very powerful tool sets, but they are not fun. JGAAP is fun. It's a playground for linguists. Let me note that I beta tested a MOOC for WEKA last March and was very impressed with their interface as well (though I think JGAAP does a better job of making language modeling easy ... and that's the fun part for linguists anyway).

I am reminded of what several Twitter friends have said to me when I say that I'm a cognitive linguist: "Really! I never would have known by your Twitter feed." That's a wake up call for me. I have been involved in NLP since roughly 2000, but my passion is definitely the blood and guts of language and linguistics. JGAAP appeals to that old linguistics fire in my belly. It make me want to play with language again.

NLPers: How would you characterize your linguistics background?

That was the poll question my hero Professor Emily Bender posed on Twitter March 30th. 573 tweets later, a truly epic thread had been cre...