Saturday, December 5, 2009

Outsourcing Fact Checking

Paul Spinrad guest blogs at boingboing and floats the idea of outsourcing fact checking (I'll support any proposal whatsoever that improves the fact checking process, believe me) but he adds the notion of, in essence, crowd sourcing linguistic annotation:

Now, what if these fact-checkers didn't just vet and correct the text? While they dig into the logic and accuracy of everything, as usual, they could also use some simple application to diagram the sentences and disambiguate the semantics into a machine-friendly representation. Just a little extra clicking, and they could bind all the pronouns to their antecedents, and select from a dropdown box to specify whether an instance of the string "Prince" refers to the musician Prince or to Erik Prince-- the president of XE, the company formerly known as Blackwater-- within an article that for whatever reason mentions both of them.

I have zero interest in diagramming sentences, mind you (because it's a dated and frankly messy pseudo-logical method of representing the syntax of a sentence), but there is a good idea at the core. While it's true that the web has given us greater access to large corpora, this corpora remains unstructured text. I'd like to see larger parsed corpora available (like the BNC).

With minimal training, editors and fact checkers could be utilized to mark up text with simple phrase boundaries and labels (this is a NP, this is a VP) as well as PP attachment ambiguity and co-reference, etc. There would be messiness in this approach too, but Breck Baldwin has noted that this can be done effectively (for recall, at least) and the major issue is adjudicating the error rate of a set of crowd-sourced raters (see my previous post here and Baldwin's original post here). A little sampling could adjudicate nicely.

No comments:

TV Linguistics - Pronouncify.com and the fictional Princeton Linguistics department

 [reposted from 11/20/10] I spent Thursday night on a plane so I missed 30 Rock and the most linguistics oriented sit-com episode since ...