dot POS files and Corpus Linguistics

Richmond Mathewson richmondmathewson at gmail.com
Tue Apr 27 14:04:08 EDT 2010


  Well, Yippee-doo; the good folks at the University of
Oxford have sent me the files of the
York-Toronto-Helsinki Parsed Corpus of Old English Prose
(try saying that with your mouth full of cornflakes).

Jolly generous considering it is normally restricted to British
Higher Education Institutions (somehow the University of
Plovdiv, Paisii Hilendarski doesn't fit in that category).

HOWEVER; the corpus comes in .pos files whcih cheeses me
off immensely; on opening them with the redoubtable
TextWrangler they are heavily formatted in some odd fashion
suggesting some sort of meta-tagging.

The Java-based CS_2.002.74.jar, a.k.a 'CorpusSearch' doesn't run
for some funny reason on ye olde G4 (have yet to try it on the
Ubu-Box); but that doesn't really fuss me as ye olde academics
have decided the parameters of their stuff in advance and my feet
are too big for their shoes (hey; it's mixed metaphors time again).

So; I am looking to build a Runrev data-miner / chewer / masticator
/ whatever; but, until I can work out what a .pos file can be opened with
(so I can hae a keek at its formatin) the whole thing is on standby.
Once I can see what a .pos file should look like in some sort of POS-file
reader I can cobble together a suitably algorithmic sieve to make the
file look like it should inside a text field prior to 'chewin the fat'.

Google comes up with unintentionally witty results about 'point of sale'
and so forth, as well as something about Arabic linguistic corpora,
Chinese linguistic corpora and so forth (well, at least they are going
in the right direction).

Having written one of those slimy messages back, where one thanks people
fulsomely and then shoves in the 'However'; I got a "we cannot comment on
other methods of accessing the corpus" message. Well; at least I signed 
my name with
my second name (Richmond) otherwise I would have had what the Americans call
a 'Dear John' message . . .  :)

Any help re POS-file readers would be most welcome.

sincerely, Richmond Mathewson.



More information about the use-livecode mailing list