determining a plain text file
dsc at swcp.com
Mon Jul 10 13:20:57 CDT 2006
On Jul 9, 2006, at 12:59 AM, Scott Morrow wrote:
> Does anyone have a method for determining whether a file is plain
> text that they would be willing to share?
I don't think plain text or not is the right question. How sure do
you want to be? This can take a lot of processing.
Do you mean plain text vs binary? Plain text vs RTF? Plain text
ASCII vs plain text UTF-8?
For example: I have a function I use that tries to "guess" the
Unicode encoding form of a file. My approach is not to ask "is this
this format?" but "is this more likely this one than the others under
consideration?". (That gets hard under some perverse cases of
UTF-16BE vs UTF-16LE. Brag: My Unicode recognizer code beats my
Microsoft programs in encoding guessing.) I have a few hard rules to
handle the easy cases, but for the most part I build up evidence
points and then compare.
Also, I don't look at the whole file (except in some special cases).
I look at only the characters near the end and near the front. That
puts an upper bound on determination time.
Is the question "Should I dump this into a field or should I convert
to hex first?" ?
More information about the use-livecode