determining a plain text file

Dar Scott dsc at swcp.com
Mon Jul 10 14:20:57 EDT 2006


On Jul 9, 2006, at 12:59 AM, Scott Morrow wrote:

> Does anyone have a method for determining whether a file is plain  
> text that they would be willing to share?

I don't think plain text or not is the right question.  How sure do  
you want to be?  This can take a lot of processing.

Do you mean plain text vs binary?  Plain text vs RTF?  Plain text  
ASCII vs plain text UTF-8?

For example:  I have a function I use that tries to "guess" the  
Unicode encoding form of a file.  My approach is not to ask "is this  
this format?" but "is this more likely this one than the others under  
consideration?".  (That gets hard under some perverse cases of  
UTF-16BE vs UTF-16LE.  Brag:  My Unicode recognizer code beats my  
Microsoft programs in encoding guessing.)  I have a few hard rules to  
handle the easy cases, but for the most part I build up evidence  
points and then compare.

Also, I don't look at the whole file (except in some special cases).   
I look at only the characters near the end and near the front.  That  
puts an upper bound on determination time.


Is the question "Should I dump this into a field or should I convert  
to hex first?" ?

Dar Scott




More information about the use-livecode mailing list