determining a plain text file

Cubist at aol.com Cubist at aol.com
Sun Jul 9 20:45:33 EDT 2006


In a message dated 7/9/06 11:38:23 AM, <scott at elementarysoftware.com> writes:
>Does anyone have a method for determining whether a file is plain  
>text that they would be willing to share?
   This is not a simple question to answer. Consider that a *web page* is 
plain text -- what makes it a "web page" is what a browser does with/to it when 
you run across it in the course of your websurfing. So perhaps it might be 
appropriate for you to explain what *you* mean when you say "plain text"? 
Depending on your definition of "plain text", the method of detecting it may well 
vary...
   That said, here's a couple of possible methods which, even if they don't 
do what you want, may help set you on the right road to finding your answer...

# possible answer 1: what's the file extension?
function IsItText1 TheFilename
  # all we care about here is the *name* of the file

  set the itemDelimiter to "."
  put item -1 of TheFilename into Fred
  # "text" and "txt" are the most common extensions denoting
  # text files; if you know of any others, you can add them in, too
  put "text,txt" into TextExtensions
  repeat for each item ThisExt in TextExtensions
    if Fred = ThisExt then return true
  end repeat
  return false
end IsItText2

# possible answer 2: does the file contain weird characters?
function IsItText2 TheText
  # assumes that you've already read the file from disc,
  # and are fiddling with the file's content

  put the length of TheText into OldLength
  # garden-variety ASCII text only has characters in it whose
  # ASCII code numbers are 127 *or less*. thus, if there's
  # anything in there with an ASCII code number *greater than 127*,
  # it's prolly not "plain text"
  repeat with K1 = 127 to 255
    put numToChar (K1) into BadChar
    replace BadChar with "" in TheText
  end repeat
  put the length of TheText into NewLength
  return (OldLength = NewLength)
  # if OldLength is the same as NewLength, this will return "true";
  # otherwise, it returns "false". since the only way NewLength *can*
  # be different from OldLength is if some characters got nuked
  # in the loop, you'll get The Right Answer here
end IsItText2

   Neither of these functions is perfect; both of them can be fooled, whether 
by intent or by accident. Suppose some joker slapped the name 
"Budget2006.txt" onto an Excel spreadsheet file, for instance; the IsItText1 function above 
would say "Yes, it's a text file, alright", but IsItText2 would *not* be so 
fooled. As for IsItText2, *that* function will turn up bits nose at any file 
which contains curly-quotes rather than straight-quotes, which means that yes, 
there are genuine, honest-to-God *text files* which IsItText2 will *wrongly* deem 
"not plain text".
   Again, once you know what *you* consider a "plain text file" to be, it'll 
be easier to come up with a solution.

   Hope this helps...



More information about the use-livecode mailing list