determining a plain text file
Cubist at aol.com
Cubist at aol.com
Sun Jul 9 20:45:33 EDT 2006
In a message dated 7/9/06 11:38:23 AM, <scott at elementarysoftware.com> writes:
>Does anyone have a method for determining whether a file is plain
>text that they would be willing to share?
This is not a simple question to answer. Consider that a *web page* is
plain text -- what makes it a "web page" is what a browser does with/to it when
you run across it in the course of your websurfing. So perhaps it might be
appropriate for you to explain what *you* mean when you say "plain text"?
Depending on your definition of "plain text", the method of detecting it may well
vary...
That said, here's a couple of possible methods which, even if they don't
do what you want, may help set you on the right road to finding your answer...
# possible answer 1: what's the file extension?
function IsItText1 TheFilename
# all we care about here is the *name* of the file
set the itemDelimiter to "."
put item -1 of TheFilename into Fred
# "text" and "txt" are the most common extensions denoting
# text files; if you know of any others, you can add them in, too
put "text,txt" into TextExtensions
repeat for each item ThisExt in TextExtensions
if Fred = ThisExt then return true
end repeat
return false
end IsItText2
# possible answer 2: does the file contain weird characters?
function IsItText2 TheText
# assumes that you've already read the file from disc,
# and are fiddling with the file's content
put the length of TheText into OldLength
# garden-variety ASCII text only has characters in it whose
# ASCII code numbers are 127 *or less*. thus, if there's
# anything in there with an ASCII code number *greater than 127*,
# it's prolly not "plain text"
repeat with K1 = 127 to 255
put numToChar (K1) into BadChar
replace BadChar with "" in TheText
end repeat
put the length of TheText into NewLength
return (OldLength = NewLength)
# if OldLength is the same as NewLength, this will return "true";
# otherwise, it returns "false". since the only way NewLength *can*
# be different from OldLength is if some characters got nuked
# in the loop, you'll get The Right Answer here
end IsItText2
Neither of these functions is perfect; both of them can be fooled, whether
by intent or by accident. Suppose some joker slapped the name
"Budget2006.txt" onto an Excel spreadsheet file, for instance; the IsItText1 function above
would say "Yes, it's a text file, alright", but IsItText2 would *not* be so
fooled. As for IsItText2, *that* function will turn up bits nose at any file
which contains curly-quotes rather than straight-quotes, which means that yes,
there are genuine, honest-to-God *text files* which IsItText2 will *wrongly* deem
"not plain text".
Again, once you know what *you* consider a "plain text file" to be, it'll
be easier to come up with a solution.
Hope this helps...
More information about the use-livecode
mailing list