binary vs. text?
Dar Scott
dsc at swcp.com
Mon Dec 11 18:48:20 EST 2006
On Dec 11, 2006, at 3:09 PM, Chris Sheffield wrote:
> Does anyone have a sure fire way to determine if a file is binary
> or text?
>
> I have need to create an import utility that will import data from
> a text file (csv, tab-delimited, etc) into a database, but I'd like
> to check the file before doing anything else just to make sure it
> is in fact text and not binary.
In general, there is no way.
However, all is not lost.
A text file is a special case of a binary file consisting of a
sequence of characters whose representations are binary.
For very short files, it is hard to tell. However, if you have some
idea of the pattern you are expecting you can increase your
confidence that some file is binary or text.
Many file formats have magic words and header data that indicate the
type. These provide a hint and an additional check can provide some
confidence. For example, a magic word plus a required element can
identify a .png file, that is, check to see whether it starts with
this: format("\211PNG\r\n\032\n\000\000\000\015IHDR").
Unicode files often have BOM markers at the start, but they are not
required in some cases and the BOM shouldn't be there in others. I
have a function I use to differentiate among Unicode files, but that
already assumes I know they are unicode and even then it has trouble
with some perverse files. (It does get it right more often than
Microsoft programs do.) UTF-8 files also have other limitations
among the characters, so that can help.
Text files should have certain patterns. For example, if the file is
ASCII and is comma-delimited or tab-delimited, there are some
indicators. You should see only certain control characters. You
should see the expected delimiter. You should see either CR or LF or
both. All characters have codes less than 128. You might want to
require the same number of delimiters per line.
So, given some specified pattern of what you expect in binary or
text, you should be able to differentiate.
However, an alternate approach would be to parse the file and if the
file does not pass, then reject it no matter the form of the data.
Dar
More information about the use-livecode
mailing list