binary vs. text?

Dar Scott dsc at swcp.com
Mon Dec 11 18:48:20 EST 2006


On Dec 11, 2006, at 3:09 PM, Chris Sheffield wrote:

> Does anyone have a sure fire way to determine if a file is binary  
> or text?
>
> I have need to create an import utility that will import data from  
> a text file (csv, tab-delimited, etc) into a database, but I'd like  
> to check the file before doing anything else just to make sure it  
> is in fact text and not binary.

In general, there is no way.

However, all is not lost.

A text file is a special case of a binary file consisting of a  
sequence of characters whose representations are binary.

For very short files, it is hard to tell.  However, if you have some  
idea of the pattern you are expecting you can increase your  
confidence that some file is binary or text.

Many file formats have magic words and header data that indicate the  
type.  These provide a hint and an additional check can provide some  
confidence.  For example, a magic word plus a required element can  
identify a .png file, that is, check to see whether it starts with  
this: format("\211PNG\r\n\032\n\000\000\000\015IHDR").

Unicode files often have BOM markers at the start, but they are not  
required in some cases and the BOM shouldn't be there in others.  I  
have a function I use to differentiate among Unicode files, but that  
already assumes I know they are unicode and even then it has trouble  
with some perverse files.  (It does get it right more often than  
Microsoft programs do.)  UTF-8 files also have other limitations  
among the characters, so that can help.

Text files should have certain patterns.  For example, if the file is  
ASCII and is comma-delimited or tab-delimited, there are some  
indicators.  You should see only certain control characters.  You  
should see the expected delimiter.  You should see either CR or LF or  
both.  All characters have codes less than 128.  You might want to  
require the same number of delimiters per line.

So, given some specified pattern of what you expect in binary or  
text, you should be able to differentiate.

However, an alternate approach would be to parse the file and if the  
file does not pass, then reject it no matter the form of the data.

Dar




More information about the use-livecode mailing list