Some thoughts on duck typing

Jeff Massung massung at gmail.com
Wed Jan 12 12:55:52 EST 2011


On Wed, Jan 12, 2011 at 4:37 AM, David Bovill <david at vaudevillecourt.tv>wrote:

> If it quacks like a duck it is a duck.
>
> So I have some data in a variable that I want to display. I can use is an
> array/number/date - but for other types of data I'm wandering... xml should
> be easy, but harder would be to distinguish long text files from binary.
> Any
> ideas for hacks to distinguish:
>
>   1. images
>   2. sounds
>   3. video
>   4. binary blob
>   5. text
>   6. rtftext
>   7. utf8
>
>
This is a pretty solved problem (except for the "array" part, which is a
LC-specific data type/format). Wish I had some references for you at the
moment, but here's some things to keep in mind:

- First, use your OS when possible. Images, sounds, video, and often text is
already done for you via registry on Windows or the 4-byte code on Mac (i.e.
'TEXT').

- Next, determine text vs. binary. This is usually done by just grabbing the
first N (where N is ~1000) bytes and look for any that are < 10 or > 127. If
you find any, it's binary - or unicode.

- Binary starts the look at image vs. video vs. unicode. Image and video are
pretty simple. You don't need to understand every form of image or video,
just a handful that will hit 99% of all images/videos out there. And they
all - very politely - have a nice header you can examine. For example,
looking at PNG:

http://en.wikipedia.org/wiki/Portable_Network_Graphics#File_header

>From there, you can see that the first 4 bytes of a PNG file are 0x89 0x50
0x4E and 0x47 (where 50, 4E, and 47 are actually the ASCII letters 'PNG').
Almost every single image and video format you'll care about will have
something very similar you can use. This is a great site you can reference:

http://www.wotsit.org/

If you don't find a header that you understand, then you are looking at
either a straight binary lump/blob or multi-byte text file (unicode).
Remember that while UTF8 is not ASCII, it's designed to be indistinguishable
from ASCII most of the time. I don't have any advice to give you here on how
to determine if the file is unicode text or not... as I understand it this
is really a difficult problem to solve. I'm sure Google can help, though.
;-)

- At this point you've determined that the file is "text" in nature and you
are trying to specifically figure out if it's RTF, XML, INI, whatever. This
gets a little more tricky, as often times people skip what optional headers
could be there (e.g. <?xml ...?>, <!DOCTYPE ...>, ...) and you are left with
either taking your best guess or going off the file extension.

- RTF - I don't believe - has an actual "header" that lets you know it is an
RTF file. Instead, just scan it and look for "{\" in the file followed by
some known RTF "tags".

- XML/HTML/*ML, is a matter of scanning for some known tags (like <BODY>,
<HTML>) you know should be there near the top or - in the case of XML -
checking for namespaces in the tag names.

Hope this helps!

Jeff M.



More information about the use-livecode mailing list