Some thoughts on duck typing

David Bovill david at vaudevillecourt.tv
Wed Jan 12 13:08:08 EST 2011


It does Jeff - thanks lots of detail there to translate into good 'ol code
:)

On 12 January 2011 17:55, Jeff Massung <massung at gmail.com> wrote:

> On Wed, Jan 12, 2011 at 4:37 AM, David Bovill <david at vaudevillecourt.tv
> >wrote:
>
> > If it quacks like a duck it is a duck.
> >
> > So I have some data in a variable that I want to display. I can use is an
> > array/number/date - but for other types of data I'm wandering... xml
> should
> > be easy, but harder would be to distinguish long text files from binary.
> > Any
> > ideas for hacks to distinguish:
> >
> >   1. images
> >   2. sounds
> >   3. video
> >   4. binary blob
> >   5. text
> >   6. rtftext
> >   7. utf8
> >
> >
> This is a pretty solved problem (except for the "array" part, which is a
> LC-specific data type/format). Wish I had some references for you at the
> moment, but here's some things to keep in mind:
>
> - First, use your OS when possible. Images, sounds, video, and often text
> is
> already done for you via registry on Windows or the 4-byte code on Mac
> (i.e.
> 'TEXT').
>
> - Next, determine text vs. binary. This is usually done by just grabbing
> the
> first N (where N is ~1000) bytes and look for any that are < 10 or > 127.
> If
> you find any, it's binary - or unicode.
>
> - Binary starts the look at image vs. video vs. unicode. Image and video
> are
> pretty simple. You don't need to understand every form of image or video,
> just a handful that will hit 99% of all images/videos out there. And they
> all - very politely - have a nice header you can examine. For example,
> looking at PNG:
>
> http://en.wikipedia.org/wiki/Portable_Network_Graphics#File_header
>
> From there, you can see that the first 4 bytes of a PNG file are 0x89 0x50
> 0x4E and 0x47 (where 50, 4E, and 47 are actually the ASCII letters 'PNG').
> Almost every single image and video format you'll care about will have
> something very similar you can use. This is a great site you can reference:
>
> http://www.wotsit.org/
>
> If you don't find a header that you understand, then you are looking at
> either a straight binary lump/blob or multi-byte text file (unicode).
> Remember that while UTF8 is not ASCII, it's designed to be
> indistinguishable
> from ASCII most of the time. I don't have any advice to give you here on
> how
> to determine if the file is unicode text or not... as I understand it this
> is really a difficult problem to solve. I'm sure Google can help, though.
> ;-)
>
> - At this point you've determined that the file is "text" in nature and you
> are trying to specifically figure out if it's RTF, XML, INI, whatever. This
> gets a little more tricky, as often times people skip what optional headers
> could be there (e.g. <?xml ...?>, <!DOCTYPE ...>, ...) and you are left
> with
> either taking your best guess or going off the file extension.
>
> - RTF - I don't believe - has an actual "header" that lets you know it is
> an
> RTF file. Instead, just scan it and look for "{\" in the file followed by
> some known RTF "tags".
>
> - XML/HTML/*ML, is a matter of scanning for some known tags (like <BODY>,
> <HTML>) you know should be there near the top or - in the case of XML -
> checking for namespaces in the tag names.
>
> Hope this helps!
>
> Jeff M.
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>



More information about the use-livecode mailing list