Unicode is not "everywhere"...

Mark Waddingham mark at livecode.com
Tue Aug 27 07:54:28 EDT 2019


On 2019-08-22 20:53, Paul Dupuis via use-livecode wrote:
> I just want it consistent and documented and able to return more than
> just ASCII data
> 
> Currently, OSX shell returns UTF8 which may mean that it is returning
> binary as it is returning 8-bit bytes where Unicode text has been
> encoded as UTF8

The encoding returned by the terminal commands on macOS are UTF-8 for 
two reasons:

   1) Various environment variables make it so (the 'system encoding')

   2) The terminal commands you are calling are written to respect the 
system encoding and emit text encoded in that way - because they are 
actually emitting text.

In contrast - 'cat' will just dump the contents of the file you specify 
byte by byte - and files could contain data in any encoding.

There is absolutely no way to tell whether a command is 'ls' like and 
thus emits text, or 'cat' like and thus emits binary.

> Windows returns CP1252 text, not binary and any Unicode results, which
> DOS displays as Unicode just fine, can be returned without elaborate
> work-arounds.
> 
> That by definition is a bug.

No - that isn't the definition of a bug - it is a difference of behavior 
because you are dealing with platform-specific details.

The /U switch which Dar suggested (and appears to work for DIR and 
friends at least) seems to be only applicable to 'internal commands' 
(according to 
https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/cmd) 
so it isn't clear what, if anything, it would do to an arbitrary windows 
terminal command.

> I would advocate that shell should return binary data. Text being
> returned should be UTF8 encoded, that way people expecting ASCII do
> nto need to o anything, they can just work with teh returned text.
> People expecting Unicode can use textDecode to get the UTF8 converted
> to LC native 16-bit Unicode, and people extcting binary can use the
> byte chunk to process what comes back however they want.

The problem here is that it is up to the command being called what it 
outputs - nothing else - so this isn't an achievable goal. You have to 
know what the commands you are calling do, and how they work - and 
ensure you set the environment up when calling them to return what you 
want.

The current situation with shell is irksome though - the internal 
platform-dependent code returns binary data and does nothing to it but 
the higher-level wrapper (i.e. the 'shell()' function implementation) 
will basically leave it as binary data (converted to a native string - 
native strings and binary strings are essentially interchangeable) and 
then will perform EOL conversion on it on Windows and in server engines. 
This means it kinda returns text but not really. Unfortunately this 
behavior has existed for so long that it is 'just the way things are' so 
it isn't going to change.

Moving forward, a second parameter to shell() would probably be the best 
way to resolve the above anomaly - empty would mean legacy behavior, 
binary would mean do nothing at all.

It would be nice to be able to specify 'text' as well...

On UNIX-based systems it is clear what that should do (textDecode the 
output based on the 'system' encoding, which is determined from the 
environment variables of the calling process).

On Windows it is not clear to me what such a setting could do - /U 
certainly doesn't sound like it helps arbitrary processes, but it might 
be there is some way to change the codepage (analogous to the 'system 
encoding') of the command being called so some attempt can be made to 
text decode and EOL convert appropriately.

Warmest Regards,

Mark.

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps




More information about the use-livecode mailing list