Unicode is not "everywhere"...
Mark Waddingham
mark at livecode.com
Tue Aug 27 07:54:28 EDT 2019
On 2019-08-22 20:53, Paul Dupuis via use-livecode wrote:
> I just want it consistent and documented and able to return more than
> just ASCII data
>
> Currently, OSX shell returns UTF8 which may mean that it is returning
> binary as it is returning 8-bit bytes where Unicode text has been
> encoded as UTF8
The encoding returned by the terminal commands on macOS are UTF-8 for
two reasons:
1) Various environment variables make it so (the 'system encoding')
2) The terminal commands you are calling are written to respect the
system encoding and emit text encoded in that way - because they are
actually emitting text.
In contrast - 'cat' will just dump the contents of the file you specify
byte by byte - and files could contain data in any encoding.
There is absolutely no way to tell whether a command is 'ls' like and
thus emits text, or 'cat' like and thus emits binary.
> Windows returns CP1252 text, not binary and any Unicode results, which
> DOS displays as Unicode just fine, can be returned without elaborate
> work-arounds.
>
> That by definition is a bug.
No - that isn't the definition of a bug - it is a difference of behavior
because you are dealing with platform-specific details.
The /U switch which Dar suggested (and appears to work for DIR and
friends at least) seems to be only applicable to 'internal commands'
(according to
https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/cmd)
so it isn't clear what, if anything, it would do to an arbitrary windows
terminal command.
> I would advocate that shell should return binary data. Text being
> returned should be UTF8 encoded, that way people expecting ASCII do
> nto need to o anything, they can just work with teh returned text.
> People expecting Unicode can use textDecode to get the UTF8 converted
> to LC native 16-bit Unicode, and people extcting binary can use the
> byte chunk to process what comes back however they want.
The problem here is that it is up to the command being called what it
outputs - nothing else - so this isn't an achievable goal. You have to
know what the commands you are calling do, and how they work - and
ensure you set the environment up when calling them to return what you
want.
The current situation with shell is irksome though - the internal
platform-dependent code returns binary data and does nothing to it but
the higher-level wrapper (i.e. the 'shell()' function implementation)
will basically leave it as binary data (converted to a native string -
native strings and binary strings are essentially interchangeable) and
then will perform EOL conversion on it on Windows and in server engines.
This means it kinda returns text but not really. Unfortunately this
behavior has existed for so long that it is 'just the way things are' so
it isn't going to change.
Moving forward, a second parameter to shell() would probably be the best
way to resolve the above anomaly - empty would mean legacy behavior,
binary would mean do nothing at all.
It would be nice to be able to specify 'text' as well...
On UNIX-based systems it is clear what that should do (textDecode the
output based on the 'system' encoding, which is determined from the
environment variables of the calling process).
On Windows it is not clear to me what such a setting could do - /U
certainly doesn't sound like it helps arbitrary processes, but it might
be there is some way to change the codepage (analogous to the 'system
encoding') of the command being called so some attempt can be made to
text decode and EOL convert appropriately.
Warmest Regards,
Mark.
--
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps
More information about the use-livecode
mailing list