Linux filenames in LC Server

Mark Waddingham mark at livecode.com
Wed Aug 16 03:34:23 EDT 2023


On 2023-08-16 06:37, Neville Smythe via use-livecode wrote:
> So I misunderstood, I thought we were talking about Apache environment 
> variables. Indeed the Terminal app reports
> 
> LANG=C
> 
> as a system env variable. But if this is not specifically a server 
> problem, wouldn’t
> that mean we could see the same behaviour with LC Desktop on Linux 
> machines running
> vanilla Ubuntu or Debian (which is what Dreamhost uses)? I haven’t 
> tried this yet,
> as it is a bit of pain to fire up my Linux emulator machine.

So the situation here is similar to that which you get on macOS. If you 
open Terminal, then the (UNIX) environment (variable-wise) which you get 
will be different from that you get when you double-click on an app to 
launch it. In the latter case, the executable is launched via the 
desktop environments 'launcher' process and will inherit the environment 
provided by that. Presumably, as Linux desktops mandate various things 
(like language settings), the locale and environment vars will be set 
appropriately.

> An experiment, which make me wonder if this counts as a configuration 
> problem or an actual bug in LC Server:
> 
>  In Terminal I type (actually paste) and execute
> 
>  echo “éü😃” > Carré.txt
> 
>     (for Forum users like me who just see ? everywhere, that is 
> [e-acute][u-umlaut][happyface emoji] in the content to be written to a 
> file with [e-acute] in its name)
> 
>    This works without problem. The contents of the file are utf-8 
> encoded, which I didn’t
> need to specify, but I guess that is what the pasteboard provided. 
> Terminal had no problem
> creating or finding the file without needing those env settings. Of 
> course it cannot *display*
> the file name without knowing the encoding, so ls reports the filename 
> as 'Carr'$'\303\251''.txt’
> ( readable as an ascii encoding, though not one I have seen before; 
> note the single quotes)

I'm guessing here that this is a remote ssh session to your Linux 
server, and you are using macOS Terminal app to run and connect? If that 
is the case then the reason this works is because Terminal on macOS is 
UTF-8 (which is the *only* encoding macOS supports in its UNIX subsystem 
so you don't get the variance problem you do with Linux). This means 
that pasting text from somewhere else will paste the UTF-8 bytes - i.e. 
they will get transmitted over SSH to the remote linux machine.

As filenames are just sequences of bytes on Linux this works fine - 
however when you ask the remote terminal to list the files, it can only 
interpret the ascii chars (as the LANG is C) and thus emits octal 
escapes for the others - here this ix 0xC3 0xA9 which is the utf-8 
encoding of e-acute.

> If I setup the env variables Mark suggests in the Terminal session
> 
> export LC_ALL="en_US.UTF8"
> export LANG=“en_US.UTF8”
> 
> then Terminal is able to display the filename á la française.

So now the remote terminal knows how to interpret the sequences of bytes 
present in the filenames, and thus can emit them appropriately.

> Cyberduck reports this filename correctly using the [e-acute] without 
> having to set encoding
> knowledge. And I can also create the file using Cyberduck with no 
> problems. So IT knows about/expects/sets
> up the encoding as needed. I bet other Linux-aware apps would also open 
> or list such files without
> drama or special configuration.

IT doesn't know - it assumes. I suspect that if you used Cyberduck to 
connect to a Linux server which is setup to *not* be utf-8 (so filenames 
are encoded with some other encoding), then it would display things 
incorrectly.

Of course, if the protocols it deals with specify the text encoding as 
utf-8 *and* the daemons running on said server are setup correctly (i.e. 
so that they process the filenames and such relative to the server's 
encoding) *and* they correctly convert the filenames from that encoding 
to the encoding mandated by the protocol then it would display fine.

Certainly FTP treats filenames as sequences of bytes - so at least for 
that protocol the client would have to assume UTF-8 or be told the 
correct encoding to do the correct thing.

> However: in LC Server when I call "the long files" for the enclosing 
> folder: crash!
> (Actually an in-line error reported for this code line). To my mind 
> that qualifies as
> bug, even if the source of the crash is the same as for open file.

I take it by crash you mean a runtime error is logged, and that this 
only happens if the LANG / LC_ALL environment variables are not set?

This is the same issue as opening a file - the low-level text encoding 
from ASCII to the internal encoding used by strings in the engine will 
be failing because it encounters non-ASCII.

>    On the other hand hopefully setting the environment variables as 
> Mark suggests will
> fix everything . Mark, could I clarify exactly how that “launcher 
> script” is to be used…
> I’m guessing the cgi configuration should point to that file to be 
> executed when it wants
> to open myscript.lc instead of pointing to the livecode-server 
> executable (in which case
> it might have to have a .cgi suffix rather than .txt), or is it a shell 
> script to be
> executed by livecode-server?

The provided text should be put into a shell script which should be 
launched *instead* of livecode-server - so configure your CGI 
environment to call said shell script when it encounters a lc server 
script file to run. It will then set environment variables and then 
'exec' replaces the shell script with livecode-server (in the same 
process).

Technically while what the engine is doing is correct (relative to its 
need to have filenames represented as strings internally at least) it 
isn't ideal. There are two options to improve the situation (when the 
locale env vars are not set / set to C):

   1) Rather than assume ASCII, assume native - this would preserve the 
bytes in the filename regardless of system encoding.
   2) Rather than assume ASCII, assume utf-8 - this would correctly 
represent filenames which are valid UTF-8, but would still fail on 
filenames with bad encoding

Here (1) has the advantage that filenames would be preserved; but with 
the slight caveat that if you combined with other unicode characters (in 
a report say); the filenames would be displayed incorrectly (here 
'display' would also include being sent as part of some protocol 
response).

Here (2) has the advantage of everything working as expected assuming 
the server in question is utf-8 - it would still fail on filenames which 
are badly encoded though. However the latter could be mitigated by 
making the sys-string<->lc-string conversion slightly less strict - i.e. 
bad utf-8 chars map to/from '?' as they do in textEncode/Decode - so at 
least you could see the bad filenames.

I suspect (2) is overall better - its only downside is that you would 
not be able to manipulate files on the server which had badly encoded 
utf-8 names. However, that seems like an extreme edge case; and one 
which you could work around by just setting the LANG env var to a native 
encoding and put appropriate code in your app to deal with.

Warmest Regards,

Mark.

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Build Amazing Things



More information about the use-livecode mailing list