Linux filenames in LC Server

Mark Waddingham mark at livecode.com
Mon Aug 14 06:22:03 EDT 2023


On 2023-08-14 02:45, Neville Smythe via use-livecode wrote:
> OK, so the macOS *is* using utf-8 for its file names - the [e-acute] in 
> the filename Carré.txt is rendered with two bytes [C3A9] not the single 
> byte MacRoman encoding. I got tricked by copying the terminal listing 
> into another program rather than hex dumping within the terminal, and 
> somewhere in the process the native encoding was preferred.
> 
> So one must *not* textEncode a filename to utf-8 before writing a file 
> to disk, LC deals with the encoding, although you *should” textEncode 
> its contents.
> 
> Which leaves the problem of why I can’t get LC Server on Linux to write 
> non-ascii filenames

So I suspect the problem here is normalization, rather than the 
inability of Linux to write non-ascii filenames.

Characters such as e-acute / e-grave have *two* representations in 
unicode - the decomposed and composed form.

The composed form is a direct mapping from the native encodings and is a 
single codepoint, the decomposed form will be two codepoints - (e, 
combining-acute/grave)

Depending on where the string comes from it might either be composed or 
decomposed - macOS filenames are stored decomposed in the FS, but the 
higher-level parts of the OS make either form work (in a similar fashion 
to how macOS filesystems are case-insensitive by default).

Linux filesystems, however, are both case-sensitive and form-sensitive - 
a filename must match byte to byte with what it was created with 
(indeed, linux filesystems care nothing for encodings, they see 
filenames as a sequence of bytes which are interpreted relative to the 
user's current locale - the default locale on linux these days is 
utf-8).

If your app is managing the files completely on Linux (i.e. it is 
creating / deleting them and the filenames are not user-editable) then 
(if this is the caseu) the problem should be fixable by choosing a 
normalization form when you create / lookup the file - i.e. pass all 
filenames on the server through `normalizeText(<str>, <form>)` - here 
you want form to be either "NFC" (composed) or "NFD" (decomposed).

Warmest Regards,

Mark.

P.S. For all the gory details about Unicode normalization forms see - 
https://unicode.org/reports/tr15/

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Build Amazing Things



More information about the use-livecode mailing list