Linux filenames in LC Server
Mark Waddingham
mark at livecode.com
Mon Aug 14 06:22:03 EDT 2023
On 2023-08-14 02:45, Neville Smythe via use-livecode wrote:
> OK, so the macOS *is* using utf-8 for its file names - the [e-acute] in
> the filename Carr.txt is rendered with two bytes [C3A9] not the single
> byte MacRoman encoding. I got tricked by copying the terminal listing
> into another program rather than hex dumping within the terminal, and
> somewhere in the process the native encoding was preferred.
>
> So one must *not* textEncode a filename to utf-8 before writing a file
> to disk, LC deals with the encoding, although you *should textEncode
> its contents.
>
> Which leaves the problem of why I cant get LC Server on Linux to write
> non-ascii filenames
So I suspect the problem here is normalization, rather than the
inability of Linux to write non-ascii filenames.
Characters such as e-acute / e-grave have *two* representations in
unicode - the decomposed and composed form.
The composed form is a direct mapping from the native encodings and is a
single codepoint, the decomposed form will be two codepoints - (e,
combining-acute/grave)
Depending on where the string comes from it might either be composed or
decomposed - macOS filenames are stored decomposed in the FS, but the
higher-level parts of the OS make either form work (in a similar fashion
to how macOS filesystems are case-insensitive by default).
Linux filesystems, however, are both case-sensitive and form-sensitive -
a filename must match byte to byte with what it was created with
(indeed, linux filesystems care nothing for encodings, they see
filenames as a sequence of bytes which are interpreted relative to the
user's current locale - the default locale on linux these days is
utf-8).
If your app is managing the files completely on Linux (i.e. it is
creating / deleting them and the filenames are not user-editable) then
(if this is the caseu) the problem should be fixable by choosing a
normalization form when you create / lookup the file - i.e. pass all
filenames on the server through `normalizeText(<str>, <form>)` - here
you want form to be either "NFC" (composed) or "NFD" (decomposed).
Warmest Regards,
Mark.
P.S. For all the gory details about Unicode normalization forms see -
https://unicode.org/reports/tr15/
--
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Build Amazing Things
More information about the use-livecode
mailing list