Linux file names in LC Server
Neville Smythe
neville.smythe at optusnet.com.au
Sun Aug 13 08:45:39 EDT 2023
As we know with LC it is pretty straightforward to deal with internationalised text for remote databases and unknown user platforms by conversion to utf-8. But I have come across a problem with Linux filenames containing non-ascii characters which has me befuddled.
My many-years-old app has until now just required all filenames to be in standard 7-bit ascii, so it was way past time I brought it up to date.
The app talks to a database, media and web site on a unix (DreamHost) server using LC server as intermediary.
I create a file say “Carré.txt” on a Mac - the non-ascii character in that name being [e-acute] - I shall use this convention from now on to ensure what is displayed here on the forum is understood.
BTW, as far as I can determine that character in the Mac file system is a single byte hex [8e], the classic MacRoman encoding, not its utf-8 2-byte [C3A9] encoding. So I don’t understand how macOS handles unicode in its filesystem, which it certainly does. We are exhorted to textEncode to utf-8 when exporting anything outside LC but perhaps not filenames?? If I textEncode the filename and save with that name I get a new file “Carr[squareroot copyright].txt”. I am befuddled already - how does macOS distinguish MacRoman encoding from unicode encoding when it displays a file name? - but that is another story for another place..
Oh, and another story: it ain't true that all text in LC is utf-16: While it’s not possible using LC-API’s to determine exactly what is inside the black-box of an LC variable in memory, it is evidently platform dependent — that MacRoman [8e] is reported as being the relevant byte in the LC variable. What can be determined is what is on disk when a stack is saved: there text appears to be encoded as a mixture of 7-bit ascii when it can be, utf-16 encoding for other characters. Not that we as consumers need to know how the magic is performed, as long as it works. Back to my story..
So now I want to upload this file to my remote Linux server. I POST a form, prepared with libURLMultiPartFormData, to an LC Server script, which is supposed to save the received file.
If I attempt to use the original Mac file name, the server responds “Cannot open file Carr[e-acute].txt”
(this is the Result error message from "open file tFileName for binary write”)
If I send textEncode(filename, utf-8) as the file name, the server responds “Cannot open file Carr[squareroot][copyright].txt”
If I textEncode at the client end, and then textDecode on the server it responds “Cannot open file Carre[E-grave].txt” (Where did THAT come from? Is there a bug in textDecode on Linux LCS? The native encoding on Linux is supposed to be ISO-Latin-1, where E-grave is hex [C8], in MacRoman it is [E9], no apparent connections between them or the utf-8 bytes.)
And just as a piece of nonsense, if I send the raw un-Encoded Mac file name, but then textDecode on the server, the file is happily saved as “Carr.txt”, which is correct since [8e] followed by . is illegal as utf-8, so the [e-acute] is just skipped by textDecode.
Could it be that LCserver cannot create files on Linux with non-ascii names?!? That doesn’t seem believable. I can of course directly create files on the server with non-ascii characters such as e-acute.
Either I am missing something, or surely our European users have seen this already, so someone should be able to unfuddle me!
Neville Smythe
More information about the use-livecode
mailing list