Spurious characters from html files - text encoding issues?

Mon May 31 10:39:44 EDT 2021

Thanks for posting these.

The later one (https://quality.livecode.com/show_bug.cgi?id=12205) I was 
already following because I think I raised the issue originally and Mark 
kindly added a bug entry. The former I was unaware, but would also be a 
convenient enhancement - especially along with a built-in 
'guessEncoding' function.

On 5/31/2021 8:39 AM, Ben Rubinstein via use-livecode wrote:
> Also relevant enhancement requests:
> https://quality.livecode.com/show_bug.cgi?id=13581
> https://quality.livecode.com/show_bug.cgi?id=12205
>
> On 21/05/2021 15:57, Paul Dupuis via use-livecode wrote:
>> BBEdit has a built in "guess encoding" function to try to determine 
>> the encoding of a text file.
>>
>> I have had this bug in to LC now for 6 years: 
>> https://quality.livecode.com/show_bug.cgi?id=14474
>>
>> Even Frasier, who did much of the Unicode work for LC7 agreed there 
>> should be a guessEncoding function in Livecode. Instead, anyone who 
>> needs one either has to write their own or find someone who has 
>> written one to get one from.
>>
>> While you can never tell with 100% accurate the encoding for all text 
>> files, there are algorithms that make pretty good guesses. I'd still 
>> like to see it as a build in function in the LC engine.
>>
>>
>> On 5/21/2021 8:19 AM, Keith Clarke via use-livecode wrote:
>>> Hi Ben,
>>> Thanks for the further details and tips - my problem is now solved!
>>>
>>> The BBedit tip re file 'open-as UTF-8' was a great help. I’d not 
>>> noticed these options before (as I tend to open files from 
>>> PathFinder folder lists not via apps). However, this did indeed 
>>> reveal format errors on these cache files when they were saved with 
>>> the raw (UTF-8 confirmed) htmltext of widget “browser”. Text 
>>> encoding to UTF-8 before saving fixed this issue and re-crawling the 
>>> source pages has resulted in files that BBEdit recognises as 
>>> ‘regular’ UTF-8.
>>>
>>> This reduced the anomaly count but whilst testing, I also noticed 
>>> that the read-write cycle updating the output csv file was spawning 
>>> anomalies and expanding those already present. So I wrapped this 
>>> function to also force UTF-8 decoding/encoding - and now all is now 
>>> good.
>>>
>>> No longer will I assume that a simple text file is a simple text 
>>> file! :-)
>>>
>>> Thanks & regards,
>>> Keith
>>>
>>>> On 19 May 2021, at 19:01, Ben Rubinstein via use-livecode 
>>>> <use-livecode at lists.runrev.com> wrote:
>>>>
>>>> Hi Keith,
>>>>
>>>> This might need input from the mothership, but I think if you've 
>>>> obtained the text from the browser widget's htmlText, it will 
>>>> probably be in the special 'internal' format. I'm not entirely sure 
>>>> what happens when you save that as text - I suspect it depends on 
>>>> the platform.
>>>>
>>>> So for clarity (if you have the opportunity to re-save this 
>>>> material; and if it won't confuse things because existing files are 
>>>> in one format, and new ones another) it would probably be best to 
>>>> textEncode it into UTF-8, then save it as binfile. That way the 
>>>> files on disk should be UTF-8, which is something like a standard.
>>>>
>>>> What I tend to do in this situation where I have text files and I'm 
>>>> not sure what the format is (and I spend quite a lot of time 
>>>> messing with text files from various sources, some unknown and many 
>>>> not under my control) is use a good text editor - I use BBedit on 
>>>> Mac, not sure what suitable alternatives would be on Windows or 
>>>> Linux - to investigate the file. BBEdit makes a guess when it opens 
>>>> the file, but allows you to try re-opening in different encodings, 
>>>> and then warns you if there are byte sequences that don't make 
>>>> sense with that encoding. So by doing this I can often figure out 
>>>> what the encoding of the file is - once you've got that, you're off 
>>>> to the races.
>>>>
>>>> But if you have the opportunity to re-collect the whole set, then I 
>>>> *think* the above formula of textEncoding from LC's internal format 
>>>> to UTF-8, then saving as binary file; and reversing the process 
>>>> when you load them back in to process; and then doing the same 
>>>> again - possibly to a different format - when you output the CSV, 
>>>> should see you clear.
>>>>
>>>> HTH,
>>>>
>>>> Ben
>>>>
>>>>
>>>> On 17/05/2021 15:58, Keith Clarke via use-livecode wrote:
>>>>> Thanks Ben, that’s really interesting. It never occurred to me 
>>>>> that these html files might be anything other than simple plain 
>>>>> text files, as I’d work with in Coda, etc., for years.
>>>>> The local HTML files are storage of the HTML text pulled from the 
>>>>> LiveCode browser widget, saved using the URL ‘file:’ option. I’d 
>>>>> been working ‘live’ from the Browser widget’s html text until 
>>>>> recently, when I’ve introduced these local files to split page 
>>>>> ‘crawling’ and analysis activities without needing a database.
>>>>> Reading the files back into LiveCode with the URL ‘file:’ option 
>>>>> works quite happily with no text anomalies when put into a field 
>>>>> to read. The problem seems to arise when I load the HTML text into 
>>>>> a variable and then start to extract elements using LiveCode's 
>>>>> text chunking. For example pulling the text between the offsets of 
>>>>> say <p> & </p> tags is when these character anomalies have started 
>>>>> to pop into the strings.
>>>>> A quick test on reading in the local HTML files with the URL 
>>>>> ‘binfile:’ option and then textDecode(tString, “UTF-8”) seems to 
>>>>> reduce the frequency and size of anomalies, but some remain. So, 
>>>>> I’ll see if re-crawling pages and saving the HTML text from the 
>>>>> browser widget as binfiles reduces this further.
>>>>> Thanks & regards,
>>>>> Keith
>>>>>> On 17 May 2021, at 12:57, Ben Rubinstein via use-livecode 
>>>>>> <use-livecode at lists.runrev.com> wrote:
>>>>>>
>>>>>> Hi Keith,
>>>>>>
>>>>>> The thing with character encoding is that you always need to know 
>>>>>> where it's coming from and where it's going.
>>>>>>
>>>>>> Do you know how the HTML documents were obtained? Saved from a 
>>>>>> browser, fetched by curl, fetched by Livecode? Or generated on 
>>>>>> disk by something else?
>>>>>>
>>>>>> If it was saved from a browser or fetched by curl, then the 
>>>>>> format is most likely to be UTF-8. In order to see it correctly 
>>>>>> in LiveCode, you'd need to two things:
>>>>>>     - read it in as a binary file, rather than text (e.g. use URL 
>>>>>> "binfile://..." or "open file ... for binary read")
>>>>>>     - convert it to the internal text format FROM UTF-8 - which 
>>>>>> means use textDecode(tString, "UTF-8"), rather than textEncode
>>>>>>
>>>>>> If it was fetched by LiveCode, then it most likely arrived over 
>>>>>> the wire as UTF-8, but if it was saved by LiveCode as text (not 
>>>>>> binary) then it _may_ have got corrupted.
>>>>>>
>>>>>> If you can see the text looking as you expect in LiveCode, you've 
>>>>>> solved half the problem. Then you need to consider where it's 
>>>>>> going: who (that) is going to consume the CSV. This is the time 
>>>>>> to use textEncode, and then be sure to save it as a binary file. 
>>>>>> If the consumer will be something reasonably modern, then again 
>>>>>> UTF-8 is a good default. If it's something much older, you might 
>>>>>> need to use "CP1252" or similar.
>>>>>>
>>>>>> HTH,
>>>>>>
>>>>>> Ben
>>>>>>
>>>>>>
>>>>>> On 17/05/2021 09:28, Keith Clarke via use-livecode wrote:
>>>>>>> Hi folks,
>>>>>>> I’m using LiveCode to summarise text from HTML documents into 
>>>>>>> csv summary files and am noticing that when I extract strings 
>>>>>>> from html documents stored on disk - rather than visiting the 
>>>>>>> sites via the browser widget & grabbing the HTML text - weird 
>>>>>>> characters being inserted in place of what appear to be 
>>>>>>> ‘regular’ characters.
>>>>>>> The number of characters inserted can run into the thousands per 
>>>>>>> instance, making my csv ‘summary’ file run into gigabytes! Has 
>>>>>>> anyone seen the following type of string before, happen to know 
>>>>>>> what might be causing it and offer a fix?
>>>>>>> ‚Äö√Ñ√∂‚àö√ë‚àö‚àÇ‚Äö√†√∂‚àö√´‚Äö√†√∂‚Äö√†√á‚Äö√Ñ√∂‚àö‚Ä†‚àö‚àÇ‚Äö√†√∂‚àö¬¥‚Äö√Ñ√∂‚àö‚Ä†‚àö‚àÇ‚Äö√Ñ√∂‚àö‚Ä†‚àö√° 
>>>>>>>
>>>>>>> I’ve tried deliberately setting UTF-8 on the extracted strings, 
>>>>>>> with put textEncode(tString, "UTF-8") into tString. Currently 
>>>>>>> I’m not attempting to force any text format on the local HTML 
>>>>>>> documents.
>>>>>>> Thanks & regards,
>>>>>>> Keith
>>>>>>> _______________________________________________
>>>>>>> use-livecode mailing list
>>>>>>> use-livecode at lists.runrev.com
>>>>>>> Please visit this url to subscribe, unsubscribe and manage your 
>>>>>>> subscription preferences:
>>>>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>>>>> _______________________________________________
>>>>>> use-livecode mailing list
>>>>>> use-livecode at lists.runrev.com
>>>>>> Please visit this url to subscribe, unsubscribe and manage your 
>>>>>> subscription preferences:
>>>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>>>> _______________________________________________
>>>>> use-livecode mailing list
>>>>> use-livecode at lists.runrev.com
>>>>> Please visit this url to subscribe, unsubscribe and manage your 
>>>>> subscription preferences:
>>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>>> _______________________________________________
>>>> use-livecode mailing list
>>>> use-livecode at lists.runrev.com
>>>> Please visit this url to subscribe, unsubscribe and manage your 
>>>> subscription preferences:
>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>>
>>> _______________________________________________
>>> use-livecode mailing list
>>> use-livecode at lists.runrev.com
>>> Please visit this url to subscribe, unsubscribe and manage your 
>>> subscription preferences:
>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>
>>
>> _______________________________________________
>> use-livecode mailing list
>> use-livecode at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your 
>> subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
>
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your 
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode