Spurious characters from html files - text encoding issues?

Ben Rubinstein benr_mc at cogapp.com
Mon May 31 08:39:53 EDT 2021


Also relevant enhancement requests:
https://quality.livecode.com/show_bug.cgi?id=13581
https://quality.livecode.com/show_bug.cgi?id=12205

On 21/05/2021 15:57, Paul Dupuis via use-livecode wrote:
> BBEdit has a built in "guess encoding" function to try to determine the 
> encoding of a text file.
> 
> I have had this bug in to LC now for 6 years: 
> https://quality.livecode.com/show_bug.cgi?id=14474
> 
> Even Frasier, who did much of the Unicode work for LC7 agreed there should be 
> a guessEncoding function in Livecode. Instead, anyone who needs one either has 
> to write their own or find someone who has written one to get one from.
> 
> While you can never tell with 100% accurate the encoding for all text files, 
> there are algorithms that make pretty good guesses. I'd still like to see it 
> as a build in function in the LC engine.
> 
> 
> On 5/21/2021 8:19 AM, Keith Clarke via use-livecode wrote:
>> Hi Ben,
>> Thanks for the further details and tips - my problem is now solved!
>>
>> The BBedit tip re file 'open-as UTF-8' was a great help. I’d not noticed 
>> these options before (as I tend to open files from PathFinder folder lists 
>> not via apps). However, this did indeed reveal format errors on these cache 
>> files when they were saved with the raw (UTF-8 confirmed) htmltext of widget 
>> “browser”. Text encoding to UTF-8 before saving fixed this issue and 
>> re-crawling the source pages has resulted in files that BBEdit recognises as 
>> ‘regular’ UTF-8.
>>
>> This reduced the anomaly count but whilst testing, I also noticed that the 
>> read-write cycle updating the output csv file was spawning anomalies and 
>> expanding those already present. So I wrapped this function to also force 
>> UTF-8 decoding/encoding - and now all is now good.
>>
>> No longer will I assume that a simple text file is a simple text file! :-)
>>
>> Thanks & regards,
>> Keith
>>
>>> On 19 May 2021, at 19:01, Ben Rubinstein via use-livecode 
>>> <use-livecode at lists.runrev.com> wrote:
>>>
>>> Hi Keith,
>>>
>>> This might need input from the mothership, but I think if you've obtained 
>>> the text from the browser widget's htmlText, it will probably be in the 
>>> special 'internal' format. I'm not entirely sure what happens when you save 
>>> that as text - I suspect it depends on the platform.
>>>
>>> So for clarity (if you have the opportunity to re-save this material; and 
>>> if it won't confuse things because existing files are in one format, and 
>>> new ones another) it would probably be best to textEncode it into UTF-8, 
>>> then save it as binfile. That way the files on disk should be UTF-8, which 
>>> is something like a standard.
>>>
>>> What I tend to do in this situation where I have text files and I'm not 
>>> sure what the format is (and I spend quite a lot of time messing with text 
>>> files from various sources, some unknown and many not under my control) is 
>>> use a good text editor - I use BBedit on Mac, not sure what suitable 
>>> alternatives would be on Windows or Linux - to investigate the file. BBEdit 
>>> makes a guess when it opens the file, but allows you to try re-opening in 
>>> different encodings, and then warns you if there are byte sequences that 
>>> don't make sense with that encoding. So by doing this I can often figure 
>>> out what the encoding of the file is - once you've got that, you're off to 
>>> the races.
>>>
>>> But if you have the opportunity to re-collect the whole set, then I *think* 
>>> the above formula of textEncoding from LC's internal format to UTF-8, then 
>>> saving as binary file; and reversing the process when you load them back in 
>>> to process; and then doing the same again - possibly to a different format 
>>> - when you output the CSV, should see you clear.
>>>
>>> HTH,
>>>
>>> Ben
>>>
>>>
>>> On 17/05/2021 15:58, Keith Clarke via use-livecode wrote:
>>>> Thanks Ben, that’s really interesting. It never occurred to me that these 
>>>> html files might be anything other than simple plain text files, as I’d 
>>>> work with in Coda, etc., for years.
>>>> The local HTML files are storage of the HTML text pulled from the LiveCode 
>>>> browser widget, saved using the URL ‘file:’ option. I’d been working 
>>>> ‘live’ from the Browser widget’s html text until recently, when I’ve 
>>>> introduced these local files to split page ‘crawling’ and analysis 
>>>> activities without needing a database.
>>>> Reading the files back into LiveCode with the URL ‘file:’ option works 
>>>> quite happily with no text anomalies when put into a field to read. The 
>>>> problem seems to arise when I load the HTML text into a variable and then 
>>>> start to extract elements using LiveCode's text chunking. For example 
>>>> pulling the text between the offsets of say <p> & </p> tags is when these 
>>>> character anomalies have started to pop into the strings.
>>>> A quick test on reading in the local HTML files with the URL ‘binfile:’ 
>>>> option and then textDecode(tString, “UTF-8”) seems to reduce the frequency 
>>>> and size of anomalies, but some remain. So, I’ll see if re-crawling pages 
>>>> and saving the HTML text from the browser widget as binfiles reduces this 
>>>> further.
>>>> Thanks & regards,
>>>> Keith
>>>>> On 17 May 2021, at 12:57, Ben Rubinstein via use-livecode 
>>>>> <use-livecode at lists.runrev.com> wrote:
>>>>>
>>>>> Hi Keith,
>>>>>
>>>>> The thing with character encoding is that you always need to know where 
>>>>> it's coming from and where it's going.
>>>>>
>>>>> Do you know how the HTML documents were obtained? Saved from a browser, 
>>>>> fetched by curl, fetched by Livecode? Or generated on disk by something 
>>>>> else?
>>>>>
>>>>> If it was saved from a browser or fetched by curl, then the format is 
>>>>> most likely to be UTF-8. In order to see it correctly in LiveCode, you'd 
>>>>> need to two things:
>>>>>     - read it in as a binary file, rather than text (e.g. use URL 
>>>>> "binfile://..." or "open file ... for binary read")
>>>>>     - convert it to the internal text format FROM UTF-8 - which means use 
>>>>> textDecode(tString, "UTF-8"), rather than textEncode
>>>>>
>>>>> If it was fetched by LiveCode, then it most likely arrived over the wire 
>>>>> as UTF-8, but if it was saved by LiveCode as text (not binary) then it 
>>>>> _may_ have got corrupted.
>>>>>
>>>>> If you can see the text looking as you expect in LiveCode, you've solved 
>>>>> half the problem. Then you need to consider where it's going: who (that) 
>>>>> is going to consume the CSV. This is the time to use textEncode, and then 
>>>>> be sure to save it as a binary file. If the consumer will be something 
>>>>> reasonably modern, then again UTF-8 is a good default. If it's something 
>>>>> much older, you might need to use "CP1252" or similar.
>>>>>
>>>>> HTH,
>>>>>
>>>>> Ben
>>>>>
>>>>>
>>>>> On 17/05/2021 09:28, Keith Clarke via use-livecode wrote:
>>>>>> Hi folks,
>>>>>> I’m using LiveCode to summarise text from HTML documents into csv 
>>>>>> summary files and am noticing that when I extract strings from html 
>>>>>> documents stored on disk - rather than visiting the sites via the 
>>>>>> browser widget & grabbing the HTML text - weird characters being 
>>>>>> inserted in place of what appear to be ‘regular’ characters.
>>>>>> The number of characters inserted can run into the thousands per 
>>>>>> instance, making my csv ‘summary’ file run into gigabytes! Has anyone 
>>>>>> seen the following type of string before, happen to know what might be 
>>>>>> causing it and offer a fix?
>>>>>> ‚Äö 
>>>>>>
>>>>>> I’ve tried deliberately setting UTF-8 on the extracted strings, with put 
>>>>>> textEncode(tString, "UTF-8") into tString. Currently I’m not attempting 
>>>>>> to force any text format on the local HTML documents.
>>>>>> Thanks & regards,
>>>>>> Keith
>>>>>> _______________________________________________
>>>>>> use-livecode mailing list
>>>>>> use-livecode at lists.runrev.com
>>>>>> Please visit this url to subscribe, unsubscribe and manage your 
>>>>>> subscription preferences:
>>>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>>>> _______________________________________________
>>>>> use-livecode mailing list
>>>>> use-livecode at lists.runrev.com
>>>>> Please visit this url to subscribe, unsubscribe and manage your 
>>>>> subscription preferences:
>>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>>> _______________________________________________
>>>> use-livecode mailing list
>>>> use-livecode at lists.runrev.com
>>>> Please visit this url to subscribe, unsubscribe and manage your 
>>>> subscription preferences:
>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>> _______________________________________________
>>> use-livecode mailing list
>>> use-livecode at lists.runrev.com
>>> Please visit this url to subscribe, unsubscribe and manage your 
>>> subscription preferences:
>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>
>> _______________________________________________
>> use-livecode mailing list
>> use-livecode at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your subscription 
>> preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
> 
> 
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription 
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode




More information about the use-livecode mailing list