Spurious characters from html files - text encoding issues?
Paul Dupuis
paul at researchware.com
Mon May 31 10:39:44 EDT 2021
Thanks for posting these.
The later one (https://quality.livecode.com/show_bug.cgi?id=12205) I was
already following because I think I raised the issue originally and Mark
kindly added a bug entry. The former I was unaware, but would also be a
convenient enhancement - especially along with a built-in
'guessEncoding' function.
On 5/31/2021 8:39 AM, Ben Rubinstein via use-livecode wrote:
> Also relevant enhancement requests:
> https://quality.livecode.com/show_bug.cgi?id=13581
> https://quality.livecode.com/show_bug.cgi?id=12205
>
> On 21/05/2021 15:57, Paul Dupuis via use-livecode wrote:
>> BBEdit has a built in "guess encoding" function to try to determine
>> the encoding of a text file.
>>
>> I have had this bug in to LC now for 6 years:
>> https://quality.livecode.com/show_bug.cgi?id=14474
>>
>> Even Frasier, who did much of the Unicode work for LC7 agreed there
>> should be a guessEncoding function in Livecode. Instead, anyone who
>> needs one either has to write their own or find someone who has
>> written one to get one from.
>>
>> While you can never tell with 100% accurate the encoding for all text
>> files, there are algorithms that make pretty good guesses. I'd still
>> like to see it as a build in function in the LC engine.
>>
>>
>> On 5/21/2021 8:19 AM, Keith Clarke via use-livecode wrote:
>>> Hi Ben,
>>> Thanks for the further details and tips - my problem is now solved!
>>>
>>> The BBedit tip re file 'open-as UTF-8' was a great help. I’d not
>>> noticed these options before (as I tend to open files from
>>> PathFinder folder lists not via apps). However, this did indeed
>>> reveal format errors on these cache files when they were saved with
>>> the raw (UTF-8 confirmed) htmltext of widget “browser”. Text
>>> encoding to UTF-8 before saving fixed this issue and re-crawling the
>>> source pages has resulted in files that BBEdit recognises as
>>> ‘regular’ UTF-8.
>>>
>>> This reduced the anomaly count but whilst testing, I also noticed
>>> that the read-write cycle updating the output csv file was spawning
>>> anomalies and expanding those already present. So I wrapped this
>>> function to also force UTF-8 decoding/encoding - and now all is now
>>> good.
>>>
>>> No longer will I assume that a simple text file is a simple text
>>> file! :-)
>>>
>>> Thanks & regards,
>>> Keith
>>>
>>>> On 19 May 2021, at 19:01, Ben Rubinstein via use-livecode
>>>> <use-livecode at lists.runrev.com> wrote:
>>>>
>>>> Hi Keith,
>>>>
>>>> This might need input from the mothership, but I think if you've
>>>> obtained the text from the browser widget's htmlText, it will
>>>> probably be in the special 'internal' format. I'm not entirely sure
>>>> what happens when you save that as text - I suspect it depends on
>>>> the platform.
>>>>
>>>> So for clarity (if you have the opportunity to re-save this
>>>> material; and if it won't confuse things because existing files are
>>>> in one format, and new ones another) it would probably be best to
>>>> textEncode it into UTF-8, then save it as binfile. That way the
>>>> files on disk should be UTF-8, which is something like a standard.
>>>>
>>>> What I tend to do in this situation where I have text files and I'm
>>>> not sure what the format is (and I spend quite a lot of time
>>>> messing with text files from various sources, some unknown and many
>>>> not under my control) is use a good text editor - I use BBedit on
>>>> Mac, not sure what suitable alternatives would be on Windows or
>>>> Linux - to investigate the file. BBEdit makes a guess when it opens
>>>> the file, but allows you to try re-opening in different encodings,
>>>> and then warns you if there are byte sequences that don't make
>>>> sense with that encoding. So by doing this I can often figure out
>>>> what the encoding of the file is - once you've got that, you're off
>>>> to the races.
>>>>
>>>> But if you have the opportunity to re-collect the whole set, then I
>>>> *think* the above formula of textEncoding from LC's internal format
>>>> to UTF-8, then saving as binary file; and reversing the process
>>>> when you load them back in to process; and then doing the same
>>>> again - possibly to a different format - when you output the CSV,
>>>> should see you clear.
>>>>
>>>> HTH,
>>>>
>>>> Ben
>>>>
>>>>
>>>> On 17/05/2021 15:58, Keith Clarke via use-livecode wrote:
>>>>> Thanks Ben, that’s really interesting. It never occurred to me
>>>>> that these html files might be anything other than simple plain
>>>>> text files, as I’d work with in Coda, etc., for years.
>>>>> The local HTML files are storage of the HTML text pulled from the
>>>>> LiveCode browser widget, saved using the URL ‘file:’ option. I’d
>>>>> been working ‘live’ from the Browser widget’s html text until
>>>>> recently, when I’ve introduced these local files to split page
>>>>> ‘crawling’ and analysis activities without needing a database.
>>>>> Reading the files back into LiveCode with the URL ‘file:’ option
>>>>> works quite happily with no text anomalies when put into a field
>>>>> to read. The problem seems to arise when I load the HTML text into
>>>>> a variable and then start to extract elements using LiveCode's
>>>>> text chunking. For example pulling the text between the offsets of
>>>>> say <p> & </p> tags is when these character anomalies have started
>>>>> to pop into the strings.
>>>>> A quick test on reading in the local HTML files with the URL
>>>>> ‘binfile:’ option and then textDecode(tString, “UTF-8”) seems to
>>>>> reduce the frequency and size of anomalies, but some remain. So,
>>>>> I’ll see if re-crawling pages and saving the HTML text from the
>>>>> browser widget as binfiles reduces this further.
>>>>> Thanks & regards,
>>>>> Keith
>>>>>> On 17 May 2021, at 12:57, Ben Rubinstein via use-livecode
>>>>>> <use-livecode at lists.runrev.com> wrote:
>>>>>>
>>>>>> Hi Keith,
>>>>>>
>>>>>> The thing with character encoding is that you always need to know
>>>>>> where it's coming from and where it's going.
>>>>>>
>>>>>> Do you know how the HTML documents were obtained? Saved from a
>>>>>> browser, fetched by curl, fetched by Livecode? Or generated on
>>>>>> disk by something else?
>>>>>>
>>>>>> If it was saved from a browser or fetched by curl, then the
>>>>>> format is most likely to be UTF-8. In order to see it correctly
>>>>>> in LiveCode, you'd need to two things:
>>>>>> - read it in as a binary file, rather than text (e.g. use URL
>>>>>> "binfile://..." or "open file ... for binary read")
>>>>>> - convert it to the internal text format FROM UTF-8 - which
>>>>>> means use textDecode(tString, "UTF-8"), rather than textEncode
>>>>>>
>>>>>> If it was fetched by LiveCode, then it most likely arrived over
>>>>>> the wire as UTF-8, but if it was saved by LiveCode as text (not
>>>>>> binary) then it _may_ have got corrupted.
>>>>>>
>>>>>> If you can see the text looking as you expect in LiveCode, you've
>>>>>> solved half the problem. Then you need to consider where it's
>>>>>> going: who (that) is going to consume the CSV. This is the time
>>>>>> to use textEncode, and then be sure to save it as a binary file.
>>>>>> If the consumer will be something reasonably modern, then again
>>>>>> UTF-8 is a good default. If it's something much older, you might
>>>>>> need to use "CP1252" or similar.
>>>>>>
>>>>>> HTH,
>>>>>>
>>>>>> Ben
>>>>>>
>>>>>>
>>>>>> On 17/05/2021 09:28, Keith Clarke via use-livecode wrote:
>>>>>>> Hi folks,
>>>>>>> I’m using LiveCode to summarise text from HTML documents into
>>>>>>> csv summary files and am noticing that when I extract strings
>>>>>>> from html documents stored on disk - rather than visiting the
>>>>>>> sites via the browser widget & grabbing the HTML text - weird
>>>>>>> characters being inserted in place of what appear to be
>>>>>>> ‘regular’ characters.
>>>>>>> The number of characters inserted can run into the thousands per
>>>>>>> instance, making my csv ‘summary’ file run into gigabytes! Has
>>>>>>> anyone seen the following type of string before, happen to know
>>>>>>> what might be causing it and offer a fix?
>>>>>>> ‚Äö
>>>>>>>
>>>>>>> I’ve tried deliberately setting UTF-8 on the extracted strings,
>>>>>>> with put textEncode(tString, "UTF-8") into tString. Currently
>>>>>>> I’m not attempting to force any text format on the local HTML
>>>>>>> documents.
>>>>>>> Thanks & regards,
>>>>>>> Keith
>>>>>>> _______________________________________________
>>>>>>> use-livecode mailing list
>>>>>>> use-livecode at lists.runrev.com
>>>>>>> Please visit this url to subscribe, unsubscribe and manage your
>>>>>>> subscription preferences:
>>>>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>>>>> _______________________________________________
>>>>>> use-livecode mailing list
>>>>>> use-livecode at lists.runrev.com
>>>>>> Please visit this url to subscribe, unsubscribe and manage your
>>>>>> subscription preferences:
>>>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>>>> _______________________________________________
>>>>> use-livecode mailing list
>>>>> use-livecode at lists.runrev.com
>>>>> Please visit this url to subscribe, unsubscribe and manage your
>>>>> subscription preferences:
>>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>>> _______________________________________________
>>>> use-livecode mailing list
>>>> use-livecode at lists.runrev.com
>>>> Please visit this url to subscribe, unsubscribe and manage your
>>>> subscription preferences:
>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>>
>>> _______________________________________________
>>> use-livecode mailing list
>>> use-livecode at lists.runrev.com
>>> Please visit this url to subscribe, unsubscribe and manage your
>>> subscription preferences:
>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>
>>
>> _______________________________________________
>> use-livecode mailing list
>> use-livecode at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
>
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
More information about the use-livecode
mailing list