Spurious characters from html files - text encoding issues?
Ben Rubinstein
benr_mc at cogapp.com
Mon May 17 07:57:32 EDT 2021
Hi Keith,
The thing with character encoding is that you always need to know where it's
coming from and where it's going.
Do you know how the HTML documents were obtained? Saved from a browser,
fetched by curl, fetched by Livecode? Or generated on disk by something else?
If it was saved from a browser or fetched by curl, then the format is most
likely to be UTF-8. In order to see it correctly in LiveCode, you'd need to
two things:
- read it in as a binary file, rather than text (e.g. use URL "binfile://..."
or "open file ... for binary read")
- convert it to the internal text format FROM UTF-8 - which means use
textDecode(tString, "UTF-8"), rather than textEncode
If it was fetched by LiveCode, then it most likely arrived over the wire as
UTF-8, but if it was saved by LiveCode as text (not binary) then it _may_ have
got corrupted.
If you can see the text looking as you expect in LiveCode, you've solved half
the problem. Then you need to consider where it's going: who (that) is going
to consume the CSV. This is the time to use textEncode, and then be sure to
save it as a binary file. If the consumer will be something reasonably modern,
then again UTF-8 is a good default. If it's something much older, you might
need to use "CP1252" or similar.
HTH,
Ben
On 17/05/2021 09:28, Keith Clarke via use-livecode wrote:
> Hi folks,
> I’m using LiveCode to summarise text from HTML documents into csv summary files and am noticing that when I extract strings from html documents stored on disk - rather than visiting the sites via the browser widget & grabbing the HTML text - weird characters being inserted in place of what appear to be ‘regular’ characters.
>
> The number of characters inserted can run into the thousands per instance, making my csv ‘summary’ file run into gigabytes! Has anyone seen the following type of string before, happen to know what might be causing it and offer a fix?
> ‚Äö
>
> I’ve tried deliberately setting UTF-8 on the extracted strings, with put textEncode(tString, "UTF-8") into tString. Currently I’m not attempting to force any text format on the local HTML documents.
>
> Thanks & regards,
> Keith
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>
More information about the use-livecode
mailing list