use-livecode Digest, Vol 249, Issue 24

Sat Jun 29 13:49:49 EDT 2024

Thanks Mark for the gory details which i found fascinating. Unicode is even more complicated than I realized, and I thought I had a pretty good understanding of it.

Actually I thought my test 2 demonstrated that matchchunk performed very well on Unicode, rather than trying to show it was part of the problem. 

As to my back of the envelope analysis, I realized after I hit the Send button that my sloppy code computed the end condition 
the number of lines of fff
in the inner loop as well as the outer, which makes the timing computation incorrect.

So, end of story, my original problem is resolved, and I have that nifty array trick for random access to lines of large text data which is going to be invaluable, plus a tutorial  on Unicode. All worth the embarrassment of exposing my ignorance in front of God and everyone (God in this case being Mark)

Neville Smythe

> On 30 Jun 2024, at 2:01 am, use-livecode-request at lists.runrev.com wrote:
> 
> Send use-livecode mailing list submissions to
>   use-livecode at lists.runrev.com
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>   http://lists.runrev.com/mailman/listinfo/use-livecode
> or, via email, send a message with subject or body 'help' to
>   use-livecode-request at lists.runrev.com
> 
> You can reach the person managing the list at
>   use-livecode-owner at lists.runrev.com
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of use-livecode digest..."
> 
> 
> you can find the archives for this list at:
> 
> http://lists.runrev.com/pipermail/use-livecode/
> 
> and search them using this link:
> 
> https://www.mail-archive.com/use-livecode@lists.runrev.com/
> 
> 
> Today's Topics:
> 
>  1. Re: Socket Packaging (Bob Sneidar)
>  2. url no longer working as expected (Hugh Senior)
>  3. Re: url no longer working as expected (Bob Sneidar)
>  4. Re: url no longer working as expected (Paul Dupuis)
>  5. Re: url no longer working as expected (Bob Sneidar)
>  6. Re: url no longer working as expected (Paul Dupuis)
>  7. Re: url no longer working as expected (Richard Gaskin)
>  8. Re: Slow stack problem (Neville Smythe)
>  9. Re: Slow stack problem (Mark Waddingham)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Fri, 28 Jun 2024 16:44:24 +0000
> From: Bob Sneidar <bobsneidar at iotecdigital.com>
> To: How to use LiveCode <use-livecode at lists.runrev.com>
> Subject: Re: Socket Packaging
> Message-ID: <2D093B01-C351-4AA7-B2F3-AE44CDC3503E at iotecdigital.com>
> Content-Type: text/plain; charset="utf-8"
> 
> Added error checking. Also the payload can now be a string or an array. 
> 
> command packagePayload @pPayload, pUseEncryption
>  try   
>     if pPayload is an array then \
>           put arrayEncode(pPayload) into pPayload
> 
>     if pUseEncryption then \
>           put slyEncrypt(pPayload) into pPayload
> 
>     put base64Encode(pPayload) into pPayload
>  catch tError
>     return "ERROR:" && tError
>  end try
> end packagePayload
> 
> command unpackPayload @pPayload, pUseEncryption
>  try
>     put base64Decode(pPayload) into pPayload
> 
>     if pUseEncryption is true or pPayload begins with "salted" then \
>           put slyDecrypt(pPayload) into pPayload
> 
>     if pPayload is an array then \
>           put arrayDecode(pPayload) into pPayload
>  catch tError
>     return "ERROR:" && tError
>  end try
> end unpackPayload
> 
> Bob S
> 
> 
>> On Jun 24, 2024, at 9:21 AM, Bob Sneidar <bobsneidar at iotecdigital.com> wrote:
>> Hi all.
>> I came up with deceptively simple wrappers for packaging data for transmission over raw sockets. I can?t send the slyEncrypt and slyDecrypt handlers because I use methods no one else knows. But you can roll your own or else eliminate encryption altogether.
>> And to answer the question befor it?s asked, I don?t use SSL because I don?t like having to deal with certificates, and also because I use a method for encryption that I don?t think anyone else has thought of, or at least I can?t find any info online.
>> Bob S
>> command packagePayload @pPayload, pUseEncryption
>> if pPayload is an array then \
>>       put arrayEncode(pPayload) into pPayload
>> if pUseEncryption then \
>>       put slyEncrypt(pPayload) into pPayload
>> put base64Encode(pPayload) into pPayload
>> end packagePayload
>> command unpackPayload @pPayload
>> put base64Decode(pPayload) into pPayload
>> if pPayload begins with "salted" then \
>>       put slyDecrypt(pPayload) into pPayload
>> try
>>    put arrayDecode(pPayload) into tResult
>>    put tResult into pPayload
>> catch tError
>>    -- not an array
>> end try
>> end unpackPayload
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Fri, 28 Jun 2024 18:04:09 +0100
> From: "Hugh Senior" <admin at flexiblelearning.com>
> To: <use-livecode at lists.runrev.com>
> Subject: url no longer working as expected
> Message-ID: <000001dac97d$3292afe0$97b80fa0$@flexiblelearning.com>
> Content-Type: text/plain;    charset="us-ascii"
> 
> 
> Platform: Windows 11, LC 9.6.12
> Query: Using URL to access a web page
> 
> Problem:
> Enter "https://uk.finance.yahoo.com/quote/SHEL.L/history/" into any web
> browser and the page is displayed as expected.
> 
> Use LC's URL command to access the same page direct returns a 404
> put url "https://uk.finance.yahoo.com/quote/SHEL.L/history/"
> 
> Anyone got any insights?
> 
> Hugh Senior
> 
> 
> 
> 
> ------------------------------
> 
> Message: 3
> Date: Fri, 28 Jun 2024 17:07:55 +0000
> From: Bob Sneidar <bobsneidar at iotecdigital.com>
> To: How to use LiveCode <use-livecode at lists.runrev.com>
> Subject: Re: url no longer working as expected
> Message-ID: <7CFEFE88-133D-48B6-9E28-3728C5A7AB37 at iotecdigital.com>
> Content-Type: text/plain; charset="us-ascii"
> 
> I get the HTML of the page. Are you trying to open the page in a browser? 
> 
> Bob S
> 
> 
>> On Jun 28, 2024, at 10:04 AM, Hugh Senior via use-livecode <use-livecode at lists.runrev.com> wrote:
>> Platform: Windows 11, LC 9.6.12
>> Query: Using URL to access a web page
>> Problem:
>> Enter "https://uk.finance.yahoo.com/quote/SHEL.L/history/" into any web
>> browser and the page is displayed as expected.
>> Use LC's URL command to access the same page direct returns a 404
>> put url "https://uk.finance.yahoo.com/quote/SHEL.L/history/"
>> Anyone got any insights?
>> Hugh Senior
>> _______________________________________________
>> use-livecode mailing list
>> use-livecode at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
> 
> 
> 
> 
> ------------------------------
> 
> Message: 4
> Date: Fri, 28 Jun 2024 13:50:35 -0400
> From: Paul Dupuis <paul at researchware.com>
> To: use-livecode at lists.runrev.com
> Subject: Re: url no longer working as expected
> Message-ID: <499fc022-eafb-49dd-96df-1dcd18afb8fb at researchware.com>
> Content-Type: text/plain; charset=UTF-8; format=flowed
> 
> I get a response from Yahoos that is an html page with a 404 information 
> as part of it.
> 
> This happens under LC 9.6.12 and 9.6.11
> 
> I think this is Yahoo Finance not being able to detect the browser type 
> and intentionally returning a 404 as a method of deterring screen scraping.
> 
> 
>> On 6/28/2024 1:04 PM, Hugh Senior via use-livecode wrote:
>> Platform: Windows 11, LC 9.6.12
>> Query: Using URL to access a web page
>> Problem:
>> Enter "https://uk.finance.yahoo.com/quote/SHEL.L/history/" into any web
>> browser and the page is displayed as expected.
>> Use LC's URL command to access the same page direct returns a 404
>> put url "https://uk.finance.yahoo.com/quote/SHEL.L/history/"
>> Anyone got any insights?
>> Hugh Senior
>> _______________________________________________
>> use-livecode mailing list
>> use-livecode at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
> 
> 
> 
> 
> ------------------------------
> 
> Message: 5
> Date: Fri, 28 Jun 2024 18:03:28 +0000
> From: Bob Sneidar <bobsneidar at iotecdigital.com>
> To: How to use LiveCode <use-livecode at lists.runrev.com>
> Subject: Re: url no longer working as expected
> Message-ID: <F7E08D99-E0F9-43DE-A333-95F60571FB36 at iotecdigital.com>
> Content-Type: text/plain; charset="us-ascii"
> 
> Did you try that in the message box? 
> 
> Bob S
> 
> 
>> On Jun 28, 2024, at 10:50 AM, Paul Dupuis via use-livecode <use-livecode at lists.runrev.com> wrote:
>> I get a response from Yahoos that is an html page with a 404 information as part of it.
>> This happens under LC 9.6.12 and 9.6.11
>> I think this is Yahoo Finance not being able to detect the browser type and intentionally returning a 404 as a method of deterring screen scraping.
>>> On 6/28/2024 1:04 PM, Hugh Senior via use-livecode wrote:
>>> Platform: Windows 11, LC 9.6.12
>>> Query: Using URL to access a web page
>>> Problem:
>>> Enter "https://uk.finance.yahoo.com/quote/SHEL.L/history/" into any web
>>> browser and the page is displayed as expected.
>>> Use LC's URL command to access the same page direct returns a 404
>>> put url "https://uk.finance.yahoo.com/quote/SHEL.L/history/"
>>> Anyone got any insights?
>>> Hugh Senior
>>> _______________________________________________
>>> use-livecode mailing list
>>> use-livecode at lists.runrev.com
>>> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>> _______________________________________________
>> use-livecode mailing list
>> use-livecode at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
> 
> 
> 
> 
> ------------------------------
> 
> Message: 6
> Date: Fri, 28 Jun 2024 14:23:35 -0400
> From: Paul Dupuis <paul at researchware.com>
> To: use-livecode at lists.runrev.com
> Subject: Re: url no longer working as expected
> Message-ID: <30ffa56e-1cc0-4f83-9590-e1525a14e613 at researchware.com>
> Content-Type: text/plain; charset=UTF-8; format=flowed
> 
> Yes.
> 
> put url "https://uk.finance.yahoo.com/quote/SHEL.L/history/"
> 
> In the message box on 9.6.11 and 9.6.12 under Windows 11. Both return a 
> pile of HTML text that is all the formatting and CSS linked stuff to 
> show a "404" page.
> 
> This suggests that put URL is working and it is the Yahoo server that 
> returning a different page of HTML/CSS for the put vs when you enter the 
> URL in a browser (Firefox in my case, where I get the Yahoo finance data 
> for Shell, although I did have to respond to a Cookies dialog first).
> 
> 
>> On 6/28/2024 2:03 PM, Bob Sneidar via use-livecode wrote:
>> Did you try that in the message box?
>> Bob S
>>>> On Jun 28, 2024, at 10:50 AM, Paul Dupuis via use-livecode <use-livecode at lists.runrev.com> wrote:
>>> I get a response from Yahoos that is an html page with a 404 information as part of it.
>>> This happens under LC 9.6.12 and 9.6.11
>>> I think this is Yahoo Finance not being able to detect the browser type and intentionally returning a 404 as a method of deterring screen scraping.
>>> On 6/28/2024 1:04 PM, Hugh Senior via use-livecode wrote:
>>>> Platform: Windows 11, LC 9.6.12
>>>> Query: Using URL to access a web page
>>>> Problem:
>>>> Enter "https://uk.finance.yahoo.com/quote/SHEL.L/history/" into any web
>>>> browser and the page is displayed as expected.
>>>> Use LC's URL command to access the same page direct returns a 404
>>>> put url "https://uk.finance.yahoo.com/quote/SHEL.L/history/"
>>>> Anyone got any insights?
>>>> Hugh Senior
>>>> _______________________________________________
>>>> use-livecode mailing list
>>>> use-livecode at lists.runrev.com
>>>> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>> _______________________________________________
>>> use-livecode mailing list
>>> use-livecode at lists.runrev.com
>>> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>> _______________________________________________
>> use-livecode mailing list
>> use-livecode at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
> 
> 
> 
> 
> ------------------------------
> 
> Message: 7
> Date: Fri, 28 Jun 2024 18:59:34 +0000
> From: "Richard Gaskin" <ambassador at fourthworld.com>
> To: use-livecode at lists.runrev.com
> Subject: Re: url no longer working as expected
> Message-ID: <c28ce0b853e9fb2de1711c302105cb6dba567317 at fourthworld.com>
> Content-Type: text/plain; charset="utf-8"
> 
> Likely the case. The expense of collating all that data and presenting it to their site visitors is considerable.  They use advertising to cover those costs.  If the data were easily scrapable, scrapers diminish revenue, putting the resource itself at risk.
> 
> Some data provides offer APIs.  When you see anti-scraping effects, look for API options (I saw none there but I didn't look deeply).  APIs take fewer resources to deliver, and may have strategic benefit for some data brokers.
> 
> But if they have scrape-prevention and no API, they're sending a clear signal: "We need to pay our bills, please send your traffic to our page so we can do that."
> 
> That said, I've come across stock APIs before, and while I don't recall many free ones there likely are some.
> 
> Richard Gaskin
> FourthWorld.com
> 
> 
> 
> Paul Dupuis wrote:
> 
>> I get a response from Yahoos that is an html page with a 404
>> information as part of it.
> ...
>> I think this is Yahoo Finance not being able to detect the
>> browser type and intentionally returning a 404 as a method
>> of deterring screen scraping.
> On 6/28/2024 1:04 PM, Hugh Senior via use-livecode wrote:
> ...
>>> Problem:
>>> Enter "https://uk.finance.yahoo.com/quote/SHEL.L/history/" into
>>> any web browser and the page is displayed as expected.
>>> Use LC's URL command to access the same page direct returns a
>>> 404
>>> put url "https://uk.finance.yahoo.com/quote/SHEL.L/history/"
>>> Anyone got any insights?
> 
> 
> 
> ------------------------------
> 
> Message: 8
> Date: Sat, 29 Jun 2024 17:53:58 +1000
> From: Neville Smythe <neville.smythe at optusnet.com.au>
> To: How to use LiveCode <use-livecode at lists.runrev.com>
> Subject: Re: Slow stack problem
> Message-ID: <76E596CE-7A40-4060-8A40-A9BD9DC0FFA8 at optusnet.com.au>
> Content-Type: text/plain;    charset=utf-8
> 
> Thanks Mark for pointing me (as ever) in the right direction, away from the red herring but to the hidden inner loop entailed in evaluating
> 
> line k of fff
> 
> A bit embarrassing because I was party to the discussion some time ago about the slowness of lineOffset when working with unicode text.
> 
> And thanks for the neat array trick which I wouldn?t have thought of.
> 
> There is indeed just one line in the text which contains a single Unicode character. Without that character evidently the variable fff would internally be an ascii (or native?) string, but otherwise is entirely utf16 and we hit the LineOffset problem.
> 
> But I am still bemused. 
> 
> Is it not the case that the processing time for looping over the number of lines and getting the k-th line in each iteration, for some arbitrary k, going to be
> 
> Order(N^2) * C * T
> 
> where
> 
> N = the number of lines in the text
> 
> C = the number of codepoints in each line 
> 
> T =  the (average) time for processing each codepoint to check for a return character
> 
> Now N and C are the same whether the text is ascii or unicode
> 
> Test 1
> If I get rid of the red herring by replacing the matchChunk call with a simple
> 
> put line k of fff into x; put true into found ? as if matchChunk always finds a match on the first line tested so that I am timing just the getting of lines endings:
> 
> The time for processing plain ascii is.  0.008 seconds
> The time for processing unicode is.     0.84 seconds
> 
> Which would appear to mean that processing unicode codepoints for the apparently simple task of checking for return takes 100 times as much time as processing ascii. That seems excessive, to put it mildly!
> 
> Test 2
> With the original code using matchChunk, which of course is going to have its own internal loop on code points so multiply by another 8 (it only searches the first few characters)
> and will not always return true so more lines must be processed ? but again these are same multipliers whether ascii or unicode
> 
> Plain ascii takes   0.07 seconds
> Unicode takes    19.9 seconds, a multiplier of nearly 300. ? I can easily believe matchChunk takes 3 times as long to process unicode as ascii, this is the sort of factor I would have expected in Test 1.
> 
> OK Mark, hit me again, I am a glutton for punishment, what is wrong with this analysis?
> 
> Neville Smythe
> 
> 
> 
> 
> 
> 
> ------------------------------
> 
> Message: 9
> Date: Sat, 29 Jun 2024 10:27:19 +0100
> From: Mark Waddingham <mark at livecode.com>
> To: How to use LiveCode <use-livecode at lists.runrev.com>
> Subject: Re: Slow stack problem
> Message-ID: <0aa021b2188068c191e11ba49db857cc at livecode.com>
> Content-Type: text/plain; charset=UTF-8; format=flowed
> 
>> On 2024-06-29 08:53, Neville Smythe via use-livecode wrote:
>> Is it not the case that the processing time for looping over the number
>> of lines and getting the k-th line in each iteration, for some
>> arbitrary k, going to be
>> Order(N^2) * C * T
>> where
>> N = the number of lines in the text
>> C = the number of codepoints in each line
>> T =  the (average) time for processing each codepoint to check for a
>> return character
>> Now N and C are the same whether the text is ascii or unicode
> 
> Largely - yes - although for stuff like this you need to think in terms 
> of bytes not codepoints (as memory throughput becomes 'a thing' when the 
> strings are anything longer than a few characters) - so unicode is 
> 2*ascii in this regard
> 
> [ Its actually more than 2x for longer strings but how much more depends 
> on CPU/memory architecture - CPUs can only read from their level 1 
> cache, and there's a cost to a cache miss, and you get 2x as many cache 
> misses with unicode data as native data, assuming the data is larger 
> than a single level 1 cache line. ]
> 
>> Test 1
>> If I get rid of the red herring by replacing the matchChunk call with a
>> simple
>> ...
>> Which would appear to mean that processing unicode codepoints for the
>> apparently simple task of checking for return takes 100 times as much
>> time as processing ascii. That seems excessive, to put it mildly!
> 
> Its a lot slower certainly, but then searching unicode text for a string 
> is (in the general case) a lot more complex than searching native/ascii 
> text for a string.
> 
>> Test 2
>> With the original code using matchChunk, which of course is going to
>> have its own internal loop on code points so multiply by another 8 (it
>> only searches the first few characters)
>> and will not always return true so more lines must be processed ? but
>> again these are same multipliers whether ascii or unicode
>> ...
>> Plain ascii takes   0.07 seconds
>> Unicode takes    19.9 seconds, a multiplier of nearly 300. ? I can
>> easily believe matchChunk takes 3 times as long to process unicode as
>> ascii, this is the sort of factor I would have expected in Test 1.
> 
> So 'Test 2' is slightly misleading - as it still suggests matchChunk is 
> causing a slowdown - which it isn't.
> 
> The difference here is Test 2 is doing more work as it isn't always 
> exiting. If you test:
> 
>  get line k of fff
>  put true into tFound
> 
> I suspect you'll find the time to process your data if it contains 
> unicode is pretty similar to that when matchChunk is also called.
> 
> In my quick test (which is 32 index lines, 200 fff lines) I get about 
> 10ms (no unicode) vs 1400ms (unicode)
> 
>> OK Mark, hit me again, I am a glutton for punishment, what is wrong
>> with this analysis?
> 
> Nothing in particular - apart from thinking that matchChunk is actually 
> a relevant factor here ;)
> 
> The reasons this delimiter search operation on unicode strings is so 
> much slower than native is for two reasons:
>  1) We (well, I) heavily optimized the core native/ascii string 
> operations in 2015 to make sure there were as fast as possible
>  2) Searching a unicode string for another string (which is what is 
> going on here) is much more complex than doing the same for native/ascii
> 
> Native/ascii strings have some very pleasant properties:
>  - one byte (codeunit) is one character - always.
>  - each character has only one representation - its byte value
>  - casing is a simple mapping between lower and upper case characters - 
> and only about 25% of characters are affected
> 
> Unicode has none of these properties
>  - a unicode character (grapheme) can be arbitrarily many codeunits (2 
> byte quantities) long
>  - characters can have multiple representations - e.g. e-acute vs 
> e,combining-acute
>  - casing is not (in general) a simple mapping of one codeunit to 
> another
> 
> Currently the unicode operations in the engine are largely unoptimized - 
> they assume the general case in all things so even searching a string 
> for LF (which is the case here) is still done under the assumption that 
> it might need that (very hefty) extra processing.
> 
> Of course it would be nice to have highly optimized low-level unicode 
> string optimizations but you have to pick your battles (particular when 
> the battles are incredibly technical ones!) but the reality is that this 
> (admittedly large!) speed difference is only really noticeable 'at 
> scale' and when scale is involved, there's pretty much always an 
> algorithmic change which can make those 'low-level' performance 
> differences irrelevant.
> 
> The case here is a really good example.
> 
> The line X based code gives (no matchChunk / with matchChunk):
> 
>  ASCII 300 lines  13ms / 22ms
>  ASCII 3000 lines - 986ms / 1104ms
>  ASCII 10000 lines - 10804ms / 11213ms
> 
> The array based code gives (no matchChunk / with matchChunk):
> 
>  ASCII 300 lines - 2ms / 11ms
>  ASCII 3000 lines - 19ms / 101ms
>  ASCII 10000 lines - 69ms / 336ms
> 
>  UNICODE 300 lines - 7ms / 12ms
>  UNICODE 3000 lines - 52ms / 108ms
>  UNICODE 10000 lines - 170ms / 359ms
> 
> Warmest Regards,
> 
> Mark.
> 
> -- 
> Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
> LiveCode: Build Amazing Things
> 
> 
> 
> ------------------------------
> 
> Subject: Digest Footer
> 
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> http://lists.runrev.com/mailman/listinfo/use-livecode
> 
> 
> ------------------------------
> 
> End of use-livecode Digest, Vol 249, Issue 24
> *********************************************