OMG text processing performance 6.7 - 9.5

Thu Jan 30 10:03:05 EST 2020

On 2020-01-30 14:38, Ben Rubinstein via use-livecode wrote:
> Hi Mark,
> 
> Thanks for taking the time to reply!
> 
> I'm indeed currently in the process of seeing whether I can persuade
> the client's IT department to install the 32-bit drivers on the new
> VM. I'm optimistic that will buy me some time, but it won't be a
> complete solution because they outsource support to a third company,
> which has warned that it doesn't intend to support the 32-bit drivers
> much longer (apparently they're just waiting for Crystal Reports to be
> updated!).

Ah! From that I'm guessing you are using the ODBC revdb driver - which 
needs
a third-party ODBC connector :)

> And if that fails, one of my options is as you suggest to use the LC
> 9.5-built app to retrieve the data through the 64-bit drivers, and the
> the LC 6.7-built app to process and (probably) build it. It will be
> shonky.

It doesn't have to be 'shonky' - if the fetch-from-database part is 
already
separated from the data-processing-part through cache-files (i.e. fetch
writes to files on disk, data-process reads said files and processes) 
then
you could build a 64-bit win standalone which is the fetch-from-database
part, which is then called by the data-process part using shell (or open
process).

Of course, it would be slightly cleaner to all be one app :)

> However, what you say certainly makes me feel more optimistic that
> something should be possible. There's really very little going on in
> the way of binary<->text conversion; there probably is a fair amount
> of word chunking - although half the work is about tracing
> cross-references etc, there's also a fair amount of processing of
> 'prose' and prose-like text. However, the nature of the text is that
> although 99% of it is probably ASCII, in any given table of text there
> will be just a few 'extended' characters - does that mean it all gets
> treated as four-byte data?

Binary<->text can be quite subtle - as it isn't something you had to 
think
about before 6.7. For example, if you are fetching using *b via revDB 
from
the database, then *that* will now be binary data - not text. (Indeed, 
what
accessors are you using to get the data?)

Also, things like binfile and reading for binary (from a file) will also
produce binary rather than text.

You can test for binary data using 'is strictly a binary string'.

Native encoding means (on Windows at least) anything which fits into 
Latin-1
so any text you are getting out of revDB from the database should come 
through
as native strings.

Native strings get converted to unicode internally when they are 
combined with
a string which contains unicode and in two other places:
   1) Using matchText / replaceText (because we use the utf-16 variant of 
PCRE)
   2) When put into a field (because all text layout APIs on all 
platforms use UTF-16)

What sort of text operations are you using for 'tracing cross-references 
etc' and
'processing of 'prose' and prose-like text'?

> I'll see how the negotiations with IT get on...

Good luck!

Warmest Regards,

Mark.

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps