New chunks

Paul Dupuis paul at researchware.com
Wed Mar 12 21:32:07 EDT 2014


Actually, no it is not based on simple period delimiters. It is using a
code library that has built in rules to understand sentence structure in
most languages and actually can recognize real end-of-sentences. It
deals with numbers, abbreviations, etc. correctly.

Can someone probably construct some sequence of characters that could be
called a sentence that might get mis-parsed? Possibly - I am familiar
with the library RunRev is using only by reputation, so I can't say for
sure. However for most text you will work with where you want to return
"sentence 2 of paragraph 5 of fld X" you will get exactly what you expect.

On 3/12/2014 7:58 PM, Bob Sneidar wrote:
> Pretty sure Livecode is going to do a simple delimiter on period. You would have to prep the data first by replacing periods in any word that is a number with a placeholder, processing your sentences, then restoring the placeholders (if you need to). 
>
> You could get fancy by setting the lineDelimiter to space, then finding every line that ends in a period and processing everything in-between. It’s doubtful a number would end in a period without it being the end of a sentence. 
>
> Bob
>
>
> On Mar 11, 2014, at 15:34 , Jim Hurley <jhurley0305 at sbcglobal.net> wrote:
>
>> Can someone explain how the “sentence" chunk would work?
>> How are decimal points, and points in an abbreviation distinguished from the “period” that deliniates the end of a “sentence?”
>> Does it presume that the exitsing text has special embedded “periods?”
>>
>> I’ve written my own, but it is very cumbersome and not flawless. I use it to do manuscript analysis.
>> Like: Find all sentences in which “time” and “party” occur anywhere in the same sentence.
>>
>> My ignorance on unicode is profound.
>> Jim
>>
>> C
>>> Message: 15
>>> Date: Tue, 11 Mar 2014 18:15:18 +0000
>>> From: Benjamin Beaumont <ben at runrev.com>
>>> To: LiveCode Developer List <livecode-dev at lists.runrev.com>, 	How to
>>> 	use LiveCode <use-livecode at lists.runrev.com>
>>> Subject: New chunks
>>> Message-ID:
>>> 	<CADd0_Txbhdem4PbKXifXUsujqPLs9HROME6vKhF=Sk1zNp29cQ at mail.gmail.com>
>>> Content-Type: text/plain; charset=ISO-8859-1
>>>
>>> Hi All,
>>>
>>> We're in the process of adding some new chunk types in LiveCode 7 and we
>>> would appreciate suggestions for a particular chunk name.
>>>
>>> The new chunk types are:
>>>
>>> naturalword (breaks on unicode word boundaries)
>>> sentence (breaks on unicode sentence boundaries)
>>> paragraph (Same behaviour as current 'line' chunk)
>>>
>>> The first chunk is called 'naturalword' because 'word' is already in use.
>>> Renaming the current 'word' chunk to 'token' to free up 'word' is not an
>>> option for backward compatibility. We are also limited by the current
>>> parser which doesn't allow us to use the form:
>>>
>>> put natural word 1 of "this is a string of words"
>>>
>>> 'naturalword' is the clearest internal suggestion at the moment and we'd
>>> love to get the input from community members if there is an even clearer
>>> option.
>>>
>>> Warm regards and thank you for your input.
>>>
>>> Ben
>>>
>>> _____
>> _______________________________________________
>> use-livecode mailing list
>> use-livecode at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
>
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>





More information about the use-livecode mailing list