Looking for parser for Email (MIME)
Roland Huettmann
roland.huettmann at gmail.com
Tue Mar 22 10:36:39 EDT 2016
Hello Mark,
Thank your for the explanation. It is very nice.
--- MBOX file format
Yes, as you also suggest, I am already reading MXBOX file in chunks which
are separated by a string CR & "From " as also defined for that file format.
So, it goes "read from file <filename> at <position> until <string>".
The only drawback is that at the end of the file there is no such string
and it needs another way of reading, but that is then possible
Another way, as you suggest, is reading line by line and checking for such
string value to separate messages. I just do not know yet what will be more
efficient in terms of speed. I will be testing.
--- Checking available physical memory (in RAM, not on disk)
Also a good way would be to check for available amount of *physical*
memory. This way one could limit chunks read into memory, and processing
would be pretty straight forward and fast when also knowing limitations of
the OS (32bit, 64bit, available RAM, etc... all you suggested).
Is there a function to know available physical memory in LiveCode? I could
not find yet.
--- Reading backwards in a file
--Well, reading backwards in that way is equivalent to knowing how long the
file is:
-- read ... at -1000 until EOF
-- is the same as
-- read ... at (fileSize - 1000) until EOF
With reading backwards I meant starting from EOF or any position and having
the pointer going backward char by char to whatever other previous
position. Syntax could be: "read from file <filename> at <position> down to
<position>". But I am not sure if there are many use cases for this.
--- Storing large number of messages
You are right with storing the retrieved messages in a database. It is the
best way. That is what I was preparing to do as it is obviously the only
solution which makes sense for such large amounts of data. And only then it
allows for all kinds of post-processing the easier way. I will be using
both, SQLite, and later a remote database system.
--- The detailed files
I was not aware about the "the detailed files" function. Something new I
learned. Again thank you. I checked the dictionary. It could be much more
explicit about such function. With "detailed". It only finds the keyword
"detailed." Searching for "detailed files" I finds nothing.
But I found something in the Forums with good explanation. Maybe it is
worth writing an enhancement request to document this function the
dictionary of LiveCode.
Cheers to all ), Roland
On 22 March 2016 at 14:16, Mark Waddingham <mark at livecode.com> wrote:
> On 2016-03-22 12:45, Roland Huettmann wrote:
>
>> How to know how much we can read into memory? Is there any function to
>> know
>> this? Is there a size limit for variables?
>>
>
> LiveCode has a limit of 2Gb characters for strings but that depends on how
> much memory a single process can have on your system.
>
> On 32-bit systems, you're generally limited to 768Mb-1Gb contiguous block
> of memory (32-bit Windows has an address space of 3Gb for a user process
> which also has to include all mapped resources such as executables and
> shared libraries; Mac has a user process address space of 4Gb which also
> has to include all mapped resources so you can generally get up to around
> 1.5Gb contiguous allocated memory block).
>
> On 64-bit systems then you should be able to many 2Gb strings (or similar
> in LiveCode), although obviously how fast they will operate will depend on
> the amount of physical ram in the machine - disk paged virtual memory
> taking up the slack).
>
> It is not possible to read backwards - which could be a nice way reading a
>> file in some special cases. So "read from file fName at eof until -1000"
>> does not work.
>>
>
> Well, reading backwards in that way is equivalent to knowing how long the
> file is:
>
> read ... at -1000 until EOF
>
> is the same as
>
> read ... at (fileSize - 1000) until EOF
>
> So, the only way reading very large file is reading a chunk of data of n
>> bytes (whatever is allowed in memory), processing this, and then reading
>> the next chunk until the remaining part of the file is small enough to be
>> read until eof.
>>
>
> For such a large file (38gb) your only solution is to read and parse it in
> chunks. MBOX files are a sequence of records, so you need to use a process
> which reads in blocks from the file when there is not enough data left to
> find the current record boundary - that way you only load into memory (at
> any one time) enough of the file to process completely the next record.
>
> In terms of finding the size of a file in LiveCode you can use 'the
> detailed files'.
>
> It is worth pointing out that using 'open file' and 'read from file' are
> *stream* based in approach. From memory, the MBOX format is essentially
> line-based, so you should be able to write a relatively simple parsing loop
> with that in mind:
>
> open file ...
> repeat forever
> read from file ... until return
> if the result is not empty then
> exit repeat
> end if
> if *it is a new message boundary* then
> ... finish processing current message ...
> ... start processing new boundary ...
> else
> ... append line to current message ...
> end if
> end repeat
>
> Of course, one thing to bear in mind, is that with a 38Gb file you are
> never going to fit all of that into memory; so the best approach would
> probably be to parse your mail messages and then store them into a storage
> scheme which doesn't require everything to appear in memory at once - e.g.
> an sqlite db or a more traditional dbms, or even lots of discrete files in
> a filesystem in some suitable hierarchy.
>
> Warmest Regards,
>
> Mark.
>
> --
> Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
> LiveCode: Everyone can create apps
>
>
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>
More information about the use-livecode
mailing list