Re: one-time-iterators

Subject: Re: one-time-iterators

Date: Sat, 28 May 2011 09:58:03 +0100

To: Austin Clements

Cc: notmuch

From: Patrick Totzke


Excerpts from Austin Clements's message of Fri May 27 20:29:24 +0100 2011:
> On Fri, May 27, 2011 at 2:04 PM, Patrick Totzke
> <patricktotzke@googlemail.com> wrote:
> > Excerpts from Austin Clements's message of Fri May 27 03:41:44 +0100 2011:
> >> >> > > Have you tried simply calling list() on your thread
> >> >> > > iterator to see how expensive it is?  My bet is that it's quite cheap,
> >> >> > > both memory-wise and CPU-wise.
> >> >> > Funny thing:
> >> >> >  q=Database().create_query('*')
> >> >> >  time tlist = list(q.search_threads())
> >> >> > raises a NotmuchError(STATUS.NOT_INITIALIZED) exception. For some reason
> >> >> > the list constructor must read mere than once from the iterator.
> >> >> > So this is not an option, but even if it worked, it would show
> >> >> > the same behaviour as my above test..
> >> >>
> >> >> Interesting.  Looks like the Threads class implements __len__ and that
> >> >> its implementation exhausts the iterator.  Which isn't a great idea in
> >> >> itself, but it turns out that Python's implementation of list() calls
> >> >> __len__ if it's available (presumably to pre-size the list) before
> >> >> iterating over the object, so it exhausts the iterator before even
> >> >> using it.
> >> >>
> >> >> That said, if list(q.search_threads()) did work, it wouldn't give you
> >> >> better performance than your experiment above.
> > true. Nevertheless I think that list(q.search_threads())
> > should be equivalent to [t for t in q.search_threads()], which is
> > something to be fixed in the bindings. Should I file an issue somehow?
> > Or is enough to state this as a TODO here on the list?
> 
> Yes, they should be equivalent.
> 
> Sebastian was thinking about fixing the larger issue of generator
> exhaustion, which would address this, though the performance would
> depend on the cost of iterating twice.  This is why generators
> shouldn't support __len__.  Unfortunately, it's probably hard to get
> rid of at this point and I doubt there's a way to tell list() to
> overlook the presence of a __len__ method.
Why not simply removing support for __len__ in the Threads and Messages classes?

> >> >> > would it be very hard to implement a Query.search_thread_ids() ?
> >> >> > This name is a bit off because it had to be done on a lower level.
> >> >>
> >> >> Lazily fetching the thread metadata on the C side would probably
> >> >> address your problem automatically.  But what are you doing that
> >> >> doesn't require any information about the threads you're manipulating?
> >> > Agreed. Unfortunately, there seems to be no way to get a list of thread
> >> > ids or a reliable iterator thereof by using the current python bindings.
> >> > It would be enough for me to have the ids because then I could
> >> > search for the few threads I actually need individually on demand.
> >>
> >> There's no way to do that from the C API either, so don't feel left
> >> out.  ]:--8)  It seems to me that the right solution to your problem
> >> is to make thread information lazy (effectively, everything gathered
> >> in lib/thread.cc:_thread_add_message).  Then you could probably
> >> materialize that iterator cheaply.
> > Alright. I'll put this on my mental notmuch wish list and
> > hope that someone will have addressed this before I run out of
> > ideas how to improve my UI and have time to look at this myself.
> > For now, I go with the [t.get_thread_id for t in q.search_threads()]
> > approach to cache the thread ids myself and live with the fact that
> > this takes time for large result sets.
> >
> >> In fact, it's probably worth
> >> trying a hack where you put dummy information in the thread object
> >> from _thread_add_message and see how long it takes just to walk the
> >> iterator (unfortunately I don't think profiling will help much here
> >> because much of your time is probably spent waiting for I/O).
> > I don't think I understand what you mean by dummy info in a thread
> > object.
> 
> In _thread_add_message, rather than looking up the message's author,
> subject, etc, just hard-code some dummy values.  Performance-wise,
> this would simulate making the thread metadata lookup lazy, so you
> could see if making this lazy would address your problem.
Thanks for the clarification. I did that, and also commented out the
lower parts of _notmuch_thread_create and this did indeed improve
the performance, but not so much as I had hoped:

In [10]: q=Database().create_query('*')
In [11]: time T=[t for t in q.search_threads()]
CPU times: user 2.43 s, sys: 0.22 s, total: 2.65 s
Wall time: 2.66 s

And I have only about 8000 mails in my index.
Making thread lookups lazy would help, but here one would still
create a lot of unused (empty) thread objects.
The easiest solution to my problem would in my opinion be
a function that queries only for thread ids without instanciating them.
But I can't think of any other use case than mine for this
so I guess many of you would be against adding this to the API?
/p
signature.asc (application/pgp-signature)

Thread: