Excerpts from Austin Clements's message of Fri May 27 20:29:24 +0100 2011: > On Fri, May 27, 2011 at 2:04 PM, Patrick Totzke > <patricktotzke@googlemail.com> wrote: > > Excerpts from Austin Clements's message of Fri May 27 03:41:44 +0100 2011: > >> >> > > Have you tried simply calling list() on your thread > >> >> > > iterator to see how expensive it is? My bet is that it's quite cheap, > >> >> > > both memory-wise and CPU-wise. > >> >> > Funny thing: > >> >> > q=Database().create_query('*') > >> >> > time tlist = list(q.search_threads()) > >> >> > raises a NotmuchError(STATUS.NOT_INITIALIZED) exception. For some reason > >> >> > the list constructor must read mere than once from the iterator. > >> >> > So this is not an option, but even if it worked, it would show > >> >> > the same behaviour as my above test.. > >> >> > >> >> Interesting. Looks like the Threads class implements __len__ and that > >> >> its implementation exhausts the iterator. Which isn't a great idea in > >> >> itself, but it turns out that Python's implementation of list() calls > >> >> __len__ if it's available (presumably to pre-size the list) before > >> >> iterating over the object, so it exhausts the iterator before even > >> >> using it. > >> >> > >> >> That said, if list(q.search_threads()) did work, it wouldn't give you > >> >> better performance than your experiment above. > > true. Nevertheless I think that list(q.search_threads()) > > should be equivalent to [t for t in q.search_threads()], which is > > something to be fixed in the bindings. Should I file an issue somehow? > > Or is enough to state this as a TODO here on the list? > > Yes, they should be equivalent. > > Sebastian was thinking about fixing the larger issue of generator > exhaustion, which would address this, though the performance would > depend on the cost of iterating twice. This is why generators > shouldn't support __len__. Unfortunately, it's probably hard to get > rid of at this point and I doubt there's a way to tell list() to > overlook the presence of a __len__ method. Why not simply removing support for __len__ in the Threads and Messages classes? > >> >> > would it be very hard to implement a Query.search_thread_ids() ? > >> >> > This name is a bit off because it had to be done on a lower level. > >> >> > >> >> Lazily fetching the thread metadata on the C side would probably > >> >> address your problem automatically. But what are you doing that > >> >> doesn't require any information about the threads you're manipulating? > >> > Agreed. Unfortunately, there seems to be no way to get a list of thread > >> > ids or a reliable iterator thereof by using the current python bindings. > >> > It would be enough for me to have the ids because then I could > >> > search for the few threads I actually need individually on demand. > >> > >> There's no way to do that from the C API either, so don't feel left > >> out. ]:--8) It seems to me that the right solution to your problem > >> is to make thread information lazy (effectively, everything gathered > >> in lib/thread.cc:_thread_add_message). Then you could probably > >> materialize that iterator cheaply. > > Alright. I'll put this on my mental notmuch wish list and > > hope that someone will have addressed this before I run out of > > ideas how to improve my UI and have time to look at this myself. > > For now, I go with the [t.get_thread_id for t in q.search_threads()] > > approach to cache the thread ids myself and live with the fact that > > this takes time for large result sets. > > > >> In fact, it's probably worth > >> trying a hack where you put dummy information in the thread object > >> from _thread_add_message and see how long it takes just to walk the > >> iterator (unfortunately I don't think profiling will help much here > >> because much of your time is probably spent waiting for I/O). > > I don't think I understand what you mean by dummy info in a thread > > object. > > In _thread_add_message, rather than looking up the message's author, > subject, etc, just hard-code some dummy values. Performance-wise, > this would simulate making the thread metadata lookup lazy, so you > could see if making this lazy would address your problem. Thanks for the clarification. I did that, and also commented out the lower parts of _notmuch_thread_create and this did indeed improve the performance, but not so much as I had hoped: In [10]: q=Database().create_query('*') In [11]: time T=[t for t in q.search_threads()] CPU times: user 2.43 s, sys: 0.22 s, total: 2.65 s Wall time: 2.66 s And I have only about 8000 mails in my index. Making thread lookups lazy would help, but here one would still create a lot of unused (empty) thread objects. The easiest solution to my problem would in my opinion be a function that queries only for thread ids without instanciating them. But I can't think of any other use case than mine for this so I guess many of you would be against adding this to the API? /p