Tomi Ollila <tomi.ollila@iki.fi> writes: > On Mon, Jun 02 2014, Mark Walters <markwalters1009@gmail.com> wrote: > >> Vladimir Marek <Vladimir.Marek@oracle.com> writes: >> >>> Hi, >>> >>> I want to import bigger chunk of archived messages into my notmuch >>> database. It's about 100k messages. The problem is, that I most probably >>> have quite a lot of those messages in the DB. Basically I would like to >>> add only those I don't have already. >>> >>> There are two possibilities >>> >>> a) I will add all the 100k messages and then remove the duplicities. >>> >>> b) I will write a script which will parse the message ID's of the >>> to-be-added messages and try to match them to the notmuch DB. Adding >>> only files I can't find already. >>> >>> Ad b) might be better option, but I started to play with the idea of >>> deduplication. I'm thinking about listing all the message IDs stored in >>> DB, listing all files belonging to the IDs and deleting all but one. >>> Also I'm thinking about implementing some simple algorithm telling me >>> whether the messages are really very similar. Just to be sure I don't >>> delete something I don't want to. >>> >>> Was anyone playing with the idea? >> >> I am not sure what your use case is but notmuch automatically >> deduplicates: that is if the message-id is one it has already seen no >> further indexing takes place. The only thing that happens is the new >> filename gets added to the list of filenames for the message. >> >> Thus importing should be almost as fast as if the message were not >> there, and the database should be almost identical to what you would get >> if you only imported the genuine new messages. >> >> If you want to save disk space then you could delete the duplicates >> after with something like >> >> notmuch search --output=files --format=text0 --duplicate=2 '*' piped to >> xargs -0 > > What if there are 3 duplicates (or 4... ;) I was assuming that it was merging 2 duplicate-free bunches of messages, but I guess the new 100000 might not be. In that case running the above repeatedly (ie until it is a no-op) would be fine. > >> >> (but please test it carefully first!) > > One should also have some message content heuristics to determine that the > content is indeed duplicate and not something totally different (not that > we can see the different content anyway... but...) That would be nice. Best wishes Mark >> >> I would think something like this is better than trying to parse the >> message-ids yourself. > > >> >> Best wishes >> >> Mark >> > > Tomi > > >> >>> >>> -- >>> Vlad