Michal Nazarewicz <mina86@mina86.com> writes: >>> On Tue, Sep 04 2012, Dmitry Kurochkin wrote: >>>> +class MailComparator: >>>> + """Checks if mail files are duplicates.""" >>>> + def __init__(self, filename): >>>> + self.filename = filename >>>> + self.mail = self.readFile(self.filename) >>>> + >>>> + def isDuplicate(self, filename): >>>> + return self.mail == self.readFile(filename) >>>> + >>>> + @staticmethod >>>> + def readFile(filename): >>>> + with open(filename) as f: >>>> + data = "" >>>> + while True: >>>> + line = f.readline() >>>> + for header in IGNORED_HEADERS: >>>> + if line.startswith(header): > >> Michal Nazarewicz <mina86@mina86.com> writes: >>> Case of headers should be ignored, but this does not ignore it. > > On Tue, Sep 04 2012, Dmitry Kurochkin wrote: >> It does. > > Wait, how? If line is “received:” how does it starts with “Received:”? > Sorry, I misunderstood your comment. It does not ignore the case indeed. >>>> + if os.path.realpath(comparator.filename) == os.path.realpath(filename): >>>> + print "Message '%s' has filenames pointing to the >>>> same file: '%s' '%s'" % (msg.get_message_id(), comparator.filename, >>>> filename) >>> >>> So why aren't those removed? >>> >> >> Because it is the same file indexed twice (probably because of >> symlinks). We do not want to remove the only message file. > > Ah, right, with symlinks this is troublesome, but than again, we can > check if there is at least one non-symlink. If there is, delete > everything else, if there is not, delete all but one arbitrarily chosen > symlink. > Sure, we could do that. >>>> + elif comparator.isDuplicate(filename): >>>> + os.remove(filename) >>>> + duplicates_count += 1 >>>> + else: >>>> + #print "Potential duplicates: %s" % msg.get_message_id() >>>> + suspected_duplicates_count += 1 >>>> + >>>> + new_timestamp = time.time() >>>> + if new_timestamp - timestamp > 1: >>>> + timestamp = new_timestamp >>>> + sys.stdout.write("\rProcessed %s messages, removed %s duplicates..." % (msg_count, duplicates_count)) >>>> + sys.stdout.flush() >>>> + >>>> +print "\rFinished. Processed %s messages, removed %s duplicates." % (msg_count, duplicates_count) >>>> +if duplicates_count > 0: >>>> + print "You might want to run 'notmuch new' now." >>>> + >>>> +if suspected_duplicates_count > 0: >>>> + print >>>> + print "Found %s messages with duplicate IDs but different content." % suspected_duplicates_count >>>> + print "Perhaps we should ignore more headers." >>> >>> Please consider the following instead (not tested): > >> Thanks for reviewing my poor python code :) I am afraid I do not have >> enough interest in improving it. I just implemented a simple solution >> for my problem. Though it looks like you already took time to rewrite >> the script. Would be great if you send it as a proper patch obsoleting >> this one. > > Bah, I'll probably won't have time to properly test it. > Same problem :) Regards, Dmitry > -- > Best regards, _ _ > .o. | Liege of Serenely Enlightened Majesty of o' \,=./ `o > ..o | Computer Science, Michał “mina86” Nazarewicz (o o) > ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--