Dupes pool?

jxh2154

2008-01-01 22:40:00 UTC

admin2, I noticed you started a dupes pool. What's the criteria for a dupe?

For example, post #11715 was entered in the pool as a dupe, when in fact it's a higher quality version of post #6910

So I made the former the parent of the latter, and we definitely don't want to delete post #11715

Ditto on post #11716 being not a real duplicate of post #7084 or post #6909 but in fact a higher quality version that I made the parent image.

So knowing what we're defining as a true dupe will help, as will knowing if the preference is to delete dupes or parent them instead.

admin2

2008-01-01 23:22:47 UTC

Parenting is probably the best idea, I created the pool sorta as an afterthought. However we've been working on a dupe detector but it hit a few road bumps, more details eventually I guess.

aoie_emesai

2008-01-02 05:17:36 UTC

There's lots of images, and it does require the work of members here to detect mishaps like that. If the image is of better quality and size, simply remove it and add the lesser one into the dupe pool.

viiv

2008-01-02 14:19:11 UTC

It is a bit strange but when I see post about images dupes on imageboards, I never see the name of an extremely powerful (and open-source!)tool : imgseek, I think it's the only prog that search dupes by wavelet decomposition, and it's amazingly efficient to find exact dupes! (and this is the case here).
Everyone who has a large coolection of images (and so dupes...) should take a look at this!
You can even draw a paint and imgseek find an image that seems like your paint, it's really powerful!
P.S. : the version available on windows is 0.8.5, and is sometimes a bit unstable but the version 0.8.6 is better but run only under linux (but I think moe.imouto is run under linux right?)

admin2

2008-01-02 17:37:26 UTC

We're working on a wavelet program to match dupes, but if one exists maybe we can integrate it in, thanks for letting me know

viiv

2008-01-02 17:57:30 UTC

I think imgseek would really do a good job, but if you want to have another point of view on wavelet programs to find dupes, here is one in delphi : http://www.delphifr.com/codes/RECHERCHE-SUPPRESSION-IMAGES-DOUBLE-BASEE-SUR-COMPARAISON-INTELLIGENTE_38711.aspx
the comments are in french but I think a translation tool is sufficient to understand what the author and forumers say.
May it help you.

MDGeist

2008-01-02 18:49:42 UTC

i use dupdetector to detect dups...

Well no idea how to run that on moe and its structure
but since i have most of the files in 1 folder, running checks would be possible...

Radioactive

2008-01-02 19:35:32 UTC

MDGeist said:
i use dupdetector to detect dups...

Well no idea how to run that on moe and its structure
but since i have most of the files in 1 folder, running checks would be possible...

dupdetector chokes on the amount of images I throw at it. You should try the de-duper built into GQView.

Of course this doesn't help moe in any way...

MDGeist

2008-01-02 19:57:04 UTC

its easier to search for dupes if you have all files on hdd.
one paged works fine too, since moe doesnt have that much images.

9k pics can be checked by eye in one day ... Will get troublesome after 10k though...

viiv

2008-01-03 12:19:43 UTC

Radioactive said": "Of course this doesn't help moe in any way..." --> why? every way (which is based on content and not checksum) to find dupes is good after all.

MDGeist said : "9k pics can be checked by eye in one day ..." --> good luck!

Radioactive

2008-01-03 18:25:01 UTC

viiv said:
Radioactive said": "Of course this doesn't help moe in any way..." --> why? every way (which is based on content and not checksum) to find dupes is good after all.

It depends if the de-duper of GQView could be implemented for the website. I'm only saying that it is superior to dupdetector

MDGeist said : "9k pics can be checked by eye in one day ..." --> good luck!

MDGeist is not even human, so it is quite possible....

cyanoacry

2008-01-03 19:02:24 UTC

I'm currently coding up a new normalized Haar wavelet decomposer/sorter/coefficient generator for use in the image detector, due to the fact that rq's doesn't really work.

E.g., according to the algorithm in the tree right now, http://moe.imouto.org/post/show/628 and http://moe.imouto.org/post/show/3443 are identical. This needs some work.

The algorithm that was being tested was the multiresolution image query as described in Jacobs et. al. (http://grail.cs.washington.edu/projects/query/). The fuzziness of the search didn't even come into play here---those two above posts had the -exact- same wavelet coefficients. All 120 of them.

rq wasn't kidding when he said the storage requirements were considerable. The coefficient table with 8600 posts will contain over a million rows (120 rows/image).

My query code also needs a lot of work: I'm looking at about 15 seconds to match one image out of eight thousand. Jacobs et al did it in half a second on a database of 20,000 images. In 1995.

That being said, no, I really don't want to use an imported algorithm not least because:

a) if I don't understand it, I won't be able to fix it
b) I'm a kernel-side C programmer, so I should be able to make it fast and small enough to be usable on a near-interactive basis.

viiv, I also took a look at imgseek. Their web-based demo turned up some pretty crappy results (e.g., a similar image search for snowflakes turns up... grass and other random things, but certainly not snowflakes). Plus it's GPL'd, while danbooru is intentionally BSD-licensed. I don't want to get into any tainted code issues.

Things to look forward to (assuming I have enough time to finish):

The ability to see possible dupes as you upload
The ability to search for portions of images:

--> You can search for a certain artist's eyes
--> Or, perhaps, panty styles etc.

If anybody's got a handy reference for wavelet theory that isn't filled with math junk way over the top of my head (I really don't need a comparison to Fourier transforms, and a coverage of the continuous wavelet transform is immaterial) I'd really, really appreciate it.

So far I've been working off of this article:
http://www.spelman.edu/~colm/wav.pdf

Normalized Haar wavelets are the easiest due to the averaging-and-differencing method you can use to implement them.

If anybody's got tips/experience in this field, I'd love to hear it!

viiv

2008-01-03 19:49:17 UTC

I agree that their web-based demo (isk-daemon I think) gives crappy results like you said, but imgseek itself gives good results when you put a high level (>98% I would say). It indeed gives fake results on huge database like moe but among the results there is the dupes you want to find because the images in moe which are dupes (see those in the dupes pools) are different scan of the same source, and so the scans are really similar and it is easy for imgseek to find it (fast).
I just say it to say that imgseek is good! but it seems you're gonna do a tool like that by yourself, it's courageous!
I think that "* The ability to search for portions of images:
--> You can search for a certain artist's eyes
--> Or, perhaps, panty styles etc." will be pretty hard to do but it would be amazing!
And have you tryed the delphi-code I found? (I haven't, because I don't know how to get a programm from a source code...) because it contains function that do the Haar transform, and the author gives a link about the Haar transform : http://home.versateladsl.be/epm6604b/ondelette.html (in (simple) french but babelfish is your friend!) and this page links to this page http://www.amara.com/current/wavelet.html in english this time) which seems complete concerning wavelets...good luck!!!
btw, who is this rq???

cyanoacry

2008-01-03 20:12:30 UTC

rq is the guy who runs danbooru.donmai.us and the actual danbooru coder

cyanoacry

2008-01-03 20:28:04 UTC

and actually, no. sometimes it's possible for the intended dupe not to show up in the top 10-20 results, which GREATLY reduces the chance that anybody will find it

we need to find an algorithm that'll consistently rank the target image in the top ten

viiv

2008-01-04 11:21:25 UTC

I admit it, the true dupe doesn't show up in the top 10 results because it seems the results aren't classified by relevance but by order of apparition...it's a bit dumb.
And thanx for the info about rq (shame on me!)
So good luck for your algo!

P.S.: And for this Jacobs : in 1995 the images were 320*240 and their wheight of 5k to 100k so I think it wasn't difficult to search for dupes in small databases like this...(in moe the images are 5000*5000 and 10M...)

viiv

2008-01-26 19:34:29 UTC

Like MDGeist said in another thread, a useful dupes search tool which work (for the moment) on danbooru has been released : http://danbooru.donmai.us/forum/show/4346
It looks like it's what we were speaking here! Thus, maybe, cyanocry could take this prog and run it on moe.imouto database.
(btw, it's based on imgseek, yeepee!)

Radioactive

2008-01-26 19:47:10 UTC

(btw, it's based on imgseek, yeepee!)

I'll have to download imgseek and see how the dup detector runs on my collection.

viiv

2008-01-26 19:52:30 UTC

I don't know if you're under linux or not (I'm unfortunately not...but it'll come!) but you'd better try the 0.8.6 version (that has'nt been released under windows) because the 0.8.5 blocks on some images that are in my moe.imouto folder (only 5 or 6 in 10000 pictures are buggy, it's a pain!)

Radioactive

2008-01-26 20:21:38 UTC

viiv said:
I don't know if you're under linux or not (I'm unfortunately not...but it'll come!) but you'd better try the 0.8.6 version (that has'nt been released under windows) because the 0.8.5 blocks on some images that are in my moe.imouto folder (only 5 or 6 in 10000 pictures are buggy, it's a pain!)

I'm using 0.8.6 under Kubuntu, I'm going to give the dup detector a test run on a 90Gb collection of images I have.

cyanoacry

2008-01-27 03:31:51 UTC

i talked to the developer of this app (the imgseek danbooru search hybrid) and went through his code, it's a really neat idea and actually does everything that i thought it did--and more!

but other than that i've been slapped with another project of d00m which takes precedence for the moment, sorry.

hopefully (one day, i promise) it will be up on moe.

Radioactive

2008-01-27 10:26:14 UTC

hopefully (one day, i promise) it will be up on moe.

Good news.

viiv

2008-01-27 12:38:18 UTC

Yeah, good news!
And Radioactive I don't know if it's to late or not, but run imgseek on 90GB of pictures may take looooooooong time, I've run it on 2Gb and it took 4-5 minutes, so...I let you imagine for 90 Gb...
But if you succed to run the dupe check completly let us know (cause it would mean imgseek 0.8.6 is really reliable!)

Radioactive

2008-01-27 15:03:30 UTC

But if you succed to run the dupe check completly let us know (cause it would mean imgseek 0.8.6 is really reliable!)

I'm running a 99% match scan on them as we speak.

Radioactive

2008-01-28 08:22:24 UTC

Radioactive said:
I'm running a 99% match scan on them as we speak.

I gave up after waiting for 3 hours...

The dimension/size match works pretty fast.

viiv

2008-01-28 16:09:42 UTC

Radioactive said:
I gave up after waiting for 3 hours...

The dimension/size match works pretty fast.

Too bad!
You should have tried first just on images from moe.imouto.org (29 GB is enough to begin)!
And the really intersting feature in imgseek is the "by similarity" dupe checker though the dimension/size match enables you to delete the exact dupes (it's already good!).

Radioactive

2008-01-28 18:49:58 UTC

I'll target a smaller set of images on my next attempt.

petopeto

2008-02-03 07:16:24 UTC

admin2 said:
Parenting is probably the best idea, I created the pool sorta as an afterthought. However we've been working on a dupe detector but it hit a few road bumps, more details eventually I guess.

What's the rationale behind parenting and dupe pools, instead of just deleting dupes?

aoie_emesai

2008-02-03 07:50:59 UTC

Parenting is to save gallery bandwidth and space, since some parent contain as much as 80 images and even 180. It saves space for other images, and all child post aren't in the gallery.

I don't remember what the dupe pool was, lol ^_^

Radioactive

2008-02-03 10:14:29 UTC

I don't remember what the dupe pool was, lol ^_^

So if you spotted a dupe you could add both images to it for mod deletion. Never really used.

If you see a dupe and want it deleted please make sure to mark it for deletion as well as parenting the post. It is easier to find both images and compare them.

petopeto

2008-02-12 03:43:18 UTC

Is there a better approach for dupes than parenting them? Having dupes parented causes a problem: there are so many of them that it drowns out the whole parenting system. Whenever I see the "this post has children", I'm inclined not to click it, because the majority of the time it's going to just be a dupe. That means I'm likely to miss the occasional parented posts that are actually interesting.

What's the rationale for keeping dupes around, anyway? I understand keeping ones that have already been favorited (though migrating the favorites when deleting a dupe would be better--no interface to do that, I think), and leaving originals when the new one is a fix (some people may not like color adjustments or crops and want the original), but I don't understand most of them.

Hmm. It'd help if the text said how many posts were in the group: "This post belongs to a parent post with 10 children." (Dupes will usually have just 1.) But it seems better to just flag it and stick a post #link to the better/original in a no-bump comment.

Name
Email
Password
Confirm Password