MySQL, removing duplicates... but not blindly

-=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- (c) WidthPadding Industries 1987 0\|449\|0 -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=-
Socoder -> Web Development -> MySQL, removing duplicates... but not blindly
Tue, 17 Aug 2010, 08:02
HoboBen	I have a table, tblReleases, which may look like this: artistId albumTitle releaseDate ------------------------------- 100 Example1 20100813 100 Example1 20100815 500 Example1 19990512 --v As you can see, multiple artists can create an album with the same title. However, sometimes an artist has a duplicate copy of their album, but with a different date. I'm trying to select all records where an albumTitle is shared by the same artist, then choose the minimum releaseDate. I can't simply make (artistId, albumTitle) unique, because I need the minimum releaseDate. Here's my current idea: select albumTitle,count(albumTitle) as n from tblReleases group by albumTitle having n>1 limit 50; This gives me: albumTitle n ------------ Example1 2 --v So then, for each result, select the album by albumTitle, compare artistIds, and if any artistIds match delete the one(s) with the later date. I think that this is probably a terrible way to go about it. Particularly, I'm sure there should be a way to improve my first query to consider artistIds. Hopefully even a query that can do it all in one go. Can anyone offer a suggestion? (I definitely need to buy a book on this subject) Thanks! \|edit\| I now have: select * from tblReleases where txtTitle in (select txtTitle from tblReleases group by txtTitle having count(txtTitle)>1) limit 10; but it's taking almost a second per row! \|edit\| -=-=- blog \| work \| code \| more code
Tue, 17 Aug 2010, 09:17
JL235	One thing that might make it massively faster is if you push the album details out to it's own table. Something like: tblReleases = { artistID, albumID, releaseDate } tblAlbums = { albumID, title, other stuff ... } That should make finding duplicate album names _much_ faster for this scenario, and should also lower your memory footprint (cos the name itself is only stored once).
Tue, 17 Aug 2010, 09:22
HoboBen	That's a good idea. Ideally the name would only be stored once, but this requires removing the duplicates first. I think though, for the moment I'm just going to add unique(artistId, albumTitle). It's cheating, but will at least it will ignore any future duplicate albums inserted with later dates. Thanks! -=-=- blog \| work \| code \| more code
Tue, 17 Aug 2010, 09:29
JL235	... and the SQL you need should be something like: SELECT * FROM ( SELECT * FROM `tblReleases` WHERE artistID = $artistID ORDER BY albumID, releaseDate DESC ) AS sorted_releases GROUP BY artistID, albumID --v Not sure if it's 'releaseDate DESC' or 'releaseDate ASC', so you might want to try flipping it if your getting the last rather then soonest. \|edit\| I think I misread your original problem, I think the SQL above is what you truly want. It should (hopefully) print out all albums for a given artistID but only contain the minimum release date. \|edit\|
Tue, 17 Aug 2010, 10:58
waroffice	Coool didnt realise you could nest the select statements. thats ace. PS: im a n00b when it comes to php/mysql so i have no idea how to solve your problem
Tue, 17 Aug 2010, 13:42
Stealth	This may be what you need. It's untested so it might not even work. The DISTINCT parameter selects only a distinct field. I've ordered by releaseDate so that the oldest one will be the distinctly selected one. Play around with this and let me know if you have any luck. SELECT DISTINCT `artistId`, DISTINCT `albumTitle`, `releaseDate` FROM `tblReleases` ORDER BY ABS(`releaseDate`) ASC --v -=-=- Quit posting and try Google.
Tue, 17 Aug 2010, 13:59
HoboBen	Thanks very much Stealth, I will give that a go!
Wed, 18 Aug 2010, 13:35
Stealth	Did this work for you?
Thu, 19 Aug 2010, 10:15
HoboBen	Not yet; I've been playing too much Borderlands! Should have time on the weekend though. Cheers -=-=- blog \| work \| code \| more code