Databases -- must be a better way

-=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- (c) WidthPadding Industries 1987 0\|77\|0 -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=-
Socoder -> Web Development -> Databases -- must be a better way
Fri, 26 Aug 2011, 15:14
HoboBen	For a CMS, I will have articles, and each article can have several revisions. I was thinking of having an article table and a revision table: Articles: id created (unique, auto-increment) (timestamp) Revisions: id articleId title body (unique, auto-increment) (link to articles table) (string) (string) --v As you can see, this is denormalised so that no data is duplicated (generally a good thing, I think?) But if I want to display a list of articles, I first have to query the articles table, then for each article, query the revision to get the title. This might mean 50 queries on a page! Surely there's a better way? I could duplicate the revision title column into the articles table, so that the revision table doesn't need to be queried. I could combine the article and revision tables together, and do one query (possibly using DISTINCT or similar on the articleId field to only get the latest revision). Does that sound like a good idea? In which case, how do I safely get an auto-incremented id for new articles? (I would guess by locking the table first before a MAX query?) I haven't had much practice with databases, so I thought it best to ask in case I was doing this completely the wrong way. Thanks! -=-=- blog \| work \| code \| more code
Fri, 26 Aug 2011, 15:41
JL235	You need to use 'joins'. There are two ways, the common way is: SELECT * FROM Articles, Revisions WHERE Article.id = Revisions.articleID ORDER BY Article.created --v I select from two tables, but the 'id = articleID' specifies I am only interested in the data that crosses over. In practice that will pull out a row containing both the Article and Revision data, for each Article and each Revision. So if you have 5 articles, and each has 2 revisions, then you have 10 rows returned. Another way of thinking about it is that this will return a list of all revisions, but each revision row has it's article data included. This means you have a lot of repeated article data, but that is _FAR_ cheaper then performing multiple SQL queries. The other way to do it is using the proper JOIN syntax, which allows you to do funky stuff like pulling out data even if the join fails (the fields will have null). However, if you use some frameworks (like Ruby on Rails) then it can handle all of these joins for you, hiding them away. It allows you to get more of your CMS built, in less time. I use my own PHP database code (which I'm actually editing right now) to handle joins for me. You might also want to move the creation time out into the revision. Each revision should probably have a time stamp, and the article creation time will be the same as the first revision.
Fri, 26 Aug 2011, 17:14
HoboBen	Thanks, that makes a lot of sense. I've read up on joins. You're right that article creation time is the same as the first revision time stamp - it might make sense to remove the articles table completely. I guess getting a list of say 50 distinct articles with the latest revision data would then look something like: select * from Revisions where articleId > $offset // for pagination group by articleId // unique article id order by max(timestamp) limit 0,50 Although I'm a bit dodgy on using max. I'll have to test, but I don't know if it should be in order by or part of the select. Thanks for the help. -=-=- blog \| work \| code \| more code
Fri, 26 Aug 2011, 17:49
JL235	I don't think you should use articleID for pagination. It would be very easy to mess up your pagination, such as deleting or hiding an article. You'll also find in the future you will have more bits of data to store, which is specific to an article. For example storing a pretty url, a publication date, or a flag to say if the article is hidden or public. In contrast to what I said above, it is probably a good idea to also keep the creation time in the Article table as well. So it would be stored twice. Otherwise you will have to pull out the first revision, every time you pull out an article, which I'd imagine you'll be doing every time the article is displayed. That is known as 'de-nomalization', where you go against normalization, in order to improve performance and efficiency. Normalization is a bit like functional programming. It's mainly from a theoretical and aesthetic point of view, to help build DB designs where it is difficult to end up with common DB issues. For example you don't store data in multiple places, to ensure that if you update the data, you can't end up with all those locations going out of sync. Good DB design is just as difficult as programming.
Fri, 26 Aug 2011, 19:07
HoboBen	Normally I'd agree that it's not the best way to do pagination generally, but it will work in this specific case (mainly because there are no deletions (so that you can always revert)). I'd like to avoid using a limit offset as it gets slower as the table gets bigger. I also intend to use static HTML as much as possible so that you only hit the DB if you're logged in. On de-normalization -- knowing where to use it seems like a black art. Do you just have to get a feel for it, or are there really some rules? Lastly, is it worth splitting tables into two, e.g. one for querying by (e.g. fixed size title, timestamp, section, id) and one for variable text-heavy content (e.g. synopsis, body) and then joining them when necessary? (if nothing else, I assume this would help reduce fragmentation?) Thanks again; this is all very helpful. -=-=- blog \| work \| code \| more code
Fri, 26 Aug 2011, 22:59
Stealth	HoboBen On de-normalization -- knowing where to use it seems like a black art. Do you just have to get a feel for it, or are there really some rules? You probably don't need to worry about this unless you're running a high traffic database. In fact, JOINing tables is worst performance wise as there is more overhead. HoboBen Lastly, is it worth splitting tables into two, e.g. one for querying by (e.g. fixed size title, timestamp, section, id) and one for variable text-heavy content (e.g. synopsis, body) and then joining them when necessary? Datatypes such as longtext are stored separately on the disk from the row data. If you don't request these fields then the database isn't going to bother with them. If you're concerned with performance then definitely look in to caching. A simple file cache can be used and you only need to update it when someone modifies the database. -=-=- Quit posting and try Google.
Fri, 26 Aug 2011, 23:47
HoboBen	That's very helpful, thanks Stealth!
Sat, 27 Aug 2011, 08:32
JL235	However if you pull out all data, like using *, then it's much faster to use VARCHAR over Text. That is because VARCHAR values are stored within the table, and so only require going to disk once, rather then twice.