Different database systems offer substantially different ways of storing and retrieving information, and deciding what to use will require asking yourself questions about storage and retrieval.
This set of articles tries to explore this by looking at three "classes" of database systems. This article will deal with SQL Databases. A followup article will discuss xBase descendants and Keyed Tables like DBM.
One of the most common database questions that come up is "Where is the MS Access clone?"
One thing that should be made clear is that there isn't such a thing. "Experts" tend to regard this as a good thing, as MS Access tries to do too many things all at once to be really particularly successful at them all.
MS Access combines:
Most of the database tools that will be described here primarily focus on the first area. By doing just storage and retrieval, reliability certainly increases.
Mind you, there would be considerable merit in having a single integrated environment for data access, writing reports, and providing GUIed access to update data. There are ongoing projects to provide those sorts of capabilities, but none are yet quite as "GUI-pretty and newbie-friendly" as MS Access.
There are a lot of database systems that run on Linux using SQL data access schemes.
These databases are often fairly "heavyweight", requiring considerable disk and memory resources and providing data access capabilities of considerable sophistication.
Since they provide a (somewhat) common query language, a suitably-designed application may be able to be readily ported to run on different database systems, thus allowing the ability to take advantage of differing performance properties, as well as the ability to not be forcibly dependent on any one vendor.
Unfortunately, virtually all SQL database systems offer one "extension" or another that tends to tempt developers to tune their applications specifically for one database system.
One of the benefits of the use of a reasonably "abstract" query language is that the database engine can do a lot of work for you. For instance, rather than having to write code (adding temporary variables, loop structures, and such) to access several tables, you may construct a more complex SQL query that lets the database engine join tables together for you. The hopes typically expressed are that:
In practice, software bugs are sufficiently ubiquitous that you'll still need to do some debugging, and there are certainly some overheads in terms of the cost of parsing queries and submitting them to the DBMS engine.
Other notable merits of SQL database systems include:
All transactions are either performed completely (committed), or are not done at all; a partial transaction that is aborted must be rolled back.
The effects of a transaction must preserve required system properties. For instance, if funds are transferred between accounts, a deposit and a withdrawal must both be committed to the database, so that the accounting system does not fall out of balance.
In double-entry accounting, the "staying in balance" property is usually not overly difficult to maintain. The more thorny issue comes when the property is something like "Cash Balance Cannot Drop Below Zero", or "We can't ship inventory we don't have." In such cases, if you have two transactions being submitted concurrently, it could be that either could be accepted, but not both. If one of the transactions would cause balance requirements to be violated, the transaction management system needs to reject one of the transactions.
Intermediate stages must not be made visible to other transactions. Thus, in the case of a transfer of funds between accounts, both sides of the double-entry bookkeeping system must change together for each transaction. This means that transactions appear to execute serially even if some of the work is done concurrently.
Once a transaction is committed, the change must persist, except in the face of a truly catastrophic failure.
If Mongol hordes ride through and lay waste to your server room (or, sadly more likely, an unfortunate plane crash takes place), you can hardly expect a transaction system to guarantee that all is well, but a good transaction processing system should be resistant to moderately traumatic sorts of system failures such as a network link breaking down or perhaps even something as traumatic as a disk drive malfunctioning.
There are really a lot of SQL database systems available to run on Linux. At one time, there was great excitement at the thought that some people at Oracle had an internal port they were fiddling with; today, almost any database vendor that offers a version on some Unix variation sells licenses for Linux. Pretty much the only major database vendor that doesn't deploy a version on Linux is Microsoft. The top tier industry names, Oracle, Sybase, Informix, and IBM DB/2, are all available on Linux.
Development licenses are typically available inexpensively or even for free, but production licenses tend to be quite expensive. These systems tend to be "heavyweights" in terms of feature sets, use of memory and disk, and licensing costs. They provide robust access to large amounts of data, at considerable price; if you are building an "enterprise" system, they are the common choices.
They are not suitable for every purpose; other database systems often offer superior characteristics in one area or another.
These database systems tend to put the data in a single compact location, whether that be in files in a single directory hierarchy, or even in a single file.
If the database is to be used as part of an application, it is attractive if the data stays highly localized in contrast with some of the DBMSes with which, to maximize speed and robustness, the system may manage raw disk partitions.
A number of database systems are available under Free Software licenses such as the GPL. Most notable are MySQL and PostgreSQL, which are available pre-packaged for many Linux distributions, and which are widely used to support Web-based applications.
There are a number of others; they include "toy" databases as well as some that used to have proprietary licenses. Firebird was once Borland InterBase, and SAPDB was once called Adabas-D. These may become of greater interest in the future, once "systems integration" efforts get further along, but they are not yet being widely integrated with Linux distributions or with Free applications.
"Embedded" databases are intended to be embedded in applications, and tend to be designed with a view to being easy to install and to require little, if any, attention to administration or tuning.
These database systems store data primarily in memory, as compared to more traditional architectures that involve "paging" data from disk as needed. At first glance, this would appear a crippling reduction in robustness, but reality lies elsewhere, as nothing prevents these systems from being tremendously "paranoid" in logging updates to more permanent storage.
The point of the exercise here is not so much to provide robustness as it is to take advantage of the fact that memory space on modern computers has grown astoundingly. These systems assume that enough physical RAM is available to hold the entire database, which means that queries proceed without worrying about what is or isn't cached.
In-Memory databases should be particularly useful for applications like data warehousing, or for providing fast responses for things like catalog queries.
There are many, many commercial database systems available on Linux; it is quite difficult to distinguish, for many of these systems, why they should be considered interesting, as many represent "Yet Another SQL DBMS conforming to some reasonable set of standards". Some have, as "claims to fame", the ability to do sophisticated text-oriented queries, or integration between a database server and a Web server.
It's not much use having a neat new database system if you have no way of querying or updating information in that database, working in some language you find desirable to work in.
There are a number of common ways to access data in an SQL database. Some are standardized to the point that it is not difficult to plug in a different database system as needed; others are less so.
This should allow programs to be written more compactly and to be more efficient than would be the case with SQL/CLI. Unfortunately, this requires work in two languages simultaneously, and tends to be somewhat nonportable.
Aficionados of Python or Tcl can find similar sorts of libraries; the somewhat more widespread popularity of Perl means that there are somewhat more options for Perl.
Note that this discussion has not even tried to address the issue of which database system is "fastest."
The problem is that evaluating this is a really daunting problem. People commonly claim that one DBMS or another is much faster than the others. Unfortunately, this sort of thing is very difficult to evaluate in a scientific manner. We might find that MySQL was providing vastly better performance than PostgreSQL, until an extra table key or buffer was added, at which point the tables might turn. Or you might find that PostgreSQL has some feature to support your application that no other database system offers, so it is the only option that actually performs acceptably.
Unfortunately, when it comes to benchmarks, everyone is pretty partisan. Database vendors have often been known to send developers out to work with hardware vendors to tweak performance on industry benchmarks, and there is some indication that vendors have even tuned database engines to be specifically aware of certain benchmarks, in much the same way that compiler vendors were once accused of writing code to recognize the Byte Magazine Prime Number Benchmark, and then generate hand-tuned assembly language. Faking a benchmark like that obviously goes well past what is reasonable, but things gets less clear if you find that using some vendor-specific extension dramatically improves performance on some part of a benchmark.
The way that locking is used can be easily cited as a place where performance will vary; a major merit of the more sophisticated database systems is that they cope well with having many users working with and modifying data concurrently. Correctly handling that requires doing some locking of data against modification. There are several different granularities of locking, with different costs and benefits:
There are times when this is reasonable or even necessary, particularly when data conversions or cleanup are underway, but if a system has a lot of users trying to do updates, they'll get extremely irritated at being blocked from doing their work.
This is definitely a lot less "antisocial" than the table lock; anybody being blocked by a row lock is likely trying to modify the very same bit of data as someone else.
Unfortunately, the DBMS has to manage and track these locks, so behind the scenes, there will be a (probably invisible to the user) lock table with one entry for each row that is locked. If there are a lot of table rows being worked on, that's a lot of locks!
In many database systems, space is allocated to tables in "pages", where a page is a fixed block (often 2K) of storage devoted to rows in a specific table. If each row is 80 bytes, a 2K page would hold about 25 rows.
Sybase would lock the whole page, which is less "antisocial" than locking the whole table, but if you're merely modifying one row, the lock might affect 25 times as much data as is necessary.
Careful design of the application can keep this from being a problem. For instance, users might queue updates and submit them to a centralized "transaction update" process. If that central process is the only thing updating the tables, the page contention goes away, and there may be a performance increase because this form of locking is a bit less expensive than row locking. But it does require careful application design...
Another daunting issue is that a number of notable commercial database systems have, as a specific license clause, the condition that you are forbidden to publicly report performance benchmarks.