Financial Cryptography: my War On SQL

November 08, 2009

my War On SQL

Around three weeks ago, I had a data disaster. In a surprise attack, 2 weeks worth of my SQL DATABASE was wiped out. Right after my FIRST weekend demo of some new code. The horror!

On that Monday morning I went into shell-shock for hours as I tried to trace where the results of the demo -- the very first review of my panzer code -- had disappeared to. By 11:00 there was no answer, and finger-of-blame pointed squarely at some SQL database snafu. The decision was reached to replace the weaponry with tried and trusty blades of all wars previous: flat files, the infantry of data. By the end of the day, the code was written to rebuild the vanguard from its decimated remains, and the next day, work-outs and field exercises proved the results. Two tables entirely replaced, reliably.

That left the main body, a complex object split across many tables, and the rearguard of various sundry administrative units. It took another week to write the object saving & restoring framework, including streaming, model-objects along the lines of MVC for each element, model-view-controller conversions and unit-testing. (I need a name for this portable object pattern. It came from SOX and I didn't think much of it at the time, but it seems nobody else does it.) Then, some days of unit tests, package tests, field tests, and so forth. Finally, 4-5 days of application re-working to use object database methods, not SQL.

16 days later, up and going. The army is on the march, SQL is targetted, acquired, destroyed. Defeated, wiped off the map, no longer to blot on the territory of my application. 2 days of mop-up and I'm back to the demo.

Why go on a holy crusade against SQL? There are several motives for this war:

Visibility. Like all databases, SQL is a black box. How exactly do you debug an application sitting on a black box?
Visibility #2. When it goes wrong, the caller/user/owner is powerless to fix it. It's a specialist task to even see inside.
Generalist. This database is lightweight, indeed it's called SQLite. The point isn't that it is "worse" than the rest, but that it'll have to be replaced one day. Why? All of these things are generalists and cannot cope with narrow, precise requirements. In other words, they look good on the parade ground, but when the shooting starts, the only order is "retreat!"
Achilles heel: backups have to manually created, recovered, tested on a routine basis. Yet this never gets done, and when the data corrupts, it is too late. Making backups easy is the #1 priority of all databases. Did you know that? Funnily enough, neither did any of the database providers.

And then there's the interface. Let us not shame the mere technical _casus belli_ above, let us put the evil that is SQL in a section of abomination, all of its own:

SQL is in the top 5 of the computing world's most awful anachronisms. It's right up there with ASN1, X.509, LDAP, APL, and other embarrassments to the world of computer science. In this case, there is one reason why SQL stands out like the sideways slice of death of a shamed samurai: data! these things, SQL included, were all designed when data was king, we all had to bow before the august power of our corporate bytes, while white-suited priests inserted the holy decks and the tapes of glory into bright shining temples of the mainframe of enlightenment.

But those imperial times are over. The false data-is-god was slain, discarded and buried, in the glorious & bloody revolution of the Object, that heathen neologism that rose up and slew and enslaved data during the early 1980s. Data-only is slain, data is dead. Data is captured, enslaved, owned. It is now objects, objects, objects. Data is an immoral ghost of its former self, when let out from its rightful context of semantic control.

These are the reasons why I leapt to the field to do battle to the death with the beast that is SQL. My report from the field is as follows:

operating from a flat files data store is slower and faster, depending on the type of action. Direct actions by my code are slower than SQL, but complicated actions are quicker. Overall, we are talking 10 - 50ms, so it is all in the "whatever" rating.
Code overall is dramatically simpler. There is no need to emasculate the soul of ones object model. Simply store the object, and get back to the fight. The structure of the code, the design, is simplified as the inherently senseless interface of OO to SQL is gone, it is now more OO, more top-to-bottom.
Joins have to be done in main code. This is an advantage for the coder, because the coder knows the main code, and the main language.
debugging is much easier because the data can be read, those times that is necessary, and the data can be seen, which is necessary all the time.
object transactions are trivial. Cross-object transactions are tricky. This forces the general in the field to be much more balanced.
no data is ever lost. At least, in my design, it can't be lost by action of the code, as everything is append-only.
it is uses about 5 times more space on the battlefield of your diskspace. Another "whatever..."
The code length is about the same. What was additionally required in database code (1kloc) was taken away in awfully complicated SQL interfacing code that melted away. The "Model" code, that is, the objects to be saved is an additional 1kloc, but required anyway for clean OO design.

I continue to mop up. What's the bottom line? Control. The application controls the objects and the objects control their data.

So what went wrong with SQL? Fundamentally it was designed in a day when data was seriously different. Those days are gone, now data is integrally related to code. It's called "object oriented" and even if you don't know how to do it, it is how it is doing it to you. The object owns the data, not the database. Data seen naked is an abomination, and SQL is just rags on a beast; it's still a beast.

Sadly, the world is still a decade or two away from this. And, to be fair, hattip to Jeroen van Gelderen for the OO database he designed and built for Webfunds. Using that was a lesson in how much of a delight it was to be OO all the way down to the disk-byte.

Posted by iang at November 8, 2009 02:53 AM | TrackBack

Comments

There was an article around mid-90s about how rdbms & sql set the computing business back (at least) 20 yrs.

disclaimer: i was in sjr in 70s & 80s as well as handling some of the technaology transfer to endicott for sql/ds ... misc. past post mentioning original rdbms/sql
http://www.garlic.com/~lynn/subtopic.html#systemr

this is old post about jan92 meeting in ellison's conference room ... one of the people mentioned claimed to have done much of the technology transfer from endicott back to stl for (mainframe) db2.

as an aside, two other people also named at the same meeting ... later left and show up at small client/server startup responsible for something called commerce server (the startup had also invented this technology called "SSL" they wanted to use). We were brought to consult because they wanted us to do payment transactions ... the result is now frequently referred to as "electronic commerce"

during those early years of electronic commerce, RDBMS based webserver tended to have a significantly larger number of problems ... use of rdbms significantly increased the skill level and effort and the added complexity significantly increased the probability of mistakes, security vulnerabilities ... you name it ... it was always much worse.

i do some playing with sqlite3 because it is used by mozilla ... and i do a whole bunch of processing with firefox file (outside of firefox and backup information in various non-relational formats)

Posted by: Lynn Wheeler at November 7, 2009 10:53 PM

"Cross-object transactions are tricky."

I use a "parallel update" routine for changing multiple files atomically. You give it a set of distinct file names, and for each file name you specify both a new and an old value. The routine automatically locks the files in ASCII order to avoid deadlock, then verifies that all the files still have their specified old values. If so, then all the new values are written and the locks released. If not, then no changes are made, and the caller must repeat its logic from scratch. The caller does this in an infinite loop until the parallel update finally succeeds. Statistically it will *always* eventually succeed, but you can abort after 1000 tries or so if you like.

This works extremely well in a highly concurrent system. I have a Perl test suite where you specify any number of processes (e.g. 3000) and any number of files (e.g. 5). Each process chooses two files at random, subtracting 1 from one file and adding 1 to the other. Each process loops until its parallel update succeeds. Often the update succeeds on the first try, but clashes occur more frequently with a larger number of processes and a smaller number of files.

After all the child processes are done, the main test program verifies that the sum of all the file values is precisely 0. (Some will be negative, some positive.)

The parallel update routine also does convenient things like auto-creating directory paths when it needs to create a new file. I also have it automatically delete a file when it is set to a null value, and delete any enclosing directories which become empty as a result. That way you never have to worry about "mkdir" and "rmdir".

I plan to use this technique in the Loom.cc software soon. Right now loom.cc uses a single locked GDBM file (a simple key-value store written by the good folks at Gnu). Using parallel update will allow massively concurrent updates with no single point of locking. In a real system clashes are rare -- but utterly disastrous if they do occur and you don't handle them correctly. Parallel update is a simple and powerful concept here.

When I roll this out at loom.cc you can look at the source code and see the "update" routine in all its glory, along with the demanding stress test. (It's written in Perl, but "beautiful" Perl, not script-kiddie Perl.)

Posted by: Patrick Chkoreff at November 8, 2009 08:45 AM

You forget, sir, our old conversation re securing the communications between app and DB... No secure JDBC/ ODBC drivers afaik... And the dependency on the black box to secure its end of the deal...

Posted by: AC2 at November 10, 2009 06:52 AM

for the fun of it:

Developers: The NoSQL Ecosystem
http://developers.slashdot.org/story/09/11/09/2335214/The-NoSQL-Ecosystem

and

NoSQL Ecosystem
http://www.rackspacecloud.com/blog/2009/11/09/nosql-ecosystem/

for something completely different ... old post in comp.database.theory on 3value logic
http://www.garlic.com/~lynn/2003g.html# How to cope with missing values - NULLS?

now their is something over dispute between rdbms and xml database. original markup language, GML was invented at the science center in 1969 (precursor to sgml, html, xml, etc). GML (generalized markup language) actually stands for the first letters of last names of the inventors. science center was also responsible for early virtual machine systems (cp40, cp67, vm370 ... gml original ran on cms under cp67). misc. past posts mentioning science center
http://www.garlic.com/~lynn/subtopic.html#545tech

some number of people transferred from science center to sjr ... where the original rdbms/sql implementation went on under vm370 ("L" from gml did some amount of work on "blobs" in r-star time-frame).

Posted by: Lynn Wheeler at November 10, 2009 02:03 PM

...
Some people object to the NoSQL term because it sounds like we’re defining ourselves based on what we aren’t doing rather than what we are. That’s true, to a degree, but the term is still valuable because when a relational database is the only tool you know, every problem looks like a thumb. NoSQL is making people aware that there are other options out there. But we’re not anti-relational-database for when that really is the best tool for the job; it’s “Not Only SQL,” rather than “No SQL at all.”
...

Posted by: NoSQL Ecosystem at November 10, 2009 03:06 PM

"Applications traditionally changed faster than relational data, but social data changes faster than applications"

Posted by: seen on the net by Gunnar at November 11, 2009 09:31 AM

Have peps forgoton the real reason why we had to "normalise" not the later excuse?

Back in 1980 core memory was measured in Kbytes and magnetic media likewise or if you where very wealthy Mbytes. As for CPU speed well that was still measured in Mhz and as for flops you went to second silicon in the form of lookup tables or maths coprocessors.

Normalisation is and always will be a form of compression. "Data" is compressable for many reasons but... In compressing by normalisation you optomise out some data relations in favour of others.

It is why flat file DBs still have advantages over relational systems when it comes to either complex or ad hock enquiries.

Have a look at some SQL and you will see that the first thing being done is to re-build to get at relashionships that where normalised out...

Yes OO does have some advantages over relational but the problem has been shifted not removed. The problem now is methods...

In 20years I expect we will be having a conversation about "My War with OO". And guess what the "flat file DB" will still be there faithfully holding it all up. As was once said "the more things change the more they stay the same" ;)

Posted by: Clive Robinson at November 13, 2009 10:00 AM

I started out programming in the real world on DEC's VMS using both DEC's Basic (a nifty language) and C (just for low level stuff).

I miss the built-in RMS or record management system. It was a sort of a flat file scheme that DEC provided interfaces to. I was in a multi-project/multi-programmer team and it worked quite well for us. Good times.

Posted by: Purpleslog at May 14, 2010 02:58 PM

FiteClub, a london finance-tech evening get-together, says:

"NOSQL is presently a hot topic, with developers finding that traditional relational database management systems aren't always appropriate to their needs. An emergent definition is that NOSQL should stand for Not Only SQL, with databases that bring the best of both worlds - NOSQL for the speed and scalability of eventually consistent architecture, and SQL for where it's needed to integrate with legacy systems (or just the query language of choice for developers on the edge of the system)."

Posted by: FiteClub at June 1, 2010 05:01 PM