Do Not Go Gentle into that “Good Enough”
At one of our company lunches, my colleague Paul Brown commented (somewhat in a rage) about how foolish engineering choices sometimes can come back to bite you (or your users) hard. In addition to being a noted database developer-geek, Paul is an avid online gamer. He claims to have put more bugs in databases beginning with “I’s” (Ingres, Illustra, Informix, IBM) than anyone else; and to have “pwnd newbs” from Azeroth to Zelda’s Hyrule. (And to you non-gamers out there, apparently that actually means something.) We can wonder about how many of Paul’s bugs persist but he had a more interesting point about the perils of “eventual” consistency or relying on “reliable” file-systems. This is a world where “good enough”, is just not good. He was rather passionate about these perils because such shortcomings had cost him dearly.
What does all this have to do with a Paradigm4 blog post on SciDB? Well it turns out that this same issue that Paul was so passionate about (and that fellow online gamer Curt Monash encountered) is precisely what drives users to Paradigm4 and SciDB. Curt points out that (in Elder Scroll’s Online), “there’s been a major bug in which players’ “banks” shrank, losing items and so on. Days later, the data still hasn’t been recovered.”
Real DBMS platforms make strong quality-of-service guarantees about the data they manage. File systems, even “highly reliable” file systems, don’t. Some modern data management platforms throw up their hands at these fundamentally hard engineering problems and opt for “eventual” consistency (which in practice means, “not consistent right now, maybe come back later and things will be better”). And if your data is corrupt or inconsistent or incorrect, it doesn’t matter how good your analysis algorithms are. Wrong answers are…wrong.
The importance of this tends to be overlooked by many users. Regrettably, it’s a point most appreciated by those who have learned the lesson firsthand. So in the interest of saving some of us the pain of firsthand learning I asked Paul if the rivers of his passions run deep enough that he would blog on this important topic. Read what Paul had to say.
How Poor Engineering Choices Turned a Dragon Knight into a Newbie
So. Here we go again.
Curt Monash, on his DBMS2 blog, last week burnt off what I am going to guess (reading between the lines) was an ample portion of gamer rage about what he (rightly) sees as the poor engineering choices a lot of software engineers make concerning data management in general, and Elder Scrolls Online’s sucky persistence story in particular.
Hell hath no fury like a Dragon Knight who took an arrow to the inventory, I suppose.
Heartfelt as Curt’s feelings undoubtedly are, the technical point he makes is important, and profound; and helps explain a principal early design-choice we made with SciDB. In a world where software developers debate the relative merits of file formats like JSON, XML and CSV, and store them on distributed file systems like HDFS, we at Paradigm4 chose to go the extra ten miles to build a storage manager that provides strong quality-of-service guarantees.
Why in Tamriel would we do that? After all, isn’t SciDB is supposed to be a computational platform? Well it’s an ACID database too, and consequently:
1. Every change a user makes in SciDB, whether it’s adding new data or updating existing data, leaves the database in a state that’s always compliant with whatever rules the users wants to enforce. And that means whatever analysis you’re doing is addressing self-consistent, timely information, and that data is never lost.
This kind of “who needs rules?” mentality explains why online games are so bedeviled by “cloning” bugs and glitches. If there’s only ever supposed to be one of something, then tell the database that if it sees more than one, it should say something! Take this simple precaution, and you won’t see toons dueling on dragons two hours after launch. [Ed: Paul refers here to bugs—arising from lack of database rules—that allow players to multiply the number of items in their inventory.]
2. Dibella can be writing data to the database at the same time that Mara is reading from it. This happens frequently when you have multiple users, but even a single user will experience this if an automated process is also writing to or accessing the database. And further, if Mara and Arkay are both reading at the same time, their views are consistent with one another’s, regardless of what Dibella’s up to. No need for data in your HDFS repository to be unavailable while today’s data dump is loaded (however long that takes, and that could be a while). And no need to create copies of data so that multiple users can each work with it without standing on each other’s toes.
Why is this so important you ask? This “file forking” is exactly what happens when users work with manually updated files. It was forked Excel worksheets that cost JP Morgan about $2 billion and their London Whale his job.
Of course, we designed SciDB with the ultimate goal of providing a very large-scale computational analytics platform. So what we have implemented is deliberately simple. It’s basic no-overwrite, multi-version concurrency control that creates redundant data copies to ensure reliability. We’re not an online transaction processing system. Use something else to build your MMORPG. (But please… use something.)
However, anyone in the business of building value through data to an enterprise (or just building an empire) is going to be better off getting the basics right. Which means more than just fancy algorithms, analytics and visualization. It means starting at the beginning by making bedrock assurances about your data’s correctness and consistency.
Strong consistency matters, Highness. Anyone who says differently is selling something.