How to version your data keys in a NoSQL storage environment

NoSQL Database IconA question came across my plate today that intrigued me and I wanted to share some of the discussion.  The question was “how do you version your keys in a NoSQL data environment”.  An example use case would be when you release a new version of a feature that needs the existing data, but will need to save it back in a new format.  Such as changing the data from a string format to an array, and vice-versa.

By definition a NoSQL schema is simply a key / value storage engine.  From a general sense there is only 1 column in the database and the schema for that column doesn’t change (there are actually several columns behind the scenes but the rest are all general meta data).  How then do you version the NoSQL data layer?

In this post let’s use the following example: Version 1 of your User Login History might simply save the last login timestamp as a simple string.  You are about to release Version 2 of the Login History feature which will need to save the last 10 login timestamps as an Array.

Without a versioning system in place for your data keys you would need to brute force attack the problem.  You would need to evaluate if the data object is a string or an array and then act accordingly.

Problem #1: The problem to this approach is that the evaluation would occur everytime the key is accessed.  This would result in a lot of wasted processing depending on how often the key is interacted with.

Problem #2:  Another problem with this approach is that the upgraded data in essence replaces the older data.There is no way to go back to the older if for example, the upgraded feature is not working and rollback needs to occur.

Keep in mind this is just a simple example to illustrate a point (and the brute force approach is pretty light weight here).  However if the feature change is more complex this brute force approach can easily cripple or damage a system.

How then do you version your data in a NoSQL environment?  The short answer is to include a version string into the key name.

In most NoSQL environments teams store the data as one giant blob.  The key name then is simply the unique identifier for the user.  Something like a user id or profile id, etc.  If my user id was “5005” then literally my key name might be called “5005”.

In other NoSQL environments teams might use a multi key approach where they have multiple keys for different data points about the same user.  Maybe they separate out the keys to the major functions of the site.  For example you may have your primary user profile data stored in a key entitled “profile_5005” while your login history is stored in “login_5005” and so forth.

Regardless of which situation you are in it is very easy to add a simple version string to end of your key names.  Using our above example of migrating from version 1 to version 2 our key names might look like “login_5005_1” and “login_5005_2”.  We simply attached a version identifier to the end of the key name.

So what are the benefits.  In your code you simply need to check to see if the upgraded key “login_5005_2” exists in the data store.  On the first request, it will fail, which is by design.  You then know to go and retrieve the old version of the data, perform any appropriate “migration” routines to it, then save it back to the data store as the new version.   On subsequent requests it will retrieve the new data on it’s first request.

The benefits of this approach are speed and data integrity.  Using this method you take the expense hit of migrating the data once, then it is always a normal get call.  This approach does not require you to check data types on each request or perform the migration over and over.  In addition you get the data integrity benefit of keeping the old data in it’s pre-migrated state.  If problems occur with the new version of the feature you can revert back to the old state with ease.  Once your migration is complete you can delete the old keys from the store.

Another added bonus to this versioning system is that it makes A/B testing very easy.  You can easily run half of your population on version 1 of the data and the other half of your user population on version 2.  You can then switch to whatever version of the feature runs better in the long run. Since they are 2 different keys behind the scenes you are able to switch between them at will.


Posted in , and tagged , , , .