T-SQL Tuesday #136 - Your Favorite Datatype (Or Least Favorite)
Brent Ozar put out a call for blogging about your data type of choice. This might be a favorite (or least favorite) datatype. Naturally, varchar(max) and nvarchar(max) are going to have their punishment, deservedly. The datatype I’m not so sure is going to be getting the bludgeoning it deserves is varbinary(max). Sure, there are valid uses for the data type, but having the potential for just shy of 2GB of binary LOB data stored in a database has some negative potentials. Let’s take a look at some of these pitfalls.
Varbinary, much like its data cousins varchar and nvarchar allow for the storage of variable length data. Unlike its cousins varchar and nvarchar, the presentation of the varbinary data to the client, is as one would expect, binary in form. The varbinary max length specification allows for storage of binary data of 1 byte short of 2GB, with intended lengths greater than 8000 bytes. Binary datatypes are useful for many use cases. A few examples come to mind for their use, a hash of some value that you need to store, perhaps a secure string, a document hash, a query plan hash, etc. Certain data warehouse designs call for hashes to be used for determining relationships. By no means is this list exhaustive, nor are any of these design patterns universally appropriate.
There is a use case that is overrepresented, poorly, though which is varbinary for file storage. SharePoint used this model at one time did it, other vendor applications have done it. This pattern is too easy a trap to fall into. The code to read/write a varbinary column is simple and may lull an unsuspecting architect into the belief that a single data repository is easier to manage. Depending on the circumstances, a single repository may be appropriate, provided that the downsides of the design have been examined.
There is a linear relationship between the average “file” size, the number of records stored and the total storage requirement. The linear relationship creates the inherent danger behind this design pattern. Imagine a hypothetical application that within the application a workflow generates a file. Many applications have this pattern across industries for various legal, regulatory or business requirements. Where the single storage repository breaks down is all in the volume the workflow generates combined with the file sizes. If the workflow is high volume, the storage growth will necessarily be steep in correlation with the average file size. Likewise, if the average file size is large, but the workflow is low volume the storage growth will also be steep. In math terms, the storage growth can be described as a slope with the X-axis being workflow volume and the Y-axis being average file size. The steeper the slope, the more painful the application design can become.
So far, I’ve discussed the varbinary(max) issue in terms of raw storage. Storage prices have been dropping over several years, so why should you care about wasted space in the database? The answer is that database storage is often the single most expensive storage within the enterprise. To compound matters, for every extra byte of storage, you are adding (a small, but measurable) overhead to integrity checks, backups, statistics updates and other routine maintenance. By proxy, if backups take longer to complete, so do restores which can jeopardize RTO. If your company is planning a migration to the cloud, or already are there, you can directly measure the cost that these wasteful choices may be causing you. If any of these reasons were not enough by themselves, varbinary(max) datatypes are not eligible for row or page compression benefits. This means bytes wasted on disk are bytes wasted in RAM, potentially meaning the need for larger hardware.
To summarize, varbinary(max) isn’t a bad datatype (it’s just misunderstood), but it does, and should get a bad reputation for poor design decisions. Designs that seem innocuous at first conception can grow up to be a monster and be hard to undo.