Indexing Bit Columns

In previous versions of SQL Server (prior to 2000), you were unable to index a column with a bit datatype ie: a true/false, yes/no, boolean value. In Books Online for SQL Server 2000, under the description of the bit data type, it still said:
Remarks

Columns of type bit cannot have indexes on them.

Microsoft® SQL Server™ optimizes the storage used for bit columns. If there are 8 or fewer bit columns in a table, the columns are stored as 1 byte. If there are from 9 through 16 bit columns, they are stored as 2 bytes, and so on.

Well it turns out that you can. Apparently it’s covered in an update to Books Online and it did get changed in SQL Server 2000.

One of the reasons we have always avoided bit data types is because we couldn’t build indexes on them. Yes, I know most of the Microsoft documentation says it’s pointless because there are only two possible values but they always assume an even distribution eg: Male vs Female. We often have boolean values like Finalised/Unfinalised that might have less than 1% of the values unfinalised. In that case, you’d certainly be sad if you didn’t have an index on the column.

There are other issues though. As mentioned in the quote above, a bit column may well end up consuming a byte anyway and I’m guessing that masking out a bit from a byte is going to be slower than just using the byte. I’ll bet when they get used in an index they end up using a byte regardless.

Then there are development environments, some of which store -1 for True, some of which store 1. There are even tools from Microsoft that do each of these. If we use smallint, it doesn’t matter unless we are indexing the column and then only if we’re interested in the True values, not the False values. In our case, we’re after the False values most of the time anyway.

 

Static/Shared Methods

I had been disappointed that when you use a CLR-based UDT, you didn’t have access to shared/static methods, only instance methods. In the zVarchar example, I thought it would be good to be able to use the following:


DECLARE @TextVar zVarchar


SELECT @TextVar.Compress(‘Some value           I       want        to          compress’)


This would be similar to how you’d use Decimal.Round() in the framework.


Well, Venkatesh (from MS) pointed out to me that you can. The shared/static methods are available via the older :: syntax directly off the class name. That’s the syntax that was used for all method calls in Beta 1. So you can do the following:


SELECT zVarChar::Compress(‘Some       value     I        want     to           compress’)


That’s even better as it doesn’t require an instantiation.

zVarchar example

As mentioned in a previous entry, I decided to try to build a compressed text data type as an example of a UDT. So here it is. I built a datatype called zVarchar. To keep things as a reasonable demo, I didn’t incorporate complex compression, just a simple run length encoding scheme and built it in VB so most should be able to read it. But it’s interesting. If you want to try it, start a new SQL Server project in VS2005, add a User Defined Type and include the code below. After building and deploying it, you can do the following:


CREATE TABLE GregTest(


    RecID int IDENTITY(1,1),


    TextValue zVarchar


)


I then tried storing values in it. On small varchar values, the overhead of the UDT meant the data was actually larger than a varchar. But, I tried some examples with large varchar values (around 1000 characters per row and about 40% compressible) and got good results. My table using the zVarchar data type was around 50% of the size of a table using varchar, with 10,000 rows in the table. To keep it interesting, I added some additional functions to allow access to the compressed value, the length (both compressed and uncompressed), etc.


You can then execute queries like:


SELECT TextValue.CompressedLength() FROM GregTest


SELECT AVG(TextValue.CompressionPercentage()) FROM GregTest


Enjoy!


Imports System
Imports System.Data.Sql
Imports System.Data.SqlTypes
Imports System.Text


<Serializable()> _
<SqlUserDefinedType(Format.SerializedDataWithMetadata, MaxByteSize:=8000)> _
Public Class zVarchar
    Implements INullable


    Private ValueIsNull As Boolean = True
    Private CurrrentValue As String = “”


    Public Overrides Function ToString() As String
        If Me.ValueIsNull Then
            Return “NULL”
        Else
            Return Decompress(Me.CurrrentValue)
        End If
    End Function


    Public ReadOnly Property IsNull() As Boolean Implements INullable.IsNull
        Get
            Return Me.ValueIsNull
        End Get
    End Property


    Public Function CompressedLength() As Integer
        If Me.ValueIsNull Then
            Return 0
        Else
            Return Me.CurrrentValue.Length
        End If
    End Function


    Public Function CompressedValue() As String
        If Me.ValueIsNull Then
            Return “NULL”
        Else
            Return Me.CurrrentValue
        End If
    End Function


    Public Function DeCompressedLength() As Integer
        If Me.ValueIsNull Then
            Return 0
        Else
            Return zVarchar.Decompress(Me.CurrrentValue).Length
        End If
    End Function


    Public Function CompressionPercentage() As Decimal
        If Me.ValueIsNull Then
            Return 0
        Else
            Dim compressedLength As Integer
            Dim decompressedLength As Integer


            compressedLength = Me.CurrrentValue.Length
            decompressedLength = Me.DeCompressedLength
            If decompressedLength = 0 Then
                Return 0
            Else
                Return Decimal.Round(CType(compressedLength, Decimal) * 100 / CType(decompressedLength, Decimal), 2)
            End If
        End If
    End Function


    Public Shared ReadOnly Property Null() As zVarchar
        Get
            Dim newNullValue As zVarchar = New zVarchar


            Return newNullValue
        End Get
    End Property


    Public Const EscapeCharacter As Char = “\”c


    Public Const MaximumRepeatValue As Integer = 250


    Public Shared Function Parse(ByVal s As SqlString) As zVarchar
        If s.IsNull Or s.Value.ToLower().Equals(“null”) Then
            Dim newNullValue As New zVarchar


            Return newNullValue
        Else
            Dim zValue As New StringBuilder
            Dim objectToReturn As New zVarchar


            objectToReturn.ValueIsNull = False
            objectToReturn.CurrrentValue = zVarchar.Compress(s.ToString())


            Return objectToReturn
        End If
    End Function


    Public Shared Function Compress(ByVal ValueToCompress As String) As String
        Dim compressedValue As New StringBuilder
        Dim lastValue As Char
        Dim valueCount As Integer = 0
        Dim currentValue As Char
        Dim characterCounter As Integer


        For characterCounter = 0 To (ValueToCompress.Length – 1)
            currentValue = CType(ValueToCompress.Substring(characterCounter, 1), Char)
            If currentValue = zVarchar.EscapeCharacter Then
                If valueCount > 0 Then ‘ we have something to output first
                    If valueCount < 3 Then
                        compressedValue.Append(lastValue, valueCount)
                    Else
                        compressedValue.Append(zVarchar.EscapeCharacter)
                        compressedValue.Append(lastValue)
                        compressedValue.Append(Chr(valueCount))
                    End If
                    valueCount = 0
                End If
                compressedValue.Append(zVarchar.EscapeCharacter, 2)
                compressedValue.Append(Chr(1))
            Else ‘ if we don’t have an escape character
                If valueCount = 0 Then
                    lastValue = currentValue
                    valueCount = 1
                Else ‘ could be a repeated character
                    If lastValue = currentValue Then
                        If valueCount > zVarchar.MaximumRepeatValue Then ‘ needs to still fit in a byte
                            compressedValue.Append(zVarchar.EscapeCharacter)
                            compressedValue.Append(lastValue)
                            compressedValue.Append(Chr(valueCount))
                            valueCount = 0
                        End If
                        valueCount += 1
                    Else ‘ different character
                        ‘ output the cached value first
                        If valueCount < 3 Then
                            compressedValue.Append(lastValue, valueCount)
                        Else
                            compressedValue.Append(zVarchar.EscapeCharacter)
                            compressedValue.Append(lastValue)
                            compressedValue.Append(Chr(valueCount))
                        End If
                        lastValue = currentValue
                        valueCount = 1
                    End If
                End If


            End If
        Next characterCounter
        If valueCount > 0 Then ‘ we have something cached still to output
            If valueCount < 3 Then
                compressedValue.Append(lastValue, valueCount)
            Else
                compressedValue.Append(zVarchar.EscapeCharacter)
                compressedValue.Append(lastValue)
                compressedValue.Append(Chr(valueCount))
            End If
        End If


        Return compressedValue.ToString()
    End Function


    Public Shared Function Decompress(ByVal ValueToDecompress As String) As String
        Dim decompressedValue As New StringBuilder
        Dim characterCounter As Integer = 0


        Do While characterCounter < ValueToDecompress.Length
            If CType(ValueToDecompress.Substring(characterCounter, 1), Char) = zVarchar.EscapeCharacter Then
                decompressedValue.Append(CType(ValueToDecompress.Substring(characterCounter + 1, 1), Char), Asc(ValueToDecompress.Substring(characterCounter + 2, 1)))
                characterCounter += 2
            Else
                decompressedValue.Append(ValueToDecompress.Substring(characterCounter, 1))
            End If
            characterCounter += 1
        Loop
        Return decompressedValue.ToString()
    End Function
End Class


 

Yukon Ascend

In Sydney this week for the Yukon (SQL Server 2005) Ascend training. Covering a lot of the scalability and high-availability aspects of the product. I continue to be impressed by it.


In the material on partitioning tables, it says you can’t use more than a single column in the partition function. The idea is that you could partition a sales table on say, the region column and potentially store the data from each partition on a separate filegroup.


However, we noticed today that you can use a calculated column as the input to your partition function, as long as you make it a “persisted” calculated column. Persisting a calculated column is a new feature in Yukon too. So, this means that if you want to partition a table based upon more than one column, you could just create a calculated persisted column based upon those other columns and then use that new column in the partition function.


I was also talking to Chris Hewitt today and we kicked around the concept of creating a CLR-based user defined data type that compresses the input. I’m going to try to build one as a new example of a CLR-based type.