4 in 1: EF won’t fire DBAs, Velocity, SQL CLR and SQL Dependency

Imagine that we have a table Cars with a PK Id (nvarchar(8)) based on the following regex pattern [1-9][0-9]{1,2}[A-Z]{2,3}[0-9]{2}. The last two numbers are a region code.

The increasement is done like this: 10AA[region], 11AA[region], …, 999AA[region], 10AB[region], …, 999ZZ[region], 10AAA[region], …, 999ZZZ[region]

This table can be huge.

In a first time, we will write a LINQ To Entities query to get regions with 3 letters:

var q1 = (from c in context.Cars
          where c.Id.Contains("AAA")
          select c.Id.Substring(c.Id.Length – 2)).Distinct();

The SQL generated query is the following:

SELECT
[Distinct1].[C1] AS [C1]
FROM ( SELECT DISTINCT

      SUBSTRING([Extent1].[ Id ], ((LEN([Extent1].[ Id ])) - 2) + 1, (LEN([Extent1].[ Id ])) - ((LEN([Extent1].[ Id ])) - 2)) AS [C1]
      FROM [dbo].[Cars] AS [Extent1]
      WHERE [Extent1].[ Id ] LIKE N’%AAA%’
)  AS [Distinct1]

If I write it myself, I would probably write this:

SELECT DISTINCT
      RIGHT(Id, 2) AS [C1]
      FROM [Cars]
      WHERE [ Id ] LIKE N’%AAA%’

Note that the execution plan is the same!

Imagine that the DBA wants that I use the RIGHT function. With EF4, we can do it.

The Right doesn’t exist in C# but it does in ESQL. So we will use CSDL Functions.

In the CSDL, we will add the following code:

<Function Name="GetRegion" ReturnType="String">
  <Parameter Name="car" Type="Self.Car" />
  <DefiningExpression>
    Right(car.Id, 2)
  </DefiningExpression>
</Function>

Then,we will add an extension method to be able to use it in our LINQ To Entities queries:

public static class CarExtension
{
    [EdmFunction("CarsModel", "GetRegion")] 
    public static string GetRegion(this Car car)
    {
        throw new NotImplementedException("Only used by LINQ To Entities");
    }
}

I now can write my query like this:

var q1 = (from c in context.Cars
          where c.Id.Contains("AAA")
          select c.GetRegion()).Distinct();

This LINQ query is translated as the following SQL query:

SELECT
[Distinct1].[C1] AS [C1]
FROM ( SELECT DISTINCT
      RIGHT([Extent1].[ Id ], 2) AS [C1]
      FROM [dbo].[Cars] AS [Extent1]
      WHERE [Extent1].[ Id ] LIKE N’%AAA%’
)  AS [Distinct1]

Cool!

Now I want to get the last Id per region. I don’t think that there is a clean way to do it with SQL. Indeed, I don’t think that we can use Regex in SQL Server and we need it to determine which is the last id.

So to do this, I will use a LINQ To Object query. The “normal” way can be to write this:

var qL2E = from c in context.Cars
           orderby c.GetRegion()
           select c.Id;
var qL2O = from id in qL2E.AsEnumerable()
           group id by id.Substring(id.Length – 2) into g
           select new
           {
               Region = g.Key,
               Id = (from id in g
                     let letters = Regex.Match(id, "[A-Z]{2,3}")
                     orderby letters.Length descending, letters descending, Regex.Match(id, "^[0-9]{2,3}") descending
                     select id).FirstOrDefault()
           }.ToDictionary(id => id.Region, id => id.Id);

However, with this way, we will probably have an OutOfMemoryException. Indeed, with this query, all the Cars records are loaded in memory and we can have hundreds millions of rows.

In a first time, we will try a dichotomous approach.

var ids = (from c in context.Cars
           select c.GetRegion()).Distinct().AsEnumerable().ToDictionary(region => region, region => GetLastId(region, "", (new[] { "" }.Union(Enumerable.Range(‘A’, 26).Select(i => char.ConvertFromUtf32(i)))), 0));

private static string GetLastId(string region, string letters, IEnumerable<string> possibleChars, int index)
{
    if (!possibleChars.Skip(1).Any()) // Count() == 1
    {
        letters += possibleChars.First();
        if (++index == 3)
        {
            using (var context = new CarsContainer())
            {
                return
                    (from c in context.Cars
                     where c.Id.EndsWith(letters + region)
                     orderby c.Id.IndexOf(letters) descending, c.Id descending
                     select c.Id).FirstOrDefault();
            }
        }
        return GetLastId(region, letters, Enumerable.Range(‘A’, 26).Select(i => char.ConvertFromUtf32(i)), index);
    }
    else
    {
        var possibleCharsList = possibleChars.ToList();
        int middle = possibleCharsList.Count / 2;
        using (var context = new CarsContainer())
        {
            string idBeginning = string.Format("10{0}{1}{2}", letters, possibleCharsList[middle], "AA".Substring(index));
            if ((from c in context.Cars
                 where c.Id.StartsWith(idBeginning) && c.Id.EndsWith(region)
                 select c.Id).Any())
                return GetLastId(region, letters, possibleCharsList.Skip(middle), index);
            return GetLastId(region, letters, possibleCharsList.Take(middle), index);
        }
    }
}

The execution of this query is very slow (26 minutes and 18 seconds in my test with 562 012 347 records).

So what can we do?

The CPU is used by the DB at 100% and the slowly is only because of the DB. So it useless to imagine to parallelize our code.

We can imagine to resize the Data Base server but it is not the goal of this post.

In our case, the DBA (I specify that I am not a DBA) will probably want to change the Cars table schema in order to split into 3 columns the Id. However, in a lot of cases, it can be interesting to keep the Id column, which keeps the PK. Indeed, imagine that we have some existing applications. The idea is to add these columns with no impact on them.

So we will have four columns (all not nullable) :

  • Id
  • Number
  • Letters
  • Region

To set them, we will use CLR functions:

public class CarsFunction
{
    [SqlFunction]
    public static SqlInt16 GetNumber(SqlString id)
    {
        return new SqlInt16(short.Parse(Regex.Match(id.Value, "^[0-9]{2,3}").Value));
   

    [SqlFunction]
    public static SqlString GetLetters(SqlString id)
    {
        return new SqlString(Regex.Match(id.Value, "[A-Z]{2,3}").Value);
   

    [SqlFunction]
    public static SqlInt16 GetRegion(SqlString id)
    {
        return new SqlInt16(short.Parse(Regex.Match(id.Value, "[0-9]{2}$").Value));
    }
}

Then, in the DB, we will register the assembly and we will create some SQL functions from previous ones:

CREATE ASSEMBLY CarsFunctionsAssembly
FROM ‘D:\documents\visual studio 2010\Projects\CarsFunctions\CarsFunctions\bin\Debug\CarsFunctions.dll’
GO   

CREATE FUNCTION GetNumber(@id AS nvarchar(8)) RETURNS smallint AS EXTERNAL NAME CarsFunctionsAssembly.CarsFunction.GetNumber
GO

CREATE FUNCTION GetLetters(@id AS nvarchar(8)) RETURNS nvarchar(3) AS EXTERNAL NAME CarsFunctionsAssembly.CarsFunction.GetLetters
GO

CREATE
FUNCTION GetRegion(@id AS nvarchar(8)) RETURNS smallint AS EXTERNAL NAME CarsFunctionsAssembly.CarsFunction.GetRegion

Ok, we have the functions, so we can be back on the new columns creation. Instead of creating “normal” columns, we will use computed columns. We can do it with the SSMS designer by setting the Computed Column Specification Formula to dbo.GetNumber(Id) (resp dbo.GetLetters(Id), dbo.GetRegion(Id)). In our case, the idea isn’t to calculate them each time but to keep their values. So we will set the “Is Persisted” property to true.

But there is an issue: the function has to be deterministic. Our function is. However, because it is a CLR function, SQL Server can’t know it. So we will tell it explicitly.

To do this, we will change the CLR functions code like this:

[SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlInt16 GetNumber(SqlString id)
{
    return new SqlInt16(short.Parse(Regex.Match(id.Value, "^[0-9]{2,3}").Value));
}

[SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlString GetLetters(SqlString id)
{
    return new SqlString(Regex.Match(id.Value, "[A-Z]{2,3}").Value);
}

[SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlInt16 GetRegion(SqlString id)
{
    return new SqlInt16(short.Parse(Regex.Match(id.Value, "[0-9]{2}$").Value));
}

Now, we can persist the three columns.

Be careful, if we create an edmx from the DB, we will have a StoreGeneratedPattern="Computed" on the columns Number, Letters and Region. It implies that in each update, EF, we reload the properties Number, Letters et Region. In our case, the calculation is only done on the key. With Entity Framework, it is not possible to change the value of the key. So, it is useful to change the StoreGeneratedPattern from Computed to Identity.

Cool, we have fixed the INSERT issue.

Let’s now concentrate on the goal: our query.

We don’t need a dichotomous approach, a simple LINQ To Entities query is enough:

var ids = (from c in context.Cars
           group c by c.Region into g
           select new
           {
               Region = g.Key,
               LastId = (from c in g
                         orderby c.Letters.Length descending, c.Letters descending, c.Number descending
                         select c.Id).FirstOrDefault() 
           }).AsEnumerable().ToDictionary(region => region.Region, region => region.LastId);

With this, our query runs in 1 minute and 42 seconds instead of 26 minutes and 18 seconds!

It’s better but it isn’t finished yet.

In the DB, we will set the Cars table cluster on the Region column (instead of the Id column).

In this case, the query execution falls to 12 seconds!

However, be careful! It’s really great in our sample but in the real life, don’t forget to study all the DB query on the table Cars before changing the cluster.

So the DBA role is important in the success of the project.

 

Now, imagine that we want to keep in cache the last id per region. To do it, we can use a static Dictionary. But there are two issues:

  • Imagine a N-Tiers scenario. If our application is deployed on more than one frontal server, it’s a shame to have to initialize the cache for each of them. Moreover, we have to maintain each of them with the last values
  • If another application adds or removes a car in the DB, how to maintain our cache with the last values?

To fix the first issue, we will Velocity. For the second, we will use SQLDependency.

Velocity is a distributed cache which allows a lot of very interesting things like load balancing. This distributed cache will be shared by all the frontal servers. It means that we won”t have to manage one cache per server.

What do we have to do to use Velocity? First point: download it and install it.[:)]

When the installation is done, we will add the following references in our project:

  • CacheBaseLibrary.dll
  • CASBase.dll
  • CASMain.dll
  • ClientLibrary.dll
  • FabricCommon.dll

Then, we have to start the Velocity cluster. In this CTP, all administration commands are written with the console “Administration Tool – Microsoft Distributed Cache”. Then we write the following command line:

start-cachecluster

Then, in the config file, we wiil add these lines:

<configSections>
  <
section name="dataCacheClient" type="Microsoft.Data.Caching.DataCacheClientSection, CacheBaseLibrary" allowLocation="true" allowDefinition="Everywhere"/>
</
configSections>
<
dataCacheClient deployment="routing">
  <
localCache isEnabled="true" sync="TTLBased" ttlValue="60000"/>
  <
hosts>
    <
host name="MATTHIEU-PRO" cachePort="22233" cacheHostName="DistributedCacheService"/>
    <
host name="MATTHIEU-LAPTOP1" cachePort="22233" cacheHostName="DistributedCacheService"/>
    <
host name="MATTHIEU-LAPTOP2" cachePort="22233" cacheHostName="DistributedCacheService"/>
    <
host name="MATTHIEU-PC1" cachePort="22233" cacheHostName="DistributedCacheService"/>
    <
host name="MATTHIEU-PC2" cachePort="22233" cacheHostName="DistributedCacheService"/>
  </
hosts>
</
dataCacheClient>

Now we can imagine to query the cache in our application (server tier).

return new DataCacheFactory().GetDefaultCache().GetObjectsInRegion("LastImmatPerRegion").ToDictionary(keyValuePair => short.Parse(keyValuePair.Key), keyValuePair => (string)keyValuePair.Value);

With the data in the cache, the query executes in 30 ms!

Now, we have to fix the last point: initializing the cache and maintaining it.

To initialize it, it’s very easy. We just have to use the above query:

var cache = new DataCacheFactory().GetDefaultCache();

try
{
  
cache.RemoveRegion(LAST_IMMAT_PER_REGION);
}
catch
{
} 
cache.CreateRegion(LAST_IMMAT_PER_REGION, false);
foreach (var car in from c in context.Cars
                    group c by c.Region into g
                    select new
                    {
                        Region = g.Key,
                        LastId = (from c in g
                                  orderby c.Letters.Length descending, c.Letters descending, c.Number descending
                                  select c.Id).FirstOrDefault()
                    })
    cache.Put(car.Region.ToString(), car.LastId, LAST_IMMAT_PER_REGION);

Now to maintain the cache with the last values, it’s harder. My idea is to use a SQLDependency. The issue is the fact that we can’t, a priori, get the modifications with SQLDependency and, of course, I don’t want to refresh all my cache. So my idea is to add a Trigger on the Cars table which will fill a new table (for updates).

To begin, we will create this new table: CarsModificationsTmp with six columns:

  • TmpId (int Identity(1,1), PK),
  • CarId (nvarchar(8), not nullable)
  • Number (smallint, not nullable)
  • Letters (nvarchar(3), not nullable)
  • Region (smallint, not nullable),
  • Deleted (bit, not nullable, default=0)

Then, we have to define triggers on Cars table:

CREATE TRIGGER CarsInserted
ON Cars
FOR INSERT
AS
BEGIN
  DECLARE @Id AS nvarchar(8)
  DECLARE @Number AS smallint
  DECLARE @Letters AS nvarchar(3)
  DECLARE @Region AS smallint
  SELECT @Id = Id, @Number = Number, @Letters = Letters, @REGION = Region FROM Inserted
  INSERT INTO CarsModificationsTmp(CarId, Number, Letters, Region) VALUES(@Id, @Number, @Letters, @Region)
END
GO
 
CREATE TRIGGER CarsDeleted
ON Cars
FOR DELETE
AS
BEGIN
  DECLARE @Region AS smallint
  DECLARE @Id AS nvarchar(8)
  DECLARE @Number AS smallint
  DECLARE @Letters AS nvarchar(3)
  SELECT @Id = Id, @Number = Number, @Letters = Letters, @REGION = Region FROM Deleted
  INSERT INTO CarsModificationsTmp(CarId, Number, Letters, Region, Deleted) VALUES(@Id, @Number, @Letters, @Region, 1)
END

Last, we will modify our code to use SQL Dependency.

private void Load()
{
    using (var context = new CarsContainer())
    {
        _cache = new DataCacheFactory().GetDefaultCache();
        try
        {
            _cache.RemoveRegion(LAST_IMMAT_PER_REGION);
        }
        catch
        {
        }
        _cache.CreateRegion(LAST_IMMAT_PER_REGION, false);
        foreach (var car in from c in context.Cars
                            group c by c.Region into g
                            select new
                            {
                                Region = g.Key,
                                LastId = (from c in g
                                          orderby c.Letters.Length descending, c.Letters descending, c.Number descending
                                          select c.Id).FirstOrDefault()
                            })
            _cache.Put(car.Region.ToString(), car.LastId, LAST_IMMAT_PER_REGION);

 

        _connectionString = ((EntityConnection)context.Connection).StoreConnection.ConnectionString;
        SqlDependency.Stop(_connectionString);
        SqlDependency.Start(_connectionString);
        DefineCarsNotification();
    }
}

 

private
void DefineCarsNotification()
{
    var connection = new SqlConnection(_connectionString);
    var command = connection.CreateCommand();
    command.CommandText = "SELECT TmpId, Region, CarId, Deleted FROM CarsModificationsTmp";
    command.CommandType = CommandType.Text;
    var sqlDependency = new SqlDependency(command);
    sqlDependency.OnChange += SqlDependency_OnChange;
    connection.Open();
    command.ExecuteNonQuery();
    connection.Close();
}

 

private
void SqlDependency_OnChange(object sender, SqlNotificationEventArgs e)
{
    using (var contextModifications = new CarsContainer())
    {
        bool any = false;
        foreach (var cm in contextModifications.CarsModificationsTmps)
        {
            any = true;
            var cacheModifications = new DataCacheFactory().GetDefaultCache();
            var cacheItem = cacheModifications.GetCacheItem(cm.Region.ToString(), LAST_IMMAT_PER_REGION);
            if (cm.Deleted)
            {
                if (cacheItem != null && cacheItem.Value.ToString() == cm.CarId)
                {
                    var newLastId = (from c in contextModifications.Cars
                                     where c.Region == cm.Region
                                     orderby c.Letters.Length descending, c.Letters descending, c.Number descending
                                     select c.Id).FirstOrDefault();
                    if (newLastId == null)
                        _cache.Remove(cm.Region.ToString(), LAST_IMMAT_PER_REGION);
                    else
                        _cache.Put(cm.Region.ToString(), newLastId, LAST_IMMAT_PER_REGION);
                }
            }
            else
            {
                string letters;
                int lettersCompare = 0;
                if (cacheItem == null || (letters = Regex.Match(cacheItem.Value.ToString(), "[A-Z]{2,3}").Value).Length < cm.Letters.Length || (letters.Length == cm.Letters.Length && ((lettersCompare = String.Compare(letters, cm.Letters)) < 0 || lettersCompare == 0 && short.Parse(Regex.Match(cacheItem.Value.ToString(), "^[0-9]{2,3}").Value) < cm.Number)))
                    _cache.Put(cm.Region.ToString(), cm.CarId, LAST_IMMAT_PER_REGION);
            }
            contextModifications.DeleteObject(cm);
        }
        if (any)
            contextModifications.SaveChanges();
    }
    DefineCarsNotification();

}

We have a possible bug with this. Indeed, SqlDependency_OnChange is asynchronous and so we can have some parallels calls which implies some concurrent issues. Velocity DataCache is thread-safe. But, with the previous code, the concurrency doesn’t allow us to be sure we have the last id in the cache. To fix this, we will use a lock.

if (cm.Deleted)
{
    lock (_lockObject)
    {
        if (cacheItem != null && cacheItem.Value.ToString() == cm.CarId)
        {
            var newLastId = (from c in contextModifications.Cars
                             where c.Region == cm.Region
                             orderby c.Letters.Length descending, c.Letters descending, c.Number descending
                             select c.Id).FirstOrDefault();
            if (newLastId == null)
                _cache.Remove(cm.Region.ToString(), LAST_IMMAT_PER_REGION);
            else
                _cache.Put(cm.Region.ToString(), newLastId, LAST_IMMAT_PER_REGION);
        }
    }
}
else
{
    string letters;
    int lettersCompare = 0;
    lock (_lockObject)
    {
        if (cacheItem == null || (letters = Regex.Match(cacheItem.Value.ToString(), "[A-Z]{2,3}").Value).Length < cm.Letters.Length || (letters.Length == cm.Letters.Length && ((lettersCompare = String.Compare(letters, cm.Letters)) < 0 || lettersCompare == 0 && short.Parse(Regex.Match(cacheItem.Value.ToString(), "^[0-9]{2,3}").Value) < cm.Number)))
            _cache.Put(cm.Region.ToString(), cm.CarId, LAST_IMMAT_PER_REGION);
    }
}

We have another thing like this with the SaveChanges. Indeed, with concurrency, we can try to delete an already deleted DataRow which will throw an OptimisticConcurrencyException. My idea is: if the SaveChanges throws an exception, we will detach all the entities which implies the exception (ie: which are already deleted).

if (any)
    for (; ; )
        try
        {
            contextModifications.SaveChanges();
            break;
        }
        catch (OptimisticConcurrencyException ex)
        {
            foreach (var ose in ex.StateEntries)
                contextModifications.Detach(ose.Entity);
        }

Our solution is now thread-safe and is (I hope so [:)]) bug free. However, we can improve our code. If we have two update notifications in parallel which aren’t in the same region, it’s a shame to wait that the first ends before treating the second (what we do with our single lock). In fact, my idea is to use a dictionnary of locks instead of a single one. However, the Dictionary class is not thread-safe! We can use another object to lock the Dictionary but with .NET 4, we have a very useful new class: ConcurrentDictionary.

private ConcurrentDictionary<short, object> _concurrentDictionary = new ConcurrentDictionary<short,object>();

Now, we will modify our code to use a lock per region:

object lockRegion = _concurrentDictionary.GetOrAdd(cm.Region, new object());
if (cm.Deleted)
{
    lock (lockRegion)
    {
        if (cacheItem != null && cacheItem.Value.ToString() == cm.CarId)
        {
            var newLastId = (from c in contextModifications.Cars
                             where c.Region == cm.Region
                             orderby c.Letters.Length descending, c.Letters descending, c.Number descending
                             select c.Id).FirstOrDefault();
            if (newLastId == null)
                _cache.Remove(cm.Region.ToString(), LAST_IMMAT_PER_REGION);
            else
                _cache.Put(cm.Region.ToString(), newLastId, LAST_IMMAT_PER_REGION);
        }
    }
}
else
{
    string letters;
    int lettersCompare = 0;
    lock (lockRegion)
    {
        if (cacheItem == null || (letters = Regex.Match(cacheItem.Value.ToString(), "[A-Z]{2,3}").Value).Length < cm.Letters.Length || (letters.Length == cm.Letters.Length && ((lettersCompare = String.Compare(letters, cm.Letters)) < 0 || lettersCompare == 0 && short.Parse(Regex.Match(cacheItem.Value.ToString(), "^[0-9]{2,3}").Value) < cm.Number)))
            _cache.Put(cm.Region.ToString(), cm.CarId, LAST_IMMAT_PER_REGION);
    }

}

It’s now finish. Enjoy [:)]

This entry was posted in 10500, 12229, 7671, 7674, 7675, 9104. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>