One book that I recommend the reading is Clean Code, by Robert Martin. It is a well written book with wonderful techniques to create better code and improve your current programs, so they become easier to read, maintain and understand.

While going through it again, I found an excellent opportunity to improve my skills trying to do some refactoring: in listing 4.7 there is a prime generator function that he uses to show some refactoring concepts and turn int listing 4.8. I then thought do do the same and show my results here.

We can start with the listing converted to C#. This is a very easy task. The original program is written in  Java, but converting it to C# is just a matter of one or two small fixes:

using System;

namespace PrimeNumbers
{
/**
* This class Generates prime numbers up to a user specified
* maximum. The algorithm used is the Sieve of Eratosthenes.
* <p>
* Eratosthenes of Cyrene, b. c. 276 BC, Cyrene, Libya --
* d. c. 194, Alexandria. The first man to calculate the
* circumference of the Earth. Also known for working on
* calendars with leap years and ran the library at Alexandria.
* <p>
* The algorithm is quite simple. Given an array of integers
* starting at 2. Cross out all multiples of 2. Find the next
* uncrossed integer, and cross out all of its multiples.
* Repeat untilyou have passed the square root of the maximum
* value.
*
* @author Alphonse
* @version 13 Feb 2002 atp
*/
    public class GeneratePrimes
    {
        /**
        * @param maxValue is the generation limit.
*/
        public static int[] generatePrimes(int maxValue)
        {
            if (maxValue >= 2) // the only valid case
            {
                // declarations
                int s = maxValue + 1; // size of array
                bool[] f = new bool[s];
                int i;

                // initialize array to true.
                for (i = 0; i < s; i++)
                    f[i] = true;
                // get rid of known non-primes
                f[0] = f[1] = false;
                // sieve
                int j;
                for (i = 2; i < Math.Sqrt(s) + 1; i++)
                {
                    if (f[i]) // if i is uncrossed, cross its multiples.
                    {
                        for (j = 2 * i; j < s; j += i)
                            f[j] = false; // multiple is not prime
                    }
                }
                // how many primes are there?
                int count = 0;
                for (i = 0; i < s; i++)
                {
                    if (f[i])
                        count++; // bump count.
                }
                int[] primes = new int[count];
                // move the primes into the result
                for (i = 0, j = 0; i < s; i++)
                {
                    if (f[i]) // if prime
                        primes[j++] = i;
                }
                return primes; // return the primes
            }
            else // maxValue < 2
                return new int[0]; // return null array if bad input.
        }
    }
}

The first step is to put in place some tests, so we can be sure that we are not breaking anything while refactoring the code. In the solution, I added a new Class Library project, named it GeneratePrimes.Tests and added the packages NUnit, NUnit3TestAdapter and FluentAssertions to get fluent assertions in a NUnit test project. Then I added these tests:

using NUnit.Framework;
using FluentAssertions;

namespace PrimeNumbers.Tests
{
    [TestFixture]
    public class GeneratePrimesTests
    {
        [Test]
        public void GeneratePrimes0ReturnsEmptyArray()
        {
            var actual = GeneratePrimes.generatePrimes(0);
            actual.Should().BeEmpty();
        }

        [Test]
        public void GeneratePrimes1ReturnsEmptyArray()
        {
            var actual = GeneratePrimes.generatePrimes(1);
            actual.Should().BeEmpty();
        }

        [Test]
        public void GeneratePrimes2ReturnsArrayWith2()
        {
            var actual = GeneratePrimes.generatePrimes(2);
            actual.Should().BeEquivalentTo(new[] { 2 });
        }

        [Test]
        public void GeneratePrimes10ReturnsArray()
        {
            var actual = GeneratePrimes.generatePrimes(10);
            actual.Should().BeEquivalentTo(new[] { 2,3,5,7 });
        }

        [Test]
        public void GeneratePrimes10000ReturnsArray()
        {
            var actual = GeneratePrimes.generatePrimes(10000);
            actual.Should().HaveCount(1229).And.EndWith(9973);
        }
    }
}

These tests check that there are no primes for 0 and 1, one prime for 2, the primes for 10 are 2, 3, 5, 7 and that there are 1229 primes less than 10,000 and the largest one is 9973. Once we run the tests, we can see that the pass and we can start doing our changes.

The easiest fix we can do is to revise the comments at the beginning. We don’t need the history of Erasthotenes (you can go to Wikipedia for that). We don’t need the author and version, thanks to source control technology :-). We don’t need either the initial comment:

/**
    * This class Generates prime numbers up to a user specified
    * maximum. The algorithm used is the Sieve of Eratosthenes.
    *  https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes   
*/
public class GeneratePrimes
{
    public static int[] generatePrimes(int maxValue)

Then we can invert the initial test, to reduce nesting. If we hover the mouse in the line of the first if, an arrow appears at the border, indicating a quick fix:

We can do the quick fix, then eliminate the else clause (don’t forget to remove the extra comments that are not needed):

public static int[] generatePrimes(int maxValue)
{
    if (maxValue < 2) 
        return new int[0]; 

    // declarations
    int s = maxValue + 1; // size of array
    bool[] f = new bool[s];
    int i;

Save the code and check that all tests pass. The next step is to rename the variables:

  • s can be renamed to sizeOfArray
  • f can be renamed as isPrimeArray

Go to the declaration of s and press Ctrl-R-R to rename and rename it to sizeOfArray. Do the same with the f variable. Don’t forget to remove the comments (and to run the tests):

int sizeOfArray = maxValue + 1; 
bool[] isPrimeArray = new bool[sizeOfArray];
int i;

To go to the next refactorings, we can use the comments as indicators for extracting methods. We can extract the InitializeArray method:

The extracted code isn’t what I expected, so I change it to:

private static bool[] InitializeArray(int sizeOfArray)
{
    bool[] isPrimeArray = new bool[sizeOfArray];
    // initialize array to true.
    for (var i = 0; i < sizeOfArray; i++)
        isPrimeArray[i] = true;
    return isPrimeArray;
}

I can use the code like this:

var isPrimeArray = InitializeArray(sizeOfArray);

After passing the tests, I can refactor the code of InitializeArray to:

private static bool[] InitializeArray(int sizeOfArray)
{
    return Enumerable
        .Range(0, sizeOfArray)
        .Select(n => true)
        .ToArray();
}

The next step is the sieve:

The code for the sieve is really bad:

private static void Sieve(int sizeOfArray, bool[] isPrimeArray, 
    out int i, out int j)
{
    // get rid of known non-primes
    isPrimeArray[0] = isPrimeArray[1] = false;
    for (i = 2; i < Math.Sqrt(sizeOfArray) + 1; i++)
    {
        if (isPrimeArray[i]) // if i is uncrossed, cross its multiples.
        {
            for (j = 2 * i; j < sizeOfArray; j += i)
                isPrimeArray[j] = false; // multiple is not prime
        }
    }
}

It has two out parameters (which, for me, is a code smell), and has an error (the out parameter j must be assigned) before exiting the method. So we can change it to remove the out parameters and remove the sizeOfArray parameter:

private static void Sieve(bool[] isPrimeArray)
{
    var sizeOfArray = isPrimeArray.Length;

    isPrimeArray[0] = isPrimeArray[1] = false;

    for (int i = 2; i < Math.Sqrt(sizeOfArray) + 1; i++)
    {
        if (isPrimeArray[i]) // if i is uncrossed, cross its multiples.
        {
            for (int j = 2 * i; j < sizeOfArray; j += i)
                isPrimeArray[j] = false; 
        }
    }

Then, we can extract the method to count primes:

CountPrimes has the same flaws as Sieve, so we change it to:

private static int CountPrimes(bool[] isPrimeArray)
{
    var sizeOfArray = isPrimeArray.Length;
    var count = 0;
    for (var i = 0; i < sizeOfArray; i++)
    {
        if (isPrimeArray[i])
            count++; 
    }
    return count;
}

We can refactor it to:

private static int CountPrimes(bool[] isPrimeArray) => 
    isPrimeArray.Count(i => i);

The next step is MovePrimes:

After we tweak the MovePrimes code, we get:

private static int[] MovePrimes(bool[] isPrimeArray, int count)
{
    var sizeOfArray = isPrimeArray.Length;
    var primes = new int[count];
    for (int i = 0, j = 0; i < sizeOfArray; i++)
    {
        if (isPrimeArray[i]) // if prime
            primes[j++] = i;
    }
    return primes;
}

Then we can refactor MovePrimes:

 private static int[] MovePrimes(bool[] isPrimeArray, int count) =>
     isPrimeArray
         .Select((p, i) => new { Index = i, IsPrime = p })
         .Where(v => v.IsPrime)
         .Select(v => v.Index)
         .ToArray();

Notice that we aren’t using the primes count in this case, so we can remove the calculation of the count and the parameter. After some cleaning and name changing, we get:

public static int[] GetPrimes(int maxValue)
{
    if (maxValue < 2)
        return new int[0];

    bool[] isPrimeArray = InitializeArray(maxValue);
    Sieve(isPrimeArray);
    return MovePrimes(isPrimeArray);
}

Much cleaner, no? Now, it’s easier to read the method, the details are hidden, but the code still runs the same way. We have a more maintainable method, and it shows clearly what it does.

But there is a change we can do here: we are using static methods only. We can then use extension methods and add the keyword this to allow the methods to be used as extension methods. For example, if we change MovePrimes and Sieve to:

private static int[] MovePrimes(this bool[] isPrimeArray) =>
    isPrimeArray
        .Select((p, i) => new { Index = i, IsPrime = p })
        .Where(v => v.IsPrime)
        .Select(v => v.Index)
        .ToArray();

private static bool[] Sieve(this bool[] isPrimeArray)
{
    var sizeOfArray = isPrimeArray.Length;

    isPrimeArray[0] = isPrimeArray[1] = false;

    for (int i = 2; i < Math.Sqrt(sizeOfArray) + 1; i++)
    {
        if (isPrimeArray[i]) // if i is uncrossed, cross its multiples.
        {
            for (int j = 2 * i; j < sizeOfArray; j += i)
                isPrimeArray[j] = false;
        }
    }
    return isPrimeArray;

We can have the GetPrimes method to be changed to:

public static int[] PrimesSmallerOrEqual(this int maxValue)
{
    if (maxValue < 2)
        return new int[0];

    return maxValue.InitializeArray()
        .Sieve()
        .MovePrimes();
}

Cool, no? With this change, the tests become:

public class GeneratePrimesTests
{
    [Test]
    public void GeneratePrimes0ReturnsEmptyArray()
    {
        0.PrimesSmallerOrEqual().Should().BeEmpty();
    }

    [Test]
    public void GeneratePrimes1ReturnsEmptyArray()
    {
        1.PrimesSmallerOrEqual().Should().BeEmpty();
    }

    [Test]
    public void GeneratePrimes2ReturnsArrayWith2()
    {
        2.PrimesSmallerOrEqual()
            .Should().BeEquivalentTo(new[] { 2 });
    }

    [Test]
    public void GeneratePrimes10ReturnsArray()
    {
        10.PrimesSmallerOrEqual()
            .Should().BeEquivalentTo(new[] { 2, 3, 5, 7 });
    }

    [Test]
    public void GeneratePrimes10000ReturnsArray()
    {
        10000.PrimesSmallerOrEqual()
            .Should().HaveCount(1229).And.EndWith(9973);
    }
}

The full code is at https://github.com/bsonnino/PrimeNumbers. Each commit there is a phase of the refactoring.

Sometimes, when we open an Explorer window in the main computer, we see red bars in some disks, telling us that the disk is almost full and that we need to do some cleanup. We call the system cleanup, that removes some unused space, but this isn’t enough to  make things better.

So, we try to find the duplicate files in the disk to remove some extra space, but we have a problem: where are the duplicate files? The first answer is to check the files with the same name and size, but that isn’t enough – files can be renamed, and still be duplicates.

So, the best thing to do is to find a way to find and list all duplicates in the disk. But how can we do this?

The naive approach is to get all files with the same size and compare them one with the other. But this is really cumbersome, because if there are 100 files in the group, there will be 100!/(2!*98!) = 100*99/2 = 4950 comparisons and has a complexity of O(n^2).

One other approach is to get a checksum of the file and compare checksums. That way, you will still have the O(n^2) complexity, but you’ll have less data to compare (but you will have to compute the time to calculate the checksums). A third approach would be to use a dictionary to group the files with the same hash. The search in a dictionary has a O(1) complexity, so this would do a O(n) complexity.

Now, we only have to choose the checksum. Every checksum has a number of bits and, roughly, the larger the number of bits, the longer it takes to compute it. But the larger number of bits make it more difficult to get wrong results: if you are using CRC16 checksum (16 bits), you will have 65,535 combinations and the probability of two different files have the same checksum is very large. CRC32 allows 2,147,483,647 combinations and, thus, is more difficult to have a wrong result. You can use other algorithms, like MD5 (128 bits), SHA1 (196 bits) or SHA256 (256 bits), but computing these will be way longer than computing the CRC32 bits. As we are not seeking for huge accuracy, but for speed, we’ll use the CRC32 algorithm to compute the hashes. A fast implementation of this algorithm can be found here , and you can use it by installing the CRC32C.NET NuGet package.

From there, we can create our program to find and list the duplicates in the disk. In Visual Studio, create a new WPF application. In the Solution Explorer, right-click on the references node and select the WpfFolderBrowser and Crc32C.NET packages. Then add this code in MainWindow.xaml:

<Grid>
    <Grid.RowDefinitions>
        <RowDefinition Height="40"/>
        <RowDefinition Height="*"/>
    </Grid.RowDefinitions>
    <Button Width="85" Height="30" Content="Start" Click="StartClick"
                HorizontalAlignment="Right" Margin="5" Grid.Row="0"/>
    <Grid Grid.Row="1">
        <Grid.RowDefinitions>
            <RowDefinition Height="*"/>
            <RowDefinition Height="30"/>
        </Grid.RowDefinitions>
        <ScrollViewer HorizontalScrollBarVisibility="Disabled">
        <ItemsControl x:Name="FilesList" HorizontalContentAlignment="Stretch">
            <ItemsControl.ItemTemplate>
                <DataTemplate>
                    <Grid HorizontalAlignment="Stretch">
                        <Grid.RowDefinitions>
                            <RowDefinition Height="30" />
                            <RowDefinition Height="Auto" />
                        </Grid.RowDefinitions>
                        <TextBlock Text="{Binding Value[0].Length, StringFormat=N0}"
                                   Margin="5" FontWeight="Bold"/>
                        <TextBlock Text="{Binding Key, StringFormat=X}"
                                   Margin="5" FontWeight="Bold" HorizontalAlignment="Right"/>
                        <ItemsControl ItemsSource="{Binding Value}" Grid.Row="1" 
                                      HorizontalAlignment="Stretch"
                                      ScrollViewer.HorizontalScrollBarVisibility="Disabled"
                                      Background="Aquamarine">
                            <ItemsControl.ItemTemplate>
                                <DataTemplate>
                                    <TextBlock Text="{Binding FullName}" Margin="15,0"  />
                                </DataTemplate>
                            </ItemsControl.ItemTemplate>
                        </ItemsControl>
                    </Grid>
                </DataTemplate>
            </ItemsControl.ItemTemplate>
        </ItemsControl>
        </ScrollViewer>
        <StackPanel Grid.Row="1" Orientation="Horizontal">
            <TextBlock x:Name="TotalFilesText" Margin="5,0" VerticalAlignment="Center"/>
            <TextBlock x:Name="LengthFilesText" Margin="5,0" VerticalAlignment="Center"/>
        </StackPanel>
    </Grid>
</Grid>

In the button’s click event handler, we will open a Folder browser dialog and, if the user selects a folder, we will process it, enumerating the files and  finding the ones that have the same size. Then, we calculate the Crc32 for these files and add them to a dictionary, grouped by hash:

private async void StartClick(object sender, RoutedEventArgs e)
{
    var fbd = new WPFFolderBrowserDialog();
    if (fbd.ShowDialog() != true)
        return;
    FilesList.ItemsSource = null;
    var selectedPath = fbd.FileName;

    var files = await GetPossibleDuplicatesAsync(selectedPath);
     FilesList.ItemsSource = await GetRealDuplicatesAsync(files);
}

The GetPossibleDuplicatesAsync will enumerate the files and group them by size, returning only the groups that have more than one file:

private async Task<List<IGrouping<long, FileInfo>>> GetPossibleDuplicates(string selectedPath)
{
    List<IGrouping<long, FileInfo>> files = null;
    await Task.Factory.StartNew(() =>
    {
        files = GetFilesInDirectory(selectedPath)
                       .OrderByDescending(f => f.Length)
                         .GroupBy(f => f.Length)
                         .Where(g => g.Count() > 1)
                         .ToList();
    });
    return files;
}

GetFilesInDirectory enumerates the files in the selected directory:

private List<FileInfo> GetFilesInDirectory(string directory)
{
    var files = new List<FileInfo>();
    try
    {
        var directories = Directory.GetDirectories(directory);
        try
        {
            var di = new DirectoryInfo(directory);
            files.AddRange(di.GetFiles("*"));
        }
        catch
        {
        }
        foreach (var dir in directories)
        {
            files.AddRange(GetFilesInDirectory(System.IO.Path.Combine(directory, dir)));
        }
    }
    catch
    {
    }

    return files;
}

After we have the duplicate files grouped, we can search the real duplicates with GetRealDuplicatesAsync:

private static async Task<Dictionary<uint,List<FileInfo>>> GetRealDuplicatesAsync(
    List<IGrouping<long, FileInfo>> files)
{
    var dictFiles = new Dictionary<uint, List<FileInfo>>();
    await Task.Factory.StartNew(() =>
    {
        foreach (var file in files.SelectMany(g => g))
        {
            var hash = GetCrc32FromFile(file.FullName);
            if (hash == 0)
                continue;
            if (dictFiles.ContainsKey(hash))
                dictFiles[hash].Add(file);
            else
                dictFiles.Add(hash, new List<FileInfo>(new[] { file }));
        }
    });
    return dictFiles.Where(p => p.Value.Count > 1).ToDictionary(p => p.Key, p => p.Value);
}

The GetCrc32FromFile method with use the Crc32C library to compute the Crc32 hash from the file. Note that we can’t compute the hash in one pass, by reading the whole file, as this will fail with files with more than 2Gb. So, we read chunks of 10,000 bytes and process them.

public static uint GetCrc32FromFile(string fileName)
{
    try
    {
        using (FileStream file = new FileStream(fileName, FileMode.Open))
        {
            const int NumBytes = 10000;
            var bytes = new byte[NumBytes];
            var numRead = file.Read(bytes, 0, NumBytes);
            if (numRead == 0)
                return 0;
            var crc = Crc32CAlgorithm.Compute(bytes, 0, numRead);
            while (numRead > 0)
            {
                numRead = file.Read(bytes, 0, NumBytes);
                if (numRead > 0)
                    Crc32CAlgorithm.Append(crc, bytes, 0, numRead);
            }
            return crc;
        }
    }
    catch (Exception ex) when (ex is UnauthorizedAccessException || ex is IOException)
    {
        return 0;
    }
}

Now, when you run the app, you will get something like this:

You can then verify the files you want to remove and then go to Explorer and remove them. But there is one thing to do here: the time to compute the hash is very large, especially if you have a lot of data to process (large files, large number of files or both). Could it be improved?

This issue is somewhat complicated to solve. Fortunately, .NET provide us with an excellent tool to improve performance in this case: Parallel programming. By making a small change in the code, you can calculate the CRC of the files in parallel, thus improving the performance. But there is a catch: we are using classes that are not thread safe. If you use the common Dictionary and List to store the data, you will end up with wrong results. But, once again, .NET comes to rescue us: it provides the ConcurrentDictionary and ConcurrentBag to replace the common classes, so we can store the data in a thread safe way. We can then change the code to this:

private static async Task<Dictionary<uint, List<FileInfo>>> GetRealDuplicatesAsync(
    List<IGrouping<long, FileInfo>> files)
{
    var dictFiles = new ConcurrentDictionary<uint, ConcurrentBag<FileInfo>>();
    await Task.Factory.StartNew(() =>
    {
        Parallel.ForEach(files.SelectMany(g => g), file =>
        {
            var hash = GetCrc32FromFile(file.FullName);
            if (hash != 0)
            {
                if (dictFiles.ContainsKey(hash))
                    dictFiles[hash].Add(file);
                else
                    dictFiles.TryAdd(hash, new ConcurrentBag<FileInfo>(new[] { file }));
            }
        });
    });
    return dictFiles.Where(p => p.Value.Count > 1)
        .OrderByDescending(p => p.Value.First().Length)
        .ToDictionary(p => p.Key, p => p.Value.ToList());
}

When we do that and run our program again, we will see that more CPU is used for the processing and the times to get the list come to 46 seconds from 78 seconds (for 18GB of duplicate files).

Conclusions

With this program, we can show the largest duplicates in a folder and see what can be safely deleted in our disk, thus retrieving some space (in our case, we would have potentially got 9Gb extra). We’ve done some optimization in the code by parallelizing the calculations using the parallel extensions in .NET.

The source code for this article is in https://github.com/bsonnino/FindDuplicates

Sometimes, you need to parse some html data to do some processing and present it to the user. That may be a daunting task, as some pages can become very complex and it may be difficult to do it.

For that, you can use an excellent tool, named HTML Agility Pack. With it, you can parse HTML from a string, a file, a web site or even from a WebBrowser: you can add a WebBrowser to your app, navigate to an URL and parse the data from there.

In this article, I’ll show how to make a query in Bing, retrieve and parse the response. For that, we need to create the query url and pass it to Bing. You may ask why I’m querying Bing and not Google – I’m doing that because Google makes it difficult to get its data, and I want to show you how to use HTML Agility Pack, and not how to retrieve data from Google :-). The query should be something like this:

https://www.bing.com/search?q=html+agility+pack&count=100

We will use the Query (q) and the number of results (count) parameters. With them, we can create our program. We will create a WPF program that gets the query text, parses it and presents the results in a Listbox.

Create a new WPF program and name it BingSearch.

The next step is to add the HTML Agility Pack to the project. Right-click the References node in the Solution Explorer and select Manage NuGet Packages. Then add the Html Agility Pack to the project.

Then, in the main window, add this XAML code:

<Grid>
    <Grid.RowDefinitions>
        <RowDefinition Height="40"/>
        <RowDefinition Height="*"/>
    </Grid.RowDefinitions>
    <StackPanel Grid.Row="0" Orientation ="Horizontal" 
                Margin="5,0" VerticalAlignment="Center">
        <TextBlock Text="Search" VerticalAlignment="Center"/>
        <TextBox x:Name="TxtSearch" Width="300" Height="30" 
                 Margin="5,0" VerticalContentAlignment="Center"/>
    </StackPanel>
    <Button Grid.Row="0" HorizontalAlignment="Right" 
            Content="Search" Margin="5,0" VerticalAlignment="Center"
            Width="65" Height="30" Click="SearchClick"/>
    <ListBox Grid.Row="1" x:Name="LbxResults" />
</Grid>

Right click in the button’s click event handler in the XAML and press F12 to add the handler in code and go to it. Then, add this code to the handler:

private async void SearchClick(object sender, RoutedEventArgs e)
{
    if (string.IsNullOrWhiteSpace(TxtSearch.Text))
        return;
    var queryString = WebUtility.UrlEncode(TxtSearch.Text);
    var htmlWeb = new HtmlWeb();
    var query = $"https://bing.com/search?q={queryString}&count=100";
    var doc = await htmlWeb.LoadFromWebAsync(query);
    var response = doc.DocumentNode.SelectSingleNode("//ol[@id='b_results']");
    var results = response.SelectNodes("//li[@class='b_algo']");
    if (results == null)
    {
        LbxResults.ItemsSource = null;
        return;
    }
    var searchResults = new List<SearchResult>();
    foreach (var result in results)
    {
        var refNode = result.Element("h2").Element("a");
        var url = refNode.Attributes["href"].Value;
        var text = refNode.InnerText;
        var description = result.Element("div").Element("p").InnerText;
        searchResults.Add(new SearchResult(text, url, description));
    }
    LbxResults.ItemsSource = searchResults;
}

Initially we encode the text to search to add it to the query and create the query string. Then we call the LoadFromWebAsync method to load the HTML data from the query response. When the response comes, we get the response node, from the ordered list with id b_results and extract from it the individual results. Finally, we parse each result and add it to a list of SearchResult, and assign the list to the items in the ListBox. You can note that we can finde the nodes using XPath, like in

var results = response.SelectNodes("//li[@class='b_algo']");

Or we can traverse the elements and get the text of the resulting node with something like:

var refNode = result.Element("h2").Element("a");
var url = refNode.Attributes["href"].Value;
var text = refNode.InnerText;
var description = WebUtility.HtmlDecode(
    result.Element("div").Element("p").InnerText);

SearchResult is declared as:

internal class SearchResult
{
    public string Text { get; }
    public string Url { get; }
    public string Description { get; }

    public SearchResult(string text, string url, string description)
    {
        Text = text;
        Url = url;
        Description = description;
    }
}

if you run the program, you will see something like this:

The data isn’t displayed because we haven’t defined any data template for the list items. You can define an item template like that in the XAML:

<ListBox.ItemTemplate>
    <DataTemplate>
        <StackPanel Margin="0,3">
            <TextBlock Text="{Binding Text}" FontWeight="Bold"/>
            <TextBlock >
              <Hyperlink NavigateUri="{Binding Url}" RequestNavigate="LinkNavigate">
                 <TextBlock Text="{Binding Url}"/>
              </Hyperlink>
            </TextBlock>
            <TextBlock Text="{Binding Description}" TextWrapping="Wrap"/>
        </StackPanel>
    </DataTemplate>
</ListBox.ItemTemplate>

The LinkNavigate event handler is:

private void LinkNavigate(object sender, RequestNavigateEventArgs e)
{
    System.Diagnostics.Process.Start(e.Uri.AbsoluteUri);
}

Now, when you run the program, you will get something like this:

You can click on the hyperlink and it will open a browser window with the selected page. We can even go further and add a WebBrowser to our app that will show the selected page when you click on an item. For that, you have to modify the XAML code with something like this:

<Grid>
    <Grid.RowDefinitions>
        <RowDefinition Height="40"/>
        <RowDefinition Height="*"/>
    </Grid.RowDefinitions>
    <Grid.ColumnDefinitions>
        <ColumnDefinition Width="*"/>
        <ColumnDefinition Width="*"/>
    </Grid.ColumnDefinitions>
    <StackPanel Grid.Row="0" Orientation ="Horizontal" 
                Margin="5,0" VerticalAlignment="Center">
        <TextBlock Text="Search" VerticalAlignment="Center"/>
        <TextBox x:Name="TxtSearch" Width="300" Height="30" 
                 Margin="5,0" VerticalContentAlignment="Center"/>
    </StackPanel>
    <Button Grid.Row="0" HorizontalAlignment="Right" 
            Content="Search" Margin="5,0" VerticalAlignment="Center"
            Width="65" Height="30" Click="SearchClick"/>
    <ListBox Grid.Row="1" x:Name="LbxResults" 
             ScrollViewer.HorizontalScrollBarVisibility="Disabled"
             SelectionChanged="LinkChanged">
        <ListBox.ItemTemplate>
            <DataTemplate>
                <StackPanel Margin="0,3">
                    <TextBlock Text="{Binding Text}" FontWeight="Bold"/>
                    <TextBlock >
                      <Hyperlink NavigateUri="{Binding Url}" RequestNavigate="LinkNavigate">
                         <TextBlock Text="{Binding Url}"/>
                      </Hyperlink>
                    </TextBlock>
                    <TextBlock Text="{Binding Description}" TextWrapping="Wrap"/>
                </StackPanel>
            </DataTemplate>
        </ListBox.ItemTemplate>
    </ListBox>
    <WebBrowser Grid.Column="1" Grid.RowSpan="2" x:Name="WebPage"  />
</Grid>

We’ve added a second column to the window and added a WebBrwser to it, then added a SelectionChanged event to the listbox, so we can navigate to the selected page.

The SelectionChanged event handler is:

private void LinkChanged(object sender, SelectionChangedEventArgs e)
{
    if (e.AddedItems?.Count > 0)
    {
        WebPage.Navigate(((SearchResult)e.AddedItems[0]).Url);
    }
}

Now, when you run the app and click on a result, it will show the page in the WebBrowser. One thing that happened is that, sometimes a Javascript error pops up. To remove these errors, I used the solution obtained from here:

public MainWindow()
{
    InitializeComponent();
    WebPage.Navigated += (s, e) => SetSilent(WebPage, true);
}

public static void SetSilent(WebBrowser browser, bool silent)
{
    if (browser == null)
        throw new ArgumentNullException("browser");

    // get an IWebBrowser2 from the document
    IOleServiceProvider sp = browser.Document as IOleServiceProvider;
    if (sp != null)
    {
        Guid IID_IWebBrowserApp = new Guid("0002DF05-0000-0000-C000-000000000046");
        Guid IID_IWebBrowser2 = new Guid("D30C1661-CDAF-11d0-8A3E-00C04FC9E26E");

        object webBrowser;
        sp.QueryService(ref IID_IWebBrowserApp, ref IID_IWebBrowser2, out webBrowser
        if (webBrowser != null)
        {
            webBrowser.GetType().InvokeMember("Silent", 
                BindingFlags.Instance | BindingFlags.Public | 
                BindingFlags.PutDispProperty, null, webBrowser, 
                new object[] { silent });
        }
    }
}


[ComImport, Guid("6D5140C1-7436-11CE-8034-00AA006009FA"), 
    InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
private interface IOleServiceProvider
{
    [PreserveSig]
    int QueryService([In] ref Guid guidService, [In] ref Guid riid, 
        [MarshalAs(UnmanagedType.IDispatch)] out object ppvObject);
}

With this code, the Javascript errors disappear and when you run the app, you will see something like this:

As you can see, the HTML Agility Pack makes it easy to process and parse HTML Pages, allowing you to manipulate them the way you want.

The full source code for this article is in https://github.com/bsonnino/BingSearch

One thing that I use a lot is sample data. Every article I write needs some data to explain the concepts, I need some data to see how it fits in my designs or even sample data for testing. This is a real trouble, as I must find some reliable data for my programs. Sometimes, I go to databases (Northwind and AdventureWorks are my good friends), sometimes, I use Json or XML data and other times I create the sample data by myself.

None of them are perfect, and it’s not consistent. Every time I get a new way of accessing data (yes, it can be good for learning purposes, but it’s a nightmare for maintenance). Then, looking around, I found Bogus (https://github.com/bchavez/Bogus), It’s a simple data generator for C#. All you have to do is create rules for your data and generate it. Simple as that! Then, you’ve got the data to use in your programs. It can be fixed (every time you run your program, you have the same data) or variable (every time you get a different set of data), and once you got it, you can serialize it to whichever data format you want: json files, databases, xml or plain text files.

Generating sample data

The first step to generate sample data is to create your classes. Create a new console app and add these two classes:

public class Customer
{
    public Guid Id { get; set; }
    public string Name { get; set; }
    public string Address { get; set; }
    public string City { get; set; }
    public string Country { get; set; }
    public string ZipCode { get; set; }
    public string Phone { get; set; }
    public string Email { get; set; }
    public string ContactName { get; set; }
    public IEnumerable<Order> Orders { get; set; }
}
public class Order
{
    public Guid Id { get; set; }
    public DateTime Date { get; set; }
    public Decimal OrderValue { get; set; }
    public bool Shipped { get; set; }
}

Once you’ve got the classes, you can add the repositories to get the sample data. To use the sample data generator, you must add the Bogus NuGet package to your project, with the command Install-Package Bogus, in the package manager console. Then we can add the repository class to retrieve the data. Add a new class to the project and name it SampleCustomerRepository. Then add this code in the class:

public IEnumerable<Customer> GetCustomers()
{
    Randomizer.Seed = new Random(123456);
    var ordergenerator = new Faker<Order>()
        .RuleFor(o => o.Id, Guid.NewGuid)
        .RuleFor(o => o.Date, f => f.Date.Past(3))
        .RuleFor(o => o.OrderValue, f => f.Finance.Amount(0, 10000))
        .RuleFor(o => o.Shipped, f => f.Random.Bool(0.9f));
    var customerGenerator = new Faker<Customer>()
        .RuleFor(c => c.Id, Guid.NewGuid())
        .RuleFor(c => c.Name, f => f.Company.CompanyName())
        .RuleFor(c => c.Address, f => f.Address.FullAddress())
        .RuleFor(c => c.City, f => f.Address.City())
        .RuleFor(c => c.Country, f => f.Address.Country())
        .RuleFor(c => c.ZipCode, f => f.Address.ZipCode())
        .RuleFor(c => c.Phone, f => f.Phone.PhoneNumber())
        .RuleFor(c => c.Email, f => f.Internet.Email())
        .RuleFor(c => c.ContactName, (f, c) => f.Name.FullName())
        .RuleFor(c => c.Orders, f => ordergenerator.Generate(f.Random.Number(10)).ToList());
    return customerGenerator.Generate(100);
}

In line 3, we have set the Randomizer.Seed to a fixed seed, so the data is always the same for all runs. If we don’t want it, we just don’t need to set it up. Then we set the rules for the order and customer generation. Then we call the Generate method to generate the sample data. Easy as that.

As you can see, the generator has a lot of classes to generate data. For example, the Company  class generates data for the company, like CompanyName. You can use this data as sample data for your programs. I can see some uses for it:

  • Test data for unit testing
  • Sample data for design purposes
  • Sample data for prototypes

but I’m sure you can find some more.

To use the data, you can add this code in the main program:

static void Main(string[] args)
{
    var repository = new SampleCustomerRepository();
    var customers = repository.GetCustomers();
    Console.WriteLine(JsonConvert.SerializeObject(customers, 
        Formatting.Indented));
}

We are serializing the data to Json, so you must add the Newtonsoft.Json Nuget package to your project. When you run this project, you will see something like this:

As you can see, it generated a whole set of customers with their orders, so you can use in your programs.

You may say that this is just dummy data, it will be cluttering your project, and it this data will remain unused when the program goes to production, and you’re right. But that’s not the way I would recommend to use this kind of data. It is disposable and will clutter the project. So, a better way would be create another project, a class library with the repository. That solves one problem, the maintenance one. When entering in production, you replace the sample dll with the real repository and you’re done. Or no? Not too fast. If you see the code in the main program, you will see that we are instantiating an instance of the SampleCustomerRepository, that won’t exist anymore. That way, you must also change the code for the main program. Not a good solution.

You can use conditional compilation to instantiate the repository you want, like this:

        static void Main(string[] args)
        {
#if SAMPLEREPO
            var repository = new SampleCustomerRepository();
#else
            var repository = new CustomerRepository();
#endif
            var customers = repository.GetCustomers();
            Console.WriteLine(JsonConvert.SerializeObject(customers, 
                Formatting.Indented));
        }

That’s better, but still not optimal: you need to compile the program twice: one time to use the sample data and the other one for going to production. With automated builds that may be less than a problem, but if you need to deploy by yourself, it can be a nightmare.

The solution? Dependency injection. Just create an interface for the repository, use a dependency injection framework (I like to use Unity or Ninject, but there are many others out there, just check this list). That way, the code will be completely detached from the data and you won’t need to recompile the project to use this or that data. Just add the correct dll and you are using the data you want. This is a nice approach, but the topic gives space for another post, just wait for it!

The code for this article is at https://github.com/bsonnino/SampleData

 

Many times I need to enumerate the files in my disk or in a folder and subfolders, but that always has been slow. All the file enumeration techniques go through the disk structures querying the file names and going to the next one. With the Windows file indexing, this has gone to another level of speed: you can query and filter your data almost instantaneously, with one pitfall: it only works in the indexed parts of the disk (usually the libraries and Windows folders), being unusable for your data folders, unless you add them to the indexer:

Wouldn’t it be nice to have a database that stores all files in the system and is updated as the files change? Some apps, like Copernic Desktop Search or X1 Search do exactly that: they have a database that indexes your system and can do fast queries for you. The pitfall is that you don’t have an API to integrate to your programs, so the only way you have is to query the files is to use their apps.

At some time, Microsoft thought of doing something like a database of files, creating what was called WinFS – Windows Future Storage, but the project was cancelled. So, we have to stick with our current APIs to query files. Or no? As a matter of fact there is something in the Windows system that allows us to query the files in a very fast way, and it’s called the NTFS MFT (NT file system master file table).

The NTFS MFT is a file structure use internally by Windows that allows querying files in a very fast way. It was designed to be fast and safe (there are two copies of the MFT, in case one of them gets corrupt), and we can access it to get our files enumerated. But some things should be noted when accessing the MFT:

  • The MFT is only available for NTFS volumes. So, you cannot access FAT drives with this API
  • To access the MFT structures, you must have elevated privileges – a normal user won’t be able to access it
  • With great power comes great responsibility (this is a SpiderMan comic book quote), so you should know that accessing the internal NTFS structures may harm you system irreversively – use the code with care, and don’t blame me if something goes wrong (but here’s a suggestion to Windows API designers: why not create some APIs that query the NTFS structures safely for normal users? That could be even be added to UWP programming).

Acessing the MFT structure

There is no formal API to access the MFT structure. You will have to decipher the structure (there is a lot of material here) and access the data using raw disk data read (that’s why you need elevated privileges). This is a lot of work and hours of trial and error.

Fortunately, there are some libraries that do that in C#, and I’ve used this one, which is licensed as LGPL. You can use the library in your compiled work as a library, with no restriction. If you include the library source code in your code, it will be “derived work” and you must distribute all code as LGPL.

We will create a WPF program that will show the disk usage. It will enumerate all files and show them in the list, so you can see what’s taking space in your disk. You will be able to select any of the NTFS disks in your machine.

Open Visual Studio with administrative rights (this is very important, or you won’t be able to debug your program). Then create a new WPF project and add a new item. Choose Application Manifest File, you will have an app.manifest file added to your project. Then, you must change the requestedExecutionLevel tag of the file to:

<requestedExecutionLevel level="requireAdministrator" uiAccess="false" />

The next step is to detect the NTFS disks in your system. This is done with this code:

var ntfsDrives = DriveInfo.GetDrives()
    .Where(d => d.DriveFormat == "NTFS").ToList();

Then, add the NtfsReader project to the solution and add a reference to it in the WPF project. Then, add the following UI in MainWindow.xaml.cs:

<Grid>
    <Grid.RowDefinitions>
        <RowDefinition Height="40"/>
        <RowDefinition Height="*"/>
        <RowDefinition Height="30"/>
    </Grid.RowDefinitions>
    <StackPanel Orientation="Horizontal" Margin="5">
        <TextBlock Text="Drive" VerticalAlignment="Center"/>
        <ComboBox x:Name="DrvCombo" Margin="5,0" Width="100" 
                  VerticalContentAlignment="Center"/>
    </StackPanel>
    <ListBox x:Name="FilesList" Grid.Row="1" 
             VirtualizingPanel.IsVirtualizing="True"
             VirtualizingPanel.IsVirtualizingWhenGrouping="True"
             >
        <ListBox.ItemTemplate>
            <DataTemplate>
                <StackPanel Orientation="Horizontal">
                    <TextBlock Text="{Binding FullName}" 
                               Margin="5,0" Width="450"/>
                    <TextBlock Text="{Binding Size,StringFormat=N0}" 
                               Margin="5,0" Width="150" TextAlignment="Right"/>
                    <TextBlock Text="{Binding LastChangeTime, StringFormat=g}" 
                               Margin="5,0" Width="200"/>
                </StackPanel>
            </DataTemplate>
        </ListBox.ItemTemplate>
    </ListBox>
    <TextBlock x:Name="StatusTxt" Grid.Row="2" HorizontalAlignment="Center" Margin="5"/>
</Grid>

We will have a combobox with all drives in the first row and a listbox with the files. The listbox has an ItemTemplate that will show the name, size and date of last change of each file. To fill this data, you will have to add this code in MainWindow.xaml.cs:

public MainWindow()
{
    InitializeComponent();
    var ntfsDrives = DriveInfo.GetDrives()
        .Where(d => d.DriveFormat == "NTFS").ToList();
    DrvCombo.ItemsSource = ntfsDrives;
    DrvCombo.SelectionChanged += DrvCombo_SelectionChanged;
}

private void DrvCombo_SelectionChanged(object sender, 
    System.Windows.Controls.SelectionChangedEventArgs e)
{
    if (DrvCombo.SelectedItem != null)
    {
        var driveToAnalyze = (DriveInfo) DrvCombo.SelectedItem;
        var ntfsReader =
            new NtfsReader(driveToAnalyze, RetrieveMode.All);
        var nodes =
            ntfsReader.GetNodes(driveToAnalyze.Name)
                .Where(n => (n.Attributes & 
                             (Attributes.Hidden | Attributes.System | 
                              Attributes.Temporary | Attributes.Device | 
                              Attributes.Directory | Attributes.Offline | 
                              Attributes.ReparsePoint | Attributes.SparseFile)) == 0)
                .OrderByDescending(n => n.Size);
        FilesList.ItemsSource = nodes;
    }
}

It gets all NTFS drives in your system and fills the combobox. In the SelectionChanged event handler, the reader gets all nodes in the drive. These nodes are filtered to remove all that are not normal files and then ordered descending by size and added to the listbox.

If you run the program you will see some things:

  • If you look at the output window in Visual Studio, you will see these debug messages:
1333.951 MB of volume metadata has been read in 26.814 s at 49.748 MB/s
1324082 nodes have been retrieved in 2593.669 ms

This means that it took 2.6s to read and analyze all files in the disk (pretty fast for 1.3 million files, no?).

  • When you change the drive in the combobox, the program will freeze for some time and the list will be filled with the files. The freezing is due to the fact that you are blocking the main thread while you are analyzing the disk. To avoid this, you should run the code in a secondary thread, like this code:
private async void DrvCombo_SelectionChanged(object sender, 
    System.Windows.Controls.SelectionChangedEventArgs e)
{
    if (DrvCombo.SelectedItem != null)
    {
        var driveToAnalyze = (DriveInfo) DrvCombo.SelectedItem;
        DrvCombo.IsEnabled = false;
        StatusTxt.Text = "Analyzing drive";
        List<INode> nodes = null;
        await Task.Factory.StartNew(() =>
        {
            var ntfsReader =
                new NtfsReader(driveToAnalyze, RetrieveMode.All);
            nodes =
                ntfsReader.GetNodes(driveToAnalyze.Name)
                    .Where(n => (n.Attributes &
                                 (Attributes.Hidden | Attributes.System |
                                  Attributes.Temporary | Attributes.Device |
                                  Attributes.Directory | Attributes.Offline |
                                  Attributes.ReparsePoint | Attributes.SparseFile)) == 0)
                    .OrderByDescending(n => n.Size).ToList();
        });
        FilesList.ItemsSource = nodes;
        DrvCombo.IsEnabled = true;
        StatusTxt.Text = $"{nodes.Count} files listed. " +
                         $"Total size: {nodes.Sum(n => (double)n.Size):N0}";
    }
}

This code creates a task and runs the analyzing code in it, and doesn’t freeze the UI. I just took care of disabling the combobox and putting a warning for the user. After the code is run, the nodes list is assigned to the listbox and the UI is re-enabled.

This code can show you the list of the largest files in your disk, but you may want to analyze it by other ways, like grouping by extension or by folder. WPF has an easy way to group and show data: the CollectionViewSource. With it, you can do grouping and sorting in the ListBox. We will change our UI to add a new ComboBox to show the new groupings:

<Grid>
    <Grid.RowDefinitions>
        <RowDefinition Height="40"/>
        <RowDefinition Height="*"/>
        <RowDefinition Height="30"/>
    </Grid.RowDefinitions>
    <StackPanel Orientation="Horizontal" Margin="5">
        <TextBlock Text="Drive" VerticalAlignment="Center"/>
        <ComboBox x:Name="DrvCombo" Margin="5,0" Width="100" 
                  VerticalContentAlignment="Center"/>
    </StackPanel>
    <StackPanel Grid.Row="0" HorizontalAlignment="Right" Orientation="Horizontal" Margin="5">
        <TextBlock Text="Sort" VerticalAlignment="Center"/>
        <ComboBox x:Name="SortCombo" Margin="5,0" Width="100" 
                  VerticalContentAlignment="Center" SelectedIndex="0" 
                  SelectionChanged="SortCombo_OnSelectionChanged">
            <ComboBoxItem>Size</ComboBoxItem>
            <ComboBoxItem>Extension</ComboBoxItem>
            <ComboBoxItem>Folder</ComboBoxItem>
        </ComboBox>
    </StackPanel>
    <ListBox x:Name="FilesList" Grid.Row="1" 
             VirtualizingPanel.IsVirtualizing="True"
             VirtualizingPanel.IsVirtualizingWhenGrouping="True" >
        <ListBox.ItemTemplate>
            <DataTemplate>
                <StackPanel Orientation="Horizontal">
                    <TextBlock Text="{Binding FullName}" 
                               Margin="5,0" Width="450"/>
                    <TextBlock Text="{Binding Size,StringFormat=N0}" 
                               Margin="5,0" Width="150" TextAlignment="Right"/>
                    <TextBlock Text="{Binding LastChangeTime, StringFormat=g}" 
                               Margin="5,0" Width="200"/>
                </StackPanel>
            </DataTemplate>
        </ListBox.ItemTemplate>
        <ListBox.GroupStyle>
            <GroupStyle>
                <GroupStyle.HeaderTemplate>
                    <DataTemplate>
                        <StackPanel Orientation="Horizontal">
                            <TextBlock Text="{Binding Name}" FontSize="15" FontWeight="Bold" 
                                       Margin="5,0"/>
                            <TextBlock Text="(" VerticalAlignment="Center" Margin="5,0,0,0" />
                            <TextBlock Text="{Binding Items.Count}" VerticalAlignment="Center"/>
                            <TextBlock Text=" files - " VerticalAlignment="Center"/>
                            <TextBlock Text="{Binding Items, 
                                Converter={StaticResource ItemsSizeConverter}, StringFormat=N0}"
                                         VerticalAlignment="Center"/>
                            <TextBlock Text=" bytes)" VerticalAlignment="Center"/>
                        </StackPanel>
                    </DataTemplate>
                </GroupStyle.HeaderTemplate>
            </GroupStyle>
        </ListBox.GroupStyle>
    </ListBox>
    <TextBlock x:Name="StatusTxt" Grid.Row="2" HorizontalAlignment="Center" Margin="5"/>
</Grid>

The combobox has three options, Size, Extension and Folder. The first one is the same thing we’ve had until now; the second will group the files by extension and the third will group the files by top folder. We’ve also added a GroupStyle to the listbox. If we don’t do that, the data will be grouped, but the groups won’t be shown. If you notice the GroupStyle, you will see that we’re adding the name, then the count of the items (number of files in the group), then we have a third TextBox where we pass the Items and a converter. That’s because we want to show the total size in bytes of the group. For that, I’ve created a converter that converts the Items in the group to the sum of the bytes of the file:

public class ItemsSizeConverter : IValueConverter
{
    public object Convert(object value, Type targetType, object parameter, 
        CultureInfo culture)
    {
        var items = value as ReadOnlyObservableCollection<object>;
        return items?.Sum(n => (double) ((INode)n).Size);
    }

    public object ConvertBack(object value, Type targetType, object parameter, 
        CultureInfo culture)
    {
        throw new NotImplementedException();
    }
}

The code for the SelectionChanged for the sort combobox is:

private void SortCombo_OnSelectionChanged(object sender, 
    SelectionChangedEventArgs e)
{
    if (_view == null)
        return;
    _view.GroupDescriptions.Clear();
    _view.SortDescriptions.Clear();
    switch (SortCombo.SelectedIndex)
    {
        case 1:
            _view.GroupDescriptions.Add(new PropertyGroupDescription("FullName", 
                new FileExtConverter()));
            break;
        case 2:
            _view.SortDescriptions.Add(new SortDescription("FullName",
                ListSortDirection.Ascending));
            _view.GroupDescriptions.Add(new PropertyGroupDescription("FullName", 
                new FilePathConverter()));
            break;
    }
}

We add GroupDescriptions for each kind of group. As we don’t have the extension and top path properties in the nodes shown in the listbox, I’ve created two converters to get these from the full name. The converter that gets the extension from the name is:

class FileExtConverter : IValueConverter
{
    public object Convert(object value, Type targetType, object parameter, 
        CultureInfo culture)
    {
        var fileName = value as string;
        return string.IsNullOrWhiteSpace(fileName) ? 
            null : 
            Path.GetExtension(fileName).ToLowerInvariant();
    }

    public object ConvertBack(object value, Type targetType, object parameter, 
        CultureInfo culture)
    {
        throw new NotImplementedException();
    }
}

The converter that gets the top path of the file is:

class FilePathConverter :IValueConverter
{
    public object Convert(object value, Type targetType, object parameter, 
        CultureInfo culture)
    {
        var fileName = value as string;
        return string.IsNullOrWhiteSpace(fileName) ?
            null :
            GetTopPath(fileName);
    }

    private string GetTopPath(string fileName)
    {
        var paths = fileName.Split(Path.DirectorySeparatorChar).Take(2);
        return string.Join(Path.DirectorySeparatorChar.ToString(), paths);
    }

    public object ConvertBack(object value, Type targetType, object parameter, 
        CultureInfo culture)
    {
        throw new NotImplementedException();
    }
}

One last thing is to create the _view field, when we are filling the listbox:

 private async void DrvCombo_SelectionChanged(object sender, 
     System.Windows.Controls.SelectionChangedEventArgs e)
 {
     if (DrvCombo.SelectedItem != null)
     {
         var driveToAnalyze = (DriveInfo) DrvCombo.SelectedItem;
         DrvCombo.IsEnabled = false;
         StatusTxt.Text = "Analyzing drive";
         List<INode> nodes = null;
         await Task.Factory.StartNew(() =>
         {
             var ntfsReader =
                 new NtfsReader(driveToAnalyze, RetrieveMode.All);
             nodes =
                 ntfsReader.GetNodes(driveToAnalyze.Name)
                     .Where(n => (n.Attributes &
                                  (Attributes.Hidden | Attributes.System |
                                   Attributes.Temporary | Attributes.Device |
                                   Attributes.Directory | Attributes.Offline |
                                   Attributes.ReparsePoint | Attributes.SparseFile)) == 0)
                     .OrderByDescending(n => n.Size).ToList();
         });
         FilesList.ItemsSource = nodes;
         _view = (CollectionView)CollectionViewSource.GetDefaultView(FilesList.ItemsSource);

         DrvCombo.IsEnabled = true;
         StatusTxt.Text = $"{nodes.Count} files listed. " +
                          $"Total size: {nodes.Sum(n => (double)n.Size):N0}";
     }
     else
     {
         _view = null;
     }
 }

With all these in place, you can run the app and get a result like this:

Conclusions

As you can see, there is a way to get fast file enumeration for your NTFS disks, but you must have admin privileges to use it. We’ve created a WPF program that uses this kind of enumeration and allows you to group the data in different ways, using the WPF resources. If you need to enumerate  your files very fast, you can consider this way to do it.

The full source code for this article is at https://github.com/bsonnino/NtfsFileEnum

As an MVP, I sometimes receive licenses to software from the vendors for my usage. Some of them become indispensable for me and I feel in the obligation to write a review (yes, it’s a biased review, as I really like the tool and use it on a daily basis :-)) as a way to say thank you!

One of these tools is Linqpad (https://www.linqpad.net/). It’s a simple tool, with a small footprint, but I have used it in so many ways that I find it incredible. There is a free version that has a lot of features to start, but I really recommend the paid version (if you have the $95 to spend, the Premium edition has even a debugger to debug your snippets).

Introduction

Once you open Linqpad, you will see a simple desktop like this:

At first, the name of the tool may indicate that this is a notepad for linq queries, but it’s much more than that! If you take a look at the Samples pane, you can see that there’s even an Interactive Regex Evaluator.

A closer look at that pane shows that you are not tied to C#: you can also use F# there. In fact, there is a full F# tutorial there. If you open the Language combo, you can see that you can use also VB or SQL queries.

My first usages in Linqpad were to learn Linq (the name is Linqpad, no?). At the beginning, Linq seems a little bit daunting, with all those extension methods and lambdas. So, I started to try some Linq queries, making them more difficult as my knowledge was improving. In Linqpad, you have three flavors of code: Expressions, where you have a single expression evaluated; Statements, where you have some statements evaluated and Program, where you can have a full program run in Linqpad (I use this when I want to run a console program and don’t want to open Visual Studio and create a new project).

In the Expression mode, you can enter a single expression, like this:

from i in Enumerable.Range(1,1000)
  where i % 2 == 0
  select i

If you run it, you will see the result in the Results pane:

As you can see, all the results are there, there is no need to open a console window or anything else. And, what’s better, you can export the results to Excel, Word or HTML. You can also use the other Linq format, the functional one:

Enumerable.Range(1,1000).Where(i => i %2 == 0)

After that, you can start tweaking your code and clicking on the Run button and observing the results. If you have the paid version, you also have Intellisense in the code, so you can check the syntax.

For example, to get the sum of the squares of the even numbers, we can do something like this:

If we have something more complicated than a single expression, we can run it using the C# statements. For example, to get all methods and parameters of the methods in the Directory class, we can use these statements:

var methodInfos = typeof(Directory).GetMethods(BindingFlags.Public | 
  BindingFlags.Static);

methodInfos.Select(m => new 
{
  m.Name, 
  Parameters = m.GetParameters() 
}).Dump();

You may have noticed something different in the code above: the Dump method. Linqpad adds this method to dump the values to the results pane. It is very powerful, you don’t need to know the type of the object, all the properties are shown there:

And you are not limited to old C#, you can also use C#7 features and even async programming. For example, this code (based on https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/async/walkthrough-accessing-the-web-by-using-async-and-await) will download asynchronously some pages from the web and will display their sizes:

async Task Main()
{
	await SumPageSizesAsync().Dump();
}

private async Task<List<string>> SumPageSizesAsync()
{
	var results = new List<string>();
	// Declare an HttpClient object and increase the buffer size. The
	// default buffer size is 65,536.
	HttpClient client =
		new HttpClient() { MaxResponseContentBufferSize = 1000000 };

	// Make a list of web addresses.
	List<string> urlList = SetUpURLList();

	var total = 0;

	foreach (var url in urlList)
	{
		// GetByteArrayAsync returns a task. At completion, the task
		// produces a byte array.
		byte[] urlContents = await client.GetByteArrayAsync(url);

		// The following two lines can replace the previous assignment statement.
		//Task<byte[]> getContentsTask = client.GetByteArrayAsync(url);
		//byte[] urlContents = await getContentsTask;

		results.Add(DisplayResults(url, urlContents));

		// Update the total.
		total += urlContents.Length;
	}

	// Display the total count for all of the websites.
	results.Add(
		$"\r\n\r\nTotal bytes returned:  {total}\r\n");
	return results;
}

private List<string> SetUpURLList()
{
	List<string> urls = new List<string>
			{
				"https://msdn.microsoft.com/library/windows/apps/br211380.aspx",
				"https://msdn.microsoft.com",
				"https://msdn.microsoft.com/library/hh290136.aspx",
				"https://msdn.microsoft.com/library/ee256749.aspx",
				"https://msdn.microsoft.com/library/hh290138.aspx",
				"https://msdn.microsoft.com/library/hh290140.aspx",
				"https://msdn.microsoft.com/library/dd470362.aspx",
				"https://msdn.microsoft.com/library/aa578028.aspx",
				"https://msdn.microsoft.com/library/ms404677.aspx",
				"https://msdn.microsoft.com/library/ff730837.aspx"
			};
	return urls;
}

private string DisplayResults(string url, byte[] content)
{
	// Display the length of each website. The string format
	// is designed to be used with a monospaced font, such as
	// Lucida Console or Global Monospace.
	var bytes = content.Length;
	// Strip off the "https://".
	var displayURL = url.Replace("https://", "");
	return $"\n{displayURL,-58} {bytes,8}";
}

When you run it, you will see something like this:

And you are not tied to the default C# libraries. If you have the Developer or Premium versions, you can download and use NuGet packages in your queries. For example in this previous article, I’ve shown how to use the Microsoft.SqlServer.TransactSql.ScriptDom package to parse your Sql Server code. You don’t even need to open Visual Studio for that. Just put this code in the Linqpad window:

static void Main()
{
	using (var con = new SqlConnection("Server=.;Database=WideWorldImporters;Trusted_Connection=True;"))
	{
		con.Open();
		var procTexts = GetStoredProcedures(con)
		  .Select(n => new { ProcName = n, Tree = ParseSql(GetProcText(con, n)) })
		  .Dump();
	}
}

private static List<string> GetStoredProcedures(SqlConnection con)
{
	using (SqlCommand sqlCommand = new SqlCommand("select s.name+'.'+p.name as name from sys.procedures p " +
	  "inner join sys.schemas s on p.schema_id = s.schema_id order by name", con))
	{
		using (DataTable procs = new DataTable())
		{
			procs.Load(sqlCommand.ExecuteReader());
			return procs.Rows.OfType<DataRow>().Select(r => r.Field<String>("name")).ToList();
		}
	}
}

private static string GetProcText(SqlConnection con, string procName)
{
	using (SqlCommand sqlCommand = new SqlCommand("sys.sp_helpText", con)
	{
		CommandType = CommandType.StoredProcedure
	})
	{
		sqlCommand.Parameters.AddWithValue("@objname", procName);
		using (var proc = new DataTable())
		{
			try
			{
				proc.Load(sqlCommand.ExecuteReader());
				return string.Join("", proc.Rows.OfType<DataRow>().Select(r => r.Field<string>("Text")));
			}
			catch (SqlException)
			{
				return null;
			}
		}
	}
}

private static (TSqlFragment sqlTree, IList<ParseError> errors) ParseSql(string procText)
{
	var parser = new TSql150Parser(true);
	using (var textReader = new StringReader(procText))
	{
		var sqlTree = parser.Parse(textReader, out var errors);

		return (sqlTree, errors);
	}
}

You will see some missing references. Just press F4 and it will open the following screen:

Click the Add NuGet button and add the Microsoft.SqlServer.TransactSql.ScriptDom package, then run the program. You will see something like this:

You can even click on the ScriptTokenStream result, to see the list of tokens in the procedure:

You can also simplify the query by using the connections available in Linqpad. Just go to the connections pane, add a new connection and point it to the WorldWideImporters database. Then select the connection in the connections combo and use this code:

void Main()
{
	ExecuteQuery<string>("select s.name+'.'+p.name as name from sys.procedures p " +
	  "inner join sys.schemas s on p.schema_id = s.schema_id order by name")
		  .Select(n => new 
		    { 
			  ProcName = n, 
			  Tree = ParseSql(ExecuteQuery<string>("exec sys.sp_helpText @objname={0}",n).FirstOrDefault()) 
			})
		  .Dump();
}

private static (TSqlFragment sqlTree, IList<ParseError> errors) ParseSql(string procText)
{
	var parser = new TSql150Parser(true);
	using (var textReader = new StringReader(procText))
	{
		var sqlTree = parser.Parse(textReader, out var errors);

		return (sqlTree, errors);
	}
}

You will see the same results. As you can see, you don’t even need to open the connection and create the command to run it. You can run your queries against your databases the same way you would do with any data. And if you are a SQL guy, you can run your queries directly using the SQL language. And, if you are brave and want to learn F#, you have here a really nice tool to learn.

Conclusions

At first, the size and appearance of Linqpad may fool you, but it’s a very nice tool to work, saving you a lot of time to try and debug your code. If you have some code snipped that you want to test and improve, this is the tool to use. And, one feature that I didn’t mention that’s invaluable when you are optimizing your code is the timing feature. After the execution of each query, Linqpad shows the execution time, so you can know how long did it take to execute it.

Some time ago, I’ve written this post about sending parametrized emails with C#. After some time, the emails weren’t arriving to their destination and I didn’t know why that happened – there was no message and the program was out in production, so I didn’t have an easy way to debug it.

Then, I went to the oracle that knows everything and asked “what is the easiest way to debug a C# program that sends emails in production?”. The oracle answered: “go to this page and you will find the answers you are looking for”. Once again, it was right.

Yes, the post pointed me to the easiest way to add debugging to your app that sends emails. There is not even the need to change the app. Just add a new clause to your app.config file and that’s it. Your app will generate a log file with all the data sent and received. The clause to add is system.diagnostics and you should add the traces you want and send them to a listener:

<system.diagnostics>
  <trace autoflush="true" />
  <sources>
    <source name="System.Net" >
      <listeners>
        <add name="MyTraceFile"/>
      </listeners>
    </source>
    <source name="System.Net.Sockets">
      <listeners>
        <add name="MyTraceFile"/>
      </listeners>
    </source>
  </sources>
  <sharedListeners>
    <add  name="MyTraceFile" type="System.Diagnostics.TextWriterTraceListener" 
          initializeData="System.Net.trace.log"   />
  </sharedListeners>
  <switches>
    <add name="System.Net" value="Verbose" />
    <add name="System.Net.Sockets" value="Verbose" />
  </switches>
</system.diagnostics>

The clause above will trace all calls to System.Net and System.Net.Sockets, with a Verbose level, to a file named System,Net.trace.log.

You can change the level of the tracing to something less verbose, but one note here: if you try to use the Info level, as stated here you will get an error. You should use the Information level. Using the number levels don’t show an error, but don’t generate a log either. So, you should use the named levels.

After that, I could see the logs and notice what has happened:

System.Net Error: 0 : [22720] Exception in SmtpClient#58328727::Send - The SMTP server requires a secure connection or the client was not authenticated. The server response was: 5.5.1 Authentication Required. Learn more at.
   at System.Net.Mail.MailCommand.CheckResponse(SmtpStatusCode statusCode, String response)
   at System.Net.Mail.MailCommand.Send(SmtpConnection conn, Byte[] command, MailAddress from, Boolean allowUnicode)
   at System.Net.Mail.SmtpTransport.SendMail(MailAddress sender, MailAddressCollection recipients, String deliveryNotify, Boolean allowUnicode, SmtpFailedRecipientException& exception)
   at System.Net.Mail.SmtpClient.Send(MailMessage message)
System.Net.Sockets Verbose: 0 : [22720] Entering Socket#24827179::Dispose()
System.Net Error: 0 : [22720] Exception in AppDomain#41560081::UnhandledExceptionHandler - The SMTP server requires a secure connection or the client was not authenticated. The server response was: 5.5.1 Authentication Required. Learn more at.
   at System.Net.Mail.MailCommand.CheckResponse(SmtpStatusCode statusCode, String response)
   at System.Net.Mail.MailCommand.Send(SmtpConnection conn, Byte[] command, MailAddress from, Boolean allowUnicode)
   at System.Net.Mail.SmtpTransport.SendMail(MailAddress sender, MailAddressCollection recipients, String deliveryNotify, Boolean allowUnicode, SmtpFailedRecipientException& exception)
   at System.Net.Mail.SmtpClient.Send(MailMessage message)
   at HtmlEmail.Program.Main(String[] args) in D:\Documentos\Artigos\Artigos\CSharp\HtmlEmail\HtmlEmail\HtmlEmail\Program.cs:line 41

Just a change in the password!

In the last post, I’ve shown how you can use the Microsoft libraries to parse your SQL Server code in C#. You can use the same parser to reformat your code, but that would be a lot of work: parse the code, check all structures and reformat the code according to your needs.

But there is an easier way: the SqlScriptGenerator class. This class makes a breeze to get your unformatted code and reformat it according to your options. In this article, we will develop a WPF program that takes a SQL Code snippet typed in TextBox and reformats it using the options you give.

Create a WPF program and name it SqlReformatter. In the solution explorer, add the NuGet package Microsoft.SqlServer.TransactSql.ScriptDom. In the XAML file for the Main page, add this code:

<Window x:Class="SqlReformatter.MainWindow"
        xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
        xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
        xmlns:d="http://schemas.microsoft.com/expression/blend/2008"
        xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
        mc:Ignorable="d"
        Title="MainWindow" Height="700" Width="1024">
    <Grid>
        <Grid.ColumnDefinitions>
            <ColumnDefinition Width="200"/>
            <ColumnDefinition Width="*"/>
        </Grid.ColumnDefinitions>
        <StackPanel Grid.Column="0" Margin="5" x:Name="stackChecks">
            <CheckBox Content="AlignClauseBodies" Click="OptionClick"/>
            <CheckBox Content="AlignColumnDefinitionFields" Click="OptionClick"/>
            <CheckBox Content="AlignSetClauseItem" Click="OptionClick"/>
            <CheckBox Content="AsKeywordOnOwnLine" Click="OptionClick"/>
            <CheckBox Content="IncludeSemicolons" Click="OptionClick"/>
            <CheckBox Content="IndentSetClause" Click="OptionClick"/>
            <CheckBox Content="IndentViewBody" Click="OptionClick"/>
            <CheckBox Content="MultilineInsertSourcesList" Click="OptionClick"/>
            <CheckBox Content="MultilineInsertTargetsList" Click="OptionClick"/>
            <CheckBox Content="MultilineSelectElementsList" Click="OptionClick"/>
            <CheckBox Content="MultilineSetClauseItems" Click="OptionClick"/>
            <CheckBox Content="MultilineViewColumnsList" Click="OptionClick"/>
            <CheckBox Content="MultilineWherePredicatesList" Click="OptionClick"/>
            <CheckBox Content="NewLineBeforeCloseParenthesisInMultilineList" Click="OptionClick"/>
            <CheckBox Content="NewLineBeforeFromClause" Click="OptionClick"/>
            <CheckBox Content="NewLineBeforeGroupByClause" Click="OptionClick"/>
            <CheckBox Content="NewLineBeforeHavingClause" Click="OptionClick"/>
            <CheckBox Content="NewLineBeforeJoinClause" Click="OptionClick"/>
            <CheckBox Content="NewLineBeforeOffsetClause" Click="OptionClick"/>
            <CheckBox Content="NewLineBeforeOpenParenthesisInMultilineList" Click="OptionClick"/>
            <CheckBox Content="NewLineBeforeOrderByClause" Click="OptionClick"/>
            <CheckBox Content="NewLineBeforeOutputClause" Click="OptionClick"/>
            <CheckBox Content="NewLineBeforeWhereClause" Click="OptionClick"/>
            <StackPanel Margin="0,5">
                <TextBlock Text="KeywordCasing"/>
                <ComboBox SelectionChanged="CaseChanged" x:Name="cbxCase">
                    <ComboBoxItem>LowerCase</ComboBoxItem>
                    <ComboBoxItem>UpperCase</ComboBoxItem>
                    <ComboBoxItem>PascalCase</ComboBoxItem>
                </ComboBox>
            </StackPanel>
            <StackPanel Orientation="Horizontal" Margin="0,5">
                <TextBlock Text="IndentationSize"/>
                <TextBox Width="60" Margin="5,0" TextChanged="IndentChanged" x:Name="txtIdent"/>
            </StackPanel>
        </StackPanel>
        <Button VerticalAlignment="Bottom" HorizontalAlignment="Right" Grid.Column="0"
                Margin="5" Content="Reformat" Click="ReformatSqlClick"/>
        <Grid Grid.Column="1">
            <Grid.RowDefinitions>
                <RowDefinition Height="*"/>
                <RowDefinition Height="*"/>
            </Grid.RowDefinitions>
            <TextBox x:Name="SourceBox" Margin="5" Grid.Row="0"/>
            <TextBox x:Name="DestBox" Margin="5" Grid.Row="1" IsReadOnly="True"/>
        </Grid>
    </Grid>
</Window>

As you can see, we are adding at the right a set of Checkboxes, each one with one property of the SqlScriptOptions class. When we check each box, we will change the reformatting options. To do that, we must use this code:

private void OptionClick(object sender, RoutedEventArgs e)
{
    if (sender is CheckBox checkBox)
    {
        var option = checkBox.Content.ToString();
        PropertyInfo pinfo = typeof(SqlScriptGeneratorOptions).GetProperty(option);
        pinfo?.SetValue(_options, checkBox.IsChecked == true);
    }
}

In this case, we are using reflection to get the property corresponding to the clicked Checkbox and we set its value according to the IsChecked value. We must also add two event handlers, for the case combobox and for the indent textbox:

private void CaseChanged(object sender, SelectionChangedEventArgs e)
{
    var selectedCase = (sender as ComboBox)?.SelectedIndex;
    if (selectedCase != null)
        _options.KeywordCasing = (KeywordCasing) selectedCase;
}

private void IndentChanged(object sender, TextChangedEventArgs e)
{
    if (int.TryParse((sender as TextBox)?.Text, out int size))
        _options.IndentationSize = size;
}

Once we have this set, we need to add the code to the click of the Reformat  button:

private void ReformatSqlClick(object sender, RoutedEventArgs e)
{
    var sqlSrc = SourceBox.Text;
    if (string.IsNullOrWhiteSpace(sqlSrc))
        return;
    var processed = ParseSql(sqlSrc);
    if (processed.errors.Any())
    {
        var sb = new StringBuilder("Errors found:");
        foreach (var error in processed.errors)
        {
            sb.AppendLine($"     Line: {error.Line}  Col: {error.Column}: {error.Message}");
        }
    }
    else
    {
        var scriptGenerator = new Sql150ScriptGenerator(_options);
        scriptGenerator.GenerateScript(processed.sqlTree, out string sqlDst);
        DestBox.Text = sqlDst;
    }
}

We parse the code in the source box, and if there are any errors, we show them in the destination box. If there are no errors, the code is reformatted according to the selected options. The ParseSql method is similar to the one shown in the last article:

private static (TSqlFragment sqlTree, IList<ParseError> errors) ParseSql(string procText)
{
    var parser = new TSql150Parser(true);
    using (var textReader = new StringReader(procText))
    {
        var sqlTree = parser.Parse(textReader, out var errors);

        return (sqlTree, errors);
    }
}

Now, we only need to initialize the UI in the beginning, so all the options are up-to-date with the SqlScriptGeneratorOptions instance:

public MainWindow()
{
    InitializeComponent();
    _options = new SqlScriptGeneratorOptions();
    txtIdent.Text = _options.IndentationSize.ToString();
    cbxCase.SelectedIndex = (int) _options.KeywordCasing;
    foreach (var child in stackChecks.Children)
    {
        if (child is CheckBox check)
        {
            var checkContent = check.Content.ToString();
            PropertyInfo pinfo = typeof(SqlScriptGeneratorOptions).GetProperty(checkContent);
            check.IsChecked = (bool?)pinfo?.GetValue(_options) == true;
        }

    }
}

This code also uses reflection to get the property values and fill the data in the boxes. When you run this program, you have something like this:

You can change any options and click the Reformat button and the code will be reformatted accordingly.

Conclusions

As you can see, with very little code you can create a SQL Reformatter to reformat your code and make sure that it agrees with your standards. This formatter has many options and can be used also to parse Sql source code.

All the source code for this article is at https://github.com/bsonnino/SqlReformatter

In the last post, I’ve shown how to parse SQL Server code with C# and get all tokens in it, showing their types. This is very nice, there is a lot you can do with that, but there is a pitfall: you don’t have the context the token was used. For example, you have an identifier, but you don’t know if it is a parameter or a variable declared in the procedure. You don’t know its type and where it is used.

Sometimes you need more information about the token and you can’t get it unless you analyze the code. Fortunately, Microsoft has already done that for us and provided a wonderful tool, so we can get the context of the tokens: you can create a new class that inherits from TSqlFragmentVisitor and override its methods, to visit the node types you want (if this doesn’t make sense to you right now, keep on reading, you’ll see it’s very simple).

The TSqlFragmentVisitor class is a massive class that uses the Visitor Pattern to analyze the SQL tree and visit all the nodes, so you can take action on the nodes that come to your attention. If you take a look at it, you will see that there are a huge number of Visit and ExplicitVisit overrrides, one for each kind of node you can have in a SQL Server procedure. Using Visit is similar to ExplicitVisit, with the difference that with ExplicitVisit you can control if the children nodes are also visited.

To use it, you must parse the tree in the same way you did in the last article and then create a visitor instance and pass it in the Accept method. This code shows how this is done:

static void Main(string[] args)
{
    var sql = "Select * from customer";
    var parsed = ParseSql(sql);
    if (parsed.errors.Any())
        return;
    var visitor = new SelectVisitor();
    parsed.sqlTree.Accept(visitor);
    Console.ReadLine();
}

private static (TSqlFragment sqlTree, IList<ParseError> errors) ParseSql(string procText)
{
    var parser = new TSql150Parser(true);
    using (var textReader = new StringReader(procText))
    {
        var sqlTree = parser.Parse(textReader, out var errors);

        return (sqlTree, errors);
    }
}

The visitor class is something like this:

internal class SelectVisitor : TSqlFragmentVisitor
{
    public override void Visit(SelectStatement node)
    {
        Console.WriteLine($"Visiting Select: {node}");
    }

    public override void Visit(QueryExpression node)
    {
        Console.WriteLine($"Visiting QueryExpression: {node}");
    }

    public override void Visit(QuerySpecification node)
    {
        Console.WriteLine($"Visiting QuerySpecification: {node}");
    }

    public override void Visit(SelectStarExpression node)
    {
        Console.WriteLine($"Visiting SelectStarExpression: {node}");
    }
}

Running this code, you will get something like this:

As you can see, the node and its children are visited and we have no control on that. If you want to have some control, you must use the ExplicitVisit:

internal class SelectVisitor : TSqlFragmentVisitor
{
    public override void ExplicitVisit(SelectStatement node)
    {
        Console.WriteLine($"Visiting Select: {node}");
    }

    public override void Visit(QueryExpression node)
    {
        Console.WriteLine($"Visiting QueryExpression: {node}");
    }

    public override void Visit(QuerySpecification node)
    {
        Console.WriteLine($"Visiting QuerySpecification: {node}");
    }

    public override void Visit(SelectStarExpression node)
    {
        Console.WriteLine($"Visiting SelectStarExpression: {node}");
    }
}

As you can see, the child nodes are not visited. To visit them, you must call the base method, like this:

internal class SelectVisitor : TSqlFragmentVisitor
{
    public override void ExplicitVisit(SelectStatement node)
    {
        Console.WriteLine($"Visiting Select: {node}");
        base.ExplicitVisit(node);
    }

    public override void ExplicitVisit(QueryExpression node)
    {
        Console.WriteLine($"Visiting QueryExpression: {node}");
    }

    public override void ExplicitVisit(QuerySpecification node)
    {
        Console.WriteLine($"Visiting QuerySpecification: {node}");
    }

    public override void ExplicitVisit(SelectStarExpression node)
    {
        Console.WriteLine($"Visiting SelectStarExpression: {node}");
    }
}

Nice, no? With this simple code, you can write something to check if any procedure in your database has a “select *”, which can be flagged as an error. Just override the Visit for the SelectStarExpression and flag the error if it’s visited.

Generating statistics from the database procedures

With this knowledge, we can write a program that generates statistics from the database procedures. We want to know how many inserts, deletes and updates there are in the procedures and what are the tables used with these commands. We also can know how many tables are created and dropped in the procedures.

For that, create a new console application and name it DataStats Add the NuGet package Microsoft.SqlServer.TransactSql.ScriptDom and add this code to Program.cs:

     static void Main(string[] args)
     {
         using (var con = new SqlConnection("Server=.;Database=WideWorldImporters;Trusted_Connection=True;"))
         {
             con.Open();
             var procTexts = GetStoredProcedures(con)
                 .Select(n => new { ProcName = n, ProcText = GetProcText(con, n) })
                 .ToList();
             var procTrees = procTexts.Select(p =>
             {
                 var processed = ParseSql(p.ProcText);
                 var visitor = new StatsVisitor();
                 if (!processed.errors.Any())
                     processed.sqlTree.Accept(visitor);
                 return new { p.ProcName, processed.sqlTree, processed.errors, visitor };
             }).ToList();
             foreach (var procTree in procTrees)
             {
                 Console.WriteLine(procTree.ProcName);
                 if (procTree.errors.Any())
                 {
                     Console.WriteLine("   Errors found:");
                     foreach (var error in procTree.errors)
                     {
                         Console.WriteLine($"     Line: {error.Line}  Col: {error.Column}: {error.Message}");
                     }
                 }
                 else
                 {
                     var visitor = procTree.visitor;
                     Console.WriteLine($"  Inserts: {visitor.Inserts}");
                     foreach (var table in visitor.InsertTables)
                     {
                         Console.WriteLine($"      {table}");
                     }

                     Console.WriteLine($"  Updates: {visitor.Updates}");
                     foreach (var table in visitor.UpdateTables)
                     {
                         Console.WriteLine($"      {table}");
                     }

                     Console.WriteLine($"  Deletes: {visitor.Deletes}");
                     foreach (var table in visitor.DeleteTables)
                     {
                         Console.WriteLine($"      {table}");
                     }

                     Console.WriteLine($"  Creates: {visitor.Creates}");
                     foreach (var table in visitor.CreateTables)
                     {
                         Console.WriteLine($"      {table}");
                     }

                     Console.WriteLine($"  Drops: {visitor.Drops}");
                     foreach (var table in visitor.DropTables)
                     {
                         Console.WriteLine($"      {table}");
                     }
                 }
             }
         }

         Console.ReadLine();
     }

     private static List<string> GetStoredProcedures(SqlConnection con)
     {
         using (SqlCommand sqlCommand = new SqlCommand(
             "select s.name+'.'+p.name as name from sys.procedures p " + 
             "inner join sys.schemas s on p.schema_id = s.schema_id order by name",
             con))
         {
             using (DataTable procs = new DataTable())
             {
                 procs.Load(sqlCommand.ExecuteReader());
                 return procs.Rows.OfType<DataRow>().Select(r => r.Field<String>("name")).ToList();
             }
         }
     }

     private static string GetProcText(SqlConnection con, string procName)
     {
         using (SqlCommand sqlCommand = new SqlCommand("sys.sp_helpText", con)
         {
             CommandType = CommandType.StoredProcedure
         })
         {
             sqlCommand.Parameters.AddWithValue("@objname", procName);
             using (var proc = new DataTable())
             {
                 try
                 {
                     proc.Load(sqlCommand.ExecuteReader());
                     return string.Join("", proc.Rows.OfType<DataRow>().Select(r => r.Field<string>("Text")));
                 }
                 catch (SqlException)
                 {
                     return null;
                 }
             }
         }
     }

     private static (TSqlFragment sqlTree, IList<ParseError> errors) ParseSql(string procText)
     {
         var parser = new TSql150Parser(true);
         using (var textReader = new StringReader(procText))
         {
             var sqlTree = parser.Parse(textReader, out var errors);

             return (sqlTree, errors);
         }
     }
 }

This code is very similar to the one in the last post, but it will create a visitor and will call the Accept method to visit its nodes. The visitor’s code is:

internal class StatsVisitor : TSqlFragmentVisitor
{
    public int Inserts { get; private set; }
    public int Updates { get; private set; }
    public int Deletes { get; private set; }
    public int Creates { get; private set; }
    public int Drops { get; private set; }
    public List<string> InsertTables { get; }
    public List<string> UpdateTables { get; }
    public List<string> DeleteTables { get; }
    public List<string> CreateTables { get; }
    public List<string> DropTables { get; }

    public StatsVisitor()
    {
        InsertTables = new List<string>();
        UpdateTables = new List<string>();
        DeleteTables = new List<string>();
        CreateTables = new List<string>();
        DropTables = new List<string>();
    }

    public override void Visit(InsertStatement node)
    {
        Inserts++;
        InsertTables.Add((node.InsertSpecification.Target as NamedTableReference)?.
            SchemaObject.BaseIdentifier.Value);
    }

    public override void Visit(UpdateStatement node)
    {
        Updates++;
        UpdateTables.Add((node.UpdateSpecification.Target as NamedTableReference)?.
            SchemaObject.BaseIdentifier.Value);
    }

    public override void Visit(DeleteStatement node)
    {
        Deletes++;
        UpdateTables.Add((node.DeleteSpecification.Target as NamedTableReference)?.
            SchemaObject.BaseIdentifier.Value);
    }

    public override void Visit(CreateTableStatement node)
    {
        Creates++;
        CreateTables.Add(node.SchemaObjectName.BaseIdentifier.Value);
    }

    public override void Visit(DropTableStatement node)
    {
        Drops++;
        DropTables.AddRange(node.Objects.Select(o => o.BaseIdentifier.Value));
    }

}

When there is an Insert, Delete, Update, Create Table or Drop Table, the corresponding method will be called and the properties referring to the node will be updated. At the end you will have the statistics for all the procedures in the database:

There is just a minor glitch in this code: as you can see in the Inserts, there is a blank line. That is due to the fact that the insert target isn’t a table, but a variable. This code solves the issue:

public override void Visit(InsertStatement node)
{
    Inserts++;
    var targetName = (node.InsertSpecification.Target as NamedTableReference)?.
        SchemaObject.BaseIdentifier.Value ??
      (node.InsertSpecification.Target as VariableTableReference)?.Variable.Name;
    InsertTables.Add(targetName);
}

Conclusions

As you can see, Microsoft has put a lot of work in this parser, you can have a lot of power when working with SQL Server code in C# with this parser. The options are endless: check quality of the code, refactor code, make statistics, and so on. In the next post we will see a simple way to reformat your SQL Server code, so you can ensure that the code follows your standards. See you then.

The code for this article is in https://github.com/bsonnino/SqlParsing

When you are using SQL Server stored procedures, you get a lot of flexibility in your code – changing the code doesn’t need a recompile, deploying the new or updated procedures is as easy as sending and running text scripts, and so on. I won’t enter on the pros and cons of keeping code in the database, as this will open a very long discussion, with no winners.

In this article, I will focus on one aspect of the SQL Server code, the lack of rigid rules for the code. It all starts with being case insensitive, something like “SELECT * from customer” is the same of “select * FROM customer”. It’s not a great deal, but when you have a lot of stored procedures and many people maintaining it, you may need to enforce some standards. In this article I will show you how to use C# to parse SQL Server code and enforce some coding standards in your SQL Server procedures.

Retrieving the stored procedures from the database

The first step for processing the stored procedures is to retrieve them from the database. For this article, we will be using the WorldWideImporters sample database, that can be downloaded from https://github.com/Microsoft/sql-server-samples/releases/tag/wide-world-importers-v1.0.

To get the names of all procedures in the database, you must use a command like this one:

select s.name+'.'+p.name as name from sys.procedures p 
inner join sys.schemas s on p.schema_id = s.schema_id 
order by name

Once you have the names of all the procedures in the database, you can get their code with the procedure sys.sp_helpText, passing the procedure name as a parameter. This code will get all procedures from the database and store them in a list:

static void Main()
{
	using (var con = new SqlConnection("Server=.;Database=WideWorldImporters;Trusted_Connection=True;"))
	{
		con.Open();
		var procTexts = GetStoredProcedures(con).Select(n => new {ProcName = n, ProcText = GetProcText(con, n)}).ToList();
	}
}

private static List<string> GetStoredProcedures(SqlConnection con)
{
	using (SqlCommand sqlCommand = new SqlCommand("select s.name+'.'+p.name as name from sys.procedures p " +
	  "inner join sys.schemas s on p.schema_id = s.schema_id order by name", con))
	{
		using (DataTable procs = new DataTable())
		{
			procs.Load(sqlCommand.ExecuteReader());
			return procs.Rows.OfType<DataRow>().Select(r => r.Field<String>("name")).ToList();
		}
	}
}

private static string GetProcText(SqlConnection con, string procName)
{
	using (SqlCommand sqlCommand = new SqlCommand("sys.sp_helpText", con)
	{
		CommandType = CommandType.StoredProcedure
	})
	{
		sqlCommand.Parameters.AddWithValue("@objname", procName);
		using (var proc = new DataTable())
		{
			try
			{
				proc.Load(sqlCommand.ExecuteReader());
				return string.Join("", proc.Rows.OfType<DataRow>().Select(r => r.Field<string>("Text")));
			}
			catch (SqlException)
			{
				return null;
			}
		}
	}
}

This code will open a connection to the database, get all procedure names and then get their text, returning a list of an anonymous class with two members: ProcName and ProcText. You can then process all the procedures in the database.

Parsing the procedures

The next step is to parse the procedures to split into the containing parts. This is not an easy task to make from scratch. The SQL Server language is not easy and evolves with each version of SQL Server. There are many ways to do the same thing, so writing a parser is not something I would recommend. Fortunately, Microsoft has already written a parser for us, and it’s made available via NuGet package. To use it, just add the Microsoft.SqlServer.TransactSql.ScriptDom package to your app. When you add the NuGet package to the app you will see it’s not an official package: this dll is part of a larger package, Microsoft.SqlServer.DacFx.x64 and it’s there only to make the number of dlls added smaller.

Once you add the package, you can use this code to parse the code of the procedure:

private static (TSqlFragment sqlTree, IList<ParseError> errors) ParseSql(string procText)
{
	var parser = new TSql150Parser(true);
	using (var textReader = new StringReader(procText))
	{
		var sqlTree = parser.Parse(textReader, out var errors);

		return (sqlTree, errors);
	}
}

This method uses a new feature of C# 7.0, ValueTuples. If you are using .NET version 4.7 or higher, you don’t need to add anything to use it. If you are using an older version, you must add the System.ValueTuple NuGet package to your project. At the beginning, we instantiate a TSql150Parser class. Depending on the version of SQL server that you want to parse, you can choose a different parser class. Then, we call the Parse method, that will parse the code, return any errors in the errors variable and return a TSqlFragment with the parsed data. Just with this simple code, you already have many benefits: you can parse the procedure for many versions of SQL Server, check if the procedure has errors in it and parse the procedure into its components. The returned tree is of type TSqlScript, that has some interesting properties:

  • Batches – gets the batches in the procedure, so you can process them individually
  • FirstTokenIndex – shows where the fragment begins – for the script, it usually starts at 0, but when you are working with other kinds of fragments, that may not be the case
  • FragmentLength – size of the fragment, in bytes
  • LastTokenIndex – shows where the fragment ends
  • ScriptTokenStream – the list of parsed tokens
  • StartColumn – start column of the fragment
  • StartLine – start line of the fragment
  • Offset – offset in bytes from the start

As you can see, there is already a lot of information there. When you take a look at the ScriptTokenStream, you can see a lot more there:

The SQL code is parsed into tokens and every token is categorized. With this code, you can group the tokens and have brief statistics of your code:

static void Main()
{
	using (var con = new SqlConnection("Server=.;Database=WideWorldImporters;Trusted_Connection=True;"))
	{
		con.Open();
		var procTexts = GetStoredProcedures(con).Select(n => 
		  new { ProcName = n, ProcText = GetProcText(con, n) }).ToList();
		var procTrees = procTexts.Select(p =>
		{
			var processed = ParseSql(p.ProcText);
			return new { p.ProcName, processed.sqlTree, processed.errors };
		});
		var procStats = procTrees
		  .SelectMany(p => p.sqlTree.ScriptTokenStream)
		  .GroupBy(t => t.TokenType)
		  .Select(g => new { Key = g.Key, Count = g.Count()})
		  .OrderByDescending(g => g.Count);
	}
}

With this code, we can see something like this in the database:

WhiteSpace              15693
Identifier               3571
Variable                 1976
Plus                     1667
UnicodeStringLiteral     1192
Dot                      1085
Semicolon                1051
Comma                     957
LeftParenthesis           937
RightParenthesis          937

We can see the variable names used with this code:

static void Main()
{
	using (var con = new SqlConnection("Server=.;Database=WideWorldImporters;Trusted_Connection=True;"))
	{
		con.Open();
		var procTexts = GetStoredProcedures(con).Select(n => 
		  new { ProcName = n, ProcText = GetProcText(con, n) }).ToList();
		var procTrees = procTexts.Select(p =>
		{
			var processed = ParseSql(p.ProcText);
			return new { p.ProcName, processed.sqlTree, processed.errors };
		});
		var procVariables = procTrees
		  .SelectMany(p => p.sqlTree.ScriptTokenStream)
		  .Where(t => t.TokenType == TSqlTokenType.Variable)
		  .GroupBy(t => t.Text)
		  .Select(g => g.Key)
		  .OrderBy(t => t);
	}
}

If you run this code, you will see that, although there are 1976 tokens of type Variable, there are only 76 different names used.One thing that called me my attention is that there are two similar variables used, @Crlf and @CrLf. We can use this parsing to know in which procedures these variables are used. Note that this is different than a text search: a text search could find the variables inside a comment or in a string. In this case, we will only find the real SQL variables:

static void Main()
{
	using (var con = new SqlConnection("Server=.;Database=WideWorldImporters;Trusted_Connection=True;"))
	{
		con.Open();
		var procTexts = GetStoredProcedures(con).Select(n => new { ProcName = n, ProcText = GetProcText(con, n) }).ToList();
		var procTrees = procTexts.Select(p =>
		{
			var processed = ParseSql(p.ProcText);
			return new { p.ProcName, processed.sqlTree, processed.errors };
		});
		Func<TSqlParserToken, bool> queryVariables = t => t.TokenType == TSqlTokenType.Variable
			  && (t.Text == "@CrLf" || t.Text == "@Crlf");
		var variablesFound = procTrees
		  .Where(p => p.sqlTree.ScriptTokenStream.Any(queryVariables))
		   .Select(t => new { t.ProcName, FoundVariables = t.sqlTree.ScriptTokenStream.Where(queryVariables)});
	}
}

If you run this code, you will see that most of the variables is written as @CrLf, while some are written as @Crlf. We will change all variables to @CrLf with this code:

static void Main()
{
	using (var con = new SqlConnection("Server=.;Database=WideWorldImporters;Trusted_Connection=True;"))
	{
		con.Open();
		var procTexts = GetStoredProcedures(con).Select(n => new { ProcName = n, ProcText = GetProcText(con, n) }).ToList();
		var procTrees = procTexts.Select(p =>
		{
			var processed = ParseSql(p.ProcText);
			return new { p.ProcName, processed.sqlTree, processed.errors };
		});
		var variablesFound = procTrees
		  .Where(p => p.sqlTree.ScriptTokenStream.Any(queryVariables))
		   .Select(t => new { t.ProcName, FoundVariables = t.sqlTree.ScriptTokenStream.Where(queryVariables)});
		foreach (var procName in variablesFound.Select(p => p.ProcName))
		{
			var sqlTree = procTrees.First(p => p.ProcName == procName).sqlTree;
			foreach (var token in sqlTree.ScriptTokenStream.Where(t => t.Text == "@Crlf"))
			{
				token.Text = "@CrLf";
			}
			var changedSql = string.Join("", sqlTree.ScriptTokenStream.Select(t => t.Text).ToArray());
		}

	}
}

ChangedSql has the changed SQL, that can be applied to the database, making the variable names uniform.

Conclusions

As you can see, you can do a lot with this parser, like have statistics, change variable names and so on. But still we haven’t scratched the surface of what can be done with the parser. We will see more in the next article. See you then!