[PowerShell] How to Join Data in faster way?

Last week, one of Hong Kong ex-MVP asked a question to our ex-MVPs private group. He has a PowerShell script and 2 csv data files. Both csv data file has a common column and he could like to join them and combine into a single csv data file. his code  like this,

Get csv1 | ForEach-Object -Parallel Where-Object $_.id -match $csv2.id

He complaints that it will take a very long time to handle large amount of data. And there are some comments,

  • change from -match to -eq
  • try to use hash
  • something about add-member
    …etc

From me, I just say, why not do it with LINQ? So I try to write 2 PowerShell scripts, one will be using traditional PowerShell command and another one will be using LINQ.

First, in my testing code, I will try to create 2 arrays and then generate some data into it. Here is my code,

#Create empty arrays
$datasetA = @()
$datasetB = @()
#Initialize "status" arrays to pull random values from
$genderArray = @('Male','Female','Decline to Answer')
#Loop 10 times to populate our separate datasets
1..10 | Foreach-Object {
    #Set the name with the current iteration attached
    $thisName = "Person_$_" 
    #Create an object with the name property and a random gender
    $rowA = @{
        Name = $thisName
        Gender = $genderArray[(Get-Random -Minimum 0 -Maximum 3)]
    }
    $datasetA += New-Object -Type PSObject -Property $rowA
    #Create a second object with the same name and a random age
    $rowB = @{
        Name = $thisName
        Age = Get-Random -Minimum 1 -Maximum 120
    }
    $datasetB += New-Object -Type PSObject -Property $rowB
}

It will create 2 array, each contains 10 rows. The common column will be the Name column.  One array contains a Gender column, another array contains Age column. Simple data, right?

In traditional PowerShell, we might iterate through one of the arrays while doing a filter on the second array. And then either add property members to the first array or create a new objects with a combination of properties from both arrays. My code then will be something like this:

$joinedDataset =@()
foreach($rowA in $datasetA) {
    $rowB = $datasetB | Where-Object Name -eq $rowA.Name
    $joinedRow = @{
        Name = $rowA.Name
        Gender = $rowA.Gender
        Age = $rowB.Age
    }
    $joinedDataset += New-Object -Type PSObject -Property $joinedRow
}

It works fine and it only takes 32.6 milliseconds to complete. But if I increase the iterate number from 10 to 100, it then takes 174.1 milliseconds. And then 7025.7 milliseconds for 1000 records, 638429.8 milliseconds (10 min 38 sec) for 10000 records.
(P.S., it include the time it take to generate iteration arrays)

Now, let’s do the same with using LINQ:

$linqJoinedDataset = [System.Linq.Enumerable]::Join(
    $datasetA,
    $datasetB,
    [System.Func[Object,string]] {param ($x);$x.Name},
    [System.Func[Object,string]]{param ($y);$y.Name},
    [System.Func[Object,Object,Object]]{
        param ($x,$y);
        New-Object -TypeName PSObject -Property @{
        Name = $x.Name;
        Gender = $x.Gender;
        Age = $y.Age}
    }
)

$OutputArray = [System.Linq.Enumerable]::ToArray($linqJoinedDataset)

The time it takes will be as following,

  • 37.8 milliseconds for 10 records
  • 79.8 milliseconds for 100 records
  • 809.4 milliseconds for 1000 records
  • 16026.2 milliseconds for 10000 records

In this code, I am calling the Join method on System.Linq.Enumerable and then passing it with 5 parameters,

  1. The first dataset that it is going to join
  2. The second dataset that it joins
  3. The delegate which defines the key to compare against on the first dataset
  4. The delegate which defines the key to compare against on the second dataset
  5. Finally, we pass in the delegate which defines what the output should be

Because the Join method is a Deferred Execution. it is not actually joining the data at that time. It is just building an expression tree which defines the relational algebra needed to perform the join. So the above example code, I will need to call “ToArray()”, so that the execution could start. If you have any further codes that you will handle with the result data, you could skip the “ToArray()” code. For example, calling like this,

$linqJoinedDataset.Where({($_.Age -gt 20) -and ($_.Gender -eq "Male")})

Then the join query would execute at that time and then “Where()” would filter down to just the objects that matching the conditions.

Here is the table on the execution time on each way,

# iterate Traditional PS LINQ
10 32.6 37.8
100 174.1 79.8
1000 7025.7 809.4
1000 638429.8 16026.2

So I can say, LINQ saved more than 40 times of timing to complete the join. With this figure, I am sure that you will love to update and use LINQ if you need a join operation on data.

Hope this blog could help you to get a faster execution script in PowerShell.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.