Can you optimize this code ? (Is GoTo evil ?)

yesterday I posted a bit of code that was checkign to see if a string was whitespace only.  Let’s assume the assumption I made was correct, that checkign the middle of the string then spreading out in each direction is the best way of doing that. 


 


So the code I worte yesterday had some branching “issues”.  As I wrote it, I really felt I was missing something obvious… you know that, “this just doesn’t feel right” kind of feeling.  To do the branching I used a boolean flag, hasNonWhitespace.  The only alternative I could think of was to use a GoTo statement.  So here’s the first cut, and also a second version using GoTo.   Is GoTo evil ?  Can you optimize this code some other way ????


 


Public Function IsWhitespaceOnly(ByVal stringToCheck As String, ByVal ParamArray whitespaceChars() As Char) As Boolean
   Dim length As Int32 = stringToCheck.Length
   Dim middle As Int32 = length \ 2
   Dim first As Int32 = middle
   Dim second As Int32 = middle + 1
   Dim chr As Char

   If whitespaceChars Is Nothing OrElse whitespaceChars.Length = 0 Then
      whitespaceChars = New Char() {” “c, ChrW(&H3000)}
   End If


   Dim whitespaceUbound As Int32 = whitespaceChars.Length – 1
   Dim hasNonWhitespace As Boolean = False


   For i As Int32 = 0 To middle

      If first >= 0 Then
         hasNonWhitespace = True        
         chr = stringToCheck.Chars(first)
         For j As Int32 = 0 To whitespaceUbound
            If chr = whitespaceChars(j) Then
               hasNonWhitespace = False
               Exit For
            End If
         Next
         first -= 1
      End If


      If hasNonWhitespace Then Return False



      If second < length Then
         hasNonWhitespace = True
         chr = stringToCheck.Chars(second)
         For j As Int32 = 0 To whitespaceUbound
            If chr = whitespaceChars(j) Then
               hasNonWhitespace = False
               Exit For
            End If
         Next
         second += 1
      End If


      If hasNonWhitespace Then Return False


   Next


   Return Not hasNonWhitespace


End Function


 


‘ using GoTo



   Public Function IsWhitespaceOnly(ByVal stringToCheck As String, ByVal ParamArray whitespaceChars() As Char) As Boolean


      Dim length As Int32 = stringToCheck.Length
      Dim middle As Int32 = length \ 2
      Dim first As Int32 = middle
      Dim second As Int32 = middle + 1


      Dim chr As Char


      If whitespaceChars Is Nothing OrElse whitespaceChars.Length = 0 Then
         whitespaceChars = New Char() {” “c, ChrW(&H3000)}
      End If


      Dim whitespaceUbound As Int32 = whitespaceChars.Length – 1



      For i As Int32 = 0 To middle


         If first >= 0 Then
            chr = stringToCheck.Chars(first)
               For j As Int32 = 0 To whitespaceUbound
                  If chr = whitespaceChars(j) Then
                     first -= 1
                     GoTo secondpart
                  End If
               Next
               Return False
         End If



secondpart:
         If second < length Then


            chr = stringToCheck.Chars(second)
            For j As Int32 = 0 To whitespaceUbound
               If chr = whitespaceChars(j) Then
                  second += 1
                  GoTo nextpart
               End If
            Next
            Return False
         End If


nextpart:
      Next


      Return True


   End Function


 


 



21 Comments so far

  1.   Geoff Appleby on June 6th, 2005          

    I don’t see GoTo’s as evil, so long as you have a damn good reason for using them.

    And I’ve tried. Really tried. Really hard.

    I can’t come up with anything that runs faster.

    I’m still trying tho smile

  2.   Bill McCarthy on June 6th, 2005          

    Hey Geoff,

    I think you’ve hit the nail on the head … it’s not that they are evil, more it’s the way that we use them that can be. In complicated code, or large methods I think GoTo should be avoided. With complicated code they can actually make it hard to restructure, and can easily lend themselves to spaghetti style code. In large methods, they tend to be used instead of breaking the method out into smaller methods. But in this kind of code where the method is pretty simple, unlikely to need major changes so I am thinking that maybe goto is not so bad in this case. What I found interesting though was my instinct was to avoid using GoTo, rather use a flag, or look at even using While True blocks etc.

    BTW: I’m glad you haven’t been able to optimize it Makes me feel happier about that choice.

  3.   Matthew Douglass on June 7th, 2005          

    I’m going to explain my thoughts first, then show the code since I wrote it without really knowing VB.NET and without a compiler:

    To me the first problem with your function is the duplication of code where you check if the first character is whitespace and where you check if the second character is whitespace. If you pull that out into its own function that simplies the code and removes the need for the goto:

    Please excuse any obvious typos/mistakes.

    Public Function IsWhitespaceCharacter(ByVal charToCheck As Char, ByVal ParamArray whitespaceChars() As Char) As Boolean

    Dim whitespaceUbound As Int32 = whitespaceChars.Length – 1

    For j As Int32 = 0 To whitespaceUbound

    If charToCheck = whitespaceChars(j) Then

    Return True

    End If

    Next

    Return False

    End Function

    Public Function IsWhitespaceOnly(ByVal stringToCheck As String, ByVal ParamArray whitespaceChars() As Char) As Boolean

    Dim length As Int32 = stringToCheck.Length

    Dim middle As Int32 = length \ 2

    Dim first As Int32 = middle

    Dim second As Int32 = middle + 1

    If whitespaceChars Is Nothing OrElse whitespaceChars.Length = 0 Then

    whitespaceChars = New Char() {" "c, ChrW(&H3000)}

    End If

    For i As Int32 = 0 To middle

    If first >= 0 Then

    If Not IsWhitespaceCharacter(stringToCheck.Chars(first), whitespaceChars) Return False;

    first -= 1

    End If

    If second < length Then

    If Not IsWhitespaceCharacter(stringToCheck.Chars(second), whitespaceChars) Return False;

    second += 1

    End If

    Next

    Return True ‘ I believe a empty string is considered whitespace only, correct as necessary.

    End Function

  4.   Bill McCarthy on June 7th, 2005          

    Hey Matthew,

    The code you posted looks pretty good to me (I haven’t tested it). The problems I see with it though are increased call stack usage, and the ubbound for the character array loop is recalculated each time the IsWhitespaceCharacter function is called. So my guess is it would result in a performance penalty.

  5.   Matthew Douglass on June 7th, 2005          

    The ubound recalculated bothered me too — I almost passed it in as an extra parameter. But I thought their might be some use for the IsWhitespaceCharacter as a standalone function, so I didn’t. If you knew it was internal only, then I’d add it as a parameter at the small cost of increased coupling between the two.

    The call stack usage doesn’t bother me too much — mainly because I think there is a decent chance it would get inlined. I’d love if someone who has a VB.NET compiler could do a quick perf analysis of the two.

  6.   Bill McCarthy on June 7th, 2005          

    I can’t right now, but I’ll run some tests later tonight and post back here. It sounds like Geoff may have run some tests… I actually haven’t run any yet.

    BTW: I actually think my base assumption of starting searching in the middle is probably not a good assumption. I tend to find strings generally have leading whitespace rather than trailing whitespace, but it would depend on usage. Take for exampel reading from a text file line by line, if the lines actually have the carriage returns and or line feeds on the and of them, and they are considered whitespace, the optimal may be searching from "near" the end but not at the end.

  7.   Geoff Appleby on June 7th, 2005          

    Bill, no, i think your right in starting from the middle.

    Remember, the way the mothod is structured is to find one single non-whitespace char ASAP, and bail out immediately if it does. If whitespace mostly appears on the left or right, it doesn’t matter. most of the time (and we can only talk probabilities here) a non-whitespace char is highly likely to be somewhere near the middle of the string, or not too far away from there.

    As for tests, yeah, I have been. I’ll run through Matthew code when I get a chance.

  8.   Bill McCarthy on June 7th, 2005          

    Hey Geoff,

    I think you’re wrong about me being right. It’s probably not a bad assumption. I think when I look at strings, there’s a reasonable probability they will start with whitespace. Chances of them having more than one consequtive whitespace char in the middle is low, but only if the ratio of test to whitespace is generally large. One example that comes to mind is say parsing an indented XML file. LEt’s assume they use "space" or similar ,not tabs. There the likelihood of the ratio of whitespace to non-whitespace on any given line is likely to be > 1:1,( depending on the depth of the object model and the type of xml output used) in many cases. So for those lines, the ideal spot to start would be near the end. Definetly not at the start, not so the middle, and maybe not the end. We could optimise this better is we knew the whitespace char list was optimised accordingly. That is if it checks for cr’s and lf’ early in the list, then it’s only going to be one or two comparisons for the ending chars, rather than the 20 ro so comparisons it can take ot declare a character as not being whitespace. Obviously the relative efficeincy depends a lot on the whitespace shar list though, whether cr and lf are considered to be whitespace, whether the whitespace lsit is large etc. But If I take the two standard lsits, VB’s 2 char one, and String’s 21 char one, then I think searching from the end in will genrally be the most efficient, as we can reduce the complexity of your code, and with the VB char set, as cr or lf is considered non whitespace, and with the string char set the cr and lf are pretty early on in the list, so the cost of those likely couple of comparisons shoudl be trivial compared to the cost of settign up to say work from the middle out.

    Okay, maybe, maybe not… it’s a theory anyway

  9.   Geoff Appleby on June 7th, 2005          

    binary search perhaps?

  10.   Geoff Appleby on June 7th, 2005          

    Here’s an interesting one for you. based on a few sample strings of random gumpf, this one seemd to deal better with longer strings…

    Protected Overrides Function IsWhitespaceOnly(ByVal stringToCheck As String, ByVal ParamArray whitespaceChars() As Char) As Boolean

    Dim length As Int32 = stringToCheck.Length

    Dim middle As Int32 = length \ 2

    Dim quarter As Int32 = middle \ 2

    Dim first As Int32 = middle

    Dim second As Int32 = middle + 1

    Dim chr As Char

    If whitespaceChars Is Nothing OrElse whitespaceChars.Length = 0 Then

    whitespaceChars = New Char() {" "c, ChrW(&H3000)}

    End If

    Dim whitespaceUbound As Int32 = whitespaceChars.Length – 1

    ‘check first char

    chr = stringToCheck.Chars(0)

    For j As Int32 = 0 To whitespaceUbound

    If chr = whitespaceChars(j) Then

    GoTo firstpart

    End If

    Next

    Return False

    firstpart:

    If length = 1 Then Return True

    ‘check last char

    chr = stringToCheck.Chars(length – 1)

    For j As Int32 = 0 To whitespaceUbound

    If chr = whitespaceChars(j) Then

    GoTo secondpart

    End If

    Next

    Return False

    secondpart:

    If length = 2 Then Return True

    ‘check middle char

    chr = stringToCheck.Chars(middle)

    For j As Int32 = 0 To whitespaceUbound

    If chr = whitespaceChars(j) Then

    GoTo thirdpart

    End If

    Next

    Return False

    thirdpart:

    If length = 3 Then Return True

    ‘check first quarter char

    chr = stringToCheck.Chars(quarter)

    For j As Int32 = 0 To whitespaceUbound

    If chr = whitespaceChars(j) Then

    GoTo sixthpart

    End If

    Next

    Return False

    sixthpart:

    ‘check third quarter char

    chr = stringToCheck.Chars(middle + quarter)

    For j As Int32 = 0 To whitespaceUbound

    If chr = whitespaceChars(j) Then

    GoTo seventhpart

    End If

    Next

    Return False

    seventhpart:

    ‘check first quarter range backwards

    For i As Int32 = quarter – 1 To 1 Step -1

    chr = stringToCheck.Chars(i)

    For j As Int32 = 0 To whitespaceUbound

    If chr = whitespaceChars(j) Then

    GoTo eightpart

    End If

    Next

    Return False

    eightpart:

    Next

    ‘check last quarter range forwards

    For i As Int32 = middle + quarter + 1 To length – 2

    chr = stringToCheck.Chars(i)

    For j As Int32 = 0 To whitespaceUbound

    If chr = whitespaceChars(j) Then

    GoTo ninepart

    End If

    Next

    Return False

    ninepart:

    Next

    ‘check second quarter range forwards

    For i As Int32 = quarter + 1 To middle – 1

    chr = stringToCheck.Chars(i)

    For j As Int32 = 0 To whitespaceUbound

    If chr = whitespaceChars(j) Then

    GoTo fourthpart

    End If

    Next

    Return False

    fourthpart:

    Next

    ‘check third quarter range backwards

    For i As Int32 = middle + quarter – 1 To middle + 1 Step -1

    chr = stringToCheck.Chars(i)

    For j As Int32 = 0 To whitespaceUbound

    If chr = whitespaceChars(j) Then

    GoTo fifthpart

    End If

    Next

    Return False

    fifthpart:

    Next

    Return True

    End Function

  11.   Bill McCarthy on June 8th, 2005          

    Geoff …. uhm… I’m speechless

  12.   Jacob Grass on June 9th, 2005          

    Char.IsWhitespace(myChar) …

    This uses a Select Case statement. Returns true if the character is

    ChrW(9)

    ChrW(10)

    ChrW(11)

    ChrW(12)

    ChrW(13)

    " "

    Or if the unicode category is 11,12,13 ..

    Probably not the most efficient, but it’s in there.

  13.   Geoff Appleby on June 9th, 2005          

    Speechless huh?

    So tell me…NOW are gotos evil? *grin* I think all it really proves is that you have to be willing to wear the penalties of anything other than walking the entire string from one end to the other. Statisitical analysis (which is what i was trying to emulate while not really doing it) might allow us to find in advance more _likely_ locations of non-whitespace chars, but it is it really worth the penalty of either slower or harder to maintain code?

  14.   Bill McCarthy on June 10th, 2005          

    hey Jacob,

    Actually that is probalby very efficient. I think I might switch to that not only becuase of that, but also it adds some consistency. Although one has to ask why Vb’s Trim, and String.Trim don’t use that ?? I wonder if they did some benchmarks on calling CharacterInfo.IsWhitespace(char) or not ? Then again maybe it’s just because String.Trim is designed to do other things. I think that may be it. I haven’t tested, but I think the CharacterInfo.Iswhitespace(char) is probalby the better choice… well it’s my bet at this moment anyway

  15.   Bill McCarthy on June 10th, 2005          

    Hey Geoff,

    uhm… speechless…. totally …

    <slap><slap> Ah, no I’m better…. yes, you have convinced me, GoTo is evil. give it an inch and someone will take it for a mile <bg>

    i’m going for the easier to maintain code. I know the following is different code, but simple is good, you’ve convinced me. (well unless you tell me the perf suxs that is)

    Public Function IsWhitespace(ByVal stringToCheck As String) As Boolean

    IsWhitespace = True

    Dim i As Int32

    If stringToCheck Is Nothing OrElse stringToCheck.Length = 0 Then

    Return True

    End If

    i = stringToCheck.Length

    While IsWhitespace = True And i > 0

    i -= 1

    IsWhitespace = Char.IsWhiteSpace(stringToCheck.Chars(i))

    End While

    Return IsWhitespace

    End Function

  16.   Jacob Grass on June 10th, 2005          

    With respect to the statistical analysis as to where one should start the inquiry, if you’re dealing with a random string, then the odds are highly in your favor that any character you start with will be non-whitespace. Though, no character has any more likelihood than any other.

    However, if you have some context about the string and its contents (i.e. it’s not random) then some analysis could be performed to increase the likelihood. Though, I’d be willing to bet that the first or last character would be the most fruitful.

  17.   Jacob Grass on June 10th, 2005          

    Hey Bill, I haven’t done any testing with this, but in my head this is a little more efficient:

    Public Function IsWhitespace(ByVal stringToCheck As String) As Boolean

    IsWhitespace = True

    Dim i As Int32

    If Not stringToCheck is Nothing Then

    i = stringToCheck.Length -1

    Do

    IsWhitespace = Char.IsWhiteSpace(stringToCheck.Chars(i))

    i-=1

    Loop Until Not IsWhitespace OrElse i < 0

    End If

    Return IsWhitespace

    End Function

  18.   Bill McCarthy on June 10th, 2005          

    Hey Jacob,

    On the code, you need to test for a zero length string, or set up your loop such that it isn’t entered in that case. Looking at mine, I could drop the explicit test for a zero length string and it should work okay, which would be about the same in code. I would do an early exit rather than nest the hwile loop… I just prefer them…. I doubt there is any performance difference between the two on that count.

    The statistical analysis of strings, I think given the predonimance of "code", html and xml, leading whitespace is very common. Given the parsing on a line by line basis is also common, then carriage returns and line feeds can be expected to be very common at the end of a string. Ideally we want to find a non whitespace character as soon as possible, so looking at the ends is actually the worst place to start. My guess is, that two or three characters in from the end is the most likely place, BUT considering we have to test all the characters if we find the string is purely whitespace, then startign from other than the ends adds extra complexity.

    anyway, if it’s raining on Sunday and I get bored on Sunday, what I’ll do is recursively parse all plain text files in( my documents or something like that) and cache all the strings on a line by line basis, and then test some code with them smile

  19.   Bill McCarthy on June 10th, 2005          

    oh btw: did you notice I used And, not AndAlso. Given that during the loop you do have to check both (only on exit of the loop can you short circuit), And is actually more efficient than the conditional branching AndAlso

  20.   Jacob Grass on June 11th, 2005          

    Yup, my fault on the zero length string. I saw that it could be taken out of yours, so I left it out of mine.

    re: Stats

    Well, what you’re talking about is the context of that which you are parsing. Parsing code/html/xml makes sense on that front to start in the middle and work your way out to the ends. Now, assume that you have some generic text file, you have no knowledge whatsoever of its contents. You randomly take a line from that file and you randomly take a contiguous sequence of characters from that line. Any character in that sequence is as likely to be (non)whitespace as any other. That’s the point I was trying to make.

    Oh, and that’s an interesting note on the short-circuit. I’ll have to keep that in mind. I have reams of code that looks like:

    Do Until blnFound OrElse i = Count

    ..

    Loop

    What’s up with your human proof, btw? It never works the first time for me.

  21.   Bill McCarthy on June 12th, 2005          

    Hey Jacob,

    At least on developer’s machines, "text" is more likely to have leading whitespace. As office moves to XML this is going to spread. For use input, leading whitespace is rere (usually only a sinlge space), but trailing whitespace can be common as it is visually not as obvious. Even then the usual I see is people cut and pasting and getitng on extra trailing whitespace.

    So i get what you mean about it any character can pobbily be whitespace or not, but I think there is a definetly a probability that the start is less likely than the end for non whitespace. It’s like the alphabet.. although any letter could be any legal char, we usually find a higher percentage of a and e, s and t compared to q or z etc. So if you had to test for each different letter, you’d be better to test for say s and t earlier on, than alphabetically.

    Re: short circuited versus non, the simple rule I use is if the second expression is not dependant on the first to be legal (eg testing for null first) then bitwise operators can be more performant than branching ones. The toehr factors are the cost of evaluating the second expression. But if you expect that to be evaluated the majority of the time anyway then that cost is the same for both.

    As ot the CAPTCHA, yeh don’t even try first time. Apparently it has a rediculous short time out. (sorry, but not my doing)