Fun with templates: parsing command line arguments

One of the things that everyone has to do sooner or later is parsing command line arguments into program variables. Even the most trivial command line application needs some input variables to tell it what to do.


One of the things in which C and C++ are lacking is a unified approach to command line arguments. This allows anyone to do what he wants, but unfortunately it also forces everybody to figure it out for himself.


I have done this several times, and tried several approaches. My last projects all needed moderately complex command line configurations, so I finally solved this problem in a reusable way. Originally I used the boost template library for this (www.boost.org). Specifically, I used regular expressions. But now that VS2008 SP1 is out, I can use the TR1 library to do this for me.


What does a command line argument look like


The exact syntax doesn’t matter because we are using regular expressions. What is important is that the identifier and the value are recognized by the shell to belong together.


The only way to do this is of course to make sure that they are in the same string. For example, such command line arguments can look like this:


-val:12.5
/text=blabla
-file=”C:\Documents and Settings\me\desktop\article.doc”



It really doesn’t matter. As long as all the information is recognized as 1 argument, it’s fine. If the value contains spaces, you need to use quotes so that the shell will not treat


-file=”C:\Documents and Settings\me\desktop\article.doc”


as 3 arguments.


Requirements


The requirements for my solution are as follows:


  • 1 function which can take a command parameter and a regular expression, and which can parse any given type (bool, int, string, …) from the parameter, according to the specified regex.
  • The function should be able to handle string and wstring parameters.
  • It should be able to put values into regular data types, as well as into TriStateValue template types (see here http://msmvps.com/blogs/vandooren/archive/2007/10/11/fun-with-templates-implementing-a-tri-state-value.aspx for more info).
  • The function should return a bool to notify the user if the argument was parsed.
  • The part of the regex that identifies the actual value has to be captured. I.e. it has to be enclosed between ( ) symbols.

First attempt


The first version of such a function was ready pretty rapidly.


  template <typename T>
  bool ParseArg(
    std::tr1::regex const& pattern,
    std::string const& arg,
    T& value)
  {
    std::tr1::smatch matches;
    if( !std::tr1::regex_search(arg, matches, pattern))
      return false;

    std::istringstream iss((string) matches[1]);
    return ! (iss >> std::dec >> value).fail();
  }


The function takes a regex and an argument. The value type is a template type which allows the user to use it for any type. The type itself can be inferred at compile time. I.e. the user does not have to explicitly specify the template type using brackets.


A string stream is used for getting the value out of the stream. This is very handy because this way, the text to value conversion has no dependency on the value type. By using the string stream, the function remains generic.


This function has an overload that uses a TriStateVal type for the resulting value.


template <typename T>
  bool ParseArg(
    std::tr1::regex const& pattern,
    std::string const& arg,
    TriStateVal<T>& value)
  {
    std::tr1::smatch matches;
    if( !std::tr1::regex_search(arg, matches, pattern))
      return false;

    std::istringstream iss((string) matches[1]);
    T val;
    if((iss >> std::dec >> val).fail())
      return false;

    value = val;
    return true;
  }


The TriStateVal type allows the user of that value to know whether it had been assigned or not. Unfortunately, there is no support yet for streaming directly into a TriStateVal type, so I have to use a temporary value first.


Of course, these 2 functions work for strings. What about wstring? Well, for wstring we need another 2 overloads.


  template <typename T>
  bool ParseArg(
    std::tr1::wregex const& pattern,
    std::wstring const& arg,
    T& value)
  {
    //
  }

  template <typename T>
  bool ParseArg(
    std::tr1::wregex const& pattern,
    std::wstring const& arg,
    TriStateVal<T>& value)
  {
    //
  }


These are almost identical, except that they use wregex, wstring and wistringstream instead of regex, string and istringstream.


The overload mechanism


By now you may have wondered about the function overloading, if you are into template laws.


There are 4 overloads (removed template specifier template <typename T>


 for clarity):


bool ParseArg(regex const&, string const&, T&);
bool ParseArg(regex const&, string const&, TriStateVal<T>&);
bool ParseArg(wregex const&, wstring const&, T&);
bool ParseArg(wregex const&, wstring const&, TriStateVal<T>&);


It is simple enough to see how the compiler chooses based on the types of the pattern and the argument. That is via normal overloading rules.


But how does it know when to pick the generic T template function or the more specific TriStateVal<T> template? Both can be valid.


As it turns out, the compiler can follow the rules laid out in the C++ ISO standard section 14.5.5.2, which states that if multiple template functions are valid for the supplied template types, the most specific one will be selected.


The wording is of course more complex and formal than the previous paragraph, but that is what it boils down to. Section 14.5.5.2.5 contains a number of helpful examples, and one of them is equivalent with the ParseArg function.


TriStateVal<T> is more specific than T, so the function with that parameter type will be used whenever possible.


Evaluation of the first attempt


The first implementation works very well. You use the ParseArg function like this:


wregex reFile(L“^[-/]file[:=]\”{0,1}([^\\?\\*/<>\\\”]*)\”{0,1}$”);
wregex reRows(L“^[-/]rows[:=]([0-9]*)$”);
wregex reCols(L“^[-/]cols[:=]([0-9]*)$”);

int _tmain(int argc, _TCHAR* argv[])
{
  int numRows = 0;
  TriStateVal<int> numCols;
  wstring file;
 
  for(int i=0; i< argc; i++)
  {
    wstring arg(argv[i]);
    if(ParseArg(reRows, arg, numRows))
      continue;
    if(ParseArg(reCols, arg, numCols))
      continue;
    if(ParseArg(reFile, arg, file))
      continue;
  }

  wcout << “numRows : “ << numRows << endl;;
  wcout << “numCols : “ << numRows << endl;;
  wcout << “file : “ << file << endl;;

  return 0;
}


As you can see, no matter how many command parameters there are and how complex their formatting is, parsing them remains trivial if you can correctly specify the regular expressions.


The regular expressions in my example also take care of the fact that users can use either the – or the / to precede identifiers, and that the value delimiter can be a = or : sign.


The great benefit of the TriStateVal is that you don’t have to keep track of Boolean variables, indicating the status of a variable. I.e. you don’t have to manually keep track of which variables were assigned.


If your application can do several things, based on a ‘command’ parameter for example, then you can check the TriStateVal variables for that command to see if those have been assigned to. If will make your whole program easier to understand.


Problems with the first attempt


One  issue is that it sadly doesn’t work if T is of type ‘char’ or ‘wchar_t’. The reason is that these types are the character types which make up ‘string’ and ‘wstring’, and streaming from a string stream into a char is not supported.


Another possible issue is that in the case that T is of type bool, the text value has to be 1 or 0. Perhaps we would also like to be able to use ‘on’ and ‘off’ as valid value texts, or ‘high’ and ‘low’.


Both problems can be solved by providing more specific template functions, like (removed template specifier


 for clarity):


template <> bool ParseArg(regex const&, string const&, bool &);
template <> bool ParseArg(regex const&, string const&, char &);


But if we want to implement those functions, we not only have to implement the string-regex function, but also the wstring-wregex function, and the functions that use TriStateVal parameters.


So instead of having to write 2 additional functions, we would have to implement 8 functions.


We could partially fix that by making the regex and string parameters templates instead of qualified types, but that does not fix the problem of the types that are used internally (smatch and istringstream).


Of course, we could make those template arguments as well, but that would break automatic type inference, and force the programmer to specialize the template function explicitly, which would be ugly and confusing.


So instead we use a trick that is also used in the STL itself: we use a helper class that deduces those internal types for us.


Second attempt


The key to the second attempt is to use a helper class that –through specialization- will specify the dependent types.


The coat hanger for this mechanism is the empty class


  template <typename strType>
  struct __pca_typehelper{};


which has 2 specializations:


  template <>
  struct __pca_typehelper<std::string>
  {
    typedef std::tr1::smatch MatchType;
    typedef std::istringstream SStreamType;
    typedef std::tr1::regex RegexType;
  };

  template <>
  struct __pca_typehelper<std::wstring>
  {
    typedef std::tr1::wsmatch MatchType;
    typedef std::wistringstream SStreamType;
    typedef std::tr1::wregex RegexType;
  };


Depending on which specialization is picked, MatchType, SStreamType and RegexType are typedef’ed to the correct STL and TR1 types.


And the nice thing is that is the user supplies something other than string or wstring, compilation will fail at this stage already, because those typedefs don’t exist.


Using that helper class, the ParseArg function can be written like this:


  template <typename Targ, typename Tresult>
  bool ParseArg(
    typename __pca_typehelper<Targ>::RegexType const& pattern,
    Targ const& arg,
    Tresult& value)
  {
    typename __pca_typehelper<Targ>::MatchType matches;
    if( !std::tr1::regex_search(arg, matches, pattern))
      return false;

    typename __pca_typehelper<Targ>::SStreamType
      iss((Targ) matches[1]);
    return ! (iss >> std::dec >> value).fail();
  }


As you can see, there are now 2 template arguments: Targ and Tresult.


The meaning of Tresult is still the same as in the previous implementation.


Targ is new, and is the type of the argument that needs to be parsed. In our example it is either string or wstring. This type is then used to specialize __pca_typehelper so that the type of the regular expression can be deduced:


typename __pca_typehelper<Targ>::RegexType


The same goes for the internal variables ‘matches’ and ‘iss’


The implementation for the TriStateVal<Tresult> specialization is very similar to the one for Tresult so I am not going to repeat that here.


Evaluation of the second attempt


We got rid of the code duplication (which is good) and managed to have some fun in the process of doing so. At least I did. J


The code still behaves as it did previously, so using the ParseArg function hasn’t changed at all.


We still have the issues about bool and char, but at least we can now solve this problem with a lesser number of specializations (4 instead of 8).


Conclusion


My implementation of ParseArg works good, and is easy to use. I’ve been using it for some time, but used boost for the regexes.


I would like to try and get the TriStateVal<T> and T scenarios to be handled by the same function. That would require providing a >> stream operator for TriStateVal<T> and I don’t know how much work that is.


I feel like giving this a try, but I am not going to wait with publishing this article. For one thing, it is already lengthy enough, but it also won’t matter from the user’s point of view. The usage of ParseArg will not change, so I feel it is perfectly acceptable to leave this optimization for another article.


The code of this article is available for download as always.


And as always, if you have any comments or feedback, please leave a comment at the bottom of the page.