The more experienced you are with data imports the simpler you want to make the process. Last week I was at the Microsoft MVP summit in Redmond Washington. While I was there I spent a good deal of time with my Dynamics CRM MVP buddies among others. At breakfast one morning I sat with a bunch of Access MVPs. On the way back I ran into a great SQL Server MVP in the Washington Dulles airport. It was during these conversations that I realized people were using the same complex methodologies I had in the past to clean data before importing it in to CRM or other database applications. For that reason, I thought I would document this simple technique to help ready data for importation in to CRM.
The old faithful in database imports is the trusty comma separated value (CSV) file where data is kept in a text file and Fields are separated by commas. When all else fails in exporting data, comma separated values is the base format that every program seems to support. If you are working with CSV files, Microsoft Excel is your best friend.
What is a non-printable character?
The most common non-printable characters are a space, tab or carriage return. In total there are at least 27 different non-printable characters. They are tricky because you can’t see them in the text and often they disrupt normal processing of the import tools you are using. For more information on non-printable characters please see the following resource:
Wikipedia: Control Characters
MSDN: Non-Printable in Jscript
Using Excel to Clean Data:
Excel has several great features for working with text. I am going to focus on the ‘Clean’ function.
In the following example I have a sample text block which has non-printing characters to cause line breaks in the cell. These are common in Address fields and note fields.
The following shows the second sheet which references the first. Note the clean function in the data value box.
Note that the entire value is on the same line and the non-printable characters have been stripped out.
Here is the description of the Clean function from Excel 2010 help:
Removes all non-printable characters from text. Use CLEAN on text imported from other applications that contains characters that may not print with your operating system. For example, you can use CLEAN to remove some low-level computer code that is frequently at the beginning and end of data files and cannot be printed.
Important The CLEAN function was designed to remove the first 32 non-printing characters in the 7-bit ASCII code (values 0 through 31) from text. In the Unicode character set (Unicode: A character encoding standard developed by the Unicode Consortium. By using more than one byte to represent each character, Unicode enables almost all of the written languages in the world to be represented by using a single character set.), there are additional non-printing characters (values 127, 129, 141, 143, 144, and 157). By itself, the CLEAN function does not remove these additional non-printing characters. For an example of how to remove these additional non-printing characters from text, see Remove spaces and non-printing characters from text.
CLEAN(text)The CLEAN function syntax has the following arguments (argument: A value that provides information to an action, an event, a method, a property, a function, or a procedure.):
Text Required. Any worksheet information from which you want to remove non-printable characters.
The example may be easier to understand if you copy it to a blank worksheet.
Subscribe in a reader