3.3 The most common formulas for data cleaning
TABLE OF CONTENTS
- 3.3.1 Removing unnecessary spaces
- 3.3.2 Pivoting data
- 3.3.3 Merging / Splitting Columns
- 3.3.4 Finding and replacing text
- 3.3.5 Attaching data
- 3.3.6 Copying the formula result
There may be times when you have to process databases containing incorrectly entered data or data imported from web pages, from another computer system or from applications using a particular and non-standardised character encoding. It therefore happens that supernumerary spaces, invisible control characters pollute the data that one wishes to exploit, making their exploitation in Excel completely impossible. Data cleaning is then essential.
The TRIM() formula removes all spaces from a text string except for simple spaces between words. This function is used to clean text with irregular spacing.
In B2, we have text that was entered with extra spaces at the beginning. The TRIM() formula enables you to clean this up by eliminating all these extra spaces as can be seen in the result displayed in cell C2.
It occasionally happens that we want to transpose rows and columns, that is, rows become columns and columns become rows.
You can do this by using copy/paste: start by selecting the range of data you want to pivot, then place your cursor in the new cell where you want your transposed range of data to start (make sure there is enough space to paste your data so as not to overwrite existing data), right-click and select Transpose.
One common task in data cleaning is to merge multiple columns into one or to split a column into two or more columns. For example, you might need to create a single Full Name column by combining the First Name and Last Name columns.
Example 1 combines the above last and first names, separated by a space. It is also possible to add the character you want as a separator, such as a hyphen (example 2).
To replace the formula with the results, select the cells, then under the Home tab, in the Clipboard group, click Copy, click Paste, and then click Paste Values.
You may also need to split a Full Name column into two columns with first and last names separately. There are several techniques in Excel to achieve this result:
Select the data you want to convert, and then in Data, click Text to columns. In the Conversion Wizard, select Delimited > Next and select the Delimiters that your data contains. For example, Comma and Space. A preview of your data appears in the Data Preview window. If you are satisfied with the preview, select Next, select the destination of the data fractions on your worksheet, and then complete the operation.
Regarding the first name, the LEFT() formula returns the character string before the space, the latter being identified thanks to the FIND() formula. For the last name, the formula extracts the corresponding characters in cell A2, starting from the right with the RIGHT() formula and stopping at the space.
You can use the LEFT(), RIGHT(), FIND() and LEN() formulas to manipulate text strings in your data. For example, you can distribute the first name, middle name, and last name from a single cell into three separate columns.
How do I enter multiple choice questions into a database?
For multiple-choice questions (i.e., when multiple answers are possible), a column should be assigned for each answer to facilitate the analysis and descriptive statistics, such as frequency calculations.
The example below is taken from a WASH survey in Myanmar: one of the variables collects different household water treatment methods, with several possible answers.
The variable is stored in the database as a dichotomous variable, “1” if selected, or “0” if not, for which each possible response has its own column. This record allows the use of mathematical functions for statistical analysis. This data could also have been saved with a “Yes” for “1” and a “No” for “0”.
Some data collection tools store data related to multiple choice questions in a single cell. In this case, the data must be distributed in separate cells on the same line. The data below illustrates this situation: The waterTreatmentMethod variable combines all of the answers within a single cell.
Using the Text to Columns function to split the cell:
In the Data tab, under Data Tools, the function enables you to distribute the contents of a single cell into several separate cells. Simply specify the delimiter, i.e., the character used to separate each response (in the example above a space) and the destination of the results.
It is possible to replace or delete a text string that is obsolete or unnecessary. To do this, simply search for instances of this text and replace them with no text or other text. In the Home tab, under the Editing group, click Find & Select, then click Replace, and enter the text you want to find and replace.
In the example above, each occurrence of the search criterion “ë” is replaced by “e” when you click Replace All, while Replace updates one occurrence at a time.
It is also possible to replace a specific character (in the example above, a dash) with a space. In this case, you must specify the space in the Replace with field.
Sometimes you need to retrieve information from other Excel files and integrate it into your database within a single sheet. The VLOOKUP() formula allows you to “join” elements in Excel. From your original database, a cell calls the VLOOKUP() formula, which will look for an element from the original database in another spreadsheet. There is absolutely no need for databases to be the same size (neither in number of rows nor number of columns).
In the example below, the variable “Number of active health centres” was searched for in database 1 in order to be added to database 2. The following formula was therefore inserted into an empty column and entered in C2:
With the following arguments:
B2: the desired element (a cell in general).
'Database1'!$A$1:$B$37: the search range (the peripheral table) that is set with the $ sign. It is important that the searched item is in the first column of the search range.
2the index. This is the column in this range of cells that contains the value you are looking for, 1 being the number of the first column you selected in the search range.
FALSE: the Boolean (TRUE or FALSE) lets you know if the search can be approximated or exact. Most of the time we do an exact search, and the Boolean takes the value FALSE.
There are certain limitations in the case of the =VLOOKUP function, as it can only search for a value from left to right. This means that the column containing the value you are looking for is always to the left of the column containing the return value. If your spreadsheet is not designed this way, do not use =VLOOKUP and instead prefer the combination of =INDEX and =MATCH.
The example below shows that the value on which to search for the corresponding province is not in the leftmost column. We will therefore use the =MATCH function instead to find the province in the B1:B38 range.
The following exercise is only available on the French page at this stage- translation coming soon!
In order to practice using the VLOOKUP function, nested in other functions, here is a practical exercise (Bonus 3) using clean data from the case study - developed by CartONG as part of the Data Analysis Toolbox publication. A more complete presentation of this case study can be found in the “Data Analysis Toolbox” – Part “7 The case study ”.
In this example, we propose to reconstruct a dataset indicating the duration of food security assistance as well as the size of households, among those having received this aid. To do this, you will use the nested IF and VLOOKUP functions.
Use the [“Excel Toolbox - Analysis 4.5 - Practice exercise - V1 - Statement”] file to practice using the VLOOKUP function.
In case of difficulties, or to compare your results with those of the case study, refer to the file [“Excel Toolbox - Analysis 4.5 - Practical exercise - V1 – Solutions”].
When you copy and paste into Excel, all cell parameters (data, formatting, formulas, etc.) are pasted into the destination cell. When copying and pasting cells with formulas:
- If you want to keep these formulas, you may need to correct the cell references.
- If the database cleaning process is complete, you can paste the result of the formulas instead of the formulas themselves to reduce the risk of automatic recalculation in the event of subsequent database changes.