3 Data cleaning


Data are unorganised raw facts or figures that need to be processed and analysed. Variables are a type of data that can change. They form the basis of most analyses performed to understand situations, trends and linkages.
Data and variables can take different forms: Simple and random in appearance, or statistical and complex in appearance. In any case, data and variables form the basis of the analysis but are of no use until they are processed, analysed and finally converted into information. Before being analysed, the data must first be checked for possible errors. Thus, database cleaning is primarily a logical process, including data consistency analysis and triangulation with other available information.
Some errors are difficult to detect before the analysis begins; for example, some outliers are identifiable only when the data is better known. However, it is preferable to detect as many errors as possible in order to avoid having to backtrack when analysing the data.
Make sure that all changes made to your dataset have been recorded in a “change log”.
This module consists of 7 sub-parts:
- 3.1 Formatting a database,
- 3.2 Conditional formatting,
- 3.3 The most common formulas for data cleaning,
- 3.4 Identifying duplicates,
- 3.5 Identifying extreme values,
- 3.6 Identifying missing values,
- 3.7 Summary of Data Quality Control.
This section often refers to :
- The Getting Started in Programme Data Management toolbox which provides a starting point for learning about data management in general,
- The Data Analysis toolbox, which provides an understanding of the basic principles and tricks of data analysis.