What is Data Preprocessing?
Most statistical theory concentrates on data modelling, prediction, and statistical inference while it is usually assumed that data are in the correct state for the data analysis. However, in practice, a data analyst spends most of his/her time (usually 50%-80% of an analyst time) on making ready the data before doing any statistical operation [1].
Despite the amount of time it takes, there has been surprisingly very little emphasis on how to preprocess data well (Wickham and others (2014)). Real-world data are commonly incomplete, noisy, inconsistent, and don’t have all the correct labels and codes that are required for the analysis.
Data Preprocessing, which is also commonly referred to as data wrangling, data manipulation, data cleaning, etc., is a process and the collection of operations needed to prepare all forms of untidy data (incomplete, noisy and inconsistent data) for statistical analysis.
Reasons
pre-processing of the datasets takes place for two main reasons [2]
- reduction of the size of the dataset in order to achieve more efficient analysis (dimension reduction)
- adaptation of the dataset to best suit the selected analysis method.
Reference List
- Dasu, T., & Johnson, T. (2003). Exploratory data mining and data cleaning. Wiley-Interscience.
- Jović, A., Brkić, K., & Bogunović, N. (2015, May). A review of feature selection methods with applications. In 2015 38th international convention on information and communication technology, electronics and microelectronics (MIPRO) (pp. 1200-1205). Ieee.