There is a lot of confusion about the difference between data cleansing (also known as data screening) and data transformation. Both are important to any data analysis, whether or not machine learning is to be used. In this article, we break down the difference between data cleansing and data transformation.
What is Data Cleansing and Why is it Important?
Data Cleansing, also known as data cleaning or data screening, is the process of preparing data for analysis, statistical modeling, or machine learning algorithms. This is done by deleting or modifying incomplete, incorrect, irrelevant, or inconsistent data. Data cleaning addresses factors such as outliers, noise, missing data, inconsistency, relevance and redundancy. This process is critical for ensuring consistent data quality and realizing accurate, actionable insights. “Garbage in, garbage out”, as they say.
What is Data Transformation?
Data Transformation is the process of transforming your data into a usable format, optimized for your specific business goals. If you plan to use statistical modeling or machine learning algorithms, the type of data transformation needed will vary. If data is to be fed into machine learning algorithms, it will also involve adding “features” in a process called “feature engineering”.
Data cleansing and data transformation work hand-in-hand and are absolutely essential to any data pipeline. For a detailed article on Data Transformation, see our blog article: Data Transformation: An Executive’s Guide to Affordable AI.
Data Screening and Transformation Process
Once the business goals have been identified, data cleansing is just one of the stages in the data screening and transformation process shown below.
This process leverages both SEMMA and CRISP-DM principles and is detailed below.
Data usually comes from multiple sources, which makes it difficult to analyze without transforming the data into a homogenous format. This is known as syntactic heterogeneity.
When data sets for the same domain are developed by different people, this can lead to differences in meaning and interpretation. This is known as semantic heterogeneity.
Various techniques to address both types of heterogeneity can be used for data fusion.
Data Quality Assessment
Establishing quality criteria for your data involves considering outliers, noise, inconsistency, incompleteness and redundancy within data sets.
Outliers are results that deviate so far from the rest of the data that its validity is in question. They should be eliminated to avoid skewing the results.
Noise is meaningless data that is irrelevant to the analysis being performed. These data can hinder machine learning and should also be eliminated.
Inconsistency means the data contradicts itself somehow due to mislabeling or mistakes.
Incompleteness is data with missing values that must be eliminated or filled in by various statistical methods.
Redundancy is when there are multiple copies of the same data in databases, which can skew results.
Data selection determines what data is used for analysis or modeling and includes an assessment of the data relevance. Lots of data is eliminated in this phase to prevent costly data cleansing and/or feature engineering on irrelevant data.
Data quality is improved, based on criteria set forth in the quality assessment phase. Issues related to outlier, noise, inconsistency, incompleteness and redundancy are fixed.
Feature Engineering uses domain knowledge of the data to extract the most useful features used by algorithms. These are fed into machine learning algorithms to supercharge performance. Feature engineering consists of the creation, transformation, extraction and selection of features from raw data.
Data Screening and Transformation are important steps to ensure the data used for analysis or modeling will produce the most accurate and actionable insights for business or operational decision-making.
Accelerating your AI and Machine Learning Initiatives
Why dive into an AI or Machine Learning program before making sure you can get the ROI you need?
If you would like to kick-start your Artificial Intelligence or Machine Learning initiative, Cloud App Developers, LLC has created a valuable offering in our Machine Learning Proof of Concept.
Who Would Benefit?
- Have lots of legacy data and want to launch an AI or Machine Learning Program, but don’t know where to start.
- Need to validate if business goals are achievable through data science, Machine Learning or Deep Learning.
- Want to know what is possible with AI and Machine Learning.
What does the program include?
The Cloud App Developers’ Machine Learning Proof of Concept is designed to provide a useful assessment of your data, validate your business goals against available data and models, and to help identify other business goals that might be possible through Machine Learning. Our Machine Learning Consultants will then compile a comprehensive report and present the findings to your stakeholders. A typical program would include the following stages:
Accelerator Program Stages Questions Answered
|Review of your Business Goals||“What are the business goals you hope to address with AI or Machine Learning?|
|Assessment of existing data||“What is the nature and quality of your existing data?” “Is there any missing data?” “What data preparation is needed?”|
|Top-Level Validation of Business Goals||“Can your existing data support your Business Goals?” “What other goals might be achievable?”|
|ML Model Recommendations||“Which ML Algorithms or Analytical Models are best suited to meet my business goals?”|
|Data Transformation of subset of data (see below)||“How much will it cost to prepare all of my data for modeling?”|
|Run Models against one business goal and a subset of data||“How do I validate how AI and Machine Learning can help me?” “Can I accomplish some goals with Analytical Modeling or do I need sophisticated ML Modeling?”|
|Generate ML Report with Recommendations||Full Report on Machine Learning Readiness and Goal Validation Includes recommendation and cost estimates for full Data Transformation and ML Program.|
Visit our Software Solutioneer Blog for more articles.