Source Of The Section
Sec1_DM_ANU_Spring2025.pdf
Data Preprocessing
- Data preprocessing is the process of preparing raw data for analysis by cleaning and transforming it into a usable format.
- In data mining, it involves preparing raw data for mining by performing tasks like cleaning, transforming, and organizing it into a format suitable for mining algorithms.
Goals of Data Preprocessing:
- Improve data quality.
- Handle missing values, remove duplicates, and normalize data.
- Ensure accuracy and consistency in the dataset.
Stages of Data Preprocessing:
- Data Cleaning
- Data Integration
- Data Transformation
- Data Reduction
Creating Data
import pandas as pd
# Define the data as a dictionary
data_dict = {
"CustomerID": [1001, 1002, 1003, 1004, 1005, 1005, 1006],
"Gender": ["M", "F", None, "M", "F", "F", "F"],
"Income": [75000, 40000, 10000000, 50000, 99999, 99999, 45000],
"Age": [30, 40, 45, 20, 30, 30, None],
"MaritalStatus": ["M", "W", "s", "S", "D", "D", "M"],
"Transaction Amount": [5000, 4000, 7000, None, 3000, 3000, 1000],
"Date": ["12/1/2020", "12/2/2020", "12/3/2020", "12/4/2020", "12/5/2020", "12/5/2020", "12/6/2020"]
}
# Create a DataFrame from the dictionary
df = pd.DataFrame(data_dict)
# Display the DataFrame
df
Exploratory Data Analysis
# Check dataset dimensions
df.shape