Source Of The Section

Sec1_DM_ANU_Spring2025.pdf

Data Preprocessing

Data preprocessing is the process of preparing raw data for analysis by cleaning and transforming it into a usable format.
In data mining, it involves preparing raw data for mining by performing tasks like cleaning, transforming, and organizing it into a format suitable for mining algorithms.

Goals of Data Preprocessing:

Improve data quality.
Handle missing values, remove duplicates, and normalize data.
Ensure accuracy and consistency in the dataset.

Stages of Data Preprocessing:

Data Cleaning
Data Integration
Data Transformation
Data Reduction

Creating Data

import pandas as pd

# Define the data as a dictionary
data_dict = {
    "CustomerID": [1001, 1002, 1003, 1004, 1005, 1005, 1006],
    "Gender": ["M", "F", None, "M", "F", "F", "F"],
    "Income": [75000, 40000, 10000000, 50000, 99999, 99999, 45000],
    "Age": [30, 40, 45, 20, 30, 30, None],
    "MaritalStatus": ["M", "W", "s", "S", "D", "D", "M"],
    "Transaction Amount": [5000, 4000, 7000, None, 3000, 3000, 1000],
    "Date": ["12/1/2020", "12/2/2020", "12/3/2020", "12/4/2020", "12/5/2020", "12/5/2020", "12/6/2020"]
}

# Create a DataFrame from the dictionary
df = pd.DataFrame(data_dict)

# Display the DataFrame
df

Exploratory Data Analysis

# Check dataset dimensions
df.shape