Data Science Tools & Software - Assignment No. 2


Question 1: Data Preprocessing

Part (a): Given the following dataset

index math physics
x1 85 0.7
x2 65 0.8
x3 80 0.2
x4 75 0.9

(i) Compute the Euclidean distances:

For points X1 and X3 (i.e., (85, 0.7) vs. (80, 0.2)) and for X2 and X4 (i.e., (65, 0.8) vs. (75, 0.9)).

  1. Which option correctly represents the Euclidean distances?

(ii) Comment on the computed distances:

  1. Which comment best describes these computed distances?

(iii) Normalize the dataset using min-max normalization:

For the math column, min = 65 and max = 85. For the physics column, min = 0.2 and max = 0.9.

  1. What are the normalized values for each attribute?

Part (b): Missing Data Imputation

The dataset X with missing values is given as:

x1 = [ a, 60 ]
x2 = [11, 75]
x3 = [ 5, 75]
x4 = [ 5, 80]
x5 = [ 7, b ]

For each method below, select the option that correctly replaces the missing values (a and b):

(i) Using the Mean Value:

For the first attribute, the known values are 11, 5, 5, 7 → mean = (11+5+5+7)/4 = 7.

For the second attribute, the known values are 60, 75, 75, 80 → mean = (60+75+75+80)/4 = 72.5.

  1. Replacing missing values using the mean yields:

(ii) Using the Most Probable (Mode) Value:

For the first attribute, among {11, 5, 5, 7}, the mode is 5.

For the second attribute, among {60, 75, 75, 80}, the mode is 75.

  1. Replacing missing values using the most probable method yields:

(iii) Using kNN Regression with k = 2:

For x1 (with missing a and known 60), the two nearest neighbors (based on the second attribute) are x2 and x3 (with second values 75 each). Thus, a ≈ (11 + 5)/2 = 8.

For x5 (with missing b and known 7), using the first attribute to find neighbors (x3 and x4, both with value 5), then b ≈ (75 + 80)/2 = 77.5.

  1. Using kNN regression with k = 2 yields:

Part (c): Normalized Dissimilarity Between Symbolic Objects

You are given two objects with 4 attributes:

Objects: