Data Science Tools & Software - Assignment No. 2

Question 1: Data Preprocessing

Part (a): Given the following dataset

index	math	physics
x1	85	0.7
x2	65	0.8
x3	80	0.2
x4	75	0.9

(i) Compute the Euclidean distances:

For points X1 and X3 (i.e., (85, 0.7) vs. (80, 0.2)) and for X2 and X4 (i.e., (65, 0.8) vs. (75, 0.9)).

Which option correctly represents the Euclidean distances?
- A. de(X1, X3) ≈ 5.00 and de(X2, X4) ≈ 10.00
- B. de(X1, X3) ≈ 5.02 and de(X2, X4) ≈ 10.00
- C. de(X1, X3) ≈ 5.50 and de(X2, X4) ≈ 9.50
- D. de(X1, X3) ≈ 4.80 and de(X2, X4) ≈ 10.20

(ii) Comment on the computed distances:

Which comment best describes these computed distances?

A. X1 and X3 are more similar (i.e., closer) than X2 and X4.
B. X2 and X4 are more similar than X1 and X3.
C. Both pairs show similar distances and similarity.
D. The distances are too small to indicate any similarity.

(iii) Normalize the dataset using min-max normalization:

For the math column, min = 65 and max = 85. For the physics column, min = 0.2 and max = 0.9.

What are the normalized values for each attribute?

A. math: [1, 0, 0.75, 0.5], physics: [0.714, 0.857, 0, 1]
B. math: [0.75, 0, 1, 0.5], physics: [0.714, 0.857, 0, 1]
C. math: [1, 0, 0.5, 0.75], physics: [0.857, 0.714, 0, 1]
D. math: [1, 0, 0.75, 0.5], physics: [0.714, 0.857, 1, 0]

Part (b): Missing Data Imputation

The dataset X with missing values is given as:

x1 = [ a, 60 ]
x2 = [11, 75]
x3 = [ 5, 75]
x4 = [ 5, 80]
x5 = [ 7, b ]

For each method below, select the option that correctly replaces the missing values (a and b):

(i) Using the Mean Value:

For the first attribute, the known values are 11, 5, 5, 7 → mean = (11+5+5+7)/4 = 7.

For the second attribute, the known values are 60, 75, 75, 80 → mean = (60+75+75+80)/4 = 72.5.

Replacing missing values using the mean yields:

A. a = 7, b = 72.5
B. a = 7, b = 75
C. a = 6, b = 72
D. a = 5, b = 70

(ii) Using the Most Probable (Mode) Value:

For the first attribute, among {11, 5, 5, 7}, the mode is 5.

For the second attribute, among {60, 75, 75, 80}, the mode is 75.

Replacing missing values using the most probable method yields:

A. a = 7, b = 72.5
B. a = 5, b = 75
C. a = 7, b = 75
D. a = 5, b = 72.5

(iii) Using kNN Regression with k = 2:

For x1 (with missing a and known 60), the two nearest neighbors (based on the second attribute) are x2 and x3 (with second values 75 each). Thus, a ≈ (11 + 5)/2 = 8.

For x5 (with missing b and known 7), using the first attribute to find neighbors (x3 and x4, both with value 5), then b ≈ (75 + 80)/2 = 77.5.

Using kNN regression with k = 2 yields:

A. a = 8, b = 77.5
B. a = 7, b = 75
C. a = 8, b = 75
D. a = 7, b = 77.5

Part (c): Normalized Dissimilarity Between Symbolic Objects

You are given two objects with 4 attributes:

Attribute 1: A string of 5 characters
Attribute 2: An interval
Attribute 3: A set
Attribute 4: A binary number of 5 bits

Objects: