Data Science Tools & Software - Assignment No. 2
Question 1: Data Preprocessing
Part (a): Given the following dataset
| index |
math |
physics |
| x1 |
85 |
0.7 |
| x2 |
65 |
0.8 |
| x3 |
80 |
0.2 |
| x4 |
75 |
0.9 |
(i) Compute the Euclidean distances:
For points X1 and X3 (i.e., (85, 0.7) vs. (80, 0.2)) and for X2 and X4 (i.e., (65, 0.8) vs. (75, 0.9)).
- Which option correctly represents the Euclidean distances?
- A. de(X1, X3) ≈ 5.00 and de(X2, X4) ≈ 10.00
- B. de(X1, X3) ≈ 5.02 and de(X2, X4) ≈ 10.00
- C. de(X1, X3) ≈ 5.50 and de(X2, X4) ≈ 9.50
- D. de(X1, X3) ≈ 4.80 and de(X2, X4) ≈ 10.20
(ii) Comment on the computed distances:
- Which comment best describes these computed distances?
- A. X1 and X3 are more similar (i.e., closer) than X2 and X4.
- B. X2 and X4 are more similar than X1 and X3.
- C. Both pairs show similar distances and similarity.
- D. The distances are too small to indicate any similarity.
(iii) Normalize the dataset using min-max normalization:
For the math column, min = 65 and max = 85. For the physics column, min = 0.2 and max = 0.9.
- What are the normalized values for each attribute?
- A. math: [1, 0, 0.75, 0.5], physics: [0.714, 0.857, 0, 1]
- B. math: [0.75, 0, 1, 0.5], physics: [0.714, 0.857, 0, 1]
- C. math: [1, 0, 0.5, 0.75], physics: [0.857, 0.714, 0, 1]
- D. math: [1, 0, 0.75, 0.5], physics: [0.714, 0.857, 1, 0]
Part (b): Missing Data Imputation
The dataset X with missing values is given as:
x1 = [ a, 60 ]
x2 = [11, 75]
x3 = [ 5, 75]
x4 = [ 5, 80]
x5 = [ 7, b ]
For each method below, select the option that correctly replaces the missing values (a and b):
(i) Using the Mean Value:
For the first attribute, the known values are 11, 5, 5, 7 → mean = (11+5+5+7)/4 = 7.
For the second attribute, the known values are 60, 75, 75, 80 → mean = (60+75+75+80)/4 = 72.5.
- Replacing missing values using the mean yields:
- A. a = 7, b = 72.5
- B. a = 7, b = 75
- C. a = 6, b = 72
- D. a = 5, b = 70
(ii) Using the Most Probable (Mode) Value:
For the first attribute, among {11, 5, 5, 7}, the mode is 5.
For the second attribute, among {60, 75, 75, 80}, the mode is 75.
- Replacing missing values using the most probable method yields:
- A. a = 7, b = 72.5
- B. a = 5, b = 75
- C. a = 7, b = 75
- D. a = 5, b = 72.5
(iii) Using kNN Regression with k = 2:
For x1 (with missing a and known 60), the two nearest neighbors (based on the second attribute) are x2 and x3 (with second values 75 each). Thus, a ≈ (11 + 5)/2 = 8.
For x5 (with missing b and known 7), using the first attribute to find neighbors (x3 and x4, both with value 5), then b ≈ (75 + 80)/2 = 77.5.
- Using kNN regression with k = 2 yields:
- A. a = 8, b = 77.5
- B. a = 7, b = 75
- C. a = 8, b = 75
- D. a = 7, b = 77.5
Part (c): Normalized Dissimilarity Between Symbolic Objects
You are given two objects with 4 attributes:
- Attribute 1: A string of 5 characters
- Attribute 2: An interval
- Attribute 3: A set
- Attribute 4: A binary number of 5 bits
Objects: