Euclidean Distance For Beginners
We are familiar with multiple types of distance in the world of machine learning, including Manhattan distance, cosine similarity distance, etc, But the particular distance in the discussion today is Euclidean distance, and we will look into its feature and see how it helps in solving clustering problem statements. We will not focus on complex mathematical formulas here, we will simply use Python and jupyter Notebook to find out the related distance with a self-created dataset.
# Let's create a dataframe using pandas first for our calculation
import pandas as pd
df = pd.DataFrame(data=[[12,95],
[10,86],
[9,75],
[11,98],
[5,45],
[6,59],
[4,28]],columns=['Hours Studied','Marks Obtained'])
df.head()
Hours Studied Marks Obtained
0 12 95
1 10 86
2 9 75
3 11 98
4 5 45
Now, just for our understanding, let’s take index 0 and 1 as our two clusters C1 and C2. Then we will calculate all the other indexed data points' Euclidean distances with these clusters. By cluster, we simply mean two segregated areas where most of the data are scattered. Let us visualize the data first to understand the cluster concept over here.
from matplotlib import pyplot as plt
%matplotlib inline
#display the data
plt.scatter(df['Hours Studied'],df['Marks Obtained'])
We can somehow identify 2 clusters in the data.. let’s call these clusters c1 and c2, and randomly select 2 centroids here from which we will find out the euclidean distance of other data points.
c1 = df.iloc[0]
c2 = df.iloc[1]
Now let’s go ahead and calculate the euclidean distance of each data point with our clusters, We will check out both the numpy library procedure and the sci-kit learn library method over here.
print(f"the c1 cluster data points are {c1} and it's data type is {type(c1)}")
print(f"the c2 cluster data points are {c2} and it's data type is {type(c2)}")
the c1 cluster data points are [[12 95]] and it's data type is <class 'numpy.ndarray'>
the c2 cluster data points are [[10 86]] and it's data type is <class 'numpy.ndarray'>
Now, from our main data frame, let’s take the second index and convert it into a numpy array. Then we will check this data point’s euclidean distance with both c1 and c2
d1 = np.array([df.iloc[2]])
Numpy Method of Checking the Euclidean Distance
# first let's find out the sum of squares
sum_sq_c1 = np.sum(np.square(d1-c1))
sum_sq_c2 = np.sum(np.square(d1-c2))
# Then let's square root the result and find out the euclidean distance
ed_c1d1 = np.sqrt(sum_sq_c1)
ed_c2d1 = np.sqrt(sum_sq_c2)
# Let's print our results
print(f"the euclidean distance between c1 cluster and d1 datapoint is {ed_c1d1}")
print(f"the euclidean distance between c2 cluster and d1 datapoint is {ed_c2d1}")
the euclidean distance between c1 cluster and d1 datapoint is 20.223748416156685
the euclidean distance between c2 cluster and d1 datapoint is 11.045361017187261
We can see from the above example that d1’s euclidean distance is smaller in the c2 cluster,, hence it can be safely said that the d1 datapoint will belong in the c2 cluster. And as the c2 cluster is getting an added data point, its centroid needs to be calculated again. But first, let’s check the euclidean distance with the Sklearn library method as well.
Sklearn Module Method Of finding the Euclidean Distance
# importing the module
from sklearn.metrics.pairwise import euclidean_distances
ed_c1d1_sklearn = euclidean_distances(d1,c1)
ed_c2d1_sklearn = euclidean_distances(d1,c2)
#printing the result
print(f"the euclidean distance between c1 cluster and d1 datapoint with sklearn method is {ed_c1d1_sklearn}")
print(f"the euclidean distance between c2 cluster and d1 datapoint with sklearn method is {ed_c2d1_sklearn}")
the euclidean distance between c1 cluster and d1 datapoint with sklearn method is [[20.22374842]]
the euclidean distance between c2 cluster and d1 datapoint with sklearn method is [[11.04536102]]
Now that our theory has been proven with both the numpy and sklearn methods, we can safely say that the datapoint d1 belongs with c2 cluster. Let’s calculate the new centroid now for c2 cluster. It is the simple addition of new added value to the existing centroid value and division by 2 and we have 2 elements in the cluster now. First, let’s take things manually, in c2 array we have [10,86] and in d1 array we have [9,75] so ideally, our new centroid for c2 should be ([10+9]/2,[86+75]/2) or (9.5,80.5). Let's find out ,
c2 = (c2+d1)/2
print(f"The new centroid values for c2 cluster is {c2}")
The new centroid values for c2 cluster is [[ 9.5 80.5]]
Now the next calculations will be calculated based on the new centroid of c2. Each time a data point gets added to a cluster, the centroid value for that cluster will keep changing. Hope this gives you a good insight into Euclidean distance and it’s a workaround
For more such content please follow my Medium profile and LinkedIn page — https://www.linkedin.com/in/chandan-sengupta/