Calculate the distance between all other points in a TSV file? - python

I have a TSV file filled with n data points and I want to calculate the distances between all of the points. I have something like this:
What I thought about doing was the .iloc feature
import pandas as pd
x = pd.read_csv('data.tsv', sep='\t')
print (x)
while True: xcord= (int)
I was thinking you could do where you add 1 to each point iteratively, but I don't know how to do that.

Solution using distance_matrix
You can proceed using scipy.spatial.distance_matrix.
Suppose your DataFrame is my_dataframe.
import pandas as pd
import scipy as sp
points = pd.DataFrame(my_dataframe, columns=["X", "Y", "Z"]).astype(float)
distance_matrix = sp.spatial.distance_matrix(points, points)
Visualising the result
We can use seabord.heatmap to visualise the obtained results:
from matplotlib import pyplot as plt
import seaborn as sns
labels = my_dataframe["points"]
plt.rcParams['figure.figsize'] = [10, 10]
plt.axis('scaled')
sns.heatmap(distance_matrix,
annot=True,
cbar = False,
fmt="0.2f",
cmap="YlGnBu",
xticklabels=labels,
yticklabels=labels)
plt.title("Distance matrix")
The result is:
A small textual example
We can create a small textual example with which we may help understand step by step inputs and outputs. Let's consider a DataFrame with just two points:
Generating an example dataframe
import pandas as pd
import numpy as np
a = np.random.uniform(100, size=(2, 3))
my_dataframe = pd.DataFrame(np.hstack([[["A"], ["B"]], a]), columns=["points", "X", "Y", "Z"])
The DataFrame we have generated looks like:
Splitting points and labels
We split the labels and the points:
points = pd.DataFrame(my_dataframe, columns=["X", "Y", "Z"]).astype(float)
labels = my_dataframe["points"]
So points looks like:
And labels looks like:
Calculating the distance matrix
Now we can proceed calculating the distance matrix executing scipy.spatial.distance_matrix:
distance_matrix = sp.spatial.distance_matrix(points, points)
The resulting matrix is:
array([[ 0. , 93.43955419],
[93.43955419, 0. ]])
Visualising the obtained matrix
Using the same code as above, we obtain:

Related

How to draw the Probability Density Function (PDF) plot in Python?

I'd like to ask how to draw the Probability Density Function (PDF) plot in Python.
This is my codes.
import numpy as np
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
import scipy.stats as stats
.
x = np.random.normal(50, 3, 1000)
source = {"Genotype": ["CV1"]*1000, "AGW": x}
df=pd.DataFrame(source)
df
I generated a data frame. Then, I tried to draw a PDF graph.
df["AGW"].sort_values()
df_mean = np.mean(df["AGW"])
df_std = np.std(df["AGW"])
pdf = stats.norm.pdf(df["AGW"], df_mean, df_std)
plt.plot(df["AGW"], pdf)
I obtained above graph. What I did wrong? Could you let me how to draw the Probability Density Function (PDF) Plot which is also known as normal distribution graph.
Could you let me know which codes (or library) I need to use to draw the PDF graph?
Always many thanks!!
You just need to sort the values (not really check what's after edit)
pdf = stats.norm.pdf(df["AGW"].sort_values(), df_mean, df_std)
plt.plot(df["AGW"].sort_values(), pdf)
And it will work.
The line df["AGW"].sort_values() doesn't change df. Maybe you meant df.sort_values(by=['AGW'], inplace=True).
In that case the full code will be :
import numpy as np
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
import scipy.stats as stats
x = np.random.normal(50, 3, 1000)
source = {"Genotype": ["CV1"]*1000, "AGW": x}
df=pd.DataFrame(source)
df.sort_values(by=['AGW'], inplace=True)
df_mean = np.mean(df["AGW"])
df_std = np.std(df["AGW"])
pdf = stats.norm.pdf(df["AGW"], df_mean, df_std)
plt.plot(df["AGW"], pdf)
Which gives :
Edit :
I think here we already have the distribution (x is normally distributed) so we dont need to generate the pdf of x. As the use of the pdf is for something like this :
mu = 50
variance = 3
sigma = math.sqrt(variance)
x = np.linspace(mu - 5*sigma, mu + 5*sigma, 1000)
plt.plot(x, stats.norm.pdf(x, mu, sigma))
plt.show()
Here we dont need to generate the distribution from x points, we only need to plot the density of the distribution we already have .
So you might use this :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
x = np.random.normal(50, 3, 1000) #Generating Data
source = {"Genotype": ["CV1"]*1000, "AGW": x}
df=pd.DataFrame(source) #Converting to pandas DataFrame
df.plot(kind = 'density'); # or df["AGW"].plot(kind = 'density');
Which gives :
You might use other packages if you want, like seaborn :
import seaborn as sns
plt.figure(figsize = (5,5))
sns.kdeplot(df["AGW"] , bw = 0.5 , fill = True)
plt.show()
Or this :
import seaborn as sns
sns.set_style("whitegrid") # Setting style(Optional)
plt.figure(figsize = (10,5)) #Specify the size of figure
sns.distplot(x = df["AGW"] , bins = 10 , kde = True , color = 'teal'
, kde_kws=dict(linewidth = 4 , color = 'black')) #kde for normal distribution
plt.show()
Check this article for more.

How to get the intersection of 2 lines in a plot?

I would like to determine the intersection of two Matplotlib plots.
The input data for the first plot is stored in a CSV file that looks like this:
Time;Channel A;Channel B;Channel C;Channel D (s);(mV);(mV);(mV);(mV)
0,00000000;-16,28006000;2,31961900;13,29508000;-0,98889020
0,00010000;-16,28006000;1,37345900;12,59309000;-1,34293700
0,00020000;-16,16408000;1,49554400;12,47711000;-1,92894600
0,00030000;-17,10414000;1,25747800;28,77549000;-1,57489900
0,00040000;-16,98205000;1,72750600;6,73299900;0,54327920
0,00050000;-16,28006000;2,31961900;12,47711000;-0,51886220
0,00060000;-16,39604000;2,31961900;12,47711000;0,54327920
0,00070000;-16,39604000;2,19753400;12,00708000;-0,04883409
0,00080000;-17,33610000;7,74020200;16,57917000;-0,28079600
0,00090000;-16,98205000;2,31961900;9,66304500;1,48333500
This is the shortened CSV file. The Original has a lot more Data.
I got this code so far to get the FFT of Channel D:
import matplotlib.pyplot as plt
import pandas as pd
from numpy.fft import rfft, rfftfreq
a=pd.read_csv('20210629-0007.csv', sep = ';', skiprows=[1,2],usecols = [4],dtype=float, decimal=',')
dt = 1/10000
#print(a.head())
n=len(a)
#time increment in each data
acc=a.values.flatten() #to convert DataFrame to 1D array
#acc value must be in numpy array format for half way mirror calculation
fft=rfft(acc)*dt
freq=rfftfreq(n,d=dt)
FFT=abs(fft)
plt.plot(freq,FFT)
plt.axvline(x=150, color = 'red')
plt.show()
Does anybody know how to get the intersection of those 2 plots ( red line and blue line at the same frequency ) ?
I would be very grateful for any help!
manually
This is not really a programming question, rather basic mathematics.
Here is your plot:
Let's call (x1,y1) and (x2,y2) the first two points of your blue line and (x,y) the coordinates of the intersection.
You have this relationship between the points: (x-x1)/(x2-x1) = (y-y1)/(y2-y1)
Thus: y=y1+(x-x1)*(y2-y1)/(x2-x1)
Which gives FFT[0]+(150-0)*(FFT[1]-FFT[0])/(freq[1]-freq[0])
Coordinates of the intersection are (150, 0.000189)
programmatically
You can use the pd.Series.interpolate method
import numpy as np
import pandas as pd
np.random.seed(0)
s = pd.Series(np.random.randint(0,100,20),
index=sorted(np.random.choice(range(100), 20))).sort_index()
ax = s.plot()
ax.axvline(35, color='r')
s.loc[35] = np.NaN
ax.plot(35, s.sort_index().interpolate(method='index').loc[35], marker='o')

KMeans clustering won't work on a dataframe with more than 4 columns

I have asked a similar question here: How to apply KMeans to get the centroid using dataframe with multiple features and I received some valuable responses. However, I have not succeeded in getting KMeans clustering working on a dataframe with more than 4 columns.
The dataframe in question has 5 columns as below:
col1,col2,col3,col4,col5
0.54,0.68,0.46,0.98,0.15
0.52,0.44,0.19,0.29,0.44
1.27,1.15,1.32,0.60,0.14
0.88,0.79,0.63,0.58,0.18
1.39,1.15,1.32,0.41,0.44
0.86,0.80,0.65,0.65,0.11
1.68,1.99,3.97,0.16,0.55
0.78,0.63,0.40,0.36,0.10
2.95,2.66,7.11,0.18,0.15
1.44,1.33,1.79,0.24,0.22
I have a simple KMeans clustering python code that I try to apply on the 5 column dataframe like below.
from numpy import unique
from numpy import where
from sklearn.cluster import KMeans
from matplotlib import pyplot
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
X = np.array(df)
model = KMeans(n_clusters=5)
model.fit(X)
yhat = model.predict(X)
clusters = unique(yhat)
for cluster in clusters:
row_ix = where(yhat == cluster)
pyplot.scatter(X[row_ix, 0], X[row_ix, 1], X[row_ix, 2], X[row_ix, 3], X[row_ix, 4])
pyplot.show()
When I run the code it complains about the line pyplot.scatter(X[row_ix, 0], X[row_ix, 1], X[row_ix, 2], X[row_ix, 3], X[row_ix, 4]), with the error message 'ValueError: Unrecognized marker style [[0.14 0.44 0.22]]'. However, if I remove the 5th column from the dataframe (i.e. col5) and remove X[row_ix, 4] from the code, the clustering works.
What do I need to do to get KMeans working on my example dataframe?
[Updated: 2 or 3 dimension at a time]
From the previous post, it was suggested I could split the task by representing 2 or 3 dimensions at a time using the below function. However, the function does not produce the expected clustering output (see attached output.png)
def plot(self):
import itertools
combinations = itertools.combinations(range(self.K), 2) # generate all combinations of features
fig, axes = plt.subplots(figsize=(12, 8), nrows=len(combinations), ncols=1) # initialise one subplot for each feature combination
for (x,y), ax in zip(combinations, axes.ravel()): # loop through combinations and subpltos
for i, index in enumerate(self.clusters):
point = self.X[index].T
# only get the coordinates for this combination:
px, py = point[x], point[y]
ax.scatter(px, py)
for point in self.centroids:
# only get the coordinates for this combination:
px, py = point[x], point[y]
ax.scatter(px, py, marker="x", color='black', linewidth=2)
ax.set_title('feature {} vs feature {}'.format(x,y))
plt.show()
How can I fix the above function to get the clustering output.
As mentioned in the other answer and comments, you cannot plot all 5 axis together. One way is use dimension reduction such as PCA to reduce it to 2 dimensions and plot:
import numpy as np
from sklearn.cluster import KMeans
from matplotlib import pyplot
import pandas as pd
from sklearn.decomposition import PCA
df = pd.read_csv('test.csv')
model = KMeans(n_clusters=5)
model.fit(df)
yhat = model.predict(df)
clusters = np.unique(yhat)
dims = PCA(n_components=2).fit_transform(X)
dims = pd.DataFrame(dims,columns=['PC1','PC2'])
fig,ax = plt.subplots(1,1)
for cluster in clusters:
ix = yhat == cluster
ax.scatter(x=dims.loc[ix,'PC1'],y=dims.loc[ix,'PC2'],label=cluster)
ax.legend()
Or you do use seaborn and visualize all your variables, which is ok if you only have 5 variables:
import seaborn as sns
df['cluster'] = yhat
sns.pairplot(data=df,hue='cluster',diag_kind=None)
Your KMeans work but the way you want to display the result is not proper.
If you look at the documentation of matplotlib scatter function (https://matplotlib.org/3.3.3/api/_as_gen/matplotlib.pyplot.scatter.html), you will see that the four first arguments of the function can accept an array-like while the fifth only accept a 'MarkerStyle'. That's why you get an error only when when you add the fifth argument.
Actually, you are trying to plot a 5 dimension dataset in a 2 dimension plane what is not possible without doing a dimensionality reduction beforehand.
A PCA or a PLSDA could be a good option to reduce the dimensionality of your dataset.

How to scale histogram bar heights in matplotlib / seaborn?

My dataset is a numpy array of size (m, 1) and I need 2 plots: a) one of the whole data normalized (probability) b) one of a subset of the dataset keeping the same normalization as before.
The problem is that in case b) the normalization options provided by matplotlib and seaborn only "see" the subset so they cannot normalize based on the whole data.
Essentially what I want to do is:
bar_height = bar_count / m
Sample data:
array([[-0.00996642],
[ 0.00407526],
[ 0.00547561],
...,
[ 0.05205999],
[ 0.00224144],
[ 0.01201942]])
You can use np.histogram() to calculate the histogram and then draw the bars with plt.bar() :
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
m = 200
samples = np.random.rand(1000)
hist_values, bin_edges = np.histogram(samples)
plt.bar(x=bin_edges[:-1], height=hist_values / m, width=np.diff(bin_edges), align='edge')
plt.show()

Dendrogram using pandas and scipy

I wish to generate a dendrogram based on correlation using pandas and scipy. I use a dataset (as a DataFrame) consisting of returns, which is of size n x m, where n is the number of dates and m the number of companies. Then I simply run the script
import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy as hc
import numpy as np
m = 5
dates = pd.date_range('2013-01-01', periods=365)
random_matrix = np.random.normal(0, 0.01, size=(len(dates), m))
dataframe = pd.DataFrame(data=random_matrix, index=dates)
z = hc.linkage(dataframe.values.T, method='average', metric='correlation')
dendrogram = hc.dendrogram(z, labels=dataframe.columns)
plt.show()
and I get a nice dendrogram. Now, the thing is that I'd also like to use other correlation measures apart from just ordinary Pearson correlation, which is a feature that's incorporated in pandas by simply invoking DataFrame.corr(method='<method>'). So, I thought at first that it was to simply run the following code
import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy as hc
import numpy as np
m = 5
dates = pd.date_range('2013-01-01', periods=365)
random_returns = np.random.normal(0, 0.01, size=(len(dates), m))
dataframe = pd.DataFrame(data=random_returns, index=dates)
corr = dataframe.corr()
z = hc.linkage(corr.values, method='average')
dendrogram = hc.dendrogram(z, labels=corr.columns)
plt.show()
However, if I do this I get strange values on the y-axis as the maximum value > 1.4. Whereas if I run the first script it's about 1. What am I doing wrong? Am I using the wrong metric in hc.linkage?
EDIT I might add that the shape of the dendrogram is exactly the same. Do I have to normalize the third column of the resulting z with the maximum value?
Found the solution. If you have already calculated a distance matrix (be it correlation or whatever), you simply have to condense the matrix using distance.squareform. That is,
dataframe = pd.DataFrame(data=random_returns, index=dates)
corr = 1 - dataframe.corr()
corr_condensed = hc.distance.squareform(corr) # convert to condensed
z = hc.linkage(corr_condensed, method='average')
dendrogram = hc.dendrogram(z, labels=corr.columns)
plt.show()

Categories