Plot clusters of similar words from pandas dataframe - python

I have a big dataframe, here's a small subset:
key_words prove have read represent lead replace
be 0.58 0.49 0.48 0.17 0.23 0.89
represent 0.66 0.43 0 1 0 0.46
associate 0.88 0.23 0.12 0.43 0.11 0.67
induce 0.43 0.41 0.33 0.47 0 0.43
Which shows how close each word from the key_words is to the rest of the columns (based on their embeddings distance).
I want to find a way to visualize this dataframe so that I see the clusters that are being formed among the words that are closest to each other.
Is there a simple way to do this, considering that the key_word column has string values?

One option is to set the key_words column as index and to use seaborn.clustermap to plot the clusters:
# pip install seaborn
import seaborn as sns
sns.clustermap(df.set_index('key_words'), # data
vmin=0, vmax=1, # values of min/max here white/black
cmap='Greys', # color palette
figsize=(5,5) # plot size
)
output:

Related

Plotting pcolormesh in python from csv data

I am trying to make a pcolormesh plot in python from my csv file. But I am stuck with dimension error.
My csv looks like this:
ratio 5% 10% 20% 30% 40% 50%
1.2 0.60 0.63 0.62 0.66 0.66 0.77
1.5 0.71 0.81 0.75 0.78 0.76 0.77
1.8 0.70 0.82 0.80 0.73 0.80 0.78
1.2 0.75 0.84 0.94 0.84 0.76 0.82
2.3 0.80 0.92 0.93 0.85 0.87 0.86
2.5 0.80 0.85 0.91 0.85 0.87 0.88
2.9 0.85 0.91 0.96 0.96 0.86 0.87
I want to make pcolormesh plot where x-axis shows ratio and y-axis shows csv header i.e 0.05, 0.1, 0.2, 0.3, 0.4, 0.5 and the plot includes values from csv 2nd column.
I tried to do following in python:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('./result.csv')
xlabel = df['ratio']
ylabel = [0.05, 0.1, 0.2, 0.3, 0.4, 0.5]
plt.figure(figsize=(8, 6))
df = df.iloc[:, 1:]
plt.pcolormesh(df, xlabel, ylabel, cmap='RdBu')
plt.colorbar()
plt.xlabel('rati0')
plt.ylabel('threshold')
plt.show()
But it doesn't work.
Can I get a help to make a plot as I want.
Thank you.
First off: ignoring warnings is a really bad idea, especially in code that doesn't work as expected.
X and Y in plt.colormesh define the mesh, i.e. edges of the cells, not the cells themselves. There is one more edge both horizontally and vertically than there are cells. You'll need to label the centers in a separate step.
Apart from that, you would have to change the order: when there are 3 unnamed parameters, the first is X, the second would be Y and the third the values for the colors.
Also, the columns of the dataframe will be the columns of the mesh. You seem to want to have them to be the rows of the mesh. Therefore, the dataframe should be transposed.
This is how your code could work:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from io import StringIO
df_str = '''ratio 5% 10% 20% 30% 40% 50%
1.2 0.60 0.63 0.62 0.66 0.66 0.77
1.5 0.71 0.81 0.75 0.78 0.76 0.77
1.8 0.70 0.82 0.80 0.73 0.80 0.78
1.2 0.75 0.84 0.94 0.84 0.76 0.82
2.3 0.80 0.92 0.93 0.85 0.87 0.86
2.5 0.80 0.85 0.91 0.85 0.87 0.88
2.9 0.85 0.91 0.96 0.96 0.86 0.87'''
df = pd.read_csv(StringIO(df_str), delim_whitespace=True)
xlabel = df['ratio']
ylabel = [0.05, 0.1, 0.2, 0.3, 0.4, 0.5]
plt.figure(figsize=(8, 6))
df = df.iloc[:, 1:]
plt.pcolormesh(df.T, cmap='RdBu')
plt.xticks(np.arange(len(xlabel)) + 0.5, xlabel)
plt.yticks(np.arange(len(ylabel)) + 0.5, ylabel)
plt.colorbar()
plt.xlabel('ratio')
plt.ylabel('threshold')
plt.show()
Note that your code would be a lot more straightforward if you'd use seaborn, which builds on matplotlib and pandas to easily create statistical plots.
Seaborn's heatmap uses the index of the dataframe to label the y-axis, and the columns to label the x-axis. So, you can set the 'ratio' column as index and transpose the dataframe. A colorbar will be generated by default, and optionally the cells can be annotated with their values.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# df = pd.read_csv(...)
plt.figure(figsize=(8, 6))
ax = sns.heatmap(df.set_index('ratio').T, annot=True, cmap='RdBu')
ax.set_ylabel('threshold')
plt.show()

Percent of total clusters per cluster per season using pandas

I have a pandas DataFrame that looks like this with 12 clusters in total. Certain clusters don't appear in a certain season.
I want to create a multi-line graph over the seasons of the percent of a specific cluster over each season. So if there are 30 teams in the 97-98 season and there are 10 teams in Cluster 1, then that value would be .33 since cluster 1 has one third of the total possible spots.
It'll look like this
And I want the dateset to look like this, where each cluster has its own percentage of the whole number of clusters in that season by percentage. I've tried using pandas groupby method to get a bunch of lists and then use value_counts() on that but that doesn't work since looping through df.groupby(['SEASON']) returns tuples, not a Series..
Thanks so much
Use .groupby combined with .value_counts and .unstack:
temp_df = df.groupby(['SEASON'])['Cluster'].value_counts(normalize=True).unstack().fillna(0.0)
temp_df.plot()
print(temp_df.round(2))
Cluster 0 1 2 4 5 6 7 10 11
SEASON
1996-97 0.1 0.21 0.17 0.21 0.07 0.1 0.03 0.07 0.03
1997-98 0.2 0.00 0.20 0.20 0.00 0.0 0.20 0.20 0.00

Create line plot from dataframe with two columns index

I have the following dataframe:
>>> mean_traf_tie
a d c
0.22 0.99 0.11 22
0.23 21
0.34 34
0.46 45
0.44 0.99 0.11 45
0.23 65
0.34 66
0.46 68
0.50 0.50 0.11 22
0.23 12
0.34 34
0.46 37
...
I want to crate plot from this dataframe, on a way that c will be the X axis, y will be the mean velocity and lines will be according to the a and d columns, so for example, one line will be for a=0.22 and d=0.99, the x will be c and y will be mean velocity, and then the 2nd line will be for a=0.44 and d=0.99 ect.
I have tried to do it like this:
df.plot()
(values are differrent in original dataframe).
as you can see,for some reason it plots i nthe x axis the a,d and creates only one line.
I have tried to fix it like this:
df.unstack(level=0).plot(figsize=(10,6))
but then I got very weird graph, with the correct lines by a and d but wrong x axis:
As you can see, it plot somehow the a,d values ,but that not what I want- I want it to be the c columns, and then to create lines based on the a,d columns,which suppose to create continous line.
I have trid that as well:
df[('mean_traf_tie')].unstack(level=0).plot(figsize=(10,6))
plt.xlabel('C')
plt.ylabel('mean_traf_tie')
but got again:
The desired output will have the c column as x axis, the mean_traf_tie as y axis, and lines will be generated bseed on a and d columns (line for 0.22 and 0.99, line for 0.44 and 0.99 ect).
Update:
I Have managed to go over it by concatinating the two index columns to one before plotting like this:
df['a,d'] = list(zip(df.a, df.d))
df=df.groupby(['a,d','C']).mean()
df.unstack(level=0).plot(figsize=(10,6))
the legenss is still not idealistic but I got the lines and axes as I wanted.
If anyone has better idea how to do it with original columns, i'm still open to learn.

Delete values over the diagonal in a matrix with python

I have the next problem with a matrix in python and numpy
given this matrix
Cmpd1 Cmpd2 Cmpd3 Cmpd4
Cmpd1 1 0.32 0.77 0.45
Cmpd2 0.32 1 0.14 0.73
Cmpd3 0.77 0.14 1 0.29
Cmpd4 0.45 0.73 0.29 1
i want to obtain this:
Cmpd1 Cmpd2 Cmpd3 Cmpd4
Cmpd1 1
Cmpd2 0.32 1
Cmpd3 0.77 0.14 1
Cmpd4 0.45 0.73 0.29 1
I was trying with np.diag() but doesnt works
Thanks!
Use np.tril(a) to extract the lower triangular matrix.
Refer this : https://docs.scipy.org/doc/numpy/reference/generated/numpy.tril.html
Make sure you convert your matrix to a numpy matrix first, then, if you want to extract the lower part of the matrix, use:
np.tril(a)
where 'a' is your matrix. https://numpy.org/doc/stable/reference/generated/numpy.tril.html#numpy.tril
Equally, if you need the upper part of the matrix, use:
np.triu(a)
https://numpy.org/doc/stable/reference/generated/numpy.triu.html

Calculation is done only on part of the table

I am trying to calculate the kurtosis and skewness over a data and I managaed to create table but for some reason teh result is only for few columns and not for the whole fields.
For example, as you cann see, I have many fields (columns):
I calculate the skenwess and kurtosis using the next code:
sk=pd.DataFrame(data.skew())
kr=pd.DataFrame(data.kurtosis())
sk['kr']=kr
sk.rename(columns ={0: 'sk'}, inplace =True)
but then I get result that contains about half of the data I have:
I have tried to do head(10) but it doesn't change the fact that some columns dissapeard.
How can I calculte this for all the columns?
It is really hard to reproduce the error since you did not give the original data. Probably your dataframe contains non-numerical values in the missing columns which would result in this behavior.
dat = {"1": {'lg1':0.12, 'lg2':0.23, 'lg3':0.34, 'lg4':0.45},
"2":{'lg1':0.12, 'lg2':0.23, 'lg3':0.34, 'lg4':0.45},
"3":{'lg1':0.12, 'lg2':0.23, 'lg3':0.34, 'lg4':0.45},
"4":{'lg1':0.12, 'lg2':0.23, 'lg3':0.34, 'lg4':0.45},
"5":{'lg1':0.12, 'lg2':0.23, 'lg3': 'po', 'lg4':0.45}}
df = pd.DataFrame.from_dict(dat).T
print(df)
lg1 lg2 lg3 lg4
1 0.12 0.23 0.34 0.45
2 0.12 0.23 0.34 0.45
3 0.12 0.23 0.34 0.45
4 0.12 0.23 0.34 0.45
5 0.12 0.23 po 0.45
print(df.kurtosis())
lg1 0
lg2 0
lg4 0
The solution would be to preprocess the data.
One word of advice would be to check for consistency in the error, i.e. are always the same lines missing?

Categories