Plotting pcolormesh in python from csv data - python

I am trying to make a pcolormesh plot in python from my csv file. But I am stuck with dimension error.
My csv looks like this:
ratio 5% 10% 20% 30% 40% 50%
1.2 0.60 0.63 0.62 0.66 0.66 0.77
1.5 0.71 0.81 0.75 0.78 0.76 0.77
1.8 0.70 0.82 0.80 0.73 0.80 0.78
1.2 0.75 0.84 0.94 0.84 0.76 0.82
2.3 0.80 0.92 0.93 0.85 0.87 0.86
2.5 0.80 0.85 0.91 0.85 0.87 0.88
2.9 0.85 0.91 0.96 0.96 0.86 0.87
I want to make pcolormesh plot where x-axis shows ratio and y-axis shows csv header i.e 0.05, 0.1, 0.2, 0.3, 0.4, 0.5 and the plot includes values from csv 2nd column.
I tried to do following in python:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('./result.csv')
xlabel = df['ratio']
ylabel = [0.05, 0.1, 0.2, 0.3, 0.4, 0.5]
plt.figure(figsize=(8, 6))
df = df.iloc[:, 1:]
plt.pcolormesh(df, xlabel, ylabel, cmap='RdBu')
plt.colorbar()
plt.xlabel('rati0')
plt.ylabel('threshold')
plt.show()
But it doesn't work.
Can I get a help to make a plot as I want.
Thank you.

First off: ignoring warnings is a really bad idea, especially in code that doesn't work as expected.
X and Y in plt.colormesh define the mesh, i.e. edges of the cells, not the cells themselves. There is one more edge both horizontally and vertically than there are cells. You'll need to label the centers in a separate step.
Apart from that, you would have to change the order: when there are 3 unnamed parameters, the first is X, the second would be Y and the third the values for the colors.
Also, the columns of the dataframe will be the columns of the mesh. You seem to want to have them to be the rows of the mesh. Therefore, the dataframe should be transposed.
This is how your code could work:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from io import StringIO
df_str = '''ratio 5% 10% 20% 30% 40% 50%
1.2 0.60 0.63 0.62 0.66 0.66 0.77
1.5 0.71 0.81 0.75 0.78 0.76 0.77
1.8 0.70 0.82 0.80 0.73 0.80 0.78
1.2 0.75 0.84 0.94 0.84 0.76 0.82
2.3 0.80 0.92 0.93 0.85 0.87 0.86
2.5 0.80 0.85 0.91 0.85 0.87 0.88
2.9 0.85 0.91 0.96 0.96 0.86 0.87'''
df = pd.read_csv(StringIO(df_str), delim_whitespace=True)
xlabel = df['ratio']
ylabel = [0.05, 0.1, 0.2, 0.3, 0.4, 0.5]
plt.figure(figsize=(8, 6))
df = df.iloc[:, 1:]
plt.pcolormesh(df.T, cmap='RdBu')
plt.xticks(np.arange(len(xlabel)) + 0.5, xlabel)
plt.yticks(np.arange(len(ylabel)) + 0.5, ylabel)
plt.colorbar()
plt.xlabel('ratio')
plt.ylabel('threshold')
plt.show()
Note that your code would be a lot more straightforward if you'd use seaborn, which builds on matplotlib and pandas to easily create statistical plots.
Seaborn's heatmap uses the index of the dataframe to label the y-axis, and the columns to label the x-axis. So, you can set the 'ratio' column as index and transpose the dataframe. A colorbar will be generated by default, and optionally the cells can be annotated with their values.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# df = pd.read_csv(...)
plt.figure(figsize=(8, 6))
ax = sns.heatmap(df.set_index('ratio').T, annot=True, cmap='RdBu')
ax.set_ylabel('threshold')
plt.show()

Related

Matching to a specific year column in pandas

I am trying to take a "given" value and match it to a "year" in the same row using the following dataframe:
data = {
'Given' : [0.45, 0.39, 0.99, 0.58, None],
'Year 1' : [0.25, 0.15, 0.3, 0.23, 0.25],
'Year 2' : [0.39, 0.27, 0.55, 0.3, 0.4],
'Year 3' : [0.43, 0.58, 0.78, 0.64, 0.69],
'Year 4' : [0.65, 0.83, 0.95, 0.73, 0.85],
'Year 5' : [0.74, 0.87, 0.99, 0.92, 0.95]
}
df = pd.DataFrame(data)
print(df)
Output:
Given Year 1 Year 2 Year 3 Year 4 Year 5
0 0.45 0.25 0.39 0.43 0.65 0.74
1 0.39 0.15 0.27 0.58 0.83 0.87
2 0.99 0.30 0.55 0.78 0.95 0.99
3 0.58 0.23 0.30 0.64 0.73 0.92
4 NaN 0.25 0.40 0.69 0.85 0.95
However, the matching process has a few caveats. I am trying to match to the closest year to the given value before calculating the time to the first "year" above 70%. So row 0 would match to "year 3", and we can see in the same row that it will take two years until "year 5", which is the first occurence in the row above 70%.
For any "given" value already above 70%, we can just output "full", and for any "given" values that don't contain data, we can just output the first year above 70%. The output will look like the following:
Given Year 1 Year 2 Year 3 Year 4 Year 5 Output
0 0.45 0.25 0.39 0.43 0.65 0.74 2
1 0.39 0.15 0.27 0.58 0.83 0.87 2
2 0.99 0.30 0.55 0.78 0.95 0.99 full
3 0.58 0.23 0.30 0.64 0.73 0.92 1
4 NaN 0.25 0.40 0.69 0.85 0.95 4
It has taken me a horrendously long time to clean up this data so at the moment I can't think of a way to begin other than some use of .abs() to begin the matching process. All help appreciated.
Vectorized Pandas Approach:
reset_index() of the column names and .T, so that you can have the same column names and subtract dataframes from each other in a vectorized way. pd.concat() with * creates a dataframe that duplicates the first column, so that you can get the absolute difference of the dataframes in a more vectorized way instead of looping through columns.
Use idxmax and idxmin to identify the column numbers according to your criteria.
Use np.select according to your criteria.
import pandas as pd
import numpy as np
# identify 70% columns
pct_70 = (df.T.reset_index(drop=True).T > .7).idxmax(axis=1)
# identify column number of lowest absolute difference to Given
nearest_col = ((df.iloc[:,1:].T.reset_index(drop=True).T
- pd.concat([df.iloc[:,0]] * len(df.columns[1:]), axis=1)
.T.reset_index(drop=True).T)).abs().idxmin(axis=1)
# Generate an output series
output = pct_70 - nearest_col - 1
# Conditionally apply the output series
df['Output'] = np.select([output.gt(0),output.lt(0),output.isnull()],
[output, 'full', pct_70],np.nan)
df
Out[1]:
Given Year 1 Year 2 Year 3 Year 4 Year 5 Output
0 0.45 0.25 0.39 0.43 0.65 0.74 2.0
1 0.39 0.15 0.27 0.58 0.83 0.87 2.0
2 0.99 0.30 0.55 0.78 0.95 0.99 full
3 0.58 0.23 0.30 0.64 0.73 0.92 1.0
4 NaN 0.25 0.40 0.69 0.85 0.95 4
Here you go!
import numpy as np
def output(df):
output = []
for i in df.iterrows():
row = i[1].to_list()
given = row[0]
compare = np.array(row[1:])
first_70 = np.argmax(compare > 0.7)
if np.isnan(given):
output.append(first_70 + 1)
continue
if given > 0.7:
output.append('full')
continue
diff = np.abs(np.array(compare) - np.array(given))
closest_year = diff.argmin()
output.append(first_70 - closest_year)
return output
df['output'] = output(df)

Plot clusters of similar words from pandas dataframe

I have a big dataframe, here's a small subset:
key_words prove have read represent lead replace
be 0.58 0.49 0.48 0.17 0.23 0.89
represent 0.66 0.43 0 1 0 0.46
associate 0.88 0.23 0.12 0.43 0.11 0.67
induce 0.43 0.41 0.33 0.47 0 0.43
Which shows how close each word from the key_words is to the rest of the columns (based on their embeddings distance).
I want to find a way to visualize this dataframe so that I see the clusters that are being formed among the words that are closest to each other.
Is there a simple way to do this, considering that the key_word column has string values?
One option is to set the key_words column as index and to use seaborn.clustermap to plot the clusters:
# pip install seaborn
import seaborn as sns
sns.clustermap(df.set_index('key_words'), # data
vmin=0, vmax=1, # values of min/max here white/black
cmap='Greys', # color palette
figsize=(5,5) # plot size
)
output:

Creating interaction terms in python

I'm trying to create interaction terms in a dataset. Is there another (simpler) way of creating interaction terms of columns in a dataset? For example, creating interaction terms in combinations of columns 4:98 and 98:106. I tried looping over the columns using numpy arrays, but with the following code, the kernel keeps dying.
col1 = df.columns[4:98] #94 columns
col2 = df.columns[98:106] #8 columns
var1_np = df_np[:, 4:98]
var2_np = df_np[:, 98:106]
for i in range(94):
for j in range(8):
name = col1[i] +"*" + col2[j]
df[name] = var1_np[:,i]*var2_np[:,j]
Here, df is the dataframe and df_np is df in NumPy array.
You could use the itertools.product that is roughly equivalent to nested for-loops in a generator expression. Then, use join to create the new column name with the product result. After that, use Pandas prod to return the product of two columns over the axis one (along the columns).
import pandas as pd
import numpy as np
from itertools import product
#setup
np.random.seed(12345)
data = np.random.rand(5, 10).round(2)
df = pd.DataFrame(data)
df.columns = [f'col_{c}' for c in range(0,10)]
print(df)
#code
col1 = df.columns[3:5]
col2 = df.columns[5:8]
df_new = pd.DataFrame()
for i in product(col1, col2):
name = "*".join(i)
df_new[name] = df[list(i)].prod(axis=1)
print(df_new)
Output from df
col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
0 0.93 0.32 0.18 0.20 0.57 0.60 0.96 0.65 0.75 0.65
1 0.75 0.96 0.01 0.11 0.30 0.66 0.81 0.87 0.96 0.72
2 0.64 0.72 0.47 0.33 0.44 0.73 0.99 0.68 0.79 0.17
3 0.03 0.80 0.90 0.02 0.49 0.53 0.60 0.05 0.90 0.73
4 0.82 0.50 0.81 0.10 0.22 0.26 0.47 0.46 0.71 0.18
Output from df_new
col_3*col_5 col_3*col_6 col_3*col_7 col_4*col_5 col_4*col_6 col_4*col_7
0 0.1200 0.1920 0.1300 0.3420 0.5472 0.3705
1 0.0726 0.0891 0.0957 0.1980 0.2430 0.2610
2 0.2409 0.3267 0.2244 0.3212 0.4356 0.2992
3 0.0106 0.0120 0.0010 0.2597 0.2940 0.0245
4 0.0260 0.0470 0.0460 0.0572 0.1034 0.1012

Multiple time range selection in Pandas Python

I have time-series data in CSV format. I want to calculate the mean for a different selected time period on a single run of the script, e.g. 01-05-2017: 30-04-2018, 01-05-2018: 30-04-2019 so on. Below is sample data
I have a script but it's taking only one given time period. but I want to give the multiple time period as I mentioned above.
from datetime import datetime
import pandas as pd
df = pd.read_csv(r'D:\Data\RT_2015_2020.csv', index_col=[0],parse_dates=[0])
z = df['2016-05-01' : '2017-04-30']
# Want to make like this way
#z = df[['2016-05-01' : '2017-04-30'], ['2017-05-01' : '2018-04-30']]
# It will calculate the mean for the selected time period
z.mean()
If you use dates as an index, you can extract the data with the conditions included in the desired range.
import pandas as pd
import numpy as np
import io
data = '''
Date Mean
18-05-2016 0.31
07-06-2016 0.32
17-07-2016 0.50
15-09-2016 0.62
25-10-2016 0.63
04-11-2016 0.56
24-11-2016 0.56
14-12-2016 0.22
13-01-2017 0.22
23-01-2017 0.23
12-02-2017 0.21
22-02-2017 0.21
'''
df = pd.read_csv(io.StringIO(data), delim_whitespace=True)
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
df['2016'].head()
Mean
Date
2016-05-18 0.31
2016-07-06 0.32
2016-07-17 0.50
2016-09-15 0.62
2016-10-25 0.63
df.loc['2016-05-01':'2017-01-30']
Mean
Date
2016-05-18 0.31
2016-07-06 0.32
2016-07-17 0.50
2016-09-15 0.62
2016-10-25 0.63
2016-11-24 0.56
2016-12-14 0.22
2017-01-13 0.22
2017-01-23 0.23
df.loc['2016-05-01':'2017-01-30'].mean()
Mean 0.401111
dtype: float64

pd.to_csv set float_format with list

I need to write a df to a text file, to save some space on disk I would like to set the number of decimal places for each column i.e. have each column a different width.
I have tried:
df = pd.DataFrame(np.random.random(size=(10, 4)))
df.to_csv(path, float_format=['%.3f', '%.3f', '%.3f', '%.10f'])
But this does not work;
TypeError: unsupported operand type(s) for %: 'list' and 'float'
Any suggestions on how to do this with pandas (version 0.23.0)
You can do in this way:
df.iloc[:,0:3] = df.iloc[:,0:3].round(3)
df['d'] = df['d'].round(10)
df.to_csv('path')
Thanks for all the answers, inspired by #Joe I came up with:
df = df.round({'a':3, 'b':3, 'c':3, 'd':10})
or more generically
df = df.round({c:r for c, r in zip(df.columns, [3, 3, 3, 10])})
This is a workaround and does not answer the original question, round modifies the underlying dataframe which may be undesirable.
I usually do it this way:
a['column_name'] = round(a['column_name'], 3)
And then you can export it to csv as usual.
You can use the applymap it applies to all rows and columns value.
df = pd.DataFrame(np.random.random(size=(10, 4)))
df.applymap(lambda x: round(x,2))
Out[58]:
0 1 2 3
0 0.12 0.63 0.47 0.19
1 0.06 0.81 0.09 0.56
2 0.78 0.85 0.42 0.98
3 0.58 0.39 0.73 0.68
4 0.79 0.56 0.77 0.34
5 0.16 0.20 0.94 0.89
6 0.34 0.79 0.54 0.27
7 0.70 0.58 0.05 0.28
8 0.75 0.53 0.37 0.64
9 0.57 0.68 0.59 0.84

Categories