Calculate gap between two datasets (pandas, matplotlib, fill_between already used) - python

I'd like to ask for suggestions how to calculate lenght of gap between two datasets in matplotlib made of pandas dataframe. Ideally, I would like to have these gap values written in the plot and also, if it is possible, include them into the dataframe.
Here is my simplified example of dataframe:
import pandas as pd
d = {'Mean-1': [0.195842, 0.295069, 0.321345, 0.773725], 'SEM-1': [0.001216, 0.002687, 0.005267, 0.029974], 'Mean-2': [0.143103, 0.250505, 0.305767, 0.960804],'SEM-2': [0.000959, 0.001368, 0.003722, 0.150025], 'Atom Number': [1, 3, 5, 7]}
df=pd.DataFrame(d)
df
Mean-1 SEM-1 Mean-2 SEM-2 Atom Number
0 0.195842 0.001216 0.143103 0.000959 1
1 0.295069 0.002687 0.250505 0.001368 3
2 0.321345 0.005267 0.305767 0.003722 5
3 0.773725 0.029974 0.960804 0.150025 7
Then I made plot, where we can see two lines representing Mean-1 and Mean-2, and then shaded area around each line representing standard error of the mean. This is done for the selected atom numbers.
import matplotlib.pyplot as plt
ax = df.plot(x='Atom Number', y=['Mean-1','Mean-2'])
y_1 = df['Mean-1']
y_2 = df['Mean-2']
x = df['Atom Number']
error_1 = df['SEM-1']
error_2 = df['SEM-1']
ax.fill_between(df['Atom Number'], y_1-error_1, y_1+error_1, alpha=0.2, edgecolor='#CC4F1B', facecolor='#FF9848')
ax.fill_between(df['Atom Number'], y_2-error_2, y_2+error_2, alpha=0.2, edgecolor='#3F7F4C', facecolor='#7EFF99')
plt.xticks(x)
What I would like to do further is to calculate the gap for each residue. The gap is the white space only, thus space where the lines as well as the shaded areas (SEMs) don't overlap.
And also would like to know if I can somehow print the gap values from the plot? And save them into column. Thank You for suggestions.

It's not a compact solution but you could try something like this (Check the order of things). Calculate all the position (y_i and upper and lower limits).
import numpy as np
df['y1_upper'] = y_1+error_1
df['y1_lower'] = y_1-error_1
df['y2_upper'] = y_2+error_2
df['y2_lower'] = y_2-error_2
which gives
Mean-1 SEM-1 Mean-2 SEM-2 Atom Number y1_upper y1_lower \
0 0.195842 0.001216 0.143103 0.000959 1 0.197058 0.194626
1 0.295069 0.002687 0.250505 0.001368 3 0.297756 0.292382
2 0.321345 0.005267 0.305767 0.003722 5 0.326612 0.316078
3 0.773725 0.029974 0.960804 0.150025 7 0.803699 0.743751
y2_upper y2_lower
0 0.144319 0.141887
1 0.253192 0.247818
2 0.311034 0.300500
3 0.990778 0.930830
The distances (gaps) are calculated differently depending on if y_1 is over y_2and vice versa. So use conditions on the upper and lower limits and use linalg.norm to compute the distance.
conditions = [
(df['y1_lower'] >= df['y2_upper']),
(df['y1_lower'] < df['y2_upper'])]
choices = [np.linalg.norm(df['y1_lower']-df['y2_upper']), np.linalg.norm(df['y2_lower']-df['y1_upper'])]
df['dist'] = np.select(conditions, choices)
This gives
Mean-1 SEM-1 Mean-2 SEM-2 Atom Number y1_upper y1_lower \
0 0.195842 0.001216 0.143103 0.000959 1 0.197058 0.194626
1 0.295069 0.002687 0.250505 0.001368 3 0.297756 0.292382
2 0.321345 0.005267 0.305767 0.003722 5 0.326612 0.316078
3 0.773725 0.029974 0.960804 0.150025 7 0.803699 0.743751
y2_upper y2_lower dist
0 0.144319 0.141887 0.255175
1 0.253192 0.247818 0.255175
2 0.311034 0.300500 0.255175
3 0.990778 0.930830 0.149605
As I said, check the order, but this is a possible solution.

IIUC, do you want something like this:
import matplotlib.pyplot as plt
ax = df.plot(x='Atom Number', y=['Mean-1','Mean-2'], figsize=(15,8))
y_1 = df['Mean-1']
y_2 = df['Mean-2']
x = df['Atom Number']
error_1 = df['SEM-1']
error_2 = df['SEM-1']
ax.fill_between(df['Atom Number'], y_1-error_1, y_1+error_1, alpha=0.2, edgecolor='#CC4F1B', facecolor='#FF9848')
ax.fill_between(df['Atom Number'], y_2-error_2, y_2+error_2, alpha=0.2, edgecolor='#3F7F4C', facecolor='#7EFF99')
ax.fill_between(df['Atom Number'], y_1+error_1, y_2-error_2, alpha=.2, edgecolor='k', facecolor='blue')
for i in range(len(x)):
gap = y_1[i]+error_1[i] - y_2[i]-error_2[i]
ylabel = min(y_1[i], y_2[i]) + abs(gap) / 2
_ = ax.annotate(f'{gap:0.4f}', xy=(x[i],ylabel), xytext=(x[i]-.14,y_1[i]+gap/abs(gap)*.2), arrowprops=dict(arrowstyle="-"))
plt.xticks(x);
Output:

Related

Determine number of consecutive identical points in a grid

I have dataframe and grid size is 12*8
I want to calculate the number of consecutive red dots (only in the vertical direction ) and make new column with it (col = consecutive red ) for blue it will be zero
for example
X y red/blue consecutive red
1 1 blue 0
1 3 red 3
1 4 red 3
1 2 blue 0
1 5 red 3
9 4 red 5
[![enter image description here][1]][1]
Already have data for first 3 columns
from sklearn.neighbors import BallTree
red_points = df[df.red/blue== red]
blue_points = df[df.red/blue!= red]
tree = BallTree(red_points[['x','y']], leaf_size=40, metric='minkowski')
distance, index = tree.query(df[['x','y']], k=2)
I am not aware of such algorithm (there may very well be one), but writing the algo isn't that hard (I work with numpy because I'm used to it and because you can easily accelerate with CUDA and port to other data science python tools).
The data (0=blue, 1=red):
import numpy as np
import pandas as pd
# Generating dummy data for testing
ROWS=10
COLS=20
X = np.random.randint(2, size=(ROWS, COLS))
# Visualizing
df = pd.DataFrame(data=X)
bg='background-color: '
df.style.apply(lambda x: [bg+'red' if v>=1 else bg+'blue' for v in x])
The algorithm:
result = np.zeros((ROWS,COLS),dtype=np.int)
for y,x in np.ndindex(X.shape):
if X[y, x]==0:
continue
cons = 1 # consecutive in any direction including current
# Going backwward while we can
prev = y-1
while prev>=0:
if X[prev,x]==0:
break
cons+=1
prev-=1
# Going forward while we can
nxt = y+1
while nxt<=ROWS-1:
if X[nxt,x]==0:
break
cons+=1
nxt+=1
result[y,x]=cons
df2 = pd.DataFrame(data=result)
df2.style.apply(lambda x: [bg+'red' if v>=1 else bg+'blue' for v in x])
And the result:
Please note that in numpy the first coordinate represents the row index (y in your case), and the second the column (x in your case), you can use transpose on your data if you want to swap to x,y.

Bar chart with customised width in Python

I have this dataframe df which contains -
Name Team Name Category Challenge Points Time
A B 1 1ABC 50 2019-11-04 07:37:02
D B 2 2ACE 150 2019-11-04 09:57:02
X P 4 4PQR 500 2019-11-05 08:45:02
A B 3 3PQR 10 2019-11-04 10:25:20
N P 4 4ABC 120 2019-11-05 08:35:00
C G 1 1ABC 50 2019-11-04 07:37:02
D B 4 4RST 200 2019-11-04 10:57:02
I have this ambitious plan of visualizing this dataset as a customised barchart where each team has a building (bar) made of different blocks of varying width (depending on the points asssociated with that challenge), and vertical order of blocks would be depending on the time (first one goes at the bottom). In short the plot for the above data should roughly look like this -
The different colours represent the different categories here. I know how to group the data by teams and then plot each teams number of attempts by -
df.groupby(['Team Name'])['Challenge'].count().plot.bar()
but beyond that, I'm clueless as to how to change the bar widths. Can someone help with this?
Alternatively, if someone has a better idea of how to visualise it using any of the conventional plots, I'd love to hear your opinions too.
Thanks!
Does this look like what you want?
You can accomplish this by manually plotting the 'blocks' via matplotlib.patches, it just requires some extra manipulation to do so algorithmically. Here is a complete example using the data supplied in the question
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import numpy as np
import pandas as pd
t20 = [(31, 119, 180), (174, 199, 232), (255, 127, 14), (255, 187, 120)]
for i in range(len(t20)):
r, g, b = t20[i]
t20[i] = (r / 255., g / 255., b / 255.)
fig, ax = plt.subplots(1)
df['Time'] = pd.to_datetime(df['Time'])
df = df.sort_values('Time')
cat = df['Category'].unique()
cidx = dict(zip(cat, range(len(cat))))
mw = max(df['Points'])
names = list(df['Team Name'].unique())
nt = len(names)
h = 0.5
hs = [0]*3
for ii in range(len(df.index)):
w = float(df['Points'].iloc[ii])/mw
idx = names.index(df['Team Name'].iloc[ii])
r = Rectangle((idx - w/2.0, hs[idx]), w, h, color=t20[cidx[df['Category'].iloc[ii]]])
hs[idx] += 0.5
ax.add_patch(r)
plt.xlim([-0.5, len(names)-0.5])
plt.ylim([0, max(hs)+3])
plt.xticks(range(len(names)), names)
plt.show()
I used the first 4 colors in the tableau 20 palette in case you were interested.
Edit
You can add a legend with the line
plt.legend(handles=[Patch(facecolor=t20[ii], label=cat[ii]) for ii in range(len(t20))])
as long as the additional import of Patches from matplotlib.patches is included, i.e.
from matplotlib.patches import Rectangle, Patch
And the output will be

2D bin (x,y) and calculate mean of values (c) of 10 deepest data points (z)

For a data set consisting of:
coordinates x, y
depth z
a certain value c
I would like to do the following more efficient:
bin the data set in 2D bins based on the coordinates (x, y)
take the 10 deepest data points (z) per bin
calculate the mean value of c of these 10 data points per bin
Finally show a 2d heatmap with the calculated mean values.
I have found a working solution, but this takes too long for small bins and/or large data sets.
Is there a more efficient way of achieving the same result?
Current working example
Example dataframe:
import numpy as np
from numpy.random import rand
import pandas as pd
import math
import matplotlib.pyplot as plt
n = 10000
df = pd.DataFrame({'x':rand(n), 'y':rand(n), 'z':rand(n), 'c':rand(n)})
Bin the data set:
cell_size = 0.01
nx = math.ceil((max(df['x']) - min(df['x'])) / cell_size)
ny = math.ceil((max(df['y']) - min(df['y'])) / cell_size)
x_range = np.arange(0, nx)
y_range = np.arange(0, ny)
df['xbin'], x_edges = pd.cut(x=df['x'], bins=nx, labels=x_range, retbins=True)
df['ybin'], y_edges = pd.cut(x=df['y'], bins=ny, labels=y_range, retbins=True)
Code that now takes to long:
df = df.groupby(['xbin', 'ybin']).apply(
lambda d: d.sort_values('z').head(10).mean())
Update an empty DataFrame for the bins without data and show result:
index = pd.MultiIndex.from_product([x_range, y_range],
names=['xbin', 'ybin'])
tot_df = pd.DataFrame(index=index, columns=['z', 'c'])
tot_df.update(df)
zval = tot_df['c'].astype('float').values
zval = zval.reshape((nx, ny))
zval = zval.T
zval = np.flipud(zval)
extent = [min(x_edges), max(x_edges), min(y_edges), max(y_edges)]
plt.matshow(zval, aspect='auto', extent=extent)
plt.show()
you can use np.searchsorted to bin the rows by x and y and then use groupby to take 10 deep values and calculate means. As groupby will maintains the order in each group you can sort values before applying bins. groupby will perform better without apply
df = pd.DataFrame({'x':rand(n), 'y':rand(n), 'z':rand(n), 'c':rand(n)})
df = df.sort_values("z", ascending=False)
bins = np.linspace(0, 1, 11)
df["bin_x"] = np.searchsorted(bins, df['x'].values) - 1
df["bin_y"] = np.searchsorted(bins, df['y'].values) - 1
result = df.groupby(["bin_x", "bin_y"]).head(10)
result.groupby(["bin_x", "bin_y"])["c"].mean()
Result
bin_x bin_y
0 0 0.369531
1 0.601803
2 0.554452
3 0.575464
4 0.455198
...
9 5 0.469838
6 0.420772
7 0.367549
8 0.379200
9 0.523083
Name: c, Length: 100, dtype: float64

Analysing height difference from columns and selecting max difference in Python

I have a .csv file containing x y data from transects (.csv file here).
The file can contain a few dozen transects (example only 4).
I want to calculate the elevation change from each transect and then select the transect with the highest elevation change.
x y lines
0 3.444 1
0.009 3.445 1
0.180 3.449 1
0.027 3.449 1
...
0 2.115 2
0.008 2.115 2
0.017 2.115 2
0.027 2.116 2
I've tried to calculate the change with pandas.dataframe.diff but I'm unable to select the highest elevation change from this.
UPDATE: I found a way to calculate the height difference for 1 transect. The goal is now to loop this script through the different other transects and let it select the transect with the highest difference. Not sure how to create a loop from this...
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.signal import savgol_filter, find_peaks, find_peaks_cwt
from pandas import read_csv
import csv
df = pd.read_csv('transect4.csv', delimiter=',', header=None, names=['x', 'y', 'lines'])
df_1 = df ['lines'] == 1
df1 = df[df_1]
plt.plot(df1['x'], df1['y'], label='Original Topography')
#apply a Savitzky-Golay filter
smooth = savgol_filter(df1.y.values, window_length = 351, polyorder = 5)
#find the maximums
peaks_idx_max, _ = find_peaks(smooth, prominence = 0.01)
#reciprocal, so mins will become max
smooth_rec = 1/smooth
#find the mins now
peaks_idx_mins, _ = find_peaks(smooth_rec, prominence = 0.01)
plt.xlabel('Distance')
plt.ylabel('Height')
plt.plot(df1['x'], smooth, label='Smoothed Topography')
#plot them
plt.scatter(df1.x.values[peaks_idx_max], smooth[peaks_idx_max], s = 55,
c = 'green', label = 'Local Max Cusp')
plt.scatter(df1.x.values[peaks_idx_mins], smooth[peaks_idx_mins], s = 55,
c = 'black', label = 'Local Min Cusp')
plt.legend(loc='upper left')
plt.show()
#Export to csv
df['Cusp_max']=False
df['Cusp_min']=False
df.loc[df1.x[peaks_idx_max].index, 'Cusp_max']=True
df.loc[df1.x[peaks_idx_mins].index, 'Cusp_min']=True
data=df[df['Cusp_max'] | df['Cusp_min']]
data.to_csv(r'Cusp_total.csv')
#Calculate height difference
my_data=pd.read_csv('Cusp_total.csv', delimiter=',', header=0, names=['ID', 'x', 'y', 'lines'])
df_1 = df ['lines'] == 1
df1 = df[df_1]
df1_diff=pd.DataFrame(my_data)
df1_diff['Diff_Cusps']=df1_diff['y'].diff(-1)
#Only use positive numbers for average
df1_pos = df_diff[df_diff['Diff_Cusps'] > 0]
print("Average Height Difference: ", (df1_pos['Diff_Cusps'].mean()), "m")
Ideally, the script would select the transect with the highest elevation change from an unknown number of transects in the .csv file, which will then be exported to a new .csv file.
You need to groupby by column lines.
Not sure if this is what you meant when you say elevation change but this gives difference of elevations (max(y) - min(y)) for each group, where groups are formed by all rows sharing same value of 'line'each group representing one such value. This should help you with what you are missing in your logic, (sorry can't put more time in).
frame = pd.read_csv('transect4.csv', header=None, names=['x', 'y', 'lines'])
groups = frame.groupby('lines')
groups['y'].max() - groups['y'].min()
# Should give you max elevations of each group.

How do I modify this function to accept multiple Dataframes?

I wrote this function and I would like it to accept more than one DF so that the final plot has multiple plotted lines for the predictions and the coef_DF gets completed with the rest of the coefficients.
The function extracts the needed features and target from a much larger dataset to make predictions using a linear regression func, it then makes the model, plots the line over the dataset and returns a df with all the coeficients.
(This is just an exercise.)
def prep_model_and_predict(feature, target, dataset, degree):
# part 1: make a df with relevant format and features
# degree >=1
poly_df=pd.DataFrame()
poly_df[str(target)] = dataset[str(target)]
poly_df['power_1'] = dataset[str(feature)]
#cehck if degree >1
if degree > 1:
for power in range(2, degree+1): #loop over reaming deg
name = 'power_'+str(power)
poly_df[name]=poly_df['power_1'].apply(lambda x: x**power)
#part 2: make model and predictions
features=list(poly_df.columns[1:])
X=poly_df[features]
y=poly_df[str(target)]
model=LinearRegression().fit(X,y)
predictions=model.predict(X)
#part 3: put weghts in a nice df
coef_df=pd.DataFrame()
coef_df=coef_df.append({"Name":'Intercept', 'Value':model.intercept_}, ignore_index=True)
coef_df=coef_df.append({'Name':'Power_1', 'Value':model.coef_[0]}, ignore_index=True)
if degree > 1:
for degree in range(2, degree+1):
name = 'Power_' + str(degree)
coef_df = coef_df.append({"Name":name,
'Value':'{:.3e}'.format(model.coef_[degree-1])}, ignore_index=True)
#prt 4: plot it
fig, ax = plt.subplots()
ax.plot(poly_df['power_1'], poly_df[str(target)], '.',
poly_df['power_1'], predictions, '-')
ax.set_xlabel('Square footage, living area')
ax.set_ylabel('Price per Sqft')
ax.ticklabel_format(axis='y', style='sci', scilimits=(-2,2))
return coef_df, ax
and this is the result:
Name Value
0 Intercept 506738
1 Power_1 2.71336e-77
2 Power_2 7.335e-39
3 Power_3 -1.850e-44
4 Power_4 8.437e-50
5 Power_5 0.000e+00
6 Power_6 0.000e+00
7 Power_7 3.645e-55
8 Power_8 1.504e-51
9 Power_9 5.760e-48
10 Power_10 1.958e-44
11 Power_11 5.394e-41
12 Power_12 9.404e-38
13 Power_13 -3.635e-41
14 Power_14 4.655e-45
15 Power_15 -1.972e-49
much appreciated!
I am not sure what exactly you are asking for. But I would suggest, next time try to ask a question that is easily produce-able and runnable by other people here in SO.
I have tried to answer your questions. Correct me if I misunderstand your question.
Pass arbitrary number of DataFrame to your function and plot it:
I have created three random dataframes for use:
df1 = pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=list('AB'))
df2 = pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=list('AB'))
df3 = pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=list('AB'))
The functions that plots them:
def plot_me(*kwargs):
plt.figure(figsize=(13,9))
lab_ind = 0
for i in kwargs:
plt.plot(i['A'], i['B'], label = lab_ind)
lab_ind += 1
plt.legend()
plt.show()
The result plot you get:
Put the results of your model into a DataFrame
Regarding your second question, I am not going to concentrate too much on your exact details - for example the name of the columns of your dataframe, etc.
For this particular example I have generated two random arrays:
X = np.random.randint(0,50 ,size=(50, 2))
y = np.random.randint(0,2 ,size=(50, 1))
Then fit a LinearRegression model on this data.
model=LinearRegression().fit(X,y)
predictions=model.predict(X)
And then add it to a DataFrame:
res_df = pd.DataFrame(predictions,columns = ['Value'])
And if you print res_df
Value
0 0.420395
1 0.459389
2 0.369648
3 0.416058
4 0.644088
5 0.362072
6 0.363157
7 0.468943
. .
. .

Categories