How to smooth a pandas / matplotlib lineplot? - python

I have the following df, weekly spend in a number of shops:
shop1 shop2 shop3 shop4 shop5 shop6 shop7 \
date_week
2 4328.85 5058.17 3028.68 2513.28 4204.10 1898.26 2209.75
3 5472.00 5085.59 3874.51 1951.60 2984.71 1416.40 1199.42
4 4665.53 4264.05 2781.70 2958.25 4593.46 2365.88 2079.73
5 5769.36 3460.79 3072.47 1866.19 3803.12 2166.84 1716.71
6 6267.00 4033.58 4053.70 2215.04 3991.31 2382.02 1974.92
7 5436.83 4402.83 3225.98 1761.87 4202.22 2430.71 3091.33
8 4850.43 4900.68 3176.00 3280.95 3483.53 4115.09 2594.01
9 6782.88 3800.03 3865.65 2221.43 4116.28 2638.28 2321.55
10 6248.18 4096.60 5186.52 3224.96 3614.24 2541.00 2708.36
11 4505.18 2889.33 2937.74 2418.34 5565.57 1570.55 1371.54
12 3115.26 1216.82 1759.49 2559.81 1403.61 1550.77 478.34
13 4561.82 827.16 4661.51 3197.90 1515.63 1688.57 247.25
shop8 shop9
date_week
2 3578.81 3134.39
3 4625.10 2676.20
4 3417.16 3870.00
5 3980.78 3439.60
6 3899.42 4192.41
7 4190.60 3989.00
8 4786.40 3484.51
9 6433.02 3474.66
10 4414.19 3809.20
11 3590.10 3414.50
12 4297.57 2094.00
13 3963.27 871.25
If I plot these in a line plot or "spaghetti plot" It works fine.
The goal is the look at trend in weekly sales over the last three months in 9 stores.
But looks a bit messy:
newgraph.plot()
I had a look at similar questions such as this one which uses df.interpolate() but it looks like I need to have missing values in there first. this answer seems to require a time series.
Is there another method to smoothen out the lines?
It doesn't matter if the values are not exactly accurate anymore, some interpolation is fine. All I am interested in is the trend over the last number of weeks. I have also tried logy=True in the plot() method to calm the lines a bit, but it didn't help.
My df, for pd.DataFrame.fromt_dict():
{'shop1': {2: 4328.849999999999,
3: 5472.0,
4: 4665.530000000001,
5: 5769.36,
6: 6267.0,
7: 5436.83,
8: 4850.43,
9: 6782.879999999999,
10: 6248.18,
11: 4505.18,
12: 3115.26,
13: 4561.82},
'shop2': {2: 5058.169999999993,
3: 5085.589999999996,
4: 4264.049999999997,
5: 3460.7899999999977,
6: 4033.579999999998,
7: 4402.829999999999,
8: 4900.679999999997,
9: 3800.0299999999997,
10: 4096.5999999999985,
11: 2889.3300000000004,
12: 1216.8200000000002,
13: 827.16},
'shop3': {2: 3028.679999999997,
3: 3874.5099999999984,
4: 2781.6999999999994,
5: 3072.4699999999984,
6: 4053.6999999999966,
7: 3225.9799999999987,
8: 3175.9999999999973,
9: 3865.6499999999974,
10: 5186.519999999996,
11: 2937.74,
12: 1759.49,
13: 4661.509999999998},
'shop4': {2: 2513.2799999999997,
3: 1951.6000000000001,
4: 2958.25,
5: 1866.1900000000003,
6: 2215.04,
7: 1761.8700000000001,
8: 3280.9499999999994,
9: 2221.43,
10: 3224.9600000000005,
11: 2418.3399999999997,
12: 2559.8099999999995,
13: 3197.9},
'shop5': {2: 4204.0999999999985,
3: 2984.71,
4: 4593.459999999999,
5: 3803.12,
6: 3991.31,
7: 4202.219999999999,
8: 3483.529999999999,
9: 4116.279999999999,
10: 3614.24,
11: 5565.569999999997,
12: 1403.6100000000001,
13: 1515.63},
'shop6': {2: 1898.260000000001,
3: 1416.4000000000005,
4: 2365.8799999999997,
5: 2166.84,
6: 2382.019999999999,
7: 2430.71,
8: 4115.0899999999965,
9: 2638.2800000000007,
10: 2541.0,
11: 1570.5500000000004,
12: 1550.7700000000002,
13: 1688.5700000000004},
'shop7': {2: 2209.75,
3: 1199.42,
4: 2079.7300000000005,
5: 1716.7100000000005,
6: 1974.9200000000005,
7: 3091.329999999999,
8: 2594.0099999999993,
9: 2321.5499999999997,
10: 2708.3599999999983,
11: 1371.5400000000004,
12: 478.34,
13: 247.25000000000003},
'shop8': {2: 3578.8100000000004,
3: 4625.1,
4: 3417.1599999999994,
5: 3980.7799999999997,
6: 3899.4200000000005,
7: 4190.600000000001,
8: 4786.4,
9: 6433.019999999998,
10: 4414.1900000000005,
11: 3590.1,
12: 4297.57,
13: 3963.27},
'shop9': {2: 3134.3900000000003,
3: 2676.2,
4: 3870.0,
5: 3439.6,
6: 4192.41,
7: 3989.0,
8: 3484.51,
9: 3474.66,
10: 3809.2,
11: 3414.5,
12: 2094.0,
13: 871.25}}

You could show the trend by plotting a regression line for the last few weeks, perhaps separately from the actual data, as the plot is already so crowded. I would use seaborn, because it has the convenient regplot() function:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
df.plot(figsize=[12, 10], style='--')
plt.xlim(2, 18)
last4 = df[len(df)-4:]
plt.gca().set_prop_cycle(None)
for shop in df.columns:
sns.regplot(last4.index + 4, shop, data=last4, ci=None, scatter=False)
plt.ylabel(None)
plt.xticks(list(df.index)+[14, 17], labels=list(df.index)+[10, 13]);

Related

how to add text on each rectangle in collection using matplotlib?

I am using matplotlib for plotting and convenient visualization of some graphs in xy coordinates.
I need to highlight some regions - and I use rectangles for this.
But I am interested to add some text upon each rectangle - to be able to distinguish those regions. How to do it using patches because I have a lot of objects in a plot?
Here is the code I use to plot rectangles:
# sample data for rectangles visualization
windows_df = pd.DataFrame( {'window_index_num': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9}, 'left_pulse_num': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9}, 'right_pulse_num': {0: 2, 1: 3, 2: 4, 3: 5, 4: 6, 5: 7, 6: 8, 7: 9, 8: 10, 9: 11}, 'idx_of_left_pulse': {0: 0, 1: 4036, 2: 4080, 3: 4107, 4: 4368, 5: 4491, 6: 4529, 7: 4624, 8: 4626, 9: 4639}, 'idx_of_right_pulse': {0: 4080, 1: 4107, 2: 4368, 3: 4491, 4: 4529, 5: 4624, 6: 4626, 7: 4639, 8: 4679, 9: 4781}, 'left_pulse_pos_in_E': {0: 10.002042118364418, 1: 40.29395464818188, 2: 41.19356816747343, 3: 41.76060061888303, 4: 47.90221207147802, 5: 51.27679395217831, 6: 52.39165780468267, 7: 55.37561818764979, 8: 55.47294132608167, 9: 55.99635666692289}, 'right_pulse_pos_in_E': {0: 41.19356816747343, 1: 41.76060061888303, 2: 47.90221207147802, 3: 51.27679395217831, 4: 52.39165780468267, 5: 55.37561818764979, 6: 55.47294132608167, 7: 55.99635666692289, 8: 57.33777021469516, 9: 60.984834434908144}, 'idx_window_left_border': {0: 0, 1: 3990, 2: 4058, 3: 4093, 4: 4237, 5: 4429, 6: 4510, 7: 4576, 8: 4625, 9: 4632}, 'idx_window_right_border': {0: 4094, 1: 4238, 2: 4430, 3: 4510, 4: 4577, 5: 4625, 6: 4633, 7: 4659, 8: 4730, 9: 4792}, 'left_win_pos_in_E': {0: 10.002042118364418, 1: 39.38459790393702, 2: 40.74003692229216, 3: 41.46513255508269, 4: 44.66179219947279, 5: 49.53272998148, 6: 51.82972979173252, 7: 53.82159300113625, 8: 55.40803086073492, 9: 55.76645477820397}, 'right_win_pos_in_E': {0: 41.48613320837913, 1: 44.6852679849016, 2: 49.56014983071213, 3: 51.82972979173252, 4: 53.85265044341121, 5: 55.40803086073492, 6: 55.79921126600202, 7: 56.66110947958804, 8: 59.119140585251095, 9: 61.39880967219205}, 'window_width': {0: 4095, 1: 249, 2: 373, 3: 418, 4: 341, 5: 197, 6: 124, 7: 84, 8: 106, 9: 161}, 'window_width_in_E': {0: 31.48409109001471, 1: 5.300670080964579, 2: 8.820112908419965, 3: 10.364597236649828, 4: 9.190858243938415, 5: 5.875300879254915, 6: 3.9694814742695, 7: 2.8395164784517917, 8: 3.7111097245161773, 9: 5.632354893988079}, 'sum_pulses_duration_in_E': {0: 0.5157099691135514, 1: 0.5408987779694527, 2: 0.6869248977656355, 3: 0.7304908951030242, 4: 0.7269657511683718, 5: 0.537271616198268, 6: 0.7609034761658222, 7: 0.6178183490930067, 8: 0.8269277926972265, 9: 0.5591109437337494}, 'sum_pulse_sq': {0: 3.7944375922206044, 1: 3.8756992116858715, 2: 2.9661915477796663, 3: 3.070559830941317, 4: 3.0597037730539385, 5: 10.2020204659669, 6: 45.77535573608872, 7: 45.87630607524008, 8: 39.10335270063814, 9: 3.437205923490125}, 'pulse_to_window_rate': {0: 0.01638001769335214, 1: 0.10204347180781788, 2: 0.07788164447530765, 3: 0.0704794290047244, 4: 0.0790966122938326, 5: 0.09144580460471718, 6: 0.1916883807363909, 7: 0.2175787158769594, 8: 0.22282493757444324, 9: 0.09926770493999569}, 'max_height_in_window': {0: 20.815950580921104, 1: 20.815950580921104, 2: 5.324888970962656, 3: 5.324888970962656, 4: 5.14075603114903, 5: 86.81228155905252, 6: 110.06755904473022, 7: 110.06755904473022, 8: 110.06755904473022, 9: 14.735092268739246}, 'min_height_in_window': {0: -0.011928180619527797, 1: 1.6172637244080776, 2: 1.6172637244080776, 3: 0.8658702248969847, 4: 0.8658702248969847, 5: 0.8658702248969847, 6: 1.8476229914953515, 7: 2.918666252051556, 8: 3.2397786967451707, 9: 2.4893555139463266}, 'windows_sq': {0: 655.3712842149647, 1: 110.33848645112575, 2: 46.96612194869083, 3: 55.19032951390669, 4: 47.24795994896218, 5: 510.0482741740266, 6: 436.911136546121, 7: 312.538647650477, 8: 408.4727887246568, 9: 82.9932690531994}} )
fig_w, axs_w = plt.subplots()
#theoretical cross-section
#axs_w.plot(df_wo_NANS['E'], df_wo_NANS['theo_cs'], marker = "o", markersize = 1, linewidth = 1.0, alpha=0.6, color = 'green', label = 'Theo Cross Section')
axs_w.grid(color = 'grey', linestyle = '--', linewidth = 0.2)
#windows rectangular
from matplotlib.collections import PatchCollection
from matplotlib.patches import Rectangle
boxes = []
for index,row in windows_df.iterrows():
current_rect_left_corner = (row['left_win_pos_in_E'], row['min_height_in_window'])
current_w = row['window_width_in_E']
current_h = row['max_height_in_window']-row['min_height_in_window']
boxes.append(Rectangle(current_rect_left_corner, current_w, current_h))
left = row['left_win_pos_in_E']
right = row['right_win_pos_in_E']
bottom = row['min_height_in_window']
top = row['max_height_in_window']
#mark of the start of the current window
axs_w.text(
left, #left corner, #0.5*(left+right), #middle of the rectangle
top, #top
str(index),
horizontalalignment='center',
verticalalignment='center',
fontsize=5
)
#mark of the end of the current window
axs_w.text(
right, #right corner, #0.5*(left+right), #middle of the rectangle
top+0.5*bottom, #top
str(index)+'e',
horizontalalignment='center',
verticalalignment='center',
fontsize=5
)
pc = PatchCollection(boxes, facecolor='y', alpha=0.2, edgecolor='black')
axs_w.add_collection(pc)
Added text marks using cycle but is it possible to do it using patch and collections to make more efficient code?

How to export to excel with pandas dataframe with multi column

I'm stuck at exporting a multi index dataframe to excel, in the matter what I'm looking for.
This is what I'm looking for in excel.
I know I have to add an extra Index Parameter on the left for the row of SRR (%) and Traction (-), but how?
My code so far.
import pandas as pd
import matplotlib.pyplot as plt
data = {'Step 1': {'Step Typ': 'Traction', 'SRR (%)': {1: 8.384, 2: 9.815, 3: 7.531, 4: 10.209, 5: 7.989, 6: 7.331, 7: 5.008, 8: 2.716, 9: 9.6, 10: 7.911}, 'Traction (-)': {1: 5.602, 2: 6.04, 3: 2.631, 4: 2.952, 5: 8.162, 6: 9.312, 7: 4.994, 8: 2.959, 9: 10.075, 10: 5.498}, 'Temperature': 30, 'Load': 40}, 'Step 3': {'Step Typ': 'Traction', 'SRR (%)': {1: 2.909, 2: 5.552, 3: 5.656, 4: 9.043, 5: 3.424, 6: 7.382, 7: 3.916, 8: 2.665, 9: 4.832, 10: 3.993}, 'Traction (-)': {1: 9.158, 2: 6.721, 3: 7.787, 4: 7.491, 5: 8.267, 6: 2.985, 7: 5.882, 8: 3.591, 9: 6.334, 10: 10.43}, 'Temperature': 80, 'Load': 40}, 'Step 5': {'Step Typ': 'Traction', 'SRR (%)': {1: 4.765, 2: 9.293, 3: 7.608, 4: 7.371, 5: 4.87, 6: 4.832, 7: 6.244, 8: 6.488, 9: 5.04, 10: 2.962}, 'Traction (-)': {1: 6.656, 2: 7.872, 3: 8.799, 4: 7.9, 5: 4.22, 6: 6.288, 7: 7.439, 8: 7.77, 9: 5.977, 10: 9.395}, 'Temperature': 30, 'Load': 70}, 'Step 7': {'Step Typ': 'Traction', 'SRR (%)': {1: 9.46, 2: 2.83, 3: 3.249, 4: 9.273, 5: 8.792, 6: 9.673, 7: 6.784, 8: 3.838, 9: 8.779, 10: 4.82}, 'Traction (-)': {1: 5.245, 2: 8.491, 3: 10.088, 4: 9.988, 5: 4.886, 6: 4.168, 7: 8.628, 8: 5.038, 9: 7.712, 10: 3.961}, 'Temperature': 80, 'Load': 70} }
df = pd.DataFrame(data)
items = list()
series = list()
for item, d in data.items():
items.append(item)
series.append(pd.DataFrame.from_dict(d))
df = pd.concat(series, keys=items)
df.set_index(['Step Typ', 'Load', 'Temperature'], inplace=True).T.to_excel('testfile.xlsx')
The picture below, shows df.set_index(['Step Typ', 'Load', 'Temperature'], inplace=True).T as a dataframe: (somehow close, but not exactly what I'm looking for):
Edit 1:
Found a good solution, not the exact one I was looking for, but it's still worth using it.
df.reset_index().drop(["level_0","level_1"], axis=1).pivot(columns=["Step Typ", "Load", "Temperature"], values=["SRR (%)", "Traction (-)"]).apply(lambda x: pd.Series(x.dropna().values)).to_excel("solution.xlsx")
Can you explain clearely and show the output you are looking for?
To export a table to excel use df.to_excel('path', index=True/False)
where:
index=True or False - to insert or not the index column into the file
Found a good solution, not the exact one I was looking for, but it's still worth using it.
df.reset_index().drop(["level_0","level_1"], axis=1).pivot(columns=["Step Typ", "Load", "Temperature"], values=["SRR (%)", "Traction (-)"]).apply(lambda x: pd.Series(x.dropna().values)).to_excel("solution.xlsx")

Pandas Resampling with delta time after specific starting time

After reading a CSV into a data frame, I am trying to resample my "Value" column to 5 seconds, starting from the first rounded second of the time value. I would like to have the mean for all the values within the next 5 seconds, starting from 46:19.6 (format %M:%S:%f). So the code would give me the mean for 46:20, then 46:25, and so on...Does anybody know how to do this? Thank you!
input:
df = pd.DataFrame({'Time': {0: '46:19.6',
1: '46:20.7',
2: '46:21.8',
3: '46:22.9',
4: '46:24.0',
5: '46:25.1',
6: '46:26.2',
7: '46:27.6',
8: '46:28.7',
9: '46:29.8',
10: '46:30.9',
11: '46:32.0',
12: '46:33.2',
13: '46:34.3',
14: '46:35.3',
15: '46:36.5',
16: '46:38.8',
17: '46:40.0'},
'Value': {0: 0,
1: 1,
2: 2,
3: 3,
4: 4,
5: 5,
6: 6,
7: 8,
8: 9,
9: 10,
10: 11,
11: 12,
12: 13,
13: 14,
14: 15,
15: 17,
16: 19,
17: 20}})
Assuming your Time field is in datetime64[ns] format, you simply can use pd.Grouper and pass freq=5S:
# next line of code is optional to transform to datetime format if the `Time` field is an `object` i.e. string.
# df['Time'] = pd.to_datetime('00:'+df['Time'])
df1 = df.groupby(pd.Grouper(key='Time', freq='5S'))['Value'].mean().reset_index()
#Depending on what you want to do, you can also replace the above line of code with one of two below:
#df1 = df.groupby(pd.Grouper(key='Time', freq='5S'))['Value'].mean().reset_index().iloc[1:]
#df1 = df.groupby(pd.Grouper(key='Time', freq='5S', base=4.6))['Value'].mean().reset_index()
#In the above line of code 4.6s can be adjusted to whatever number between 0 and 5.
df1
output:
Time Value
0 2020-07-07 00:46:15 0.0
1 2020-07-07 00:46:20 2.5
2 2020-07-07 00:46:25 7.6
3 2020-07-07 00:46:30 12.5
4 2020-07-07 00:46:35 17.0
5 2020-07-07 00:46:40 20.0
Full reproducible code from an example DataFrame I created:
import re
import pandas
df = pd.DataFrame({'Time': {0: '46:19.6',
1: '46:20.7',
2: '46:21.8',
3: '46:22.9',
4: '46:24.0',
5: '46:25.1',
6: '46:26.2',
7: '46:27.6',
8: '46:28.7',
9: '46:29.8',
10: '46:30.9',
11: '46:32.0',
12: '46:33.2',
13: '46:34.3',
14: '46:35.3',
15: '46:36.5',
16: '46:38.8',
17: '46:40.0'},
'Value': {0: 0,
1: 1,
2: 2,
3: 3,
4: 4,
5: 5,
6: 6,
7: 8,
8: 9,
9: 10,
10: 11,
11: 12,
12: 13,
13: 14,
14: 15,
15: 17,
16: 19,
17: 20}})
df['Time'] = pd.to_datetime('00:'+df['Time'])
df1 = df.groupby(pd.Grouper(key='Time', freq='5S'))['Value'].mean().reset_index()
df1

Plotly: Dodge overlapping points on scatterplot categorical axis

I am trying to use plotly to compare the coefficents of regression models using error bars for the confidence intervals. I used the following code to plot it, using the variable as a categorical y axis in a scatter plot. The problem is that the points are overlapping, and I'd like to dodge them like happens in bar charts when you set barmode='group'. If I had a numerical axis I could manually dodge them, but I can't do that.
fig = px.scatter(
df, y='index', x='coef', text='label', color='model',
error_x_minus='lerr', error_x='uerr',
hover_data=['coef', 'pvalue', 'lower', 'upper']
)
fig.update_traces(textposition='top center')
fig.update_yaxes(autorange="reversed")
Using facets I get almost the result I want, but some of the labels goes off-plot and are not visible:
fig = px.scatter(
df, y='model', x='coef', text='label', color='model',
facet_row='index',
error_x_minus='lerr', error_x='uerr',
hover_data=['coef', 'pvalue', 'lower', 'upper']
)
fig.update_traces(textposition='top center')
fig.update_yaxes(visible=False)
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
Somebody has any idea or workaround for either dodging points in the first case or displaying labels in the second case?
Thanks in advance.
PS: Here's the random fake dataframe I made to generate the plots:
df = pd.DataFrame({'coef': {0: 1.0018729737113143,
1: 0.9408864645423858,
2: 0.29796556981484884,
3: -0.6844053575764955,
4: -0.13689631932690113,
5: 0.1473096200402363,
6: 0.9564712505670716,
7: 0.956099003887811,
8: 0.33319108930207175,
9: -0.7022778825729681,
10: -0.1773916842612131,
11: 0.09485417304851751},
'index': {0: 'const',
1: 'x1',
2: 'x2',
3: 'x3',
4: 'x4',
5: 'x5',
6: 'const',
7: 'x1',
8: 'x2',
9: 'x3',
10: 'x4',
11: 'x5'},
'label': {0: '1.002***',
1: '0.941***',
2: '0.298***',
3: '-0.684***',
4: '-0.137',
5: '0.147',
6: '0.956***',
7: '0.956***',
8: '0.333***',
9: '-0.702***',
10: '-0.177',
11: '0.095'},
'lerr': {0: 0.19788416996400904,
1: 0.19972987383410545,
2: 0.0606849959013587,
3: 0.1772734289533593,
4: 0.1988122854078155,
5: 0.21870366703236832,
6: 0.2734783191688098,
7: 0.2760291042678362,
8: 0.08386739920069491,
9: 0.2449940255063039,
10: 0.27476098595116555,
11: 0.3022511162310027},
'lower': {0: 0.8039888037473053,
1: 0.7411565907082803,
2: 0.23728057391349014,
3: -0.8616787865298547,
4: -0.33570860473471664,
5: -0.07139404699213203,
6: 0.6829929313982618,
7: 0.6800698996199748,
8: 0.24932369010137684,
9: -0.947271908079272,
10: -0.45215267021237865,
11: -0.2073969431824852},
'model': {0: 'OLS',
1: 'OLS',
2: 'OLS',
3: 'OLS',
4: 'OLS',
5: 'OLS',
6: 'QuantReg',
7: 'QuantReg',
8: 'QuantReg',
9: 'QuantReg',
10: 'QuantReg',
11: 'QuantReg'},
'pvalue': {0: 1.4211692095019375e-16,
1: 4.3583690618389965e-15,
2: 6.278403727223468e-16,
3: 1.596372747840846e-11,
4: 0.17483151363955116,
5: 0.18433051296752084,
6: 4.877385844808361e-10,
7: 6.665860891682504e-10,
8: 5.476882838731488e-12,
9: 1.4240852942202845e-07,
10: 0.20303143985022934,
11: 0.5347222575215599},
'uerr': {0: 0.19788416996400904,
1: 0.19972987383410556,
2: 0.06068499590135873,
3: 0.1772734289533593,
4: 0.19881228540781554,
5: 0.21870366703236832,
6: 0.27347831916880994,
7: 0.2760291042678362,
8: 0.08386739920069491,
9: 0.2449940255063039,
10: 0.27476098595116555,
11: 0.3022511162310027},
'upper': {0: 1.1997571436753234,
1: 1.1406163383764913,
2: 0.35865056571620757,
3: -0.5071319286231362,
4: 0.0619159660809144,
5: 0.3660132870726046,
6: 1.2299495697358815,
7: 1.2321281081556472,
8: 0.41705848850276667,
9: -0.4572838570666642,
10: 0.09736930168995245,
11: 0.3971052892795202}})
You were very close to a working solution with your second attempt. Just make more room for your labels with:
height=600, width=800
And then place the labels for the traces named 'OLS' within the boundaries of each subplot with:
fig.for_each_trace(lambda t: t.update(textposition='bottom center') if t.name == 'OLS' else ())
Plot:
Complete code:
import plotly.express as px
import pandas as pd
df = pd.DataFrame({'coef': {0: 1.0018729737113143,
1: 0.9408864645423858,
2: 0.29796556981484884,
3: -0.6844053575764955,
4: -0.13689631932690113,
5: 0.1473096200402363,
6: 0.9564712505670716,
7: 0.956099003887811,
8: 0.33319108930207175,
9: -0.7022778825729681,
10: -0.1773916842612131,
11: 0.09485417304851751},
'index': {0: 'const',
1: 'x1',
2: 'x2',
3: 'x3',
4: 'x4',
5: 'x5',
6: 'const',
7: 'x1',
8: 'x2',
9: 'x3',
10: 'x4',
11: 'x5'},
'label': {0: '1.002***',
1: '0.941***',
2: '0.298***',
3: '-0.684***',
4: '-0.137',
5: '0.147',
6: '0.956***',
7: '0.956***',
8: '0.333***',
9: '-0.702***',
10: '-0.177',
11: '0.095'},
'lerr': {0: 0.19788416996400904,
1: 0.19972987383410545,
2: 0.0606849959013587,
3: 0.1772734289533593,
4: 0.1988122854078155,
5: 0.21870366703236832,
6: 0.2734783191688098,
7: 0.2760291042678362,
8: 0.08386739920069491,
9: 0.2449940255063039,
10: 0.27476098595116555,
11: 0.3022511162310027},
'lower': {0: 0.8039888037473053,
1: 0.7411565907082803,
2: 0.23728057391349014,
3: -0.8616787865298547,
4: -0.33570860473471664,
5: -0.07139404699213203,
6: 0.6829929313982618,
7: 0.6800698996199748,
8: 0.24932369010137684,
9: -0.947271908079272,
10: -0.45215267021237865,
11: -0.2073969431824852},
'model': {0: 'OLS',
1: 'OLS',
2: 'OLS',
3: 'OLS',
4: 'OLS',
5: 'OLS',
6: 'QuantReg',
7: 'QuantReg',
8: 'QuantReg',
9: 'QuantReg',
10: 'QuantReg',
11: 'QuantReg'},
'pvalue': {0: 1.4211692095019375e-16,
1: 4.3583690618389965e-15,
2: 6.278403727223468e-16,
3: 1.596372747840846e-11,
4: 0.17483151363955116,
5: 0.18433051296752084,
6: 4.877385844808361e-10,
7: 6.665860891682504e-10,
8: 5.476882838731488e-12,
9: 1.4240852942202845e-07,
10: 0.20303143985022934,
11: 0.5347222575215599},
'uerr': {0: 0.19788416996400904,
1: 0.19972987383410556,
2: 0.06068499590135873,
3: 0.1772734289533593,
4: 0.19881228540781554,
5: 0.21870366703236832,
6: 0.27347831916880994,
7: 0.2760291042678362,
8: 0.08386739920069491,
9: 0.2449940255063039,
10: 0.27476098595116555,
11: 0.3022511162310027},
'upper': {0: 1.1997571436753234,
1: 1.1406163383764913,
2: 0.35865056571620757,
3: -0.5071319286231362,
4: 0.0619159660809144,
5: 0.3660132870726046,
6: 1.2299495697358815,
7: 1.2321281081556472,
8: 0.41705848850276667,
9: -0.4572838570666642,
10: 0.09736930168995245,
11: 0.3971052892795202}})
fig = px.scatter(
df, y='model', x='coef', text='label', color='model',
facet_row='index',
error_x_minus='lerr', error_x='uerr',
hover_data=['coef', 'pvalue', 'lower', 'upper'],
height=600, width=800,
)
fig.update_traces(textposition='top center')
fig.update_yaxes(visible=False)
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
fig.for_each_trace(lambda t: t.update(textposition='bottom center') if t.name == 'OLS' else ())
fig.show()

TypeError: unsupported operand type(s) for &: 'str' and 'bool'

All,
I have below Pandas dataframe, and I am trying to filter my dataframe such that my output displays country name along with the year 1989 column whose number is >1000000.For this I am using below code, but it is returning me below error.
{'Country': {0: 'Austria', 1: 'Belgium', 2: 'Denmark', 3: 'Finland', 4: 'France', 5: 'Germany', 6: 'Iceland', 7: 'Ireland', 8: 'Italy', 9: 'Luxemburg', 10: 'Netherland', 11: 'Norway', 12: 'Portugal', 13: 'Spain', 14: 'Sweden', 15: 'Switzerland', 16: 'United Kingdom'}, 'y1989': {0: 7602431, 1: 9927600, 2: 5129800, 3: 4954359, 4: 56269800, 5: 61715000, 6: 253500, 7: 3526600, 8: 57504700, 9: 374900, 10: 14805240, 11: 4226901, 12: 10304700, 13: 38851900, 14: 8458890, 15: 6619973, 16: 57236200}, 'y1990': {0: 7660345.0, 1: 9947800.0, 2: 5135400.0, 3: 4974383.0, 4: 0.0, 5: 62678000.0, 6: 255708.0, 7: 3505500.0, 8: 57576400.0, 9: 379300.0, 10: 14892574.0, 11: 4241473.0, 12: 0.0, 13: 38924500.0, 14: 8527040.0, 15: 6673850.0, 16: 57410600.0}, 'y1991': {0: 7790957, 1: 9987000, 2: 5146500, 3: 4998478, 4: 56893000, 5: 79753000, 6: 259577, 7: 3519000, 8: 57746200, 9: 384400, 10: 15010445, 11: 4261930, 12: 9858500, 13: 38993800, 14: 8590630, 15: 6750693, 16: 57649200}, 'y1992': {0: 7860800, 1: 10068319, 2: 5162100, 3: 5029300, 4: 57217500, 5: 80238000, 6: 262193, 7: 3542000, 8: 57788200, 9: 389800, 10: 15129200, 11: 4273634, 12: 9846000, 13: 39055900, 14: 8644100, 15: 6831900, 16: 58888800}, 'y1993': {0: 7909575, 1: 10100631, 2: 5180614, 3: 5054982, 4: 57529577, 5: 81338000, 6: 264922, 7: 3559985, 8: 57114161, 9: 395200, 10: 15354000, 11: 4324577, 12: 9987500, 13: 39790955, 14: 8700000, 15: 6871500, 16: 58191230}, 'y1994': {0: 7943652, 1: 10130574, 2: 5191000, 3: 5098754, 4: 57847000, 5: 81353000, 6: 266783, 7: 3570700, 8: 57201800, 9: 400000, 10: 15341553, 11: 4348410, 12: 9776000, 13: 39177400, 14: 8749000, 15: 7021200, 16: 58380000}, 'y1995': {0: 8054800, 1: 10143047, 2: 5251027, 3: 5116800, 4: 58265400, 5: 81845000, 6: 267806, 7: 3591200, 8: 57268578, 9: 412800, 10: 15492800, 11: 4370000, 12: 9920800, 13: 39241900, 14: 8837000, 15: 7060400, 16: 58684000}}
My code
df[(df.Country)& (df.y1989>1000000)]
Error:
TypeError: unsupported operand type(s) for &: 'str' and 'bool'
I am not sure what could be the reason, being a newbie to python if you could provide explanation for the error that will be greatly appreciated.
Thanks in advance,
'Country' doesn't form part of your filtering criteria, so don't use it to form your Boolean indexer. Instead, use the loc accessor to give a Boolean condition and specify necessary columns separately:
res = df.loc[df['y1989'] > 1000000, ['Country','y1989']]
Under no circumstances use chained assignment, e.g. via df[df['y1989']>1000000][['Country','y1989']], as this is ambiguous and explicitly discouraged in the docs.

Categories