Adding a total per level-2 index in a multiindex pandas dataframe - python

I have a dataframe:
df_full = pd.DataFrame.from_dict({('group', ''): {0: 'A',
1: 'A',
2: 'A',
3: 'A',
4: 'A',
5: 'A',
6: 'A',
7: 'B',
8: 'B',
9: 'B',
10: 'B',
11: 'B',
12: 'B',
13: 'B'},
('category', ''): {0: 'Books',
1: 'Candy',
2: 'Pencil',
3: 'Table',
4: 'PC',
5: 'Printer',
6: 'Lamp',
7: 'Books',
8: 'Candy',
9: 'Pencil',
10: 'Table',
11: 'PC',
12: 'Printer',
13: 'Lamp'},
(pd.Timestamp('2021-06-28 00:00:00'),
'Sales_1'): {0: 9.937449997200002, 1: 30.71300000639998, 2: 58.81199999639999, 3: 25.661999978399994, 4: 3.657999996, 5: 12.0879999972, 6: 61.16600000040001, 7: 6.319439989199998, 8: 12.333119997600003, 9: 24.0544100028, 10: 24.384659998799997, 11: 1.9992000012000002, 12: 0.324, 13: 40.69122000000001},
(pd.Timestamp('2021-06-28 00:00:00'),
'Sales_2'): {0: 21.890370397789923, 1: 28.300470581874837, 2: 53.52039700062155, 3: 52.425508769690694, 4: 6.384936971649232, 5: 6.807138946302334, 6: 52.172, 7: 5.916852561, 8: 5.810764652, 9: 12.1243325, 10: 17.88071596, 11: 0.913782413, 12: 0.869207661, 13: 20.9447844},
(pd.Timestamp('2021-06-28 00:00:00'), 'last_week_sales'): {0: np.nan,
1: np.nan,
2: np.nan,
3: np.nan,
4: np.nan,
5: np.nan,
6: np.nan,
7: np.nan,
8: np.nan,
9: np.nan,
10: np.nan,
11: np.nan,
12: np.nan,
13: np.nan},
(pd.Timestamp('2021-06-28 00:00:00'), 'total_orders'): {0: 86.0,
1: 66.0,
2: 188.0,
3: 556.0,
4: 12.0,
5: 4.0,
6: 56.0,
7: 90.0,
8: 26.0,
9: 49.0,
10: 250.0,
11: 7.0,
12: 2.0,
13: 44.0},
(pd.Timestamp('2021-06-28 00:00:00'), 'total_sales'): {0: 4390.11,
1: 24825.059999999998,
2: 48592.39999999998,
3: 60629.77,
4: 831.22,
5: 1545.71,
6: 34584.99,
7: 5641.54,
8: 6798.75,
9: 13290.13,
10: 42692.68000000001,
11: 947.65,
12: 329.0,
13: 29889.65},
(pd.Timestamp('2021-07-05 00:00:00'),
'Sales_1'): {0: 13.690399997999998, 1: 38.723000005199985, 2: 72.4443400032, 3: 36.75802000560001, 4: 5.691999996, 5: 7.206999998399999, 6: 66.55265999039996, 7: 6.4613199911999954, 8: 12.845630001599998, 9: 26.032340003999998, 10: 30.1634600016, 11: 1.0203399996, 12: 1.4089999991999997, 13: 43.67116000320002},
(pd.Timestamp('2021-07-05 00:00:00'),
'Sales_2'): {0: 22.874363860953647, 1: 29.5726042895728, 2: 55.926190956481534, 3: 54.7820864335212, 4: 6.671946105284065, 5: 7.113126469779095, 6: 54.517, 7: 6.194107518, 8: 6.083562133, 9: 12.69221484, 10: 18.71872129, 11: 0.956574175, 12: 0.910216433, 13: 21.92632044},
(pd.Timestamp('2021-07-05 00:00:00'), 'last_week_sales'): {0: 4390.11,
1: 24825.059999999998,
2: 48592.39999999998,
3: 60629.77,
4: 831.22,
5: 1545.71,
6: 34584.99,
7: 5641.54,
8: 6798.75,
9: 13290.13,
10: 42692.68000000001,
11: 947.65,
12: 329.0,
13: 29889.65},
(pd.Timestamp('2021-07-05 00:00:00'), 'total_orders'): {0: 109.0,
1: 48.0,
2: 174.0,
3: 587.0,
4: 13.0,
5: 5.0,
6: 43.0,
7: 62.0,
8: 13.0,
9: 37.0,
10: 196.0,
11: 8.0,
12: 1.0,
13: 33.0},
(pd.Timestamp('2021-07-05 00:00:00'), 'total_sales'): {0: 3453.02,
1: 17868.730000000003,
2: 44707.82999999999,
3: 60558.97999999999,
4: 1261.0,
5: 1914.6000000000001,
6: 24146.09,
7: 6201.489999999999,
8: 5513.960000000001,
9: 9645.87,
10: 25086.785,
11: 663.0,
12: 448.61,
13: 26332.7}}).set_index(['group','category'])
I am trying to get a total for each column per category. So in this df example adding 2 lines below Lamp denoting the totals of each column. Red lines indicate the desired totals placement:
What I've tried:
df_out['total'] = df_out.sum(level=1).loc[:, (slice(None), 'total_sales')]
But get:
ValueError: Wrong number of items passed 4, placement implies 1
I also checked this question but could not apply it to my self.

Let us try groupby on level=0
s = df_full.groupby(level=0).sum()
s.index = pd.MultiIndex.from_product([s.index, ['Total']])
df_out = df_full.append(s).sort_index()
print(df_out)
2021-06-28 00:00:00 2021-07-05 00:00:00
Sales_1 Sales_2 last_week_sales total_orders total_sales Sales_1 Sales_2 last_week_sales total_orders total_sales
group category
A Books 9.93745 21.890370 NaN 86.0 4390.11 13.69040 22.874364 4390.11 109.0 3453.020
Candy 30.71300 28.300471 NaN 66.0 24825.06 38.72300 29.572604 24825.06 48.0 17868.730
Lamp 61.16600 52.172000 NaN 56.0 34584.99 66.55266 54.517000 34584.99 43.0 24146.090
PC 3.65800 6.384937 NaN 12.0 831.22 5.69200 6.671946 831.22 13.0 1261.000
Pencil 58.81200 53.520397 NaN 188.0 48592.40 72.44434 55.926191 48592.40 174.0 44707.830
Printer 12.08800 6.807139 NaN 4.0 1545.71 7.20700 7.113126 1545.71 5.0 1914.600
Table 25.66200 52.425509 NaN 556.0 60629.77 36.75802 54.782086 60629.77 587.0 60558.980
Total 202.03645 221.500823 0.0 968.0 175399.26 241.06742 231.457318 175399.26 979.0 153910.250
B Books 6.31944 5.916853 NaN 90.0 5641.54 6.46132 6.194108 5641.54 62.0 6201.490
Candy 12.33312 5.810765 NaN 26.0 6798.75 12.84563 6.083562 6798.75 13.0 5513.960
Lamp 40.69122 20.944784 NaN 44.0 29889.65 43.67116 21.926320 29889.65 33.0 26332.700
PC 1.99920 0.913782 NaN 7.0 947.65 1.02034 0.956574 947.65 8.0 663.000
Pencil 24.05441 12.124332 NaN 49.0 13290.13 26.03234 12.692215 13290.13 37.0 9645.870
Printer 0.32400 0.869208 NaN 2.0 329.00 1.40900 0.910216 329.00 1.0 448.610
Table 24.38466 17.880716 NaN 250.0 42692.68 30.16346 18.718721 42692.68 196.0 25086.785
Total 110.10605 64.460440 0.0 468.0 99589.40 121.60325 67.481717 99589.40 350.0 73892.415

Related

Boolean mask unexpected behavior when applying style

I am processing data where values may be of the format '<x' I want to return 'x/2'. So <5 would be returned as '2.5'. I have columns of mixed numbers and text. The problem is that I want to style the values that have been changed. Dummy data and code:
dummy={'Location': {0: 'Perth', 1: 'Perth', 2: 'Perth', 3: 'Perth', 4: 'Perth', 5: 'Perth', 6: 'Perth', 7: 'Perth', 8: 'Perth', 9: 'Perth', 10: 'Perth', 11: 'Perth', 12: 'Perth', 13: 'Perth', 14: 'Perth', 15: 'Perth', 16: 'Perth', 17: 'Perth'}, 'Date': {0: '11/01/2012 0:00', 1: '11/01/2012 0:00', 2: '20/03/2012 0:00', 3: '6/06/2012 0:00', 4: '14/09/2012 0:00', 5: '17/12/2013 0:00', 6: '1/02/2014 0:00', 7: '1/02/2014 0:00', 8: '1/02/2014 0:00', 9: '1/02/2014 0:00', 10: '1/02/2014 0:00', 11: '1/02/2014 0:00', 12: '1/02/2014 0:00', 13: '1/02/2014 0:00', 14: '1/02/2014 0:00', 15: '1/02/2014 0:00', 16: '1/02/2014 0:00', 17: '1/02/2014 0:00'}, 'As µg/L': {0: '9630', 1: '9630', 2: '8580', 3: '4990', 4: '6100', 5: '282', 6: '21', 7: '<1', 8: '<1', 9: '<1', 10: '<1', 11: '<1', 12: '<1', 13: '<1', 14: '<1', 15: '<1', 16: '<1', 17: '<1'}, 'As': {0: '9.63', 1: '9.63', 2: '8.58', 3: '4.99', 4: '6.1', 5: '0.282', 6: '0.021', 7: '<1', 8: '<1', 9: '<1', 10: '<1', 11: '<1', 12: '<1', 13: '<1', 14: '<1', 15: '<1', 16: '<1', 17: '10'}, 'Ba': {0: 1000.0, 1: np.nan, 2: np.nan, 3: np.nan, 4: np.nan, 5: np.nan, 6: np.nan, 7: np.nan, 8: np.nan, 9: np.nan, 10: np.nan, 11: np.nan, 12: np.nan, 13: np.nan, 14: np.nan, 15: np.nan, 16: np.nan, 17: np.nan}, 'HCO3': {0: '10.00', 1: '0.50', 2: '0.50', 3: '<22', 4: '0.50', 5: '0.50', 6: '0.50', 7: np.nan, 8: np.nan, 9: np.nan, 10: '0.50', 11: np.nan, 12: np.nan, 13: np.nan, 14: np.nan, 15: np.nan, 16: np.nan, 17: np.nan}, 'Cd': {0: 0.0094, 1: 0.0094, 2: 0.011, 3: 0.0035, 4: 0.004, 5: 0.002, 6: 0.0019, 7: np.nan, 8: np.nan, 9: np.nan, 10: np.nan, 11: np.nan, 12: np.nan, 13: np.nan, 14: np.nan, 15: np.nan, 16: np.nan, 17: np.nan}, 'Ca': {0: 248.0, 1: 248.0, 2: 232.0, 3: 108.0, 4: 150.0, 5: 396.0, 6: 472.0, 7: np.nan, 8: np.nan, 9: np.nan, 10: np.nan, 11: np.nan, 12: np.nan, 13: np.nan, 14: 472.0, 15: np.nan, 16: np.nan, 17: np.nan}, 'CO3': {0: 0.5, 1: 0.5, 2: 0.5, 3: 0.5, 4: 0.5, 5: 0.5, 6: 0.5, 7: np.nan, 8: np.nan, 9: 0.5, 10: np.nan, 11: np.nan, 12: np.nan, 13: np.nan, 14: np.nan, 15: np.nan, 16: np.nan, 17: np.nan}, 'Cl': {0: 2.0, 1: 2.0, 2: 2.0, 3: 2.0, 4: 0.5, 5: 2.0, 6: 5.0, 7: np.nan, 8: np.nan, 9: np.nan, 10: np.nan, 11: np.nan, 12: np.nan, 13: 5.0, 14: np.nan, 15: np.nan, 16: np.nan, 17: np.nan}}
df=pd.DataFrame(dummy)
import pandas a pd
import numpy as np
mask = df.applymap(lambda x: (isinstance(x, str) and x.startswith('<')))
def remove_less_thans(x):
if type(x) is int:
return x
elif type(x) is float:
return x
elif type(x) is str and x[0]=="<":
try:
return float(x[1:])/2
except:
return x
elif type(x) is str and len(x)<10:
try:
return float(x)
except:
return x
else:
return x
def colour_mask(val):
colour='color: red; font-weight: bold' if val in df.values[mask] else ''
return colour
#perform remove less-thans and divide the remainder by two
df=df.applymap(remove_less_thans)
styled_df= df.style.applymap(colour_mask)
styled_df
the mask looks correct, the remove < function works ok but I get values formatted when they shouldn't be. In the dummy data the HCO3 column has the 0.5 values reformatted even though they do no start with < and are not appearing as True in the mask. I know that they are numbers stored as text but that is how the real data might appear and given the mask is being constructed as expected (i.e. the one True is there and the rest of the values in the column are False) I don't know why they are being formatted. Same for column CO3, all the non-nan values are formatted when none should be. Why is this happening and how do I fix it? Dataframe
Output
Idea is pass mask to Styler.apply with numpy.where:
def colour_mask(x):
arr = np.where(mask, 'color: red; font-weight: bold', '')
return pd.DataFrame(arr, index=x.index, columns=x.columns)
styled_df = df.style.apply(colour_mask, axis=None)
Or:
def colour_mask(x, props=''):
return np.where(mask, props, '')
styled_df = df.style.apply(colour_mask, props='color: red; font-weight: bold', axis=None)

How to Use Melt to Tidy Dataframe in Pandas?

dt = {'Ind': {0: 'Ind1',
1: 'Ind2',
2: 'Ind3',
3: 'Ind4',
4: 'Ind5',
5: 'Ind6',
6: 'Ind7',
7: 'Ind8',
8: 'Ind9',
9: 'Ind10',
10: 'Ind1',
11: 'Ind2',
12: 'Ind3',
13: 'Ind4',
14: 'Ind5',
15: 'Ind6',
16: 'Ind7',
17: 'Ind8',
18: 'Ind9',
19: 'Ind10'},
'Treatment': {0: 'Treat',
1: 'Treat',
2: 'Treat',
3: 'Treat',
4: 'Treat',
5: 'Treat',
6: 'Treat',
7: 'Treat',
8: 'Treat',
9: 'Treat',
10: 'Cont',
11: 'Cont',
12: 'Cont',
13: 'Cont',
14: 'Cont',
15: 'Cont',
16: 'Cont',
17: 'Cont',
18: 'Cont',
19: 'Cont'},
'value': {0: 4.5,
1: 8.3,
2: 6.2,
3: 4.2,
4: 7.1,
5: 7.5,
6: 7.9,
7: 5.1,
8: 5.8,
9: 6.0,
10: 11.3,
11: 11.6,
12: 13.3,
13: 12.2,
14: 13.4,
15: 11.7,
16: 12.1,
17: 12.0,
18: 14.0,
19: 13.8}}
mydt = pd.DataFrame(dt, columns = ['Ind', 'Treatment', 'value')
How can I tidy up my dataframe to make it look like?
Desired Output
You can use DataFrame.from_dict
pd.DataFrame.from_dict(data, orient='index')

Default value not invoked when using np.select

I'm using np.select to evaluate a few conditions and am trying to assign a deafult value as the previous value of array
For eg if
row[i-1] = True
row[i] = NaN
then
row[i] = True
I have used the following lines
entry_conditions = [
(df['Close'] > df['Open'] + 100),
(df['Close'] < df['Open'] -100)
]
entry_choices = [
True, False
]
df['entry'] = np.nan
#Need to initialize the column with nan or else it throws an error because evaluating first row triggers default value
df['entry'] = np.select(entry_conditions,entry_choices,default = df['entry'].shift(1))
Sample output of df['entry']
True,
'nan',
'nan',
'nan',
'nan',
'nan',
'nan',
'nan',
'nan',
'nan',
'nan',
True,
'nan',
'nan',
'nan',
True,
I don't understand why even after the default value is mentioned, the column still shows nan as final outputs
Sample data obtained by df.to_dict
{'Date': {1: Timestamp('2021-01-01 09:30:00'),
2: Timestamp('2021-01-01 09:45:00'),
3: Timestamp('2021-01-01 10:00:00'),
4: Timestamp('2021-01-01 10:15:00'),
5: Timestamp('2021-01-01 10:30:00'),
6: Timestamp('2021-01-01 10:45:00'),
7: Timestamp('2021-01-01 11:00:00'),
8: Timestamp('2021-01-01 11:15:00'),
9: Timestamp('2021-01-01 11:30:00'),
10: Timestamp('2021-01-01 11:45:00'),
11: Timestamp('2021-01-01 12:00:00'),
12: Timestamp('2021-01-01 12:15:00'),
13: Timestamp('2021-01-01 12:30:00'),
14: Timestamp('2021-01-01 12:45:00'),
15: Timestamp('2021-01-01 13:00:00')},
'Open': {1: 31376.0,
2: 31405.0,
3: 31389.4,
4: 31377.5,
5: 31347.8,
6: 31310.8,
7: 31343.4,
8: 31349.5,
9: 31349.9,
10: 31325.1,
11: 31310.9,
12: 31329.0,
13: 31376.0,
14: 31375.5,
15: 31357.4},
'High': {1: 31425.0,
2: 31411.95,
3: 31389.45,
4: 31382.0,
5: 31350.0,
6: 31354.6,
7: 31359.0,
8: 31370.0,
9: 31364.7,
10: 31350.0,
11: 31337.9,
12: 31378.9,
13: 31419.5,
14: 31377.75,
15: 31360.0},
'Low': {1: 31367.95,
2: 31352.5,
3: 31331.65,
4: 31301.4,
5: 31303.05,
6: 31310.0,
7: 31325.05,
8: 31335.35,
9: 31315.35,
10: 31281.9,
11: 31292.0,
12: 31316.25,
13: 31352.05,
14: 31335.0,
15: 31322.0},
'Close': {1: 31398.3,
2: 31386.0,
3: 31377.0,
4: 31342.3,
5: 31311.7,
6: 31345.0,
7: 31349.0,
8: 31344.2,
9: 31327.6,
10: 31311.3,
11: 31325.6,
12: 31373.0,
13: 31375.0,
14: 31357.4,
15: 31326.0}}

Create a temporal network with toneto (Python)?

I have the following dataframe (here is the portion of it):
{'date': {0: Timestamp('2020-10-03 00:00:00'),
1: Timestamp('2020-10-03 00:00:00'),
2: Timestamp('2020-10-03 00:00:00'),
3: Timestamp('2020-10-03 00:00:00'),
4: Timestamp('2020-10-24 00:00:00'),
5: Timestamp('2020-10-24 00:00:00'),
6: Timestamp('2020-10-24 00:00:00'),
7: Timestamp('2020-10-24 00:00:00'),
8: Timestamp('2020-10-25 00:00:00'),
9: Timestamp('2020-10-25 00:00:00')},
'from': {0: 7960001,
1: 25500005,
2: 4660001,
3: 91000032,
4: 280001,
5: 26100016,
6: 30001114,
7: 12000016,
8: 79000523,
9: 74000114},
'to': {0: 30000934,
1: 74000351,
2: 4660001,
3: 91000031,
4: 66000413,
5: 26100022,
6: 26100024,
7: 12000016,
8: 79000321,
9: 74000122},
'weight': {0: 17.1,
1: 15.0,
2: 931.6,
3: 145.9,
4: 29.3,
5: 25.8,
6: 15.0,
7: 132.4,
8: 51.5,
9: 492.9}}
And I want to build a temporal network out of this time series - graph/network data.
I would like to build a network with respect to time + clusters.
Here is what I am trying to do:
df is the dataframe above
import teneto
t = list(df.index())
netin = {'i': df['from'], 'j': df['to'], 't': t, 'weight': df['weight']}
df = pd.DataFrame(data=netin)
tnet = TemporalNetwork(from_df=df)
tnet.network
Keep getting:
TypeError: 'RangeIndex' object is not callable

What is the looping code to generate traces for a multi-category x-axis bar chart?

I have spent days trying to plot a bar chart of the following data on plotly using python and jupyter notebooks:
site end_date weight diversion_weight count
0 All Sites 2019-11-30 19609.32 15387.52 977.0
1 All Sites 2019-12-31 13188.50 10283.40 658.0
2 All Sites 2020-01-31 21975.01 17502.41 1124.0
3 All Sites 2020-02-29 1933.06 1535.96 111.0
4 Site 1 2019-11-30 8351.48 5864.38 491.0
5 Site 1 2019-12-31 6746.97 4761.47 393.0
6 Site 1 2020-01-31 12040.58 8748.83 745.0
7 Site 1 2020-02-29 1193.33 900.73 72.0
8 Site 2 2019-11-30 11257.84 9523.14 486.0
9 Site 2 2019-12-31 6441.53 5521.93 265.0
10 Site 2 2020-01-31 9934.43 8753.58 379.0
11 Site 2 2020-02-29 739.73 635.23 39.0
With each time series on the x axis having the following for each site: a bar for diversion_weight that overlays a bar for weight
I'd prefer to not have a series of subplots.
I saw this post as a potential guide but could not figure out how to adapt the code to my problem.
Dict for the data:
from pandas import timestamp
dfplot = {'site': {0: 'All Sites',
1: 'All Sites',
2: 'All Sites',
3: 'All Sites',
4: 'Site 1',
5: 'Site 1',
6: 'Site 1',
7: 'Site 1',
8: 'Site 2',
9: 'Site 2',
10: 'Site 2',
11: 'Site 2'},
'end_date': {0: Timestamp('2019-11-30 00:00:00'),
1: Timestamp('2019-12-31 00:00:00'),
2: Timestamp('2020-01-31 00:00:00'),
3: Timestamp('2020-02-29 00:00:00'),
4: Timestamp('2019-11-30 00:00:00'),
5: Timestamp('2019-12-31 00:00:00'),
6: Timestamp('2020-01-31 00:00:00'),
7: Timestamp('2020-02-29 00:00:00'),
8: Timestamp('2019-11-30 00:00:00'),
9: Timestamp('2019-12-31 00:00:00'),
10: Timestamp('2020-01-31 00:00:00'),
11: Timestamp('2020-02-29 00:00:00')},
'weight': {0: 19609.32,
1: 13188.5,
2: 21975.010000000002,
3: 1933.06,
4: 8351.48,
5: 6746.97,
6: 12040.58,
7: 1193.33,
8: 11257.84,
9: 6441.53,
10: 9934.43,
11: 739.73},
'diversion_weight': {0: 15387.52,
1: 10283.400000000001,
2: 17502.41,
3: 1535.96,
4: 5864.38,
5: 4761.47,
6: 8748.83,
7: 900.73,
8: 9523.14,
9: 5521.93,
10: 8753.58,
11: 635.23},
'count': {0: 977.0,
1: 658.0,
2: 1124.0,
3: 111.0,
4: 491.0,
5: 393.0,
6: 745.0,
7: 72.0,
8: 486.0,
9: 265.0,
10: 379.0,
11: 39.0}}
dfplot

Categories