Double header dataframe, sumif (possibly groupby?) with python

Double header dataframe, sumif (possibly groupby?) with python - python

So here is an image of what I have and what I want to get: https://imgur.com/a/RyDbvZD
Basically Those are SUMIF formulas in excel, I would like to recreate that in python, I was trying with pandas groupby().sum() function but I have no clue how to groupby on 2 headers like this, and then how to order the data.
Original dataframe:
df = pd.DataFrame( {'Group': {0: 'Name', 1: 20201001, 2: 20201002, 3: 20201003, 4: 20201004, 5: 20201005, 6: 20201006, 7: 20201007, 8: 20201008, 9: 20201009, 10: 20201010}, 'Credit': {0: 'Credit', 1: 65, 2: 69, 3: 92, 4: 18, 5: 58, 6: 12, 7: 31, 8: 29, 9: 12, 10: 41}, 'Equity': {0: 'Stock', 1: 92, 2: 62, 3: 54, 4: 52, 5: 14, 6: 5, 7: 14, 8: 17, 9: 54, 10: 51}, 'Equity.1': {0: 'Option', 1: 87, 2: 30, 3: 40, 4: 24, 5: 95, 6: 77, 7: 44, 8: 77, 9: 88, 10: 85}, 'Credit.1': {0: 'Credit', 1: 62, 2: 60, 3: 91, 4: 57, 5: 65, 6: 50, 7: 75, 8: 55, 9: 48, 10: 99}, 'Equity.2': {0: 'Option', 1: 61, 2: 91, 3: 38, 4: 3, 5: 71, 6: 51, 7: 74, 8: 41, 9: 59, 10: 31}, 'Bond': {0: 'Bond', 1: 4, 2: 62, 3: 91, 4: 66, 5: 30, 6: 51, 7: 76, 8: 6, 9: 65, 10: 73}, 'Unnamed: 7': {0: 'Stock', 1: 54, 2: 23, 3: 74, 4: 92, 5: 36, 6: 89, 7: 88, 8: 32, 9: 19, 10: 91}, 'Bond.1': {0: 'Bond', 1: 96, 2: 10, 3: 11, 4: 7, 5: 28, 6: 82, 7: 13, 8: 46, 9: 70, 10: 46}, 'Bond.2': {0: 'Bond', 1: 25, 2: 53, 3: 96, 4: 70, 5: 52, 6: 9, 7: 98, 8: 9, 9: 48, 10: 58}, 'Unnamed: 10': {0: float('nan'), 1: 63.0, 2: 80.0, 3: 17.0, 4: 21.0, 5: 30.0, 6: 78.0, 7: 23.0, 8: 31.0, 9: 72.0, 10: 65.0}} )
What I want at the end:
df = pd.DataFrame( {'Group': {0: 20201001, 1: 20201002, 2: 20201003, 3: 20201004, 4: 20201005, 5: 20201006, 6: 20201007, 7: 20201008, 8: 20201009, 9: 20201010}, 'Credit': {0: 127, 1: 129, 2: 183, 3: 75, 4: 123, 5: 62, 6: 106, 7: 84, 8: 60, 9: 140}, 'Equity': {0: 240, 1: 183, 2: 132, 3: 79, 4: 180, 5: 133, 6: 132, 7: 135, 8: 201, 9: 167}, 'Stock': {0: 146, 1: 85, 2: 128, 3: 144, 4: 50, 5: 94, 6: 102, 7: 49, 8: 73, 9: 142}, 'Option': {0: 148, 1: 121, 2: 78, 3: 27, 4: 166, 5: 128, 6: 118, 7: 118, 8: 147, 9: 116}} )
Any ideas where to start on this, or anything is appreciated

Here you go. First row seems to be the real headers so we first move that to column names and set the index to Name
df2 = df.rename(columns = df.loc[0]).drop(index = 0).set_index(['Name'])
Then we groupby by columns and sum
df2.groupby(df2.columns, axis=1, sort = False).sum().reset_index()
and we get
Name Credit Stock Option Bond
0 20201001 127.0 146.0 148.0 125.0
1 20201002 129.0 85.0 121.0 125.0
2 20201003 183.0 128.0 78.0 198.0
3 20201004 75.0 144.0 27.0 143.0
4 20201005 123.0 50.0 166.0 110.0
5 20201006 62.0 94.0 128.0 142.0
6 20201007 106.0 102.0 118.0 187.0
7 20201008 84.0 49.0 118.0 61.0
8 20201009 60.0 73.0 147.0 183.0
9 20201010 140.0 142.0 116.0 177.0
I realise the output is not exactly what you asked for but since we cannot see your SUMIF formulas, I do not know which columns you want to aggregate
Edit
Following up on your comment, I note that, as far as I can tell, the rules for aggregation are somewhat messy so that the same column is included in more than one output column (like Equity.1). I do not think there is much you can do with automation here, and you can replicate your SUMIF experience by directly referencing the columns you want to add. So I think the following gives you what you want
df = df.drop(index =0)
df2 = df[['Group']].copy()
df2['Credit'] = df['Credit'] + df['Credit.1']
df2['Equity'] = df['Equity'] + df['Equity.1']+ df['Equity.2']
df2['Stock'] = df['Equity'] + df['Unnamed: 7']
df2['Option'] = df['Equity.1'] + df['Equity.2']
df2
produces
Group Credit Equity Stock Option
-- -------- -------- -------- ------- --------
1 20201001 127 240 146 148
2 20201002 129 183 85 121
3 20201003 183 132 128 78
4 20201004 75 79 144 27
5 20201005 123 180 50 166
6 20201006 62 133 94 128
7 20201007 106 132 102 118
8 20201008 84 135 49 118
9 20201009 60 201 73 147
10 20201010 140 167 142 116
This also gives you control over which columns to include in the final output
If you want this more automated than you need to do something about labels of your columns, as you would want a unique label for a set of columns you want to aggregate. If the same input column is used in more than one calculation it is probably easiest to just duplicate it with the right labels

Related

How to average across two dataframes

I have two dataframes:
{'id': {4: 1548638, 6: 1953603, 7: 1956216, 8: 1962245, 9: 1981386, 10: 1981773, 11: 2004787, 13: 2017418, 14: 2020989, 15: 2045043}, 'total': {4: 17, 6: 38, 7: 59, 8: 40, 9: 40, 10: 40, 11: 80, 13: 44, 14: 51, 15: 46}}
{'id': {4: 1548638, 6: 1953603, 7: 1956216, 8: 1962245, 9: 1981386, 10: 1981773, 11: 2004787, 13: 2017418, 14: 2020989, 15: 2045043}, 'total': {4: 17, 6: 38, 7: 59, 8: 40, 9: 40, 10: 40, 11: 80, 13: 44, 14: 51, 15: 46}}
For every 'id' that exists in both dataframes I would like to compute the average of their values in 'total' and have that in a new dataframe.
I tried:
pd.merge(df1, df2, on="id")
with the hope that I could then do:
merged_df[['total']].mean(axis=1)
but it doesn't work at all.
How can you do this?

You could use:
df1.merge(df2, on='id').set_index('id').mean(axis=1).reset_index(name='total')
Or, if you have many columns, a more generic approach:
(df1.merge(df2, on='id', suffixes=(None, '_other')).set_index('id')
.rename(columns=lambda x: x.removesuffix('_other')) # requires python 3.9+
.groupby(axis=1, level=0)
.mean().reset_index()
)
Output:
id total
0 1548638 17.0
1 1953603 38.0
2 1956216 59.0
3 1962245 40.0
4 1981386 40.0
5 1981773 40.0
6 2004787 80.0
7 2017418 44.0
8 2020989 51.0
9 2045043 46.0

You can do like the below:
df1 = pd.DataFrame({'id': {4: 1548638, 6: 1953603, 7: 1956216, 8: 1962245, 9: 1981386, 10: 1981773, 11: 2004787, 13: 2017418, 14: 2020989, 15: 2045043}, 'total': {4: 17, 6: 38, 7: 59, 8: 40, 9: 40, 10: 40, 11: 80, 13: 44, 14: 51, 15: 46}})
df2 = pd.DataFrame({'id': {4: 1548638, 6: 1953603, 7: 1956216, 8: 1962245, 9: 1981386, 10: 1981773, 11: 2004787, 13: 2017418, 14: 2020989, 15: 2045043}, 'total': {4: 17, 6: 38, 7: 59, 8: 40, 9: 40, 10: 40, 11: 80, 13: 44, 14: 51, 15: 46}})
merged_df = df1.merge(df2, on='id')
merged_df['total_mean'] = merged_df.filter(regex='total').mean(axis=1)
print(merged_df)
Output:
id total_x total_y total_mean
0 1548638 17 17 17.0
1 1953603 38 38 38.0
2 1956216 59 59 59.0
3 1962245 40 40 40.0
4 1981386 40 40 40.0
5 1981773 40 40 40.0
6 2004787 80 80 80.0
7 2017418 44 44 44.0
8 2020989 51 51 51.0
9 2045043 46 46 46.0

Python Color Dataframe cells depending on values

I am trying to color the cells
I have the following Dataframe:
pd.DataFrame({'Jugador': {1: 'M. Sanchez',
2: 'L. Ovalle',
3: 'K. Soto',
4: 'U. Kanu',
5: 'K. Abud'},
'Equipo': {1: 'Houston Dash',
2: 'Tigres UANL',
3: 'Guadalajara',
4: 'Tigres UANL',
5: 'Cruz Azul'},
'Edad': {1: 26, 2: 22, 3: 26, 4: 24, 5: 29},
'Posición específica': {1: 'RAMF, RW',
2: 'LAMF, LW',
3: 'RAMF, RW, CF',
4: 'RAMF, CF, RWF',
5: 'RW, RAMF, LW'},
'Minutos jugados': {1: 2053, 2: 3777, 3: 2287, 4: 1508, 5: 1436},
'Offence': {1: 84, 2: 90, 3: 69, 4: 80, 5: 47},
'Defense': {1: 50, 2: 36, 3: 64, 4: 42, 5: 86},
'Passing': {1: 78, 2: 81, 3: 72, 4: 73, 5: 71},
'Total': {1: 72, 2: 71, 3: 69, 4: 66, 5: 66}})
How can I color the Offence, Defense and Passing cells green if > 60, red < 40 and yellow the rest?

Use Styler.applymap with custom function:
def styler(v):
if v > 60:
return 'background-color:green'
elif v < 40:
return 'background-color:red'
else:
return 'background-color:yellow'
df.style.applymap(styler, subset=['Offence','Defense','Passing'])
Alternative solution:
styler = lambda v: 'background-color:green' if v > 60 else 'background-color:red' if v < 40 else 'background-color:yellow'
df.style.applymap(styler, subset=['Offence','Defense','Passing'])
Another approach:
def hightlight(x):
c1 = 'background-color:green'
c2 = 'background-color:red'
c3 = 'background-color:yellow'
cols = ['Offence','Defense','Passing']
#DataFrame with same index and columns names as original filled empty strings
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
#modify values of df1 columns by boolean mask
df1[cols] = np.select([x[cols] > 60, x[cols] < 40], [c1, c2], default=c3)
return df1
df.style.apply(hightlight, axis=None)

How can I best create a new df using index values from another df that are used to retrieve multiple values?

The nn_idx_df contains index values that match the index of xyz_df. How can I get the values from column H in xyz_df and create new columns in nn_idx_df to match the result illustrated in output_df. I could hack my way through this, but would like to see a pandorable solution.
nn_idx_df = pd.DataFrame({'nn_1_idx': {0: 65, 1: 7, 2: 18},
'nn_2_idx': {0: 64, 1: 9, 2: 64},
'nn_3_idx': {0: 69, 1: 67, 2: 68},
'nn_4_idx': {0: 75, 1: 13, 2: 65},
'nn_5_idx': {0: 70, 1: 66, 2: 1}})
print(nn_idx_df)
nn_1_idx nn_2_idx nn_3_idx nn_4_idx nn_5_idx
0 65 64 69 75 70
1 7 9 67 13 66
2 18 64 68 65 1
xyz_df = pd.DataFrame({'X': {1: 6401652.35,
7: 6401845.46,
9: 6401671.93,
13: 6401868.98,
18: 6401889.78,
64: 6401725.71,
65: 6401663.04,
66: 6401655.89,
67: 6401726.33,
68: 6401755.92,
69: 6401755.23,
70: 6401766.23,
75: 6401825.9},
'Y': {1: 1858548.15,
7: 1858375.68,
9: 1858490.83,
13: 1858403.79,
18: 1858423.25,
64: 1858579.25,
65: 1858570.3,
66: 1858569.97,
67: 1858607.8,
68: 1858581.58,
69: 1858591.46,
70: 1858517.48,
75: 1858420.72},
'Z': {1: 467.62,
7: 482.22,
9: 459.15,
13: 485.17,
18: 488.35,
64: 488.88,
65: 465.75,
66: 467.35,
67: 486.12,
68: 490.12,
69: 490.68,
70: 483.96,
75: 467.39},
'H': {1: 47.8791,
7: 45.5502,
9: 46.0995,
13: 41.9554,
18: 41.0537,
64: 47.1215,
65: 46.0047,
66: 45.936,
67: 40.5807,
68: 37.8478,
69: 37.1639,
70: 37.2314,
75: 25.8446}})
print(xyz_df)
X Y Z H
1 6401652.35 1858548.15 467.62 47.8791
7 6401845.46 1858375.68 482.22 45.5502
9 6401671.93 1858490.83 459.15 46.0995
13 6401868.98 1858403.79 485.17 41.9554
18 6401889.78 1858423.25 488.35 41.0537
64 6401725.71 1858579.25 488.88 47.1215
65 6401663.04 1858570.30 465.75 46.0047
66 6401655.89 1858569.97 467.35 45.9360
67 6401726.33 1858607.80 486.12 40.5807
68 6401755.92 1858581.58 490.12 37.8478
69 6401755.23 1858591.46 490.68 37.1639
70 6401766.23 1858517.48 483.96 37.2314
75 6401825.90 1858420.72 467.39 25.8446
output_df = pd.DataFrame(
{'nn_1_idx': {0: 65, 1: 7, 2: 18},
'nn_2_idx': {0: 64, 1: 9, 2: 64},
'nn_3_idx': {0: 69, 1: 67, 2: 68},
'nn_4_idx': {0: 75, 1: 13, 2: 65},
'nn_5_idx': {0: 70, 1: 66, 2: 1},
'nn_1_idx_h': {0: 46.0047, 1: 45.5502, 2: 41.0537},
'nn_2_idx_h': {0: 47.1215, 1: 46.0995, 2: 47.1215},
'nn_3_idx_h': {0: 37.1639, 1:40.5807, 2: 37.8478},
'nn_4_idx_h': {0: 25.8446, 1: 41.9554, 2: 46.0047},
'nn_5_idx_h': {0: 37.2314, 1: 45.9360, 2: 47.8791}})
print(output_df)
nn_1_idx nn_2_idx nn_3_idx nn_4_idx nn_5_idx nn_1_idx_h nn_2_idx_h nn_3_idx_h nn_4_idx_h nn_5_idx_h
0 65 64 69 75 70 46.0047 47.1215 37.1639 25.8446 37.2314
1 7 9 67 13 66 45.5502 46.0995 40.5807 41.9554 45.9360
2 18 64 68 65 1 41.0537 47.1215 37.8478 46.0047 47.8791

Let us do replace with join
df=nn_idx_df.join(nn_idx_df.replace(xyz_df.H).add_suffix('_h'))
df
nn_1_idx nn_2_idx nn_3_idx ... nn_3_idx_h nn_4_idx_h nn_5_idx_h
0 65 64 69 ... 37.1639 25.8446 37.2314
1 7 9 67 ... 40.5807 41.9554 45.9360
2 18 64 68 ... 37.8478 46.0047 47.8791
[3 rows x 10 columns]

Dataframe with column of strings to column of integer lists

I have a dataframe where in one column, the data for each row is a string like this:
[[25570], [26000]]
I want each entry in the series to become a list of integers.
IE:
[25570, 26000]
^ ^
int int
So far I can get it to a list of strings, but retaining empty spaces:
s = s.str.replace("[","").str.replace("]","")
s = s.str.replace(" ","").str.split(",")
Dict for Dataframe:
f = {'chunk': {0: '[72]',
1: '[72, 68]',
2: '[72, 68, 65]',
3: '[72, 68, 65, 70]',
4: '[72, 68, 65, 70, 67]',
5: '[72, 68, 65, 70, 67, 74]',
6: '[68]',
7: '[68, 65]',
8: '[68, 65, 70]',
9: '[68, 65, 70, 67]'},
'chunk_completed': {0: '[25570]',
1: '[26000]',
2: '[26240]',
3: '[26530]',
4: '[26880]',
5: '[27150]',
6: '[26000]',
7: '[26240]',
8: '[26530]',
9: '[26880]'},
'chunk_id': {0: '72',
1: '72-68',
2: '72-68-65',
3: '72-68-65-70',
4: '72-68-65-70-67',
5: '72-68-65-70-67-74',
6: '68',
7: '68-65',
8: '68-65-70',
9: '68-65-70-67'},
'diffs_avg': {0: nan,
1: 430.0,
2: 335.0,
3: 320.0,
4: 327.5,
5: 316.0,
6: nan,
7: 240.0,
8: 265.0,
9: 293.3333333333333},
'sd': {0: nan,
1: nan,
2: 134.35028842544406,
3: 98.48857801796105,
4: 81.80260794538685,
5: 75.3657747256671,
6: nan,
7: nan,
8: 35.355339059327385,
9: 55.075705472861024},
'timecodes': {0: '[[25570]]',
1: '[[25570], [26000]]',
2: '[[25570], [26000], [26240]]',
3: '[[25570], [26000], [26240], [26530]]',
4: '[[25570], [26000], [26240], [26530], [26880]]',
5: '[[25570], [26000], [26240], [26530], [26880], [27150]]',
6: '[[26000]]',
7: '[[26000], [26240]]',
8: '[[26000], [26240], [26530]]',
9: '[[26000], [26240], [26530], [26880]]'}}

try this
f = pd.DataFrame().from_dict(s, orient='index')
f.columns = ['timecodes']
f['timecodes'].apply(lambda x: [a[0] for a in eval(x) if a])
Output
Out[622]:
0 [25570]
1 [25570, 26000]
2 [25570, 26000, 26240]
3 [25570, 26000, 26240, 26530]
4 [25570, 26000, 26240, 26530, 26880]
5 [25570, 26000, 26240, 26530, 26880, 27150]
6 [26000]
7 [26000, 26240]
8 [26000, 26240, 26530]
9 [26000, 26240, 26530, 26880]
10 [26000, 26240, 26530, 26880, 27150]
11 [26240]
12 [26240, 26530]
13 [26240, 26530, 26880]
14 [26240, 26530, 26880, 27150]
15 [26530]
16 [26530, 26880]
17 [26530, 26880, 27150]
18 [26880]
19 [26880, 27150]
Name: 0, dtype: object

Getting the max from a nested default dictionary

I'm trying to obtain the maximum value from every dictionary in a default dictionary of default dictionaries using Python3.
Dictionary Set Up:
d = defaultdict(lambda: defaultdict(int))
My iterator runs through the dictionaries and the csv data I'm using just fine, but when I call max, it doesn't necessarily return the max every time.
Example output:
defaultdict(<class 'int'>, {0: 106, 2: 35, 3: 12})
max = (0, 106)
defaultdict(<class 'int'>, {0: 131, 1: 649, 2: 338, 3: 348, 4: 276, 5: 150, 6: 138, 7: 89, 8: 54, 9: 22, 10: 5, 11: 2})
max = (0, 131)
defaultdict(<class 'int'>, {0: 39, 1: 13, 2: 30, 3: 15, 4: 5, 5: 10, 6: 1, 8: 1})
max = (0, 39)
defaultdict(<class 'int'>, {0: 40, 1: 53, 2: 97, 3: 80, 4: 154, 5: 203, 6: 173, 7: 142, 8: 113, 9: 76, 10: 55, 11: 22, 12: 13, 13: 7})
max = (0, 40)
So sometimes it's right, but far from perfect.
My approach was informed by the answer to this question, but I adapted it to try and make it work for a nested default dictionary. Here's the code I'm using to find the max:
for sub_d in d:
outer_dict = d[sub_d]
print(max(outer_dict.items(), key=lambda x: outer_dict.get(x, 0)))
Any insight would be greatly appreciated. Thanks so much.

If you check the values in outer_dict.items(), they are actually consisted of key value tuples, and since these aren't in your dictionary, they all return 0, and hence returns the index 0.
max(a.keys(),key = lambda x: a.get(x,0))
will get you the index of the max value, and retrieve the value by looking up on the dictionary

In
max(outer_dict.items(), key=lambda x: outer_dict.get(x, 0))
the outer_dict.items() call returns an iterator that produces (key, value) tuples of the items in outer_dict. So the key function gets passed a (key, value) tuple as its x argument, and then tries to find that tuple as a key in outer_dict, and of course that's not going to succeed, so the get call always returns 0.
Instead, we can use a key function that extracts the value from the tuple. eg:
nested = {
'a': {0: 106, 2: 35, 3: 12},
'b': {0: 131, 1: 649, 2: 338, 3: 348, 4: 276, 5: 150, 6: 138, 7: 89,
8: 54, 9: 22, 10: 5, 11: 2},
'c': {0: 39, 1: 13, 2: 30, 3: 15, 4: 5, 5: 10, 6: 1, 8: 1},
'd': {0: 40, 1: 53, 2: 97, 3: 80, 4: 154, 5: 203, 6: 173, 7: 142,
8: 113, 9: 76, 10: 55, 11: 22, 12: 13, 13: 7},
}
for k, subdict in nested.items():
print(k, max((t for t in subdict.items()), key=lambda t: t[1]))
output
a (0, 106)
b (1, 649)
c (0, 39)
d (5, 203)
A more efficient alternative to that lambda is to use itemgetter. Here's a version that puts the maxima into a dictionary:
from operator import itemgetter
nested = {
'a': {0: 106, 2: 35, 3: 12},
'b': {0: 131, 1: 649, 2: 338, 3: 348, 4: 276, 5: 150, 6: 138, 7: 89,
8: 54, 9: 22, 10: 5, 11: 2},
'c': {0: 39, 1: 13, 2: 30, 3: 15, 4: 5, 5: 10, 6: 1, 8: 1},
'd': {0: 40, 1: 53, 2: 97, 3: 80, 4: 154, 5: 203, 6: 173, 7: 142,
8: 113, 9: 76, 10: 55, 11: 22, 12: 13, 13: 7},
}
ig1 = itemgetter(1)
maxes = {k: max((t for t in subdict.items()), key=ig1)
for k, subdict in nested.items()}
print(maxes)
output
{'a': (0, 106), 'b': (1, 649), 'c': (0, 39), 'd': (5, 203)}
We define ig1 outside the dictionary comprehension so that we don't call itemgetter(1) on every iteration of the outer loop.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Double header dataframe, sumif (possibly groupby?) with python - python

Related

How to average across two dataframes

Python Color Dataframe cells depending on values

How can I best create a new df using index values from another df that are used to retrieve multiple values?

Dataframe with column of strings to column of integer lists

Getting the max from a nested default dictionary

Categories

Resources