Compare groupby output for different dataframes

Compare groupby output for different dataframes - python

What could be a way to compare multiple groupby outputs.
I have multiple groupby outputs from different dataframes, like below
>>> tmp1
account place balance type
0 A A1 10 B1
1 A A1 20 B1
2 A A1 30 B1
3 A A1 10 B4
4 A A1 20 B4
5 A A1 10 B5
6 A A1 10 B6
7 B A2 10 B7
8 B A2 20 B1
9 B A2 100 B1
I do
>>>tmp1.groupby(['account','place','type']['balance'].last().sum(level=0).astype(int)
account
A 70
B 110
Name: balance, dtype: int64
Similarly
>>> tmp2
account place balance type
0 A A1 100 B1
1 A A1 200 B1
2 A A1 100 B1
3 A A1 100 B4
4 A A1 200 B4
5 A A1 100 B5
6 A A1 100 B6
7 B A2 100 B7
8 B A2 200 B1
9 B A2 200 B1
>>>tmp2.groupby(['account','place','type']['balance'].last().sum(level=0).astype(int)
account
A 500
B 300
Name: balance, dtype: int64
#similarly tmp3 grouped..and so on
Is there a way to find the df with maximum sum balance. eg. in this case tmp2 has greater sum (70+110 < 500+300).
My try:
One of the ways I tried was taking the sum and maintaining a list, like below
mylist=[]
mylist.append(tmp1.groupby(['account','place','type']['balance'].last().sum(.astype(int))
mylist.append(tmp2.groupby(['account','place','type']['balance'].last().sum(.astype(int))
>>> mylist
[180,800]
Now I can take max from list, but I loose account information (800 is max but I need info on account A having 500, B having 300)
I tried
>>>tmp2.groupby(['account','place','type'])['balance'].last().sum(level=0).to_dict()
{'A': 500, 'B': 300}
So for every df I have a dict, I just need to find maximum of such lists (I think I have come very close to solving it)
I intend to find which dataframe had maximum sum (along with account)

If I understand you correctly, in case you have more than 2 dfs.
tmp1 = pd.DataFrame([{'acount':'A', 'balance':100, 'type':'A1'},
{'acount':'A', 'balance':200, 'type':'A2'},
{'acount':'B', 'balance':200, 'type':'B1'},
{'acount':'B', 'balance':300, 'type':'B2'}])
tmp2 = pd.DataFrame([{'acount':'A', 'balance':100, 'type':'A1'},
{'acount':'A', 'balance':200, 'type':'A2'},
{'acount':'B', 'balance':400, 'type':'B1'},
{'acount':'B', 'balance':300, 'type':'B2'}])
tmplist = [tmp1,tmp2]
tmprlist = [tmp.groupby(['acount','type']).last().sum(level=0).astype(int) for tmp in tmplist]
tmpslist = [tmp.groupby(['acount','type'])['balance'].last().sum() for tmp in tmplist]
tmprlist[np.argmax(tmpslist)]
Result:
acount balance
A 300
B 700

Related

Filter grouped Pandas data frame by column aggregate, when groups are from a MultiIndex level

How can I drop data from one level of a multi-level indexed data frame, based on aggregated information I get from a column within a groupby on that level?
For example, with data frame dfmi:
midx = pd.MultiIndex.from_product([['A0','A1','A2'], ['B0','B1','B2']], names=["index_1", "index_2"])
columns = ['foo', 'bar']
dfmi = pd.DataFrame(np.arange(18).reshape((len(midx), len(columns))),
index=midx, columns=columns)
dfmi
foo bar
index_1 index_2
A0 B0 0 1
B1 2 3
B2 4 5
A1 B0 6 7
B1 8 9
B2 10 11
A2 B0 12 13
B1 14 15
B2 16 17
Let's say I only want to keep levels of index_1 if the mean for foo exceeds a certain threshold.
Like this:
thresh = 5
for grp, data in dfmi.groupby("index_1"):
print(data.foo.mean() > thresh)
False <-- drop this level
True
True
Desired output:
foo bar
index_1 index_2
A1 B0 6 7
B1 8 9
B2 10 11
A2 B0 12 13
B1 14 15
B2 16 17
In this toy example I can get what I want with dfmi.loc[pd.IndexSlice["A1":"A2", :]]. But I can't figure out how to use IndexSlice or loc variants to do aggregations inside a grouped MultiIndex and then slice the full data frame based on the results.
My best solution so far is to just keep track of the level values that qualify as keepers (with grp), and then use the accumulated keepers collection with IndexSlice:
keepers = list()
for grp, data in dfmi.groupby("index_1"):
if data.foo.mean() > thresh:
keepers.append(grp)
dfmi.loc[pd.IndexSlice[keepers, :]]
I'm looking for a more efficient and/or more elegant way to accomplish that with native Pandas functionality.

You can use loc once you have created your mask, like so:
mask = dfmi.groupby(level=0)['foo'].mean()>thresh
dfmi.loc[mask.index[mask]]
Yields:
index_1 index_2
A1 B0 6 7
B1 8 9
B2 10 11
A2 B0 12 13
B1 14 15
B2 16 17

How can I efficiently replicate a pandas row, changing only one column?

I have a dataframe that looks like this:
v1 v2
0 a A1
1 b A2,A3
2 c B4
3 d A5, B6, B7
I want to modify this dataframe such that any row which has more than one value in the v2 column gets replicated for each value in v2. For example for the above dataframe, the result is as follows:
v1 v2
0 a A1
1 b A2
2 b A3
3 c B4
4 d A5
5 d B6
6 d B7
I was able to do this with the following code:
new_df = pd.DataFrame()
for index, row in df.iterrows():
if len(row["v2"].split(','))>1:
row_base = row
for r in row["v2"].split(','):
row_base["v2"] = r
new_df = new_df.append(row_base, ignore_index=True)
else:
new_df = new_df.append(row)
however it is extremely inefficient on a large dataframe and I am would like to learn how to do it more efficiently.

Pandas solution for 0.25+ version with Series.str.split and DataFrame.explode:
df = df.assign(v2 = df.v2.str.split(',')).explode('v2').reset_index(drop=True)
print (df)
v1 v2
0 a A1
1 b A2
2 b A3
3 c B4
4 d A5
5 d B6
6 d B7
For oldier versions and also perfromace should be better with numpy:
from itertools import chain
s = df.v2.str.split(',')
lens = s.str.len()
df = pd.DataFrame({
'v1' : df['v1'].values.repeat(lens),
'v2' : list(chain.from_iterable(s.values.tolist()))
})
print (df)
v1 v2
0 a A1
1 b A2
2 b A3
3 c B4
4 d A5
5 d B6
6 d B7

pandas.merge with coinciding column names

Consider the following data frames:
import pandas as pd
df1 = pd.DataFrame({'id': list('fghij'), 'A': ['A' + str(i) for i in range(5)]})
A id
0 A0 f
1 A1 g
2 A2 h
3 A3 i
4 A4 j
df2 = pd.DataFrame({'id': list('fg'), 'B': ['B' + str(i) for i in range(2)]})
B id
0 B0 f
1 B1 g
df3 = pd.DataFrame({'id': list('ij'), 'B': ['B' + str(i) for i in range(3, 5)]})
B id
0 B3 i
1 B4 j
I want to merge them to get
A id B
0 A0 f B0
1 A1 g B1
2 A2 h NaN
3 A3 i B3
4 A4 j B4
Inspired by this answer I tried
final = reduce(lambda l, r: pd.merge(l, r, how='outer', on='id'), [df1, df2, df3])
but unfortunately it yields
A id B_x B_y
0 A0 f B0 NaN
1 A1 g B1 NaN
2 A2 h NaN NaN
3 A3 i NaN B3
4 A4 j NaN B4
Additionally, I checked out this question but I can't adapt the solution to my problem. Also, I didn't find any options in the docs for pandas.merge to make this happen.
In my real world problem the list of data frames might be much longer and the size of the data frames might be much larger.
Is there any "pythonic" way to do this directly and without "postprocessing"? It would be perfect to have a solution that raises an exception if column B of df2 and df3 would overlap (so if there might be multiple candidates for some value in column B of the final data frame).

Consider pd.concat + groupby?
pd.concat([df1, df2, df3], axis=0).groupby('id').first().reset_index()
id A B
0 f A0 B0
1 g A1 B1
2 h A2 NaN
3 i A3 B3
4 j A4 B4

Extract all the following rows in pandas

I have the following pandas DataFrame:
df
A B
1 b0
2 a0
3 c0
5 c1
6 a1
7 b1
8 b2
The first row which starts with a is
df[df.B.str.startswith("a")]
A B
2 a0
I would like to extract the first row in column B that starts with a and every row after. My desired result is below
A B
2 a0
3 c0
5 c1
6 a1
7 b1
8 b2
How can this be done?

One option is to create a mask and use it for selection:
mask = df.B.str.startswith("a")
mask[~mask] = np.nan
df[mask.fillna(method='ffill').fillna(0).astype(int) == 1]
Another option is to build an index range:
first = df[df.B.str.startswith("a")].index[0]
df.ix[first:]
The latter approach assumes that an "a" is always present.

using idxmax to find first True
df.loc[df.B.str[0].eq('a').idxmax():]
A B
1 2 a0
2 3 c0
3 5 c1
4 6 a1
5 7 b1
6 8 b2

If I understand your question correctly, here is how you do it :
df = pd.DataFrame(data={'A':[1,2,3,5,6,7,8],
'B' : ['b0','a0','c0','c1','a1','b1','b2']})
# index of the item beginning with a
index = df[df.B.str.startswith("a")].values.tolist()[0][0]
desired_df = pd.concat([df.A[index-1:],df.B[index-1:]], axis = 1)
print desired_df
and you get:

Clean way of slicing + stacking pandas dataframe

I have a Pandas DataFrame, say df, which is 1099 lines by 33 rows. I need the original file to be processed by another software, but it is not in the proper format. This is why I'm trying to get the good format whith pandas.
The problem is very simple: df is constituted by columns of identifiers (7 in the real case, only 3 in the following example), and then by corresponding results by months. To be clear, it's like
A B C date1result date2result date2result
a1 b1 c1 12 15 17
a2 b2 c3 5 8 3
But to be processed, I would need it to have one line per result, adding a column for the date. In the given example, it would be
A B C result date
a1 b1 c1 12 date1
a1 b1 c1 15 date2
a1 b1 c1 17 date3
a2 b2 c3 5 date1
a2 b2 c3 8 date2
a2 b2 c3 3 date3
So to be more precise, I have edited manually all column names with date (after the read_excel, the looked like '01/01/2015 0:00:00' or something like that, and I was unable to access them... As a secondary question, does anyone knows how to access columns being imported from a date field in an .xlsx?), so that date column names are now 2015_01, 2015_02... 2015_12, 2016_01, ..., 2016_12, the 5 first being 'Account','Customer Name','Postcode','segment' and 'Rep'. So I tried the following code:
core = df.loc[:,('Account','Customer Name','Postcode','segment','Rep')]
df_final=pd.Series([])
for year in [2015,2016]:
for month in range(1, 13):
label = "%i_%02i" % (year,month)
date = []
for i in range(core.shape[0]):
date.append("01/%02i/%i"%(month,year))
df_date=pd.Series(date) #I don't know to create this 1xn df
df_final = df_final.append(pd.concat([core, df[label], df_date], axis=1))
That works roughly, but it is very unclean: I get a (26376, 30) shaped df_final, fist column being the dates, then the results, but of course with '2015_01' as column name, then all the '2015_02' through '2016_12' filled by NaN, and at last my Account', 'Customer Name', 'Postcode', 'segment' and 'Rep' columns. Does anyone know how I could do such a "slicing+stacking" in a clean way?
Thank you very much.
Edit: it is roughly the reverse of this question: Stacking and shaping slices of DataFrame (pandas) without looping

Ithink you need melt:
df = pd.melt(df, id_vars=['A','B','C'], value_name='result', var_name='date')
print (df)
A B C date result
0 a1 b1 c1 date1result 12
1 a2 b2 c3 date1result 5
2 a1 b1 c1 date2result 15
3 a2 b2 c3 date2result 8
4 a1 b1 c1 date3result 17
5 a2 b2 c3 date3result 3
And then convert to_datetime:
print (df)
A B C 2015_01 2016_10 2016_12
0 a1 b1 c1 12 15 17
1 a2 b2 c3 5 8 3
df = pd.melt(df, id_vars=['A','B','C'], value_name='result', var_name='date')
df.date = pd.to_datetime(df.date, format='%Y_%m')
print (df)
A B C date result
0 a1 b1 c1 2015-01-01 12
1 a2 b2 c3 2015-01-01 5
2 a1 b1 c1 2016-10-01 15
3 a2 b2 c3 2016-10-01 8
4 a1 b1 c1 2016-12-01 17
5 a2 b2 c3 2016-12-01 3

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compare groupby output for different dataframes - python

Related

Filter grouped Pandas data frame by column aggregate, when groups are from a MultiIndex level

How can I efficiently replicate a pandas row, changing only one column?

pandas.merge with coinciding column names

Extract all the following rows in pandas

Clean way of slicing + stacking pandas dataframe

Categories

Resources