Compare two dataframes based off of key - python

I have two dataframes, df1 and df2, which have the exact same columns and most of the time the same values for each key.
Country A B C D E F G H Key
Argentina xylo 262 4632 0 0 26.12 2 0 Argentinaxylo
Argentina phone 6860 155811 48 0 4375.87 202 0 Argentinaphone
Argentina land 507 1803728 2 117 7165.810566 3 154 Argentinaland
Australia xylo 7650 139472 69 0 16858.42 184 0 Australiaxylo
Australia mink 1284 2342788 1 0 39287.71 53 0 Australiamink
Country A B C D E F G H Key
Argentina xylo 262 4632 0 0 26.12 2 0 Argentinaxylo
Argentina phone 6860 155811 48 0 4375.87 202 0 Argentinaphone
Argentina land 507 1803728 2 117 7165.810566 3 154 Argentinaland
Australia xylo 7650 139472 69 0 16858.42 184 0 Australiaxylo
Australia mink 1284 2342788 1 0 39287.71 53 0 Australiamink
I want a snippet that compares the keys (key = column Country + column A) in each dataframe against each other and calculates the percent difference for each column B-H, if there is any. If there isn't, output nothing.

Hope, the below given code may help you to solve your problem. I have compared both datasets based on the Key column data and generate the difference of their (B-H) columns respectively. Thereafter, having the difference in percentage, I just have merge on the both datasets on Key column, compare the difference and have the final output in df3diff column of df3 dataset.
import pandas as pd
df1 = pd.DataFrame([['Argentina', 'xylo', 262 ,4632, 0 , 0 , 26.12 , 2 , 0 , 'Argentinaxylo']
,['Argentina', 'phone',6860,155811 , 48 , 0 ,4375.87 ,202, 0 , 'Argentinaphone']
,['Argentina', 'land', 507 ,1803728, 2 , 117 ,7165.810,566, 3 , '154 Argentinaland']
,['Australia', 'xylo', 7650,139472 , 69 , 0 ,16858.42,184, 0 , 'Australiaxylo']
,['Australia', 'mink', 1284,2342788, 1 , 0 ,39287.71, 53, 0 , 'Australiamink']]
,columns=['Country', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'Key'])
df1['df1BH'] = (df1['B']-df1['H'])/100.00
print(df1)
df2 = pd.DataFrame([['Argentina', 'xylo', 262 ,4632 , 0 , 0 ,26.12 ,2 , 0 ,'Argentinaxylo']
,['Argentina', 'phone',6860,155811 , 48, 0 ,4375.87 ,202, 0 ,'Argentinaphone']
,['Argentina', 'land', 507 ,1803728, 2 , 117 ,7165.810,566, 3 ,'154 Argentinaland']
,['Australia', 'xylo', 97650,139472 , 69, 0 ,96858.42,184, 0 ,'Australiaxylo']
,['Australia', 'mink', 1284,2342788, 1 , 0 ,39287.71, 53, 0 ,'Australiamink']]
,columns=['Country', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'Key'])
df2['df2BH'] = (df2['B']-df2['H'])/100.00
print(df2)
df3 = pd.merge(df1[['Key','df1BH']],df2[['Key','df2BH']], on=['Key'],how='outer')
df3['df3diff'] = df3['df1BH'] - df3['df2BH']
print(df3)
Output:
Key df1BH df2BH df3diff
0 Argentinaxylo 2.62 2.62 0.0
1 Argentinaphone 68.60 68.60 0.0
2 154 Argentinaland 5.04 5.04 0.0
3 Australiaxylo 76.50 976.50 -900.0
4 Australiamink 12.84 12.84 0.0

Related

Pandas read file with no delimiter and with different column widths

I want to read a plaintext file using pandas.
I have entries without delimiters and with different widths like this:
59967Y98Doe John 6211100004545SO20140314- 00024278
N0546664SCHMIDT-PETER 7441100008300AW20140314- 00023643
G4894jmhTAKLONSKY-JUERGEN 4211100005000TB20140315 00023882
34875738PODESBERG-SCHUMPERTS6211100003671SO20140315 00024622
1-8 is a string.
9-28 is a string.
29-31 is numeric.
32-34 is numeric.
35-41 is numeric.
42-43 is a string.
44-51 is a date (yyyyMMdd).
52 is minus or a blank
Rest is a currency amount without a decimal point (the last 2 digits is always after the decimal point). For example: - 00024278 = -242.78 €
I know there is pd.read_fwf
There is an argument width. I could do this:
pd.read_fwf(StringIO(txt), widths=[8], header="Peronal Nr.")
But how could I read my file with different columns widths?
As the s in widths suggest, you can pass a list of widths:
pd.read_fwf(io.StringIO(txt), widths=[8,20,3,3,7,2,8,1,99], header=None)
output:
0 1 2 3 4 5 6 7 8
0 59967Y98 Doe John 621 110 4545 SO 20140314 - 24278
1 N0546664 SCHMIDT-PETER 744 110 8300 AW 20140314 - 23643
2 G4894jmh TAKLONSKY-JUERGEN 421 110 5000 TB 20140315 NaN 23882
3 34875738 PODESBERG-SCHUMPERTS 621 110 3671 SO 20140315 NaN 24622
If you want names and dtypes:
df = (pd.read_fwf(io.StringIO(txt), widths=[8,20,3,3,7,2,8,1,99], header=None,
names=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'],
dtypes=[str, str, int, int, int, str, str, str, int])
.assign(**{'G': lambda d: pd.to_datetime(d['G'], format='%Y%m%d')})
)
output:
A B C D E F G H I
0 59967Y98 Doe John 621 110 4545 SO 2014-03-14 - 24278
1 N0546664 SCHMIDT-PETER 744 110 8300 AW 2014-03-14 - 23643
2 G4894jmh TAKLONSKY-JUERGEN 421 110 5000 TB 2014-03-15 NaN 23882
3 34875738 PODESBERG-SCHUMPERTS 621 110 3671 SO 2014-03-15 NaN 24622
df.dtypes
A object
B object
C int64
D int64
E int64
F object
G datetime64[ns]
H object
I int64
dtype: object

Pandas: use apply to create 2 new columns

I have a dataset where col a represent the number of total values in values e,i,d,t which are in string format separated by a "-"
a e i d t
0 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1
1 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4
3 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1
5 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4
I want to create 8 new columns, 4 representing the SUM of (e-i-d-t), 4 the product.
For example:
def funct_two_outputs(E, I, d, t, d_calib = 50):
return E+i+d+t, E*i*d*t
OUT first 2 values:
SUM_0, row0 = 40+0.5+30+1 SUM_1 = 80+0.3+32+1
The sum and product are example functions substituting my functions which are a bit more complicated.
I have written out a function **expand_on_col ** that creates separates all the e,i,d,t values into new columns:
def expand_on_col (df_, col_to_split = "namecol", sep='-', prefix="this"):
'''
Pass a df indicating on which col you want to split,
return a df with the col split with a prefix.
'''
df1 = df_[col_to_split].str.split(sep,expand=True).add_prefix(prefix)
df1 = pd.concat([df_,df1], axis=1).replace(np.nan, '-')
return df1
Now i need to create 4 new columsn that are the sum of eidt, and 4 that are the prodct.
Example output for SUM:
index a e i d t a-0 e-0 e-1 e-2 e-3 i-0 i-1 i-2 i-3 d-0 d-1 d-2 d-3 t-0 t-1 t-2 t-3 sum-0 sum-1 sum-2 sum-3
0 0 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1 4 40 80 120 150 0.5 0.3 0.2 0.2 30 32 30 32 1 1 1 1 71 114 153 186
1 1 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4 4 40 40 40 40 0.1 0.1 0.1 0.1 18 18 18 18 1 2 3 4 59 61 63 65
2 3 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1 4 40 80 120 150 0.5 0.3 0.2 0.2 30 32 30 32 1 1 1 1 71 114 153 186
3 5 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4 4 40 40 40 40 0.1 0.1 0.1 0.1 18 18 18 18 1 2 3 4 59 61 63 65
If i run the code with funct_one_output(only returns sum) it works, but wit the funct_two_outputs(suma and product) I get an error.
Here is the code:
import pandas as pd
def expand_on_col (df_, col_to_split = "namecol", sep='-', prefix="this"):
'''
Pass a df indicating on which col you want to split,
return a df with the col split with a prefix.
'''
df1 = df_[col_to_split].str.split(sep,expand=True).add_prefix(prefix)
df1 = pd.concat([df_,df1], axis=1).replace(np.nan, '-')
return df1
def funct_two_outputs(E, I, d, t, d_calib = 50): #the function i want to pass
return E+i+d+t, E*i*d*t
def funct_one_outputs(E, I, d, t, d_calib = 50): #for now i can olny use this one, cant use 2 return values.
return E+i+d+t
for col in columns:
df = expand_on_col (df_=df, col_to_split = col, sep='-', prefix=f"{col}-")
cols_ = df.columns.drop(columns)
df[cols_]= df[cols_].apply(pd.to_numeric, errors="coerce")
df["a"] = df["a"].apply(pd.to_numeric, errors="coerce")
df.reset_index(inplace=True)
for i in range (max(df["a"])):
name_1, name_2 = f"sum-{i}", f"mult-{i}"
df[name_1] = df.apply(lambda row: funct_one_outputs(E= row[f'e-{i}'], I=row[f'i-{i}'], d=row[f'd-{i}'], t=row[f"t-{i}"]), axis=1)
#if i try and fill 2 outputs it wont work
df[[name_1, name_2]] = df.apply(lambda row: funct_two_outputs(E= row[f'e-{i}'], I=row[f'i-{i}'], d=row[f'd-{i}'], t=row[f"t-{i}"]), axis=1)
OUT:
ValueError Traceback (most recent call last)
<ipython-input-306-85157b89d696> in <module>()
68 df[name_1] = df.apply(lambda row: funct_one_outputs(E= row[f'e-{i}'], I=row[f'i-{i}'], d=row[f'd-{i}'], t=row[f"t-{i}"]), axis=1)
69 #if i try and fill 2 outputs it wont work
---> 70 df[[name_1, name_2]] = df.apply(lambda row: funct_two_outputs(E= row[f'e-{i}'], I=row[f'i-{i}'], d=row[f'd-{i}'], t=row[f"t-{i}"]), axis=1)
71
72
2 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py in __setitem__(self, key, value)
3039 self._setitem_frame(key, value)
3040 elif isinstance(key, (Series, np.ndarray, list, Index)):
-> 3041 self._setitem_array(key, value)
3042 else:
3043 # set column
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py in _setitem_array(self, key, value)
3074 )[1]
3075 self._check_setitem_copy()
-> 3076 self.iloc._setitem_with_indexer((slice(None), indexer), value)
3077
3078 def _setitem_frame(self, key, value):
/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value)
1751 if len(ilocs) != len(value):
1752 raise ValueError(
-> 1753 "Must have equal len keys and value "
1754 "when setting with an iterable"
1755 )
ValueError: Must have equal len keys and value when setting with an iterable
Don't Use apply
If you can help it
s = pd.to_numeric(
df[['e', 'i', 'd', 't']]
.stack()
.str.split('-', expand=True)
.stack()
)
sums = s.sum(level=[0, 2]).rename('Sum')
prods = s.prod(level=[0, 2]).rename('Prod')
sums_prods = pd.concat([sums, prods], axis=1).unstack()
sums_prods.columns = [f'{o}-{i}' for o, i in sums_prods.columns]
df.join(sums_prods)
a e i d t Sum-0 Sum-1 Sum-2 Sum-3 Prod-0 Prod-1 Prod-2 Prod-3
0 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1 71.5 113.3 151.2 183.2 600.0 768.0 720.0 960.0
1 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4 59.1 60.1 61.1 62.1 72.0 144.0 216.0 288.0
3 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1 71.5 113.3 151.2 183.2 600.0 768.0 720.0 960.0
5 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4 59.1 60.1 61.1 62.1 72.0 144.0 216.0 288.0

Generating range of unique random numbers for each group in pandas

I have to create a range of unique random numbers between 0 to 99999 for each group.
I have following dataset
District Prefix Quota
A 98426 783
A 98427 223
A 98446 127
A 98626 51
B 98049 167
B 98079 153
B 98140 120
B 98159 139
B 98169 182
B 98249 86
B 98426 588
B 98446 96
C 98049 104
C 98060 68
C 98149 65
C 98150 68
C 98159 86
C 98160 80
C 98169 113
Code to reproduce:
import pandas as pd
df = pd.DataFrame([
['A', 98426, 783],
['A', 98427, 223],
['A', 98446, 127],
['A', 98626, 51],
['B', 98049, 167],
['B', 98079, 153],
['B', 98140, 120],
['B', 98159, 139],
['B', 98169, 182],
['B', 98249, 86],
['B', 98426, 588],
['B', 98446, 96],
['C', 98049, 104],
['C', 98060, 68],
['C', 98149, 65],
['C', 98150, 68],
['C', 98159, 86],
['C', 98160, 80],
['C', 98169, 113]
],
columns=['District', 'Prefix', 'Quota'])
so I have to generate "quota" number of unique number from 0 to 99999 and add it to prefix so I can create a 10 digit number.
I tried:
numbers = np.random.choice(range(99999), size=df.Quota.sum(), replace=False)
random = df.Prefix.repeat(df.Quota)*100000 + numbers
but the problem is, it subtotals "Quota" and generates the unique numbers from 0 to 99999. but I want unique number 0 to 99999 available for each prefix. for example. suppose np.random.choice(range(99999), size=df.Quota.sum(), replace=False) generated 16195 and add to the first prefix "98426". the same number wont be generated and it wont be available to other prefix. so the range (0-99999) should be unique for each prefix
Here's a solution using transform, random.choice, and explode.
def make_random_numbers(x):
total = x.sum()
r = np.random.choice(range(99999), total, replace = False)
chunks = x.cumsum()[:-1]
res = np.hsplit(r, chunks)
return res
df["rand_items"] = df.groupby("Prefix")["Quota"].transform(make_random_numbers)
df.explode("rand_items")
The result is:
District Prefix Quota rand_items
0 A 98426 783 2681
0 A 98426 783 94952
0 A 98426 783 79496
0 A 98426 783 58361
0 A 98426 783 54883
0 A 98426 783 44819
0 A 98426 783 36209
0 A 98426 783 91710
...
18 C 98169 113 41859
18 C 98169 113 92311
18 C 98169 113 18572
18 C 98169 113 72492
18 C 98169 113 39188
18 C 98169 113 36808
18 C 98169 113 32055
18 C 98169 113 74678
This method returns a list of random choices for each row:
def gen_rand(x):
return (x['Prefix'].min() * 1E5 + np.random.choice(range(99999),
size=sum(x['Quota']), replace = False)).astype(int)
df.groupby('Prefix').apply(gen_rand)

df.apply(sorted, axis=1) removes column names?

Working through Pandas Cookbook. Counting the Total Number of Flights Between Cities.
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt
print('NumPy: {}'.format(np.__version__))
print('Pandas: {}'.format(pd.__version__))
print('-----')
desired_width = 320
pd.set_option('display.width', desired_width)
pd.options.display.max_rows = 50
pd.options.display.max_columns = 14
# pd.options.display.float_format = '{:,.2f}'.format
file = "e:\\packt\\data_analysis_and_exploration_with_pandas\\section07\\data\\flights.csv"
flights = pd.read_csv(file)
print(flights.head(10))
print()
# This returns the total number of rows for each group.
flights_ct = flights.groupby(['ORG_AIR', 'DEST_AIR']).size()
print(flights_ct.head(10))
print()
# Get the number of flights between Atlanta and Houston in both directions.
print(flights_ct.loc[[('ATL', 'IAH'), ('IAH', 'ATL')]])
print()
# Sort the origin and destination cities:
# flights_sort = flights.sort_values(by=['ORG_AIR', 'DEST_AIR'], axis=1)
flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)
print(flights_sort.head(10))
print()
# Passing just the first row.
print(sorted(flights.loc[0, ['ORG_AIR', 'DEST_AIR']]))
print()
# Once each row is independently sorted, the column name are no longer correct.
# We will rename them to something generic, then again find the total number of flights between all cities.
rename_dict = {'ORG_AIR': 'AIR1', 'DEST_AIR': 'AIR2'}
flights_sort = flights_sort.rename(columns=rename_dict)
flights_ct2 = flights_sort.groupby(['AIR1', 'AIR2']).size()
print(flights_ct2.head(10))
print()
When I get to this line of code my output differs from the authors:
```flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)```
My output does not contain any column names. As a result, when I get to:
```flights_ct2 = flights_sort.groupby(['AIR1', 'AIR2']).size()```
it throws a KeyError. This makes sense, as I am trying to rename columns when no column names exist.
My question is, why are the column names gone? All other output matches the authors output exactly:
Connected to pydev debugger (build 191.7141.48)
NumPy: 1.16.3
Pandas: 0.24.2
-----
MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
0 1 1 4 WN LAX SLC 1625 58.0 94.0 590 1905 65.0 0 0
1 1 1 4 UA DEN IAD 823 7.0 154.0 1452 1333 -13.0 0 0
2 1 1 4 MQ DFW VPS 1305 36.0 85.0 641 1453 35.0 0 0
3 1 1 4 AA DFW DCA 1555 7.0 126.0 1192 1935 -7.0 0 0
4 1 1 4 WN LAX MCI 1720 48.0 166.0 1363 2225 39.0 0 0
5 1 1 4 UA IAH SAN 1450 1.0 178.0 1303 1620 -14.0 0 0
6 1 1 4 AA DFW MSY 1250 84.0 64.0 447 1410 83.0 0 0
7 1 1 4 F9 SFO PHX 1020 -7.0 91.0 651 1315 -6.0 0 0
8 1 1 4 AA ORD STL 1845 -5.0 44.0 258 1950 -5.0 0 0
9 1 1 4 UA IAH SJC 925 3.0 215.0 1608 1136 -14.0 0 0
ORG_AIR DEST_AIR
ATL ABE 31
ABQ 16
ABY 19
ACY 6
AEX 40
AGS 83
ALB 33
ANC 2
ASE 1
ATW 10
dtype: int64
ORG_AIR DEST_AIR
ATL IAH 121
IAH ATL 148
dtype: int64
*** No columns names *** Why?
0 [LAX, SLC]
1 [DEN, IAD]
2 [DFW, VPS]
3 [DCA, DFW]
4 [LAX, MCI]
5 [IAH, SAN]
6 [DFW, MSY]
7 [PHX, SFO]
8 [ORD, STL]
9 [IAH, SJC]
dtype: object
The author's output. Note the columns names are present.
sorted returns a list object and obliterates the columns:
In [11]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
In [12]: df.apply(sorted, axis=1)
Out[12]:
0 [1, 2]
1 [3, 4]
dtype: object
In [13]: type(df.apply(sorted, axis=1).iloc[0])
Out[13]: list
It's possible that this wouldn't have been the case in earlier pandas... but it would still be bad code.
You can do this by passing the columns explicitly:
In [14]: df.apply(lambda x: pd.Series(sorted(x), df.columns), axis=1)
Out[14]:
A B
0 1 2
1 3 4
A more efficient way to do this is to sort the sort the underlying numpy array:
In [21]: df = pd.DataFrame([[1, 2], [3, 1]], columns=["A", "B"])
In [22]: df
Out[22]:
A B
0 1 2
1 3 1
In [23]: arr = df[["A", "B"]].values
In [24]: arr.sort(axis=1)
In [25]: df[["A", "B"]] = arr
In [26]: df
Out[26]:
A B
0 1 2
1 1 3
As you can see this sorts each row.
A final note. I just applied #AndyHayden numpy based solution from above.
flights_sort = flights[["ORG_AIR", "DEST_AIR"]].values
flights_sort.sort(axis=1)
flights[["ORG_AIR", "DEST_AIR"]] = flights_sort
All I can say is … Wow. What an enormous performance difference. I get the exact same
correct answer and I get it as soon as I click the mouse as compared to the pandas lambda solution also provided by #AndyHayden which takes about 20 seconds to perform the sort. That dataset is 58,000+ rows. The numpy solution returns the sort instantly.

Group by, aggregate, include separate column

Here's my data:
foo = pd.DataFrame({
'accnt' : [101, 102, 103, 104, 105, 101, 102, 103, 104, 105],
'gender' : [0, 1 , 0, 1, 0, 0, 1 , 0, 1, 0],
'date' : pd.to_datetime(["2019-01-01 00:10:21", "2019-01-05 00:09:18", "2019-01-05 00:09:30", "2019-02-05 00:05:12", "2019-04-01 00:08:46",
"2019-04-01 00:11:31", "2019-02-06 00:01:39", "2019-01-26 00:15:14", "2019-01-21 00:12:36", "2019-03-01 00:09:31"]),
'value' : [10, 20, 30, 40, 50, 5, 2, 6, 48, 96]
})
Which is:
accnt date gender value
0 101 2019-01-01 00:10:21 0 10
1 102 2019-01-05 00:09:18 1 20
2 103 2019-01-05 00:09:30 0 30
3 104 2019-02-05 00:05:12 1 40
4 105 2019-04-01 00:08:46 0 50
5 101 2019-04-01 00:11:31 0 5
6 102 2019-02-06 00:01:39 1 2
7 103 2019-01-26 00:15:14 0 6
8 104 2019-01-21 00:12:36 1 48
9 105 2019-03-01 00:09:31 0 96
I want to do the following:
- Group by accnt, include gender, take latest date as latest_date, count number of transactions as txn_count; resulting in:
accnt gender latest_date txn_count
101 0 2019-04-01 00:11:31 2
102 1 2019-02-06 00:01:39 2
103 0 2019-01-26 00:15:14 2
104 1 2019-02-05 00:05:12 2
105 0 2019-04-01 00:08:46 2
In R, I can do this using group_by and summarise from dplyr:
foo %>% group_by(accnt) %>%
summarise(gender = last(gender), most_recent_order_date = max(date), order_count = n()) %>% data.frame()
I'm taking last(gender) to include it, since gender is the same throughout for any accnt, I can take min, max or mean instead also.
How can I do the same in python using pandas?
I've tried:
foo.groupby('accnt').agg({'gender' : ['mean'],
'date': ['max'],
'value': ['count']}).rename(columns = {'gender' : "gender",
'date' : "most_recent_order_date",
'value' : "order_count"})
But this leads to "extra" column names. I'd also like to know what is the best way to include a non-aggregation column like gender in the result.
In R summarise will equal to agg , mutate equal to transform
The reason why you have multiple index in columns : Since you pass the function call with list , which means you can do something like {'date':['mean','sum']}
foo.groupby('accnt').agg({'gender' : 'first',
'date': 'max',
'value': 'count'}).rename(columns = {'date' : "most_recent_order_date",
'value' : "order_count"}).reset_index()
Out[727]:
accnt most_recent_order_date order_count gender
0 101 2019-04-01 00:11:31 2 0
1 102 2019-02-06 00:01:39 2 1
2 103 2019-01-26 00:15:14 2 0
3 104 2019-02-05 00:05:12 2 1
4 105 2019-04-01 00:08:46 2 0
Some example : Here I called two function same time for one columns , which means there should be have two level of index to make sure the out columns names do not have duplicated
foo.groupby('accnt').agg({'gender' : ['first','mean']})
Out[728]:
gender
first mean
accnt
101 0 0
102 1 1
103 0 0
104 1 1
105 0 0
Sorry for the late response. Here's a solution I found.
# Pandas Operations
foo = foo.groupby('accnt').agg({'gender' : ['mean'],
'date': ['max'],
'value': ['count']})
# Drop additionally created column names from Pandas Operations
foo.columns = foo.columns.droplevel(1)
# Rename original column names
foo.rename( columns = { 'date':'latest_date',
'value':'txn_count'},
inplace=True)
If you'd like to include an additional non aggregate column, you can simply append a new column to the grouped foo dataframe.

Categories