Pandas apply function to each row by calculating multiple columns - python

I have been stacked by an easy question, and my question title might be inappropriate.
df = pd.DataFrame(list(zip(['a', 'a', 'b', 'b', 'c', 'c', 'c'],
['a1', 'a2', 'b1', 'b2', 'c1', 'c2', 'c3'],
[110, 80, 100, 180, 12],
[5, 7, 2, 6, 10])),
columns=['name', 'ingredient', 'amount', 'con'])
I want to calculate (df.amount * df.con)/df.groupby('name').agg({'amount':'sum'}).reset_index().loc(df.name==i).amount) (Sorry, this line will return error, but what I want is to calculate total concentration (under each name) based on each ingredient amount and ingredient con.
Here is my code:
df['cal'] =df.amount * df.con
df = df.merge(df.groupby('name').agg({'amount':'sum'}).reset_index(),
on = ['name'], how = 'left', suffixes = (None, '_y'))
df['what_i_want'] = df['cal']/df['amount_y']
df.groupby('name').what_i_want.sum()
output:
name
a 5.842105
b 4.571429
c 10.000000
Name: what_i_want, dtype: float64
Any short-cut for this calculation?
Thanks ahead.

IIUC, you can use:
out = (df
.groupby('name')
.apply(lambda g: g['amount'].mul(g['con']).sum()/g['amount'].sum())
)
output:
name
a 5.842105
b 4.571429
c 10.000000
dtype: float64

To shortcut the operations (esp. remove the merge), you can use groupy.transform, which will retain the original index:
df["what_i_want_2"] = (df["amount"] * df["con"]) / (
df.groupby("name")["amount"].transform("sum")
)

Related

Apply function for multiple levels of tables/data

I get a problem in my work. I have tables:
import pandas as pd
import numpy as np
level1 = pd.DataFrame(list(zip(['a', 'b', 'c'], [3, 'x', 'x'])),
columns=['name', 'value'])
name value
0 a 3
1 b x
2 c x
I want to sum the value column, but it contains “x”s. So I will have to use the second table to calculate “x”s :
level2 = pd.DataFrame(list(zip(['b', 'b', 'c', 'c', 'c'], ['b1', 'b2', 'c1', 'c2', 'c3'], [5, 7, 2, 'x', 9])),
columns=['name', 'sub', 'value'])
name sub value
0 b b1 5
1 b b2 7
2 c c1 2
3 c c2 x
4 c c3 9
I should sum the b1, b2 to give “x” for b in level1 table (x=12). But for c, there is “x”, so a third level table:
level3 = pd.DataFrame(list(zip(['c', 'c', 'c'], ['c1', 'c2', 'c3'], [2, 4, 9])),
columns=['name', 'sub', 'value'])
name sub value
0 c c1 2
1 c c2 4
2 c c3 9
Now, we can get the sum value for value column in level1 table.
My question is: can we use a function to calculate it easily? If there are more levels, how can we loop them till no “x”?
It is OK to combine level2 and level3.
Here's a way using combine_first and replace:
from functools import reduce
l1 = level1.assign(sub=level1['name']+'1').replace('x', np.nan).set_index(['name', 'sub'])
l2 = level2.replace('x', np.nan).set_index(['name', 'sub'])
l3 = level3.replace('x', np.nan).set_index(['name', 'sub'])
reduce(lambda x, y: x.combine_first(y), [l3,l2,l1]).groupby(level=0).sum()
Output:
value
name
a 3.0
b 12.0
c 15.0
Complete example:
import pandas as pd
import numpy as np
level1 = pd.DataFrame(list(zip(['a', 'b', 'c'], [3, 'x', 'x'])),
columns=['name', 'value'])
level2 = pd.DataFrame(list(zip(['b', 'b', 'c', 'c', 'c'],
['b1', 'b2', 'c1', 'c2', 'c3'],
[5, 7, 2, 'x', 9])),
columns=['name', 'sub', 'value'])
level3 = pd.DataFrame(list(zip(['c', 'c', 'c'],
['c1', 'c2', 'c3'],
[2, 4, 9])),
columns=['name', 'sub', 'value'])
from functools import reduce
l1 = level1.assign(sub=level1['name']+'1')\
.replace('x', np.nan)\
.set_index(['name', 'sub'])
l2 = level2.replace('x', np.nan)\
.set_index(['name', 'sub'])
l3 = level3.replace('x', np.nan)\
.set_index(['name', 'sub'])
out = reduce(lambda x, y: x.combine_first(y),
[l3,l2,l1]).groupby(level=0).sum()
print(out)
One option is a combination of merge(multiple merges actually) and a groupby:
(level2
.merge(level3, on = ['name', 'sub'], how = 'left', suffixes = (None, '_y'))
.assign(value = lambda df: np.where(df.value.eq('x'), df.value_y, df.value))
.groupby('name', as_index = False)
.value
.sum()
.merge(level1, on = 'name', how = 'right', suffixes = ('_x',None))
.assign(value = lambda df: np.where(df.value.eq('x'), df.value_x, df.value))
.loc[:, ['name', 'value']]
)
name value
0 a 3
1 b 12.0
2 c 15.0

Selecting rows in dataframe where a value is larger than the categorical mean

I'm trying to find employees that have higher salaries above the average departmental salary, but I'm having a bit of trouble in Pandas.
In SQL, my query would look something like this:
SELECT name, department, salary
FROM employees e1
WHERE salary > (SELECT AVG(salary) FROM employees e2 WHERE e1.department = e2.department)
Here is my attempt in Pandas:
df.groupby(['Department']).filter(lambda x: df['salary'] > x.salary.mean())[['Name', 'Salary']]
I get the following error which I am assuming is coming from df['salary'] in my filter clause:
filter function returned a Series, but expected a scalar bool
this is not as readable as i would like but i think it works:
import pandas as pd
df = pd.DataFrame(columns=['employees', 'department', 'salary', 'other_features'],
data=[['A', 'C1', 1300, 5],
['B', 'C1', 1250, 10],
['C', 'C1', 2000, 18],
['D', 'C3', 1240, 21],
['E', 'C1', 1700, 29],
['F', 'C2', 1550, 11],
['G', 'C3', 2100, 2],
['H', 'C3', 1090, 7],
['I', 'C2', 1400, 13],
['B', 'C2', 1100, 4]])
df.set_index('employees').groupby('department').apply(lambda x: x[x.salary > x.salary.mean()])['salary']
output:
employees salary
department
C1 C 2000
E 1700
C2 F 1550
I 1400
C3 G 2100

Python dataframe unexpected display error using loc

I'm creating an additional column "Total_Count" to store the cumulative count record by Site and Count_Record column information. My coding is almost done for total cumulative count. However, the Total_Count column is shift for a specific Card as below. Could someone help with code modification, thank you!
Expected Output:
Current Output:
My Code:
import pandas as pd
df1 = pd.DataFrame(columns=['site', 'card', 'date', 'count_record'],
data=[['A', 'C1', '12-Oct', 5],
['A', 'C1', '13-Oct', 10],
['A', 'C1', '14-Oct', 18],
['A', 'C1', '15-Oct', 21],
['A', 'C1', '16-Oct', 29],
['B', 'C2', '12-Oct', 11],
['A', 'C2', '13-Oct', 2],
['A', 'C2', '14-Oct', 7],
['A', 'C2', '15-Oct', 13],
['B', 'C2', '16-Oct', 4]])
df_append_temp=[]
total = 0
preCard = ''
preSite = ''
preCount = 0
for pc in df1['card'].unique():
df2 = df1[df1['card'] == pc].sort_values(['date'])
total = 0
for i in range(0, len(df2)):
site = df2.iloc[i]['site']
count = df2.iloc[i]['count_record']
if site == preSite:
total += (count - preCount)
else:
total += count
preCount = count
preSite = site
df2.loc[i, 'Total_Count'] = total #something wrong using loc here
df_append_temp.append(df2)
df3 = pd.DataFrame(pd.concat(df_append_temp), columns=df2.columns)
df3
To modify the current implementation we can use groupby to create our df2 which allows us to apply a function to each grouped DataFrame to create the new column. This should offer similar performance as the current implementation but produce correctly aligned Series:
def calc_total_count(df2: pd.DataFrame) -> pd.Series:
total = 0
pre_count = 0
pre_site = ''
lst = []
for c, s in zip(df2['count_record'], df2['site']):
if s == pre_site:
total += (c - pre_count)
else:
total += c
pre_count = c
pre_site = s
lst.append(total)
return pd.Series(lst, index=df2.index, name='Total_Count')
df3 = pd.concat([
df1,
df1.sort_values('date').groupby('card').apply(calc_total_count).droplevel(0)
], axis=1)
Alternatively we can use groupby, then within groups Series.shift to get the previous site, and count_record. Then use np.where to conditionally determine each row's value and ndarray.cumsum to calculate the cumulative total of the resulting values:
def calc_total_count(df2: pd.DataFrame) -> pd.Series:
return pd.Series(
np.where(df2['site'] == df2['site'].shift(),
df2['count_record'] - df2['count_record'].shift(fill_value=0),
df2['count_record']).cumsum(),
index=df2.index,
name='Total_Count'
)
df3 = pd.concat([
df1,
df1.sort_values('date').groupby('card').apply(calc_total_count).droplevel(0)
], axis=1)
Either approach produces df3:
site card date count_record Total_Count
0 A C1 12-Oct 5 5
1 A C1 13-Oct 10 10
2 A C1 14-Oct 18 18
3 A C1 15-Oct 21 21
4 A C1 16-Oct 29 29
5 B C2 12-Oct 11 11
6 A C2 13-Oct 2 13
7 A C2 14-Oct 7 18
8 A C2 15-Oct 13 24
9 B C2 16-Oct 4 28
Setup and imports:
import numpy as np # only needed if using np.where
import pandas as pd
df1 = pd.DataFrame(columns=['site', 'card', 'date', 'count_record'],
data=[['A', 'C1', '12-Oct', 5],
['A', 'C1', '13-Oct', 10],
['A', 'C1', '14-Oct', 18],
['A', 'C1', '15-Oct', 21],
['A', 'C1', '16-Oct', 29],
['B', 'C2', '12-Oct', 11],
['A', 'C2', '13-Oct', 2],
['A', 'C2', '14-Oct', 7],
['A', 'C2', '15-Oct', 13],
['B', 'C2', '16-Oct', 4]])

Change the order of columns using Pandas dataframe and drop a column

I have the following an illustrative example dataframe df
df = pd.DataFrame({'name': ['A', 'B', 'C'],
'value': [100, 300, 150]})
The real dataframe has much more columns and rows. As I said this is only an illustrative example.
I want to change the order of the columns, so that I get the following result:
df = pd.DataFrame({'name': ['A', 'C', 'B'],
'value': [100, 150, 300]})
How can I do this?
And how can I drop column A after reordering, so that I get the new df:
df = pd.DataFrame({'name': ['C', 'B'],
'value': [150, 300]})
You can do sort_values then slice the df by position with iloc
out = df.sort_values('value').iloc[1:]
Out[190]:
name value
2 C 150
1 B 300

Count values of column A observed only with a single distinct value in columns B and/or C

I'd like to summarise a large dataframe in terms of distinct values of one column with respect to whether they are restricted to occuring with single OR multiple distinct values of other column(s). My current approach for doing this is really convoluted, and I'm looking for a pandas pattern for solving these kinds of problems.
Given the following example dataframe:
import pandas as pd
pd.DataFrame({'c': ['x', 'x', 'y', 'y', 'z', 'z'],
's': ['a1', 'a1', 'a1', 'a1', 'a1', 'a2'],
't': [1, 1, 1, 2, 1, 1]})
How may I obtain (and count) the distinct values of column c:
1) that are observed only in conjunction with a single value of columns s and t.
Desired output: set('x') and/or its length: 1
2) that are observed only in conjuction with a single value of column s but >1 values of column t.
Desired output: set('y') and/or its length: 1
3) that are observed in conjuction with >1 values of columns s and any number of distinct column t values.
Desired output: set('z') and/or its length: 1
Edit:
One more q, using the following revised df!
df = pd.DataFrame({'c': ['x', 'x', 'y', 'y', 'z', 'z', 'z1', 'z1', 'z2'],
's': ['a1', 'a1', 'a1', 'a1', 'a1', 'a2', 'a3', 'a3', 'a4'],
't': [1, 1, 1, 2, 1, 1, 3, 3, 1],
'cat': ['a', 'a', 'a', 'a', 'a', 'a', 'a', 'b', 'a']})
4) observed twice or more, and only in conjunction with a single value of columns s and t, and also restricted to cat 'a'
Desired output: set('x') and/or its length: 1
Thanks!
Idea is use DataFrame.duplicated by multiple columns with keep=False for all dupes and filtering by boolean indexing:
m1 = df.duplicated(['c','s','t'], keep=False)
m2 = df.duplicated(['c','s'], keep=False) & ~m1
m3 = df.duplicated(['c','t'], keep=False) & ~m1
a = df.loc[m1, 'c']
print (a)
0 x
1 x
Name: c, dtype: object
b = df.loc[m2, 'c']
print (b)
2 y
3 y
Name: c, dtype: object
c = df.loc[m3, 'c']
print (c)
4 z
5 z
Name: c, dtype: object
And then convert columns to sets:
out1, out2, out3 = set(a['c']), set(b['c']), set(c['c'])
print (out1)
{'x'}
print (out2)
{'y'}
print (out3)
{'z'}
And for lengths:
out11, out21, out31 = len(out1), len(out2), len(out3)
print (out11)
print (out21)
print (out31)
1
1
1
Another idea is create new column by concat and DataFrame.dot:
df1 = pd.concat([m1, m2, m3], axis=1, keys=('s&t','s','t'))
print (df1)
s&t s t
0 True False False
1 True False False
2 False True False
3 False True False
4 False False True
5 False False True
df['new'] = df1.dot(df1.columns)
And then aggregate with sets and function nunique:
df2 = (df.groupby('new')['c']
.agg([('set', lambda x: set(x)),('count','nunique')])
.reset_index())
print (df2)
new set count
0 s {y} 1
1 s&t {x} 1
2 t {z} 1

Categories