Unable to apply where clause properly in python panda data frame - python

Hi i have following data frame
S.No Description amount
1 a, b, c 100
2 a, c 50
3 b, c 80
4 b, d 90
5 a 150
I want to extract only values of 'a' forexample
expected answer:
Description amount
a 100
a 50
a 150
and sum them up as
Description amount
a 300
But i am getting this answer:
Description amount
1 a 100
2 a 50
3 nan nan
4 nan nan
5 a 150
please guide me how to properly use where clause on panda's dataframes.
Code:
filter = new_df ["Description"] =="a"
new_df.where(filter, inplace = True)
print (new_df)

Use df.assign, Series.str.split, df.explode, df.query and Groupby.sum:
In [703]: df_a = df.assign(Description=df.Description.str.split(',')).explode('Description').query('Description == "a"')
In [704]: df_a
Out[704]:
S.No Description amount
0 1 a 100
1 2 a 50
4 5 a 150
In [706]: df_a.groupby('Description')['amount'].sum().reset_index()
Out[706]:
Description amount
0 a 300
Or as a one-liner:
df.assign(letters=df['Description'].str.split(',\s'))\
.explode('letters')\
.query('letters == "a"')\
.groupby('letters', as_index=False)['amount'].sum()

Here you go:
In [3]: df["Description"] = df["Description"].str.split(", ")
In [4]: df.explode("Description").groupby("Description", as_index=False).sum()[["Description", "amount"]]
Out[4]:
Description amount
0 a 300
1 b 270
2 c 230
3 d 90
This allows you to get all the sums by each description, not just the 'a' group.

Related

Pandas: return the occurrences of the most frequent value for each group (possibly without apply)

Let's assume the input dataset:
test1 = [[0,7,50], [0,3,51], [0,3,45], [1,5,50],[1,0,50],[2,6,50]]
df_test = pd.DataFrame(test1, columns=['A','B','C'])
that corresponds to:
A B C
0 0 7 50
1 0 3 51
2 0 3 45
3 1 5 50
4 1 0 50
5 2 6 50
I would like to obtain the a dataset grouped by 'A', together with the most common value for 'B' in each group, and the occurrences of that value:
A most_freq freq
0 3 2
1 5 1
2 6 1
I can obtain the first 2 columns with:
grouped = df_test.groupby("A")
out_df = pd.DataFrame(index=grouped.groups.keys())
out_df['most_freq'] = df_test.groupby('A')['B'].apply(lambda x: x.value_counts().idxmax())
but I am having problems the last column.
Also: is there a faster way that doesn't involve 'apply'? This solution doesn't scale well with lager inputs (I also tried dask).
Thanks a lot!
Use SeriesGroupBy.value_counts which sorting by default, so then add DataFrame.drop_duplicates for top values after Series.reset_index:
df = (df_test.groupby('A')['B']
.value_counts()
.rename_axis(['A','most_freq'])
.reset_index(name='freq')
.drop_duplicates('A'))
print (df)
A most_freq freq
0 0 3 2
2 1 0 1
4 2 6 1

pandas apply User defined function to grouped dataframe on multiple columns

I would like to apply a function f1 by group to a dataframe:
import pandas as pd
import numpy as np
data = np.array([['id1','id2','u','v0','v1'],
['A','A',10,1,7],
['A','A',10,2,8],
['A','B',20,3,9],
['B','A',10,4,10],
['B','B',30,5,11],
['B','B',30,6,12]])
z = pd.DataFrame(data = data[1:,:], columns=data[0,:])
def f1(u,v):
return u*np.cumprod(v)
The result of the function depends on the column u and columns v0 or v1 (that can be thousands of v ecause I'm doing a simulation on a lot of paths).
The result should be like this
id1 id2 new_v0 new_v1
0 A A 10 70
1 A A 20 560
2 A B 60 180
3 B A 40 100
4 B B 150 330
5 B B 900 3960
I tried for a start
output = z.groupby(['id1', 'id2']).apply(lambda x: f1(u = x.u,v =x.v0))
but I can't even get a result with just one column.
Thank you very much!
You can filter column names starting with v and create a list and pass them under groupby:
v_cols = z.columns[z.columns.str.startswith('v')].tolist()
z[['u']+v_cols] = z[['u']+v_cols].apply(pd.to_numeric)
out = z.assign(**z.groupby(['id1','id2'])[v_cols].cumprod()
.mul(z['u'],axis=0).add_prefix('new_'))
print(out)
id1 id2 u v0 v1 new_v0 new_v1
0 A A 10 1 7 10 70
1 A A 10 2 8 20 560
2 A B 20 3 9 60 180
3 B A 10 4 10 40 100
4 B B 30 5 11 150 330
5 B B 30 6 12 900 3960
The way you create your data frame , will make the numeric to object , we convert first , then use the groupby+ cumprod
z[['u','v0','v1']]=z[['u','v0','v1']].apply(pd.to_numeric)
s=z.groupby(['id1','id2'])[['v0','v1']].cumprod().mul(z['u'],0)
#z=z.join(s.add_prefix('New_'))
v0 v1
0 10 70
1 20 560
2 60 180
3 40 100
4 150 330
5 900 3960
If you want to handle more than 2 v columns, it's better not to reference it.
(
z.apply(lambda x: pd.to_numeric(x, errors='ignore'))
.groupby(['id1', 'id2']).apply(lambda x: x.cumprod().mul(x.u.min()))
)

For loops with multiple result

I have a following dummy data frame:
df = pd.DataFrame([[1,50,60],[5,70,80],[2,120,30],[3,125,450],[5,80,90],[4,100,200],[2,1000,2000],[1,10,20]],columns = ['A','B','C'])
A B C
0 1 50 60
1 5 70 80
2 2 120 30
3 3 125 450
4 5 80 90
5 4 100 200
6 2 1000 2000
7 1 10 20
I am for loop in python at this moment and I would like to know if there is any possibility that for loop in python to generate multiple results. I would like to break the above data frame using for loop where for each variable in column A I would like to have new df and sort them based on column B and have column C multiplied by 2:
df1 =
A B C
1 10 40
1 20 120
df2 =
A B C
2 120 60
2 1000 4000
df3 =
A B C
3 125 900
df4 =
A B C
4 100 200
df5 =
A B C
5 70 80
5 80 90
I am not sure if this can be done in Python. Normally I use matlab and for this I tried the following in my python script:
def f(df):
for i in np.unique(df['A'].values):
df = df.sort_values(['A','B'])
df = df['C'].assign(C = lambda x: x.C*2)
print df
Of course this is wrong since it will not generate multiple result as df1,df2...df5 (this variables are important to be ended by 1,2,...5 such that it can be traced or followed column A of the dataframe). Could anyone help me please? I understand that this can be easily done without for loop (vectorization), but I have many unique values in column A and I would like to run a for loop on them and I would also like to learn more about loop in Python. Many thanks.
Use DataFrame.groupby is faster than Series.unique.
Optionally you can save the dataframes in a dictionary.
The advantage of using a dictionary with respect to the list is that it can match the password with the value in A
df2=df.copy()
df2['C']=df2['C']*2
df2=df2.sort_values('B')
dfs={i:group for i,group in df2.groupby('A')}
access the dictionary based on the value in A:
for key in dfs:
print(f'dfs[{key}]')
print(dfs[key])
print('_'*20)
dfs[1]
A B C
7 1 10 80
0 1 50 240
____________________
dfs[2]
A B C
2 2 120 120
6 2 1000 8000
____________________
dfs[3]
A B C
3 3 125 1800
____________________
dfs[4]
A B C
5 4 100 800
____________________
dfs[5]
A B C
1 5 70 320
4 5 80 360
Sort and multiply before chunking into pieces:
df['C'] = 2* df['C']
[group for name, group in df.sort_values(by=['A','B']).groupby('A')]
Or if you want a dict:
{name: group for name, group in df.sort_values(by=['A','B']).groupby('A')}
I have similar answer like Ansev:
df = pd.DataFrame([[1,50,60],[5,70,80],[2,120,30],[3,125,450],[5,80,90],[4,100,200],[2,1000,2000],[1,10,20]],columns = ['A','B','C'])
A = np.unique(data['A'].values)
df_result = []
for a in A:
df1 = df.loc[df['A'] == a]
df1 = df1.sort_values('B')
df1 = df1.assign(C = lambda x: x.C*2)
df_result+=[df1]
I am still unable to automate this for having the result as df_result1, df_result2...df_result5. What I can do is only to call the result from each loop as df_result[0], df_result[1],...df_result[4].
What you want to do is group by the column A and then store the resulting dataframe into a dict indexed by the value of A. A code to do that would be
df_dict = {}
for ix, gp in df.groupby('A'):
new_df = gp.sort_values('B')
new_df['C'] = 2*new_df['C']
df_dict[ix] = new_df
Then the variable df_list contains all the resulting dataframes sorted by column B and column C multiplied by 2. For example
print(df_dict[1])
A B C
1 10 40
1 50 120

Dataframe selecting Max for a column but output values of another

I have a dataframe with values similar to below
A10d B10d C10d A B C Strategy
20 10 5 3 5 1 3
The Strategy selects the max of A10d, B10d, C10d and return the value of A,B,C
In this case A10d is the largest and Strategy returns A, value of 3
I am not sure how to create this Strategy column properly, can anyone advise please? Thank you very much for your help
I think you need iloc for select first columns by positions and then get columns names by max with idxmax and replace 10d by whitespace for match columns. Last create new column by lookup:
print (df)
A10d B10d C10d A B C
0 20 10 5 3 5 1
1 20 100 5 3 5 1
df1 = df.iloc[:,:3]
print (df1)
A10d B10d C10d
0 20 10 5
1 20 100 5
s = df1.idxmax(axis=1).str.replace('10d','')
print (s)
0 A
1 B
dtype: object
df['Strategy'] = df.lookup(df.index, s)
print (df)
A10d B10d C10d A B C Strategy
0 20 10 5 3 5 1 3
1 20 100 5 3 5 1 5

cumsum pandas up to specific value - python pandas

Cumsum until value exceeds certain number:
Say that we have two Data frames A,B that look like this:
A = pd.DataFrame({"type":['a','b','c'], "value":[100, 50, 30]})
B = pd.DataFrame({"type": ['a','a','a','a','b','b','b','c','c','c','c','c'], "value": [10,50,45,10,45,10,5,6,6,8,12,10]})
The two data frames would look like this.
>>> A
type value
0 a 100
1 b 50
2 c 30
>>> B
type value
0 a 10
1 a 50
2 a 45
3 a 10
4 b 45
5 b 10
6 b 5
7 c 6
8 c 6
9 c 8
10 c 12
11 c 10
For each group in "type" in data frame A, i would like to add the column value in B up to the number specified in the column value in A. I would also like to count the number of rows in B that were added. I've been trying to use a cumsum() but I don't know exactly to to stop the sum when the value is reached,
The output should be:
type value
0 a 3
1 b 2
2 c 4
Thank you,
Merging the two data frame before hand should help:
import pandas as pd
df = pd.merge(B, A, on = 'type')
df['cumsum'] = df.groupby('type')['value_x'].cumsum()
B[(df.groupby('type')['cumsum'].shift().fillna(0) < df['value_y'])].groupby('type').count()
# type value
# a 3
# b 2
# c 4
Assuming B['type'] to be sorted as with the sample case, here's a NumPy based solution -
IDs = np.searchsorted(A['type'],B['type'])
count_cumsum = np.bincount(IDs,B['value']).cumsum()
upper_bound = A['value'] + np.append(0,count_cumsum[:-1])
Bv_cumsum = np.cumsum(B['value'])
grp_start = np.unique(IDs,return_index=True)[1]
A['output'] = np.searchsorted(Bv_cumsum,upper_bound) - grp_start + 1

Categories