I have a following dummy data frame:
df = pd.DataFrame([[1,50,60],[5,70,80],[2,120,30],[3,125,450],[5,80,90],[4,100,200],[2,1000,2000],[1,10,20]],columns = ['A','B','C'])
A B C
0 1 50 60
1 5 70 80
2 2 120 30
3 3 125 450
4 5 80 90
5 4 100 200
6 2 1000 2000
7 1 10 20
I am for loop in python at this moment and I would like to know if there is any possibility that for loop in python to generate multiple results. I would like to break the above data frame using for loop where for each variable in column A I would like to have new df and sort them based on column B and have column C multiplied by 2:
df1 =
A B C
1 10 40
1 20 120
df2 =
A B C
2 120 60
2 1000 4000
df3 =
A B C
3 125 900
df4 =
A B C
4 100 200
df5 =
A B C
5 70 80
5 80 90
I am not sure if this can be done in Python. Normally I use matlab and for this I tried the following in my python script:
def f(df):
for i in np.unique(df['A'].values):
df = df.sort_values(['A','B'])
df = df['C'].assign(C = lambda x: x.C*2)
print df
Of course this is wrong since it will not generate multiple result as df1,df2...df5 (this variables are important to be ended by 1,2,...5 such that it can be traced or followed column A of the dataframe). Could anyone help me please? I understand that this can be easily done without for loop (vectorization), but I have many unique values in column A and I would like to run a for loop on them and I would also like to learn more about loop in Python. Many thanks.
Use DataFrame.groupby is faster than Series.unique.
Optionally you can save the dataframes in a dictionary.
The advantage of using a dictionary with respect to the list is that it can match the password with the value in A
df2=df.copy()
df2['C']=df2['C']*2
df2=df2.sort_values('B')
dfs={i:group for i,group in df2.groupby('A')}
access the dictionary based on the value in A:
for key in dfs:
print(f'dfs[{key}]')
print(dfs[key])
print('_'*20)
dfs[1]
A B C
7 1 10 80
0 1 50 240
____________________
dfs[2]
A B C
2 2 120 120
6 2 1000 8000
____________________
dfs[3]
A B C
3 3 125 1800
____________________
dfs[4]
A B C
5 4 100 800
____________________
dfs[5]
A B C
1 5 70 320
4 5 80 360
Sort and multiply before chunking into pieces:
df['C'] = 2* df['C']
[group for name, group in df.sort_values(by=['A','B']).groupby('A')]
Or if you want a dict:
{name: group for name, group in df.sort_values(by=['A','B']).groupby('A')}
I have similar answer like Ansev:
df = pd.DataFrame([[1,50,60],[5,70,80],[2,120,30],[3,125,450],[5,80,90],[4,100,200],[2,1000,2000],[1,10,20]],columns = ['A','B','C'])
A = np.unique(data['A'].values)
df_result = []
for a in A:
df1 = df.loc[df['A'] == a]
df1 = df1.sort_values('B')
df1 = df1.assign(C = lambda x: x.C*2)
df_result+=[df1]
I am still unable to automate this for having the result as df_result1, df_result2...df_result5. What I can do is only to call the result from each loop as df_result[0], df_result[1],...df_result[4].
What you want to do is group by the column A and then store the resulting dataframe into a dict indexed by the value of A. A code to do that would be
df_dict = {}
for ix, gp in df.groupby('A'):
new_df = gp.sort_values('B')
new_df['C'] = 2*new_df['C']
df_dict[ix] = new_df
Then the variable df_list contains all the resulting dataframes sorted by column B and column C multiplied by 2. For example
print(df_dict[1])
A B C
1 10 40
1 50 120
Related
Hi i have following data frame
S.No Description amount
1 a, b, c 100
2 a, c 50
3 b, c 80
4 b, d 90
5 a 150
I want to extract only values of 'a' forexample
expected answer:
Description amount
a 100
a 50
a 150
and sum them up as
Description amount
a 300
But i am getting this answer:
Description amount
1 a 100
2 a 50
3 nan nan
4 nan nan
5 a 150
please guide me how to properly use where clause on panda's dataframes.
Code:
filter = new_df ["Description"] =="a"
new_df.where(filter, inplace = True)
print (new_df)
Use df.assign, Series.str.split, df.explode, df.query and Groupby.sum:
In [703]: df_a = df.assign(Description=df.Description.str.split(',')).explode('Description').query('Description == "a"')
In [704]: df_a
Out[704]:
S.No Description amount
0 1 a 100
1 2 a 50
4 5 a 150
In [706]: df_a.groupby('Description')['amount'].sum().reset_index()
Out[706]:
Description amount
0 a 300
Or as a one-liner:
df.assign(letters=df['Description'].str.split(',\s'))\
.explode('letters')\
.query('letters == "a"')\
.groupby('letters', as_index=False)['amount'].sum()
Here you go:
In [3]: df["Description"] = df["Description"].str.split(", ")
In [4]: df.explode("Description").groupby("Description", as_index=False).sum()[["Description", "amount"]]
Out[4]:
Description amount
0 a 300
1 b 270
2 c 230
3 d 90
This allows you to get all the sums by each description, not just the 'a' group.
I would like to apply a function f1 by group to a dataframe:
import pandas as pd
import numpy as np
data = np.array([['id1','id2','u','v0','v1'],
['A','A',10,1,7],
['A','A',10,2,8],
['A','B',20,3,9],
['B','A',10,4,10],
['B','B',30,5,11],
['B','B',30,6,12]])
z = pd.DataFrame(data = data[1:,:], columns=data[0,:])
def f1(u,v):
return u*np.cumprod(v)
The result of the function depends on the column u and columns v0 or v1 (that can be thousands of v ecause I'm doing a simulation on a lot of paths).
The result should be like this
id1 id2 new_v0 new_v1
0 A A 10 70
1 A A 20 560
2 A B 60 180
3 B A 40 100
4 B B 150 330
5 B B 900 3960
I tried for a start
output = z.groupby(['id1', 'id2']).apply(lambda x: f1(u = x.u,v =x.v0))
but I can't even get a result with just one column.
Thank you very much!
You can filter column names starting with v and create a list and pass them under groupby:
v_cols = z.columns[z.columns.str.startswith('v')].tolist()
z[['u']+v_cols] = z[['u']+v_cols].apply(pd.to_numeric)
out = z.assign(**z.groupby(['id1','id2'])[v_cols].cumprod()
.mul(z['u'],axis=0).add_prefix('new_'))
print(out)
id1 id2 u v0 v1 new_v0 new_v1
0 A A 10 1 7 10 70
1 A A 10 2 8 20 560
2 A B 20 3 9 60 180
3 B A 10 4 10 40 100
4 B B 30 5 11 150 330
5 B B 30 6 12 900 3960
The way you create your data frame , will make the numeric to object , we convert first , then use the groupby+ cumprod
z[['u','v0','v1']]=z[['u','v0','v1']].apply(pd.to_numeric)
s=z.groupby(['id1','id2'])[['v0','v1']].cumprod().mul(z['u'],0)
#z=z.join(s.add_prefix('New_'))
v0 v1
0 10 70
1 20 560
2 60 180
3 40 100
4 150 330
5 900 3960
If you want to handle more than 2 v columns, it's better not to reference it.
(
z.apply(lambda x: pd.to_numeric(x, errors='ignore'))
.groupby(['id1', 'id2']).apply(lambda x: x.cumprod().mul(x.u.min()))
)
I have two Pandas DataFrames A and B.
They have an identical index (weekly dates) up to a point: the series ends at the beginning of the year
for A and goes on for a number of observations in frame B. I need to set data frame A to have the same index as frame B - and fill each column with its own last values.
Thank you in advance.
Tikhon
EDIT: thank you for the advice on the question. What I need is for dfA_before to look at dfB and become dfA_after:
print(dfA_before)
a b
0 10 100
1 20 200
2 30 300
print(dfB)
a b
0 11 111
1 22 222
2 33 333
3 44 444
4 55 555
print(dfA_after)
a b
0 10 100
1 20 200
2 30 300
3 30 300
4 30 300
This should work
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'a':[10,20,30],'b':[100,200,300]})
df2 = pd.DataFrame({'a':[11,22,33,44,55],'c':[111,222,333,444,555]})
# solution
last = df1.iloc[-1].to_numpy()
df3 = pd.DataFrame(np.tile(last,(2,1)),
columns=df1.columns)
df4 = df1.append(df3,ignore_index=True)
# method 2
for _ in range(len(df2)-len(df1)):
df1.loc[len(df1)] = df1.loc[len(df1)-1]
# method 3
for _ in range(df2.shape[0]-df1.shape[0]):
df1 = df1.append(df1.loc[len(df1)-1],ignore_index=True)
# result
a b
0 10 100
1 20 200
2 30 300
3 30 300
4 30 300
Probably very inefficient - I am a beginner:
dfA_New = dfB.copy()
dfA_New.loc[:] = 0
dfA_New.loc[:] = dfA.loc[:]
dfA_New.fillna(method='ffill', inplace = True)
dfA = dfA_New
Cumsum until value exceeds certain number:
Say that we have two Data frames A,B that look like this:
A = pd.DataFrame({"type":['a','b','c'], "value":[100, 50, 30]})
B = pd.DataFrame({"type": ['a','a','a','a','b','b','b','c','c','c','c','c'], "value": [10,50,45,10,45,10,5,6,6,8,12,10]})
The two data frames would look like this.
>>> A
type value
0 a 100
1 b 50
2 c 30
>>> B
type value
0 a 10
1 a 50
2 a 45
3 a 10
4 b 45
5 b 10
6 b 5
7 c 6
8 c 6
9 c 8
10 c 12
11 c 10
For each group in "type" in data frame A, i would like to add the column value in B up to the number specified in the column value in A. I would also like to count the number of rows in B that were added. I've been trying to use a cumsum() but I don't know exactly to to stop the sum when the value is reached,
The output should be:
type value
0 a 3
1 b 2
2 c 4
Thank you,
Merging the two data frame before hand should help:
import pandas as pd
df = pd.merge(B, A, on = 'type')
df['cumsum'] = df.groupby('type')['value_x'].cumsum()
B[(df.groupby('type')['cumsum'].shift().fillna(0) < df['value_y'])].groupby('type').count()
# type value
# a 3
# b 2
# c 4
Assuming B['type'] to be sorted as with the sample case, here's a NumPy based solution -
IDs = np.searchsorted(A['type'],B['type'])
count_cumsum = np.bincount(IDs,B['value']).cumsum()
upper_bound = A['value'] + np.append(0,count_cumsum[:-1])
Bv_cumsum = np.cumsum(B['value'])
grp_start = np.unique(IDs,return_index=True)[1]
A['output'] = np.searchsorted(Bv_cumsum,upper_bound) - grp_start + 1
I’m trying to replicate in python/pandas what would be fairly straightforward in SQL, but am stuck.
I want to take a data frame with three columns:
dataframe1
Org Des Score
0 A B 10
1 A B 11
2 A B 15
3 A C 4
4 A C 4.5
5 A C 6
6 A D 100
7 A D 110
8 A D 130
And filter out all score values that are greater than the minimum * 1.2 for each Org-Des combination.
So the output table would be:
output_dataframe
Org Des Score
0 A B 10
1 A B 11
3 A C 4
4 A C 4.5
6 A D 100
7 A D 110
For the first Org-Des combo, A-B, the min Score is 10 and (1.2 * min) = 12. So rows 0 and 1 would be preserved because Scores 10 and 11 are < 12. Row 3 would be eliminated because it is > 12.
For A-C, the min Score is 4 and (1.2 * min) = 5. So rows 3 and 4 are preserved because they are < 5. And so on...
My approach
I thought I'd use the following approach:
Use a groupby function to create a dataframe with the mins by Org-Des pair:
dataframe2 = pd.DataFrame(dataframe1.groupby(['Org','Des'])['Score'].min())
Then do an inner join (or a merge?) between dataframe1 and dataframe2 with the criteria that the Score < 1.2 * min for each Org-Des pair type.
But I haven't been able to get this to work for two reasons, 1) dataframe2 ends up being a funky shape, which I would need to figure out how to join or merge with dataframe1, or transform then join/merge and 2) I don't know how to set criteria as part of a join/merge.
Is this the right approach or is there a more pythonic way to achieve the same goal?
Edit to reflect #Psidom answer:
I tried the code you suggested and it gave me an error, here's the full code and output:
In: import pandas as pd
import numpy as np
In: df1 = pd.DataFrame({'Org': ['A','A','A','A','A','A','A','A','A'],
'Des': ['B','B','B','C','C','C','D','D','D'],
'Score': ['10','11','15','4','4.5','6','100','110','130'], })
Out: Org Des Score
0 A B 10
1 A B 11
2 A B 15
3 A C 4
4 A C 4.5
5 A C 6
6 A D 100
7 A D 110
8 A D 130
In: df2 = pd.DataFrame(df1.groupby(['Org','Des'])['Score'].min())
df2
Out: Score
Org Des
A B 10
C 4
D 100
In: df1 = pd.merge(df1, df2.groupby(['Org', 'Des']).min()*1.2, left_on = ['Org', 'Des'], right_index=True)
df.loc[df1.Score_x < df1.Score_y, :]
Out: KeyError: 'Org' #It's a big error but this seems to be the relevant part. Let me know if it would be useful to past the whole error.
I suspect I may have the df1, df2 and df's mixed up? I changed from the original answer post to match the code.
You can set up the join criteria as this. For the original data frame, set the join columns as ['Org', 'Des'], and for the aggregated data frame the grouped columns become index so you will need to set right_index to be true, then it should work as expected:
import pandas as pd
df1 = pd.DataFrame({'Org': ['A','A','A','A','A','A','A','A','A'],
'Des': ['B','B','B','C','C','C','D','D','D'],
'Score': [10,11,15,4,4.5,6,100,110,130]})
df2 = pd.DataFrame(df1.groupby(['Org','Des'])['Score'].min())
df3 = pd.merge(df1, df2, left_on = ['Org', 'Des'], right_index=True)
df1.loc[df3.Score_x < df3.Score_y * 1.2, ]
# Org Des Score
#0 A B 10.0
#1 A B 11.0
#3 A C 4.0
#4 A C 4.5
#6 A D 100.0
#7 A D 110.0
I did it like this:
df[df.groupby(['Org', 'Des']).Score.apply(lambda x: x < x.min() * 1.2)]