I am trying to use entries from df1 to limit amounts in df2, then add them up based on their type and summarize in df3. I'm not sure how to get it, the for loop using iterrows would be my best guess but it's not complete.
Code:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'Caps':['25','50','100']})
df2 = pd.DataFrame({'Amounts':['45','25','65','35','85','105','80'], \
'Type': ['a' ,'b' ,'b' ,'c' ,'a' , 'b' ,'d' ]})
df3 = pd.DataFrame({'Type': ['a' ,'b' ,'c' ,'d']})
df1['Caps'] = df1['Caps'].astype(float)
df2['Amounts'] = df2['Amounts'].astype(float)
for index1, row1 in df1.iterrows():
for index2, row2 in df3.iterrows():
df3[str(row1['Caps']+'limit')] = df2['Amounts'].where(
df2['Type'] == row2['Type']).where(
df2['Amounts']<= row1['Caps'], row1['Caps']).sum()
# My ideal output would be this:
df3 = pd.DataFrame({'Type':['a','b','c','d'],
'Total':['130','195','35','80'],
'25limit':['50','75','25','25'],
'50limit':['95','125','35','50'],
'100limit':['130','190','35','80'],
})
Output:
>>> df3
Type Total 25limit 50limit 100limit
0 a 130 50 95 130
1 b 195 75 125 190
2 c 35 25 35 35
3 d 80 25 50 80
Use numpy for compare all values Amounts with Caps by broadcasting to 2d array a, then create DataFrame by constructor with sum per columns, transpose by DataFrame.T and DataFrame.add_prefix.
For aggregated column use DataFrame.insert for first column with GroupBy.sum:
df1['Caps'] = df1['Caps'].astype(int)
df2['Amounts'] = df2['Amounts'].astype(int)
am = df2['Amounts'].to_numpy()
ca = df1['Caps'].to_numpy()
#pandas below 0.24
#am = df2['Amounts'].values
#ca = df1['Caps'].values
a = np.where(am <= ca[:, None], am[None, :], ca[:, None])
df1 = (pd.DataFrame(a,columns=df2['Type'],index=df1['Caps'])
.sum(axis=1, level=0).T.add_suffix('limit'))
df1.insert(0, 'Total', df2.groupby('Type')['Amounts'].sum())
df1 = df1.reset_index().rename_axis(None, axis=1)
print (df1)
Type Total 25limit 50limit 100limit
0 a 130 50 95 130
1 b 195 75 125 190
2 c 35 25 35 35
3 d 80 25 50 80
Here is my solution without numpy, however it is two times slower than #jezrael's solution, 10.5ms vs. 5.07ms.
limcols= df1.Caps.to_list()
df2=df2.reindex(columns=["Amounts","Type"]+limcols)
df2[limcols]= df2[limcols].transform( \
lambda sc: np.where(df2.Amounts.le(sc.name),df2.Amounts,sc.name))
# Summations:
g=df2.groupby("Type")
df3= g[limcols].sum()
df3.insert(0,"Total", g.Amounts.sum())
# Renaming columns:
c_dic={ lim:f"{lim:.0f}limit" for lim in limcols}
df3= df3.rename(columns=c_dic).reset_index()
# Cleanup:
#df2=df2.drop(columns=limcols)
Related
I have a dataframe with three columns containing text. One column (column1) consists of 3 unique entries; "H", "D", "A".
I want to create a new column with the entries from the other two columns (column2 & column3) based on the entry from the column containing "H", "D" or "A".
I tried to write a function:
def func(x):
if x== "H":
return column2
elif x == "A":
return column3
else:
return "D"
I then tried to use the .apply() function:
df["new_col"] = df["column1"].apply(func)
But this doesn't work as it doesn't recognise column2 & column 3. How do I access the entries of the columns column2 & column 3 inside the function?
You can send the whole row to the function and access it's columns:
def func(x):
if x["column1"]== "H":
return x["column2"]
elif x["column1"] == "A":
return x["column3"]
else:
return "D"
df["new_col"] = df.apply(lambda x: func(x), axis=1)
No need to use .apply you can use np.select to choose elements based upon the conditions:
Consider the example dataframe:
df = pd.DataFrame({
'column1': ['H', 'D', 'A', 'H', 'A'],
'column2': [1, 2, 3, 4, 5],
'column3': [10, 20, 30, 40, 50]
})
Use:
import numpy as np
conditions = [
df['column1'].eq('H'),
df['column1'].eq('A')
]
choices = [
df['column2'],
df['column3']]
df['new_col'] = np.select(
conditions, choices, default='D')
Result:
# print(df)
column1 column2 column3 new_col
0 H 1 10 1
1 D 2 20 D
2 A 3 30 30
3 H 4 40 4
4 A 5 50 50
Here I am retrieving the rows with the conditions required and altering corresponding rows in column4. We can achieve this using iloc in a pandas dataframe.
import pandas as pd
d = {"column1":["H","D","A","D", "H", "H", "A"],"column2":[1,2,3,4,5,6,7],"column3":[12,23,34,45,56,67,87]}
df = pd.DataFrame(d)
df["column4"] = None
df.iloc[list(df[df["column1"] == "H"].index), 3] = df[df["column1"] == "H"]["column2"]
df.iloc[list(df[df["column1"] == "A"].index), 3] = df[df["column1"] == "A"]["column3"]
df.iloc[list(df[df["column4"].isnull()].index), 3] = "D"
The output of the above processing is given below,
print(df)
column1 column2 column3 column4
0 H 1 12 1
1 D 2 23 D
2 A 3 34 34
3 D 4 45 D
4 H 5 56 5
5 H 6 67 6
6 A 7 87 87
You can use np.select() function
import numpy as np
df['column4'] = np.select([df.column1=='H',df.column1=='A'],
[df.column2,df.column3], default = 'D')
It's a kind of a case when statement in which 1st argument is a values to compare,2nd argument is the output corresponding to that comparison. Default is a keyword argument fot the 'else' statement.
Based on my understanding of your query, I'll illustrate taking your example.
Conisder this data frame:
d = {
"col1": ["H","D","A","H","D","A"],
"col2": [172,180,190,156,176,182],
"col3":[80,75,53,80,100,92]
}
df = pd.DataFrame(d)
df
col1 col2 col3
0 H 172 80
1 D 180 75
2 A 190 53
3 H 156 80
4 D 176 100
5 A 182 92
apply takes in a Series object and accesses the columns using appropriate indices with respect to the dataframe you passed. On calling apply it is necessary to pass axis=1 , since you you need column values for each row. Finally append the returned series to the original dataframe.
def func(df):
if df[0] == 'H':
return df[1]
elif df[0] == 'A':
return df[2]
else:
return "D"
df['col4'] = df.apply(func, axis=1)
df
col1 col2 col3 col4
0 H 172 80 172
1 D 180 75 D
2 A 190 53 53
3 H 156 80 156
4 D 176 100 D
5 A 182 92 92
I have a DataFrame for which I want to calculate, for each row, how many other rows match a given condition (e.g. number of rows that have value in column C less than the value for this row). Iterating through each row is too slow (I have ~1B rows), especially when the columns dtype is a datetime, but this is the way it could be run on a DataFrame df with a column labeled C:
df['newcol'] = 0
for row in df.itertuples():
df.loc[row.Index, 'newcol'] = len(df[df.C < row.C])
Is there a way to vectorize this?
Thanks!
Preparation:
import numpy as np
import pandas as pd
count = 5000
np.random.seed(100)
data = np.random.randint(100, size=count)
df = pd.DataFrame({'Col': list('ABCDE') * (count/5),
'Val': data})
Suggestion:
u, c = np.unique(data, return_counts=True)
values = np.cumsum(c)
dictionary = dict(zip(u[1:], values[:-1]))
dictionary[u[0]] = 0
df['newcol'] = [dictionary[x] for x in data]
It does exactly the same as your example.
If it does not help. Write more detailed question.
Recommendations:
Pandas vectorization and jit-compiling are available with numba at page .
If you work with 1d arrays - use numpy. In many situations it works faster. Just compare that:
Pandas
%timeit df['newcol2'] = df.apply(lambda x: sum(df['Val'] < x.Val), axis=1)
1 loop, best of 3: 51.1 s per loop
204.34800005
Numpy
%timeit df['newcol3'] = [np.sum(data<x) for x in data]
10 loops, best of 3: 61.3 ms per loop
2.5490000248
Use numpy.sum instead of sum!
Consider pandas.DataFrame.apply with a lambda expression to count the rows to your condition. Admittedly, apply is a loop and to run across ~1 billion rows may take time to process.
import numpy as np
import pandas as pd
np.random.seed(161)
df = pd.DataFrame({'Col': list('ABCDE') * 3,
'Val': np.random.randint(100, size=15)})
df['newcol'] = df.apply(lambda x: sum(df['Val'] < x.Val), axis=1)
# Col Val Count
# 0 A 78 13
# 1 B 11 2
# 2 C 51 8
# 3 D 31 5
# 4 E 29 4
# 5 A 99 14
# 6 B 65 10
# 7 C 16 3
# 8 D 43 7
# 9 E 10 1
# 10 A 67 11
# 11 B 36 6
# 12 C 1 0
# 13 D 73 12
# 14 E 64 9
I have two dataFrame in Python.
The first one is df1:
'ID' 'B'
AA 10
BB 20
CC 30
DD 40
The second one is df2:
'ID' 'C' 'D'
BB 30 0
DD 35 0
What I want to get finally is like df3:
'ID' 'C' 'D'
BB 30 20
DD 35 40
how to reach this goal?
my code is:
for i in df.ID
if len(df2.ID[df2.ID==i]):
df2.D[df2.ID==i]=df1.B[df2.ID==i]
but it doesn't work.
So first of all, I've interpreted the question differently, since your description is rather ambiguous. Mine boils down to this:
df1 is this data structure:
ID B <- column names
AA 10
BB 20
CC 30
DD 40
df2 is this data structure:
ID C D <- column names
BB 30 0
DD 35 0
Dataframes have a merge option, if you wanted to merge based on index the following code would work:
import pandas as pd
df1 = pd.DataFrame(
[
['AA', 10],
['BB', 20],
['CC', 30],
['DD', 40],
],
columns=['ID','B'],
)
df2 = pd.DataFrame(
[
['BB', 30, 0],
['DD', 35, 0],
], columns=['ID', 'C', 'D']
)
df3 = pd.merge(df1, df2, on='ID')
Now df3 only contains rows with ID's in both df1 and df2:
ID B C D <- column names
BB 20 30 0
DD 40 35 0
Now you were trying to remove D, and fill it in with column B, a.k.a
ID C D
BB 30 20
DD 35 40
Something that can be done with these simple steps:
df3 = pd.merge(df1, df2, on='ID') # merge them
df3.D = df3['B'] # set D to B's values
del df3['B'] # remove B from df3
Or to summarize:
def match(df1, df2):
df3 = pd.merge(df1, df2, on='ID') # merge them
df3.D = df3['B'] # set D to B's values
del df3['B'] # remove B from df3
return df3
Following code will replace zero in df1 with value df2
df1=pd.DataFrame(['A','B',0,4,6],columns=['x'])
df2=pd.DataFrame(['A','X',3,0,5],columns=['x'])
df3=df1[df1!=0].fillna(df2)
Say I have some data in a DataFrame df. In particular, df.columns is a MultiIndex where the first level indicates "what kind of data" we are dealing with, and the second level indicates some sort of ID. To begin with, there is only a single unique value in the outermost column level:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(400, 5), columns=list('abcde'))
df.columns = pd.MultiIndex.from_tuples([('raw', c) for c in df.columns],
names=['datum', 'id'])
So say I want to compute a 10 period moving average of this chunk of data. I can easily do that with
df['raw'].rolling(window=10, min_periods=10).mean()
I'd like to assign this to a new section of the existing data frame. I wish the syntax were simply:
df['avg_10'] = df['raw'].rolling(window=10, min_periods=10).mean()
But that doesn't work. Instead, to get the equivalent, I need to do something clunky like:
a = df['raw'].rolling(window=10, min_periods=10).mean()
a.columns = pd.MultiIndex.from_tuples([('avg_10', c) for c in a.columns],
names=['datum', 'id'])
df = pd.concat([df, a], axis=1)
Is there a concise way to do this?
you can add new columns in one shot like this:
df[df.columns.get_level_values(1)] = df['raw'].rolling(window=10, min_periods=10).mean()
and now let's bring order to columns levels:
df.columns = pd.MultiIndex.from_tuples(
[t if t[0]=='raw' else ('avg_10', t[0]) for t in df.columns.tolist()]
)
Output:
In [121]: df.tail()
Out[121]:
raw avg_10 \
a b c d e a b
35 -0.036381 -0.202369 0.728408 -1.149906 -0.888169 0.174578 0.244956
36 1.700182 -0.957104 -0.005931 -1.035258 0.916398 0.304429 0.025519
37 1.142203 0.198508 -0.568147 0.006620 1.912575 0.408570 0.029939
38 -1.360093 0.638533 -0.899154 1.120311 1.702436 0.109886 0.155383
39 -1.860319 0.863798 0.876608 1.292301 0.547762 -0.069686 0.141820
c d e
35 -0.046456 -0.291078 0.176360
36 0.128143 -0.670730 0.213351
37 0.041724 -0.542027 0.301774
38 -0.147804 -0.363713 0.400007
39 0.005854 -0.164190 0.483140
Because of df.rolling as in your example, this solution only works with Pandas 0.18.0+.
# Create sample data with three columns.
np.random.seed(0)
df = pd.DataFrame(np.random.randn(400, 3), columns=list('abc'))
df.columns = pd.MultiIndex.from_tuples([('raw', c) for c in df.columns],
names=['datum', 'id'])
# Have two window periods (e.g. 10, 30).
windows = [10, 30]
cols = df.columns.get_level_values(1)
for window in windows:
for col in cols:
df.loc[:, ('avg_{0}'.format(window), col)] = \
df.xs(col, axis=1, level=1).rolling(window=window, min_periods=window).mean()
>>> df.tail()
datum raw avg_10 avg_30
id a b c a b c a b c
395 -0.177813 0.250998 1.054758 0.528226 0.266558 0.123020 0.046781 0.365069 0.233943
396 0.960048 -0.416499 -0.276823 0.459380 0.379910 0.140920 0.067177 0.329077 0.261536
397 1.123905 -0.173464 -0.510030 0.429155 0.268950 0.022079 0.105671 0.270666 0.271052
398 1.392518 1.037586 0.018792 0.485142 0.340002 -0.139202 0.170970 0.315509 0.262711
399 -0.593777 -2.011880 0.589704 0.387988 0.114828 -0.096127 0.133680 0.206199 0.265718
I have the following dataframe:
ID first mes1.1 mes 1.2 ... mes 1.10 mes2.[1-10] mes3.[1-10]
123df John 5.5 130 45 [12,312,...] [123,346,53]
...
where I have abbreviated columns using [] notation. So in this dataframe I have 31 columns: first, mes1.[1-10], mes2.[1-10], and mes3.[1-10]. Each row is keyed by a unique index: ID.
I would like to form a new table where I've replicated all column values, (represented here by ID and first) and move the mes2 and mes3 columns (20 of them) "down" giving me something like this:
ID first mes1 mes2 ... mes10
123df John 5.5 130 45
123df John 341 543 53
123df John 123 560 567
...
# How I set up your dataframe (please include a reproducible df next time)
df = pd.DataFrame(np.random.rand(6,31), index=["ID" + str(i) for i in range(6)],
columns=['first'] + ['mes{0}.{1}'.format(i, j) for i in range(1,4) for j in range(1,11)])
df['first'] = 'john'
Then there are two ways to do this
# Generate new underlying array
first = np.repeat(df['first'].values, 3)[:, np.newaxis]
new_vals = df.values[:, 1:].reshape(18,10)
new_vals = np.hstack((first, new_vals))
# Create new df
m = pd.MultiIndex.from_product((df.index, range(1,4)), names=['ID', 'MesNum'])
pd.DataFrame(new_vals, index=m, columns=['first'] + list(range(1,11)))
or using only Pandas
df.columns = ['first'] + list(range(1,11))*3
pieces = [df.iloc[:, i:i+10] for i in range(1,31, 10)]
df2 = pd.concat(pieces, keys = ['first', 'second', 'third'])
df2 = df2.swaplevel(1,0).sortlevel(0)
df2.insert(0, 'first', df['first'].repeat(3).values)