How change values in dataframe based on another in pandas? - python

Hi,
i have two dataframes and i want to change values in first dataframe where have same IDs in both dataframes,
suppose i have:
df1 = [ID price
1 200
4 300
5 120
7 230
8 110
9 90
12 180]
and
df2 = [ID price count
3 340 27
4 60 10
5 290 2]
after replace:
df1 = [ID price
1 200
4 60
5 290
7 230
8 110
9 90
12 180]
my first try:
df1.loc[df1.ID.isin(df2.ID),['price']] = df2.loc[df2.ID.isin(df1.ID),['price']].values
but it isn't correct.

Assuming ID is the index (or can be set as the index), then you can just update:
In []:
df1.update(df2)
df1
Out[]:
price
ID
1 200.0
4 60.0
5 290.0
7 230.0
8 110.0
9 90.0
12 180.0
If you need to set_index():
df = df1.set_index('ID')
df.update(df2.set_index('ID'))
df1 = df.reset_index()

Related

How can I assign the result of a filtered, grouped aggregation as a new column in the original Pandas DataFrame

I am having trouble making the transition from using R data.table to using Pandas for data munging.
Specifically, I am trying to assign the results of aggregations back into the original df as a new column. Note that the aggregations are functions of two columns, so I don't think df.transform() is the right approach.
To illustrate, I'm trying to replicate what I would do in R with:
library(data.table)
df = as.data.table(read.csv(text=
"id,term,node,hours,price
1,qtr,A,300,107
2,qtr,A,300,104
3,qtr,A,300,91
4,qtr,B,300,89
5,qtr,B,300,113
6,qtr,B,300,116
7,mth,A,50,110
8,mth,A,100,119
9,mth,A,150,99
10,mth,B,50,111
11,mth,B,100,106
12,mth,B,150,108"))
df[term == 'qtr' , `:=`(vwap_ish = sum(hours * price),
avg_id = mean(id) ),
.(node, term)]
df
# id term node hours price vwap_ish avg_id
# 1: 1 qtr A 300 107 90600 2
# 2: 2 qtr A 300 104 90600 2
# 3: 3 qtr A 300 91 90600 2
# 4: 4 qtr B 300 89 95400 5
# 5: 5 qtr B 300 113 95400 5
# 6: 6 qtr B 300 116 95400 5
# 7: 7 mth A 50 110 NA NA
# 8: 8 mth A 100 119 NA NA
# 9: 9 mth A 150 99 NA NA
# 10: 10 mth B 50 111 NA NA
# 11: 11 mth B 100 106 NA NA
# 12: 12 mth B 150 108 NA NA
Using Pandas, I can create an object from df that contains all rows of the original df, with the aggregations
import io
import numpy as np
import pandas as pd
data = io.StringIO("""id,term,node,hours,price
1,qtr,A,300,107
2,qtr,A,300,104
3,qtr,A,300,91
4,qtr,B,300,89
5,qtr,B,300,113
6,qtr,B,300,116
7,mth,A,50,110
8,mth,A,100,119
9,mth,A,150,99
10,mth,B,50,111
11,mth,B,100,106
12,mth,B,150,108""")
df = pd.read_csv(data)
df1 = df.groupby(['node','term']).apply(
lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
)
df1
"""
id term node hours price vwap_ish avg_id
node term
B mth 9 10 mth B 50 111 32350 10.0
10 11 mth B 100 106 32350 10.0
11 12 mth B 150 108 32350 10.0
qtr 3 4 qtr B 300 89 95400 4.0
4 5 qtr B 300 113 95400 4.0
5 6 qtr B 300 116 95400 4.0
A mth 6 7 mth A 50 110 32250 7.0
7 8 mth A 100 119 32250 7.0
8 9 mth A 150 99 32250 7.0
qtr 0 1 qtr A 300 107 90600 1.0
1 2 qtr A 300 104 90600 1.0
2 3 qtr A 300 91 90600 1.0
"""
This doesn't really get me what I want because a) it re-orders and creates indices, and b) it has calculated the aggregation for all rows.
I can get the subset easily enough with
df2 = df[df.term=='qtr'].groupby(['node','term']).apply(
lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
df2
"""
id term node hours price vwap_ish avg_id
node term
A qtr 0 1 qtr A 300 107 90600 1.0
1 2 qtr A 300 104 90600 1.0
2 3 qtr A 300 91 90600 1.0
B qtr 3 4 qtr B 300 89 95400 4.0
4 5 qtr B 300 113 95400 4.0
5 6 qtr B 300 116 95400 4.0
"""
but I can't get the values in the new columns (vwap_ish, avg_id) back into the old df.
I've tried:
df[df.term=='qtr'] = df[df.term == 'qtr'].groupby(['node','term']).apply(
lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
)
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
And also a few variations of .merge and .join. For example:
df.merge(df2, how='left')
ValueError: 'term' is both an index level and a column label, which is ambiguous.
and
df.merge(df2, how='left', on=df.columns)
KeyError: Index(['id', 'term', 'node', 'hours', 'price'], dtype='object')
In writing this I realised I could take my first approach and then just do
df[df.term=='qtr', ['vwap_ish','avg_id']] = NaN
but this seems quite hacky. It means I have to use a new column, rather than overwriting an existing one on the filtered rows, and if the aggregation function were to break say if term='mth' then that would be problematic too.
I'd really appreciate any help with this as it's been a very steep learning curve to try to make the transition from data.table to Pandas and there's so much I would do in a one-liner that is taking me hours to figure out.
You can add group_keys=False parameter for remove MultiIndex, so left join working well:
df2 = df[df.term == 'qtr'].groupby(['node','term'], group_keys=False).apply(
lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
)
df = df.merge(df2, how='left')
print (df)
id term node hours price vwap_ish avg_id
0 1 qtr A 300 107 90600.0 2.0
1 2 qtr A 300 104 90600.0 2.0
2 3 qtr A 300 91 90600.0 2.0
3 4 qtr B 300 89 95400.0 5.0
4 5 qtr B 300 113 95400.0 5.0
5 6 qtr B 300 116 95400.0 5.0
6 7 mth A 50 110 NaN NaN
7 8 mth A 100 119 NaN NaN
8 9 mth A 150 99 NaN NaN
9 10 mth B 50 111 NaN NaN
10 11 mth B 100 106 NaN NaN
11 12 mth B 150 108 NaN NaN
Solution without left join:
m = df.term == 'qtr'
df.loc[m, ['vwap_ish','avg_id']] = (df[m].groupby(['node','term'], group_keys=False)
.apply(lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
))
Improved solution with named aggregation and create vwap_ish column before groupby can improve performance:
df2 = (df[df.term == 'qtr']
.assign(vwap_ish = lambda x: x.hours * x.price)
.groupby(['node','term'], as_index=False)
.agg(vwap_ish=('vwap_ish','sum'),
avg_id=('id','mean')))
df = df.merge(df2, how='left')
print (df)
id term node hours price vwap_ish avg_id
0 1 qtr A 300 107 90600.0 2.0
1 2 qtr A 300 104 90600.0 2.0
2 3 qtr A 300 91 90600.0 2.0
3 4 qtr B 300 89 95400.0 5.0
4 5 qtr B 300 113 95400.0 5.0
5 6 qtr B 300 116 95400.0 5.0
6 7 mth A 50 110 NaN NaN
7 8 mth A 100 119 NaN NaN
8 9 mth A 150 99 NaN NaN
9 10 mth B 50 111 NaN NaN
10 11 mth B 100 106 NaN NaN
11 12 mth B 150 108 NaN NaN
One option is to break it into individual steps, if you are willing to avoid using apply (if you are keen on performance):
Compute the product of hours and price before grouping:
temp = df.assign(vwap_ish = df.hours * df.price, avg_id = df.id)
Get the groupby object after filtering term:
temp = (temp
.loc[temp.term.eq('qtr'), ['vwap_ish', 'avg_id']]
.groupby([df.node, df.term])
)
Assign back the aggregated values with transform; pandas will take care of the index alignment:
(df
.assign(vwap_ish = temp.vwap_ish.transform('sum'),
avg_id = temp.avg_id.transform('mean'))
)
id term node hours price vwap_ish avg_id
0 1 qtr A 300 107 90600.0 2.0
1 2 qtr A 300 104 90600.0 2.0
2 3 qtr A 300 91 90600.0 2.0
3 4 qtr B 300 89 95400.0 5.0
4 5 qtr B 300 113 95400.0 5.0
5 6 qtr B 300 116 95400.0 5.0
6 7 mth A 50 110 NaN NaN
7 8 mth A 100 119 NaN NaN
8 9 mth A 150 99 NaN NaN
9 10 mth B 50 111 NaN NaN
10 11 mth B 100 106 NaN NaN
11 12 mth B 150 108 NaN NaN
This is just an aside, and you can totally ignore it - pydatatable attempts to mimic R's datatable as much as it can. This is one solution with pydatatable:
from datatable import dt, f, by, ifelse, update
DT = dt.Frame(df)
query = f.term == 'qtr'
agg = {'vwap_ish': ifelse(query, (f.hours * f.price), np.nan).sum(),
'avg_id' : ifelse(query, f.id.mean(), np.nan).sum()}
# update is a near equivalent to `:=`
DT[:, update(**agg), by('node', 'term')]
DT
| id term node hours price vwap_ish avg_id
| int64 str32 str32 int64 int64 float64 float64
-- + ----- ----- ----- ----- ----- -------- -------
0 | 1 qtr A 300 107 90600 6
1 | 2 qtr A 300 104 90600 6
2 | 3 qtr A 300 91 90600 6
3 | 4 qtr B 300 89 95400 15
4 | 5 qtr B 300 113 95400 15
5 | 6 qtr B 300 116 95400 15
6 | 7 mth A 50 110 NA NA
7 | 8 mth A 100 119 NA NA
8 | 9 mth A 150 99 NA NA
9 | 10 mth B 50 111 NA NA
10 | 11 mth B 100 106 NA NA
11 | 12 mth B 150 108 NA NA
[12 rows x 7 columns]

Split a data frame into six equal parts based on number of rows without knowing the number of rows - pandas

I have a df as shown below.
df:
ID Job Salary
1 A 100
2 B 200
3 B 20
4 C 150
5 A 500
6 A 600
7 A 200
8 B 150
9 C 110
10 B 200
11 B 220
12 A 150
13 C 20
14 B 50
I would like to split the df into 6 equal parts based on the number of rows.
Expected Output
df1:
ID Job Salary
1 A 100
2 B 200
3 B 20
df2:
ID Job Salary
4 C 150
5 A 500
6 A 600
df3:
ID Job Salary
7 A 200
8 B 150
df4:
ID Job Salary
9 C 110
10 B 200
df5:
ID Job Salary
11 B 220
12 A 150
df6:
ID Job Salary
13 C 20
14 B 50
Note: Since there are 14 rows first two dfs can have 3 rows and the remaining 4 dfs should have 2 rows.
And I would like to save all dfs as csv dynamically
You can use np.array_split():
dfs = np.array_split(df, 6)
for index, df in enumerate(dfs):
df.to_csv(f'df{index+1}.csv')
>>> print(dfs)
[ ID Job Salary
0 1 A 100
1 2 B 200
2 3 B 20,
ID Job Salary
3 4 C 150
4 5 A 500
5 6 A 600,
ID Job Salary
6 7 A 200
7 8 B 150,
ID Job Salary
8 9 C 110
9 10 B 200,
ID Job Salary
10 11 B 220
11 12 A 150,
ID Job Salary
12 13 C 20
13 14 B 50]

Create New Pandas DataFrame Column Equaling Values From Other Row in Same DataFrame

I'm new to python and very new to Pandas. I've looked through the Pandas documentation and tried multiple ways to solve this problem unsuccessfully.
I have a DateFrame with timestamps in one column and prices in another, such as:
d = {'TimeStamp': [1603822620000, 1603822680000,1603822740000, 1603823040000,1603823100000,1603823160000,1603823220000], 'Price': [101,105,102,108,105,101,106], 'OtherData1': [1,2,3,4,5,6,7], 'OtherData2': [7,6,5,4,3,2,1]}
df= pd.DataFrame(d)
df
TimeStamp Price OtherData1 OtherData2
0 1603822620000 101 1 7
1 1603822680000 105 2 6
2 1603822740000 102 3 5
3 1603823040000 108 4 4
4 1603823100000 105 5 3
5 1603823160000 101 6 2
6 1603823220000 106 7 1
In addition to the two columns of interest, this DataFrame also has additional columns with data not particularly relevant to the question (represented with OtherData Cols).
I want to create a new column 'Fut2Min' (Price Two Minutes into the Future). There may be missing data, so this problem can't be solved by simply getting the data from 2 rows below.
I'm trying to find a way to make the value for Fut2Min Col in each row == the Price at the row with the timestamp + 120000 (2 minutes into the future) or null (or NAN or w/e) if the corresponding timestamp doesn't exist.
For the example data, the DF should be updated to:
(Code used to mimic desired result)
d = {'TimeStamp': [1603822620000, 1603822680000, 1603822740000, 1603822800000, 1603823040000,1603823100000,1603823160000,1603823220000],
'Price': [101,105,102,108,105,101,106,111],
'OtherData1': [1,2,3,4,5,6,7,8],
'OtherData2': [8,7,6,5,4,3,2,1],
'Fut2Min':[102,108,'NaN','NaN',106,111,'NaN','NaN']}
df= pd.DataFrame(d)
df
TimeStamp Price OtherData1 OtherData2 Fut2Min
0 1603822620000 101 1 8 102
1 1603822680000 105 2 7 108
2 1603822740000 102 3 6 NaN
3 1603822800000 108 4 5 NaN
4 1603823040000 105 5 4 106
5 1603823100000 101 6 3 111
6 1603823160000 106 7 2 NaN
7 1603823220000 111 8 1 NaN
Assuming that the DataFrame is:
TimeStamp Price OtherData1 OtherData2 Fut2Min
0 1603822620000 101 1 8 0
1 1603822680000 105 2 7 0
2 1603822740000 102 3 6 0
3 1603822800000 108 4 5 0
4 1603823040000 105 5 4 0
5 1603823100000 101 6 3 0
6 1603823160000 106 7 2 0
7 1603823220000 111 8 1 0
Then, if you use pandas.DataFrame.apply, along the column axis:
import pandas as pd
def Fut2MinFunc(row):
futTimeStamp = row.TimeStamp + 120000
if (futTimeStamp in df.TimeStamp.values):
return df.loc[df['TimeStamp'] == futTimeStamp, 'Price'].iloc[0]
else:
return None
df['Fut2Min'] = df.apply(Fut2MinFunc, axis = 1)
You will get exactly what you describe as:
TimeStamp Price OtherData1 OtherData2 Fut2Min
0 1603822620000 101 1 8 102.0
1 1603822680000 105 2 7 108.0
2 1603822740000 102 3 6 NaN
3 1603822800000 108 4 5 NaN
4 1603823040000 105 5 4 106.0
5 1603823100000 101 6 3 111.0
6 1603823160000 106 7 2 NaN
7 1603823220000 111 8 1 NaN
EDIT 2: I have updated the solution since it had some sloppy parts (exchanged the list for index determination with a dictionary and restricted the search for timestamps).
This (with import numpy as np)
indices = {ts - 120000: i for i, ts in enumerate(df['TimeStamp'])}
df['Fut2Min'] = [
np.nan
if (ts + 120000) not in df['TimeStamp'].values[i:] else
df['Price'].iloc[indices[ts]]
for i, ts in enumerate(df['TimeStamp'])
]
gives you
TimeStamp Price Fut2Min
0 1603822620000 101 102.0
1 1603822680000 105 108.0
2 1603822740000 102 NaN
3 1603822800000 108 NaN
4 1603823040000 105 106.0
5 1603823100000 101 111.0
6 1603823160000 106 NaN
7 1603823220000 111 NaN
But I'm not sure if that is an optimal solution.
EDIT: Inspired by the discussion in the comments I did some timing:
With the sample frame
from itertools import accumulate
import numpy as np
rng = np.random.default_rng()
n = 10000
timestamps = [1603822620000 + t
for t in accumulate(rng.integers(1, 4) * 60000
for _ in range(n))]
df = pd.DataFrame({'TimeStamp': timestamps, 'Price': n * [100]})
TimeStamp Price
0 1603822680000 100
... ... ...
9999 1605030840000 100
[10000 rows x 2 columns]
and the two test functions
# (1) Other solution
def Fut2MinFunc(row):
futTimeStamp = row.TimeStamp + 120000
if (futTimeStamp in df.TimeStamp.values):
return df.loc[df['TimeStamp'] == futTimeStamp, 'Price'].iloc[0]
else:
return None
def test_1():
df['Fut2Min'] = df.apply(Fut2MinFunc, axis = 1)
# (2) Solution here
def test_2():
indices = list(df['TimeStamp'] - 120000)
df['Fut2Min'] = [
np.nan
if (timestamp + 120000) not in df['TimeStamp'].values else
df['Price'].iloc[indices.index(timestamp)]
for timestamp in df['TimeStamp']
]
I conducted the experiment
from timeit import timeit
t1 = timeit('test_1()', number=100, globals=globals())
t2 = timeit('test_2()', number=100, globals=globals())
print(t1, t2)
with the result
135.98962861 40.306039344
which seems to imply that the version here is faster? (I also measured directly with time() and without the wrapping in functions and the results are virtually identical.)
With my updated version the result looks like
139.33713767799998 14.178187169000012
I finally did one try with a frame with 1,000,000 rows (number=1) and the result was
763.737430931 175.73120002400003

Pandas calculate aggrerage value with respect to current row

Let's say we have this data:
df = pd.DataFrame({
'group_id': [100,100,100,101,101,101,101],
'amount': [30,40,10,20,25,80,40]
})
df.index.name = 'id'
df.set_index(['group_id', df.index], inplace=True)
It looks like this:
amount
group_id id
100 0 30
1 40
2 10
101 3 20
4 25
5 80
6 40
The goal is to compute a new column, that's the sum of all amounts less than the current one. I.e. We want this result.
amount sum_of_smaller_amounts
group_id id
100 0 30 10
1 40 40 # 30 + 10
2 10 0 # smallest amount
101 3 20 0 # smallest
4 25 20
5 80 85 # 20 + 25 + 40
6 40 45 # 20 + 25
Ideally this should be (very) efficient as the real dataframe could be millions of rows.
Better solution (I think):
df['sum_smaller_amount'] = (df_sort.groupby('group_id')['amount']
.transform(lambda x: x.mask(x.duplicated(),0).cumsum()) -
df['amount'])
Output:
amount sum_smaller_amount
group_id id
100 0 30 10.0
1 40 40.0
2 10 0.0
101 3 20 0.0
4 25 20.0
5 80 85.0
6 40 45.0
Another way to do this to use a cartesian product and filter:
df.merge(df.reset_index(), on='group_id', suffixes=('_sum_smaller',''))\
.query('amount_sum_smaller < amount')\
.groupby(['group_id','id'])[['amount_sum_smaller']].sum()\
.join(df, how='right').fillna(0)
Output:
amount_sum_smaller amount
group_id id
100 0 10.0 30
1 40.0 40
2 0.0 10
101 3 0.0 20
4 20.0 25
5 85.0 80
6 45.0 40
You want sort_values and cumsum:
df['new_amount']= (df.sort_values('amount')
.groupby(level='group_id')
['amount'].cumsum() - df['amount'])
Output:
amount new_amount
group_id id
100 0 30 10
1 40 40
2 10 0
101 3 20 0
4 25 20
5 80 85
6 40 45
Update: fix for repeated values:
# the data
df = pd.DataFrame({
'group_id': [100,100,100,100,101,101,101,101],
'amount': [30,40,10,30,20,25,80,40]
})
df.index.name = 'id'
df.set_index(['group_id', df.index], inplace=True)
# sort values:
df_sorted = df.sort_values('amount')
# cumsum
s1 = df_sorted.groupby('group_id')['amount'].cumsum()
# value counts
s2 = df_sorted.groupby(['group_id', 'amount']).cumcount() + 1
# instead of just subtracting df['amount'], we subtract amount * counts
df['new_amount'] = s1 - df['amount'].mul(s2)
Output (note the two values 30 in group 100)
amount new_amount
group_id id
100 0 30 10
1 40 70
2 10 0
3 30 10
101 4 20 0
5 25 20
6 80 85
7 40 45
I'm intermediate on pandas, not sure on efficiency but here's a solution:
temp_df = df.sort_values(['group_id','amount'])
temp_df = temp_df.mask(temp_df['amount'] == temp_df['amount'].shift(), other=0).groupby(level='group_id').cumsum()
df['sum'] = temp_df.sort_index(level='id')['amount'] - df['amount']
Result:
amount sum
group_id id
100 0 30 10
1 40 40
2 10 0
101 3 20 0
4 25 20
5 80 85
6 40 45
7 40 45
You can substitute the last line with these if they help efficiency somehow:
df['sum'] = df.subtract(temp_df).multiply(-1)
# or
df['sum'] = (~df).add(temp_df + 1)

Create an indicator column based on one column being within +/- 5% of another column

I would like to populate the 'Indicator' column based on both charge columns. If 'Charge1' is within plus or minus 5% of the 'Charge2' value, set the 'Indicator' to RFP, otherwise leave it blank (see example below).
ID Charge1 Charge2 Indicator
1 9.5 10 RFP
2 22 20
3 41 40 RFP
4 65 80
5 160 160 RFP
6 315 320 RFP
7 613 640 RFP
8 800 700
9 759 800
10 1480 1500 RFP
I tried using a .loc approach, but struggled to establish if 'Charge1' was within +/- 5% of 'Charge2'.
In [190]: df.loc[df.eval("Charge2*0.95 <= Charge1 <= Charge2*1.05"), 'RFP'] = 'REP'
In [191]: df
Out[191]:
ID Charge1 Charge2 RFP
0 1 9.5 10 REP
1 2 22.0 20 NaN
2 3 41.0 40 REP
3 4 65.0 80 NaN
4 5 160.0 160 REP
5 6 315.0 320 REP
6 7 613.0 640 REP
7 8 800.0 700 NaN
8 9 759.0 800 NaN
9 10 1480.0 1500 REP
Pretty simple, create an 'indicator' series of booleans which depends on the percentage difference between Charge1 and Charge2.
df = pd.read_clipboard()
threshold = 0.05
indicator = ( (df['Charge1'] / df['Charge2']) - 1).abs() <= threshold
df.loc[indicator]
Set a threshold figure and compare the values against that.
Wherever the value is within the threshold, return true, and so you can directly use the indicator (boolean series) as an input into .loc.
Try
cond = ((df['Charge2'] - df['Charge1'])/df['Charge2']*100).abs() <= 5
df['Indicator'] = np.where(cond, 'RFP', np.nan)
ID Charge1 Charge2 Indicator
0 1 9.5 10 RFP
1 2 22.0 20 nan
2 3 41.0 40 RFP
3 4 65.0 80 nan
4 5 160.0 160 RFP
5 6 315.0 320 RFP
6 7 613.0 640 RFP
7 8 800.0 700 nan
8 9 759.0 800 nan
9 10 1480.0 1500 RFP
You can using pct_change
df[['Charge2','Charge1']].T.pct_change().dropna().T.abs().mul(100).astype(int)<=(5)
Out[245]:
Charge1
0 True
1 False
2 True
3 False
4 True
5 True
6 True
7 False
8 True
9 True
Be very careful!
In Python / float counting, 9.5/10 - 1 == -0.050000000000000044
This is one way to explicitly account for this issue via numpy.
import numpy as np
vals = np.abs(df.Charge1.values / df.Charge2.values - 1)
cond1 = vals <= 0.05
cond2 = np.isclose(vals, 0.05, atol=1e-08)
df['Indicator'] = np.where(cond1 | cond2, 'RFP', '')

Categories