I have two dataframes, one shows buys and the other shows sell. I need to pull sale date for each buy lot. Sometimes, the buy is sold in different sale lots, I need to be able to split shares for that(or if not possible, no need to split shares, just pull the selldate). This is what I have:
df1 = pd.DataFrame({'ID': ['AAA','AAA', 'AAA','BBB','CCC'],
'Buydate': ['2017-04-13', '2019-12-31', '2019-03-05', '2018-11-04', '2019-12-31' ],
'Quantity': [100.00, 2000.00, 385.95, 214514.00, 63205.00]})
df2=pd.DataFrame({'ID': ['AAA','AAA','BBB'],
'Selldate': ['2020-01-25', '2020-10-25', '2020-12-19'],
'Quantity': [500.00, 1985.95, 214714.00]})
Output is:
df1
ID | Buydate | Quantity
0 AAA 2017-04-13 100.00
1 AAA 2019-12-31 2000.00
2 AAA 2019-03-05 385.95
3 BBB 2018-11-04 214514.00
4 CCC 2019-12-31 63205.00
df2
ID Selldate Quantity
0 AAA 2020-01-25 500.00
1 AAA 2020-10-25 1985.95
2 BBB 2020-12-19 214714.00
First I added cumsum column, then I plan to use a loop for df1 each group to look for df2 by ID, if share is less than the quantity of first lot in df2, I use original quantity of df1, if it's over, I need to get the remaining quantity and continue to look for the second lot of df2. I guess I need a concat function at some point.
The ideal result is:
ID Buydate Quantity SplitQuantity Selldate
0 AAA 2017-04-13 100.00 100.00 2020-01-25
1 AAA 2019-03-05 385.95 385.95 2020-01-25
2 AAA 2019-12-31 2000.00 14.05 2020-01-25
3 AAA 2019-12-31 2000.00 1985.95 2020-10-25
4 BBB 2018-11-04 214514.00 214514.00 2020-12-19
5 CCC 2019-12-31 63205.00 NaN NaT
This solution is a little messy, but what you're asking is a little complicated, so here comes a working prototype:
# Sort values by date.
df1 = df1.sort_values(by='Buydate').reset_index()
# id_jumps will be used for ignoring items you already subtracted from.
id_jump = {}
for id_ in df1['ID']:
id_jump[id_] = 0
new_index = ['ID', 'Buydate', 'Quantity', 'SplitQuantity', 'Selldate']
new_df = []
# For all items in DF2, subtrack the quantity from items in df1 with the same ID.
for index, row in df2.iterrows():
sum_ = row['Quantity']
for index2, row2 in df1[df1['ID'] == row['ID']].iterrows():
if index2 < id_jump[row['ID']]:
# Skip items already included from previous purchases.
continue
if sum_ > row2['Quantity']:
sub = row2['Quantity']
sum_ = sum_ - row2['Quantity']
id_jump[row['ID']] += 1
new_df.append(
[row2['ID'], row2['Buydate'], row2['Quantity'], sub, row['Selldate']])
else:
id_jump[row['ID']] += 1
new_df.append(
[row2['ID'], row2['Buydate'], row2['Quantity'], sum_, row['Selldate']])
break
df3 = pd.DataFrame(new_df, columns=new_index)
# Add missing 'CCC' row, for IDs never bought.
for id_ in df1['ID']:
if id_jump[id_] == 0:
df4 = pd.concat([df3, df1[df1['ID'] == id_]]).drop(columns='index').reset_index()
print(df4)
# ID Buydate Quantity SplitQuantity Selldate
# 0 AAA 2017-04-13 100.00 100.00 2020-01-25
# 1 AAA 2019-03-05 385.95 385.95 2020-01-25
# 2 AAA 2019-12-31 2000.00 14.05 2020-01-25
# 3 AAA 2019-12-31 2000.00 1985.95 2020-10-25
# 4 BBB 2018-11-04 214514.00 214514.00 2020-12-19
# 5 CCC 2019-12-31 63205.00 NaN NaN
Related
I have two dataframes:
EDIT:
df1 = pd.DataFrame(index = [0,1,2], columns=['timestamp', 'order_id', 'account_id', 'USD', 'CAD'])
df1['timestamp']=['2022-01-01','2022-01-02','2022-01-03']
df1['account_id']=['usdcad','usdcad','usdcad']
df1['order_id']=['11233123','12313213','12341242']
df1['USD'] = [1,2,3]
df1['CAD'] = [4,5,6]
df1:
timestamp account_id order_id USD CAD
0 2022-01-01 usdcad 11233123 1 4
1 2022-01-02 usdcad 12313213 2 5
2 2022-01-03 usdcad 12341242 3 6
df2 = pd.DataFrame(index = [0,1], columns = ['timestamp','account_id', 'currency','balance'])
df2['timestamp']=['2021-12-21','2021-12-21']
df2['account_id']=['usdcad','usdcad']
df2['currency'] = ['USD', 'CAD']
df2['balance'] = [2,3]
df2:
timestamp account_id currency balance
0 2021-12-21 usdcad USD 2
1 2021-12-21 usdcad CAD 3
I would like to add a row to df1 at index 0, and fill that row with the balance of df2 based on currency. So the final df should look like this:
df:
timestamp account_id order_id USD CAD
0 0 0 0 2 3
1 2022-01-01 usdcad 11233123 1 4
2 2022-01-02 usdcad 12313213 2 5
3 2022-01-03 usdcad 12341242 3 6
How can I do this in a pythonic way? Thank you
Set the index of df2 to currency then transpose the index to columns, then append this dataframe with df1
df_out = df2.set_index('currency').T.append(df1, ignore_index=True).fillna(0)
print(df_out)
USD CAD order_id
0 2 3 0
1 1 4 11233123
2 2 5 12313213
3 3 6 12341242
I have a dataframe:
df
Name Date ID Amount
0 Faye 2018-12-31 A 2
1 Faye 2019-03-31 A 1
2 Faye 2019-06-30 B 5
3 Faye 2019-09-30 B 2
4 Faye 2019-09-30 A 4
5 Faye 2020-03-31 A 1
6 Mike 2018-12-31 A 4
7 Mike 2019-03-31 B 2
8 Mike 2019-03-31 C 1
9 Mike 2019-06-30 A 3
And for each Name, Date, ID, group I calculate the % change of Amount from the previous Date in a new column. If there was no previous entry, then I add New:
df['% Change'] = (df.sort_values('Date').groupby(['Name', 'ID']).Amount.pct_change())
df['% Change'] = df['% Change'].fillna('New')
But if the ID disappears for a few Dates (e.g., ID A is present for Faye in 2018-12-31 and 2019-03-31, but disappears for the 2019-06-30 period), I want the next time it appears to show as New again so that the output looks like:
Name Date ID Amount % Change
0 Faye 2018-12-31 A 2 New
1 Faye 2019-03-31 A 1 -0.5
2 Faye 2019-06-30 B 5 New
3 Faye 2019-09-30 B 2 -0.6
4 Faye 2019-09-30 A 4 New
5 Faye 2020-03-31 A 1 New
6 Mike 2018-12-31 A 4 New
7 Mike 2019-03-31 B 2 New
8 Mike 2019-03-31 C 1 New
9 Mike 2019-06-30 A 3 New
How do I achieve this?
I think your expected result contains an error, sorting it will make it easier toooo see:
>>> expected.sort_values(['Name', 'ID', 'Date'])
Name Date ID Amount %_Change
0 Faye 2018-12-31 A 2 New
1 Faye 2019-03-31 A 1 -0.5
4 Faye 2019-09-30 A 4 New
5 Faye 2020-03-31 A 1 -0.75 <-- shouldn't this be "New" since 2020-12-31 was missing?
2 Faye 2019-06-30 B 5 New
3 Faye 2019-09-30 B 2 -0.6
6 Mike 2018-12-31 A 4 New
9 Mike 2019-06-30 A 3 New
7 Mike 2019-03-31 B 2 New
8 Mike 2019-03-31 C 1 New
With that said, you can right a custom percent_change function to resample each (Name, ID) group to quarterly before calculating the percentage change:
def percent_change(group):
amount = group[['Date', 'Amount']].set_index('Date')
return amount.pct_change(fill_method=None, freq='Q').rename(columns={'Amount': '% Change'})
result = (
df.sort_values('Date')
.groupby(['Name', 'ID'])
.apply(percent_change)
.fillna('New')
.merge(df, how='right', on=['Name', 'ID', 'Date'])
)
Result:
Name ID Date % Change Amount
0 Faye A 2018-12-31 New 2
1 Faye A 2019-03-31 -0.5 1
2 Faye A 2019-09-30 New 4
3 Faye A 2020-03-31 New 1
4 Faye B 2019-06-30 New 5
5 Faye B 2019-09-30 -0.6 2
6 Mike A 2018-12-31 New 4
7 Mike A 2019-06-30 New 3
8 Mike B 2019-03-31 New 2
9 Mike C 2019-03-31 New 1
A faster version
For the record, I randomly generate a 2M rows dataframe with the following code:
import string
n = 2_000_000
qend = pd.date_range('2000-01-01', '2019-12-31', freq='Q')
np.random.seed(42)
names = list(map(''.join, np.random.choice(list(string.ascii_uppercase), (n, 3))))
dates = np.random.choice(qend, n)
ids = np.random.choice(list(string.ascii_uppercase), n)
amounts = np.random.randint(1, 100, n)
df = pd.DataFrame({
'Name': names,
'Date': dates,
'ID': ids,
'Amount': amounts
})
It assumes all dates are quarter-ends and that they are of Timestamp data type. Explanation provided in the comment
# Make a sorted copy of the original dataframe
tmp = df.sort_values(['Name', 'ID', 'Date'])
# When we call a GroupBy, we lose the original index
# so let's keep a copy here
tmp['OriginalIndex'] = tmp.index
# Calculate day difference between consecutive rows.
# It is a lot faster than `groupby(...)['Date'].diff()`
# but it gives wrong result for the first row of each
# group. The first row of each group should be NaN. It's
# an easy fix and we will deal with it later
tmp['DayDiff'] = tmp['Date'].diff() / pd.Timedelta(days=1)
# This has the same problem as `DayDiff` above but you will
# see that it's irrelevant to our problem
tmp['% Change'] = tmp['Amount'].pct_change()
# The index of the first row in each group
first_indexes = tmp.groupby(['Name', 'ID'])['OriginalIndex'].first()
# Fix the issue in `DayDiff`: the first row of each group
# should be NaN
tmp.loc[first_indexes, 'DayDiff'] = np.nan
# Now this is the key to the whole problem: a quarter lasts a
# maximum of 92 days. If `DayDiff <= 92`, the percentage change
# formula applies. Otherwise, `DayDiff` is either NaN or >92.
# In both cases, the percentage change is NaN.
pct_change = tmp['% Change'].where(tmp['DayDiff'] <= 92, np.nan).fillna('New')
# Assign the result back to frame
df['% Change'] = pct_change
I have a dataframe like the following (here a subset):
df1
ID zone date
0 6a93b747472484e41f969a0ac02b96161eb0af9edb1fe4... 01529224 2020-01-01
1 6a93b747472484e41f969a0ac02b96161eb0af9edb1fe4... 01529224 2020-01-01
2 6a93b747472484e41f969a0ac02b96161eb0af9edb1fe4... 01529224 2020-01-01
3 6a93b747472484e41f969a0ac02b96161eb0af9edb1fe4... 01529224 2020-01-01
4 6a93b747472484e41f969a0ac02b96161eb0af9edb1fe4... 01529224 2020-01-01
If I count the distinct ID per day I have
tmp = df1.groupby(['date']).agg({"ID": pd.Series.nunique}).reset_index()
tmp.head()
date ID
0 2019-12-31 4653
1 2020-01-01 6656
2 2020-01-02 1
Now if I group by zone and date I have the following:
distinctID = df1.groupby(['date', "zone"]).agg({"ID": pd.Series.nunique}).reset_index()
date zone ID
0 2019-12-31 00023901 1
1 2019-12-31 00025441 2
2 2019-12-31 00025442 2
3 2019-12-31 00025443 3
4 2019-12-31 00025444 2
If I count the ID for each day, how I have:
tmp1 = distinctID.groupby(['date']).agg({"ID": 'sum'}).reset_index()
tmp1.head()
date ID
0 2019-12-31 5833
1 2020-01-01 11837
2 2020-01-02 1
Why I do not get the same counting per each day?
Problem is your code is not same, I try change data for see it:
print (df1)
date zone ID
0 2019-12-31 23901 a
0 2019-12-31 23901 b
0 2019-12-31 25441 b
1 2019-12-31 25441 a
2 2019-12-31 25442 a
#only 2 unique values per date
tmp = df1.groupby(['date']).agg({"ID": pd.Series.nunique}).reset_index()
print (tmp)
date ID
0 2019-12-31 2 <-a, b
#if test per 2 columns there are more unique values, because tested separately
distinctID = df1.groupby(['date', "zone"]).agg({"ID": pd.Series.nunique}).reset_index()
print (distinctID)
date zone ID
0 2019-12-31 23901 2 <-a, b
1 2019-12-31 25441 2 <-a, b
2 2019-12-31 25442 1 <-a
#sum is different, because unique values are counts per 2 columns
tmp1 = distinctID.groupby(['date']).agg({"ID": 'sum'}).reset_index()
print (tmp1)
date ID
0 2019-12-31 5 <-a, b, a, b, a
I'd like to copy or duplicate the rows of a DataFrame based on the value of a column, in this case orig_qty. So if I have a DataFrame and using pandas==0.24.2:
import pandas as pd
d = {'a': ['2019-04-08', 4, 115.00], 'b': ['2019-04-09', 2, 103.00]}
df = pd.DataFrame.from_dict(
d,
orient='index',
columns=['date', 'orig_qty', 'price']
)
Input
>>> print(df)
date orig_qty price
a 2019-04-08 4 115.0
b 2019-04-09 2 103.0
So in the example above the row with orig_qty=4 should be duplicated 4 times and the row with orig_qty=2 should be duplicated 2 times. After this transformation I'd like a DataFrame that looks like:
Desired Output
>>> print(new_df)
date orig_qty price fifo_qty
1 2019-04-08 4 115.0 1
2 2019-04-08 4 115.0 1
3 2019-04-08 4 115.0 1
4 2019-04-08 4 115.0 1
5 2019-04-09 2 103.0 1
6 2019-04-09 2 103.0 1
Note I do not really care about the index after the transformation. I can elaborate more on the use case for this, but essentially I'm doing some FIFO accounting where important changes can occur between values of orig_qty.
Use Index.repeat, DataFrame.loc, DataFrame.assign and DataFrame.reset_index
new_df = df.loc[df.index.repeat(df['orig_qty'])].assign(fifo_qty=1).reset_index(drop=True)
[output]
date orig_qty price fifo_qty
0 2019-04-08 4 115.0 1
1 2019-04-08 4 115.0 1
2 2019-04-08 4 115.0 1
3 2019-04-08 4 115.0 1
4 2019-04-09 2 103.0 1
5 2019-04-09 2 103.0 1
Use np.repeat
new_df = pd.DataFrame({col: np.repeat(df[col], df.orig_qty) for col in df.columns})
I made a game and got the players’s data like this:
StartTime Id Rank Score
2018-04-24 08:46:35.684000 aaa 1 280
2018-04-24 23:54:47.742000 bbb 2 176
2018-04-25 15:28:36.050000 ccc 1 223
2018-04-25 00:13:00.120000 aaa 4 79
2018-04-26 04:59:36.464000 ddd 1 346
2018-04-26 06:01:17.728000 fff 2 157
2018-04-27 04:57:37.701000 ggg 4 78
but I want to group it by day, just like this:
Date 2018/4/24 2018/4/25 2018/4/26 2018/4/27
ID aaa ccc ddd ggg
bbb aaa fff NaN
how do I group by date with Pandas?
Use set_index and cumcount:
df.set_index([df['StartTime'].dt.floor('D'),
df.groupby(df['StartTime'].dt.floor('D')).cumcount()])['Id'].unstack(0)
OUtput:
StartTime 2018-04-24 2018-04-25 2018-04-26 2018-04-27
0 aaa ccc ddd ggg
1 bbb aaa fff NaN
You can use cumcount to align index by group, then concat to concatenate series.
# normalize to zero out time
df['StartTime'] = pd.to_datetime(df['StartTime']).dt.normalize()
# get unique days and make index count by group
cols = df['StartTime'].unique()
df.index = df.groupby('StartTime').cumcount()
# concatenate list comprehension of series
res = pd.concat([df.loc[df['StartTime'] == i, 'Id'] for i in cols], axis=1)
res.columns = cols
print(res)
2018-04-24 2018-04-25 2018-04-26 2018-04-27
0 aaa ccc ddd ggg
1 bbb aaa fff NaN
Performance
For smaller dataframes, use #ScottBoston's more succinct solution. For larger dataframes, concat seems to scale better than unstack:
def scott(df):
df['StartTime'] = pd.to_datetime(df['StartTime'])
return df.set_index([df['StartTime'].dt.floor('D'),
df.groupby(df['StartTime'].dt.floor('D')).cumcount()])['Id'].unstack(0)
def jpp(df):
df['StartTime'] = pd.to_datetime(df['StartTime']).dt.normalize()
df.index = df.groupby('StartTime').cumcount()
res = pd.concat([df.loc[df['StartTime'] == i, 'Id'] for i in df['StartTime'].unique()], axis=1)
res.columns = cols
return res
df2 = pd.concat([df]*100000)
%timeit scott(df2) # 1 loop, best of 3: 681 ms per loop
%timeit jpp(df2) # 1 loop, best of 3: 271 ms per loop
import pandas as pd
df = pd.DataFrame({'StartTime': ['2018-04-01 15:25:11', '2018-04-04 16:25:11', '2018-04-04 15:27:11'], 'Score': [10, 20, 30]})
print(df)
This yields
Score StartTime
0 10 2018-04-01 15:25:11
1 20 2018-04-04 16:25:11
2 30 2018-04-04 15:27:11
Now we create a new column based on the StartTime column, which contains only the date:
df['Date'] = df['StartTime'].apply(lambda x: x.split(' ')[0])
print(df)
Output:
Score StartTime Date
0 10 2018-04-01 15:25:11 2018-04-01
1 20 2018-04-04 16:25:11 2018-04-04
2 30 2018-04-04 15:27:11 2018-04-04
We can now use the pd.DataFrame.groupby method to group the rows by the values of the new Datecolumn. In the example below, I first group the columns and then iterate over them to print the name (the value of the Date column of this group) and the mean score achieved:
for name, group in df.groupby('Date'):
print(name)
print(group)
print(group['Score'].mean())
Gives:
2018-04-01
Score StartTime Date
0 10 2018-04-01 15:25:11 2018-04-01
10.0
2018-04-04
Score StartTime Date
1 20 2018-04-04 16:25:11 2018-04-04
2 30 2018-04-04 15:27:11 2018-04-04
25.0
Edit: Since you initially did not provide the dataframe data in table format, I leave it as an exercise to you to adapt the dataframe in my answer ;-)