I am trying to find the date diff between my anchor date and the other dates grouping by ID.
Input
ID Date Anchor Date
123 1/5/2018 N
123 4/10/2018 N
123 5/8/2018 Y
123 10/12/2018 N
234 1/4/2018 N
234 1/4/2018 N
234 1/4/2018 Y
456 5/6/2018 N
456 5/6/2018 N
456 5/10/2018 N
456 6/1/2018 Y
567 3/2/2018 N
567 3/2/2018 N
567 3/2/2018 Y
Expected Output:
ID Date Anchor Date Diff
123 1/5/2018 N -123
123 4/10/2018 N -28
123 5/8/2018 Y 0
123 10/12/2018 N 157
234 1/4/2018 N 0
234 1/4/2018 N 0
234 1/4/2018 Y 0
456 5/6/2018 N -26
456 5/6/2018 N -26
456 5/10/2018 N -22
456 6/1/2018 Y 0
567 3/2/2018 N 0
567 3/2/2018 N 0
567 3/2/2018 Y 0
Code Attempt
import pandas as pd
df = pd.read_csv()
df['Date'] = df.groupby('ID')['Date'].apply(lambda x: x.sort_values())
df['diff'] = df.groupby('ID')['Date'].diff() / np.timedelta64(1, 'D')
df['diff'] = df['diff'].fillna(0)
The error I am receiving is "incompatible index of inserted column with frame index."
And secondly, I am not sure how to incorporate the Anchor Date column to ensure that is used for time zero.
First you need to convert Date into datetime type:
df['Date'] = pd.to_datetime(df['Date'])
After that, can extract the index of the Anchor Date with idxmax, then use loc to extract the actual dates:
idx = df['Anchor Date'].eq('Y').groupby(df['ID']).transform('idxmax')
df['Diff'] = (df['Date'] - df.loc[idx, 'Date'].values) / np.timedelta64(1, 'D')
Another way is to extract those Date with boolean indexing, and map:
anchor_dates = df.loc[df['Anchor Date']=='Y', ['ID','Date']].set_index('ID')['Date']
df['Diff'] = (df['Date'] - anchor_dates)/np.timedelta64(1, 'D')
Output:
ID Date Anchor Date Diff
0 123 2018-01-05 N -123.0
1 123 2018-04-10 N -28.0
2 123 2018-05-08 Y 0.0
3 123 2018-10-12 N 157.0
4 234 2018-01-04 N 0.0
5 234 2018-01-04 N 0.0
6 234 2018-01-04 Y 0.0
7 456 2018-05-06 N -26.0
8 456 2018-05-06 N -26.0
9 456 2018-05-10 N -22.0
10 456 2018-06-01 Y 0.0
11 567 2018-03-02 N 0.0
12 567 2018-03-02 N 0.0
13 567 2018-03-02 Y 0.0
Related
I have a dataframe of the following structure:
import pandas as pd
df = pd.DataFrame({'x': [1,5,8,103,105,112],
'date': pd.DatetimeIndex(('2022-02-01', '2022-02-03', '2022-02-06',
'2022-02-05', '2022-02-05', '2022-02-07'))})
x dt
0 1 2022-02-01
1 5 2022-02-03
2 8 2022-02-06
3 103 2022-02-05
4 105 2022-02-05
5 112 2022-02-07
How can I add a new column y that contains x if x < 100 and otherwise the x-value of the row with the next smaller date, from the subset where x < 100.
What I currently have is this code. It works, but doesn't look very efficient:
df['y'] = df.x
df_ref = df.loc[df.x < 100].sort_values('date').copy()
df_ref.set_index('x', inplace=True)
for ix, row in df.iterrows():
if row.x >= 100:
delta = row.date - df_ref.date
delta_gt = delta.loc[delta > pd.Timedelta(0)]
if delta_gt.size > 0:
df.loc[ix, 'y'] = delta_gt.idxmin()
x date y
0 1 2022-02-01 1
1 5 2022-02-03 5
2 8 2022-02-06 8
3 103 2022-02-04 5
4 105 2022-02-05 5
5 112 2022-02-07 8
Sort by date, mask the values greater than 100 and ffill, sort by index again:
(df.sort_values(by='date')
.assign(y=df['x'].mask(df['x'].gt(100)))
.assign(y=lambda d: d['y'].ffill())
.sort_index()
)
Output:
x date y
0 1 2022-02-01 1
1 5 2022-02-03 5
2 8 2022-02-06 8
3 103 2022-02-05 5
4 105 2022-02-05 5
5 112 2022-02-07 8
We can check merge_asof
#df.date = pd.to_datetime(df.date)
df = df.sort_values('date')
out = pd.merge_asof(df,
df[df['x']<100].rename(columns={'x':'y'}),
on = 'date',
direction = 'backward').sort_values('x')
out
Out[160]:
x date y
0 1 2022-02-01 1
1 5 2022-02-03 5
4 8 2022-02-06 8
2 103 2022-02-05 5
3 105 2022-02-05 5
5 112 2022-02-07 8
if I have a dataframe like
hour x
1 122
1 133
2 234
...
24 232
How can I get a dataframe like
hour x1 x2 *** xn
1 122 133
2 234
.....
24
You could create a cumulated count per group and pivot:
(df.assign(group=df.groupby('hour').cumcount().add(1))
.pivot(index='hour', columns='group', values='x')
.add_prefix('x')
.rename_axis(None)
)
Output:
group x1 x2
1 122.0 133.0
2 234.0 NaN
24 232.0 NaN
I have the following mock DataFrames:
df1:
ID FILLER1 FILLER2 QUANTITY
01 123 132 12
02 123 132 5
03 123 132 10
df2:
ID FILLER1 FILLER2 QUANTITY
01 123 132 +1
02 123 132 -1
which would result in the 'Quantity' of DF1 will result in 13, 4 and 10.
Thx in advance for any help provided!
Question is not super clear but if I get what you're trying to do here is a way:
# A left join and filling 0 instead of NaN for that third row
In [19]: merged = df1.merge(df2, on=['ID', 'FILLER1', 'FILLER2'], how='left').fillna(0)
In [20]: merged
Out[20]:
ID FILLER1 FILLER2 QUANTITY_x QUANTITY_y
0 1 123 132 12 1.0
1 2 123 132 5 -1.0
2 3 123 132 10 0.0
# Adding new quantity column
In [21]: merged['QUANTITY'] = merged['QUANTITY_x'] + merged['QUANTITY_y']
In [22]: merged
Out[22]:
ID FILLER1 FILLER2 QUANTITY_x QUANTITY_y QUANTITY
0 1 123 132 12 1.0 13.0
1 2 123 132 5 -1.0 4.0
2 3 123 132 10 0.0 10.0
# Removing _x and _y columns
In [23]: merged = merged[['ID', 'FILLER1', 'FILLER2', 'QUANTITY']]
In [24]: merged
Out[24]:
ID FILLER1 FILLER2 QUANTITY
0 1 123 132 13.0
1 2 123 132 4.0
2 3 123 132 10.0
The current dateframe.
ID Date Start Value Payment
111 1/1/2018 1000 0
111 1/2/2018 100
111 1/3/2018 500
111 1/4/2018 400
111 1/5/2018 0
222 4/1/2018 2000 200
222 4/2/2018 100
222 4/3/2018 700
222 4/4/2018 0
222 4/5/2018 0
222 4/6/2018 1000
222 4/7/2018 0
This is the dataframe what I am trying to get. Basically, i am trying to fill the star value for each row. AS you can see, every ID has a start value on the first day. next day's start value = last day's start value - last day's payment.
ID Date Start Value Payment
111 1/1/2018 1000 0
111 1/2/2018 1000 100
111 1/3/2018 900 500
111 1/4/2018 400 400
111 1/5/2018 0 0
222 4/1/2018 2000 200
222 4/2/2018 1800 100
222 4/3/2018 1700 700
222 4/4/2018 1000 0
222 4/5/2018 1000 0
222 4/6/2018 1000 1000
222 4/7/2018 0 0
Right now, I use Excel with this formula.
Start Value = if(ID in this row == ID in last row, last row's start value - last row's payment, Start Value)
It works well, I am wondering if I can do it in Python/Pandas. Thank you.
We can using groupby and shift + cumsum, ffill will setting up initial value for all row under the same Id, then we just need to deduct the cumulative payment from that row till the start , we get the remaining value at that point
df.StartValue.fillna(df.groupby('ID').apply(lambda x : x['StartValue'].ffill()-x['Payment'].shift().cumsum()).reset_index(level=0,drop=True))
Out[61]:
0 1000.0
1 1000.0
2 900.0
3 400.0
4 0.0
5 2000.0
6 1800.0
7 1700.0
8 1000.0
9 1000.0
10 1000.0
11 0.0
Name: StartValue, dtype: float64
Assign it back by adding inplace=Ture
df.StartValue.fillna(df.groupby('ID').apply(lambda x : x['StartValue'].ffill()-x['Payment'].shift().cumsum()).reset_index(level=0,drop=True),inplace=True)
df
Out[63]:
ID Date StartValue Payment
0 111 1/1/2018 1000.0 0
1 111 1/2/2018 1000.0 100
2 111 1/3/2018 900.0 500
3 111 1/4/2018 400.0 400
4 111 1/5/2018 0.0 0
5 222 4/1/2018 2000.0 200
6 222 4/2/2018 1800.0 100
7 222 4/3/2018 1700.0 700
8 222 4/4/2018 1000.0 0
9 222 4/5/2018 1000.0 0
10 222 4/6/2018 1000.0 1000
11 222 4/7/2018 0.0 0
I have a dataframe like bellow
ID Date
111 1.1.2018
222 5.1.2018
333 7.1.2018
444 8.1.2018
555 9.1.2018
666 13.1.2018
and I would like to bin them into 5 days intervals.
The output should be
ID Date Bin
111 1.1.2018 1
222 5.1.2018 1
333 7.1.2018 2
444 8.1.2018 2
555 9.1.2018 2
666 13.1.2018 3
How can I do this in python, please?
Looks like groupby + ngroup does it:
df['Date'] = pd.to_datetime(df.Date, errors='coerce', dayfirst=True)
df['Bin'] = df.groupby(pd.Grouper(freq='5D', key='Date')).ngroup() + 1
df
ID Date Bin
0 111 2018-01-01 1
1 222 2018-01-05 1
2 333 2018-01-07 2
3 444 2018-01-08 2
4 555 2018-01-09 2
5 666 2018-01-13 3
If you don't want to mutate the Date column, then you may first call assign for a copy based assignment, and then do the groupby:
df['Bin'] = df.assign(
Date=pd.to_datetime(df.Date, errors='coerce', dayfirst=True)
).groupby(pd.Grouper(freq='5D', key='Date')).ngroup() + 1
df
ID Date Bin
0 111 1.1.2018 1
1 222 5.1.2018 1
2 333 7.1.2018 2
3 444 8.1.2018 2
4 555 9.1.2018 2
5 666 13.1.2018 3
One way is to create an array of your date range and use numpy.digitize.
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
date_ranges = pd.date_range(df['Date'].min(), df['Date'].max(), freq='5D')\
.astype(np.int64).values
df['Bin'] = np.digitize(df['Date'].astype(np.int64).values, date_ranges)
Result:
ID Date Bin
0 111 2018-01-01 1
1 222 2018-01-05 1
2 333 2018-01-07 2
3 444 2018-01-08 2
4 555 2018-01-09 2
5 666 2018-01-13 3