Pandas calculating difference between rows - python

My goal is to calculate the day difference from Start/End comparing to End. I know I have to group them by Id but I'm not sure how to perform the difference between the Day.
I tried df['length'] = -(df.groupby('Id')['Day'].diff()). This doesn't compare to End, it only computes the difference whenever the Status changes.
df
Id Day Status
111 1 Start
111 5 End
222 2 Begin
222 7 End
333 1 Start
333 3 Begin
333 7 End
Ideal result would be:
Id Day Status Length
111 1 Start 4
111 5 End
222 2 Begin 5
222 7 End
333 1 Start 6 (since we Start on Day 1 and End on day 7)
333 3 Begin 4 (since we Begin on Day 3 and End on day 7)
333 7 End
Thank you

Here's another method with groupby + transform -
v = df.groupby('Id').Day.transform('last') - df.Day
df['Length'] = v.mask(v == 0) # or v.mask(df.Status.eq('End'))
df
Id Day Status Length
0 111 1 Start 4.0
1 111 5 End NaN
2 222 2 Begin 5.0
3 222 7 End NaN
4 333 1 Start 6.0
5 333 3 Begin 4.0
6 333 7 End NaN
Timings
df = pd.concat([df] * 1000000, ignore_index=True)
# apply + iloc
%timeit df.groupby('Id').Day.apply(lambda x : x.iloc[-1]-x).replace(0,np.nan)
1 loop, best of 3: 1.49 s per loop
# transform + mask
%%timeit
v = df.groupby('Id').Day.transform('last') - df.Day
df['Length'] = v.mask(v == 0)
1 loop, best of 3: 294 ms per loop

By using apply with .iloc
df.groupby('Id').Day.apply(lambda x : x.iloc[-1]-x).replace(0,np.nan)
Out[187]:
0 4.0
1 NaN
2 5.0
3 NaN
4 6.0
5 4.0
6 NaN
Name: Day, dtype: float64
After assign it back
df['Length']=df.groupby('Id').Day.apply(lambda x : x.iloc[-1]-x).replace(0,np.nan)
df
Out[189]:
Id Day Status Length
0 111 1 Start 4.0
1 111 5 End NaN
2 222 2 Begin 5.0
3 222 7 End NaN
4 333 1 Start 6.0
5 333 3 Begin 4.0
6 333 7 End NaN

Related

How to use each row's value as compare object to get the count of rows satisfying a condition from whole DataFrame?

date data1
0 2012/1/1 100
1 2012/1/2 109
2 2012/1/3 108
3 2012/1/4 120
4 2012/1/5 80
5 2012/1/6 130
6 2012/1/7 100
7 2012/1/8 140
Given the dataframe above, I want get the number of rows which data1 value is between ± 10 of each row's data1 field, and append that count to each row, such that:
date data Count
0 2012/1/1 100.0 4.0
1 2012/1/2 109.0 4.0
2 2012/1/3 108.0 4.0
3 2012/1/4 120.0 2.0
4 2012/1/5 80.0 1.0
5 2012/1/6 130.0 3.0
6 2012/1/7 100.0 4.0
7 2012/1/8 140.0 2.0
Since each row's field is rule's compare object, I use iterrows, although I know this is not elegant:
result = pd.DataFrame(index=df.index)
for i,r in df.iterrows():
high=r['data']+10
low=r['data1']-10
df2=df.loc[(df['data']<=r['data']+10)&(df['data']>=r['data']-10)]
result.loc[i,'date']=r['date']
result.loc[i,'data']=r['data']
result.loc[i,'count']=df2.shape[0]
result
Is there any more Pandas-style way to do that?
Thank you for any help!
Use numpy broadcasting for boolean mask and for count Trues use sum:
arr = df['data'].to_numpy()
df['count'] = ((arr[:, None] <= arr+10)&(arr[:, None] >= arr-10)).sum(axis=1)
print (df)
date data count
0 2012/1/1 100 4
1 2012/1/2 109 4
2 2012/1/3 108 4
3 2012/1/4 120 2
4 2012/1/5 80 1
5 2012/1/6 130 3
6 2012/1/7 100 4
7 2012/1/8 140 2

Create New Pandas DataFrame Column Equaling Values From Other Row in Same DataFrame

I'm new to python and very new to Pandas. I've looked through the Pandas documentation and tried multiple ways to solve this problem unsuccessfully.
I have a DateFrame with timestamps in one column and prices in another, such as:
d = {'TimeStamp': [1603822620000, 1603822680000,1603822740000, 1603823040000,1603823100000,1603823160000,1603823220000], 'Price': [101,105,102,108,105,101,106], 'OtherData1': [1,2,3,4,5,6,7], 'OtherData2': [7,6,5,4,3,2,1]}
df= pd.DataFrame(d)
df
TimeStamp Price OtherData1 OtherData2
0 1603822620000 101 1 7
1 1603822680000 105 2 6
2 1603822740000 102 3 5
3 1603823040000 108 4 4
4 1603823100000 105 5 3
5 1603823160000 101 6 2
6 1603823220000 106 7 1
In addition to the two columns of interest, this DataFrame also has additional columns with data not particularly relevant to the question (represented with OtherData Cols).
I want to create a new column 'Fut2Min' (Price Two Minutes into the Future). There may be missing data, so this problem can't be solved by simply getting the data from 2 rows below.
I'm trying to find a way to make the value for Fut2Min Col in each row == the Price at the row with the timestamp + 120000 (2 minutes into the future) or null (or NAN or w/e) if the corresponding timestamp doesn't exist.
For the example data, the DF should be updated to:
(Code used to mimic desired result)
d = {'TimeStamp': [1603822620000, 1603822680000, 1603822740000, 1603822800000, 1603823040000,1603823100000,1603823160000,1603823220000],
'Price': [101,105,102,108,105,101,106,111],
'OtherData1': [1,2,3,4,5,6,7,8],
'OtherData2': [8,7,6,5,4,3,2,1],
'Fut2Min':[102,108,'NaN','NaN',106,111,'NaN','NaN']}
df= pd.DataFrame(d)
df
TimeStamp Price OtherData1 OtherData2 Fut2Min
0 1603822620000 101 1 8 102
1 1603822680000 105 2 7 108
2 1603822740000 102 3 6 NaN
3 1603822800000 108 4 5 NaN
4 1603823040000 105 5 4 106
5 1603823100000 101 6 3 111
6 1603823160000 106 7 2 NaN
7 1603823220000 111 8 1 NaN
Assuming that the DataFrame is:
TimeStamp Price OtherData1 OtherData2 Fut2Min
0 1603822620000 101 1 8 0
1 1603822680000 105 2 7 0
2 1603822740000 102 3 6 0
3 1603822800000 108 4 5 0
4 1603823040000 105 5 4 0
5 1603823100000 101 6 3 0
6 1603823160000 106 7 2 0
7 1603823220000 111 8 1 0
Then, if you use pandas.DataFrame.apply, along the column axis:
import pandas as pd
def Fut2MinFunc(row):
futTimeStamp = row.TimeStamp + 120000
if (futTimeStamp in df.TimeStamp.values):
return df.loc[df['TimeStamp'] == futTimeStamp, 'Price'].iloc[0]
else:
return None
df['Fut2Min'] = df.apply(Fut2MinFunc, axis = 1)
You will get exactly what you describe as:
TimeStamp Price OtherData1 OtherData2 Fut2Min
0 1603822620000 101 1 8 102.0
1 1603822680000 105 2 7 108.0
2 1603822740000 102 3 6 NaN
3 1603822800000 108 4 5 NaN
4 1603823040000 105 5 4 106.0
5 1603823100000 101 6 3 111.0
6 1603823160000 106 7 2 NaN
7 1603823220000 111 8 1 NaN
EDIT 2: I have updated the solution since it had some sloppy parts (exchanged the list for index determination with a dictionary and restricted the search for timestamps).
This (with import numpy as np)
indices = {ts - 120000: i for i, ts in enumerate(df['TimeStamp'])}
df['Fut2Min'] = [
np.nan
if (ts + 120000) not in df['TimeStamp'].values[i:] else
df['Price'].iloc[indices[ts]]
for i, ts in enumerate(df['TimeStamp'])
]
gives you
TimeStamp Price Fut2Min
0 1603822620000 101 102.0
1 1603822680000 105 108.0
2 1603822740000 102 NaN
3 1603822800000 108 NaN
4 1603823040000 105 106.0
5 1603823100000 101 111.0
6 1603823160000 106 NaN
7 1603823220000 111 NaN
But I'm not sure if that is an optimal solution.
EDIT: Inspired by the discussion in the comments I did some timing:
With the sample frame
from itertools import accumulate
import numpy as np
rng = np.random.default_rng()
n = 10000
timestamps = [1603822620000 + t
for t in accumulate(rng.integers(1, 4) * 60000
for _ in range(n))]
df = pd.DataFrame({'TimeStamp': timestamps, 'Price': n * [100]})
TimeStamp Price
0 1603822680000 100
... ... ...
9999 1605030840000 100
[10000 rows x 2 columns]
and the two test functions
# (1) Other solution
def Fut2MinFunc(row):
futTimeStamp = row.TimeStamp + 120000
if (futTimeStamp in df.TimeStamp.values):
return df.loc[df['TimeStamp'] == futTimeStamp, 'Price'].iloc[0]
else:
return None
def test_1():
df['Fut2Min'] = df.apply(Fut2MinFunc, axis = 1)
# (2) Solution here
def test_2():
indices = list(df['TimeStamp'] - 120000)
df['Fut2Min'] = [
np.nan
if (timestamp + 120000) not in df['TimeStamp'].values else
df['Price'].iloc[indices.index(timestamp)]
for timestamp in df['TimeStamp']
]
I conducted the experiment
from timeit import timeit
t1 = timeit('test_1()', number=100, globals=globals())
t2 = timeit('test_2()', number=100, globals=globals())
print(t1, t2)
with the result
135.98962861 40.306039344
which seems to imply that the version here is faster? (I also measured directly with time() and without the wrapping in functions and the results are virtually identical.)
With my updated version the result looks like
139.33713767799998 14.178187169000012
I finally did one try with a frame with 1,000,000 rows (number=1) and the result was
763.737430931 175.73120002400003

How to map pandas Groupby dataframe with sum values to another dataframe using non-unique column

I have two pandas dataframe df1 and df2. Where i need to find df1['seq'] by doing a groupby on df2 and taking the sum of the column df2['sum_column']. Below are sample data and my current solution.
df1
id code amount seq
234 3 9.8 ?
213 3 18
241 3 6.4
543 3 2
524 2 1.8
142 2 14
987 2 11
658 3 17
df2
c_id name role sum_column
1 Aus leader 6
1 Aus client 1
1 Aus chair 7
2 Ned chair 8
2 Ned leader 3
3 Mar client 5
3 Mar chair 2
3 Mar leader 4
grouped = df2.groupby('c_id')['sum_column'].sum()
df3 = grouped.reset_index()
df3
c_id sum_column
1 14
2 11
3 11
The next step where am having issues is to map the df3 to df1 and conduct a conditional check to see if df1['amount'] is greater then df3['sum_column'].
df1['seq'] = np.where(df1['amount'] > df1['code'].map(df3.set_index('c_id')[sum_column]), 1, 0)
printing out df1['code'].map(df3.set_index('c_id')['sum_column']), I get only NaN values.
Does anyone know what am doing wrong here?
Expected results:
df1
id code amount seq
234 3 9.8 0
213 3 18 1
241 3 6.4 0
543 3 2 0
524 2 1.8 0
142 2 14 1
987 2 11 0
658 3 17 1
Solution should be simplify with remove .reset_index() for df3 and pass Series to map:
s = df2.groupby('c_id')['sum_column'].sum()
df1['seq'] = np.where(df1['amount'] > df1['code'].map(s), 1, 0)
Alternative with casting boolean mask to integer for True, False to 1,0:
df1['seq'] = (df1['amount'] > df1['code'].map(s)).astype(int)
print (df1)
id code amount seq
0 234 3 9.8 0
1 213 3 18.0 1
2 241 3 6.4 0
3 543 3 2.0 0
4 524 2 1.8 0
5 142 2 14.0 1
6 987 2 11.0 0
7 658 3 17.0 1
You forget add quote for sum_column
df1['seq']=np.where(df1['amount'] > df1['code'].map(df3.set_index('c_id')['sum_column']), 1, 0)

get the new value based on the last row and checking the ID

The current dateframe.
ID Date Start Value Payment
111 1/1/2018 1000 0
111 1/2/2018 100
111 1/3/2018 500
111 1/4/2018 400
111 1/5/2018 0
222 4/1/2018 2000 200
222 4/2/2018 100
222 4/3/2018 700
222 4/4/2018 0
222 4/5/2018 0
222 4/6/2018 1000
222 4/7/2018 0
This is the dataframe what I am trying to get. Basically, i am trying to fill the star value for each row. AS you can see, every ID has a start value on the first day. next day's start value = last day's start value - last day's payment.
ID Date Start Value Payment
111 1/1/2018 1000 0
111 1/2/2018 1000 100
111 1/3/2018 900 500
111 1/4/2018 400 400
111 1/5/2018 0 0
222 4/1/2018 2000 200
222 4/2/2018 1800 100
222 4/3/2018 1700 700
222 4/4/2018 1000 0
222 4/5/2018 1000 0
222 4/6/2018 1000 1000
222 4/7/2018 0 0
Right now, I use Excel with this formula.
Start Value = if(ID in this row == ID in last row, last row's start value - last row's payment, Start Value)
It works well, I am wondering if I can do it in Python/Pandas. Thank you.
We can using groupby and shift + cumsum, ffill will setting up initial value for all row under the same Id, then we just need to deduct the cumulative payment from that row till the start , we get the remaining value at that point
df.StartValue.fillna(df.groupby('ID').apply(lambda x : x['StartValue'].ffill()-x['Payment'].shift().cumsum()).reset_index(level=0,drop=True))
Out[61]:
0 1000.0
1 1000.0
2 900.0
3 400.0
4 0.0
5 2000.0
6 1800.0
7 1700.0
8 1000.0
9 1000.0
10 1000.0
11 0.0
Name: StartValue, dtype: float64
Assign it back by adding inplace=Ture
df.StartValue.fillna(df.groupby('ID').apply(lambda x : x['StartValue'].ffill()-x['Payment'].shift().cumsum()).reset_index(level=0,drop=True),inplace=True)
df
Out[63]:
ID Date StartValue Payment
0 111 1/1/2018 1000.0 0
1 111 1/2/2018 1000.0 100
2 111 1/3/2018 900.0 500
3 111 1/4/2018 400.0 400
4 111 1/5/2018 0.0 0
5 222 4/1/2018 2000.0 200
6 222 4/2/2018 1800.0 100
7 222 4/3/2018 1700.0 700
8 222 4/4/2018 1000.0 0
9 222 4/5/2018 1000.0 0
10 222 4/6/2018 1000.0 1000
11 222 4/7/2018 0.0 0

Indexing/Binning Time Series

I have a dataframe like bellow
ID Date
111 1.1.2018
222 5.1.2018
333 7.1.2018
444 8.1.2018
555 9.1.2018
666 13.1.2018
and I would like to bin them into 5 days intervals.
The output should be
ID Date Bin
111 1.1.2018 1
222 5.1.2018 1
333 7.1.2018 2
444 8.1.2018 2
555 9.1.2018 2
666 13.1.2018 3
How can I do this in python, please?
Looks like groupby + ngroup does it:
df['Date'] = pd.to_datetime(df.Date, errors='coerce', dayfirst=True)
df['Bin'] = df.groupby(pd.Grouper(freq='5D', key='Date')).ngroup() + 1
df
ID Date Bin
0 111 2018-01-01 1
1 222 2018-01-05 1
2 333 2018-01-07 2
3 444 2018-01-08 2
4 555 2018-01-09 2
5 666 2018-01-13 3
If you don't want to mutate the Date column, then you may first call assign for a copy based assignment, and then do the groupby:
df['Bin'] = df.assign(
Date=pd.to_datetime(df.Date, errors='coerce', dayfirst=True)
).groupby(pd.Grouper(freq='5D', key='Date')).ngroup() + 1
df
ID Date Bin
0 111 1.1.2018 1
1 222 5.1.2018 1
2 333 7.1.2018 2
3 444 8.1.2018 2
4 555 9.1.2018 2
5 666 13.1.2018 3
One way is to create an array of your date range and use numpy.digitize.
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
date_ranges = pd.date_range(df['Date'].min(), df['Date'].max(), freq='5D')\
.astype(np.int64).values
df['Bin'] = np.digitize(df['Date'].astype(np.int64).values, date_ranges)
Result:
ID Date Bin
0 111 2018-01-01 1
1 222 2018-01-05 1
2 333 2018-01-07 2
3 444 2018-01-08 2
4 555 2018-01-09 2
5 666 2018-01-13 3

Categories