Pandas closest future value not equal to current row - python

I have a Pandas DataFrame with one column, price, and a DateTimeIndex. I would like to create a new column that is 1 when price increases the next time it changes and 0 if it decreases. Multiple consecutive rows may have the same value of price.
Example:
import pandas as pd
df = pd.DataFrame({"price" : [10, 10, 20, 10, 30, 5]}, index=pd.date_range(start="2017-01-01", end="2017-01-06"))
The output should then be:
2017-01-01 1
2017-01-02 1
2017-01-03 0
2017-01-04 1
2017-01-05 0
2017-01-06 NaN
In practice this DF has ~20mm rows so I'm really looking for a vectorized method of doing this.

Here is one way to do this:
calculate the price difference and shift up by one;
use numpy.where to assign one to positions where price increases, zero to positions where price decreases;
back fill the indicator column, so non change values are the same as the next available observation;
In code:
import numpy as np
price_diff = df.price.diff().shift(-1)
df['indicator'] = np.where(price_diff.gt(0), 1, np.where(price_diff.lt(0), 0, np.nan))
df['indicator'] = df.indicator.bfill()
df
# price indicator
#2017-01-01 10 1.0
#2017-01-02 10 1.0
#2017-01-03 20 0.0
#2017-01-04 10 1.0
#2017-01-05 30 0.0
#2017-01-06 5 NaN

df['New']=(df-df.shift(-1))[:-1].le(0).astype(int)
df
Out[879]:
price New
2017-01-01 10 1.0
2017-01-02 10 1.0
2017-01-03 20 0.0
2017-01-04 10 1.0
2017-01-05 30 0.0
2017-01-06 5 NaN

use shift:
sh = df['price'].shift(-1)
out = sh[~sh.isnull()] = df['price']<=sh
or
sh = df['price'].shift(-1)
out = np.where(sh.isnull(), np.nan, df['price']<=sh)

Related

Creating a column with moving sum

I have a time series data and a non-continuous data logs with timestamps. I want to merge the latter with the time series data, and create a new columns with column values.
Let the time series data be:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=pd.date_range(freq=f'{5}T',start='2020-10-10',periods=(12)*24*5))
df['col'] = np.random.random_integers(1, 100, size= df.shape[0])
df['uid'] = 1
df2 = pd.DataFrame(index=pd.date_range(freq=f'{5}T',start='2020-10-10',periods=(12)*24*5))
df2['col'] = np.random.random_integers(1, 50, size= df2.shape[0])
df2['uid'] = 2
df3=pd.concat([df, df2]).reset_index()
df3= df3.rename(columns={'index': 'timestamp'})
timestamp col uid
0 2020-10-10 00:00:00 96 1
1 2020-10-10 00:05:00 47 1
2 2020-10-10 00:10:00 78 1
3 2020-10-10 00:15:00 27 1
...
Let the log data be:
import datetime as dt
df_log=pd.DataFrame(np.array([[100, 1, 3], [40, 2, 6], [50, 1, 5], [60, 2, 9], [20, 1, 2], [30, 2, 5]]),
columns=['duration', 'uid', 'factor'])
df_log['timestamp'] = pd.Series([dt.datetime(2020,10,10, 15,21), dt.datetime(2020,10,10, 16,27),
dt.datetime(2020,10,11, 21,25), dt.datetime(2020,10,11, 10,12),
dt.datetime(2020,10,13, 20,56), dt.datetime(2020,10,13, 13,15)])
duration uid factor timestamp
0 100 1 3 2020-10-10 15:21:00
1 40 2 6 2020-10-10 16:27:00
...
I want to merge these two (df_merged), and create new column in the time series data as such (respective to the uid):
df_merged['new'] = df_merged['duration] * df_merged['factor']
and ffill the df_merged['new'] with this value until the next log for each uid, then do the same operation on the next log and sum, and have it be a moving 2-day average.
Can anybody show me a direction for this problem?
Expected Output:
timestamp col uid duration factor new
0 2020-10-10 15:20:00 96 1 100 3 300
1 2020-10-10 15:25:00 47 1 100 3 300
2 2020-10-10 15:30:00 78 1 100 3 300
...
2020-10-11 21:25:00 .. 1 60 9 540+300
2020-10-11 21:30:00 .. 1 60 9 540+300
...
2020-10-13 20:55:00 .. 1 20 2 40+540
2020-10-13 21:00:00 .. 1 20 2 40+540
..
2020-10-13 21:25:00 .. 1 20 2 40
as I understand it, it's simpler to calculate the new column on df_log before merging. You'd just use rolling to calculate the window for each uid group:
df_log["new"] = df_log["duration"] * df_log["factor"]
# 2 day rolling window summing `new`
df_log = df_log.groupby("uid").rolling("2d", on="timestamp")["new"].sum().to_frame()
Then merging is straightforward:
# prepare for merge
df_log = df_log.sort_values(by="timestamp")
df3 = df3.sort_values(by="timestamp")
df_merged = (
pd.merge_asof(df3, df_log, on="timestamp", by=["uid"])
.dropna()
.reset_index(drop=True)
)
This solution does deviate slightly from your expected output. The first included row from the continuous series (df3) would be at timestamp 2020-10-10 15:25:00 instead of 2020-10-10 15:20:00 since the merge method would look for the last timestamp in df_log before the timestamp in df3.
Alternatively, if you require the first row in the output to have timestamp 2020-10-10 15:20:00, you can use direction="forward" in pd.merge_asof. That would make each row match the first row in df_log with a timestamp after the one in df3, so you'd need to remove the extra rows in the beginning for each uid.

How can I count the rows between a date index and a date one month in the future in pandas vectorized to add them as a column?

I have a dataframe (df) with a date index. And I want to achieve the following:
1. Take Dates column and add one month -> e.g. nxt_dt = df.index + np.timedelta64(month=1) and lets call df.index curr_dt
2. Find the nearest entry in Dates that is >= nxt_dt.
3 Count the rows between curr_dt and nxt_dt and put them into a column in df.
The result is supposed to look like this:
px_volume listed_sh ... iv_mid_6m '30d'
Dates ...
2005-01-03 228805 NaN ... 0.202625 21
2005-01-04 189983 NaN ... 0.203465 22
2005-01-05 224310 NaN ... 0.202455 23
2005-01-06 221988 NaN ... 0.202385 20
2005-01-07 322691 NaN ... 0.201065 21
Needless to mention that there are only dates/rows in the df for which there are observations.
I can think of some different ways to get this done in loops, but since the data I work with is quite big, I would really like to avoid to loop through rows to fill them.
Is there a way in pandas to get this done vectorized?
If you are OK to reindex this should do the job:
import numpy as np
import pandas as pd
df = pd.DataFrame({'date': ['2020-01-01', '2020-01-08', '2020-01-24', '2020-01-29', '2020-02-09', '2020-03-04']})
df['date'] = pd.to_datetime(df['date'])
df['value'] = 1
df = df.set_index('date')
df = df.reindex(pd.date_range('2020-01-01','2020-03-04')).fillna(0)
df = df.sort_index(ascending=False)
df['30d'] = df['value'].rolling(30).sum() - 1
df.sort_index().query("value == 1")
gives:
value 30d
2020-01-01 1.0 3.0
2020-01-08 1.0 2.0
2020-01-24 1.0 2.0
2020-01-29 1.0 1.0
2020-02-09 1.0 NaN
2020-03-04 1.0 NaN

Complete days in data from a DataFrame grouped by group by

I have this Dataframe:
DataFrame
I applied df.groupby ('site') to classify data by this feature.
grouped = Datos.groupby('site')
After classifying it I want to complete, for all records, the "date" column day by day.
The procedure that I think I should follow will be:
1. Generate a complete sequence between start and end date. (Step completed).
for site in grouped:
dates = ['2018-01-01', '2020-01-17']
startDate = datetime.datetime.strptime( dates[0], "%Y-%m-%d") # parse first date
endDate = datetime.datetime.strptime( dates[-1],"%Y-%m-%d") # parse last date
days = (endDate - startDate).days # how many days between?
allDates = {datetime.datetime.strftime(startDate+datetime.timedelta(days=k),
"%Y-%m-%d"):0 for k in range(days+1)}
Compare this sequence with the column 'date' of my groupby. ('Site) and add those that are not present do not match the dates in' date '.
Write a function or loop that allows you to update the 'date' column with the new dates and also complete the missing values with 0.
(grouped.apply(add_days))
So far I have only managed to complete step 1, so I ask for your help to complete steps 2 and 3.
I would very much appreciate your always important help.
Regards
I had to do quiet the same thing for a project:
Maybe it's not the best solution for you but it can help you. (and I hope save you the headache I had)
Here is how as I managed it with help of https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
df_DateRange=pd.DataFrame()
df_1=pd.DataFrame()
grouped=pd.DataFrame()
#1. Create a DataFrame with alldays (your step2):
#Create a DataFrame with alldays
dates_list = ['2019-12-31', '2020-01-05']
df_DateRange['date']=pd.date_range(start=dates_list [0],end=dates_list [-1],freq='1D')
df_DateRange['date']=df_DateRange['date'].dt.strftime('%Y-%m-%d')
df_DateRange.set_index(['date'],inplace=True)
#Set index of you Datos DataFrame:
Datos.set_index(['date'], inplace=True)
#Join both DataFrame:
df_1=df_DateRange.join(Datos)
#2. Replace the NaN:
df_1['site'].fillna("", inplace=True)
df_1['value'].fillna(0, inplace=True)
df_1['value2'].fillna(0, inplace=True)
#3. do the calculation:
grouped = df_1.groupby('site').sum()
df_DateRange:
date
0 2019-12-31
1 2020-01-01
2 2020-01-02
3 2020-01-03
4 2020-01-04
5 2020-01-05
Datos:
date site value value2
0 2020-01-01 site1 1 -1
1 2020-01-01 site2 2 -2
2 2020-01-02 site1 10 -10
3 2020-01-02 site2 20 -20
df1:
site value value2
date
2019-12-31 0.0 0.0
2020-01-01 site1 1.0 -1.0
2020-01-01 site2 2.0 -2.0
2020-01-02 site1 10.0 -10.0
2020-01-02 site2 20.0 -20.0
2020-01-03 0.0 0.0
2020-01-04 0.0 0.0
2020-01-05 0.0 0.0
grouped=
value value2
site
0.0 0.0
site1 11.0 -11.0
site2 22.0 -22.0

Time Ranking DataFrame within the constraints of noise

I have a dataframe df with three columns, viz., Date, Time, Name (there can be more extra columns). df is sorted in ascending order of Time. On any given Date there could be multiple Time values which can either be 5 minutes apart or > 15 minutes apart. On any given day anything within 5 minutes should be treated as same. I want to add column TimeRank which on any given day clusters similar Time within 5 minutes together and give them same TimeRank. For example,
Date Name Time TimeRank
0 2017-01-01 Henry 2017-01-01 09:21:01 1
1 2017-01-01 John 2017-01-01 09:23:43 1
2 2017-01-01 Svetlana 2017-01-01 10:15:01 2
3 2017-01-01 Sara 2017-01-01 11:01:01 3
4 2017-01-01 Whitney 2017-01-01 11:03:03 3
5 2017-01-02 Lara 2017-01-02 11:03:03 1
6 2017-01-02 Eugene 2017-01-02 16:46:00 2
7 2017-01-02 Richard 2017-01-02 16:46:00 2
8 2017-01-03 Andy 2017-01-03 11:01:01 1
9 2017-01-03 Paul 2017-01-03 11:03:03 1
Below I have created a sample df. Unfortunately, I am constrained with using an older version of pandas 0.16.
import pandas as pd
from random import randint
from datetime import time
dates = pd.date_range('2017-01-01', '2017-01-04')
dates2 = [dates[i] for i in [randint(0, len(dates) -1) for i in range (0, 100)]]
timelist = [time(9,20,45), time(9,21,0), time(9,23,43), time(9,50,0), time(10,15,1), time(11,1,1), time(11,3,3), time(16,45,0), time(16,46,0)]
timelist2 = [timelist[i] for i in [randint(0, len(timelist) -1) for i in range (0, 100)]]
names = ['henry', 'tom', 'andy', 'lara', 'whitney', 'eleanor', 'paloma', 'john', 'james', 'svetlana', 'paul']
names2 = [names[i] for i in [randint(0, len(names)-1) for i in range (0, 100)]]
df = pd.DataFrame({'Date':dates2, 'Time':timelist2, 'Name':names2})
df['Time'] = df.apply(lambda r:pd.datetime.combine(r['Date'],r['Time']), axis=1)
df.sort('Time', inplace=True)
df.loc[:, 'minutes'] = df.apply(lambda x:x['Time'].minute + 60*x['Time'].hour, axis=1)
df.loc[:, 'delTime'] = df.groupby('Date')['minutes'].diff()
df.loc[(df['delTime'] <=5) & (df['delTime'] >=-5), 'delTime'] = 0
df.loc[np.isnan(df['delTime']), 'delTime'] = 1.
df.loc[(df['delTime']) == 0, 'delTime'] = np.nan
df.loc[~np.isnan(df['delTime']), 'delTime'] = df['minutes']
df = df.ffill()
df.loc[:, 'TimeRank'] = df.groupby('Date')['delTime'].rank(method='dense')
df.drop(['minutes', 'delTime'], inplace=True, axis=1)

Boxplot Pandas data

DataFrame is as follows:
ID1 ID2
0 00:00:01.002 00:00:01.002
1 00:00:01.001 00:00:01.006
2 00:00:01.004 00:00:01.011
3 00:00:00.998 00:00:01.012
4 NaT 00:00:01.000
...
20 NaT 00:00:00.998
What I am trying to do is create a boxplot for each ID. There may or may not be multiple IDs depending on the dataset I provide. For right now I am trying to solve this for 2 datasets. If possible I would like a solution that has all the data on the same boxplot and then another with the data displayed on its own boxplot per ID.
I am very new to pandas (trying to learn it...) and am just getting frustrated at how long this is taking to figure out... Here is my code...
deltaTime = pd.DataFrame() #Create blank df
for x in range(0, len(totIDs)):
ID = IDList[x]
df = pd.DataFrame(data[ID]).T
deltaT[ID] = pd.to_datetime(df[TIME_COL]).diff()
deltaT.boxplot()
Pretty simple just cant seem to get it do what I want in plotting a boxplot for each ID. I should not that data is given to me by a homegrown file reader that takes several complex files and sorts them into the data dictionary which is indexed by IDs.
I am running pandas version 0.14.0 and python version 2.7.7
I am not sure how this works in 0.14.0 version, because last is 0.19.2 - I recommend upgrade if possible:
#sample data
np.random.seed(180)
dates = pd.date_range('2017-01-01 10:11:20', periods=10, freq='T')
cols = ['ID1','ID2']
df = pd.DataFrame(np.random.choice(dates, size=(10,2)), columns=cols)
print (df)
ID1 ID2
0 2017-01-01 10:12:20 2017-01-01 10:17:20
1 2017-01-01 10:16:20 2017-01-01 10:20:20
2 2017-01-01 10:18:20 2017-01-01 10:17:20
3 2017-01-01 10:12:20 2017-01-01 10:16:20
4 2017-01-01 10:14:20 2017-01-01 10:18:20
5 2017-01-01 10:18:20 2017-01-01 10:19:20
6 2017-01-01 10:17:20 2017-01-01 10:12:20
7 2017-01-01 10:13:20 2017-01-01 10:17:20
8 2017-01-01 10:16:20 2017-01-01 10:11:20
9 2017-01-01 10:13:20 2017-01-01 10:19:20
Call DataFrame.diff and then convert timedeltas to total_seconds:
df = df.diff().apply(lambda x: x.dt.total_seconds())
print(df)
ID1 ID2
0 NaN NaN
1 240.0 180.0
2 120.0 -180.0
3 -360.0 -60.0
4 120.0 120.0
5 240.0 60.0
6 -60.0 -420.0
7 -240.0 300.0
8 180.0 -360.0
9 -180.0 480.0
Last use DataFrame.plot.box
df.plot.box()
You can also check docs.

Categories