Group consecutive rises and falls using Pandas Series - python

I want to group consecutive growth and falls in pandas series. I have tried this, but it seems not working:
consec_rises = self.df_dataset.diff().cumsum()
group_consec = consec_rises.groupby(consec_rises)
My dataset:
date
2022-01-07 25.847718
2022-01-08 29.310294
2022-01-09 31.791339
2022-01-10 33.382136
2022-01-11 31.791339
2022-01-12 29.310294
2022-01-13 25.847718
2022-01-14 21.523483
2022-01-15 16.691068
2022-01-16 11.858653
2022-01-17 7.534418
I want to get result as following:
Group #1 (consecutive growth)
2022-01-07 25.847718
2022-01-08 29.310294
2022-01-09 31.791339
2022-01-10 33.382136
Group #2 (consecutive fall)
2022-01-12 29.310294
2022-01-13 25.847718
2022-01-14 21.523483
2022-01-15 16.691068
2022-01-16 11.858653
2022-01-17 7.534418

If I understand you correctly:
mask = df["date"].diff().bfill() >= 0
for _, g in df.groupby((mask != mask.shift(1)).cumsum()):
print(g)
print("-" * 80)
Prints:
date
2022-01-07 25.847718
2022-01-08 29.310294
2022-01-09 31.791339
2022-01-10 33.382136
--------------------------------------------------------------------------------
date
2022-01-11 31.791339
2022-01-12 29.310294
2022-01-13 25.847718
2022-01-14 21.523483
2022-01-15 16.691068
2022-01-16 11.858653
2022-01-17 7.534418
--------------------------------------------------------------------------------

Related

Python Pandas, add column containing first index where future column value is greater than current row's value

Is there a way to add a column that indicates the next index that meets some condition (ex. first index where a future row's val is greater than the current row's val) using a vectorized approach?
I found a number of examples that show how to do this using a fixed value, such as getting the next index where a column is greater than 0, but I am wanting to do this for every row based on that row's changing value.
Here's an example of doing this with simple loop, and I'm curious if there's a Pandas/vectorized approach to do the same:
import pandas as pd
df = pd.DataFrame( [0,2,3,2,3,4,5,6,5,4,7,8,7,2,3], columns=['val'], index=pd.date_range('20220101', periods=15))
def add_new_highs (df):
df['new_high'] = pd.NaT
for i,v in df.val.iteritems():
row = df.loc[i:][ df.val > v ].head(1)
if len(row) > 0:
df['new_high'].loc[i] = row.index[0]
add_new_highs(df)
print(df)
Output:
val new_high
2022-01-01 0 2022-01-02
2022-01-02 2 2022-01-03
2022-01-03 3 2022-01-06
2022-01-04 2 2022-01-05
2022-01-05 3 2022-01-06
2022-01-06 4 2022-01-07
2022-01-07 5 2022-01-08
2022-01-08 6 2022-01-11
2022-01-09 5 2022-01-11
2022-01-10 4 2022-01-11
2022-01-11 7 2022-01-12
2022-01-12 8 NaT
2022-01-13 7 NaT
2022-01-14 2 2022-01-15
2022-01-15 3 NaT
One option is to use numpy broadcasting. Since we want the index that appears after the current index, we only need to look at the upper triangle of an array; so we use np.triu. Then since we need the first such index, we use argmax. Finally, for some indices, there might never be a greater than value, so we replace those with NaN using where:
import numpy as np
df['new_high'] = df.index[np.triu(df[['val']].to_numpy() < df['val'].to_numpy()).argmax(axis=1)]
df['new_high'] = df['new_high'].where(lambda x: x.index < x)
Output:
val new_high
2022-01-01 0 2022-01-02
2022-01-02 2 2022-01-03
2022-01-03 3 2022-01-06
2022-01-04 2 2022-01-05
2022-01-05 3 2022-01-06
2022-01-06 4 2022-01-07
2022-01-07 5 2022-01-08
2022-01-08 6 2022-01-11
2022-01-09 5 2022-01-11
2022-01-10 4 2022-01-11
2022-01-11 7 2022-01-12
2022-01-12 8 NaT
2022-01-13 7 NaT
2022-01-14 2 2022-01-15
2022-01-15 3 NaT
Similar to #enke's response
import numpy as np
arr = np.repeat(df.values, len(df), axis=1) # make a matrix
arr = np.tril(arr) # remove values before you
arr = (arr - df.values.T) > 0 # make bool array of larger values
ind = np.argmax(arr, axis=0) # get first larger value index
df['new_high'] = df.iloc[ind].index # use index as new row
df['new_high'] = df['new_high'].replace({df.index[0]: pd.NaT}) # replace ones with no-max as NaT

Pandas fill missing Time-Series data. Only if more than one day is missing

I have two time-series with different frequencies. Would like to fill values using the lower frequency data.
Here is what I mean. Hope it is clear this way:
index = [pd.datetime(2022,1,10,1),
pd.datetime(2022,1,10,2),
pd.datetime(2022,1,12,7),
pd.datetime(2022,1,14,12),]
df1 = pd.DataFrame([1,2,3,4],index=index)
2022-01-10 01:00:00 1
2022-01-10 02:00:00 2
2022-01-12 07:00:00 3
2022-01-14 12:00:00 4
index = pd.date_range(start=pd.datetime(2022,1,9),
end = pd.datetime(2022,1,15),
freq='D')
df2 = pd.DataFrame([n+99 for n in range(len(index))],index=index)
2022-01-09 99
2022-01-10 100
2022-01-11 101
2022-01-12 102
2022-01-13 103
2022-01-14 104
2022-01-15 105
The final df should only fill values if more than one day is missing under df1. So the result should be:
2022-01-09 00:00:00 99
2022-01-10 01:00:00 1
2022-01-10 02:00:00 2
2022-01-11 00:00:00 101
2022-01-12 07:00:00 3
2022-01-13 00:00:00 103
2022-01-14 12:00:00 4
2022-01-15 00:00:00 105
Any idea how to do this?
You can filter df2 to keep only the new dates and concat to df1:
import numpy as np
idx1 = pd.to_datetime(df1.index).date
idx2 = pd.to_datetime(df2.index).date
df3 = pd.concat([df1, df2[~np.isin(idx2, idx1)]]).sort_index()
Output:
0
2022-01-09 00:00:00 99
2022-01-10 01:00:00 1
2022-01-10 02:00:00 2
2022-01-11 00:00:00 101
2022-01-12 07:00:00 3
2022-01-13 00:00:00 103
2022-01-14 12:00:00 4
2022-01-15 00:00:00 105

Get the max value of dates in Pandas

here is my code and datetime columns.
import pandas as pd
xcel_file=pd.read_excel('data.xlsx',usecols=['datetime'])
date=[]
time=[]
date.append((xcel_file['datetime']).dt.date)
time.append((xcel_file['datetime']).dt.time)
new_file=pd.DataFrame({'a':len(xcel_file['datetime'])},index=xcel_file['datetime'])
day=new_file.between_time('9:00','16:00')
day.reset_index(inplace=True)
day=day.drop(columns={'a'})
day['time']=pd.to_datetime(day['datetime']).dt.date
model_list=day['time'].drop_duplicates()
data_set=[]
i=0
for n in day['datetime']:
data_2=max(day['datetime'][day['time']==model_list[i])
i+=1
data_set.append(data_2)
datetime column
0 2022-01-10 09:30:00
1 2022-01-10 10:30:00
2 2022-01-11 10:30:00
3 2022-01-11 15:30:00
4 2022-01-11 11:00:00
5 2022-01-11 12:00:00
6 2022-01-12 13:00:00
7 2022-01-12 15:30:00
8 2022-01-13 14:00:00
9 2022-01-14 15:00:00
10 2022-01-14 16:00:00
11 2022-01-14 16:30:00
expected result
1 2022-01-10 10:30:00
3 2022-01-11 15:30:00
7 2022-01-12 15:30:00
8 2022-01-13 14:00:00
9 2022-01-14 15:00:00
I'm trying to get max value of same dates from datetime column in between time 9am to 4pm.
Is there any way of doing this? Truly thankful for any kind of help.
Use DataFrame.between_time with aggregate by days in Grouper for maximal datetimes:
df = pd.read_excel('data.xlsx',usecols=['datetime'])
df = df.set_index('datetime', drop=False)
df = (df.between_time('9:00','16:00')
.groupby(pd.Grouper(freq='d'))[['datetime']]
.max()
.reset_index(drop=True))
print (df)
datetime
0 2022-01-10 10:30:00
1 2022-01-11 15:30:00
2 2022-01-12 15:30:00
3 2022-01-13 14:00:00
4 2022-01-14 16:00:00
EDIT: Added missing values if exist match, so DataFrame.dropna solve this problem.
print (df)
datetime
0 2022-01-10 17:40:00
1 2022-01-10 19:30:00
2 2022-01-11 19:30:00
3 2022-01-11 15:30:00
4 2022-01-12 19:30:00
5 2022-01-12 15:30:00
6 2022-01-14 18:30:00
7 2022-01-14 16:30:00
df = df.set_index('datetime', drop=False)
df = (df.between_time('17:00','19:30')
.groupby(pd.Grouper(freq='d'))[['datetime']]
.max()
.dropna()
.reset_index(drop=True))
print (df)
datetime
0 2022-01-10 19:30:00
1 2022-01-11 19:30:00
2 2022-01-12 19:30:00
3 2022-01-14 18:30:00
Added alternative solution:
df = df.set_index('datetime', drop=False)
df = (df.between_time('17:00','19:30')
.sort_index()
.assign(d = lambda x: x['datetime'].dt.date)
.drop_duplicates('d', keep='last')
.drop('d', axis=1)
.reset_index(drop=True)
)
print (df)
datetime
0 2022-01-10 19:30:00
1 2022-01-11 19:30:00
2 2022-01-12 19:30:00
3 2022-01-14 18:30:00
EDIT: solution for filter first by datetime column, then datetime2 and last filtering by dates from datetime column:
print (df)
datetime datetime2
0 2022-01-10 09:30:00 2022-01-10 17:40:00
1 2022-01-10 10:30:00 2022-01-10 19:30:00
2 2022-01-11 10:30:00 2022-01-11 19:30:00
3 2022-01-11 15:30:00 2022-01-11 15:30:00
4 2022-01-11 11:00:00 2022-01-12 15:30:00
5 2022-01-11 12:00:00 2022-01-14 18:30:00
6 2022-01-12 13:00:00 2022-01-14 16:30:00
7 2022-01-12 15:30:00 2022-01-14 17:30:00
df = (df.set_index('datetime', drop=False)
.between_time('9:00','16:00')
.sort_index()
.set_index('datetime2', drop=False)
.between_time('17:00','19:30')
.assign(d = lambda x: x['datetime'].dt.date)
.drop_duplicates('d', keep='last')
.drop('d', axis=1)
.reset_index(drop=True)
)
print (df)
datetime datetime2
0 2022-01-10 10:30:00 2022-01-10 19:30:00
1 2022-01-11 12:00:00 2022-01-14 18:30:00
2 2022-01-12 15:30:00 2022-01-14 17:30:00
If filtering by dates by datetim2 output is different:
df = (df.set_index('datetime', drop=False)
.between_time('9:00','16:00')
.sort_index()
.set_index('datetime2', drop=False)
.between_time('17:00','19:30')
.assign(d = lambda x: x['datetime2'].dt.date)
.drop_duplicates('d', keep='last')
.drop('d', axis=1)
.reset_index(drop=True)
)
print (df)
datetime datetime2
0 2022-01-10 10:30:00 2022-01-10 19:30:00
1 2022-01-11 10:30:00 2022-01-11 19:30:00
2 2022-01-12 15:30:00 2022-01-14 17:30:00

How do I change the name of a resampled column?

I have a dataframe with the price fluctuations of the Nasdaq stock index every minute.
In trading it is important to take into account data on different time units (to know the short term, medium and long term trends...)
So I used the resample() method of Pandas to get a dataframe with the price in 5 minutes in addition to the original 1 minute:
df1m = pd.DataFrame({
'Time' : ['2022-01-11 09:30:00', '2022-01-11 09:31:00', '2022-01-11 09:32:00', '2022-01-11 09:33:00', '2022-01-11 09:34:00', '2022-01-11 09:35:00', '2022-01-11 09:36:00' , '2022-01-11 09:37:00' , '2022-01-11 09:38:00' ,
'2022-01-11 09:39:00', '2022-01-11 09:40:00'],
'Price' : [1,2,3,4,5,6,7,8,9,10,11]})
df1m['Time'] = pd.to_datetime(df1m['Time'])
df1m.set_index(['Time'], inplace =True)
df5m = df1m.resample('5min').first()
I renamed the column names to 5min :
df5m.rename(columns={'Price' : 'Price5'})
Unfortunately the change of column names is no longer taken into account when the two dataframes (1 and 5 min) are put together:
df_1m_5m = pd.concat([df1m, df5m], axis=1)
How to rename definitively the columns created for the 5min data and avoid having twice the same column name for different data?
You can use:
df5m = df1m.resample('5min').first().add_suffix('5')
df_1m_5m = pd.concat([df1m, df5m], axis=1)
Output:
>>> df_1m_5m
Price Price5
Time
2022-01-11 09:30:00 1 1.0
2022-01-11 09:31:00 2 NaN
2022-01-11 09:32:00 3 NaN
2022-01-11 09:33:00 4 NaN
2022-01-11 09:34:00 5 NaN
2022-01-11 09:35:00 6 6.0
2022-01-11 09:36:00 7 NaN
2022-01-11 09:37:00 8 NaN
2022-01-11 09:38:00 9 NaN
2022-01-11 09:39:00 10 NaN
2022-01-11 09:40:00 11 11.0
You forgot to reassign the result to your dataframe:
df5m = df5m.rename(columns={'Price' : 'Price5'})
# OR
df5m.rename(columns={'Price' : 'Price5'}, inplace=True)
Output:
>>> df5m
Price5
Time
2022-01-11 09:30:00 1
2022-01-11 09:35:00 6
2022-01-11 09:40:00 11
Believe your issue is you are missing option inplace=true in your rename. By default it's false, so it generates a copy of the DataFrame rather than editing your existing DataFrame. Setting it to true will edit your existing DataFrame df5m
df5m.rename(columns={'Price' : 'Price5'},inplace=True)
Output of df_1m_5m:
Price Price5
Time
2022-01-11 09:30:00 1 1.0
2022-01-11 09:31:00 2 NaN
2022-01-11 09:32:00 3 NaN
2022-01-11 09:33:00 4 NaN
2022-01-11 09:34:00 5 NaN
2022-01-11 09:35:00 6 6.0
2022-01-11 09:36:00 7 NaN
2022-01-11 09:37:00 8 NaN
2022-01-11 09:38:00 9 NaN
2022-01-11 09:39:00 10 NaN
2022-01-11 09:40:00 11 11.0
Agree with Stephan and Corralien. You can also try this:
df1m['Price5'] = df1m.resample('5T').first()

I have a dataframe table with a ticker column and a date column. I want to calculate the price of the ticker at the corresponding date

Here is the table specified as df:
id
ticker
date
1
PLTR
2022-01-07
2
GME
2022-01-06
3
AMC
2022-01-06
4
GOOD
2022-01-07
5
GRAB
2022-01-07
6
ALL
2022-01-06
7
FOR
2022-01-06
I want to have something like this:
id
ticker
date
Price
1
PLTR
2022-01-07
$16.56
2
GME
2022-01-06
$131.03
3
AMC
2022-01-06
$22.46
4
GOOD
2022-01-07
$24.76
5
GRAB
2022-01-07
$6.81
6
ALL
2022-01-06
$122.40
7
FOR
2022-01-06
$21.26
I tried df['Price'] = yf.download(df['ticker'],df['date'])['Close']
using the yahoo finance tool but received an error:
AttributeError: 'Series' object has no attribute 'split'
I also tried the pandas_datareader (imported as web), got the same error:
df.assign(Price=web.DataReader(list(df.ticker('\n')), 'yahoo', list(df.date)))['Close']
Any advice/ideas what I am doing wrong?
import pandas as pd
import pandas_datareader.data as web
tickers = list(df.ticker)
prices = ( web.DataReader(tickers, data_source='yahoo', start=df.date.min().date(), end=df.date.max().date() )['Close']
.reset_index()
.melt(id_vars=['Date'])
.rename(columns={'Symbols':'ticker', 'Date':'date'})
)
prices:
date
ticker
value
0
2022-01-06 00:00:00
PLTR
16.74
1
2022-01-07 00:00:00
PLTR
16.56
2
2022-01-06 00:00:00
GME
131.03
3
2022-01-07 00:00:00
GME
140.62
4
2022-01-06 00:00:00
AMC
22.46
5
2022-01-07 00:00:00
AMC
22.99
6
2022-01-06 00:00:00
GOOD
25.03
7
2022-01-07 00:00:00
GOOD
24.76
8
2022-01-06 00:00:00
GRAB
6.65
9
2022-01-07 00:00:00
GRAB
6.81
10
2022-01-06 00:00:00
ALL
122.4
11
2022-01-07 00:00:00
ALL
125.95
12
2022-01-06 00:00:00
FOR
21.26
13
2022-01-07 00:00:00
FOR
20.19
Now merge them:
df.merge(prices, on=['ticker','date'], how='left')
id
ticker
date
value
0
1
PLTR
2022-01-07 00:00:00
16.56
1
2
GME
2022-01-06 00:00:00
131.03
2
3
AMC
2022-01-06 00:00:00
22.46
3
4
GOOD
2022-01-07 00:00:00
24.76
4
5
GRAB
2022-01-07 00:00:00
6.81
5
6
ALL
2022-01-06 00:00:00
122.4
6
7
FOR
2022-01-06 00:00:00
21.26

Categories