Python - how can i detect the steepness of a line? - python

I'm dealing with some datasets of OHLC data, and I'm trying to find a way to determine the steepness of that data at a given point.
Literally, given a specific point, I need to find a way to know if at that point the line is going up, down, or sideways, so I would need to find some sort of coefficient that tells how steep the line is at the specific point.
Here is a sample of a dataset, it's very specific stuff:
Close
2021-03-04 07:00:00 1571.53
2021-03-04 08:00:00 1571.11
2021-03-04 09:00:00 1547.29
2021-03-04 10:00:00 1557.74
2021-03-04 11:00:00 1561.59
2021-03-04 12:00:00 1561.28
2021-03-04 13:00:00 1567.00
2021-03-04 14:00:00 1574.28
2021-03-04 15:00:00 1577.22
2021-03-04 16:00:00 1584.92
2021-03-04 17:00:00 1562.29
2021-03-04 18:00:00 1507.77
2021-03-04 19:00:00 1526.71
2021-03-04 20:00:00 1531.00
2021-03-04 21:00:00 1517.97
2021-03-04 22:00:00 1529.71
2021-03-04 23:00:00 1539.23
2021-03-05 00:00:00 1478.94
2021-03-05 01:00:00 1463.28
2021-03-05 02:00:00 1480.97
2021-03-05 03:00:00 1481.67
2021-03-05 04:00:00 1465.48
2021-03-05 05:00:00 1472.41
2021-03-05 06:00:00 1479.99
2021-03-05 07:00:00 1474.17
2021-03-05 08:00:00 1456.06
2021-03-05 09:00:00 1469.80
2021-03-05 10:00:00 1471.51
2021-03-05 11:00:00 1480.15
2021-03-05 12:00:00 1482.47
2021-03-05 13:00:00 1489.47
2021-03-05 14:00:00 1493.41
2021-03-05 15:00:00 1481.41
2021-03-05 16:00:00 1477.98
2021-03-05 17:00:00 1507.11
2021-03-05 18:00:00 1521.66
2021-03-05 19:00:00 1533.58
2021-03-05 20:00:00 1533.36
2021-03-05 21:00:00 1531.68
2021-03-05 22:00:00 1536.38
2021-03-05 23:00:00 1528.31
2021-03-06 00:00:00 1542.00
2021-03-06 01:00:00 1534.70
2021-03-06 03:00:00 1531.70
2021-03-06 04:00:00 1529.28
2021-03-06 05:00:00 1574.67
2021-03-06 06:00:00 1578.37
2021-03-06 07:00:00 1589.31
2021-03-06 08:00:00 1574.99
2021-03-06 09:00:00 1564.03
2021-03-06 10:00:00 1563.93
2021-03-06 11:00:00 1564.07
2021-03-06 12:00:00 1542.18
2021-03-06 13:00:00 1528.55
2021-03-06 14:00:00 1541.65
2021-03-06 15:00:00 1544.41
2021-03-06 16:00:00 1580.20
2021-03-06 17:00:00 1600.98
2021-03-06 18:00:00 1615.63
2021-03-06 19:00:00 1620.71
2021-03-06 20:00:00 1653.09
2021-03-06 21:00:00 1642.97
2021-03-06 22:00:00 1665.29
2021-03-06 23:00:00 1650.35
2021-03-07 00:00:00 1671.00
2021-03-07 01:00:00 1641.96
2021-03-07 02:00:00 1656.10
2021-03-07 03:00:00 1648.98
2021-03-07 04:00:00 1656.85
2021-03-07 05:00:00 1663.31
2021-03-07 06:00:00 1662.99
2021-03-07 07:00:00 1684.54
2021-03-07 08:00:00 1668.45
2021-03-07 09:00:00 1684.10
2021-03-07 10:00:00 1675.01
2021-03-07 11:00:00 1678.36
2021-03-07 12:00:00 1669.22
2021-03-07 13:00:00 1663.43
2021-03-07 14:00:00 1648.96
2021-03-07 15:00:00 1652.00
2021-03-07 16:00:00 1669.16
2021-03-07 17:00:00 1675.99
2021-03-07 18:00:00 1655.52
2021-03-07 19:00:00 1658.26
2021-03-07 20:00:00 1649.31
2021-03-07 21:00:00 1664.14
2021-03-07 22:00:00 1678.61
2021-03-07 23:00:00 1726.16
2021-03-08 00:00:00 1742.21
2021-03-08 01:00:00 1736.80
2021-03-08 02:00:00 1723.35
2021-03-08 03:00:00 1723.68
Here is what I tried:
I tried to find a way to calculate the slope of the line, and it partially works. I used a library called pandas_ta, here is the code it uses:
def slope( close, length=None, as_angle=None, to_degrees=None, vertical=None, offset=None, **kwargs):
"""Indicator: Slope"""
# Validate arguments
length = int(length) if length and length > 0 else 1
as_angle = True if isinstance(as_angle, bool) else False
to_degrees = True if isinstance(to_degrees, bool) else False
close = verify_series(close, length)
offset = get_offset(offset)
if close is None: return
# Calculate Result
slope = close.diff(length) / length
if as_angle:
slope = slope.apply(npAtan)
if to_degrees:
slope *= 180 / npPi
# Offset
if offset != 0:
slope = slope.shift(offset)
# Handle fills
if "fillna" in kwargs:
slope.fillna(kwargs["fillna"], inplace=True)
if "fill_method" in kwargs:
slope.fillna(method=kwargs["fill_method"], inplace=True)
# Name and Categorize it
slope.name = f"SLOPE_{length}" if not as_angle else f"ANGLE{'d' if to_degrees else 'r'}_{length}"
slope.category = "momentum"
return slope
Here is the problem: The problem now is that the output of that function works and it gives me what I'm looking for but it depends heavily on the magnitude of the data I'm feeding (of course) and this is a problem because I'm using other datasets with much lower numbers.
Instead, is there a way to get some sort of coefficient or number that tells me the inclination of the line for every point? Any kind of advice is appreciated

Related

How to calculate daily average from ERA5 hourly netCDF data?

Hi dear all,
I do apologize for repeating the question. I have downloaded and merged the ERA5 hourly dew-point temperature data (d2m_wb.nc) from the Copernicus web platform. Now, I want to calculate the daily mean from the hourly d2m_wb.nc data. The timestamps are 00, 01, 02...23. The ECMWF provided an example for the calculation of daily total precipitation (https://confluence.ecmwf.int/display/CKB/ERA5%3A+How+to+calculate+daily+total+precipitation). It said to cover total precipitation for 1st January 2017, we need two days of data.
(a) 1st January 2017 time = 01 - 23 will give you total precipitation data to cover 00 - 23 UTC for 1st January 2017
(b) 2nd January 2017 time = 00 will give you total precipitation data to cover 23 - 24 UTC for 1st January 2017
That means I need to shift the -1hour timestamp to account for step (b). Accordingly, I did it using Climate Data Operators (CDO).
cdo daymean -shifttime,-1hour in.nc out.nc
and got the following result.
cdo sinfo d2m_wb.nc
File format : NetCDF2
-1 : Institut Source T Steptype Levels Num Points Num Dtype : Parameter ID
1 : unknown unknown v instant 1 1 475 1 F64 : -1
Grid coordinates :
1 : lonlat : points=475 (19x25)
lon : 85.5 to 90 by 0.25 degrees_east
lat : 21.5 to 27.5 by 0.25 degrees_north
Vertical coordinates :
1 : surface : levels=1
Time coordinate : 25904 steps
RefTime = 1900-01-01 00:00:00 Units = hours Calendar = gregorian Bounds = true
YYYY-MM-DD hh:mm:ss YYYY-MM-DD hh:mm:ss YYYY-MM-DD hh:mm:ss YYYY-MM-DD hh:mm:ss
1949-12-31 23:00:00 1950-01-01 11:00:00 1950-01-02 11:00:00 1950-01-03 11:00:00
1950-01-04 11:00:00 1950-01-05 11:00:00 1950-01-06 11:00:00 1950-01-07 11:00:00
1950-01-08 11:00:00 1950-01-09 11:00:00 1950-01-10 11:00:00 1950-01-11 11:00:00
1950-01-12 11:00:00 1950-01-13 11:00:00 1950-01-14 11:00:00 1950-01-15 11:00:00
1950-01-16 11:00:00 1950-01-17 11:00:00 1950-01-18 11:00:00 1950-01-19 11:00:00
1950-01-20 11:00:00 1950-01-21 11:00:00 1950-01-22 11:00:00 1950-01-23 11:00:00
1950-01-24 11:00:00 1950-01-25 11:00:00 1950-01-26 11:00:00 1950-01-27 11:00:00
1950-01-28 11:00:00 1950-01-29 11:00:00 1950-01-30 11:00:00 1950-01-31 11:00:00
1950-02-01 11:00:00 1950-02-02 11:00:00 1950-02-03 11:00:00 1950-02-04 11:00:00
1950-02-05 11:00:00 1950-02-06 11:00:00 1950-02-07 11:00:00 1950-02-08 11:00:00
1950-02-09 11:00:00 1950-02-10 11:00:00 1950-02-11 11:00:00 1950-02-12 11:00:00
1950-02-13 11:00:00 1950-02-14 11:00:00 1950-02-15 11:00:00 1950-02-16 11:00:00
1950-02-17 11:00:00 1950-02-18 11:00:00 1950-02-19 11:00:00 1950-02-20 11:00:00
1950-02-21 11:00:00 1950-02-22 11:00:00 1950-02-23 11:00:00 1950-02-24 11:00:00
1950-02-25 11:00:00 1950-02-26 11:00:00 1950-02-27 11:00:00 1950-02-28 11:00:00
................................................................................
................................................................................
................................................................................
.................
2020-10-03 11:00:00 2020-10-04 11:00:00 2020-10-05 11:00:00 2020-10-06 11:00:00
2020-10-07 11:00:00 2020-10-08 11:00:00 2020-10-09 11:00:00 2020-10-10 11:00:00
2020-10-11 11:00:00 2020-10-12 11:00:00 2020-10-13 11:00:00 2020-10-14 11:00:00
2020-10-15 11:00:00 2020-10-16 11:00:00 2020-10-17 11:00:00 2020-10-18 11:00:00
2020-10-19 11:00:00 2020-10-20 11:00:00 2020-10-21 11:00:00 2020-10-22 11:00:00
2020-10-23 11:00:00 2020-10-24 11:00:00 2020-10-25 11:00:00 2020-10-26 11:00:00
2020-10-27 11:00:00 2020-10-28 11:00:00 2020-10-29 11:00:00 2020-10-30 11:00:00
2020-10-31 11:00:00 2020-11-01 11:00:00 2020-11-02 11:00:00 2020-11-03 11:00:00
2020-11-04 11:00:00 2020-11-05 11:00:00 2020-11-06 11:00:00 2020-11-07 11:00:00
2020-11-08 11:00:00 2020-11-09 11:00:00 2020-11-10 11:00:00 2020-11-11 11:00:00
2020-11-12 11:00:00 2020-11-13 11:00:00 2020-11-14 11:00:00 2020-11-15 11:00:00
2020-11-16 11:00:00 2020-11-17 11:00:00 2020-11-18 11:00:00 2020-11-19 11:00:00
2020-11-20 11:00:00 2020-11-21 11:00:00 2020-11-22 11:00:00 2020-11-23 11:00:00
2020-11-24 11:00:00 2020-11-25 11:00:00 2020-11-26 11:00:00 2020-11-27 11:00:00
2020-11-28 11:00:00 2020-11-29 11:00:00 2020-11-30 11:00:00 2020-12-31 23:00:00
cdo sinfo: Processed 1 variable over 25904 timesteps [6.03s 37MB
In this case, the timestep shows 11:00:00 (from 1950-01-01 onwards). I guess it should be 12:00:00. What wrong I've done here? Any suggestions will highly be appreciated? Thank you.
This output appears correct. CDO has to make a decision about which timestep to use when averaging. In this case it takes the mid-point of each day, which is 11:00.
You'll notice that in the first day the time is 23:00, as there is only one time.
However, it is not clear why you would want to shift the time back one hour. Your code is not actually calculating the daily mean. Instead it is the mean of the final 23 hours of one day and the first hour of the next. Just change your CDO call to the following and everything should be fine:
cdo daymean in.nc out.nc
Robert Wilson's answer is correct, I just wanted to quickly clarify that the confusion here is due to the difference between
Instantaneous fields: such as clouds, water vapour, temperatures, winds, etc, these are fields that are valid for the instant
Accumulated fields: such as radiative fluxes, latent and sensible heat fluxes, precipitation and so on, these are accumulated over a period of time, and the time stamp is placed at the end of the window.
Thus for instant fields Robert is correct that you don't want to shift, if you consider 00Z to be in the subsequent day, but you could equally validly argue that midnight should be included in the previous day (thus you would need shift), as it lies on the border. Convention says you don't shift, and count 00...23 as one day...
Concerning the fluxes, there are also more details in this post: Calculating ERA5 Daily Total Precipitation using CDO
I also have similar issue with GLDAS 3-hourly temperature data.
Let say I use data for 1948, and the first data will be GLDAS_NOAH025_3H.A19480101.0300.020.nc4 which is temperature value for 1948-01-01 00:00:00 -- 1948-01-01 03:00:00 and the last data with year 1948 written on the filename is GLDAS_NOAH025_3H.A19481231.2100.020.nc4 which is temperature value for 1948-12-31 18:00:00 -- 1948-12-31 21:00:00
I added GLDAS_NOAH025_3H.A19490101.0000.020.nc4 into 1948 folder and merge all files into single netcdf using:
cdo mergetime *.nc4 merge_1948.nc4
Then I try to calculate the daily mean using:
cdo daymean merge_1948.nc4 tmean_1948.nc4
Unfortunately the total file (time) is 367, and the first data is 1948-01-01 00:00:00 -- 1948-01-01 21:00:00 and the last data is 1948-12-31 21:00:00 -- 1949-01-01 00:00:00
So, I tried to use shifttime and it solved the problem.
cdo daymean -shifttime,-3hour merge_1948.nc4 temp.nc4
cdo -shifttime,3hour temp.nc4 tmean_1948.nc4
and the first data is 1948-01-01 00:00:00 -- 1948-01-02 00:00:00 and the last data is 1948-12-31 00:00:00 -- 1949-01-01 00:00:00

Forward fill seasonal data in pandas

I have hourly observations of several variables that exhibit daily seasonality. I want to fill any missing value with the corresponding variable's value 24 hours prior.
Ideally my function would fill the missing values from oldest to newest. Thus if there are 25 consecutive missing values, the 25th missing value is filled with the same value as the first missing value. Using Series.map() fails in this case.
value desired_output
hour
2019-08-17 00:00:00 58.712986 58.712986
2019-08-17 01:00:00 28.904234 28.904234
2019-08-17 02:00:00 14.275149 14.275149
2019-08-17 03:00:00 58.777087 58.777087
2019-08-17 04:00:00 95.964955 95.964955
2019-08-17 05:00:00 64.971372 64.971372
2019-08-17 06:00:00 95.759469 95.759469
2019-08-17 07:00:00 98.675457 98.675457
2019-08-17 08:00:00 77.510319 77.510319
2019-08-17 09:00:00 56.492446 56.492446
2019-08-17 10:00:00 90.968924 90.968924
2019-08-17 11:00:00 66.647501 66.647501
2019-08-17 12:00:00 7.756725 7.756725
2019-08-17 13:00:00 49.328135 49.328135
2019-08-17 14:00:00 28.634033 28.634033
2019-08-17 15:00:00 65.157161 65.157161
2019-08-17 16:00:00 93.127539 93.127539
2019-08-17 17:00:00 98.806335 98.806335
2019-08-17 18:00:00 94.789761 94.789761
2019-08-17 19:00:00 63.518037 63.518037
2019-08-17 20:00:00 89.524433 89.524433
2019-08-17 21:00:00 48.076081 48.076081
2019-08-17 22:00:00 5.027928 5.027928
2019-08-17 23:00:00 0.417763 0.417763
2019-08-18 00:00:00 29.933627 29.933627
2019-08-18 01:00:00 61.726948 61.726948
2019-08-18 02:00:00 NaN 14.275149
2019-08-18 03:00:00 NaN 58.777087
2019-08-18 04:00:00 NaN 95.964955
2019-08-18 05:00:00 NaN 64.971372
2019-08-18 06:00:00 NaN 95.759469
2019-08-18 07:00:00 NaN 98.675457
2019-08-18 08:00:00 NaN 77.510319
2019-08-18 09:00:00 NaN 56.492446
2019-08-18 10:00:00 NaN 90.968924
2019-08-18 11:00:00 NaN 66.647501
2019-08-18 12:00:00 NaN 7.756725
2019-08-18 13:00:00 NaN 49.328135
2019-08-18 14:00:00 NaN 28.634033
2019-08-18 15:00:00 NaN 65.157161
2019-08-18 16:00:00 NaN 93.127539
2019-08-18 17:00:00 NaN 98.806335
2019-08-18 18:00:00 NaN 94.789761
2019-08-18 19:00:00 NaN 63.518037
2019-08-18 20:00:00 NaN 89.524433
2019-08-18 21:00:00 NaN 48.076081
2019-08-18 22:00:00 NaN 5.027928
2019-08-18 23:00:00 NaN 0.417763
2019-08-19 00:00:00 NaN 29.933627
2019-08-19 01:00:00 NaN 61.726948
2019-08-19 02:00:00 NaN 14.275149
2019-08-19 03:00:00 NaN 58.777087
2019-08-19 04:00:00 NaN 95.964955
2019-08-19 05:00:00 NaN 64.971372
2019-08-19 06:00:00 NaN 95.759469
2019-08-19 07:00:00 NaN 98.675457
2019-08-19 08:00:00 NaN 77.510319
2019-08-19 09:00:00 NaN 56.492446
2019-08-19 10:00:00 NaN 90.968924
2019-08-19 11:00:00 NaN 66.647501
2019-08-19 12:00:00 NaN 7.756725
2019-08-19 13:00:00 61.457913 61.457913
2019-08-19 14:00:00 52.429383 52.429383
2019-08-19 15:00:00 79.016485 79.016485
2019-08-19 16:00:00 77.724758 77.724758
2019-08-19 17:00:00 62.205810 62.205810
2019-08-19 18:00:00 15.841707 15.841707
2019-08-19 19:00:00 72.196028 72.196028
2019-08-19 20:00:00 5.497441 5.497441
2019-08-19 21:00:00 30.737596 30.737596
2019-08-19 22:00:00 65.550690 65.550690
2019-08-19 23:00:00 3.543332 3.543332
import pandas as pd
from dateutil.relativedelta import relativedelta as rel_delta
df['isna'] = df['value'].isna()
df['value'] = df.index.map(lambda t: df.at[t - rel_delta(hours=24), 'value'] if df.at[t,'isna'] and t - rel_delta(hours=24) >= df.index.min() else df.at[t, 'value'])
What is the most efficient way to complete this naive forward fill?
IIUC, just groupby time and ffill()
df['resuts'] = df.groupby(df.hour.dt.time).value.ffill()
If hour is your index, just do df.index.time instead.
Checking:
>>> (df['results'] == df['desired_output']).all()
True
Wouldn't this work?
df['value'] = df['value'].fillna(df.index.hour)
Separate Date and Time into two columns as strings. Call it df.
Date Time Value
0 2019-08-17 00:00:00 58.712986
1 2019-08-17 01:00:00 28.904234
2 2019-08-17 02:00:00 14.275149
3 2019-08-17 03:00:00 58.777087
4 2019-08-17 04:00:00 95.964955
Then conducts data reshaping, pivot Time into column headers, forward fillna along each hour.
(df reshaping)
Date 00:00:00 01:00:00 02:00:00 03:00:00 04:00:00
2019-08-17 58.712986 28.904234 14.275149 58.777087 95.964955
2019-08-18 29.933627 61.726948 NaN NaN NaN
2019-08-19 NaN NaN NaN NaN NaN
(df ffill)
Date 00:00:00 01:00:00 02:00:00 03:00:00 04:00:00
2019-08-17 58.712986 28.904234 14.275149 58.777087 95.964955
2019-08-18 29.933627 61.726948 14.275149 58.777087 95.964955
2019-08-19 29.933627 61.726948 14.275149 58.777087 95.964955
(Code)
(df.set_index(['Date','Time')['Value']
.unstack()
.ffill()
.stack()
.reset_index(name='Value')

Create an hourly Series of a year

I'm not able to create a Pandas Series of every hour (as datetime objects) of a given year without iterating and adding one hour to the last, and that's slow. Is there any way to do that paralelly.
My input would be a year and the output should be a Pandas Series of every hour of that year.
You can use pd.date_range with freq='H' which is hourly frequency:
Edit with 23:00:00 after comment by #ALollz
year = 2019
pd.Series(pd.date_range(start=f'{year}-01-01', end=f'{year}-12-31 23:00:00', freq='H'))
0 2019-01-01 00:00:00
1 2019-01-01 01:00:00
2 2019-01-01 02:00:00
3 2019-01-01 03:00:00
4 2019-01-01 04:00:00
5 2019-01-01 05:00:00
6 2019-01-01 06:00:00
7 2019-01-01 07:00:00
8 2019-01-01 08:00:00
9 2019-01-01 09:00:00
10 2019-01-01 10:00:00
11 2019-01-01 11:00:00
12 2019-01-01 12:00:00
13 2019-01-01 13:00:00
14 2019-01-01 14:00:00
15 2019-01-01 15:00:00
16 2019-01-01 16:00:00
17 2019-01-01 17:00:00
18 2019-01-01 18:00:00
19 2019-01-01 19:00:00
20 2019-01-01 20:00:00
21 2019-01-01 21:00:00
22 2019-01-01 22:00:00
23 2019-01-01 23:00:00
24 2019-01-02 00:00:00
25 2019-01-02 01:00:00
26 2019-01-02 02:00:00
27 2019-01-02 03:00:00
28 2019-01-02 04:00:00
29 2019-01-02 05:00:00
30 2019-01-02 06:00:00
31 2019-01-02 07:00:00
32 2019-01-02 08:00:00
33 2019-01-02 09:00:00
34 2019-01-02 10:00:00
35 2019-01-02 11:00:00
36 2019-01-02 12:00:00
37 2019-01-02 13:00:00
38 2019-01-02 14:00:00
39 2019-01-02 15:00:00
40 2019-01-02 16:00:00
41 2019-01-02 17:00:00
42 2019-01-02 18:00:00
43 2019-01-02 19:00:00
44 2019-01-02 20:00:00
45 2019-01-02 21:00:00
46 2019-01-02 22:00:00
47 2019-01-02 23:00:00
48 2019-01-03 00:00:00
49 2019-01-03 01:00:00
...
8711 2019-12-29 23:00:00
8712 2019-12-30 00:00:00
8713 2019-12-30 01:00:00
8714 2019-12-30 02:00:00
8715 2019-12-30 03:00:00
8716 2019-12-30 04:00:00
8717 2019-12-30 05:00:00
8718 2019-12-30 06:00:00
8719 2019-12-30 07:00:00
8720 2019-12-30 08:00:00
8721 2019-12-30 09:00:00
8722 2019-12-30 10:00:00
8723 2019-12-30 11:00:00
8724 2019-12-30 12:00:00
8725 2019-12-30 13:00:00
8726 2019-12-30 14:00:00
8727 2019-12-30 15:00:00
8728 2019-12-30 16:00:00
8729 2019-12-30 17:00:00
8730 2019-12-30 18:00:00
8731 2019-12-30 19:00:00
8732 2019-12-30 20:00:00
8733 2019-12-30 21:00:00
8734 2019-12-30 22:00:00
8735 2019-12-30 23:00:00
8736 2019-12-31 00:00:00
8737 2019-12-31 01:00:00
8738 2019-12-31 02:00:00
8739 2019-12-31 03:00:00
8740 2019-12-31 04:00:00
8741 2019-12-31 05:00:00
8742 2019-12-31 06:00:00
8743 2019-12-31 07:00:00
8744 2019-12-31 08:00:00
8745 2019-12-31 09:00:00
8746 2019-12-31 10:00:00
8747 2019-12-31 11:00:00
8748 2019-12-31 12:00:00
8749 2019-12-31 13:00:00
8750 2019-12-31 14:00:00
8751 2019-12-31 15:00:00
8752 2019-12-31 16:00:00
8753 2019-12-31 17:00:00
8754 2019-12-31 18:00:00
8755 2019-12-31 19:00:00
8756 2019-12-31 20:00:00
8757 2019-12-31 21:00:00
8758 2019-12-31 22:00:00
8759 2019-12-31 23:00:00
8760 2020-01-01 00:00:00
Length: 8761, dtype: datetime64[ns]
Note if your Python version is lower than 3.6 use .format for string formatting:
year = 2019
pd.Series(pd.date_range(start='{}-01-01'.format(year), end='{}-01-01 23:00:00'.format(year), freq='H'))

Pandas: Copy a value in a dataframe to multiple rows based on a condition

I need to parse a price series in a pandas dataframe for occurrences of two consecutive lower lows which creates a price level we call NLBL. I'm able to do this with a simple conditional (see below) but instead of the TRUE values I need the value of the third previous candle high. PLUS I need to copy that very same level forward four more times.
This is some example data:
Date Time Open High Low Close
datetime
2019-01-22 11:00:00 2019-01-22 11:00:00 2643.99 2647.47 2634.73 2634.73
2019-01-22 12:00:00 2019-01-22 12:00:00 2634.79 2638.55 2632.69 2635.94
2019-01-22 13:00:00 2019-01-22 13:00:00 2635.95 2636.35 2623.30 2631.93
2019-01-22 14:00:00 2019-01-22 14:00:00 2631.92 2632.29 2618.33 2622.66
2019-01-22 15:00:00 2019-01-22 15:00:00 2622.71 2632.90 2617.27 2625.49
2019-01-22 16:00:00 2019-01-22 16:00:00 2625.58 2633.81 2625.58 2633.81
2019-01-23 09:00:00 2019-01-23 09:00:00 2643.48 2652.44 2643.48 2650.97
2019-01-23 10:00:00 2019-01-23 10:00:00 2651.00 2653.19 2632.85 2634.47
2019-01-23 11:00:00 2019-01-23 11:00:00 2634.47 2638.55 2617.36 2617.46
2019-01-23 12:00:00 2019-01-23 12:00:00 2617.47 2627.43 2612.86 2627.31
2019-01-23 13:00:00 2019-01-23 13:00:00 2627.31 2631.70 2621.62 2629.92
2019-01-23 14:00:00 2019-01-23 14:00:00 2629.93 2635.26 2625.34 2629.21
2019-01-23 15:00:00 2019-01-23 15:00:00 2629.25 2639.22 2628.71 2636.61
2019-01-23 16:00:00 2019-01-23 16:00:00 2636.71 2639.54 2636.71 2638.60
2019-01-24 09:00:00 2019-01-24 09:00:00 2638.84 2641.03 2631.06 2636.14
2019-01-24 10:00:00 2019-01-24 10:00:00 2636.18 2647.20 2633.12 2640.49
2019-01-24 11:00:00 2019-01-24 11:00:00 2640.31 2645.37 2633.60 2644.08
2019-01-24 12:00:00 2019-01-24 12:00:00 2644.14 2644.42 2632.79 2634.31
2019-01-24 13:00:00 2019-01-24 13:00:00 2634.34 2635.16 2627.01 2633.62
2019-01-24 14:00:00 2019-01-24 14:00:00 2633.64 2638.47 2630.96 2637.04
2019-01-24 15:00:00 2019-01-24 15:00:00 2637.03 2643.21 2636.46 2642.66
2019-01-24 16:00:00 2019-01-24 16:00:00 2642.63 2643.10 2641.97 2641.99
2019-01-25 09:00:00 2019-01-25 09:00:00 2657.44 2663.57 2657.33 2661.64
2019-01-25 10:00:00 2019-01-25 10:00:00 2661.60 2671.61 2661.60 2669.49
2019-01-25 11:00:00 2019-01-25 11:00:00 2669.47 2670.50 2664.18 2669.13
2019-01-25 12:00:00 2019-01-25 12:00:00 2669.12 2672.38 2661.39 2664.88
2019-01-25 13:00:00 2019-01-25 13:00:00 2664.88 2668.49 2663.76 2667.93
2019-01-25 14:00:00 2019-01-25 14:00:00 2667.95 2669.12 2661.14 2665.27
2019-01-25 15:00:00 2019-01-25 15:00:00 2665.27 2666.52 2658.75 2663.06
2019-01-25 16:00:00 2019-01-25 16:00:00 2662.98 2664.74 2661.64 2664.14
This is how far I've come thus far:
min_data['NLBL'] = (min_data['Low'] < min_data['Low'].shift(1)) & (min_data['Low'].shift(1) < min_data['Low'].shift(2))
min_data['NLBL'] = min_data['NLBL'].shift(periods=1) # shifting downward as the trigger is valid after the close
print("\nResult:\n %s" % min_data.tail(30))
Result:
Date Time Open High Low Close \
datetime
2019-01-22 11:00:00 2019-01-22 11:00:00 2643.99 2647.47 2634.73 2634.73
2019-01-22 12:00:00 2019-01-22 12:00:00 2634.79 2638.55 2632.69 2635.94
2019-01-22 13:00:00 2019-01-22 13:00:00 2635.95 2636.35 2623.30 2631.93
2019-01-22 14:00:00 2019-01-22 14:00:00 2631.92 2632.29 2618.33 2622.66
2019-01-22 15:00:00 2019-01-22 15:00:00 2622.71 2632.90 2617.27 2625.49
2019-01-22 16:00:00 2019-01-22 16:00:00 2625.58 2633.81 2625.58 2633.81
2019-01-23 09:00:00 2019-01-23 09:00:00 2643.48 2652.44 2643.48 2650.97
2019-01-23 10:00:00 2019-01-23 10:00:00 2651.00 2653.19 2632.85 2634.47
2019-01-23 11:00:00 2019-01-23 11:00:00 2634.47 2638.55 2617.36 2617.46
2019-01-23 12:00:00 2019-01-23 12:00:00 2617.47 2627.43 2612.86 2627.31
2019-01-23 13:00:00 2019-01-23 13:00:00 2627.31 2631.70 2621.62 2629.92
2019-01-23 14:00:00 2019-01-23 14:00:00 2629.93 2635.26 2625.34 2629.21
2019-01-23 15:00:00 2019-01-23 15:00:00 2629.25 2639.22 2628.71 2636.61
2019-01-23 16:00:00 2019-01-23 16:00:00 2636.71 2639.54 2636.71 2638.60
2019-01-24 09:00:00 2019-01-24 09:00:00 2638.84 2641.03 2631.06 2636.14
2019-01-24 10:00:00 2019-01-24 10:00:00 2636.18 2647.20 2633.12 2640.49
2019-01-24 11:00:00 2019-01-24 11:00:00 2640.31 2645.37 2633.60 2644.08
2019-01-24 12:00:00 2019-01-24 12:00:00 2644.14 2644.42 2632.79 2634.31
2019-01-24 13:00:00 2019-01-24 13:00:00 2634.34 2635.16 2627.01 2633.62
2019-01-24 14:00:00 2019-01-24 14:00:00 2633.64 2638.47 2630.96 2637.04
2019-01-24 15:00:00 2019-01-24 15:00:00 2637.03 2643.21 2636.46 2642.66
2019-01-24 16:00:00 2019-01-24 16:00:00 2642.63 2643.10 2641.97 2641.99
2019-01-25 09:00:00 2019-01-25 09:00:00 2657.44 2663.57 2657.33 2661.64
2019-01-25 10:00:00 2019-01-25 10:00:00 2661.60 2671.61 2661.60 2669.49
2019-01-25 11:00:00 2019-01-25 11:00:00 2669.47 2670.50 2664.18 2669.13
2019-01-25 12:00:00 2019-01-25 12:00:00 2669.12 2672.38 2661.39 2664.88
2019-01-25 13:00:00 2019-01-25 13:00:00 2664.88 2668.49 2663.76 2667.93
2019-01-25 14:00:00 2019-01-25 14:00:00 2667.95 2669.12 2661.14 2665.27
2019-01-25 15:00:00 2019-01-25 15:00:00 2665.27 2666.52 2658.75 2663.06
2019-01-25 16:00:00 2019-01-25 16:00:00 2662.98 2664.74 2661.64 2664.14
NLBL
datetime
2019-01-22 11:00:00 True
2019-01-22 12:00:00 True
2019-01-22 13:00:00 True
2019-01-22 14:00:00 True
2019-01-22 15:00:00 True
2019-01-22 16:00:00 True
2019-01-23 09:00:00 False
2019-01-23 10:00:00 False
2019-01-23 11:00:00 False
2019-01-23 12:00:00 True
2019-01-23 13:00:00 True
2019-01-23 14:00:00 False
2019-01-23 15:00:00 False
2019-01-23 16:00:00 False
2019-01-24 09:00:00 False
2019-01-24 10:00:00 False
2019-01-24 11:00:00 False
2019-01-24 12:00:00 False
2019-01-24 13:00:00 False
2019-01-24 14:00:00 True
2019-01-24 15:00:00 False
2019-01-24 16:00:00 False
2019-01-25 09:00:00 False
2019-01-25 10:00:00 False
2019-01-25 11:00:00 False
2019-01-25 12:00:00 False
2019-01-25 13:00:00 False
2019-01-25 14:00:00 False
2019-01-25 15:00:00 False
2019-01-25 16:00:00 True
So here is where I am stuck. What I need to do from here are two things:
Replace each True in min_value['NLBL'] with Hight.Shift(3) - basically the highest low in the series. Also set each False to 0.
Copy every min_value['NLBL'] row that is not populated with 0 forward four more times but only until it finds the next 0.
I assume a lambda expression would be appropriate but doing all this in the context of pandas has me stumped. Any ideas/insights how this can be done without resorting to a slow/ugly/annoying if loop?
This is just one example of several similar patterns I will have to implement. So solving this is a big issue for me and any help would be greatly appreciated.
Thanks in advance!
UPDATE: Someone asked for the correct output of the NLBL column:
NLBL
datetime
2019-01-22 14:00:00 2647.47
2019-01-22 15:00:00 2638.55
2019-01-22 16:00:00 2636.35
2019-01-23 09:00:00 0
2019-01-23 10:00:00 0
2019-01-23 11:00:00 0
2019-01-23 12:00:00 2652.44
2019-01-23 13:00:00 2653.19
2019-01-23 14:00:00 2653.19
2019-01-23 15:00:00 2653.19
2019-01-23 16:00:00 2653.19
2019-01-24 09:00:00 2653.19
2019-01-24 10:00:00 0
2019-01-24 11:00:00 0
2019-01-24 12:00:00 0
2019-01-24 13:00:00 0
2019-01-24 14:00:00 2645.37
2019-01-24 15:00:00 2645.37
2019-01-24 16:00:00 2645.37
2019-01-25 09:00:00 2645.37
2019-01-25 10:00:00 2645.37
2019-01-25 11:00:00 0
2019-01-25 12:00:00 0
2019-01-25 13:00:00 0
2019-01-25 14:00:00 0
2019-01-25 15:00:00 0
2019-01-25 16:00:00 2668.49
If it gets to a row with a TRUE value in the 'NLBL' column it'll count three rows backward, grab the 'High' value and replace TRUE with that one. It then copies that same 'High' value to the following four rows.
However if it finds a new TRUE it will stop copying forward and use the new High value.
Hope this makes sense.
Thanks for the clarification, pretty sure I understood you (though if I understood correctly then there's a small error in the sample output).
Here's my solution: basically adds a helper column, swaps out 0 for NaN (if performance is a serious concern you could look into map instead of replace), and uses two fillna methods:
min_data['helper'] = min_data['High'].shift(3)
min_data.loc[min_data['NLBL'] == True, 'NLBL'] = min_data.loc[min_data['NLBL'] == True, 'helper']
min_data = min_data.drop(columns=['helper'])
min_data.NLBL = min_data.NLBL.replace(False, np.nan)
min_data['NLBL'] = min_data['NLBL'].fillna(method='ffill', limit=4)
min_data['NLBL'] = min_data['NLBL'].fillna(0)

subselect pandas dataframe based on index

I have a dataframe and I want to remove certain specific repeating rows:
import numpy as np
import pandas as pd
nrows = 144
df = pd.DataFrame(np.random.rand(nrows,), pd.date_range('2016-02-08 00:00:00', periods=nrows, freq='2h'), columns=['A'])
The dataframe is continuous with time, providing data every two hours ad infinitum, but I've chosen to only show a subset for brevity.I want to remove the data every 72 hours at 8:00 starting on Mondays to coincide with an external event that alters the data.For this snapshot of data I want to remove the rows indexed at 2016-02-08 08:00, 2016-02-11 08:00, +3D etc..
Is there a simple way to do this?
IIUC you could do this:
In [18]:
start = df.index[(df.index.dayofweek == 0) & (df.index.hour == 8)][0]
start
Out[18]:
Timestamp('2016-02-08 08:00:00')
In [45]:
df.loc[df.index.difference(pd.date_range(start, end=df.index[-1], freq='3D'))]
Out[45]:
A
2016-02-08 00:00:00 0.323742
2016-02-08 02:00:00 0.962252
2016-02-08 04:00:00 0.706537
2016-02-08 06:00:00 0.561446
2016-02-08 10:00:00 0.225042
2016-02-08 12:00:00 0.746258
2016-02-08 14:00:00 0.167950
2016-02-08 16:00:00 0.199958
2016-02-08 18:00:00 0.808286
2016-02-08 20:00:00 0.288797
2016-02-08 22:00:00 0.508109
2016-02-09 00:00:00 0.980772
2016-02-09 02:00:00 0.995731
2016-02-09 04:00:00 0.742751
2016-02-09 06:00:00 0.392247
2016-02-09 08:00:00 0.460511
2016-02-09 10:00:00 0.083660
2016-02-09 12:00:00 0.273620
2016-02-09 14:00:00 0.791506
2016-02-09 16:00:00 0.440630
2016-02-09 18:00:00 0.326418
2016-02-09 20:00:00 0.790780
2016-02-09 22:00:00 0.521131
2016-02-10 00:00:00 0.219315
2016-02-10 02:00:00 0.016625
2016-02-10 04:00:00 0.958566
2016-02-10 06:00:00 0.405643
2016-02-10 08:00:00 0.958025
2016-02-10 10:00:00 0.786663
2016-02-10 12:00:00 0.589064
... ...
2016-02-17 12:00:00 0.360848
2016-02-17 14:00:00 0.757499
2016-02-17 16:00:00 0.391574
2016-02-17 18:00:00 0.062812
2016-02-17 20:00:00 0.308282
2016-02-17 22:00:00 0.251520
2016-02-18 00:00:00 0.832871
2016-02-18 02:00:00 0.387108
2016-02-18 04:00:00 0.070969
2016-02-18 06:00:00 0.298831
2016-02-18 08:00:00 0.878526
2016-02-18 10:00:00 0.979233
2016-02-18 12:00:00 0.386620
2016-02-18 14:00:00 0.420962
2016-02-18 16:00:00 0.238879
2016-02-18 18:00:00 0.124069
2016-02-18 20:00:00 0.985828
2016-02-18 22:00:00 0.585278
2016-02-19 00:00:00 0.409226
2016-02-19 02:00:00 0.093945
2016-02-19 04:00:00 0.389450
2016-02-19 06:00:00 0.378091
2016-02-19 08:00:00 0.874232
2016-02-19 10:00:00 0.527629
2016-02-19 12:00:00 0.490236
2016-02-19 14:00:00 0.509008
2016-02-19 16:00:00 0.097061
2016-02-19 18:00:00 0.111626
2016-02-19 20:00:00 0.877099
2016-02-19 22:00:00 0.796201
[140 rows x 1 columns]
So this determines the start range by comparing the dayofweek and hour and taking the first index value, we then generate an index using date_range and call difference on the index to remove these rows and pass these to loc

Categories