Duplicate rows for dates between interval - python

I have a frame like this:
ID
Start
Stop
1
2020-01-01
2020-01-05
2
2020-01-01
2020-01-10
And I want to duplicate the rows so I end up with a table like this:
ID
Start
Stop
Date
1
2020-01-01
2020-01-05
2020-01-01
1
2020-01-01
2020-01-05
2020-01-02
1
2020-01-01
2020-01-05
2020-01-03
1
2020-01-01
2020-01-05
2020-01-04
1
2020-01-01
2020-01-05
2020-01-05
2
2020-01-01
2020-01-10
2020-01-01
2
2020-01-01
2020-01-10
2020-01-02
2
2020-01-01
2020-01-10
2020-01-03
2
2020-01-01
2020-01-10
2020-01-04
2
2020-01-01
2020-01-10
2020-01-05
2
2020-01-01
2020-01-10
2020-01-06
2
2020-01-01
2020-01-10
2020-01-07
2
2020-01-01
2020-01-10
2020-01-08
2
2020-01-01
2020-01-10
2020-01-09
2
2020-01-01
2020-01-10
2020-01-10
I am however lost on how to achieve this, any pointers?

Generate a list using date_range() then expand it using explode()
df = pd.read_csv(io.StringIO("""ID Start Stop
1 2020-01-01 2020-01-05
2 2020-01-01 2020-01-10
"""), sep="\t", index_col=0)
df.Start = pd.to_datetime(df.Start)
df.Stop = pd.to_datetime(df.Stop)
df.assign(Date=lambda dfa: dfa.apply(lambda r: pd.date_range(r["Start"], r["Stop"]), axis=1)).explode("Date")

You should look into pandas DataFrames' methods iterrows to iterate over the rows of your original DataFrame and the function date_range to create the dates between each Start and Stop date.
Create a new DataFrame for each row from your original DataFrame (df) and then combine all the created DataFrames into one big DataFrame.
import pandas as pd
expanded_dfs = []
for idx, row in df.iterrows():
dates = pd.date_range(row["Start"], row["Stop"], freq="D")
expanded = pd.DataFrame({
"Start": row["Start"],
"End": row["Stop"],
"Date": dates,
"ID": row["ID"]
})
expanded_dfs.append(expanded)
pd.concat(expanded_dfs)

Related

Filter rows in DataFrame where certain conditions are met?

I have a DataFrame with relevant stock information that looks like this.
Screenshot of my dataframe
I need it so that if the 'close' from one row is different from the 'open' in the next row a new dataframe will be created storing the ones that fulfill this criteria. I would like that all of the values from the row to be saved in the new dataframe. To clarify, I would like the two rows where this happens to be stored in the new dataframe.
DataFrame as text as requested:
timestamp open high low close volume
0 2020-01-01 00:00:00 129.16 130.98 128.68 130.24 4.714333e+04
1 2020-01-01 08:00:00 130.24 132.40 129.87 132.08 5.183323e+04
2 2020-01-01 16:00:00 132.08 133.05 129.74 130.77 4.579396e+04
3 2020-01-02 00:00:00 130.72 130.78 128.69 129.26 6.606601e+04
4 2020-01-02 08:00:00 129.23 130.28 128.90 129.59 4.849893e+04
5 2020-01-02 16:00:00 129.58 129.78 126.38 127.19 9.919212e+04
6 2020-01-03 00:00:00 127.19 130.15 125.88 128.86 1.276414e+05
This can be accomplished using Series.shift
>>> df['close'] != df['open'].shift(-1)
0 2020-01-01 False
1 2020-01-01 False
2 2020-01-01 True
3 2020-01-02 True
4 2020-01-02 True
5 2020-01-02 False
6 2020-01-03 True
This compares the close value in one row to the open value of the next row ("shifted" one row ahead).
You can then select the rows for which the condition is True.
>>> df[df['close'] != df['open'].shift(-1)]
timestamp open high low close volume
2 2020-01-01 16:00:00 132.08 133.05 129.74 130.77 45793.96
3 2020-01-02 00:00:00 130.72 130.78 128.69 129.26 66066.01
4 2020-01-02 08:00:00 129.23 130.28 128.90 129.59 48498.93
6 2020-01-03 00:00:00 127.19 130.15 125.88 128.86 127641.40
This only returns the second of the two rows; to get the first, we can then shift back one, and unite the two conditions.
>>> row_condition = df['close'] != df['open'].shift(-1)
>>> row_before = row_condition.shift(1)
>>> df[row_condition | row_before]
timestamp open high low close volume
0 2020-01-01 00:00:00 129.16 130.98 128.68 130.24 47143.33
1 2020-01-01 08:00:00 130.24 132.40 129.87 132.08 51833.23
2 2020-01-01 16:00:00 132.08 133.05 129.74 130.77 45793.96
3 2020-01-02 00:00:00 130.72 130.78 128.69 129.26 66066.01
4 2020-01-02 08:00:00 129.23 130.28 128.90 129.59 48498.93
5 2020-01-02 16:00:00 129.58 129.78 126.38 127.19 99192.12
6 2020-01-03 00:00:00 127.19 130.15 125.88 128.86 127641.40
Providing a textual sample of the DataFrame is useful because this can be copied directly into a Python session; I would have had to manually type the content of your screenshot otherwise.

Most effective method to get the max value from a column based on a timedelta calculated from the current row

I would like to identify the maximum value in a column that occurs within the following X days from the current date.
This is a subselect of the data frame showing the daily values for 2020.
Date Data
6780 2020-01-02 323.540009
6781 2020-01-03 321.160004
6782 2020-01-06 320.489990
6783 2020-01-07 323.019989
6784 2020-01-08 322.940002
... ... ...
7028 2020-12-24 368.079987
7029 2020-12-28 371.739990
7030 2020-12-29 373.809998
7031 2020-12-30 372.339996
I would like to find a way to identify the max value within the following 30 days. e.g.
Date Data Max
6780 2020-01-02 323.540009 323.019989
6781 2020-01-03 321.160004 323.019989
6782 2020-01-06 320.489990 323.730011
6783 2020-01-07 323.019989 323.540009
6784 2020-01-08 322.940002 325.779999
... ... ... ...
7028 2020-12-24 368.079987 373.809998
7029 2020-12-28 371.739990 373.809998
7030 2020-12-29 373.809998 372.339996
7031 2020-12-30 372.339996 373.100006
I tried calculating the start and end dates and storing them in the columns. e.g.
df['startDate'] = df['Date'] + pd.to_timedelta(1, unit='d')
df['endDate'] = df['Date'] + pd.to_timedelta(30, unit='d')
before trying to calculate the max. e.g,
df['Max'] = df.loc[(df['Date'] > df['startDate']) & (df['Date'] < df['endDate'])]['Data'].max()
But this results in;
Date Data startDate endDate Max
6780 2020-01-02 323.540009 2020-01-03 2020-01-29 NaN
6781 2020-01-03 321.160004 2020-01-04 2020-01-30 NaN
6782 2020-01-06 320.489990 2020-01-07 2020-02-02 NaN
6783 2020-01-07 323.019989 2020-01-08 2020-02-03 NaN
6784 2020-01-08 322.940002 2020-01-09 2020-02-04 NaN
... ... ... ... ... ...
7027 2020-12-23 368.279999 2020-12-24 2021-01-19 NaN
7028 2020-12-24 368.079987 2020-12-25 2021-01-20 NaN
7029 2020-12-28 371.739990 2020-12-29 2021-01-24 NaN
7030 2020-12-29 373.809998 2020-12-31 2021-01-26 NaN
If I statically add dates to the loc[] statement, it partially works, filling the max for this static range however this just gives me the same value for every field.
Any help on the correct panda way to achieve this would be appreciated.
Kind Regards
df.rolling can do this if you make the date a datetime object as the axis:
df["Date"] = pd.to_datetime(df.Date)
df.set_index("Date").rolling("2d").max()
output:
Data
Date
2020-01-02 323.540009
2020-01-03 323.540009
2020-01-06 320.489990
2020-01-07 323.019989
2020-01-08 323.019989

How to iterate through a time frame?

Okay so I have some S&P 500 minute data from a csv file. I am looking to iterate through a timestamp based on time. So far the code looks like this:
import datetime as dt
import pandas as pd
d = pd.read_csv('/Volumes/Seagate Portable/usindex_2020_all_tickers_awvbxk9/SPX_2020_2020.txt')
d.columns = ['Dates', 'Open', 'High', 'Low', 'Close']
d.Dates = pd.to_datetime(d.Dates)
d = d[(d.Dates.dt.time == dt.time(9, 30)) | (d.Dates.dt.time == dt.time(16, 0))].copy()
d.drop(['High', 'Low'], axis=1, inplace=True)
d.index = range(len(d.Open))
for i in d.index:
if dt.time(16, 0) in d.Dates[i]:
d['Open'][i] == np.NaN
The imported csv looks like this:
Date Open Close
0 2020-01-02 16:00:00 3258.14 3257.98
1 2020-01-03 09:30:00 3226.36 3225.79
2 2020-01-03 16:00:00 3234.35 3234.57
3 2020-01-06 09:30:00 3217.55 3215.01
4 2020-01-06 16:00:00 3246.23 3246.28
5 2020-01-07 09:30:00 3241.86 3238.09
6 2020-01-07 16:00:00 3237.13 3237.18
7 2020-01-08 09:30:00 3238.59 3236.82
8 2020-01-08 16:00:00 3253.21 3253.06
9 2020-01-09 09:30:00 3266.03 3270.29
10 2020-01-09 16:00:00 3274.74 3274.66
11 2020-01-10 09:30:00 3281.81 3281.20
12 2020-01-10 16:00:00 3265.39 3265.34
13 2020-01-13 09:30:00 3271.13 3273.28
14 2020-01-13 16:00:00 3287.98 3288.05
15 2020-01-14 09:30:00 3285.35 3285.09
16 2020-01-14 16:00:00 3282.93 3282.89
17 2020-01-15 09:30:00 3282.27 3281.75
18 2020-01-15 16:00:00 3289.76 3289.40
19 2020-01-16 09:30:00 3302.97 3304.34
I am getting the error Im getting is TypeError: argument of type 'Timestamp' is not iterable
What I am trying to do is fill all Open values at 16:00:00 NaN values, then keep the Close valyes for that time. I can I either iterate through the time stamp with the same for loop? Or is there another possible way to sort through this and fill in the respective NaN values? Thanks!
in is used to test for membership in a collection or to find substring within a string. You cannot use it to test for the time in a Timestamp.
If you want to use a for loop:
for i in d.index:
if d.loc[i, 'Date'].time() == dt.time(16,0):
d.loc[i, 'Open'] == np.NaN
But it's always better to use a vectorized function:
d['Open'] = d['Open'].mask(d['Dates'].dt.time == dt.time(16, 0))
for i in d.index:
if dt.time(16, 0) == d.Dates[i]:
d['Open'].loc[i] = np.nan
or
for i in d.index:
if dt.time(16, 0) is d.Dates[i]:
d['Open'].loc[i] = np.nan

Merge do DataFrames with missing rows

I want to merge two DataFrames:
df1:
dt_object Lng
1 2020-01-01 00:00:00 1.57423
2 2020-01-01 01:00:00 1.57444
3 2020-01-01 02:00:00 1.57465
4 2020-01-01 03:00:00 1.57486
df2:
dt_object Price
0 2020-01-03 10:00:00 256.086667
1 2020-01-03 11:00:00 256.526667
2 2020-01-03 12:00:00 257.386667
3 2020-01-03 13:00:00 256.703333
4 2020-01-03 14:00:00 255.320000
dt_object in both cases has type datetime64
df1 never has missing rows. So it has 24 hours per day.
But df2 HAS missing rows.
When I combine them, there is a mismatch.
df = pd.merge(df1, df2, on = 'dt_object')
Merged df:
dt_object Lng Price
0 2020-04-01 10:00:00 1.59270 183.996667
1 2020-04-01 11:00:00 1.59294 184.466667
2 2020-04-01 12:00:00 1.59319 184.810000
3 2020-04-01 13:00:00 1.59343 184.386667
4 2020-04-01 14:00:00 1.59367 184.533333
Problems:
Lng 1.59270 is in wrong place. It flew in 2020-04-01 10:00:00 from 04.01.2020 10:00:00 (month and date are messed up). But Price 183.996667 is in correct place. So ALL Lng flew from wrong date with messed up Date/Month.
Prices in df2 start from January 2020-01-03 10:00:00, but merged dataframe starts from April 2020-04-01
When I saw this problem, I added for both dataframes:
df1['dt_object'] = pd.to_datetime(df1['dt_object'], format='%Y-%m-%d %H:%M:%S')
df2['dt_object'] = pd.to_datetime(df2['dt_object'], format='%Y-%m-%d %H:%M:%S')
, but it didn't help. Nothing changed. Inside dt_object is a strange bug with month/date, but I cannot detect it.
Help me to fix it , please!
You have to specify that you want to execute a left join. The Pandas Documentation explains what the different options for what the how parameter will do.
>>> df1 = pd.DataFrame({'dt_object': pd.date_range('2020-01-01', '2020-01-04'), 'Lng': [0, 1, 2, 3]})
>>> df1
dt_object Lng
0 2020-01-01 0
1 2020-01-02 1
2 2020-01-03 2
3 2020-01-04 3
>>> df2 = pd.DataFrame({'dt_object': [pd.Timestamp('2020-01-01'), pd.Timestamp('2020-01-02'), pd.Timestamp('2020-01-04')], 'Price': [1000, 2000, 3000]})
>>> df2
dt_object Price
0 2020-01-01 1000
1 2020-01-02 2000
2 2020-01-04 3000
>>> df1.merge(df2, how='left')
dt_object Lng Price
0 2020-01-01 0 1000.0
1 2020-01-02 1 2000.0
2 2020-01-03 2 NaN
3 2020-01-04 3 3000.0

How to sum up a sequence of value in 3 records span? [duplicate]

Using pandas, what is the easiest way to calculate a rolling cumsum over the previous n elements, for instance to calculate trailing three days sales:
df = pandas.Series(numpy.random.randint(0,10,10), index=pandas.date_range('2020-01', periods=10))
df
2020-01-01 8
2020-01-02 4
2020-01-03 1
2020-01-04 0
2020-01-05 5
2020-01-06 8
2020-01-07 3
2020-01-08 8
2020-01-09 9
2020-01-10 0
Freq: D, dtype: int64
Desired output:
2020-01-01 8
2020-01-02 12
2020-01-03 13
2020-01-04 5
2020-01-05 6
2020-01-06 13
2020-01-07 16
2020-01-08 19
2020-01-09 20
2020-01-10 17
Freq: D, dtype: int64
You need rolling.sum:
df.rolling(3, min_periods=1).sum()
Out:
2020-01-01 8.0
2020-01-02 12.0
2020-01-03 13.0
2020-01-04 5.0
2020-01-05 6.0
2020-01-06 13.0
2020-01-07 16.0
2020-01-08 19.0
2020-01-09 20.0
2020-01-10 17.0
dtype: float64
min_periods ensures the first two elements are calculated, too. With a window size of 3, by default, the first two elements are NaN.

Categories