Query/Replace element in DataFrame with element directly beneath it - python

I have a dataframe in which I need to query and replace 0.00s with a value directly below it if certain conditions are met. I have looked for documentation on such a behavior but have been unable to find an efficient Pythonic solution.
The logic is as follows:
IF [Symbol] = 'VIX' AND [QuoteDateTime] CONTAINS '09:31:00' AND [Close] = '0.00'
THEN I would like to replace the [Close] value with the [Close] value right below it.
+----+--------+---------------------+---------+
| | Symbol | QuoteDateTime | Close |
+----+--------+---------------------+---------+
| 0 | VIX | 2019-04-11 09:31:00 | 0.00 |
| 1 | VIX | 2019-04-11 09:32:00 | 14.24 |
| 2 | VIX | 2019-04-11 09:33:00 | 14.40 |
| 3 | SPX | 2019-04-11 09:31:00 | 2911.09 |
| 4 | SPX | 2019-04-11 09:32:00 | 2911.55 |
| 5 | SPX | 2019-04-11 09:33:00 | 2915.22 |
| 6 | VIX | 2019-04-12 09:31:00 | 0.00 |
| 7 | VIX | 2019-04-12 09:32:00 | 15.64 |
| 8 | VIX | 2019-04-12 09:33:00 | 15.80 |
| 9 | SPX | 2019-04-12 09:31:00 | 2901.09 |
| 10 | SPX | 2019-04-12 09:32:00 | 2901.55 |
| 11 | SPX | 2019-04-12 09:33:00 | 2905.22 |
+----+--------+---------------------+---------+
Expected output would be that Index 0 [Close] is 14.24 and Index 6 [Close] is 15.64. Everything else remains the same.
+----+--------+---------------------+---------+
| | Symbol | QuoteDateTime | Close |
+----+--------+---------------------+---------+
| 0 | VIX | 2019-04-11 09:31:00 | 14.24 |
| 1 | VIX | 2019-04-11 09:32:00 | 14.24 |
| 2 | VIX | 2019-04-11 09:33:00 | 14.40 |
| 3 | SPX | 2019-04-11 09:31:00 | 2911.09 |
| 4 | SPX | 2019-04-11 09:32:00 | 2911.55 |
| 5 | SPX | 2019-04-11 09:33:00 | 2915.22 |
| 6 | VIX | 2019-04-12 09:31:00 | 15.64 |
| 7 | VIX | 2019-04-12 09:32:00 | 15.64 |
| 8 | VIX | 2019-04-12 09:33:00 | 15.80 |
| 9 | SPX | 2019-04-12 09:31:00 | 2901.09 |
| 10 | SPX | 2019-04-12 09:32:00 | 2901.55 |
| 11 | SPX | 2019-04-12 09:33:00 | 2905.22 |
+----+--------+---------------------+---------+

Create boolean mask by Series.eq for ==, Series.dt.strftime for strings from datetimes and set new values by Series.mask with Series.shift:
#convert to datetimes if necessary
df['QuoteDateTime'] = pd.to_datetime(df['QuoteDateTime'])
mask = (df['Symbol'].eq('VIX') &
df['QuoteDateTime'].dt.strftime('%H:%M:%S').eq('09:31:00') &
df['Close'].eq(0))
df['Close'] = df['Close'].mask(mask, df['Close'].shift(-1))
#alternative1
#df.loc[mask, 'Close'] = df['Close'].shift(-1)
#alternative2
#df['Close'] = np.where(mask, df['Close'].shift(-1), df['Close'])
print (df)
Symbol QuoteDateTime Close
0 VIX 2019-04-11 09:31:00 14.24
1 VIX 2019-04-11 09:32:00 14.24
2 VIX 2019-04-11 09:33:00 14.40
3 SPX 2019-04-11 09:31:00 2911.09
4 SPX 2019-04-11 09:32:00 2911.55
5 SPX 2019-04-11 09:33:00 2915.22
6 VIX 2019-04-12 09:31:00 15.64
7 VIX 2019-04-12 09:32:00 15.64
8 VIX 2019-04-12 09:33:00 15.80
9 SPX 2019-04-12 09:31:00 2901.09
10 SPX 2019-04-12 09:32:00 2901.55
11 SPX 2019-04-12 09:33:00 2905.22

Not an expert, but you could try using the index:
First get the index with this short line:
idx = df.index[(df['Symbol'] == 'VIX') & (df['QuoteDateTime'].str.contains("09:31:00")) & (df['Close'] == '0.0')]
Then use the index to set the values to the values in the rows below:
df.loc[idx, 'Close'] = df.loc[idx+1, 'Close'].values

Related

How do I correctly remove all text from column in Pandas?

I have a dataframe as:
df:
+-----+-------------------------+------------------------------+-------------+-------------+-------------+--------------+--------------+--------------------------+------------------------------+---------------------+
| | country | league | home_odds | draw_odds | away_odds | home_score | away_score | home_team | away_team | datetime |
+=====+=========================+==============================+=============+=============+=============+==============+==============+==========================+==============================+=====================+
| 63 | Chile | Primera Division | 2.80 | 3.05 | 2.63 | 3 | 1 | Melipilla | O'Higgins | 2021-06-07 00:30:00 |
+-----+-------------------------+------------------------------+-------------+-------------+-------------+--------------+--------------+--------------------------+------------------------------+---------------------+
| 64 | North & Central America | CONCACAF Nations League | 2.95 | 3.07 | 2.49 | 3 | 2 ET | USA | Mexico | 2021-06-07 01:00:00 |
+-----+-------------------------+------------------------------+-------------+-------------+-------------+--------------+--------------+--------------------------+------------------------------+---------------------+
| 66 | World | World Cup 2022 | 1.04 | 13.43 | 28.04 | 0 | 1 | Kyrgyzstan | Mongolia | 2021-06-07 07:00:00 |
+-----+-------------------------+------------------------------+-------------+-------------+-------------+--------------+--------------+--------------------------+------------------------------+---------------------+
| 65 | World | Friendly International | 1.52 | 3.91 | 7.01 | 1 | 1 | Serbia | Jamaica | 2021-06-07 07:00:00 |
+-----+-------------------------+------------------------------+-------------+-------------+-------------+--------------+--------------+--------------------------+------------------------------+---------------------+
I want the columns home_score and away_score to be just integers and I am trying regex as:
df[['home_score', 'away_score']] = re.sub('\D', '', '.*')
however all the columns are coming in blank.
How do I correctly do it?
You can try via extract() and astype() method:
df['away_score']=df['away_score'].str.extract('^(\d+)').astype(int)
df['home_score']=df['home_score'].str.extract('^(\d+)').astype(int)
OR
df['away_score']=df['away_score'].str.extract('([0-9]+)').astype(int)
df['home_score']=df['home_score'].str.extract('([0-9]+)').astype(int)
output:
+-----+-------------------------+------------------------------+-------------+-------------+-------------+--------------+--------------+--------------------------+------------------------------+---------------------+
| | country | league | home_odds | draw_odds | away_odds | home_score | away_score | home_team | away_team | datetime |
+=====+=========================+==============================+=============+=============+=============+==============+==============+==========================+==============================+=====================+
| 63 | Chile | Primera Division | 2.80 | 3.05 | 2.63 | 3 | 1 | Melipilla | O'Higgins | 2021-06-07 00:30:00 |
+-----+-------------------------+------------------------------+-------------+-------------+-------------+--------------+--------------+--------------------------+------------------------------+---------------------+
| 64 | North & Central America | CONCACAF Nations League | 2.95 | 3.07 | 2.49 | 3 | 2 | USA | Mexico | 2021-06-07 01:00:00 |
+-----+-------------------------+------------------------------+-------------+-------------+-------------+--------------+--------------+--------------------------+------------------------------+---------------------+
| 66 | World | World Cup 2022 | 1.04 | 13.43 | 28.04 | 0 | 1 | Kyrgyzstan | Mongolia | 2021-06-07 07:00:00 |
+-----+-------------------------+------------------------------+-------------+-------------+-------------+--------------+--------------+--------------------------+------------------------------+---------------------+
| 65 | World | Friendly International | 1.52 | 3.91 | 7.01 | 1 | 1 | Serbia | Jamaica | 2021-06-07 07:00:00 |
+-----+-------------------------+------------------------------+-------------+-------------+-------------+--------------+--------------+--------------------------+------------------------------+---------------------+
You can do df[['home_score', 'away_score']] = df[['home_score', 'away_score']].applymap(lambda x: int(float(x)))

Pandas not sorting datetime columns?

I have a dataframe as:
df:
+-----+--------------+-------------------------+------------------------------+----------------------------------------------------+-------------+-------------+-------------+--------------+--------------+---------------------+
| | Unnamed: 0 | country | league | game | home_odds | draw_odds | away_odds | home_score | away_score | datetime |
+=====+==============+=========================+==============================+====================================================+=============+=============+=============+==============+==============+=====================+
| 0 | 0 | Chile | Primera Division | Nublense - A. Italiano | 2.25 | 3.33 | 3.11 | 1 | 0 | 2021-06-08 00:30:00 |
+-----+--------------+-------------------------+------------------------------+----------------------------------------------------+-------------+-------------+-------------+--------------+--------------+---------------------+
| 1 | 1 | China | Jia League | Zibo Cuju - Shaanxi Changan | 11.54 | 4.39 | 1.31 | nan | nan | 2021-06-08 08:00:00 |
+-----+--------------+-------------------------+------------------------------+----------------------------------------------------+-------------+-------------+-------------+--------------+--------------+---------------------+
| 2 | 2 | Algeria | U21 League | Medea U21 - MC Alger U21 | 2.38 | 3.23 | 2.59 | nan | nan | 2021-06-08 09:00:00 |
+-----+--------------+-------------------------+------------------------------+----------------------------------------------------+-------------+-------------+-------------+--------------+--------------+---------------------+
| 3 | 3 | Algeria | U21 League | Skikda U21 - CR Belouizdad U21 | 9.48 | 4.9 | 1.25 | nan | nan | 2021-06-08 09:00:00 |
+-----+--------------+-------------------------+------------------------------+----------------------------------------------------+-------------+-------------+-------------+--------------+--------------+---------------------+
| 4 | 4 | China | Jia League | Zhejiang Professional - Xinjiang Tianshan | 1.2 | 5.92 | 12.18 | nan | nan | 2021-06-08 10:00:00 |
+-----+--------------+-------------------------+------------------------------+----------------------------------------------------+-------------+-------------+-------------+--------------+--------------+---------------------+
I have defined datetime as datetime
df['datetime'] = pd.to_datetime(df['datetime'])
and then tried to sort it
df.sort_values(by=['datetime'], ascending=True)
However the sorting does not work.
Can anybody help me understand why?
Please find the entire dataframe here for reference.
p.s. I am unable to paste the entire dataframe here because of character constraints.
I see in the comments you already found your solution. Copying the df back into itself after calling sort_values() means it's "new" name is the old name.
I'll add this as an answer.
df.sort_values(by=['datetime'], ascending=True, inplace=True)
Then it should make the sorting in-place, so you don't have to assign it to itself.

How to create rows that fill the time between events in Python

I am building a data frame for survival analysis starting from 2018-01-01 00:00:00 and ending TODAY. I have two columns with start and end times only for the events that ocurred associated with an ID.
However, I need to add rows with the times between which the event was not observed
Here I show what I have:
+--------+-----+-----+---------------------+---------------------+
| State | ID1 | ID2 | Start_Time | End_Time |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-04 04:00:00 | 2019-12-04 19:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-08 06:30:00 | 2019-12-20 10:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-22 11:00:00 | 2019-12-22 23:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-26 08:00:00 | 2019-12-29 16:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-19 08:00:00 | 2018-09-20 04:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-25 16:30:00 | 2018-09-26 23:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-27 01:30:00 | 2018-09-27 10:30:00 |
+--------+-----+-----+---------------------+---------------------+
And what I need is:
+--------+-----+-----+---------------------+---------------------+
| State | ID1 | ID2 | Start_Time | End_Time |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2018-01-01 00:00:00 | 2019-12-04 04:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-04 04:00:00 | 2019-12-04 19:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-04 19:30:00 | 2019-12-08 06:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-08 06:30:00 | 2019-12-20 10:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-20 10:00:00 | 2019-12-22 11:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-22 11:00:00 | 2019-12-22 23:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-22 23:00:00 | 2019-12-26 08:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-26 08:00:00 | 2019-12-29 16:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-29 16:30:00 | TODAY |
+--------+-----+-----+---------------------+---------------------+
| State1 | 112 | AA1 | 2018-01-01 00:00:00 | 2018-09-19 08:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-19 08:00:00 | 2018-09-20 04:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA1 | 2018-09-20 04:30:00 | 2018-09-25 16:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-25 16:30:00 | 2018-09-26 23:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA1 | 2018-09-26 23:00:00 | 2018-09-27 01:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-27 01:30:00 | 2018-09-27 10:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-27 10:30:00 | TODAY |
+--------+-----+-----+---------------------+---------------------+
I have tried this code (borrowed from: How to find the start time and end time of an event in python?), but it gives me only the sequence of events, not the desired rows and the answer provided by #Fredy Montaño (below):
fill_date = []
for item in range(1,df.shape[0],1):
if (df['End_Time'][item-1] - df['Start_Time'][item]) == 0:
""
else:
fill_date.append([df["State"][item-1], df["ID1"][item-1], df["ID2"][item-1], df['End_Time'][item-1],df['Start_Time'][item]])
df_add = pd.DataFrame(fill_date)
df_add.columns = ["State", "ID1", "ID2", 'Start_Time', 'End_Time']
df_output = pd.concat([df[["State", "ID1", "ID2", "Start_Time", "End_Time"]], df_add],axis = 0)
df_output = df_output.sort_values(["State", "ID2", "Start_Time"], ascending=True)
I think I have to put a condition over the STATE, ID1 and ID2 variables in order to not to take times from the previous groups.
Any suggestion?
Maybe this solution works for you.
I slice the dataframe only to take the dates, but it works for you you can repeat it taking into account the states and ID
df = df[['Start_Time', 'End_Time']]
fill_date = []
for item in range(1,df.shape[0],1):
if df['Start_Time'][item] - df['End_Time'][item-1] == 0:
""
else:
fill_date.append([df['End_Time'][item-1],df['Start_Time'][item]])
df_add = pd.DataFrame(fill_date)
df_add.columns = ['Start_Time', 'End_Time']
and finally, I do a concat to join you original dataframe with the new df of dates of not Observed events dates on squares are the new
df_final = pd.concat([df,df_add],axis = 0)
df_final.sort_index(0)

Python rolling period returns

I need to develop a rolling 6-month return on the following dataframe
date Portfolio Performance
2001-11-30 1.048134
2001-12-31 1.040809
2002-01-31 1.054187
2002-02-28 1.039920
2002-03-29 1.073882
2002-04-30 1.100327
2002-05-31 1.094338
2002-06-28 1.019593
2002-07-31 1.094096
2002-08-30 1.054130
2002-09-30 1.024051
2002-10-31 0.992017
A lot of the answers from previous questions describe rolling average returns, which I can do. However, i am not looking for the average. What I need is the following example formula for a rolling 6-month return:
(1.100327 - 1.048134)/1.100327
The formula would then consider the next 6-month block between 2001-12-31 and 2002-05-31 and continue through to the end of the dataframe.
I've tried the following, but doesn't provide the right answer.
portfolio['rolling'] = portfolio['Portfolio Performance'].rolling(window=6).apply(np.prod) - 1
Expected output would be:
date Portfolio Performance Rolling
2001-11-30 1.048134 NaN
2001-12-31 1.040809 NaN
2002-01-31 1.054187 NaN
2002-02-28 1.039920 NaN
2002-03-29 1.073882 NaN
2002-04-30 1.100327 0.0520
2002-05-31 1.094338 0.0422
2002-06-28 1.019593 -0.0280
The current output is:
Portfolio Performance rolling
date
2001-11-30 1.048134 NaN
2001-12-31 1.040809 NaN
2002-01-31 1.054187 NaN
2002-02-28 1.039920 NaN
2002-03-29 1.073882 NaN
2002-04-30 1.100327 0.413135
2002-05-31 1.094338 0.475429
2002-06-28 1.019593 0.445354
2002-07-31 1.094096 0.500072
2002-08-30 1.054130 0.520569
2002-09-30 1.024051 0.450011
2002-10-31 0.992017 0.307280
I simply added the columns shifted 6 months and ran the formula presented. Does this meet the intent of the question?
df['before_6m'] = df['Portfolio Performance'].shift(6)
df['rolling'] = (df['Portfolio Performance'] - df['before_6m'])/df['Portfolio Performance']
df
| | date | Portfolio Performance | before_6m | rolling |
|---:|:--------------------|------------------------:|------------:|------------:|
| 0 | 2001-11-30 00:00:00 | 1.04813 | nan | nan |
| 1 | 2001-12-31 00:00:00 | 1.04081 | nan | nan |
| 2 | 2002-01-31 00:00:00 | 1.05419 | nan | nan |
| 3 | 2002-02-28 00:00:00 | 1.03992 | nan | nan |
| 4 | 2002-03-29 00:00:00 | 1.07388 | nan | nan |
| 5 | 2002-04-30 00:00:00 | 1.10033 | nan | nan |
| 6 | 2002-05-31 00:00:00 | 1.09434 | 1.04813 | 0.042221 |
| 7 | 2002-06-28 00:00:00 | 1.01959 | 1.04081 | -0.0208083 |
| 8 | 2002-07-31 00:00:00 | 1.0941 | 1.05419 | 0.0364767 |
| 9 | 2002-08-30 00:00:00 | 1.05413 | 1.03992 | 0.0134803 |
| 10 | 2002-09-30 00:00:00 | 1.02405 | 1.07388 | -0.0486607 |
| 11 | 2002-10-31 00:00:00 | 0.992017 | 1.10033 | -0.109182 |

Find accounts with more than N months of activity in a row using Pandas

I want to filter accounts that have no consecutive activity for N months.
Example:
a100000001 | 2019-01-31 | NaN
| 2019-02-28 | 40
| 2019-03-31 | 30
| 2019-04-30 | 50
-----------|------------|-----
a100000002 | 2019-01-31 | NaN
| 2019-02-28 | NaN
| 2019-03-31 | 20
| 2019-04-30 | NaN
-----------|------------|-----
... | |
The result for N=3 consecutive months will look like this:
a100000001 | 2019-01-31 | NaN
| 2019-02-28 | 40
| 2019-03-31 | 30
| 2019-04-30 | 50
-----------|------------|-----
... | |
where account "a100000002" was ignored.
I tried df[df.rolling(3)['amount'].min().notna()] but it also removes the NaN rows from the desired accounts.
Something like this should work:
df.groupby('account').filter(lambda g: (g['date'].dt.month.diff() <= n).all())

Categories