get idxmax rolling for each group and each row? - python

data: https://github.com/zero-jack/data/blob/main/hy_data.csv#L7
Goal
get the idxmax from last n rows for each group.
Try
df=df.assign(
l6d_highest_date=lambda x: x.groupby('hy_code')['high'].transform(lambda x: x.rolling(6).idxmax())
AttributeError: 'Rolling' object has no attribute 'idxmax'
notice: week_date is the index.

My solution is based on the conversion of the argmax computed on each sliding-window. For each date, thanks to this information, you can infer the date the argmax refers to.
df = pd.read_csv(
"https://raw.githubusercontent.com/zero-jack/data/main/hy_data.csv",
sep=",", index_col="week_date"
)
def rolling_idmax(series, n):
#fist compute the index in the sliding window
ids = series.rolling(n).apply(np.argmax)
#0 <= ids <= n-1
#how many rows have past from the sliding window maximum?
ids = n-1-ids
#0 <= ids <= n-1
#subtract `ids` from the actual positions
ids = np.arange(len(series))-ids
#0 <= ids <= len(series)-1
#convert the positions stored in `ids` with the corrisponding dates (series.index)
ids.loc[~ids.isna()] = series.index[ids.dropna().astype(int)]
#"2005-06-10" <= ids <= "2022-03-04"
return ids
df["l6d_highest_date"] = df.groupby("hy_code").high.apply(rolling_idmax, 6)

Based on this answer, I get the following workaround. Note that the linked answer can only handle series with the default index, I add x.index[global_index] to deal with non-default index.
window_size = 6
def get_idxmax_in_rolling(x: pd.Series):
local_index = x.rolling(window_size).apply(np.argmax)[window_size-1:].astype(int) # local index, removed nan before astype()
global_index = local_index + np.arange(len(x)-window_size+1)
# return list(x.index[global_index]) + [np.nan]*(window_size-1)
return [np.nan]*(window_size-1) + list(x.index[global_index]) # add nan back
df = df.assign(l6d_highest_date=lambda x: x.groupby('hy_code')['high'].transform(get_idxmax_in_rolling))

You can apply idxmax (for older versions of pandas before 1.0.0 you need to pass raw=False). The only caveat is that rolling must return a float (see linked docs), not a Timestamp. That's why you need to temporaryly reset the index, get the idxmax values and the corresponding week_dates and reset the index:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/zero-jack/data/main/hy_data.csv', index_col='week_date', parse_dates=True)
df = df.reset_index()
df['l6d_highest_date'] = df.groupby('hy_code')['high'].transform(lambda x: x.rolling(6).apply(pd.Series.idxmax))
df.loc[df.l6d_highest_date.notna(), 'l6d_highest_date'] = df.loc[df.loc[df.l6d_highest_date.notna(), 'l6d_highest_date'].values, 'week_date'].values
df = df.set_index('week_date')

Related

Pythonic way to insert a DataFrame column and calculate its values from each column in a list

I have a DataFrame column (from my project here) that prints like this:
ticker 2021-02-11 21:04 2021-01-12_close 2020-02-11_close 2016-02-11_close
0 AAPL 134.94 128.607819 79.287888 21.787796
1 MSFT 244.20 214.929993 182.506607 45.343704
This gives a stock ticker and its current price followed by the close price on given dates. I am looking for a pythonic way to, after each X_close column, insert an X_return column and calculate the return between the current price and the X price. What is a good way to do this?
Thanks!
Edit: When I say "calculate the return", I mean, for example, to do:
((134.94 - 128.607819) / 128.607819) * 100
So, simply using div() or sub() isn't quite satisfactory.
Try:
df.filter to select the close columns,
then .sub to subtract the selected column
join back
sort the columns with sort_index. You may need to play with this.
All code:
df.join(df.filter(like='close').sub(df['2021-02-11 21:04'], axis=0)
.rename(columns=lambda x: x.replace('close','return'))
).sort_index(axis=1)
Good question. The idea is to simply create the new columns first and concatenate it to the dataframe.
df_returns = (df[cols].div(df["2021-02-11 21:04:00"], axis=0)).rename(columns = (lambda x: x.split('_')[0]+'_return'))
df_new = pd.concat([df, df_returns], axis=1).sort_index(axis=1)
Optionally, you could resort the indices for better graphic utility:
df_new[df_new.columns[:-3:-1].union(df_new.columns[:-2], sort=False)]
For a more customized approach use pandas apply method.
df_returns = (df[cols].apply(foo, axis=0))
def foo(s: pd.Series):
#Series specific changes
ans = pd.Series()
for i in s.shape[0]:
ans.iloc[i] = some_func(s.iloc[i])
#Rename series index for convenience
Hope this helps! You can perform any opps you like in some_func()
Combining ideas from the answers given with my own, here is my solution:
def calculate_returns(df):
print(df)
print()
# Get dataframe of return values
returns_df = df.apply(calculate_return_row, axis=1)
# Append returns df to close prices df
df = pd.concat([df, returns_df], axis=1).sort_index(axis=1, ascending=False)
# Rearrange columns so that close price precedes each respective return value
return_cols = df.columns[2::2]
close_cols = df.columns[3::2]
reordered_cols = list(df.columns[0:2])
reordered_cols = reordered_cols + [col for idx, _ in enumerate(return_cols) for col in [close_cols[idx], return_cols[idx]]]
df = df[reordered_cols]
print(df)
return df
def calculate_return_row(row: pd.Series):
current_price = row[1]
close_prices = row[2:]
returns = [calculate_return(current_price, close_price) for close_price in close_prices]
index = [label.replace('close', 'return') for label in row.index[2:]]
returns = pd.Series(returns, index=index)
return returns
def calculate_return(current_val, initial_val):
return (current_val - initial_val) / initial_val * 100
This avoids loops, and puts the return columns after the close columns:
ticker 2021-02-12 20:37 2021-01-13_close 2020-02-12_close 2016-02-12_close
0 AAPL 134.3500 130.694702 81.170799 21.855232
1 MSFT 243.9332 216.339996 182.773773 46.082863
ticker 2021-02-12 20:37 2021-01-13_close 2021-01-13_return 2020-02-12_close 2020-02-12_return 2016-02-12_close 2016-02-12_return
0 AAPL 134.3500 130.694702 2.796822 81.170799 65.515187 21.855232 514.726938
1 MSFT 243.9332 216.339996 12.754555 182.773773 33.461818 46.082863 429.336037
Thanks!

How to find and add missing dates in a dataframe of sorted dates (descending order)?

In Python, I have a DataFrame with column 'Date' (format e.g. 2020-06-26). This column is sorted in descending order: 2020-06-26, 2020-06-25, 2020-06-24...
The other column 'Reviews' is made of text reviews of a website. My data can have multiple reviews on a given date or no reviews on another date. I want to find what dates are missing in column 'Date'. Then, for each missing date, add one row with date in ´´format='%Y-%m-%d'´´, and an empty review on 'Reviews', to be able to plot them. How should I do this?
from datetime import date, timedelta
d = data['Date']
print(d[0])
print(d[-1])
date_set = set(d[-1] + timedelta(x) for x in range((d[0] - d[-1]).days))
missing = sorted(date_set - set(d))
missing = pd.to_datetime(missing, format='%Y-%m-%d')
idx = pd.date_range(start=min(data.Date), end=max(data.Date), freq='D')
#tried this
data = data.reindex(idx, fill_value=0)
data.head()
#Got TypeError: 'fill_value' ('0') is not in this Categorical's categories.
#also tried this
df2 = (pd.DataFrame(data.set_index('Date'), index=idx).fillna(0) + data.set_index('Date')).ffill().stack()
df2.head()
#Got ValueError: cannot reindex from a duplicate axis
This is my code:
for i in range(len(df)):
if i > 0:
prev = df.loc[i-1]["Date"]
current =df.loc[i]["Date"]
for a in range((prev-current).days):
if a > 0:
df.loc[df["Date"].count()] = [prev-timedelta(days = a), None]
df = df.sort_values("Date", ascending=False)
print(df)

calculate Geometric mean return for specific rows

I have a dataframe like this.
Date price mid std top btm
..............
1999-07-21 8.6912 8.504580 0.084923 9.674425 8.334735
1999-07-22 8.6978 8.508515 0.092034 8.692583 8.324447
1999-07-23 8.8127 8.524605 0.118186 10.760976 8.288234
1999-07-24 8.8779 8.688810 0.091124 8.871057 8.506563
..............
I want to create a new col called 'diff'.
If in a row ,'price' >'top' then I want to fill 'diff' of this row with the Geometric mean return of price in this row and price in the n-5 previous row.(The 5-day Geometric mean).
For example, In row 1999-07-22,the price is greater than top, so I wanto fill 'diff' in this row with Geometric mean of 07-22 and 07-17(notice the date may not be consecutive since holidays are excluded ). Only a small part of the rows meet the demand. So most of values in 'diff' will be missing values.
Could you please tell me how I can do this in python?
Use Series.diff with Series.where for set NaNs:
df['diff'] = df['price'].diff().where(df['price'] > df['top'])
print (df)
price mid std top btm diff
Date
1999-07-21 8.6912 8.504580 0.084923 9.674425 8.334735 NaN
1999-07-22 8.6978 8.508515 0.092034 8.692583 8.324447 0.0066
1999-07-23 8.8127 8.524605 0.118186 10.760976 8.288234 NaN
1999-07-24 8.8779 8.688810 0.091124 8.871057 8.506563 0.0652
EDIT:
I believe you need:
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
from scipy.stats.mstats import gmean
df['gmean'] = (df['price'].rolling('5d')
.apply(gmean, raw=True)
.where(df['price'] > df['top']))
print (df)
price mid std top btm gmean
Date
1999-07-21 8.6912 8.504580 0.084923 9.674425 8.334735 NaN
1999-07-22 8.6978 8.508515 0.092034 8.692583 8.324447 8.694499
1999-07-23 8.8127 8.524605 0.118186 10.760976 8.288234 NaN
1999-07-24 8.8779 8.688810 0.091124 8.871057 8.506563 8.769546
You can achieve that by taking the difference of price and top columns and then assign those values that are <= 0 a NaN or zero value:
import pandas as pd
import numpy as np
df = pd.DataFrame(...)
df['diff'] = df['price'] - df['top']
df.loc[df['diff'] <= 0, 'diff'] = np.NaN # or 0
Here's another solution:
import pandas as pd
from functools import reduce
__name__ = 'RunScript'
ddict = {
'Date':['1999-07-21','1999-07-22','1999-07-23','1999-07-24',],
'price':[8.6912,8.6978,8.8127,8.8779],
'mid':[8.504580,8.508515,8.524605,8.688810],
'std':[0.084923,0.092034,0.118186,0.091124],
'top':[9.674425,8.692583,10.760976,8.871057],
'btm':[8.334735,8.324447,8.288234,8.506563],
}
data = pd.DataFrame(ddict)
def geo_mean(iter):
"""
Geometric mean function. Pass iterable
"""
return reduce(lambda a, b: a * b, iter) ** (1.0 / len(iter))
def set_geo_mean(df):
# Shift the price row down one period
data['shifted price'] = data['price'].shift(periods=1)
# Create a masked expression that evaluates price vs top
masked_expression = df['price'] > df['top']
# Return rows from dataframe where masked expression is true
masked_data = df[masked_expression]
# Apply our function to the relevant rows
df.loc[masked_expression, 'geo_mean'] = geo_mean([masked_data['price'], masked_data['shifted price']])
# Drop the shifted price data column once complete
df.drop('shifted price', axis=1, inplace=True)
if __name__ == 'RunScript':
# Call function and pass dataframe argument.
set_geo_mean(data)

How to fill missing date in timeSeries

Here's what my data looks like:
There are daily records, except for a gap from 2017-06-12 to 2017-06-16.
df2['timestamp'] = pd.to_datetime(df['timestamp'])
df2['timestamp'] = df2['timestamp'].map(lambda x:
datetime.datetime.strftime(x,'%Y-%m-%d'))
df2 = df2.convert_objects(convert_numeric = True)
df2 = df2.groupby('timestamp', as_index = False).sum()
I need to fill this missing gap and others with values for all fields (e.g. timestamp, temperature, humidity, light, pressure, speed, battery_voltage, etc...).
How can I accomplish this with Pandas?
This is what I have done before
weektime = pd.date_range(start = '06/04/2017', end = '12/05/2017', freq = 'W-SUN')
df['week'] = 'nan'
df['weektemp'] = 'nan'
df['weekhumidity'] = 'nan'
df['weeklight'] = 'nan'
df['weekpressure'] = 'nan'
df['weekspeed'] = 'nan'
df['weekbattery_voltage'] = 'nan'
for i in range(0,len(weektime)):
df['week'][i+1] = weektime[i]
df['weektemp'][i+1] = df['temperature'].iloc[7*i+1:7*i+7].sum()
df['weekhumidity'][i+1] = df['humidity'].iloc[7*i+1:7*i+7].sum()
df['weeklight'][i+1] = df['light'].iloc[7*i+1:7*i+7].sum()
df['weekpressure'][i+1] = df['pressure'].iloc[7*i+1:7*i+7].sum()
df['weekspeed'][i+1] = df['speed'].iloc[7*i+1:7*i+7].sum()
df['weekbattery_voltage'][i+1] =
df['battery_voltage'].iloc[7*i+1:7*i+7].sum()
i = i + 1
The value of sum is not correct. Cause the value of 2017-06-17 is a sum of 2017-06-12 to 2017-06-16. I do not want to add them again. This gap is not only one gap in the period. I want to fill all of them.
Here is a function I wrote that might be helpful to you. It looks for inconsistent jumps in time and fills them in. After using this function, try using a linear interpolation function (pandas has a good one) to fill in your null data values. Note: Numpy arrays are much faster to iterate over and manipulate than Pandas dataframes, which is why I switch between the two.
import numpy as np
import pandas as pd
data_arr = np.array(your_df)
periodicity = 'daily'
def fill_gaps(data_arr, periodicity):
rows = data_arr.shape[0]
data_no_gaps = np.copy(data_arr) #avoid altering the thing you're iterating over
data_no_gaps_idx = 0
for row_idx in np.arange(1, rows): #iterate once for each row (except the first record; nothing to compare)
oldtimestamp_str = str(data_arr[row_idx-1, 0])
oldtimestamp = np.datetime64(oldtimestamp_str)
currenttimestamp_str = str(data_arr[row_idx, 0])
currenttimestamp = np.datetime64(currenttimestamp_str)
period = currenttimestamp - oldtimestamp
if period != np.timedelta64(900,'s') and period != np.timedelta64(3600,'s') and period != np.timedelta64(86400,'s'):
if periodicity == 'quarterly':
desired_period = 900
elif periodicity == 'hourly':
desired_period = 3600
elif periodicity == 'daily':
desired_period = 86400
periods_missing = int(period / np.timedelta64(desired_period,'s'))
for missing in np.arange(1, periods_missing):
new_time_orig = str(oldtimestamp + missing*(np.timedelta64(desired_period,'s')))
new_time = new_time_orig.replace('T', ' ')
data_no_gaps = np.insert(data_no_gaps, (data_no_gaps_idx + missing),
np.array((new_time, np.nan, np.nan, np.nan, np.nan, np.nan)), 0) # INSERT VALUES YOU WANT IN THE NEW ROW
data_no_gaps_idx += (periods_missing-1) #incriment the index (zero-based => -1) in accordance with added rows
data_no_gaps_idx += 1 #allow index to change as we iterate over original data array (main for loop)
#create a dataframe:
data_arr_no_gaps = pd.DataFrame(data=data_no_gaps, index=None,columns=['Time', 'temp', 'humidity', 'light', 'pressure', 'speed'])
return data_arr_no_gaps
Fill time gaps and nulls
Use the function below to ensure expected date sequence exists, and then use forward fill to fill in nulls.
import pandas as pd
import os
def fill_gaps_and_nulls(df, freq='1D'):
'''
General steps:
A) check for extra dates (out of expected frequency/sequence)
B) check for missing dates (based on expected frequency/sequence)
C) use forwardfill to fill nulls
D) use backwardfill to fill remaining nulls
E) append to file
'''
#rename the timestamp to 'date'
df.rename(columns={"timestamp": "date"})
#sort to make indexing faster
df = df.sort_values(by=['date'], inplace=False)
#create an artificial index of dates at frequency = freq, with the same beginning and ending as the original data
all_dates = pd.date_range(start=df.date.min(), end=df.date.max(), freq=freq)
#record column names
df_cols = df.columns
#delete ffill_df.csv so we can begin anew
try:
os.remove('ffill_df.csv')
except FileNotFoundError:
pass
#check for extra dates and/or dates out of order. print warning statement for log
extra_dates = set(df.date).difference(all_dates)
#if there are extra dates (outside of expected sequence/frequency), deal with them
if len(extra_dates) > 0:
#############################
#INSERT DESIRED BEHAVIOR HERE
print('WARNING: Extra date(s):\n\t{}\n\t Shifting highlighted date(s) back by 1 day'.format(extra_dates))
for date in extra_dates:
#shift extra dates back one day
df.date[df.date == date] = date - pd.Timedelta(days=1)
#############################
#check the artificial date index against df to identify missing gaps in time and fill them with nulls
gaps = all_dates.difference(set(df.date))
print('\n-------\nWARNING: Missing dates: {}\n-------\n'.format(gaps))
#if there are time gaps, deal with them
if len(gaps) > 0:
#initialize df of correct size, filled with nulls
gaps_df = pd.DataFrame(index=gaps, columns=df_cols.drop('date')) #len(index) sets number of rows
#give index a name
gaps_df.index.name = 'date'
#add the region and type
gaps_df.region = r
gaps_df.type = t
#remove that index so gaps_df and df are compatible
gaps_df.reset_index(inplace=True)
#append gaps_df to df
new_df = pd.concat([df, gaps_df])
#sort on date
new_df.sort_values(by='date', inplace=True)
#fill nulls
new_df.fillna(method='ffill', inplace=True)
new_df.fillna(method='bfill', inplace=True)
#append to file
new_df.to_csv('ffill_df.csv', mode='a', header=False, index=False)
return df_cols, regions, types, all_dates

Pandas Columns Operations with List

I have a pandas dataframe with two columns, the first one with just a single date ('action_date') and the second one with a list of dates ('verification_date'). I am trying to calculate the time difference between the date in 'action_date' and each of the dates in the list in the corresponding 'verification_date' column, and then fill the df new columns with the number of dates in verification_date that have a difference of either over or under 360 days.
Here is my code:
df = pd.DataFrame()
df['action_date'] = ['2017-01-01', '2017-01-01', '2017-01-03']
df['action_date'] = pd.to_datetime(df['action_date'], format="%Y-%m-%d")
df['verification_date'] = ['2016-01-01', '2015-01-08', '2017-01-01']
df['verification_date'] = pd.to_datetime(df['verification_date'], format="%Y-%m-%d")
df['user_name'] = ['abc', 'wdt', 'sdf']
df.index = df.action_date
df = df.groupby(pd.TimeGrouper(freq='2D'))['verification_date'].apply(list).reset_index()
def make_columns(df):
df = df
for i in range(len(df)):
over_360 = []
under_360 = []
for w in [(df['action_date'][i]-x).days for x in df['verification_date'][i]]:
if w > 360:
over_360.append(w)
else:
under_360.append(w)
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
return df
make_columns(df)
This kinda works EXCEPT the df has the same values for each row, which is not true as the dates are different. For example, in the first row of the dataframe, there IS a difference of over 360 days between the action_date and both of the items in the list in the verification_date column, so the over_360 column should be populated with 2. However, it is empty and instead the under_360 column is populated with 1, which is accurate only for the second row in 'action_date'.
I have a feeling I'm just messing up the looping but am really stuck. Thanks for all help!
Your problem was that you were always updating the whole column with the value of the last calculation with these lines:
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
what you want to do instead is set the value for each line calculation accordingly, you can do this by replacing the above lines with these:
df.set_value(i,'over_360',len(over_360))
df.set_value(i,'under_360',len(under_360))
what it does is, it sets a value in line i and column over_360 or under_360.
you can learn more about it here.
If you don't like using set_values you can also use this:
df.ix[i,'over_360'] = len(over_360)
df.ix[i,'under_360'] = len(under_360)
you can check dataframe.ix here.
you might want to try this:
df['over_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days >360) for i in x['verification_date']]) , axis=1)
df['under_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days <360) for i in x['verification_date']]) , axis=1)
I believe it should be a bit faster.
You didn't specify what to do if == 360, so you can just change > or < into >= or <=.

Categories