Printing row based on datestamp condition of another column - python
Background:I have a DataFrame ('weather_tweets') containing two columns of interest, weather (weather on the planet Mars) and date (the date the weather relates). Structure as follows:
Objective:I am trying to write code that will determine the latest datestamp (date column) and print that row's corresponding weather column value.Sample rows:Here is a sample row:
weather_tweets = [
('tweet', 'weather', 'date'),
('Mars Weather#MarsWxReport·Jul 15InSight sol 58', 'InSight sol 580 (2020-07-14) low -88.8ºC (-127.8ºF) high -8.4ºC (16.8ºF) winds from the WNW at 5.9 m/s (13.3 mph) gusting to 15.4 m/s (34.4 mph) pressure at 7.80 hPa, '2020-07-14')]
My code:Thus far, I have only been able to formulate some messy code that will return the latest dates in order, but it's pretty useless for my expected results:latest_weather = weather_tweets.groupby(['tweet', 'weather'])['date'].transform(max) == weather_tweets['date']print(weather_tweets[latest_weather])
Any advice on how to reach the desired result would be much appreciated.
Try:
weather_tweets[weather_tweets.date == weather_tweets.date.max()].weather
You can add to_frame() at the end to obtain more elegant dataframe result:
weather_tweets[weather_tweets.date == weather_tweets.date.max()].weather.to_frame()
Or create new dataframe:
df_latest = weather_tweets.loc[weather_tweets.date == weather_tweets.date.max(),['weather','date']]
df_max.columns = ['latest_weather','latest_date']
Related
Pandas find overlapping time intervals in one column based on same date in another column for different rows
I have data that looks like this: id Date Time assigned_pat_loc prior_pat_loc Activity 0 45546325 2/7/2011 4:29:38 EIAB^EIAB^6 NaN Admission 1 45546325 2/7/2011 5:18:22 8W^W844^A EIAB^EIAB^6 Observation 2 45546325 2/7/2011 5:18:22 8W^W844^A EIAB^EIAB^6 Transfer to 8W 3 45546325 2/7/2011 6:01:44 8W^W858^A 8W^W844^A Bed Movement 4 45546325 2/7/2011 7:20:44 8W^W844^A 8W^W858^A Bed Movement 5 45546325 2/9/2011 18:36:03 8W^W844^A NaN Discharge-Observation 6 45666555 3/8/2011 20:22:36 EIC^EIC^5 NaN Admission 7 45666555 3/9/2011 1:08:04 53^5314^A EIC^EIC^5 Admission 8 45666555 3/9/2011 1:08:04 53^5314^A EIC^EIC^5 Transfer to 53 9 45666555 3/9/2011 17:03:38 53^5336^A 53^5314^A Bed Movement I need to find where there were multiple patients (identified with id column) are in the same room at the same time, the start and end times for those, the dates, and room number (assigned_pat_loc). assigned_pat_loc is the current patient location in the hospital, formatted as “unit^room^bed”. So far I've done the following: # Read in CSV file and remove bed number from patient location data = pd.read_csv('raw_data.csv') data['assigned_pat_loc'] = data['assigned_pat_loc'].str.replace(r"([^^]+\^[^^]+).*", r"\1", regex=True) # Convert Date column to datetime type patient_data['Date'] = pd.to_datetime(patient_data['Date']) # Sort dataframe by date patient_data.sort_values(by=['Date'], inplace = True) # Identify rows with duplicate room and date assignments, indicating multiple patients shared room same_room = patient_data.duplicated(subset = ['Date','assigned_pat_loc']) # Assign duplicates to new dataframe df_same_rooms = patient_data[same_room] # Remove duplicate patient ids but keep latest one no_dups = df_same_rooms.drop_duplicates(subset = ['id'], keep = 'last') # Group patients in the same rooms at the same times together df_shuf = pd.concat(group[1] for group in df_same_rooms.groupby(['Date', 'assigned_pat_loc'], sort=False)) And then I'm stuck at this point: id Date Time assigned_pat_loc prior_pat_loc Activity 599359 42963403 2009-01-01 12:32:25 11M^11MX 4LD^W463^A Transfer 296155 42963484 2009-01-01 16:41:55 11M^11MX EIC^EIC^2 Transfer 1373 42951976 2009-01-01 15:51:09 11M^11MX NaN Discharge 362126 42963293 2009-01-01 4:56:57 11M^11MX EIAB^EIAB^6 Transfer 362125 42963293 2009-01-01 4:56:57 11M^11MX EIAB^EIAB^6 Admission ... ... ... ... ... ... ... 268266 46381369 2011-09-09 18:57:31 54^54X 11M^1138^A Transfer 16209 46390230 2011-09-09 6:19:06 10M^1028 EIAB^EIAB^5 Admission 659699 46391825 2011-09-09 14:28:20 9W^W918 EIAB^EIAB^3 Transfer 659698 46391825 2011-09-09 14:28:20 9W^W918 EIAB^EIAB^3 Admission 268179 46391644 2011-09-09 17:48:53 64^6412 EIE^EIE^3 Admission Where you can see different patients in the same room at the same time, but I don't know how to extract those intervals of overlap between two different rows for the same room and same times. And then to format it such that the start time and end time are related to the earlier and later times of the transpiring of a shared room between two patients. Below is the desired output. Where r_id is the id of the other patient sharing the same room and length is the number of hours that room was shared.
As suggested, you can use groupby. One more thing you need to take care of is finding the overlapping time. Ideally you'd use datetime which are easy to work with. However you used a different format so we need to convert it first to make the solution easier. Since you did not provide a workable example, I will just write the gist here: # convert current format to datetime df['start_datetime'] = pd.to_datetime(df.start_date) + df.start_time.astype('timedelta64[h]') df['end_datetime'] = pd.to_datetime(df.end_date) + df.end_time.astype('timedelta64[h]') df = df.sort_values(['start_datetime', 'end_datetime'], ascending=[True, False]) gb = df.groupby('r_id') for g, g_df in gb: g_df['overlap_group'] = (g_df['end_datetime'].cummax().shift() <= g_df['start_datetime']).cumsum() print(g_df) This is a tentative example, and you might need to tweak the datetime conversion and some other minor things, but this is the gist. The cummax() detects where there is an overlap between the intervals, and cumsum() counts the number of overlapping groups, since it's a counter we can use it as a unique identifier. I used the following threads: Group rows by overlapping ranges python/pandas - converting date and hour integers to datetime Edit After discussing it with OP the idea is to take each patient's df and sort it by the date of the event. The first one will be the start_time and the last one would be the end_time. The unification of the time and date are not necessary for detecting the start and end time as they can sort by date and then by the time to get the same order they would have gotten if they did unify the columns. However for the overlap detection it does make life easier when it's in one column. gb_patient = df.groupby('id') patients_data_list = [] for patient_id, patient_df in gb_patient: patient_df = patient_df.sort_values(by=['Date', 'Time']) patient_data = { "patient_id": patient_id, "start_time": patient_df.Date.values[0] + patient_df.Time.values[0], "end_time": patient_df.Date.values[-1] + patient_df.Time.values[-1] } patients_data_list.append(patient_data) new_df = pd.DataFrame(patients_data_list) After that they can use the above code for the overlaps.
Comparing three data frames to evaluate multiple criteria
I have three dataframes: ob (Orderbook) - an orderbook containing Part Numbers, the week they are due and the hours it takes to build them. Part Number Due Week Build Hours A 2022-46 4 A 2022-46 5 B 2022-46 8 C 2022-47 1.6 osm (Operator Skill Matrix) - a skills matrix containing operators names and part numbers Operator Part number Mr.One A Mr.One B Mr.Two A Mr.Two B Mrs. Three C ah (Avaliable Hours) - a list containg how many hours an operator can work in a given week Operator YYYYWW Hours Mr.One 2022-45 40 Mr.One 2022-46 35 Mr.Two 2022-46 37 Mr.Two 2022-47 39 Mrs. Three 2022-47 40 Mrs. Three 2022-48 45 I am trying to work out for each week if there are enough operators, with the right skills, working enough hours to complete all of the orders on the orderbook. And if not, identify the orders that cant be complete. Step by Step it would look like this: Take the part number of the first row of the orderbook. Seach the skills matrix to find a list of operators who can build that part. Seach the hours list and check if the operators have any hours avaliable for the week the order is due. If the operator has hours avalible, add their name to that row of the orderbook. Subtract the Build hours in the orderbook from the Avalible hours in the Avalible Hours df. Repeat this for each row in the orderbook until all orders have a name against them or there are no avalible hours left. The only thing i could think to try was a bunch of nested for loops, but as there are thousands of rows it takes ~45 minutes to complete one iteration and would take days if not weeks to complete the whole thing. #for each row in the orderbook for i, rowi in ob_sum_hours.iterrows(): #for each row in the operator skill matrix for j, rowj in osm.iterrows(): #for each row in the avalible operator hours for y, rowy in aoh.iterrows(): if(rowi['Material']==rowj['MATERIAL'] and rowi['ProdYYYYWW']==rowy['YYYYWW'] and rowj['Operator']==rowy['Operator'] and rowy['Hours'] > 0):` rowy['Hours'] -=rowi['PlanHrs'] rowi['HoursAllocated'] = rowi['Operator'] The final result would look like this: Part Number Due Week Build Hours Operator A 2022-46 4 Mr.One A 2022-46 5 Mr.One B 2022-46 8 Mr.Two C 2022-47 1.6 Mrs.Three Is there a better way to achieve this?
Made with one loop + apply on each line. Orderbook.groupby(Orderbook.index) groups by index, i.e. my_func iterates through each row, still better than a loop. In the 'aaa' list, we get a list of unique Operators that match. In the 'bbb' list, filter Avaliable by: 'YYYYWW', 'Operator' (using isin for the list of unique Operators) and 'Hours' greater than 0. Further in the loop, using the 'bbb' indices, we check free time and if 'ava' is greater than zero, using explicit indexing loc set values. import pandas as pd Orderbook = pd.read_csv('Orderbook.csv', header=0) Operator = pd.read_csv('Operator.csv', header=0) Avaliable= pd.read_csv('Avaliable.csv', header=0) Orderbook['Operator'] = 'no' def my_func(x): aaa = Operator.loc[Operator['Part number'] == x['Part Number'].values[0], 'Operator'].unique() bbb = Avaliable[(Avaliable['YYYYWW'] == x['Due Week'].values[0]) & (Avaliable['Operator'].isin(aaa)) & (Avaliable['Hours'] > 0)] for i in bbb.index: ava = Avaliable.loc[i, 'Hours'] - x['Build Hours'].values if ava >= 0: Avaliable.loc[i, 'Hours'] = ava Orderbook.loc[x.index, 'Operator'] = Avaliable.loc[i, 'Operator'] break# added loop interrupt Orderbook.groupby(Orderbook.index).apply(my_func) print(Orderbook) print(Avaliable) Update 18.11.2022 I did it without cycles. But, you need to check. If you find something incorrect please let me know. You can also measure the exact processing time by putting at the beginning: import datetime now = datetime.datetime.now() and printing the elapsed time at the end: time_ = datetime.datetime.now() - now print('elapsed time', time_) the code: Orderbook = pd.read_csv('Orderbook.csv', header=0) Operator = pd.read_csv('Operator.csv', header=0) Avaliable = pd.read_csv('Avaliable.csv', header=0) Orderbook['Operator'] = 'no' aaa = [Operator.loc[Operator['Part number'] == Orderbook.loc[i, 'Part Number'], 'Operator'].unique() for i in range(len(Orderbook))] def my_func(x): bbb = Avaliable[(Avaliable['YYYYWW'] == x['Due Week'].values[0]) & (Avaliable['Operator'].isin(aaa[x.index[0]])) & (Avaliable['Hours'] > 0)] fff = Avaliable.loc[bbb.index, 'Hours'] - x['Build Hours'].values ind = fff[fff.ge(0)].index Avaliable.loc[ind[0], 'Hours'] = fff[ind[0]] Orderbook.loc[x.index, 'Operator'] = Avaliable.loc[ind[0], 'Operator'] Orderbook.groupby(Orderbook.index).apply(my_func) print(Orderbook) print(Avaliable)
Pandas - Fill in Missing Column Values Regression
I have a data frame 'df' that has missing column values. I want to fill in the missing/NaN values in the Avg Monthly Long Distance Charges column through prediction (regression) using the other column values. Then, replace the NaN values with the new values found. Data frame: 'df' Customer ID,Gender,Age,Married,Number of Dependents,City,Zip Code,Latitude,Longitude,Number of Referrals,Tenure in Months,Offer,Phone Service,Avg Monthly Long Distance Charges,Multiple Lines,Internet Service,Internet Type,Avg Monthly GB Download,Online Security,Online Backup,Device Protection Plan,Premium Tech Support,Streaming TV,Streaming Movies,Streaming Music,Unlimited Data,Contract,Paperless Billing,Payment Method,Monthly Charge,Total Charges,Total Refunds,Total Extra Data Charges,Total Long Distance Charges,Total Revenue,Customer Status,Churn Category,Churn Reason 0002-ORFBO,Female,37,Yes,0,Frazier Park,93225,34.827662,-118.999073,2,9,None,Yes,42.39,No,Yes,Cable,16,No,Yes,No,Yes,Yes,No,No,Yes,One Year,Yes,Credit Card,65.6,593.3,0,0,381.51,974.81,Stayed,, 0003-MKNFE,Male,46,No,0,Glendale,91206,34.162515,-118.203869,0,9,None,Yes,10.69,Yes,Yes,Cable,10,No,No,No,No,No,Yes,Yes,No,Month-to-Month,No,Credit Card,-4,542.4,38.33,10,96.21,610.28,Stayed,, 0004-TLHLJ,Male,50,No,0,Costa Mesa,92627,33.645672,-117.922613,0,4,Offer E,Yes,33.65,No,Yes,Fiber Optic,30,No,No,Yes,No,No,No,No,Yes,Month-to-Month,Yes,Bank Withdrawal,73.9,280.85,0,0,134.6,415.45,Churned,Competitor,Competitor had better devices 0011-IGKFF,Male,78,Yes,0,Martinez,94553,38.014457,-122.115432,1,13,Offer D,Yes,27.82,No,Yes,Fiber Optic,4,No,Yes,Yes,No,Yes,Yes,No,Yes,Month-to-Month,Yes,Bank Withdrawal,98,1237.85,0,0,361.66,1599.51,Churned,Dissatisfaction,Product dissatisfaction 0013-EXCHZ,Female,75,Yes,0,Camarillo,93010,34.227846,-119.079903,3,3,None,Yes,7.38,No,Yes,Fiber Optic,11,No,No,No,Yes,Yes,No,No,Yes,Month-to-Month,Yes,Credit Card,83.9,267.4,0,0,22.14,289.54,Churned,Dissatisfaction,Network reliability 0013-MHZWF,Female,23,No,3,Midpines,95345,37.581496,-119.972762,0,9,Offer E,Yes,16.77,No,Yes,Cable,73,No,No,No,Yes,Yes,Yes,Yes,Yes,Month-to-Month,Yes,Credit Card,69.4,571.45,0,0,150.93,722.38,Stayed,, 0013-SMEOE,Female,67,Yes,0,Lompoc,93437,34.757477,-120.550507,1,71,Offer A,Yes,9.96,No,Yes,Fiber Optic,14,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Two Year,Yes,Bank Withdrawal,109.7,7904.25,0,0,707.16,8611.41,Stayed,, 0014-BMAQU,Male,52,Yes,0,Napa,94558,38.489789,-122.27011,8,63,Offer B,Yes,12.96,Yes,Yes,Fiber Optic,7,Yes,No,No,Yes,No,No,No,No,Two Year,Yes,Credit Card,84.65,5377.8,0,20,816.48,6214.28,Stayed,, 0015-UOCOJ,Female,68,No,0,Simi Valley,93063,34.296813,-118.685703,0,7,Offer E,Yes,10.53,No,Yes,DSL,21,Yes,No,No,No,No,No,No,Yes,Two Year,Yes,Bank Withdrawal,48.2,340.35,0,0,73.71,414.06,Stayed,, 0016-QLJIS,Female,43,Yes,1,Sheridan,95681,38.984756,-121.345074,3,65,None,Yes,28.46,Yes,Yes,Cable,14,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Two Year,Yes,Credit Card,90.45,5957.9,0,0,1849.9,7807.8,Stayed,, 0017-DINOC,Male,47,No,0,Rancho Santa Fe,92091,32.99356,-117.207121,0,54,None,No,,,Yes,Cable,10,Yes,No,No,Yes,Yes,No,No,Yes,Two Year,No,Credit Card,45.2,2460.55,0,0,0,2460.55,Stayed,, 0017-IUDMW,Female,25,Yes,2,Sunnyvale,94086,37.378541,-122.020456,2,72,None,Yes,16.01,Yes,Yes,Fiber Optic,59,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Two Year,Yes,Credit Card,116.8,8456.75,0,0,1152.72,9609.47,Stayed,, 0018-NYROU,Female,58,Yes,0,Antelope,95843,38.715498,-121.363411,0,5,None,Yes,18.65,No,Yes,Fiber Optic,10,No,No,No,No,No,No,No,Yes,Month-to-Month,Yes,Bank Withdrawal,68.95,351.5,0,0,93.25,444.75,Stayed,, 0019-EFAEP,Female,32,No,0,La Mesa,91942,32.782501,-117.01611,0,72,Offer A,Yes,2.25,Yes,Yes,Fiber Optic,16,Yes,Yes,Yes,No,Yes,No,No,Yes,Two Year,Yes,Bank Withdrawal,101.3,7261.25,0,0,162,7423.25,Stayed,, 0019-GFNTW,Female,39,No,0,Los Olivos,93441,34.70434,-120.02609,0,56,None,No,,,Yes,DSL,19,Yes,Yes,Yes,Yes,No,No,No,Yes,Two Year,No,Bank Withdrawal,45.05,2560.1,0,0,0,2560.1,Stayed,, 0020-INWCK,Female,58,Yes,2,Woodlake,93286,36.464635,-119.094348,9,71,Offer A,Yes,27.26,Yes,Yes,Fiber Optic,12,No,Yes,Yes,No,No,Yes,Yes,Yes,Two Year,Yes,Credit Card,95.75,6849.4,0,0,1935.46,8784.86,Stayed,, 0020-JDNXP,Female,52,Yes,1,Point Reyes Station,94956,38.060264,-122.830646,0,34,None,No,,,Yes,DSL,20,Yes,No,Yes,Yes,Yes,Yes,Yes,Yes,One Year,No,Credit Card,61.25,1993.2,0,0,0,1993.2,Stayed,, 0021-IKXGC,Female,72,No,0,San Marcos,92078,33.119028,-117.166036,0,1,Offer E,Yes,7.77,Yes,Yes,Fiber Optic,22,No,No,No,No,No,No,No,Yes,One Year,Yes,Bank Withdrawal,72.1,72.1,0,0,7.77,79.87,Joined,, 0022-TCJCI,Male,79,No,0,Daly City,94015,37.680844,-122.48131,0,45,None,Yes,10.67,No,Yes,DSL,17,Yes,No,Yes,No,No,Yes,No,Yes,One Year,No,Credit Card,62.7,2791.5,0,0,480.15,3271.65,Churned,Dissatisfaction,Limited range of services My code: # Let X = predictor variable and y = target variable X = pd.DataFrame(df[['Monthly Charge', 'Total Charges', 'Total Long Distance Charges']]) y = pd.DataFrame(df[['Avg Monthly Long Distance Charges']]) # Add a constant variable to the predictor variables X = sm.add_constant(X) model01 = sm.OLS(y, X).fit() df['Avg Monthly Long Distance Charges'].fillna(sm.OLS(y, X).fit()) My code output: 0 42.39 1 10.69 2 33.65 3 27.82 4 7.38 ... 7038 46.68 7039 16.2 7040 18.62 7041 2.12 7042 <statsmodels.regression.linear_model.Regressio... Name: Avg Monthly Long Distance Charges, Length: 7043, dtype: object My code outputs this, but does not print this into the original data frame. How do I do this? Thanks.
RSI in spyder using data in excel
So I have an excel file containing data on a specific stock. My excel file contains about 2 months of data, it monitors the Open price, Close price, High Price, Low Price and Volume of trades in 5 minute intervals, so there are about 3000 rows in my file. I want to calculate the RSI (or EMA if it's easier) of a stock daily, I'm making a summary table that collects the daily data so it converts my table of 3000+ rows into a table with only about 60 rows (each row represents one day). Essentially I want some sort of code that sorts the excel data by date then calculates the RSI as a single value for that day. RSI is given by: 100-(100/(1+RS)) where RS = average gain of up periods/average loss of down periods. Note: My excel uses 'Datetime' so each row's 'Datetime' looks something like '2022-03-03 9:30-5:00' and the next row would be '2022-03-03 9:35-5:00', etc. So the code needs to just look at the date and ignore the time I guess. Some code to maybe help understand what I'm looking for: So here I'm calling my excel file, I want the code to take the called excel file, group data by date and then calculate the RSI of each day using the formula I wrote above. dat = pd.read_csv('AMD_5m.csv',index_col='Datetime',parse_dates=['Datetime'], date_parser=lambda x: pd.to_datetime(x, utc=True)) dates = backtest.get_dates(dat.index) #create a summary table cols = ['Num. Obs.', 'Num. Trade', 'PnL', 'Win. Ratio','RSI'] #add addtional fields if necessary summary_table = pd.DataFrame(index = dates, columns=cols) # loop backtest by dates This is the code I used to fill out the other columns in my summary table, I'll put my SMA (simple moving average) function below. for d in dates: this_dat = dat.loc[dat.index.date==d] #find the number of observations in date d summary_table.loc[d]['Num. Obs.'] = this_dat.shape[0] #get trading (i.e. position holding) signals signals = backtest.SMA(this_dat['Close'].values, window=10) #find the number of trades in date d summary_table.loc[d]['Num. Trade'] = np.sum(np.diff(signals)==1) #find PnLs for 100 shares shares = 100 PnL = -shares*np.sum(this_dat['Close'].values[1:]*np.diff(signals)) if np.sum(np.diff(signals))>0: #close position at market close PnL += shares*this_dat['Close'].values[-1] summary_table.loc[d]['PnL'] = PnL #find the win ratio ind_in = np.where(np.diff(signals)==1)[0]+1 ind_out = np.where(np.diff(signals)==-1)[0]+1 num_win = np.sum((this_dat['Close'].values[ind_out]-this_dat['Close'].values[ind_in])>0) if summary_table.loc[d]['Num. Trade']!=0: summary_table.loc[d]['Win. Ratio'] = 1. *num_win/summary_table.loc[d]['Num. Trade'] This is my function for calculating Simple Moving Average. I was told to try and adapt this for RSI or for EMA (Exponential Moving Average). Apparently adapting this for EMA isn't too troublesome but I can't figure it out. def SMA(p,window=10,signal_type='buy only'): #input price "p", look-back window "window", #signal type = buy only (default) --gives long signals, sell only --gives sell signals, both --gives both long and short signals #return a list of signals = 1 for long position and -1 for short position signals = np.zeros(len(p)) if len(p)<window: #no signal if no sufficient data return signals sma = list(np.zeros(window)+np.nan) #the first few prices does not give technical indicator values sma += [np.average(p[k:k+window]) for k in np.arange(len(p)-window)] for i in np.arange(len(p)-1): if np.isnan(sma[i]): continue #skip the open market time window if sma[i]<p[i] and (signal_type=='buy only' or signal_type=='both'): signals[i] = 1 elif sma[i]>p[i] and (signal_type=='sell only' or signal_type=='both'): signals[i] = -1 return signals
I have two solutions to this. One is to loop through each group, then add the relevant data to the summary_table, the other is to calculate the whole series and set the RSI column as this. I first recreated the data: import yfinance import pandas as pd # initially created similar data through yfinance, # then copied this to Excel and changed the Datetime column to match yours. df = yfinance.download("AAPL", period="60d", interval="5m") # copied it and read it as a dataframe df = pd.read_clipboard(sep=r'\s{2,}', engine="python") df.head() # Datetime Open High Low Close Adj Close Volume #0 2022-03-03 09:30-05:00 168.470001 168.910004 167.970001 168.199905 168.199905 5374241 #1 2022-03-03 09:35-05:00 168.199997 168.289993 167.550003 168.129898 168.129898 1936734 #2 2022-03-03 09:40-05:00 168.119995 168.250000 167.740005 167.770004 167.770004 1198687 #3 2022-03-03 09:45-05:00 167.770004 168.339996 167.589996 167.718094 167.718094 2128957 #4 2022-03-03 09:50-05:00 167.729996 167.970001 167.619995 167.710007 167.710007 968410 Then I formatted the data and created the summary_table: df["date"] = pd.to_datetime(df["Datetime"].str[:16], format="%Y-%m-%d %H:%M").dt.date # calculate percentage change from open and close of each row df["gain"] = (df["Close"] / df["Open"]) - 1 # your summary table, slightly changing the index to use the dates above cols = ['Num. Obs.', 'Num. Trade', 'PnL', 'Win. Ratio','RSI'] #add addtional fields if necessary summary_table = pd.DataFrame(index=df["date"].unique(), columns=cols) Option 1: # loop through each group, calculate the average gain and loss, then RSI for grp, data in df.groupby("date"): # average gain for gain greater than 0 average_gain = data[data["gain"] > 0]["gain"].mean() # average loss for gain less than 0 average_loss = data[data["gain"] < 0]["gain"].mean() # add to relevant cell of summary_table summary_table["RSI"].loc[grp] = 100 - (100 / (1 + (average_gain / average_loss))) Option 2: # define a function to apply in the groupby def rsi_calc(series): avg_gain = series[series > 0].mean() avg_loss = series[series < 0].mean() return 100 - (100 / (1 + (avg_gain / avg_loss))) summary_table["RSI"] = df.groupby("date")["gain"].apply(lambda x: rsi_calc(x)) Output (same for each): summary_table.head() # Num. Obs. Num. Trade PnL Win. Ratio RSI #2022-03-03 NaN NaN NaN NaN -981.214015 #2022-03-04 NaN NaN NaN NaN 501.950956 #2022-03-07 NaN NaN NaN NaN -228.379066 #2022-03-08 NaN NaN NaN NaN -2304.451654 #2022-03-09 NaN NaN NaN NaN -689.824739
calculating slope on a rolling basis in pandas df python
I have a dataframe : CAT ^GSPC Date 2012-01-06 80.435059 1277.810059 2012-01-09 81.560600 1280.699951 2012-01-10 83.962914 1292.079956 .... 2017-09-16 144.56653 2230.567646 and I want to find the slope of the stock / and S&P index for the last 63 days for each period. I have tried : x = 0 temp_dct = {} for date in df.index: x += 1 max(x, (len(df.index)-64)) temp_dct[str(date)] = np.polyfit(df['^GSPC'][0+x:63+x].values, df['CAT'][0+x:63+x].values, 1)[0] However I feel this is very "unpythonic" , but I've had trouble integrating rolling/shift functions into this. My expected output is to have a column called "Beta" that has the slope of the S&P (x values) and stock (y values) for all dates available
# this will operate on series def polyf(seri): return np.polyfit(seri.index.values, seri.values, 1)[0] # you can store the original index in a column in case you need to reset back to it after fitting df.index = df['^GSPC'] df['slope'] = df['CAT'].rolling(63, min_periods=2).apply(polyf, raw=False) After running this, there will be a new column store the fitting result.