Duplicating rows in dataframe python

Duplicating rows in dataframe python - python

Good afternoon everyone,
I am currently writing a thesis on the KMV model in python. I took inspiration from the code here to solve the non-linear equations. Here is the link to the CSV file used to create the dataframe. And this is the code I have so far:
Importation of the required modules
from datetime import datetime
import pandas as pd
import numpy as np
import scipy.optimize as sco
from scipy.stats import norm
df = pd.DataFrame()
df = pd.read_csv("AREX.csv", sep=';', engine = "python", decimal=',')
Functions to prepare the file for the model to run
def clean():
# df.rename(columns ={"ï»¿Date": "Date"}, inplace = True)
# df["Date"] = pd.to_datetime(df['Date'])
df.set_index("Date", inplace = True)
df['AREX.O']=df['AREX.O'].astype(float)
df.drop(['Total Short Term debt'], axis =1, inplace = True)
return df
def preparation():
df['e']=df['AREX.O']*df['Share Outstanding']
df['Short Term Debt']=df['Debt']-df['Total Long term Debt']
df['f']=df['Short Term Debt']+df['Total Long term Debt']*0.5
df['log_ret'] = np.log(df['AREX.O']) - np.log(df['AREX.O'].shift(1))
# df['stdev']=df['log_ret'].rolling(252).std()*m.sqrt(252)
return df
Algorithm used to solve for a and sigma_a.
I only tried to adapt the code to my dataframe here
def algo1():
# formatting the vaules as required
df["f"] = df["f"].astype(float)
df["e"] = df["e"].astype(float)
# #computating of key input variable for the model
df['a'] = df['f'].add(df["e"])
#defining a function for the black Scholes equation
def bseqn(a, debug=False):
d1 = (np.log(a/f) + (r + 0.5*sigma_a**2)*T)/(sigma_a*np.sqrt(T))
d2 = d1 - sigma_a*np.sqrt(T)
y1 = e - (a*norm.cdf(d1) - np.exp(-r*T)*f*norm.cdf(d2))
if debug:
print("d1 = {:.6f}".format(d1))
print("d2 = {:.6f}".format(d2))
print("Error = {:.6f}".format('y1'))
return y1
#Solving the model
time_horizon=[1]
timesteps = range(1, len(df))
results = np.empty((df.shape[0],len(time_horizon)))
#looping to solve for each row
for i, years in enumerate(time_horizon):
T = 1
results[:,i] = df.loc[:,'a']
for i_t, t in enumerate(timesteps):
a = results[t-10:t,i]
ra =np.log(a/np.roll(a,1))
sigma_a = np.nanstd(ra) #gives initial value of sigma_a
if i_t == 0:
subset_timesteps = range(t-1, t+1)
print(subset_timesteps)
else:
subset_timesteps = [t]
n_its = 0
while n_its < 10:
n_its += 1
for t_sub in subset_timesteps:
r = df.iloc[t_sub]['r']
f = df.iloc[t_sub]['f']
e = df.iloc[t_sub]['e']
sol = sco.fsolve(bseqn, results[t_sub,i]) #if I replace newton with fsolve the code works properly
results[t_sub,i] = sol # stores the new values of a
# Update sigma_a based on new values of a
last_sigma_a = sigma_a
a = results[t-10:t,i]
ra = np.log(a/np.roll(a,1))
sigma_a = np.nanstd(ra) #new val of sigma
diff = last_sigma_a - sigma_a
if abs(diff) < 1e-3:
df.loc[t_sub,'sigma_a'] = sigma_a
break
else:
pass
return df
Run function
def run():
clean()
preparation()
algo1()
print(df)
print(list(df))
# main_df = df.to_csv("AREX_D.csv")
The output should write the results of sigma_a on the created sigma_a column but instead of that it adds a row so instead of 1500 rows i end-up with 3000 rows most of it being Nan values. I do not understand where the code asks that...
I suspect it to come from these lines:
diff = last_sigma_a - sigma_a
if abs(diff) < 1e-3:
df.loc[t_sub,'sigma_a'] = sigma_a
break
Does anyone has any insight on what is happening ?
Here is a picture of the output :
Thank you very much!

Related

Vectorization with python

My current code is extremely slow with the nested for loop setup. I would like to speed up the process, my assumption would be that the solution is the vectorization with Pandas or NumPy. I do not know how to transfer my current code into the new format.
I have created an example code below.
import pandas as pd
import numpy as np
balance = 10000
raw_data = [[1,2,4,1,3],[2,3,7,2,4],[3,4,5,3,4],[4,4,9,1,5],[5,5,6,4,5]]
raw_df = pd.DataFrame(raw_data, columns=['D','O','H','L','C'])
history_data = [[1,1,5,np.nan,4],[0,1,3,np.nan,4],[1,0,4,2,3],[1,0,1,6,0],[0,1,7,np.nan,8]]
history_df = pd.DataFrame(history_data, columns=['TY','ST','OP','CL','SL'])
for n in raw_df.index:
for p in history_df.index:
if history_df['ST'][p] == 1 and history_df['TY'][p] == 1 and history_df['SL'][p] >= raw_df['L'][n]:
history_df['CL'][p] = raw_df['L'][n]
history_df['ST'][p] = 0
balance = balance + 20
if raw_df['C'][n] > 4:
history_df = history_df.append({'TY':0,'ST':1,'OP':5,'CL':np.nan,'SL':9,},ignore_index = True)

Check out this example, see if it helps :
import numpy as np
# Use NumPy's where function to perform the check for each row of history_df and raw_df simultaneously
mask = np.where((history_df['ST'] == 1) & (history_df['TY'] == 1) & (history_df['SL'] >= raw_df['L']))
history_df.loc[mask, 'CL'] = raw_df.loc[mask, 'L']
history_df.loc[mask, 'ST'] = 0
# Calculate the balance change
balance_change = 20 * len(mask[0])
balance += balance_change
# Append rows to history_df where raw_df['C'] > 4
new_rows = raw_df[raw_df['C'] > 4]
new_rows['TY'] = 0
new_rows['ST'] = 1
new_rows['OP'] = 5
new_rows['CL'] = np.nan
new_rows['SL'] = 9
history_df = history_df.append(new_rows, ignore_index=True)

Pandas Dataframe masking issues: referring to previous rows and selecting values

I am new to Pandas, and I'm trying to avoid iterating over a DataFrame, and attempting to use vectorisation instead. I am not able to get the results I want; I need help in the more complicated masking and selection statements
This is my code:
import random
from datetime import datetime, timedelta
import pandas as pd
dates = []
temp = []
press = []
vel = []
fmt = '%Y-%m-%d %H:%M:%S'
stime = datetime.strptime('2020-01-06 10:28:16', fmt)
etime = datetime.strptime('2020-04-10 03:43:12', fmt)
td = etime - stime
l = set([random.random() for x in range(0, 1000)])
dates = [((td * x) + stime) for x in random.sample(l, 100)]
for i in range(100):
press.append(random.uniform(14,95.5))
temp.append(random.uniform(-15,45))
vel.append(random.uniform(50,153))
measurements = {
'date' : dates,
'pressure' : press,
'velocity' : vel,
'temperature': temp
}
df = pd.DataFrame(measurements)
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
df = df.sort_index()
df2 = pd.DataFrame()
# if temp increased from previous row, set flag
df2['temp_inc'] = df['temperature'] - df.shift(1)['temperature'] > 0
df2['temp_inc'] = df2['temp_inc'].replace({True: 1, False: 0})
# need to fetch velocity where pressure has increased from previous row, else 0
press_up_mask = df.where( (df['pressure'] - df.shift(1)['pressure']) > 0)
#df2['press_spike_velocity'] = df[press_up_mask]['velocity']
# Need to perform calc based on 'temp_inc' column: if 'temp_inc' column is 1: calculate pressure * velocity, else 0
temp_inc_mask = df2['temp_inc'] == 1
df2['boyle_fact'] = df[temp_inc_mask]['pressure'] * df[temp_inc_mask]['velocity']
# Get some stats
df2['short_max_temp'] = df['temperature'].rolling(3).max()
df2['long_min_pressure'] = df['pressure'].rolling(30).min()
print(df.head())
print(df2.head())
How do I correctly calculate columns 'press_spike_velocity' and 'boyle_fact' ?

Starting from the computations:
# if temp increased from previous row, set flag
df2['temp_inc'] = df['temperature'] - df.shift(1)['temperature'] > 0
# setting int type instead of replace
df2['temp_inc'] = df2['temp_inc'].astype(int)
# need to fetch velocity where pressure has increased from previous row, else 0
press_up_mask = df.where( (df['pressure'] - df['pressure'].shift(1)) > 0)
# set column to velocity then mask in zeros via assignment
df2['press_spike_velocity'] = df['velocity'].copy()
df2['press_spike_velocity'][~press_up_mask] = 0
# Need to perform calc based on 'temp_inc' column: if 'temp_inc' column is 1: calculate pressure * velocity, else 0
temp_inc_mask = df2['temp_inc'] == 1
# same masking approach as above
df2['boyle_fact'] = df['pressure'] * df['velocity']
df2['boyle_fact'][~temp_inc_mask] = 0
This is the simplest way to solve your problem with minimal changes to the code itself. If you dig into pandas more you could probably find methods to do this in 1-2 fewer lines via inplace operations but I don't know how much performance or readability you would gain from that.

handling real time data in python , rolling window

I want to create a function that will read a series of time values from a file (with gaps in the sampling rate,thats the problem) and would read me exactly 200 days and allow me to move through the entire data length,say 10000 day,sort of a rolling window.
I am not sure how to code it. Can I add a statement that calculates the difference between two values of the time variable (x axis) up to when is exactly 200 days?
Or can I somehow write a function that would find the starting value say t0 and then find the element of the array that is closest to t0 + (interval=) 200 days.
What I have so far is:
f = open(reading the file from directory)
lines = f.readlines()
print(len(lines))
tx = np.array([]) # times
y= np.array([])
interval = 200 # days
for li in lines:
col = li.split()
t0 = np.array([])
t1 = np.array([])
tx = np.append(tx, float(col[0]))
y= np.append(y, float(col[1]))
t0 = np.append(t0, np.max(tx))
t1 = np.append(t1, tx[np.argmin(tx)])
print(t0,t1)
days = [t1 + dt.timedelta(days = float(x)) for x in days]
#y = np.random.randn(len(days))
# use pandas for convenient rolling function:
df = pd.DataFrame({"day":tx, "value": y}).set_index("day")
def closest_value(s):
if s.shape[0]<2:
return np.nan
X = np.empty((s.shape[0]-1, 2))
X[:, 0] = s[:-1]
X[:, 1] = np.fabs(s[:-1]-s[-1])
min_diff = np.min(X[:, 1])
return X[X[:, 1]==min_diff, 0][0]
df['closest_value'] = df.rolling(window=dt.timedelta(days=200))
['value'].apply(closest_value, raw=True)
print(df.tail(5))
Output error:
TypeError: float() argument must be a string or a number, not
'datetime.datetime'
Additionally,
First 10 tx and ty values respectively:
0 0.003372722575018
0.015239999629557 0.003366515509113
0.045829999726266 0.003385171061055
0.075369999743998 0.003385171061055
0.993219999596477 0.003366515509113
1.022699999623 0.003378941085299
1.05217999964952 0.003369617612836
1.08166999975219 0.003397665493594
3.0025899996981 0.003378941085299
3.04120999993756 0.003394537568711

import numpy as np
import pandas as pd
import datetime as dt
# load data in days and y arrays
# ... or generate them:
N = 1000 # number of days
day_min = dt.datetime.strptime('2000-01-01', '%Y-%m-%d')
day_max = 2000
days = np.sort(np.unique(np.random.uniform(low=0, high=day_max, size=N).astype(int)))
days = [day_min + dt.timedelta(days = int(x)) for x in days]
y = np.random.randn(len(days))
# use pandas for convenient rolling function:
df = pd.DataFrame({"day":days, "value": y}).set_index("day")
def closest_value(s):
if s.shape[0]<2:
return np.nan
X = np.empty((s.shape[0]-1, 2))
X[:, 0] = s[:-1]
X[:, 1] = np.fabs(s[:-1]-s[-1])
min_diff = np.min(X[:, 1])
return X[X[:, 1]==min_diff, 0][0]
df['closest_value'] = df.rolling(window=dt.timedelta(days=200))['value'].apply(closest_value, raw=True)
print(df.tail(5))
Output:
value closest_value
day
2005-06-15 1.668638 1.591505
2005-06-16 0.316645 0.304382
2005-06-17 0.458580 0.445592
2005-06-18 -0.846174 -0.847854
2005-06-22 -0.151687 -0.166404

You could use pandas, set a datetime range and create a while loop to process the data in batches.
import pandas as pd
from datetime import datetime, timedelta
# Load data into pandas dataframe
df = pd.read_csv(filepath)
# Name columns
df.columns = ['dates', 'num_value']
# Convert strings to datetime
df.dates = pd.to_datetime(df['dates'], format='%d/%m/%Y')
# Print dates within a 200 day interval and move on to the next interval
i = 0
while i < len(df.dates):
start = df.dates[i]
end = start + timedelta(days=200)
print(df.dates[(df.dates >= start) & (df.dates < end)])
i += 200
If the columns don't have headers, you should omit skiprows:
dates num_value
2004-7-1 1
2004-7-2 5
2004-7-4 8
2004-7-5 11
2004-7-6 17
df = pd.read_table(filepath, sep="\s+", skiprows=1)

Efficient (fast) way to group continuous data in one DataFrame based on ranges taken from another DataFrame in Python Pandas?

I have experimental data produced by different programs. One is logging the start and end time of a trial as well as the type of trial (a category).
start trial type end
0 6.002987 2 c 7.574240
1 7.967054 3 b 19.084946
2 21.864419 5 b 23.298480
3 23.656995 7 c 24.087210
4 24.194764 9 c 27.960752
The other one records a continous datastream and logs the time for each observation.
X Y Z
0.0000 0.324963 -0.642636 -2.305040
0.0333 0.025089 -0.480412 -0.637273
0.0666 0.364149 0.966594 0.789467
0.0999 -0.087334 -0.761769 0.399813
0.1332 0.841872 2.306711 -1.059608
I have the 2 tables as pandas DataFrames and want to retrieve only those parts of the continuous data that is between the start to end ranges found in the rows of the trials DataFrame. I managed that by using a for-loop that iterates over the rows, but I was thinking that there must be more of a "pandas way" of doing this. So I looked into apply, but what I came up with so far was even considerably slower than the loop.
As I'm working on a lot of large datasets I'm looking for the most efficient way in terms of execution time to solve this.
This is a slice of the expected result for the continous DataFrame:
X Y Z trial type
13.6863 0.265358 0.116529 1.196689 NaN NaN
13.7196 -0.715096 -0.413416 0.696454 NaN NaN
13.7529 0.714897 -0.158183 1.735958 4.0 b
13.7862 -0.259513 0.194762 -0.531482 4.0 b
13.8195 -0.929080 -1.200593 -1.233834 4.0 b
[EDIT:] Here I test performance of different approaches. I found a way using apply(), but it isn't much faster than using iterrows.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def create_trials_df(num_trials=360, max_start=1400.0):
# First df holds start and end times (as seconds) of a trial as well as type of trial.
d = {'trial': pd.Series(np.sort(np.random.choice(np.arange(1, 400), replace=False, size=(360,)))),
'type': pd.Series(np.random.choice(('a', 'b', 'c', 'd'),size=num_trials)),
'start': pd.Series(np.sort(np.random.random_sample((num_trials,))) * max_start)}
trials_df = pd.DataFrame(d)
# Create column for when the trial ended.
trials_df['end'] = trials_df['start'].shift(-1)
trials_df.loc[num_trials-1, 'end'] = trials_df['start'].iloc[-1] + 2.0
trials_df['diff'] = trials_df['end'] - trials_df['start']
trials_df['end'] = trials_df['end'] - trials_df['diff'] * 0.2
del trials_df['diff']
return trials_df
def create_continuous_df(num_trials=360, max_start=1400.0):
# Second df has continuously recorded data with time as index.
time_delta = 1.0/30.0
rows = int((max_start+2) * 1/time_delta)
idx_time = pd.Index(np.arange(rows) * time_delta)
continuous_df = pd.DataFrame(np.random.randn(rows, 3), index=idx_time, columns=list('XYZ'))
print("continuous rows:", continuous_df.index.size)
print("continuous last time:", continuous_df.last_valid_index())
return continuous_df
# I want to group the continuous data by trial and type later on.
def iterrows_test(trials_df, continuous_df):
for index, row in trials_df.iterrows():
continuous_df.loc[row['start']:row['end'], 'trial'] = row['trial']
continuous_df.loc[row['start']:row['end'], 'type'] = row['type']
def itertuples_test(trials_df, continuous_df):
continuous_df['trial'] = np.NaN
continuous_df['type'] = np.NaN
for row in trials_df.itertuples():
continuous_df.loc[slice(row[1],row[4]), ['trial','type']] = [row[2],row[3]]
def apply_test(trials_df, continuous_df):
trial_series = pd.Series([x[0] for x in zip(trials_df.values)])
continuous_df['trial'] = np.NaN
continuous_df['type'] = np.NaN
def insert_trial_data_to_continuous(vals, con_df):
con_df.loc[slice(vals[0], vals[3]), ['trial','type']] = [vals[1],vals[2]]
trial_series.apply(insert_trial_data_to_continuous, args=(continuous_df,))
def real_slow_index_map(trials_df, continuous_df):
# Transform trial_data to new df: merge start and end ordered, make it float index.
trials_df['pre-start'] = trials_df['start'] - 0.0001
trials_df['post-end'] = trials_df['end'] + 0.0001
start_df = pd.DataFrame(data={'type': trials_df['type'].values, 'trial': trials_df['trial'].values},
index=trials_df['start'])
end_df = pd.DataFrame(data={'type': trials_df['type'].values, 'trial': trials_df['trial'].values},
index=trials_df['end'])
# Fill inbetween trials with NaN.
pre_start_df = pd.DataFrame({'trial': np.NaN, 'type': np.NaN}, index=trials_df['pre-start'])
post_end_df = pd.DataFrame({'trial': np.NaN, 'type': np.NaN}, index=trials_df['post-end'])
new_df = start_df.append([end_df, pre_start_df, post_end_df])
new_df.sort_index(inplace=True)
# Each start/end index in new_df has corresponding value in type and trial column.
def get_tuple(idx):
res = new_df.iloc[new_df.index.get_loc(idx, method='nearest')]
# return trial and type column values.
return tuple(res.values)
# Apply this to all indices.
idx_series = continuous_df.index.to_series()
continuous_df['trial'] = idx_series.apply(get_tuple).values
continuous_df[['trial', 'type']] = continuous_df['trial'].apply(pd.Series)
def jp_data_analysis_answer(trials_df, continuous_df):
ranges = trials_df[['trial', 'type', 'start', 'end']].values
def return_trial(n):
for i, r in enumerate(ranges):
if r[2] <= n <= r[3]:
return tuple((i, r[1]))
else:
return np.nan, np.nan
continuous_df['trial'], continuous_df['type'] = list(zip(*continuous_df.index.map(return_trial)))
def performance_test(func, trials_df, continuous_df):
return_df = continuous_df.copy()
time_ref = time.perf_counter()
func(trials_df, return_df)
time_delta = time.perf_counter() - time_ref
print("time delta for {}:".format(func.__name__), time_delta)
return return_df
# Just to illustrate where this is going:
def plot_trial(continuous_df):
continuous_df['type'] = continuous_df['type'].astype('category')
continuous_df = continuous_df.groupby('type').filter(lambda x: x is not np.NaN)
# Without the NaNs in column, let's set the trial column to dtype integer.
continuous_df['trial'] = continuous_df['trial'].astype('int64')
# Plot the data by trial.
for key, group in continuous_df.groupby('trial'):
group.drop(['trial', 'type'], axis=1).plot()
plt.title('Trial {}, Type: {}'.format(key, group['type'].iloc[0]))
plt.show()
break
if __name__ == '__main__':
import time
num_trials = 360
max_start_time = 1400
trials_df = create_trials_df(max_start=max_start_time)
data_df = create_continuous_df(max_start=max_start_time)
# My original approach with a for-loop over iterrows.
iterrows_df = performance_test(iterrows_test,trials_df, data_df)
# itertuples test
itertuples_df = performance_test(itertuples_test,trials_df, data_df)
# apply() on trial data, continuous data is manipulated therein
apply_df = performance_test(apply_test,trials_df, data_df)
# Mapping on index of continuous data. SLOW!
map_idx_df = performance_test(real_slow_index_map,trials_df, data_df)
# method by jp_data_analysis' answer. Works well with small continuous_df, but doesn't scale well.
jp_df = performance_test(jp_data_analysis_answer,trials_df, data_df)
plot_trial(apply_df)

I see a factor ~7x improvement with below logic. The trick is to use an index.map(custom_function) on continuous_df and unpack the results, together with (in my opinion) underused for..else.. construct. This is still sub-optimal, but may be sufficient for your purposes, and certainly better than iterating rows.
import numpy as np
import pandas as pd
def test2():
# First df holds start and end times (as seconds) of a trial as well as type of trial.
num_trials = 360
max_start = 1400.0
d = {'trial': pd.Series(np.sort(np.random.choice(np.arange(1, 400), replace=False, size=(360,)))),
'type': pd.Series(np.random.choice(('a', 'b', 'c', 'd'),size=num_trials)),
'start': pd.Series(np.sort(np.random.random_sample((num_trials,))) * max_start)}
trials_df = pd.DataFrame(d)
# Create column for when the trial ended.
trials_df['end'] = trials_df['start'].shift(-1)
trials_df.loc[num_trials-1, 'end'] = trials_df['start'].iloc[-1] + 2.0
trials_df['diff'] = trials_df['end'] - trials_df['start']
trials_df['end'] = trials_df['end'] - trials_df['diff'] * 0.2
del trials_df['diff']
# Second df has continuously recorded data with time as index.
time_delta = 0.0333
rows = int(max_start+2/time_delta)
idx_time = pd.Index(np.arange(rows) * time_delta)
continuous_df = pd.DataFrame(np.random.randn(rows,3), index=idx_time, columns=list('XYZ'))
ranges = trials_df[['trial', 'type', 'start', 'end']].values
def return_trial(n):
for r in ranges:
if r[2] <= n <= r[3]:
return tuple(r[:2])
else:
return (np.nan, '')
continuous_df['trial'], continuous_df['type'] = list(zip(*continuous_df.index.map(return_trial)))
return trials_df, continuous_df

Apply function to manipulate Python Pandas DataFrame group

I have data in a pandas DataFrame that requires considerable clean up with functions applied to the DataFrame's 'ID' groups. How does one apply any arbitrary function to manipulate Pandas DataFrame groups? A simplified example of the DataFrame is below:
import pandas as pd
import numpy as np
waypoint_time_string = ['0.5&3.0&6.0' for x in range(10)]
moving_string = ['0 0 0&0 0.1 0&1 1 1.2' for x in range(10)]
df = pd.DataFrame({'ID':[1,1,1,1,1,2,2,2,2,2], 'time':[1,2,3,4,5,1,2,3,4,5],
'X':[0,0,0,0,0,1,1,1,1,1],'Y':[0,0,0,0,0,1,1,1,1,1],'Z':[0,0,0,0,0,1,1,1,1,1],
'waypoint_times':waypoint_time_string,
'moving':moving_string})
I would like to apply the function set_group_positions (defined below) to each 'ID' group of df. I have only been successful looping through the DataFrame. It seems that there must be a more 'Pandas.groupby' way of doing this. Here is an example of my implementation that I'm looking to replace:
sub_frames = []
unique_IDs = df['ID'].unique()
for unique_ID in unique_IDs:
working_df = df.loc[df['ID']==unique_ID]
working_df = set_group_positions(working_df)
sub_frames.append(working_df)
final_df = pd.concat(sub_frames)
And to complete a working example, here are additional helper functions:
def set_x_vel(row):
return(row['X'] + row['x_movement'])
def set_y_vel(row):
return(row['Y'] + row['y_movement'])
def set_z_vel(row):
return(row['Z'] + row['z_movement'])
output_time_list = df['time'].unique().tolist()
#main function to apply to each ID group in the data frame:
def set_group_positions(df): #pass the combined df here
working_df = df
times_string = working_df['waypoint_times'].iloc[0]
times_list = times_string.split('&')
times_list = [float(x) for x in times_list]
points_string = working_df['moving']
points_string = points_string.iloc[0]
points_list = points_string.split('&')
points_x = []
points_y = []
points_z = []
for point in points_list:
point_list = point.split(' ')
points_x.append(point_list[0])
points_y.append(point_list[1])
points_z.append(point_list[2])
#get corresponding positions for HPAC times,
#since there could be mismatches
points_x = np.cumsum([float(x) for x in points_x])
points_y = np.cumsum([float(x) for x in points_x])
points_z = np.cumsum([float(x) for x in points_x])
x_interp = np.interp(output_time_list,times_list,points_x).tolist()
y_interp = np.interp(output_time_list,times_list,points_y).tolist()
z_interp = np.interp(output_time_list,times_list,points_z).tolist()
working_df.loc[:,('x_movement')] = x_interp
working_df.loc[:,('y_movement')] = y_interp
working_df.loc[:,('z_movement')] = z_interp
working_df.loc[:,'x_pos'] = working_df.apply(set_x_vel, axis = 1)
working_df.loc[:,'y_pos'] = working_df.apply(set_y_vel, axis = 1)
working_df.loc[:,'z_pos'] = working_df.apply(set_z_vel, axis = 1)
return(working_df)
While my current implementation works, on my real data set, it takes about 20 minutes for me to run, where a simple groupby.apply lambda call on my DataFrame takes only seconds to a minute.

Instead of looping, you can use apply with groupby and a function call:
df = df.groupby('ID').apply(set_group_positions)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Duplicating rows in dataframe python - python

Related

Vectorization with python

Pandas Dataframe masking issues: referring to previous rows and selecting values

handling real time data in python , rolling window

Efficient (fast) way to group continuous data in one DataFrame based on ranges taken from another DataFrame in Python Pandas?

Apply function to manipulate Python Pandas DataFrame group

Categories

Resources