Vectorization with python - python

My current code is extremely slow with the nested for loop setup. I would like to speed up the process, my assumption would be that the solution is the vectorization with Pandas or NumPy. I do not know how to transfer my current code into the new format.
I have created an example code below.
import pandas as pd
import numpy as np
balance = 10000
raw_data = [[1,2,4,1,3],[2,3,7,2,4],[3,4,5,3,4],[4,4,9,1,5],[5,5,6,4,5]]
raw_df = pd.DataFrame(raw_data, columns=['D','O','H','L','C'])
history_data = [[1,1,5,np.nan,4],[0,1,3,np.nan,4],[1,0,4,2,3],[1,0,1,6,0],[0,1,7,np.nan,8]]
history_df = pd.DataFrame(history_data, columns=['TY','ST','OP','CL','SL'])
for n in raw_df.index:
for p in history_df.index:
if history_df['ST'][p] == 1 and history_df['TY'][p] == 1 and history_df['SL'][p] >= raw_df['L'][n]:
history_df['CL'][p] = raw_df['L'][n]
history_df['ST'][p] = 0
balance = balance + 20
if raw_df['C'][n] > 4:
history_df = history_df.append({'TY':0,'ST':1,'OP':5,'CL':np.nan,'SL':9,},ignore_index = True)

Check out this example, see if it helps :
import numpy as np
# Use NumPy's where function to perform the check for each row of history_df and raw_df simultaneously
mask = np.where((history_df['ST'] == 1) & (history_df['TY'] == 1) & (history_df['SL'] >= raw_df['L']))
history_df.loc[mask, 'CL'] = raw_df.loc[mask, 'L']
history_df.loc[mask, 'ST'] = 0
# Calculate the balance change
balance_change = 20 * len(mask[0])
balance += balance_change
# Append rows to history_df where raw_df['C'] > 4
new_rows = raw_df[raw_df['C'] > 4]
new_rows['TY'] = 0
new_rows['ST'] = 1
new_rows['OP'] = 5
new_rows['CL'] = np.nan
new_rows['SL'] = 9
history_df = history_df.append(new_rows, ignore_index=True)

Related

Writing to an exact cell in Excel using Python

Great thanks for your help! This is my code now: I just need to fire once, so 1 iteration, is this the solution? Column B leaves one open space and then adds '2'. Probably because it looks at the index:
import pandas as pd
from pathlib import Path
data_folder = Path(PATH)
file_to_open = data_folder / "excelbestand.xlsx"
df = pd.read_excel(file_to_open)
data_x = 4
data_y = 2
df.loc[df.index.max()+1, ['A']] = data_x
df.loc[df.index.max()+1,['B']] = data_y
df.to_excel(file_to_open, index = False)
First of all it looks like you first need to read the dataframe in pandas so:
import pandas as pd
df = pandas.read_excel("name_your_file.xlsx",sheet_name = 'name_your_sheet')
num_iterations = 10 # Number of times you want to perform action
while i <= num_iterations:
data_x = df['A'].count()
df.loc[df.index.max()+1] = [data_x]
As you can see pd.iloc is very powerful : df.iloc[ row number, column number] :
columns : A,B,C,D column number = 0,1,2,3
import pandas as pd
from pathlib import Path
data_folder = Path(PATH)
file_to_open = data_folder / "excelbestand.xlsx"
df = pd.read_excel(file_to_open)
num_iterations = 1
data_x = 4
i=0
##### In this while you are appending the count of column "A" to the end of column "A"
while i <= num_iterations:
df.loc[df.index.max()+1, 0] = [data_x]
i += 1
###########code for adding to las cell of column B
i2 = 0
while i2 <= num_iterations:
df.loc[df.index.max()+1,1] = [data_x]
i += 1
df.to_excel(file_to_open, index = False)

Find when the values of a pandas.Series change by at least x

I have a time series s stored as a pandas.Series and I need to find when the value tracked by the time series changes by at least x.
In pseudocode:
print s(0)
s*=s(0)
for all t in ]t, t_max]:
if |s(t)-s*| > x:
s* = s(t)
print s*
Naively, this can be coded in Python as follows:
import pandas as pd
def find_changes(s, x):
changes = []
s_last = None
for index, value in s.iteritems():
if s_last is None:
s_last = value
if value-s_last > x or s_last-value > x:
changes += [index, value]
s_last = value
return changes
My data set is large, so I can't just use the method above. Moreover, I cannot use Cython or Numba due to limitations of the framework I will run this on. I can (and plan to) use pandas and NumPy.
I'm looking for some guidance on what NumPy vectorized/optimized methods to use and how.
Thanks!
EDIT: Changed code to match pseudocode.
I don't know if I am understanding you correctly, but here is how I interpreted the problem:
import pandas as pd
import numpy as np
# Our series of data.
data = pd.DataFrame(np.random.rand(10), columns = ['value'])
# The threshold.
threshold = .33
# For each point t, grab t - 1.
data['value_shifted'] = data['value'].shift(1)
# Absolute difference of t and t - 1.
data['abs_change'] = abs(data['value'] - data['value_shifted'])
# Test against the threshold.
data['change_exceeds_threshold'] = np.where(data['abs_change'] > threshold, 1, 0)
print(data)
Giving:
value value_shifted abs_change change_exceeds_threshold
0 0.005382 NaN NaN 0
1 0.060954 0.005382 0.055573 0
2 0.090456 0.060954 0.029502 0
3 0.603118 0.090456 0.512661 1
4 0.178681 0.603118 0.424436 1
5 0.597814 0.178681 0.419133 1
6 0.976092 0.597814 0.378278 1
7 0.660010 0.976092 0.316082 0
8 0.805768 0.660010 0.145758 0
9 0.698369 0.805768 0.107400 0
I don't think the pseudo code can be vectorized because the next state of s* is dependent on the last state. There's a pure python solution (1 iteration):
import random
import pandas as pd
s = [random.randint(0,100) for _ in range(100)]
res = [] # record changes
thres = 20
ss = s[0]
for i in range(len(s)):
if abs(s[i] - ss) > thres:
ss = s[i]
res.append([i, s[i]])
df = pd.DataFrame(res, columns=['value'])
I think there's no way to run faster than O(N) in this case.

Duplicating rows in dataframe python

Good afternoon everyone,
I am currently writing a thesis on the KMV model in python. I took inspiration from the code here to solve the non-linear equations. Here is the link to the CSV file used to create the dataframe. And this is the code I have so far:
Importation of the required modules
from datetime import datetime
import pandas as pd
import numpy as np
import scipy.optimize as sco
from scipy.stats import norm
df = pd.DataFrame()
df = pd.read_csv("AREX.csv", sep=';', engine = "python", decimal=',')
Functions to prepare the file for the model to run
def clean():
# df.rename(columns ={"Date": "Date"}, inplace = True)
# df["Date"] = pd.to_datetime(df['Date'])
df.set_index("Date", inplace = True)
df['AREX.O']=df['AREX.O'].astype(float)
df.drop(['Total Short Term debt'], axis =1, inplace = True)
return df
def preparation():
df['e']=df['AREX.O']*df['Share Outstanding']
df['Short Term Debt']=df['Debt']-df['Total Long term Debt']
df['f']=df['Short Term Debt']+df['Total Long term Debt']*0.5
df['log_ret'] = np.log(df['AREX.O']) - np.log(df['AREX.O'].shift(1))
# df['stdev']=df['log_ret'].rolling(252).std()*m.sqrt(252)
return df
Algorithm used to solve for a and sigma_a.
I only tried to adapt the code to my dataframe here
def algo1():
# formatting the vaules as required
df["f"] = df["f"].astype(float)
df["e"] = df["e"].astype(float)
# #computating of key input variable for the model
df['a'] = df['f'].add(df["e"])
#defining a function for the black Scholes equation
def bseqn(a, debug=False):
d1 = (np.log(a/f) + (r + 0.5*sigma_a**2)*T)/(sigma_a*np.sqrt(T))
d2 = d1 - sigma_a*np.sqrt(T)
y1 = e - (a*norm.cdf(d1) - np.exp(-r*T)*f*norm.cdf(d2))
if debug:
print("d1 = {:.6f}".format(d1))
print("d2 = {:.6f}".format(d2))
print("Error = {:.6f}".format('y1'))
return y1
#Solving the model
time_horizon=[1]
timesteps = range(1, len(df))
results = np.empty((df.shape[0],len(time_horizon)))
#looping to solve for each row
for i, years in enumerate(time_horizon):
T = 1
results[:,i] = df.loc[:,'a']
for i_t, t in enumerate(timesteps):
a = results[t-10:t,i]
ra =np.log(a/np.roll(a,1))
sigma_a = np.nanstd(ra) #gives initial value of sigma_a
if i_t == 0:
subset_timesteps = range(t-1, t+1)
print(subset_timesteps)
else:
subset_timesteps = [t]
n_its = 0
while n_its < 10:
n_its += 1
for t_sub in subset_timesteps:
r = df.iloc[t_sub]['r']
f = df.iloc[t_sub]['f']
e = df.iloc[t_sub]['e']
sol = sco.fsolve(bseqn, results[t_sub,i]) #if I replace newton with fsolve the code works properly
results[t_sub,i] = sol # stores the new values of a
# Update sigma_a based on new values of a
last_sigma_a = sigma_a
a = results[t-10:t,i]
ra = np.log(a/np.roll(a,1))
sigma_a = np.nanstd(ra) #new val of sigma
diff = last_sigma_a - sigma_a
if abs(diff) < 1e-3:
df.loc[t_sub,'sigma_a'] = sigma_a
break
else:
pass
return df
Run function
def run():
clean()
preparation()
algo1()
print(df)
print(list(df))
# main_df = df.to_csv("AREX_D.csv")
The output should write the results of sigma_a on the created sigma_a column but instead of that it adds a row so instead of 1500 rows i end-up with 3000 rows most of it being Nan values. I do not understand where the code asks that...
I suspect it to come from these lines:
diff = last_sigma_a - sigma_a
if abs(diff) < 1e-3:
df.loc[t_sub,'sigma_a'] = sigma_a
break
Does anyone has any insight on what is happening ?
Here is a picture of the output :
Thank you very much!

Efficient (fast) way to group continuous data in one DataFrame based on ranges taken from another DataFrame in Python Pandas?

I have experimental data produced by different programs. One is logging the start and end time of a trial as well as the type of trial (a category).
start trial type end
0 6.002987 2 c 7.574240
1 7.967054 3 b 19.084946
2 21.864419 5 b 23.298480
3 23.656995 7 c 24.087210
4 24.194764 9 c 27.960752
The other one records a continous datastream and logs the time for each observation.
X Y Z
0.0000 0.324963 -0.642636 -2.305040
0.0333 0.025089 -0.480412 -0.637273
0.0666 0.364149 0.966594 0.789467
0.0999 -0.087334 -0.761769 0.399813
0.1332 0.841872 2.306711 -1.059608
I have the 2 tables as pandas DataFrames and want to retrieve only those parts of the continuous data that is between the start to end ranges found in the rows of the trials DataFrame. I managed that by using a for-loop that iterates over the rows, but I was thinking that there must be more of a "pandas way" of doing this. So I looked into apply, but what I came up with so far was even considerably slower than the loop.
As I'm working on a lot of large datasets I'm looking for the most efficient way in terms of execution time to solve this.
This is a slice of the expected result for the continous DataFrame:
X Y Z trial type
13.6863 0.265358 0.116529 1.196689 NaN NaN
13.7196 -0.715096 -0.413416 0.696454 NaN NaN
13.7529 0.714897 -0.158183 1.735958 4.0 b
13.7862 -0.259513 0.194762 -0.531482 4.0 b
13.8195 -0.929080 -1.200593 -1.233834 4.0 b
[EDIT:] Here I test performance of different approaches. I found a way using apply(), but it isn't much faster than using iterrows.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def create_trials_df(num_trials=360, max_start=1400.0):
# First df holds start and end times (as seconds) of a trial as well as type of trial.
d = {'trial': pd.Series(np.sort(np.random.choice(np.arange(1, 400), replace=False, size=(360,)))),
'type': pd.Series(np.random.choice(('a', 'b', 'c', 'd'),size=num_trials)),
'start': pd.Series(np.sort(np.random.random_sample((num_trials,))) * max_start)}
trials_df = pd.DataFrame(d)
# Create column for when the trial ended.
trials_df['end'] = trials_df['start'].shift(-1)
trials_df.loc[num_trials-1, 'end'] = trials_df['start'].iloc[-1] + 2.0
trials_df['diff'] = trials_df['end'] - trials_df['start']
trials_df['end'] = trials_df['end'] - trials_df['diff'] * 0.2
del trials_df['diff']
return trials_df
def create_continuous_df(num_trials=360, max_start=1400.0):
# Second df has continuously recorded data with time as index.
time_delta = 1.0/30.0
rows = int((max_start+2) * 1/time_delta)
idx_time = pd.Index(np.arange(rows) * time_delta)
continuous_df = pd.DataFrame(np.random.randn(rows, 3), index=idx_time, columns=list('XYZ'))
print("continuous rows:", continuous_df.index.size)
print("continuous last time:", continuous_df.last_valid_index())
return continuous_df
# I want to group the continuous data by trial and type later on.
def iterrows_test(trials_df, continuous_df):
for index, row in trials_df.iterrows():
continuous_df.loc[row['start']:row['end'], 'trial'] = row['trial']
continuous_df.loc[row['start']:row['end'], 'type'] = row['type']
def itertuples_test(trials_df, continuous_df):
continuous_df['trial'] = np.NaN
continuous_df['type'] = np.NaN
for row in trials_df.itertuples():
continuous_df.loc[slice(row[1],row[4]), ['trial','type']] = [row[2],row[3]]
def apply_test(trials_df, continuous_df):
trial_series = pd.Series([x[0] for x in zip(trials_df.values)])
continuous_df['trial'] = np.NaN
continuous_df['type'] = np.NaN
def insert_trial_data_to_continuous(vals, con_df):
con_df.loc[slice(vals[0], vals[3]), ['trial','type']] = [vals[1],vals[2]]
trial_series.apply(insert_trial_data_to_continuous, args=(continuous_df,))
def real_slow_index_map(trials_df, continuous_df):
# Transform trial_data to new df: merge start and end ordered, make it float index.
trials_df['pre-start'] = trials_df['start'] - 0.0001
trials_df['post-end'] = trials_df['end'] + 0.0001
start_df = pd.DataFrame(data={'type': trials_df['type'].values, 'trial': trials_df['trial'].values},
index=trials_df['start'])
end_df = pd.DataFrame(data={'type': trials_df['type'].values, 'trial': trials_df['trial'].values},
index=trials_df['end'])
# Fill inbetween trials with NaN.
pre_start_df = pd.DataFrame({'trial': np.NaN, 'type': np.NaN}, index=trials_df['pre-start'])
post_end_df = pd.DataFrame({'trial': np.NaN, 'type': np.NaN}, index=trials_df['post-end'])
new_df = start_df.append([end_df, pre_start_df, post_end_df])
new_df.sort_index(inplace=True)
# Each start/end index in new_df has corresponding value in type and trial column.
def get_tuple(idx):
res = new_df.iloc[new_df.index.get_loc(idx, method='nearest')]
# return trial and type column values.
return tuple(res.values)
# Apply this to all indices.
idx_series = continuous_df.index.to_series()
continuous_df['trial'] = idx_series.apply(get_tuple).values
continuous_df[['trial', 'type']] = continuous_df['trial'].apply(pd.Series)
def jp_data_analysis_answer(trials_df, continuous_df):
ranges = trials_df[['trial', 'type', 'start', 'end']].values
def return_trial(n):
for i, r in enumerate(ranges):
if r[2] <= n <= r[3]:
return tuple((i, r[1]))
else:
return np.nan, np.nan
continuous_df['trial'], continuous_df['type'] = list(zip(*continuous_df.index.map(return_trial)))
def performance_test(func, trials_df, continuous_df):
return_df = continuous_df.copy()
time_ref = time.perf_counter()
func(trials_df, return_df)
time_delta = time.perf_counter() - time_ref
print("time delta for {}:".format(func.__name__), time_delta)
return return_df
# Just to illustrate where this is going:
def plot_trial(continuous_df):
continuous_df['type'] = continuous_df['type'].astype('category')
continuous_df = continuous_df.groupby('type').filter(lambda x: x is not np.NaN)
# Without the NaNs in column, let's set the trial column to dtype integer.
continuous_df['trial'] = continuous_df['trial'].astype('int64')
# Plot the data by trial.
for key, group in continuous_df.groupby('trial'):
group.drop(['trial', 'type'], axis=1).plot()
plt.title('Trial {}, Type: {}'.format(key, group['type'].iloc[0]))
plt.show()
break
if __name__ == '__main__':
import time
num_trials = 360
max_start_time = 1400
trials_df = create_trials_df(max_start=max_start_time)
data_df = create_continuous_df(max_start=max_start_time)
# My original approach with a for-loop over iterrows.
iterrows_df = performance_test(iterrows_test,trials_df, data_df)
# itertuples test
itertuples_df = performance_test(itertuples_test,trials_df, data_df)
# apply() on trial data, continuous data is manipulated therein
apply_df = performance_test(apply_test,trials_df, data_df)
# Mapping on index of continuous data. SLOW!
map_idx_df = performance_test(real_slow_index_map,trials_df, data_df)
# method by jp_data_analysis' answer. Works well with small continuous_df, but doesn't scale well.
jp_df = performance_test(jp_data_analysis_answer,trials_df, data_df)
plot_trial(apply_df)
I see a factor ~7x improvement with below logic. The trick is to use an index.map(custom_function) on continuous_df and unpack the results, together with (in my opinion) underused for..else.. construct. This is still sub-optimal, but may be sufficient for your purposes, and certainly better than iterating rows.
import numpy as np
import pandas as pd
def test2():
# First df holds start and end times (as seconds) of a trial as well as type of trial.
num_trials = 360
max_start = 1400.0
d = {'trial': pd.Series(np.sort(np.random.choice(np.arange(1, 400), replace=False, size=(360,)))),
'type': pd.Series(np.random.choice(('a', 'b', 'c', 'd'),size=num_trials)),
'start': pd.Series(np.sort(np.random.random_sample((num_trials,))) * max_start)}
trials_df = pd.DataFrame(d)
# Create column for when the trial ended.
trials_df['end'] = trials_df['start'].shift(-1)
trials_df.loc[num_trials-1, 'end'] = trials_df['start'].iloc[-1] + 2.0
trials_df['diff'] = trials_df['end'] - trials_df['start']
trials_df['end'] = trials_df['end'] - trials_df['diff'] * 0.2
del trials_df['diff']
# Second df has continuously recorded data with time as index.
time_delta = 0.0333
rows = int(max_start+2/time_delta)
idx_time = pd.Index(np.arange(rows) * time_delta)
continuous_df = pd.DataFrame(np.random.randn(rows,3), index=idx_time, columns=list('XYZ'))
ranges = trials_df[['trial', 'type', 'start', 'end']].values
def return_trial(n):
for r in ranges:
if r[2] <= n <= r[3]:
return tuple(r[:2])
else:
return (np.nan, '')
continuous_df['trial'], continuous_df['type'] = list(zip(*continuous_df.index.map(return_trial)))
return trials_df, continuous_df

How to efficiently join a list of values to a list of intervals?

I have a data frame which can be constructed as follows:
df = pd.DataFrame({'value':scipy.stats.norm.rvs(0, 1, size=1000),
'start':np.abs(scipy.stats.norm.rvs(0, 20, size=1000))})
df['end'] = df['start'] + np.abs(scipy.stats.norm.rvs(5, 5, size=1000))
df[:10]
start value end
0 9.521781 -0.570097 17.708335
1 3.929711 -0.927318 15.065047
2 3.990466 0.756413 4.841934
3 20.676291 -1.418172 28.284301
4 13.084246 1.280723 14.121626
5 29.784740 0.236915 32.791751
6 21.626625 1.144663 28.739413
7 18.524309 0.101871 27.271344
8 21.288152 -0.727120 27.049582
9 13.556664 0.713141 22.136275
Each row represents a value assigned to an interval (start, end)
Now, I would like to get a list of best values occuring at time 10,13,15, ... ,70. (It is similar to the geometric index in SQL if you are familiar with that.)
Below is my 1st attempt in python with pandas, it takes 18.5ms. Can any one help to improve it? (This procedure would be called 1M or more times with different data frames in my program)
def get_values(data):
data.sort_index(by='value', ascending=False, inplace=True) # this takes 0.2ms
# can we get rid of it? since we don't really need sort...
# all we need is the max value for each interval.
# But if we have to keep it for simplicity it is ok.
ret = []
#data = data[(data['end'] >= 10) & (data['start'] <= 71)]
for t in range(10, 71, 2):
interval = data[(data['end'] >= t) & (data['start'] <= t)]
if not interval.empty:
ret.append(interval['value'].values[0])
else:
for i in range(t, 71, 2):
ret.append(None)
break
return ret
#%prun -l 10 print get_values(df)
%timeit get_values(df)
The 2nd attemp involves decompose pandas into numpy as much as possible, and it takes around 0.7ms
def get_values(data):
data.sort_index(by='value', ascending=False, inplace=True)
ret = []
df_end = data['end'].values
df_start = data['start'].values
df_value = data['value'].values
for t in range(10, 71, 2):
values = df_value[(df_end >= t) & (df_start <= t)]
if len(values) != 0:
ret.append(values[0])
else:
for i in range(t, 71, 2):
ret.append(None)
break
return ret
#%prun -l 10 print get_values(df)
%timeit get_values(df)
Can we improve further? I guess the next step is algorithm level, both of the above are just naive logic implementations.
I don't understand empty process in your code, here is a faster version if ignore your empty process:
import scipy.stats as stats
import pandas as pd
import numpy as np
df = pd.DataFrame({'value':stats.norm.rvs(0, 1, size=1000),
'start':np.abs(stats.norm.rvs(0, 20, size=1000))})
df['end'] = df['start'] + np.abs(stats.norm.rvs(5, 5, size=1000))
def get_value(df, target):
value = df["value"].values
idx = np.argsort(value)[::-1]
start = df["start"].values[idx]
end = df["end"].values[idx]
value = value[idx]
mask = (target[:, None] >= start[None, :]) & (target[:, None] <= end[None, :])
index = np.argmax(mask, axis=1)
flags = mask[np.arange(len(target)), index]
result = value[index]
result[~flags] = np.nan
return result
get_value(df, np.arange(10, 71, 2))

Categories