Condensing repeat code with a "for" statement using strings - Python

Condensing repeat code with a "for" statement using strings - Python - python

I am very new with "for" statements in Python, and I can't get something that I think should be simple to work. My code that I have is:
import pandas as pd
df1 = pd.DataFrame({'Column1' : pd.Series([1,2,3,4,5,6])})
df2 = pd.DataFrame({'Column1' : pd.Series([1,2,3,4,5,6])})
df3 = pd.DataFrame({'Column1' : pd.Series([1,2,3,4,5,6])})
DF1 = pd.DataFrame({'Column1' : pd.Series([1,2,3,4,5,6])})
DF2 = pd.DataFrame({'Column1' : pd.Series([1,2,3,4,5,6])})
DF3 = pd.DataFrame({'Column1' : pd.Series([1,2,3,4,5,6])})
Then:
A1 = len(df1.loc[df1['Column1'] <= DF1['Column1'].iloc[2]])
Z1 = len(df1.loc[df1['Column1'] >= DF1['Column1'].iloc[3]])
A2 = len(df2.loc[df2['Column1'] <= DF2['Column1'].iloc[2]])
Z2 = len(df2.loc[df2['Column1'] >= DF2['Column1'].iloc[3]])
A3 = len(df3.loc[df3['Column1'] <= DF3['Column1'].iloc[2]])
Z3 = len(df3.loc[df3['Column1'] >= DF3['Column1'].iloc[3]])
As you can see, it is a lot of repeat code with just the identifying numbers being different. So my first attempt at a "for" statement was:
Numbers = [1,2,3]
for i in Numbers:
"A" + str(i) = len("df" + str(i).loc["df" + str(i)['Column1'] <= "DF" + str(i)['Column1'].iloc[2]])
"Z" + str(i) = len("df" + str(i).loc["df" + str(i)['Column1'] >= "DF" + str(i)['Column1'].iloc[3]])
This yielded the SyntaxError: "can't assign to operator". So I tried:
Numbers = [1,2,3]
for i in Numbers:
A = "A" + str(i)
Z = "Z" + str(i)
A = len("df" + str(i).loc["df" + str(i)['Column1'] <= "DF" + str(i)['Column1'].iloc[2]])
Z = len("df" + str(i).loc["df" + str(i)['Column1'] >= "DF" + str(i)['Column1'].iloc[3]])
This yielded the AttributeError: 'str' object has no attribute 'loc'. I tried a few other things like:
Numbers = [1,2,3]
for i in Numbers:
A = "A" + str(i)
Z = "Z" + str(i)
df = "df" + str(i)
DF = "DF" + str(i)
A = len(df.loc[df['Column1'] <= DF['Column1'].iloc[2]])
Z = len(df.loc[df['Column1'] <= DF['Column1'].iloc[3]])
But that just gives me the same errors. Ultimately what I would want is something like:
Numbers = [1,2,3]
for i in Numbers:
Ai = len(dfi.loc[dfi['Column1'] <= DFi['Column1'].iloc[2]])
Zi = len(dfi.loc[dfi['Column1'] <= DFi['Column1'].iloc[3]])
Where the output would be equivalent if I typed:
A1 = len(df1.loc[df1['Column1'] <= DF1['Column1'].iloc[2]])
Z1 = len(df1.loc[df1['Column1'] >= DF1['Column1'].iloc[3]])
A2 = len(df2.loc[df1['Column1'] <= DF2['Column1'].iloc[2]])
Z2 = len(df2.loc[df1['Column1'] >= DF2['Column1'].iloc[3]])
A3 = len(df3.loc[df3['Column1'] <= DF3['Column1'].iloc[2]])
Z3 = len(df3.loc[df3['Column1'] >= DF3['Column1'].iloc[3]])

It is "restricted" to generate variables in for loop (you can do that, but it's better to avoid. See other posts: post_1, post_2).
Instead use this code to achieve your goal without generating as many variables as your needs (actually generate only the values in the for loop):
# Lists of your dataframes
Hanimals = [H26, H45, H46, H47, H51, H58, H64, H65]
Ianimals = [I26, I45, I46, I47, I51, I58, I64, I65]
# Generate your series using for loops iterating through your lists above
BPM = pd.DataFrame({'BPM_Base':pd.Series([i_a for i_a in [len(i_h.loc[i_h['EKG-evt'] <=\
i_i[0].iloc[0]]) / 10 for i_h, i_i in zip(Hanimals, Ianimals)]]),
'BPM_Test':pd.Series([i_z for i_z in [len(i_h.loc[i_h['EKG-evt'] >=\
i_i[0].iloc[-1]]) / 30 for i_h, i_i in zip(Hanimals, Ianimals)]])})
UPDATE
A more efficient way (iterate over "animals" lists only once):
# Lists of your dataframes
Hanimals = [H26, H45, H46, H47, H51, H58, H64, H65]
Ianimals = [I26, I45, I46, I47, I51, I58, I64, I65]
# You don't need using pd.Series(),
# just create a list of tuples: [(A26, Z26), (A45, Z45)...] and iterate over it
BPM = pd.DataFrame({'BPM_Base':i[0], 'BPM_Test':i[1]} for i in \
[(len(i_h.loc[i_h['EKG-evt'] <= i_i[0].iloc[0]]) / 10,
len(i_h.loc[i_h['EKG-evt'] >= i_i[0].iloc[-1]]) / 30) \
for i_h, i_i in zip(Hanimals, Ianimals)])

Figured out a better way to do this that fits my needs. This is mainly so that I will be able to find my method.
# Change/Add animals and conditions here, make sure they match up directly
Animal = ['26','45','46','47','51','58','64','65', '69','72','84']
Cond = ['Stomach','Intestine','Stomach','Stomach','Intestine','Intestine','Intestine','Stomach','Cut','Cut','Cut']
d = []
def CuSO4():
for i in Animal:
# load in Spike data
A = pd.read_csv('TXT/INJ/' + i + '.txt',delimiter=r"\s+", skiprows = 15, header = None, usecols = range(1))
B = pd.read_csv('TXT/EKG/' + i + '.txt', skiprows = 3)
C = pd.read_csv('TXT/ESO/' + i + '.txt', skiprows = 3)
D = pd.read_csv('TXT/TRACH/' + i + '.txt', skiprows = 3)
E = pd.read_csv('TXT/BP/' + i + '.txt', delimiter=r"\s+").rename(columns={"4 BP": "BP"})
# Count number of beats before/after injection, divide by 10/30 minutes for average BPM.
F = len(B.loc[B['EKG-evt'] <= A[0].iloc[0]])/10
G = len(B.loc[B['EKG-evt'] >= A[0].iloc[-1]])/30
# Count number of esophogeal events before/after injection
H = len(C.loc[C['Eso-evt'] <= A[0].iloc[0]])
I = len(C.loc[C['Eso-evt'] >= A[0].iloc[-1]])
# Find Trach events after injection
J = D.loc[D['Trach-evt'] >= A[0].iloc[-1]]
# Count number of breaths before/after injection, divide by 10/30 min for average breaths/min
K = len(D.loc[D['Trach-evt'] <= A[0].iloc[0]])/10
L = len(J)/30
# Use Trach events from J to find the number of EE
M = pd.DataFrame(pybursts.kleinberg(J['Trach-evt'], s=4, gamma=0.1))
N = M.last_valid_index()
# Use N and M to determine the latency, set value to MaxTime (1800s)if EE = 0
O = 1800 if N == 0 else M.iloc[1][1] - A[0].iloc[-1]
# Find BP value before/after injection, then determine the mean value
P = E.loc[E['Time'] <= A[0].iloc[0]]
Q = E.loc[E['Time'] >= A[0].iloc[-1]]
R = P["BP"].mean()
S = Q["BP"].mean()
# Combine all factors into one DF
d.append({'EE' : N, 'EE-lat' : O,
'BPM_Base' : F, 'BPM_Test' : G,
'Eso_Base' : H, 'Eso_Test' : I,
'Trach_Base' : K, 'Trach_Test' : L,
'BP_Base' : R, 'BP_Test' : S})
CuSO4()
# Create shell DF with animal numbers and their conditions.
DF = pd.DataFrame({'Animal' : pd.Series(Animal), 'Cond' : pd.Series(Cond)})
# Pull appended DF from CuSO4 and make it a pd.DF
Df = pd.DataFrame(d)
# Combine the two DF's
df = pd.concat([DF, Df], axis=1)
df

Related

Selecting columns using [[]] is very inefficient especially as the size of the dataset increases in python using pandas

Created sample data using below function:
def create_sample(num_of_rows=1000):
num_of_rows = num_of_rows # number of records to generate.
data = {
'var1' : [random.uniform(0.0, 1.0) for x in range(num_of_rows)],
'other' : [random.uniform(0.0, 1.0) for x in range(num_of_rows)]
}
df = pd.DataFrame(data)
print("Shape : {}".format(df.shape))
print("Type : \n{}".format(df.dtypes))
return df
df = create_sample()
times = []
for i in range(1, 300):
start = time.time()
# Make the dataframe 1 column bigger
df['var' + str(i + 1)] = df['var' + str(i)]
# Select two columns from the dataframe using double square brackets
####################################################
temp = df[['var' + str(i + 1), 'var' + str(i)]]
####################################################
end = time.time()
times.append(end - start)
start = end
plt.plot(times)
print(sum(times))
The graph is linear
enter image description here
used pd.concat to select columns, the graph shows peaks at every 100.. why is this so
df = create_sample()
times = []
for i in range(1, 300):
start = time.time()
# Make the dataframe 1 column bigger
df['var' + str(i + 1)] = df['var' + str(i)]
# Select two columns from the dataframe using double square brackets
####################################################
temp = pd.concat([df['var' + str(i + 1)],df['var' + str(i)]], axis=1)
####################################################
end = time.time()
times.append(end - start)
start = end
plt.plot(times)
print(sum(times))
please ignore indentation.
**From the above we can see that the time taken to select columns using [[]] increases linerly with the size of the dataset.
However, using pd.concat the time does not increase materially. Why increases in every 100 records only. The above is not obvious
**

Python Pandas improving calculation time for large datasets currently taking ~400 mins to run

I'm trying to improve performance of a DataFrame I need to build daily, and I wondered if someone had some ideas. I create a simplied example below:
First, I have a dict of DataFrame's like this. This is time series data so updates daily.
import pandas as pd
import numpy as np
import datetime as dt
from scipy import stats
dates = [dt.datetime.today().date() - dt.timedelta(days=x) for x in range(2000)]
m_list = [str(i) + 'm' for i in range(0, 15)]
names = [i + j for i in m_list for j in m_list]
keys = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
values = [pd.DataFrame([np.random.rand(225) for x in range(0, 2000)], index=dates, columns=names) for i in range(0, 10)]
df_dict = dict(zip(keys, values)) #this is my time series data
Next I have three lists:
#I will build a dict of DataFrames calc attributes for these combos for each df in dict_df above
combos = ['{}/{}'.format(*np.random.choice(names, 2)) for i in range(750)] + ['{}/{}/{}'.format(*np.random.choice(names, 3)) for i in range(1500)]
periods = [20, 60, 100, 200, 500, 1000, 2000] #num of datapoints to use from time series
benchmarks = np.random.choice(combos, 25) #benchmarks to compare combos to
And then here is where I build the DataFrames I need:
def calc_beta (a_series, b_series) :
covariance = np.cov (a_series, b_series)
beta = covariance[0, 1] / covariance[1, 1]
return beta
data_dict = {}
for i in list(df_dict.keys()) :
attr_list = []
df = df_dict[i]
for c in combos :
c_split = c.split('/')
combo_list = []
for cs in c_split :
_list = [int(x) for x in list(filter(None, cs.split('m')))]
combo_list.append(_list)
if len(combo_list) == 2 :
combo_list.append([np.nan, np.nan])
c1a, c1b, c2a, c2b, c3a, c3b = [item for subl in combo_list for item in subl]
if len(c_split) == 2 :
l1, l2 = c_split
_series = df[l1] - df[l2]
if len(c_split) == 3 :
l1, l2, l3 = c_split
_series = df[l1] - df[l2] - df[l3]
attr = {
'name' : c,
'a' : c1a,
'b' : c1b,
'c' : c2a,
'd' : c2b,
'e' : c3a,
'f' : c3b,
'series' : _series,
'last' : _series[-1]
}
for p in periods :
_str = str(p)
p_series = _series[-p:]
attr['quantile' + _str] = stats.percentileofscore(p_series, attr['last'])
attr['z_score' + _str] = stats.zscore(p_series)[-1]
attr['std' + _str] = np.std(p_series)
attr['range' + _str] = max(p_series) - min(p_series)
attr['last_range' + _str] = attr['last'] / attr['range' + _str]
attr['last_std' + _str] = attr['last'] / attr['std' + _str]
if p > 100 :
attr['5d_autocorr' + _str] = p_series.autocorr(-5)
else :
attr['5d_autocorr' + _str] = np.nan
for b in benchmarks :
b_split = b.split('/')
if len(b_split) == 1 :
b_series = df[b_split[0]]
elif len(b_split) == 2 :
b_series = df[b_split[0]] - df[b_split[1]]
elif len(b_split) == 3 :
b_series = df[b_split[0]] - df[b_split[1]] - df[b_split[2]]
b_series = b_series[-p:]
corr_value = p_series.corr(b_series)
beta_value = calc_beta (p_series, b_series)
corr_ticker = '{}_corr{}'.format(b, _str)
beta_ticker = '{}_beta{}'.format(b, _str)
attr[corr_ticker] = corr_value
attr[beta_ticker] = corr_value
if p > 500 :
attr[b + '_20rolling_corr_mean' + _str] = p_series.rolling(20).corr(b_series).mean()
df1 = pd.DataFrame({c : p_series, b : b_series})
attr[b + '_20d_rolling_beta_mean' + _str] = df1.rolling(20) \
.cov(df1 , pairwise=True) \
.drop([c], axis=1) \
.unstack(1) \
.droplevel(0, axis=1) \
.apply(lambda row: row[c] / row[b], axis=1) \
.mean()
attr_list.append(attr)
data_dict[i] = pd.DataFrame(attr_list)
This is generic example of the actual data, but it almost exactly replicates what I'm trying to do, every type of calculation although I'm reduced the number to try to make it simpler.
This last part takes about 40 mins per DataFrame in the Dict, i.e. 400 mins total for this dataset.
I haven't worked with large datasets in the past, from what I understand I need to minimize For loops and Apply functions, which I have, but what else should I be doing? Appreciate any input.
Thank you

So, I went to a dark place to figure out some ways to help here :)
The long, short of it is the the two functions at the end of your script where p>500 is killing you. When p<500, I can get some performance gains I'll detail later.
Instead of essentially iterating by combination and filling out your dataframe, I took the approach to start with a dataframe that had all combos (in your example above, 2500 rows). Then work to the right and vectorize where I could. I think there is a lot of improvement to be had here, but I couldn't get it to work as well as I'd like so maybe some else can help.
Here's the code I ended up with. It starts after your inputs that you entered in your question.
import pandas as pd
import numpy as np
import datetime as dt
from scipy import stats
import time
def calc_beta (a_series, b_series) :
covariance = np.cov (a_series, b_series)
beta = covariance[0, 1] / covariance[1, 1]
return beta
dates = [dt.datetime.today().date() - dt.timedelta(days=x) for x in range(2000)]
m_list = [str(i) + 'm' for i in range(0, 15)]
names = [i + j for i in m_list for j in m_list]
#keys = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
keys = ['A']
values = [pd.DataFrame([np.random.rand(225) for x in range(0, 2000)], index=dates, columns=names) for i in range(0, 10)]
df_dict = dict(zip(keys, values))
combos = ['{}/{}'.format(*np.random.choice(names, 2)) for i in range(750)] + ['{}/{}/{}'.format(*np.random.choice(names, 3)) for i in range(1500)]
#periods = [20, 60, 100, 200, 500, 1000, 2000] #num of datapoints to use from time series
periods = [20] #num of datapoints to use from time series
benchmarks = np.random.choice(combos, 25) #benchmarks to compare combos to
data_dict = {}
for i in list(df_dict.keys()):
df = df_dict[i]
mydf = pd.DataFrame(combos, columns=['name'])
mydf[['a','b','c','d','e','f']]=mydf.name.str.replace('/', '').str.replace('m', ',').str[0:-1].str.split(',', expand=True)
def get_series(a):
if len(a) == 2 :
l1, l2 = a
s = df[l1] - df[l2]
return s.tolist()
else:
l1, l2, l3 = a
s = df[l1] - df[l2] - df[l3]
return s.tolist()
mydf['series'] = mydf['name'].apply(lambda x: get_series(x.split('/')))
mydf['last'] = mydf['series'].str[-1]
for p in periods:
_str = str(p)
mydf['quantile' + _str] = mydf.apply(lambda x: stats.percentileofscore(x['series'][-p:], x['last']), axis=1)
mydf['z_score' + _str] = mydf.apply(lambda x: stats.zscore(x['series'][-p:])[-1], axis=1)
mydf['std' + _str] = mydf.apply(lambda x: np.std(x['series'][-p:]), axis=1)
mydf['range' + _str] = mydf.apply(lambda x: max(x['series'][-p:]) - min(x['series'][-p:]), axis=1)
mydf['last_range' + _str] = mydf['last'] / mydf['range' + _str]
mydf['last_std' + _str] = mydf['last'] / mydf['std' + _str]
if p > 100 :
mydf['5d_autocorr' + _str] = mydf.apply(lambda x: pd.Series(x['series'][-p:]).autocorr(-5), axis=1)
else :
mydf['5d_autocorr' + _str] = np.nan
def get_series(a):
if len(a) == 1 :
b = df[a[0]]
return b.tolist()
elif len(a) == 2 :
b = df[a[0]] - df[a[1]]
return b.tolist()
else:
b = df[a[0]] - df[a[1]] - df[a[2]]
return b.tolist()
for b in benchmarks:
corr_ticker = '{}_corr{}'.format(b, _str)
beta_ticker = '{}_beta{}'.format(b, _str)
b_series = get_series(b.split('/'))[-p:]
mydf[corr_ticker] = mydf.apply(lambda x: stats.pearsonr(np.array(x['series'][-p:]), np.array(b_series))[0], axis=1)
mydf[beta_ticker] = mydf.apply(lambda x: calc_beta(np.array(x['series'][-p:]), np.array(b_series)), axis=1)
if p > 500 :
mydf[b + '_20rolling_corr_mean' + _str] = mydf.apply(lambda x: pd.Series(x['series'][-p:]).rolling(20).corr(pd.Series(b_series)).mean(), axis=1)
mydf[b + '_20d_rolling_beta_mean' + _str] = mydf.apply(lambda x: pd.DataFrame({x['name']: pd.Series(x['series'][-p:]), b : pd.Series(b_series)}).rolling(20) \
.cov(pd.DataFrame({x['name']: pd.Series(x['series'][-p:]), b : pd.Series(b_series)}) , pairwise=True) \
.drop([x['name']], axis=1) \
.unstack(1) \
.droplevel(0, axis=1) \
.apply(lambda row: row[x['name']] / row[b], axis=1) \
.mean(), axis=1)
data_dict[i] = mydf
I only ran one set of 'A' and changed the period. With keeping 'A' constant, and changing the period, I get performance gains shown here. At period = 400, I still get 60% better performance.
A 20
Original: Total Time 25.74614143371582
Revised: Total Time 7.026344299316406
A 200
Original: Total Time 25.56810474395752
Revised: Total Time 10.015231847763062
A 400
Original: Total Time 28.221587419509888
Revised: Total Time 11.064109802246094
Going to period 501, your original code took 1121.6251230239868 seconds. Mine was about the same. Going from 400 to 501 is adding an enormous amount of time for two functions (repeated over each benchmark).
If you need those functions and have to calculate them at time of analysis, you should spend your time on those two functions. I found using pandas series is slow and you'll notice I used scipy module for correlation in one instance because the gains are worth it. If you can use numpy directly or scipy module for your last two functions you'll see gains there as well.
The other place to look is where I'm using lambda functions. this is still row by row like using for loops. I am saving the period series so I can use it calculations that follow:
def get_series(a):
if len(a) == 2 :
l1, l2 = a
s = df[l1] - df[l2]
return s.tolist()
else:
l1, l2, l3 = a
s = df[l1] - df[l2] - df[l3]
return s.tolist()
mydf['series'] = mydf['name'].apply(lambda x: get_series(x.split('/')))
this series is made of lists and are passed into lambda functions. I was hoping to find a way to vectorize this by calculating all rows at the same time but some functions require series and some use a list and I just could figure it out. here's an example:
mydf['quantile' + _str] = mydf.apply(lambda x: stats.percentileofscore(x['series'][-p:], x['last']), axis=1)
If you can figure out how to vectorize this, and then apply to those functions where p>500 you'll see some savings.
In the end, your code or my code, the real issue is those last two functions. Everything else is smaller, but real, savings and adds up but rethinking that last piece can save your day.
The other option is to either multiprocess or break this up onto multiple machines

I changed one line in the inner-most loop, which give a 1.6x speedup for a data frame with 2,000 lines.
This won't solve all of the issues, but it might help.
for b in benchmarks:
...
if p > 500:
attr[b + '_20d_rolling_beta_mean' + _str] = (
df1
.rolling(5)
.cov(df1, pairwise=True)
.drop([c], axis=1)
.unstack(1)
.droplevel(0, axis=1)
# .apply(lambda row: row[c] / row[b], axis=1) # <-- removed
.assign(result = lambda x: x[c] / x[b]).iloc[:, -1].squeeze() # <-- added
.mean()
)
Crude timing info (elapsed time), for the first 100 combos in A:
apply statement: 142.6 sec
assign statement: 90.1 sec

How can I find the intersection of two lines more efficiently? (nested for loops in Python)

I need to find the intersection point of two data sets, as illustrated here:
I have used the nested loops below to achieve this, but it takes impractically long to run for a dataframe with more (~1000) rows. How can I do this more efficiently?
For clarity, here is a screenshot of the CSV used in the example (len=20):
import pandas as pd
data = pd.read_csv("Less_Data.csv")
#Intersection of line A (points 1 & 2) and line B (points 3 & 4)
def findIntersection(x1,y1,x2,y2,x3,y3,x4,y4):
px= (( (x1*y2-y1*x2)*(x3-x4)-(x1-x2)*(x3*y4-y3*x4) )
/ ( (x1-x2)*(y3-y4)-(y1-y2)*(x3-x4) ))
py= (( (x1*y2-y1*x2)*(y3-y4)-(y1-y2)*(x3*y4-y3*x4) )
/ ( (x1-x2)*(y3-y4)-(y1-y2)*(x3-x4) ))
return [px, py]
#Find intersection of two series
intersections = {}
error_x = {}
error_y = {}
count = 0
print('All intersections found:')
for i in range(len(data)):
i_storage_modulus = data.iloc[i]['Storage_Modulus_Pa']
i_storage_stress = data.iloc[i]['Storage_Oscillation_Stress_Pa']
for j in range(len(data)):
j_storage_modulus = data.iloc[j]['Storage_Modulus_Pa']
j_storage_stress = data.iloc[j]['Storage_Oscillation_Stress_Pa']
if i == j + 1:
for k in range(len(data)):
k_loss_stress = data.iloc[k]['Loss_Oscillation_Stress_Pa']
k_loss_modulus = data.iloc[k]['Loss_Modulus_Pa']
for l in range(len(data)):
l_loss_stress = data.iloc[l]['Loss_Oscillation_Stress_Pa']
l_loss_modulus = data.iloc[l]['Loss_Modulus_Pa']
if k == l + 1:
if (max(k_loss_modulus, l_loss_modulus)
<= min(i_storage_modulus, j_storage_modulus)):
continue
else:
sample_intersect = findIntersection(i_storage_stress,
i_storage_modulus,
j_storage_stress,
j_storage_modulus,
k_loss_stress,
k_loss_modulus,
l_loss_stress,
l_loss_modulus)
if (min(i_storage_stress, j_storage_stress)
<= sample_intersect[0]
<= max(i_storage_stress, j_storage_stress)):
if (min(k_loss_stress, l_loss_stress)
<= sample_intersect[0]
<= max(k_loss_stress, l_loss_stress)):
print(sample_intersect)
intersections[count] = sample_intersect
error_x[count] = ([i_storage_stress,
j_storage_stress,
k_loss_stress,
l_loss_stress])
error_y[count] = ([i_storage_modulus,
j_storage_modulus,
k_loss_modulus,
l_loss_modulus])
count += 1
#Determine error bars
min_x_poss = []
max_x_poss = []
for i in error_x[0]:
if i < intersections[0][0]:
min_x_poss.append(i)
if i > intersections[0][0]:
max_x_poss.append(i)
x_error = (min(max_x_poss) - max(min_x_poss)) / 2
min_y_poss = []
max_y_poss = []
for i in error_y[0]:
if i < intersections[0][1]:
min_y_poss.append(i)
if i > intersections[0][1]:
max_y_poss.append(i)
y_error = (min(max_y_poss) - max(min_y_poss)) / 2
#Print results
print('\n', 'Yield Stress: ' + str(int(intersections[0][0])) + ' ± ' +
str(int(x_error)) + ' Pa (' +
str(int(x_error*100/intersections[0][0]))+'%)')
print('\n', 'Yield Modulus: ' + str(int(intersections[0][1])) + ' ± ' +
str(int(y_error)) + ' Pa (' +
str(int(y_error*100/intersections[0][1]))+'%)')

Can you create a new function y = (Storage Modulus - Loss Modulus) vs Oscillation Stress? The point of intersection is where y changes sign from positive to negative. The secant method should find this point in a few iterations.
https://en.wikipedia.org/wiki/Secant_method

Pyrhon KeyError: 101 when I try to calculate multiple forecasts for a timeseries

I want to use a time series in python and calculate n forecasts.
I tried to use a for cycle, but when I use n>=2 I get an error: "KeyError: 101"
I tried:
dateparse = lambda x: pd.datetime.strptime(x, '%YM%m')
df = pd.read_excel('test.csv', sheet_name=f'sheet_1', index_col=2, parse_dates=['date'], date_parser=dateparse)
ad = df['ad']
n = 2
k = 3
for x in range(n):
tot = len(ad)-1
adtf = 7 + 23*ad[tot-1] + 55*ad[tot-2] + 13*nu[tot-1] + 3*nu[tot-2]
indexf = ad.index[tot]
indexf += relativedelta(months=+1)
i = pd.Index([indexf])
ad = ad.append(pd.DataFrame({0:[adtf]}, index=i))
nu = nu.append(pd.DataFrame({0:[k]}, index=i))
print(ad)
PS: I have added nu = nu.append(pd.DataFrame({0:[k]}, index=i)) in order to have a value to use in the next cyle.

Creating new pandas df from old one

I have a dataframe data, and want to append another one at the end. The new dataframe is similar to the previous one, only the entries are swapped. I have the following code that works and illustrates what I am doing:
listL = data.shape[0]
length = data.shape[1]
mid = (length-1) / 2.0
for j in range(0, 5) :
data.loc[listL+j] = data.iloc[j]
for j in range(0, 5) :
for i in range(start, end) :
left = int(ceil(mid+i)) + 1
right = int(ceil(mid-i))
data.iloc[listL+j][left] = data.iloc[j][right]
data.iloc[listL+j][0] = data.iloc[j][0] + 10
In this example I am adding only the first 5 rows at the end, and swap the columns. This does not scale well at all, and it is very inefficient.
Can you help make this more efficient, eliminate the loops, and make it scale well (I would like to work with dataframes that have 10000's of entries).
In particular, how can I make the swapping more efficient?
Update:
Using one of the answers, I can now do:
tmpdf = data
data = pandas.concat([data, tmpdf])
for j in range(0, listL-1) :
for i in range(start, end) :
left = int(ceil(mid+i)) + 1
right = int(ceil(mid-i))
data.iloc[listL+j][left] = data.iloc[listL+j][right]
data.iloc[listL+j][0] = data.iloc[listL+j][0] + 10
where listL is the number of rows in the original df data. I need to optimise the second part:
listL = data.shape[0]
length = data.shape[1]
mid = (length-1) / 2.0
for j in range(0, listL-1) :
for i in range(start, end) :
left = int(ceil(mid+i)) + 1
right = int(ceil(mid-i))
data.iloc[listL+j][left] = data.iloc[listL+j][right]
data.iloc[listL+j][0] = data.iloc[listL+j][0] + 10

If you have df1 and df2, you can simply use pd.concat to add df2 first five rows, independantly of how columns are ordered:
pd.concat([df1, df2.ix[:4,]])

This is what I ended up doing, thanks to the answers and comments received:
length = data.shape[1]
mid = (length-1) / 2.0
start = -int(floor(mid))
end = int(floor(mid))
#for j in range(0, 5) :
# data.loc[listL+j] = data.iloc[j]
tmpdf = data.copy(deep=True)
for i in range(start, end) :
left = int(ceil(mid+i)) + 1
right = int(ceil(mid-i))
tmpdf[data.columns[left]] = data[data.columns[right]]
data = pandas.concat([data, tmpdf])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Condensing repeat code with a "for" statement using strings - Python - python

Related

Selecting columns using [[]] is very inefficient especially as the size of the dataset increases in python using pandas

Python Pandas improving calculation time for large datasets currently taking ~400 mins to run

How can I find the intersection of two lines more efficiently? (nested for loops in Python)

Pyrhon KeyError: 101 when I try to calculate multiple forecasts for a timeseries

Creating new pandas df from old one

Categories

Resources