pandas for loop too slow

pandas for loop too slow - python

I am now busy with analyzing CSV datalog files and I am now strugling with how to speed up some pandas calculations. The following for loop is working but is not very fast:
for i in range(len(df)):
if i==0:
df.loc[i,"delta_t"] = 3
df.loc[i,"E_sun"] = 0
else:
df.loc[i,"delta_t"] = (df.loc[i,"date_time"]- df.loc[i-1,"date_time"]).total_seconds()
df.loc[i,"E_sun"] = df.loc[i-1,"E_sun"] + df.loc[i,"delta_t"] * df.loc[i,"E_flow_sun"]
Is there a fast way to do this calculation? The problem is that I am referencing information on different rows. If all the data is on the same row than things become very easy for example:
df["column1"] = df["column2"] + df["column3"]

You are looking at diff and cumsum:
df['delta_t'] = df['date_time'].diff().total_seconds().fillna(3)
df['E_sun'] = (df['delta_t'] * df['E_flow_sun']).cumsum()

Related

How to multicore processing a for loop with iterrows in python

I have a massive dataset that could use multicore processing.
I have a dataframe that has sequences and blocksize for each row.
I wrote a loop that extracts the sequence and block size for each row and calculates a score from a function from a package called localcider.
I can't figure out how to run it in parallel.
Can somebody help?
omega = []
AA=list('FYW')
for i, row in df.iterrows():
seq = df['IDRseq'][i]
b = df['bsize'][i]
bsize = [b-1,b]
SeqOb = SequenceParameters(seq,blobsize=bsize)
omega.append(SeqOb.get_kappa_X(AA))
s1 = pd.Series(omega, name='omega')
df = df.assign(omega=s1.values)

After a lot of googling, I came across pandarallel.
I think this is the most intuitive way of doing what I want.
I am posting the code for future reference.
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True, nb_workers = n)
# nb_workers = n ; I set the nb_workers fo CPU core - 1 so the system is more stable
def something(x):
#do stuff
return result
df['result'] = df.parallel_apply(something, axis=1)

Iterate over loop with append data continuous

I need to iterate over row data in a pandas dataframe. However, I am stuck with looping because spending much time on millions data. I think my code still is not optimal.
new_columns = ['alt', 'alt_anomaly']
df_new = pd.DataFrame(columns=new_columns)
loop = 20
idx = 0
for i, row in df.iterrows():
for alt in range(loop):
alt_anomaly = df.iloc[i]['alt'] * (400.00)
df_new.loc[idx] = row.values.tolist() + [alt_anomaly]
idx += 1
print(df_new)
Use 400 ft as multiples to gradually change on the first vector, the second by 800 feet, and so on by multiple.
its like
row[1] = 27800+400
row[2] = 27775+800
etc....
Thanks for your help, I appreciate that.

You can do the following without looping:
df['alt_anomaly'] = df['alt'] + (df.index+1)*400
Or use Pandas .add option:
df['alt'].add((df.index+1)*400)

Pandas: memory usage when working with very many columns using Groupby

I have a dataframe with over 1000 columns and I would like to know whether it makes a difference on memory usage and/or speed to run a groupby directly on a dataframe or to create a smaller subset of the dataframe columnwise.
df[['xnew','ynew','znew']] = df.groupby(['a','b'])['x','y','z'].transform(lambda f: f.rolling(3).mean().shift())
or,
df2=df[['a','b','x','y','z']]
df2[['xnew','ynew','znew']] = df2.groupby(['a','b'])['x','y','z'].transform(lambda f: f.rolling(3).mean().shift())
df=pd.concat([df,df2[['xnew','ynew','znew']]],axis=1)
I would like to test this myself but I am unfamiliar with how to do it. Advice on how to test this would be much appreciated.

The short answer is no, it doesn't matter on either dimension. From a Colab notebook:
%load_ext memory_profiler
import pandas as pd
import numpy as np
d = {'a': [1]*100 + [2]*100, 'b': [3]*50 + [4]*50 + [5]*50 + [6]*50}
for i in range(1000):
d[i] = np.random.random(200)
for c in 'xyz':
d[c] = np.random.random(200)
df = pd.DataFrame(d)
%time %memit df[['xnew','ynew','znew']] = df.groupby(['a','b'])[['x','y','z']].transform(lambda f: f.rolling(3).mean().shift())
%%time
%%memit
df2=df[['a','b','x','y','z']]
df2[['xnew','ynew','znew']] = df2.groupby(['a','b'])[['x','y','z']].transform(lambda f: f.rolling(3).mean().shift())
df=pd.concat([df,df2[['xnew','ynew','znew']]],axis=1)

The simple way to do this is to get the time and then subtract the time at the end of the process to display the elapsed time.
import time
start = time.time()
# Write down the process.
process_time = time.time() - start
print(process_time)

parsing CSV in pandas

I want to calculate the average number of successful Rattatas catches hourly for this whole dataset. I am looking for an efficient way to do this by utilizing pandas--I'm new to Python and pandas.

You don't need any loops. Try this. I think logic is rather clear.
import pandas as pd
#read csv
df = pd.read_csv('pkmn.csv', header=0)
#we need apply some transformations to extract date from timestamp
df['time'] = df['time'].apply(lambda x : pd.to_datetime(str(x)))
df['date'] = df['time'].dt.date
#main transformations
df = df.query("Pokemon == 'rattata' and caught == True").groupby('hour')
result = pd.DataFrame()
result['caught total'] = df['hour'].count()
result['days'] = df['date'].nunique()
result['caught average'] = result['caught total'] / result['days']

If you have your pandas dataframe saved as df this should work:
rats = df.loc[df.Pokemon == "rattata"] #Gives you subset of rows relating to Rattata
total = sum(rats.Caught) #Gives you the number caught total
diff = rats.time[len(rats)] - rats.time[0] #Should give you difference between first and last
average = total/diff #Should give you the number caught per unit time

Code / Loop optimization with pandas for creating two matrixes

I need to optimize this loop which takes 2.5 second. The needs is that I call it more than 3000 times in my script.
The aim of this code is to create two matrix which are used after in a linear system.
Has someone any idea in Python or Cython?
## df is only here for illustration and date_indicatrice changes upon function call
df = pd.DataFrame(0, columns=range(6),
index=pd.date_range(start = pd.datetime(2010,1,1),
end = pd.datetime(2020,1,1), freq="H"))
mat = pd.DataFrame(0,index=df.index,columns=range(6))
mat_bp = pd.DataFrame(0,index=df.index,columns=range(6*2))
date_indicatrice = [(pd.datetime(2010,1,1), pd.datetime(2010,4,1)),
(pd.datetime(2012,5,1), pd.datetime(2019,4,1)),
(pd.datetime(2013,4,1), pd.datetime(2019,4,1)),
(pd.datetime(2014,3,1), pd.datetime(2019,4,1)),
(pd.datetime(2015,1,1), pd.datetime(2015,4,1)),
(pd.datetime(2013,6,1), pd.datetime(2018,4,1))]
timer = time.time()
for j, (d1,d2) in enumerate(date_indicatrice):
result = df[(mat.index>=d1)&(mat.index<=d2)]
result2 = df[(mat.index>=d1)&(mat.index<=d2)&(mat.index.hour>=8)]
mat.loc[result.index,j] = 1.
mat_bp.loc[result2.index,j*2] = 1.
mat_bp[j*2+1] = (1 - mat_bp[j*2]) * mat[j]
print time.time()-timer

Here you go. I tested the following and I get the same resultant matrices in mat and mat_bp as in your original code, but in 0.07 seconds vs. 1.4 seconds for the original code on my machine.
The real slowdown was due to using result.index and result2.index. Looking up by a datetime is much slower than looking up using an index. I used binary searches where possible to find the right indices.
import pandas as pd
import numpy as np
import time
import bisect
## df is only here for illustration and date_indicatrice changes upon function call
df = pd.DataFrame(0, columns=range(6),
index=pd.date_range(start = pd.datetime(2010,1,1),
end = pd.datetime(2020,1,1), freq="H"))
mat = pd.DataFrame(0,index=df.index,columns=range(6))
mat_bp = pd.DataFrame(0,index=df.index,columns=range(6*2))
date_indicatrice = [(pd.datetime(2010,1,1), pd.datetime(2010,4,1)),
(pd.datetime(2012,5,1), pd.datetime(2019,4,1)),
(pd.datetime(2013,4,1), pd.datetime(2019,4,1)),
(pd.datetime(2014,3,1), pd.datetime(2019,4,1)),
(pd.datetime(2015,1,1), pd.datetime(2015,4,1)),
(pd.datetime(2013,6,1), pd.datetime(2018,4,1))]
timer = time.time()
for j, (d1,d2) in enumerate(date_indicatrice):
ind_start = bisect.bisect_left(mat.index, d1)
ind_end = bisect.bisect_right(mat.index, d2)
inds = np.array(xrange(ind_start, ind_end))
valid_inds = inds[mat.index[ind_start:ind_end].hour >= 8]
mat.loc[ind_start:ind_end,j] = 1.
mat_bp.loc[valid_inds,j*2] = 1.
mat_bp[j*2+1] = (1 - mat_bp[j*2]) * mat[j]
print time.time()-timer

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas for loop too slow - python

You are looking at diff and cumsum: df['delta_t'] = df['date_time'].diff().total_seconds().fillna(3) df['E_sun'] = (df['delta_t'] * df['E_flow_sun']).cumsum()

Related

How to multicore processing a for loop with iterrows in python

Iterate over loop with append data continuous

Pandas: memory usage when working with very many columns using Groupby

parsing CSV in pandas

Code / Loop optimization with pandas for creating two matrixes

Categories

Resources