I have a massive dataset that could use multicore processing.
I have a dataframe that has sequences and blocksize for each row.
I wrote a loop that extracts the sequence and block size for each row and calculates a score from a function from a package called localcider.
I can't figure out how to run it in parallel.
Can somebody help?
omega = []
AA=list('FYW')
for i, row in df.iterrows():
seq = df['IDRseq'][i]
b = df['bsize'][i]
bsize = [b-1,b]
SeqOb = SequenceParameters(seq,blobsize=bsize)
omega.append(SeqOb.get_kappa_X(AA))
s1 = pd.Series(omega, name='omega')
df = df.assign(omega=s1.values)
After a lot of googling, I came across pandarallel.
I think this is the most intuitive way of doing what I want.
I am posting the code for future reference.
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True, nb_workers = n)
# nb_workers = n ; I set the nb_workers fo CPU core - 1 so the system is more stable
def something(x):
#do stuff
return result
df['result'] = df.parallel_apply(something, axis=1)
I need to iterate over row data in a pandas dataframe. However, I am stuck with looping because spending much time on millions data. I think my code still is not optimal.
new_columns = ['alt', 'alt_anomaly']
df_new = pd.DataFrame(columns=new_columns)
loop = 20
idx = 0
for i, row in df.iterrows():
for alt in range(loop):
alt_anomaly = df.iloc[i]['alt'] * (400.00)
df_new.loc[idx] = row.values.tolist() + [alt_anomaly]
idx += 1
print(df_new)
Use 400 ft as multiples to gradually change on the first vector, the second by 800 feet, and so on by multiple.
its like
row[1] = 27800+400
row[2] = 27775+800
etc....
Thanks for your help, I appreciate that.
You can do the following without looping:
df['alt_anomaly'] = df['alt'] + (df.index+1)*400
Or use Pandas .add option:
df['alt'].add((df.index+1)*400)
I have a dataframe with over 1000 columns and I would like to know whether it makes a difference on memory usage and/or speed to run a groupby directly on a dataframe or to create a smaller subset of the dataframe columnwise.
df[['xnew','ynew','znew']] = df.groupby(['a','b'])['x','y','z'].transform(lambda f: f.rolling(3).mean().shift())
or,
df2=df[['a','b','x','y','z']]
df2[['xnew','ynew','znew']] = df2.groupby(['a','b'])['x','y','z'].transform(lambda f: f.rolling(3).mean().shift())
df=pd.concat([df,df2[['xnew','ynew','znew']]],axis=1)
I would like to test this myself but I am unfamiliar with how to do it. Advice on how to test this would be much appreciated.
The short answer is no, it doesn't matter on either dimension. From a Colab notebook:
%load_ext memory_profiler
import pandas as pd
import numpy as np
d = {'a': [1]*100 + [2]*100, 'b': [3]*50 + [4]*50 + [5]*50 + [6]*50}
for i in range(1000):
d[i] = np.random.random(200)
for c in 'xyz':
d[c] = np.random.random(200)
df = pd.DataFrame(d)
%time %memit df[['xnew','ynew','znew']] = df.groupby(['a','b'])[['x','y','z']].transform(lambda f: f.rolling(3).mean().shift())
%%time
%%memit
df2=df[['a','b','x','y','z']]
df2[['xnew','ynew','znew']] = df2.groupby(['a','b'])[['x','y','z']].transform(lambda f: f.rolling(3).mean().shift())
df=pd.concat([df,df2[['xnew','ynew','znew']]],axis=1)
The simple way to do this is to get the time and then subtract the time at the end of the process to display the elapsed time.
import time
start = time.time()
# Write down the process.
process_time = time.time() - start
print(process_time)
I want to calculate the average number of successful Rattatas catches hourly for this whole dataset. I am looking for an efficient way to do this by utilizing pandas--I'm new to Python and pandas.
You don't need any loops. Try this. I think logic is rather clear.
import pandas as pd
#read csv
df = pd.read_csv('pkmn.csv', header=0)
#we need apply some transformations to extract date from timestamp
df['time'] = df['time'].apply(lambda x : pd.to_datetime(str(x)))
df['date'] = df['time'].dt.date
#main transformations
df = df.query("Pokemon == 'rattata' and caught == True").groupby('hour')
result = pd.DataFrame()
result['caught total'] = df['hour'].count()
result['days'] = df['date'].nunique()
result['caught average'] = result['caught total'] / result['days']
If you have your pandas dataframe saved as df this should work:
rats = df.loc[df.Pokemon == "rattata"] #Gives you subset of rows relating to Rattata
total = sum(rats.Caught) #Gives you the number caught total
diff = rats.time[len(rats)] - rats.time[0] #Should give you difference between first and last
average = total/diff #Should give you the number caught per unit time
I need to optimize this loop which takes 2.5 second. The needs is that I call it more than 3000 times in my script.
The aim of this code is to create two matrix which are used after in a linear system.
Has someone any idea in Python or Cython?
## df is only here for illustration and date_indicatrice changes upon function call
df = pd.DataFrame(0, columns=range(6),
index=pd.date_range(start = pd.datetime(2010,1,1),
end = pd.datetime(2020,1,1), freq="H"))
mat = pd.DataFrame(0,index=df.index,columns=range(6))
mat_bp = pd.DataFrame(0,index=df.index,columns=range(6*2))
date_indicatrice = [(pd.datetime(2010,1,1), pd.datetime(2010,4,1)),
(pd.datetime(2012,5,1), pd.datetime(2019,4,1)),
(pd.datetime(2013,4,1), pd.datetime(2019,4,1)),
(pd.datetime(2014,3,1), pd.datetime(2019,4,1)),
(pd.datetime(2015,1,1), pd.datetime(2015,4,1)),
(pd.datetime(2013,6,1), pd.datetime(2018,4,1))]
timer = time.time()
for j, (d1,d2) in enumerate(date_indicatrice):
result = df[(mat.index>=d1)&(mat.index<=d2)]
result2 = df[(mat.index>=d1)&(mat.index<=d2)&(mat.index.hour>=8)]
mat.loc[result.index,j] = 1.
mat_bp.loc[result2.index,j*2] = 1.
mat_bp[j*2+1] = (1 - mat_bp[j*2]) * mat[j]
print time.time()-timer
Here you go. I tested the following and I get the same resultant matrices in mat and mat_bp as in your original code, but in 0.07 seconds vs. 1.4 seconds for the original code on my machine.
The real slowdown was due to using result.index and result2.index. Looking up by a datetime is much slower than looking up using an index. I used binary searches where possible to find the right indices.
import pandas as pd
import numpy as np
import time
import bisect
## df is only here for illustration and date_indicatrice changes upon function call
df = pd.DataFrame(0, columns=range(6),
index=pd.date_range(start = pd.datetime(2010,1,1),
end = pd.datetime(2020,1,1), freq="H"))
mat = pd.DataFrame(0,index=df.index,columns=range(6))
mat_bp = pd.DataFrame(0,index=df.index,columns=range(6*2))
date_indicatrice = [(pd.datetime(2010,1,1), pd.datetime(2010,4,1)),
(pd.datetime(2012,5,1), pd.datetime(2019,4,1)),
(pd.datetime(2013,4,1), pd.datetime(2019,4,1)),
(pd.datetime(2014,3,1), pd.datetime(2019,4,1)),
(pd.datetime(2015,1,1), pd.datetime(2015,4,1)),
(pd.datetime(2013,6,1), pd.datetime(2018,4,1))]
timer = time.time()
for j, (d1,d2) in enumerate(date_indicatrice):
ind_start = bisect.bisect_left(mat.index, d1)
ind_end = bisect.bisect_right(mat.index, d2)
inds = np.array(xrange(ind_start, ind_end))
valid_inds = inds[mat.index[ind_start:ind_end].hour >= 8]
mat.loc[ind_start:ind_end,j] = 1.
mat_bp.loc[valid_inds,j*2] = 1.
mat_bp[j*2+1] = (1 - mat_bp[j*2]) * mat[j]
print time.time()-timer