Parallelizing for loops in python - python

I know similar questions on this topic have been asked before, but I'm still struggling to make any headway with my problem.
Basically, I have three dataframes (of sizes 402 x 402, 402 x 3142, and 1 x 402) and I'm combining elements from them into a calculation. I then write the calculation to another dataframe - see code below using dummy data. Each calculation takes between 0.3-0.8 ms, but there are (402 x 3142)^2 total calculations, which obviously takes a long time!
Since none of the calculations is dependent on any other, this is ripe for parallelization, but I'm really having a hard time figuring out how to do this - sorry the code is probably pretty ugly, very new to python, and parallel computing.
One additional thing to note is that the non-vector matrices are sparse (0.4 and 0.3, respectively), so could be changed to coordinate or compressed row/column format so that not all of the possible combinations of calculations need to be made. This might reduce the time by half.
import pandas as pd
A = pd.DataFrame(np.random.choice([0, 1], size=(402,402), p=[0.6,0.4]))
B = pd.DataFrame(np.random.choice([0, 1], size=(402,3142), p=[0.7,0.3]))
x = A.sum(axis = 1)
col_names = ["R", "I", "S", "J","value"]
results = pd.DataFrame(columns = col_names)
row = 0
for r in B.columns:
for s in B.columns:
for i in A.index:
for j in A.columns:
results.loc[row,"R"] = r
results.loc[row,"I"] = i
results.loc[row,"S"] = s
results.loc[row,"J"] = j
results.loc[row, "value"] = A.loc[i,j]*B.loc[j,s]*B.loc[i,r]/x[i]
row = row + 1

Related

How to multicore processing a for loop with iterrows in python

I have a massive dataset that could use multicore processing.
I have a dataframe that has sequences and blocksize for each row.
I wrote a loop that extracts the sequence and block size for each row and calculates a score from a function from a package called localcider.
I can't figure out how to run it in parallel.
Can somebody help?
omega = []
AA=list('FYW')
for i, row in df.iterrows():
seq = df['IDRseq'][i]
b = df['bsize'][i]
bsize = [b-1,b]
SeqOb = SequenceParameters(seq,blobsize=bsize)
omega.append(SeqOb.get_kappa_X(AA))
s1 = pd.Series(omega, name='omega')
df = df.assign(omega=s1.values)
After a lot of googling, I came across pandarallel.
I think this is the most intuitive way of doing what I want.
I am posting the code for future reference.
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True, nb_workers = n)
# nb_workers = n ; I set the nb_workers fo CPU core - 1 so the system is more stable
def something(x):
#do stuff
return result
df['result'] = df.parallel_apply(something, axis=1)

How do I add a matrix constraint `Ax=b` to a Pyomo model efficiently?

I want to add the constraints Ax=b to a Pyomo model with my numpy arrays A and b as efficient as possible. Unfortunately, the performance is very bad currently. For the following example
import time
import numpy as np
import pyomo.environ as pyo
start = time.time()
rows = 287
cols = 2765
A = np.random.rand(rows, cols)
b = np.random.rand(rows)
mdl = pyo.ConcreteModel()
mdl.rows = range(rows)
mdl.cols = range(cols)
mdl.A = A
mdl.b = b
mdl.x_var = pyo.Var(mdl.cols, bounds=(0.0, None))
mdl.constraints = pyo.ConstraintList()
[mdl.constraints.add(sum(mdl.A[row, col] * mdl.x_var[col] for col in mdl.cols) <= mdl.b[row]) for row in mdl.rows]
mdl.obj = pyo.Objective(expr=sum(mdl.x_var[col] for col in mdl.cols), sense=pyo.minimize)
end = time.time()
print(end - start)
is takes almost 30 seconds because of the add statement and the huge amount of columns. Is it possible to pass A, x, and b directly and fast instead of adding it row by row?
The main thing that is slowing down your construction above is the fact that you are building the constraint list elements within a list comprehension, which is unnecessary and causes a lot of bloat.
This line:
[mdl.constraints.add(sum(mdl.A[row, col] * mdl.x_var[col] for col in mdl.cols) <= mdl.b[row]) for row in mdl.rows]
Constructs a list of the captured results of each ConstraintList.add() expression, which is a "rich" return. That list is an unnecessary byproduct of the loop you desire to do over the add() function. Just change your generation scheme to either a loop or a generator (by using parens) to avoid that capture, as such:
(mdl.constraints.add(sum(mdl.A[row, col] * mdl.x_var[col] for col in mdl.cols) <= mdl.b[row]) for row in mdl.rows)
And the model construction time drops to about 0.02 seconds.

How to multiprocess finding closest geographic point in two pandas dataframes?

I have a function which I'm trying to apply in parallel and within that function I call another function that I think would benefit from being executed in parallel. The goal is to take in multiple years of crop yields for each field and combine all of them into one pandas dataframe. I have the function I use for finding the closest point in each dataframe, but it is quite intensive and takes some time. I'm looking to speed it up.
I've tried creating a pool and using map_async on the inner function. I've also tried doing the same with the loop for the outer function. The latter is the only thing I've gotten to work the way I intended it to. I can use this, but I know there has to be a way to make it faster. Check out the code below:
return_columns = []
return_columns_cb = lambda x: return_columns.append(x)
def getnearestpoint(gdA, gdB, retcol):
dist = lambda point1, point2: distance.great_circle(point1, point2).feet
def find_closest(point):
distances = gdB.apply(
lambda row: dist(point, (row["Longitude"], row["Latitude"])), axis=1
)
return (gdB.loc[distances.idxmin(), retcol], distances.min())
append_retcol = gdA.apply(
lambda row: find_closest((row["Longitude"], row["Latitude"])), axis=1
)
return append_retcol
def combine_yield(field):
#field is a list of the files for the field I'm working with
#lots of pre-processing
#dfs in this case is a list of the dataframes for the current field
#mdf is the dataframe with the most points which I poppped from this list
p = Pool()
for i in range(0, len(dfs)):
p.apply_async(getnearestpoint, args=(mdf, dfs[i], dfs[i].columns[-1]), callback=return_cols_cb)
for col in return_columns:
mdf = mdf.append(col)
'''I unzip my points back to longitude and latitude here in the final
dataframe so I can write to csv without tuples'''
mdf[["Longitude", "Latitude"]] = pd.DataFrame(
mdf["Point"].tolist(), index=mdf.index
)
return mdf
def multiprocess_combine_yield():
'''do stuff to get dictionary below with each field name as key and values
as all the files for that field'''
yield_by_field = {'C01': ('files...'), ...}
#The farm I'm working on has 30 fields and below is too slow
for k,v in yield_by_field.items():
combine_yield(v)
I guess what I need help on is I envision something like using a pool to imap or apply_async on each tuple of files in the dictionary. Then within the combine_yield function when applied to that tuple of files, I want to to be able to parallel process the distance function. That function bogs the program down because it calculates the distance between every point in each of the dataframes for each year of yield. The files average around 1200 data points and then you multiply all of that by 30 fields and I need something better. Maybe the efficiency improvement lies in finding a better way to pull in the closest point. I still need something that gives me the value from gdB, and the distance though because of what I do later on when selecting which rows to use from the 'mdf' dataframe.
Thanks to #ALollz comment, I figured this out. I went back to my getnearestpoint function and instead of doing a bunch of Series.apply I am now using cKDTree from scipy.spatial to find the closest point, and then using a vectorized haversine distance to calculate the true distances on each of these matched points. Much much quicker. Here are the basics of the code below:
import numpy as np
import pandas as pd
from scipy.spatial import cKDTree
def getnearestpoint(gdA, gdB, retcol):
gdA_coordinates = np.array(
list(zip(gdA.loc[:, "Longitude"], gdA.loc[:, "Latitude"]))
)
gdB_coordinates = np.array(
list(zip(gdB.loc[:, "Longitude"], gdB.loc[:, "Latitude"]))
)
tree = cKDTree(data=gdB_coordinates)
distances, indices = tree.query(gdA_coordinates, k=1)
#These column names are done as so due to formatting of my 'retcols'
df = pd.DataFrame.from_dict(
{
f"Longitude_{retcol[:4]}": gdB.loc[indices, "Longitude"].values,
f"Latitude_{retcol[:4]}": gdB.loc[indices, "Latitude"].values,
retcol: gdB.loc[indices, retcol].values,
}
)
gdA = pd.merge(left=gdA, right=df, left_on=gdA.index, right_on=df.index)
gdA.drop(columns="key_0", inplace=True)
return gdA
def combine_yield(field):
#same preprocessing as before
for i in range(0, len(dfs)):
mdf = getnearestpoint(mdf, dfs[i], dfs[i].columns[-1])
main_coords = np.array(list(zip(mdf.Longitude, mdf.Latitude)))
lat_main = main_coords[:, 1]
longitude_main = main_coords[:, 0]
longitude_cols = [
c for c in mdf.columns for m in [re.search(r"Longitude_B\d{4}", c)] if m
]
latitude_cols = [
c for c in mdf.columns for m in [re.search(r"Latitude_B\d{4}", c)] if m
]
year_coords = list(zip_longest(longitude_cols, latitude_cols, fillvalue=np.nan))
for i in year_coords:
year = re.search(r"\d{4}", i[0]).group(0)
year_coords = np.array(list(zip(mdf.loc[:, i[0]], mdf.loc[:, i[1]])))
year_coords = np.deg2rad(year_coords)
lat_year = year_coords[:, 1]
longitude_year = year_coords[:, 0]
diff_lat = lat_main - lat_year
diff_lng = longitude_main - longitude_year
d = (
np.sin(diff_lat / 2) ** 2
+ np.cos(lat_main) * np.cos(lat_year) * np.sin(diff_lng / 2) ** 2
)
mdf[f"{year} Distance"] = 2 * (2.0902 * 10 ** 7) * np.arcsin(np.sqrt(d))
return mdf
Then I'll just do Pool.map(combine_yield, (v for k,v in yield_by_field.items()))
This has made a substantial difference. Hope it helps anyone else in a similar predicament.

How to speed up iterating through an array / matrix? Tried both pandas and numpy arrays

I would like to do some feature enrichment through a large 2 dimensional array (15,100m).
Working on a sample set with 100'000 records showed that I need to get this faster.
Edit (data model info)
To simplify, let's say we have only two relevant columns:
IP (identifier)
Unix (timestamp in seconds since 1970)
I would like to add a 3rd column, counting how many times this IP has shown up in the past 12 hours.
End edit
My first attempt was using pandas, because it was comfortable working with named dimensions, but too slow:
for index,row in tqdm_notebook(myData.iterrows(),desc='iterrows'):
# how many times was the IP address (and specific device) around in the prior 5h?
hours = 12
seen = myData[(myData['ip']==row['ip'])
&(myData['device']==row['device'])
&(myData['os']==row['os'])
&(myData['unix']<row['unix'])
&(myData['unix']>(row['unix']-(60*60*hours)))].shape[0]
ip_seen = myData[(myData['ip']==row['ip'])
&(myData['unix']<row['unix'])
&(myData['unix']>(row['unix']-(60*60*hours)))].shape[0]
myData.loc[index,'seen'] = seen
myData.loc[index,'ip_seen'] = ip_seen
Then I switched to numpy arrays and hoped for a better result, but it is still too slow to run against the full dataset:
# speed test numpy arrays
for i in np.arange(myArray.shape[0]):
hours = 12
ip,device,os,ts = myArray[i,[0,3,4,12]]
ip_seen = myArray[(np.where((myArray[:,0]==ip)
& (myArray[:,12]<ts)
& (myArray[:,12]>(ts-60*60*hours) )))].shape[0]
device_seen = myArray[(np.where((myArray[:,0]==ip)
& (myArray[:,2] == device)
& (myArray[:,3] == os)
& (myArray[:,12]<ts)
& (myArray[:,12]>(ts-60*60*hours) )))].shape[0]
myArray[i,13]=ip_seen
myArray[i,14]=device_seen
My next idea would be to iterate only once, and maintain a growing dictionary of the current count, instead of looking backwards in every iteration.
But that would have some other drawbacks (e.g. how to keep track when to reduce count for observations falling out of the 12h window).
How would you approach this problem?
Could it be even an option to use low level Tensorflow functions to involve a GPU?
Thanks
The only way to speed up things is not looping. In your case you can try using rolling with a window of the time span that you want, using the Unix timestamp as a datetime index (assuming that records are sorted by timestamp, otherwise you would need to sort first). This should work fine for the ip_seen:
ip = myData['ip']
ip.index = pd.to_datetime(myData['unix'], unit='s')
myData['ip_seen'] = ip.rolling('5h')
.agg(lambda w: np.count_nonzero(w[:-1] == w[-1]))
.values.astype(np.int32)
However, when the aggregation involves multiple columns, like in the seen column, it gets more complicated. Currently (see Pandas issue #15095) rolling functions do not support aggregations spanning two dimensions. A workaround could be merging the columns of interest into a single new series, for example a tuple (which may work better if values are numbers) or a string (which may be better is values are already strings). For example:
criteria = myData['ip'] + '|' + myData['device'] + '|' + myData['os']
criteria.index = pd.to_datetime(myData['unix'], unit='s')
myData['seen'] = criteria.rolling('5h')
.agg(lambda w: np.count_nonzero(w[:-1] == w[-1]))
.values.astype(np.int32)
EDIT
Apparently rolling only works with numeric types, which leaves as with two options:
Manipulate the data to use numeric types. For the IP this is easy, since it actually represents a 32 bit number (or 64 if IPv6 I guess). For device and OS, assuming they are strings now, it get's more complicated, you would have to map each possible value to an integer and the merge it with the IP in a long value, e.g. putting these in the higher bits or something like that (maybe even impossible with IPv6, since the biggest integers NumPy supports right now are 64 bits).
Roll over the index of myData (which should now be not datetime, because rolling cannot work with that either) and use the index window to get the necessary data and operate:
# Use sequential integer index
idx_orig = myData.index
myData.reset_index(drop=True, inplace=True)
# Index to roll
idx = pd.Series(myData.index)
idx.index = pd.to_datetime(myData['unix'], unit='s')
# Roll aggregation function
def agg_seen(w, data, fields):
# Use slice for faster data frame slicing
slc = slice(int(w[0]), int(w[-2])) if len(w) > 1 else []
match = data.loc[slc, fields] == data.loc[int(w[-1]), fields]
return np.count_nonzero(np.all(match, axis=1))
# Do rolling
myData['ip_seen'] = idx.rolling('5h') \
.agg(lambda w: agg_seen(w, myData, ['ip'])) \
.values.astype(np.int32)
myData['ip'] = idx.rolling('5h') \
.agg(lambda w: agg_seen(w, myData, ['ip', 'device', 'os'])) \
.values.astype(np.int32)
# Put index back
myData.index = idx_orig
This is not how rolling is meant to be used, though, and I'm not sure if this gives much better performance than just looping.
as mentioned in the comment to #jdehesa, I took another approach which allows me to only iterate once through the entire dataset and pull the (decaying) weight from an index.
decay_window = 60*60*12 # every 12
decay = 0.5 # fall by 50% every window
ip_idx = pd.DataFrame(myData.ip.unique())
ip_idx['ts_seen'] = 0
ip_idx['ip_seen'] = 0
ip_idx.columns = ['ip','ts_seen','ip_seen']
ip_idx.set_index('ip',inplace=True)
for index, row in myData.iterrows(): # all
# How often was this IP seen?
prior_ip_seen = ip_idx.loc[(row['ip'],'ip_seen')]
prior_ts_seen = ip_idx.loc[(row['ip'],'ts_seen')]
delay_since_count = row['unix']-ip_idx.loc[(row['ip'],'ts_seen')]
new_ip_seen = prior_ip_seen*decay**(delay_since_count/decay_window)+1
ip_idx.loc[(row['ip'],'ip_seen')] = new_ip_seen
ip_idx.loc[(row['ip'],'ts_seen')] = row['unix']
myData.iloc[index,14] = new_ip_seen-1
That way the result is not the fixed time window as requested initially, but prior observations "fade out" over time, giving frequent recent observations a higher weight.
This feature carries more information than the simplified (and turned out more expensive) approach initially planned.
Thanks for your input!
Edit
In the meantime I switched to numpy arrays for the same operation, which now only takes a fraction of the time (loop with 200m updates in <2h).
Just in case somebody looks for a starting point:
%%time
import sys
## temporary lookup
ip_seen_ts = [0]*365000
ip_seen_count = [0]*365000
cnt = 0
window = 60*60*12 # 12h
decay = 0.5
counter = 0
chunksize = 10000000
store = pd.HDFStore('store.h5')
t = time.process_time()
try:
store.remove('myCount')
except:
print("myData not present.")
for myHdfData in store.select_as_multiple(['myData','myFeatures'],columns=['ip','unix','ip_seen'],chunksize=chunksize):
print(counter, time.process_time() - t)
#display(myHdfData.head(5))
counter+=chunksize
t = time.process_time()
sys.stdout.flush()
keep_index = myHdfData.index.values
myArray = myHdfData.as_matrix()
for row in myArray[:,:]:
#for row in myArray:
i = (row[0].astype('uint32')) # IP as identifier
u = (row[1].astype('uint32')) # timestamp
try:
delay = u - ip_seen_ts[i]
except:
delay = 0
ip_seen_ts[i] = u
try:
ip_seen_count[i] = ip_seen_count[i]*decay**(delay/window)+1
except:
ip_seen_count[i] = 1
row[3] = np.tanh(ip_seen_count[i]-1) # tanh to normalize between 0 and 1
myArrayAsDF = pd.DataFrame(myArray,columns=['c_ip','c_unix','c_ip2','ip_seen'])
myArrayAsDF.set_index(keep_index,inplace=True)
store.append('myCount',myArrayAsDF)
store.close()

randomly replace rows by other rows in python

First, it reads the data from file and transpose. Everything is fine here.
Next, it generates two random samples r1 and r2. Everything is fine here.
Last, it generates a matrix with the same shape as data, and use for loop to loop r1 and r2 simultaneously, then replace row j by data[j,:] + data[i,:]. (using print(sum(data[j,:] + data[i,:]), sum(r_data[j,:])) to check if it works). Everything is fine till now.
But finally, when I check r_data, all the elements are 0. I don't why, did I make any mistakes? Appreciate for you help!!!
PS, I tried replacing data by data = np.ones((n,k)), and it works. I really have no idea how this bug happens. It only happens when I read the data from file.
import numpy as np
data = np.loadtxt("nonames.txt")
data = np.transpose(data)
n = len(data[:,0])
k = len(data[0,:])
# randomly merging samples
c = np.arange(n)
r1 = np.random.choice(c, 500, replace=False)
r2 = np.random.choice(c, 500, replace=False)
r_data = np.zeros((n,k))
for i,j in zip(r1,r2):
r_data[j,:] = data[j,:] + data[i,:]
print(sum(data[j,:] + data[i,:]), sum(r_data[j,:]))

Categories