randomly replace rows by other rows in python

randomly replace rows by other rows in python - python

First, it reads the data from file and transpose. Everything is fine here.
Next, it generates two random samples r1 and r2. Everything is fine here.
Last, it generates a matrix with the same shape as data, and use for loop to loop r1 and r2 simultaneously, then replace row j by data[j,:] + data[i,:]. (using print(sum(data[j,:] + data[i,:]), sum(r_data[j,:])) to check if it works). Everything is fine till now.
But finally, when I check r_data, all the elements are 0. I don't why, did I make any mistakes? Appreciate for you help!!!
PS, I tried replacing data by data = np.ones((n,k)), and it works. I really have no idea how this bug happens. It only happens when I read the data from file.
import numpy as np
data = np.loadtxt("nonames.txt")
data = np.transpose(data)
n = len(data[:,0])
k = len(data[0,:])
# randomly merging samples
c = np.arange(n)
r1 = np.random.choice(c, 500, replace=False)
r2 = np.random.choice(c, 500, replace=False)
r_data = np.zeros((n,k))
for i,j in zip(r1,r2):
r_data[j,:] = data[j,:] + data[i,:]
print(sum(data[j,:] + data[i,:]), sum(r_data[j,:]))

Related

How to multicore processing a for loop with iterrows in python

I have a massive dataset that could use multicore processing.
I have a dataframe that has sequences and blocksize for each row.
I wrote a loop that extracts the sequence and block size for each row and calculates a score from a function from a package called localcider.
I can't figure out how to run it in parallel.
Can somebody help?
omega = []
AA=list('FYW')
for i, row in df.iterrows():
seq = df['IDRseq'][i]
b = df['bsize'][i]
bsize = [b-1,b]
SeqOb = SequenceParameters(seq,blobsize=bsize)
omega.append(SeqOb.get_kappa_X(AA))
s1 = pd.Series(omega, name='omega')
df = df.assign(omega=s1.values)

After a lot of googling, I came across pandarallel.
I think this is the most intuitive way of doing what I want.
I am posting the code for future reference.
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True, nb_workers = n)
# nb_workers = n ; I set the nb_workers fo CPU core - 1 so the system is more stable
def something(x):
#do stuff
return result
df['result'] = df.parallel_apply(something, axis=1)

Parallelizing for loops in python

I know similar questions on this topic have been asked before, but I'm still struggling to make any headway with my problem.
Basically, I have three dataframes (of sizes 402 x 402, 402 x 3142, and 1 x 402) and I'm combining elements from them into a calculation. I then write the calculation to another dataframe - see code below using dummy data. Each calculation takes between 0.3-0.8 ms, but there are (402 x 3142)^2 total calculations, which obviously takes a long time!
Since none of the calculations is dependent on any other, this is ripe for parallelization, but I'm really having a hard time figuring out how to do this - sorry the code is probably pretty ugly, very new to python, and parallel computing.
One additional thing to note is that the non-vector matrices are sparse (0.4 and 0.3, respectively), so could be changed to coordinate or compressed row/column format so that not all of the possible combinations of calculations need to be made. This might reduce the time by half.
import pandas as pd
A = pd.DataFrame(np.random.choice([0, 1], size=(402,402), p=[0.6,0.4]))
B = pd.DataFrame(np.random.choice([0, 1], size=(402,3142), p=[0.7,0.3]))
x = A.sum(axis = 1)
col_names = ["R", "I", "S", "J","value"]
results = pd.DataFrame(columns = col_names)
row = 0
for r in B.columns:
for s in B.columns:
for i in A.index:
for j in A.columns:
results.loc[row,"R"] = r
results.loc[row,"I"] = i
results.loc[row,"S"] = s
results.loc[row,"J"] = j
results.loc[row, "value"] = A.loc[i,j]*B.loc[j,s]*B.loc[i,r]/x[i]
row = row + 1

How to multiprocess finding closest geographic point in two pandas dataframes?

I have a function which I'm trying to apply in parallel and within that function I call another function that I think would benefit from being executed in parallel. The goal is to take in multiple years of crop yields for each field and combine all of them into one pandas dataframe. I have the function I use for finding the closest point in each dataframe, but it is quite intensive and takes some time. I'm looking to speed it up.
I've tried creating a pool and using map_async on the inner function. I've also tried doing the same with the loop for the outer function. The latter is the only thing I've gotten to work the way I intended it to. I can use this, but I know there has to be a way to make it faster. Check out the code below:
return_columns = []
return_columns_cb = lambda x: return_columns.append(x)
def getnearestpoint(gdA, gdB, retcol):
dist = lambda point1, point2: distance.great_circle(point1, point2).feet
def find_closest(point):
distances = gdB.apply(
lambda row: dist(point, (row["Longitude"], row["Latitude"])), axis=1
)
return (gdB.loc[distances.idxmin(), retcol], distances.min())
append_retcol = gdA.apply(
lambda row: find_closest((row["Longitude"], row["Latitude"])), axis=1
)
return append_retcol
def combine_yield(field):
#field is a list of the files for the field I'm working with
#lots of pre-processing
#dfs in this case is a list of the dataframes for the current field
#mdf is the dataframe with the most points which I poppped from this list
p = Pool()
for i in range(0, len(dfs)):
p.apply_async(getnearestpoint, args=(mdf, dfs[i], dfs[i].columns[-1]), callback=return_cols_cb)
for col in return_columns:
mdf = mdf.append(col)
'''I unzip my points back to longitude and latitude here in the final
dataframe so I can write to csv without tuples'''
mdf[["Longitude", "Latitude"]] = pd.DataFrame(
mdf["Point"].tolist(), index=mdf.index
)
return mdf
def multiprocess_combine_yield():
'''do stuff to get dictionary below with each field name as key and values
as all the files for that field'''
yield_by_field = {'C01': ('files...'), ...}
#The farm I'm working on has 30 fields and below is too slow
for k,v in yield_by_field.items():
combine_yield(v)
I guess what I need help on is I envision something like using a pool to imap or apply_async on each tuple of files in the dictionary. Then within the combine_yield function when applied to that tuple of files, I want to to be able to parallel process the distance function. That function bogs the program down because it calculates the distance between every point in each of the dataframes for each year of yield. The files average around 1200 data points and then you multiply all of that by 30 fields and I need something better. Maybe the efficiency improvement lies in finding a better way to pull in the closest point. I still need something that gives me the value from gdB, and the distance though because of what I do later on when selecting which rows to use from the 'mdf' dataframe.

Thanks to #ALollz comment, I figured this out. I went back to my getnearestpoint function and instead of doing a bunch of Series.apply I am now using cKDTree from scipy.spatial to find the closest point, and then using a vectorized haversine distance to calculate the true distances on each of these matched points. Much much quicker. Here are the basics of the code below:
import numpy as np
import pandas as pd
from scipy.spatial import cKDTree
def getnearestpoint(gdA, gdB, retcol):
gdA_coordinates = np.array(
list(zip(gdA.loc[:, "Longitude"], gdA.loc[:, "Latitude"]))
)
gdB_coordinates = np.array(
list(zip(gdB.loc[:, "Longitude"], gdB.loc[:, "Latitude"]))
)
tree = cKDTree(data=gdB_coordinates)
distances, indices = tree.query(gdA_coordinates, k=1)
#These column names are done as so due to formatting of my 'retcols'
df = pd.DataFrame.from_dict(
{
f"Longitude_{retcol[:4]}": gdB.loc[indices, "Longitude"].values,
f"Latitude_{retcol[:4]}": gdB.loc[indices, "Latitude"].values,
retcol: gdB.loc[indices, retcol].values,
}
)
gdA = pd.merge(left=gdA, right=df, left_on=gdA.index, right_on=df.index)
gdA.drop(columns="key_0", inplace=True)
return gdA
def combine_yield(field):
#same preprocessing as before
for i in range(0, len(dfs)):
mdf = getnearestpoint(mdf, dfs[i], dfs[i].columns[-1])
main_coords = np.array(list(zip(mdf.Longitude, mdf.Latitude)))
lat_main = main_coords[:, 1]
longitude_main = main_coords[:, 0]
longitude_cols = [
c for c in mdf.columns for m in [re.search(r"Longitude_B\d{4}", c)] if m
]
latitude_cols = [
c for c in mdf.columns for m in [re.search(r"Latitude_B\d{4}", c)] if m
]
year_coords = list(zip_longest(longitude_cols, latitude_cols, fillvalue=np.nan))
for i in year_coords:
year = re.search(r"\d{4}", i[0]).group(0)
year_coords = np.array(list(zip(mdf.loc[:, i[0]], mdf.loc[:, i[1]])))
year_coords = np.deg2rad(year_coords)
lat_year = year_coords[:, 1]
longitude_year = year_coords[:, 0]
diff_lat = lat_main - lat_year
diff_lng = longitude_main - longitude_year
d = (
np.sin(diff_lat / 2) ** 2
+ np.cos(lat_main) * np.cos(lat_year) * np.sin(diff_lng / 2) ** 2
)
mdf[f"{year} Distance"] = 2 * (2.0902 * 10 ** 7) * np.arcsin(np.sqrt(d))
return mdf
Then I'll just do Pool.map(combine_yield, (v for k,v in yield_by_field.items()))
This has made a substantial difference. Hope it helps anyone else in a similar predicament.

Ensuring performance of sketching/streaming algorithm (countSketch)

I have implemented what is know as a countSketch in python (page 17: https://arxiv.org/pdf/1411.4357.pdf) but my implementation is currently lacking in performance. The algorithm is to compute the product SA where A is an n x d matrix, S is m x n matrix defined as follows: for every column of S uniformly at random select a row (hash bucket) from the m rows and for that given row, uniformly at random select +1 or -1. So S is a matrix with exactly one nonzero in every column and otherwise all zero.
My intention is to compute SA in a streaming fashion by reading the entries of A. The idea for my implementation is as follows: observe a sequence of triples (i,j,A_ij) and return a sequence (h(i), j, s(i)A_ij) where:
- h(i) is a hash bucket (row of matrix chosen uniformly at random from the m possible rows of S
- s(i) is the random sign function as described above.
I have assumed that the matrix is in row order so that the first row of A arrives in its entirety before the next row of A arrives because this limits the number of calls I need to select a random bucket or the need to use a hash library. I have also assumed that the number of nonzero entries (or the length of the input stream) is known so that I can efficiently encode the iteration.
My problem is that the matrix should compute (1+error)*||Ax||^2 <= ||SAx||^2 <= (1+error)*||Ax||^2 and also have the difference in frobenius norms between A^T S^T S A and A^T A being small. However, while my implementation for the first condition seems to be true, the latter is consistently too small. I was wondering if there is an obvious reason for this that I am missing because it seems to be underestimating the latter quantity.
I am open to feedback on changing the code if there are obvious improvements to be made. The single call to np.choice is made to remove the need to look through a (potentially large) array or hash table containing the row hashes for each row and because the matrix is in row order we can just keep the hash for that row until a new row is seen.
nb. If you don't want to run using numba then just comment out the import and the function decorator and it will run in standard numpy/scipy.
import numpy as np
import numpy.random as npr
import scipy.sparse as sparse
from scipy.sparse import coo_matrix
import numba
from numba import jit
#jit(nopython=True) # comment this if want just numpy
def countSketch(input_rows, input_data,
input_nnz,
sketch_size, seed=None):
'''
input_rows: row indices for data (can be repeats)
input_data: values seen in row location,
input_nnz : number of nonzers in the data (can replace with
len(input_data) but avoided here for speed)
sketch_size: int
seed=None : random seed
'''
hashed_rows = np.empty(input_rows.shape,dtype=np.int32)
current_row = 0
hash_val = npr.choice(sketch_size)
sign_val = npr.choice(np.array([-1.0,1.0]))
#print(hash_val)
hashed_rows[0] = hash_val
#print(hash_val)
for idx in np.arange(input_nnz):
print(idx)
row_id = input_rows[idx]
data_val = input_data[idx]
if row_id == current_row:
hashed_rows[idx] = hash_val
input_data[idx] = sign_val*data_val
else:
# make new hashes
hash_val = npr.choice(sketch_size)
sign_val = npr.choice(np.array([-1.0,1.0]))
hashed_rows[idx] = hash_val
input_data[idx] = sign_val*data_val
return hashed_rows, input_data
def sort_row_order(input_data):
sorted_row_column = np.array((input_data.row,
input_data.col,
input_data.data))
idx = np.argsort(sorted_row_column[0])
sorted_rows = np.array(sorted_row_column[0,idx], dtype=np.int32)
sorted_cols = np.array(sorted_row_column[1,idx], dtype=np.int32)
sorted_data = np.array(sorted_row_column[2,idx], dtype=np.float64)
return sorted_rows, sorted_cols, sorted_data
if __name__=="__main__":
import time
from tabulate import tabulate
matrix = sparse.random(1000, 50, 0.1)
x = np.random.randn(matrix.shape[1])
true_norm = np.linalg.norm(matrix#x,ord=2)**2
tidy_data = sort_row_order(matrix)
sketch_size = 300
start = time.time()
hashed_rows, sketched_data = countSketch(tidy_data[0],\
tidy_data[2], matrix.nnz,sketch_size)
duration_slow = time.time() - start
S_A = sparse.coo_matrix((sketched_data, (hashed_rows,matrix.col)))
approx_norm_slow = np.linalg.norm(S_A#x,ord=2)**2
rel_error_slow = approx_norm_slow/true_norm
#print("Sketch time: {}".format(duration_slow))
start = time.time()
hashed_rows, sketched_data = countSketch(tidy_data[0],\
tidy_data[2], matrix.nnz,sketch_size)
duration = time.time() - start
#print("Sketch time: {}".format(duration))
S_A = sparse.coo_matrix((sketched_data, (hashed_rows,matrix.col)))
approx_norm = np.linalg.norm(S_A#x,ord=2)**2
rel_error = approx_norm/true_norm
#print("Relative norms: {}".format(approx_norm/true_norm))
print(tabulate([[duration_slow, rel_error_slow, 'Yes'],
[duration, rel_error, 'No']],
headers=['Sketch Time', 'Relative Error', 'Dry Run'],
tablefmt='orgtbl'))

How to find which input value in a loop yielded the output min?

I am trying to solve a min value problem, I could obtain the min values from two loops but, what I really need is also the exact values that correspended to output min.
from __future__ import division
from numpy import*
b1=0.9917949
b2=0.01911
b3=0.000840
b4=0.10175
b5=0.000763
mu=1.66057*10**(-24) #gram
c=3.0*10**8
Mler=open("olasiM.txt","w+")
data=zeros(0,'float')
for A in range(1,25):
M2=zeros(0,'float')
print 'A=',A
for Z in range(1,A+1):
SEMF=mu*c**2*(b1*A+b2*A**(2./3.)-b3*Z+b4*A*((1./2.)-(Z/A))**2+(b5*Z**2)/(A**(1./3.)))
SEMF=array(SEMF)
M2=hstack((M2,SEMF))
minm2=min(M2)
data=hstack((data,minm2))
data=hstack((data,A))
datalist = data.tolist()
for i in range (len(datalist)):
Mler.write(str(datalist[i])+'\n')
Mler.close()
Here, what I want is to see the min value of the SEMF and, corresponding A,Z values, For example, it has to be A=1, Z=1 and SEMF= some#
I also don't know how to write these, A and Z values to the document

The big advantage of numpy over using python lists is vectorized operations. Unfortunately your code fails completely in using them. For example the whole inner loop that has Z as index can easily be vectorized. You instead are computing the single elements using python floats and then stacking them one by one in the numpy array M2.
So I'd refactor that part of the code by:
import numpy as np
# ...
Zs = np.arange(1, A+1, dtype=float)
SEMF = mu*c**2 * (b1*A + b2*A**(2./3.) - b3*Zs + b4*A*((1./2.) - (Zs/A))**2 + (b5*Zs**2)/(A**(1./3.)))
Here the SEMF array should be exactly what you'd obtain as the final M2 array. Now you can find the minimum and stack that value into your data array:
min_val = SEMF.min()
data = hstack((data,minm2))
data = hstack((data,A))
If you also what to keep track for which value of Z you got the minimum you can use the argmin method:
min_val, min_pos = SEMF.min(), SEMF.argmin()
data = hstack((data,np.array([min_val, min_pos, A])))
The final code should look like:
from __future__ import division
import numpy as np
b1 = 0.9917949
b2 = 0.01911
b3 = 0.000840
b4 = 0.10175
b5 = 0.000763
mu = 1.66057*10**(-24) #gram
c = 3.0*10**8
data=zeros(0,'float')
for A in range(1,25):
Zs = np.arange(1, A+1, dtype=float)
SEMF = mu*c**2 * (b1*A + b2*A**(2./3.) - b3*Zs + b4*A*((1./2.) - (Zs/A))**2 + (b5*Zs**2)/(A**(1./3.)))
min_val, min_pos = SEMF.min(), SEMF.argmin()
data = hstack((data,np.array([min_val, min_pos, A])))
datalist = data.tolist()
with open("olasiM.txt","w+") as mler:
for i in range (len(datalist)):
mler.write(str(datalist[i])+'\n')
Note that numpy provides some functions to save/load array to/from files, like savetxt so I suggest that instead of manually saving the values there to use these functions.
Probably some numpy expert could vectorize also the operations for the As. Unfortunately my numpy knowledge isn't that advanced and I don't know how the handle the fact that we'd have a variable number of dimensions due to the range(1, A+1) thing...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

randomly replace rows by other rows in python - python

Related

How to multicore processing a for loop with iterrows in python

Parallelizing for loops in python

How to multiprocess finding closest geographic point in two pandas dataframes?

Ensuring performance of sketching/streaming algorithm (countSketch)

How to find which input value in a loop yielded the output min?

Categories

Resources