How to parallelize row dataframe computations with dask - python

I have a dataframe like the following one:
index
paper_id
title
embedding
0
000a0fc8bbef80410199e690191dc3076a290117
PfSWIB, a potential chromatin regulator for va...
[-0.21326999, -0.39155999, 0.18850000, -0.0664...
1
000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a
Correlation between antimicrobial consumption ...
[-0.23322999, -0.27436000, -0.10449000, -0.536...
2
000b0174f992cb326a891f756d4ae5531f2845f7
Full Title: A systematic review of MERS-CoV (M...
[0.26385999, -0.07325000, 0.03762100, -0.12043...
Where the "embedding" column is a np.array() of some length, whose elements are floats. I need to compute the cosine similarity between every pair of paper_id, and my aim is trying to parallelize it since many of these computations are independent of each other. I thought dask delayed objects would be efficient for this purpose:
The code of my function is:
#dask.delayed
def cosine(vector1, vector2):
#one can use only the very first elements of the embeddings, i.e. lengths of the embeddings must coincide
num_elem = min(len(vector1), len(vector2))
vec1_norm = np.linalg.norm(vector1[0:num_elem])
vec2_norm = np.linalg.norm(vector2[0:num_elem])
try:
cosine = np.vdot(vector1[0:num_elem], vector2[0:num_elem])/(vec1_norm*vec2_norm)
except:
cosine = 0.
return cosine
delayed_cosine_matrix = np.eye(len(cosine_df),len(cosine_df))
for x in range(1, len(cosine_df)):
for y in range(x):
delayed_cosine_matrix[x,y] = cosine(cosine_df.embedding[x], cosine_df.embedding[y])
delayed_cosine_matrix[y,x] = cosine(cosine_df.embedding[x], cosine_df.embedding[y])
This however returns an error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
TypeError: float() argument must be a string or a number, not 'Delayed'
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
<ipython-input-114-90cefc4986d5> in <module>
3 for x in range(1, len(cosine_df)):
4 for y in range(x):
----> 5 delayed_cosine_matrix[x,y] = cosine(cosine_df.embedding[x], cosine_df.embedding[y])
ValueError: setting an array element with a sequence.
Moreover, I would stress the fact that I have chosen np.eye() since the cosine of a vector with itself returns one, as well as I would like to exploit the symmetry of the operator, i.e.
cosine(x,y) == cosine(y,x)
Is there a way to efficiently do and parallelize it, or am I totally out of scope?
EDIT: I'm adding a small snippet code that reproduces the columns and layout needed for the dataframe (i.e. only "embeddings" and the index)
import numpy as np
import pandas as pd
emb_lengths = np.random.randint(100, 1000, size = 100)
elements = [np.random.random(size = (1,x)) for x in emb_lengths ]
my_df = pd.DataFrame(elements, columns = ['embeddings'])
my_df.embeddings = my_df.embeddings.apply(lambda x: x[0])
my_df

Related

Python- trying to make new list combining values from other list

I'm trying to use two columns from an existing dataframe to generate a list of new strings with those values. I found a lot of examples doing something similar, but not the same thing, so I appreciate advice or links elsewhere if this is a repeat question. Thanks in advance!
If I start with a data frame like this:
import pandas as pd
df=pd.DataFrame(data=[["a",1],["b",2],["c",3]], columns=["id1","id2"])
id1 id2
0 a 1
1 b 2
2 c 3
I want to make a list that looks like
new_ids=['a_1','b_2','c_3'] where values are from combining values in row 0 for id1 with values for row 0 for id2 and so on.
I started by making lists from the columns, but can't figure out how to combine them into a new list. I also tried not using intermediate lists, but couldn't get that either. Error messages below are accurate to the mock data, but are different from the ones with real data.
#making separate lists version
#this function works
def get_ids(orig_df):
id1_list=[]
id2_list=[]
for i in range(len(orig_df)):
id1_list.append(orig_df['id1'].values[i])
id2_list.append(orig_df['id2'].values[i])
return(id1_list,id2_list)
idlist1,idlist2=get_ids(df)
#this is the part that doesn't work
new_id=[]
for i,j in zip(idlist1,idlist2):
row='_'.join(str(idlist1[i]),str(idlist2[j]))
new_id.append(row)
#------------------------------------------------------------------------
#AttributeError Traceback (most recent call #last)
#<ipython-input-44-09983bd890a6> in <module>
# 1 newid_list=[]
# 2 for i in range(len(df)):
#----> 3 n1=df['id1'[i].values]
# 4 n2=df['id2'[i].values]
# 5 nid= str(n1)+"_"+str(n2)
#AttributeError: 'str' object has no attribute 'values'
#skipping making lists (also doesn't work)
newid_list=[]
for i in range(len(df)):
n1=df['id1'[i].values]
n2=df['id2'[i].values]
nid= str(n1)+"_"+str(n2)
newid_list.append(nid)
#---------------------------------------------------------------------------
#TypeError Traceback (most recent call last)
#<ipython-input-41-6b0c949a1ad5> in <module>
# 1 new_id=[]
# 2 for i,j in zip(idlist1,idlist2):
#----> 3 row='_'.join(str(idlist1[i]),str(idlist2[j]))
# 4 new_id.append(row)
# 5 #return ', '.join(new_id)
#TypeError: list indices must be integers or slices, not str
(df.id1 + "_" + df.id2.astype(str)).tolist()
output:
['a_1', 'b_2', 'c_3']
your approaches(corrected):
def get_ids(orig_df):
id1_list=[]
id2_list=[]
for i in range(len(orig_df)):
id1_list.append(orig_df['id1'].values[i])
id2_list.append(orig_df['id2'].values[i])
return(id1_list,id2_list)
idlist1, idlist2=get_ids(df)
#this is the part that doesn't work
new_id=[]
for i,j in zip(idlist1,idlist2):
row='_'.join([str(i),str(j)])
new_id.append(row)
newid_list=[]
for i in range(len(df)):
n1=df['id1'][i]
n2=df['id2'][i]
nid= str(n1)+"_"+str(n2)
newid_list.append(nid)
points:
in first approach, when you loop on data, i and j are data, not indices, so use them as data and convert them to string.
join get list as data and simply define a list using 2 data: [str(i),str(j)] and pass to join
in second approach, you can get every element of every column using df['id1'][i] and you don't need values that return all elements of column as a numpy array
if you want to use values:
(df.id1.values + "_" + df.id2.values.astype(str)).tolist()
Try this:
import pandas as pd
df=pd.DataFrame(data=[["a",1],["b",2],["c",3]], columns=["id1","id2"])
index=0
newid_list=[]
while index < len(df):
newid_list.append(str(df['id1'][index]) + '_' + str(df['id2'][index]))
index+=1

Duplicate columns & possible reduce dimensionality key error 0 Python Error

I have the follow data set:
so as you can see, the shape is: 21 rows x 50 columns
So I would like to apply the follow condition:
If any row from "defaultstore"= 1, then the column "FinalSL" column will receive 4 times the value which "FCST:TOTAL" column contains.
So I create the follow function to do this calculation:
def SLFinal(defaultStore, fcst):
if (defaultStore==1):
return (fcst*4)
else:
return 2
SLFinal(DFstore.iloc[i],FcstList.iloc[i])
The function is working, but I would like to apply in my dataset, so I create the follow loops to run each row and storage the data for the "defaultstore" and "FCST:TOTAL" columns:
Fcst = copiedData.iloc[:,45:46]
FcstList = []
lenOfRows2 = len(copiedData)
for i in range(0, lenOfRows2):
FcstList.append(Fcst.loc[i])
DFstoreList`DFstore = copiedData.iloc[:,46:47]
DFstore
DFstoreList = []
lenOfRows2 = len(copiedData)
for i in range(0, lenOfRows2):
DFstoreList.append(DFstore.loc[i])
And finally, the new list which will contain the values after the function be applied:
FinalSLlist1 = []
for i in range(0, lenOfRows2 ):
Rows = []
for j in range(45, 50):
Rows.append( SLFinal(DFstore[i],FcstList[i]) )
FinalSLlist1.append(Rows)
But the folloow error is happening:
---------------------------------------------------------------------------
`KeyError Traceback (most recent call last)
2693 # get column
2694 if self.columns.is_unique:
-> 2695 return self._get_item_cache(key)
2696
2697 # duplicate columns & possible reduce dimensionality`
KeyError: 0
What should I do ?
You can use boolean indexing and avoid any loops like so:
df.loc[df.defaultstore==1, 'FCST:TOTAL'] *= 4
df.loc[df.defaultstore!=1, 'FCST:TOTAL'] = 2
It might be helpful to look at the pandas documentation on boolean indexing.
import pandas as pd
Just simply use apply() method:
df['FCST:TOTAL']=df.apply(lambda x:x['FCST:TOTAL']*4 if (x['defaultstore']==1) else 2,1)
OR
If you are familiar with numpy then use where() method as it is more efficient then pandas apply() method:
import numpy as np
df['FCST:TOTAL']=np.where(df['defaultstore']==1,df['FCST:TOTAL']*4,2)

"IndexError: too many indices for array" while merging VIPERS and PRIMUS

Hi I'm trying to extract RA, Dec and redshift information from the two surveys(PRIMUS and VIPERS) and collects them into a single nd-array.
The code is as follows :
from astropy.io import fits
import numpy as np
hdulist_PRIMUS = fits.open('data/PRIMUS_2013_zcat_v1.fits')
data_PRIMUS = hdulist_PRIMUS[1].data
data_PRIMUS = np.column_stack((data_PRIMUS['RA'], data_PRIMUS['DEC'],
data_PRIMUS['Z'], data_PRIMUS['FIELD']))
data_PRIMUS = np.array(filter(lambda x: x[3].strip() == 'xmm', data_PRIMUS))[:, :3]
data_PRIMUS = np.array(map(lambda x: [float(x[0]), float(x[1]), float(x[2])], data_PRIMUS))
hdulist_VIPERS = fits.open('data/VIPERS_W1_SPECTRO_PDR2.fits')
data_VIPERS = hdulist_VIPERS[1].data
data_VIPERS = np.column_stack((data_VIPERS['alpha'], data_VIPERS['delta'], data_VIPERS['zspec']))
from astropy import units as u
from astropy.coordinates import SkyCoord
PRIMUS_catalog = SkyCoord(ra=data_PRIMUS[:, 0]*u.degree, dec =data_PRIMUS[:, 1]*u.degree)
VIPERS_catalog = SkyCoord(ra=data_VIPERS[:, 0]*u.degree, dec=data_VIPERS [:, 1]*u.degree)
idx, d2d, d3d = PRIMUS_catalog.match_to_catalog_sky(VIPERS_catalog)
feasible_indices = np.array(map(
lambda x: x[0],
filter(lambda x: x[1].value > 1e-3, zip(idx, d2d))))
data_VIPERS = data_VIPERS[feasible_indices]
data_HZ = np.vstack((data_PRIMUS, data_VIPERS))
When I run this I'm getting a "IndexError: too many indices for array"
Datasets:
PRIMUS Redshift Catalog - https://primus.ucsd.edu/version1.html
VIPERS Redshift Catalog - https://projects.ift.uam-csic.es/skies-universes/VIPERS/photometry/
I think there are a few ways you're doing this where you're making it harder for yourself by not using existing, available tools effectively. For example, since you are working with tabular data from a FITS file, you can take advantage of Astropy's Table interface:
>>> from astropy.table import Table
>>> primus = Table.read('PRIMUS_2013_zcat_v1.fits')
(for this particular file I got some warnings about some of the headers in the table being non-standard, but this can be ignored).
If you want to do some operations on just a few columns of the table, you can do this easily. For example, rather than doing what you did, of selecting a few columns together, and then stacking them into a new array
np.column_stack((data_PRIMUS['RA'], data_PRIMUS['DEC'],
data_PRIMUS['Z'], data_PRIMUS['FIELD']))
you can select a subset of columns from the table like so:
>>> primus[['RA', 'DEC', 'Z', 'FIELD']]
<Table length=213696>
RA DEC Z FIELD
degree degree
float64 float64 float32 bytes13
------------------ ------------------- ---------- -------------
52.892275339281994 -27.833172368069615 0.3420992 calib
52.88448889270391 -27.85252305560996 0.4824943 calib
52.880363885710295 -27.86221750021335 0.33976158 calib
52.88334306466262 -27.86937808271639 0.6134631 calib
52.8866138857103 -27.871773055662942 0.58744365 calib
52.885607068267845 -27.889578785511922 0.26873255 calib
... ... ... ...
34.54856 -4.5544 0.8544105 xmm
34.56942 -4.57564 0.6331108 xmm
34.567412432719756 -4.572718190305209 1.1456184 xmm
34.57134 -4.56414 0.6346616 xmm
34.58088 -4.56804 1.081143 xmm
34.58686 -4.57449 0.7471819 xmm
Then it seems you select the RA, DEC, and Z columns where the field is xmm by using a filter function, but as these are Numpy arrays you can use the filtering capabilities built into Numpy array indexing, as well as Table indexing. The only tricky part is that since these are fixed width character fields you do still need to perform comparisons correctly. You can use Numpy's string functions like np.char.startswith for this:
>>> primus = primus[np.char.startswith(primus['FIELD'], b'xmm')]
In the process of doing a performance comparison, I realized this line is where you're probably getting the error IndexError: too many indices for array:
>>> np.array(filter(lambda x: x[3].strip() == 'xmm', primus))
array(<filter object at 0x7f5170981940>, dtype=object)
In Python 3, the filter function returns an iterable, so wrapping it in np.array() just makes a 0-D array containing this Python object; it's probably not what you intended, so it fails here (this is where looking at the traceback might have been useful). Even if you wrapped the filter() call in list() it wouldn't work, because np.array() only takes homogeneous arrays normally. So an approach like the one I gave is perfectly sufficient (though there may be slightly more efficient ways). It also makes the next line:
np.array(map(lambda x: [float(x[0]), float(x[1]), float(x[2])], data_PRIMUS))
unnecessary. In particular, the first three columns are already in floating point format so this would not be necessary anyways.
Some similar advice applies to the other parts of your code. I'd have written it like more like this:
import numpy as np
from astropy.table import Table, vstack
from astropy import units as u
from astropy.coordinates import SkyCoord
primus = Table.read('PRIMUS_2013_zcat_v1.fits')
primus_field = primus['FIELD']
primus = primus[['RA', 'DEC', 'Z']]
primus = primus[np.char.startswith(primus_field, b'xmm')]
vipers = Table.read('VIPERS_W1_SPECTRO_PDR2.fits')[['alpha', 'delta', 'zspec']]
primus_catalog = SkyCoord(ra=primus['RA']*u.degree, dec=primus['DEC']*u.degree)
vipers_catalog = SkyCoord(ra=vipers['alpha']*u.degree, dec=vipers['delta']*u.degree)
idx, d2d, d3d = primus_catalog.match_to_catalog_sky(vipers_catalog)
feasible_indices = idx[d2d > 1e-3]
vipers = vipers[feasible_indices]
vipers.rename_columns(['alpha', 'delta', 'zspec'], ['RA', 'DEC', 'Z'])
hz = vstack(primus, vipers)
Please let me know if there are any parts of this you have questions on.

Ensuring performance of sketching/streaming algorithm (countSketch)

I have implemented what is know as a countSketch in python (page 17: https://arxiv.org/pdf/1411.4357.pdf) but my implementation is currently lacking in performance. The algorithm is to compute the product SA where A is an n x d matrix, S is m x n matrix defined as follows: for every column of S uniformly at random select a row (hash bucket) from the m rows and for that given row, uniformly at random select +1 or -1. So S is a matrix with exactly one nonzero in every column and otherwise all zero.
My intention is to compute SA in a streaming fashion by reading the entries of A. The idea for my implementation is as follows: observe a sequence of triples (i,j,A_ij) and return a sequence (h(i), j, s(i)A_ij) where:
- h(i) is a hash bucket (row of matrix chosen uniformly at random from the m possible rows of S
- s(i) is the random sign function as described above.
I have assumed that the matrix is in row order so that the first row of A arrives in its entirety before the next row of A arrives because this limits the number of calls I need to select a random bucket or the need to use a hash library. I have also assumed that the number of nonzero entries (or the length of the input stream) is known so that I can efficiently encode the iteration.
My problem is that the matrix should compute (1+error)*||Ax||^2 <= ||SAx||^2 <= (1+error)*||Ax||^2 and also have the difference in frobenius norms between A^T S^T S A and A^T A being small. However, while my implementation for the first condition seems to be true, the latter is consistently too small. I was wondering if there is an obvious reason for this that I am missing because it seems to be underestimating the latter quantity.
I am open to feedback on changing the code if there are obvious improvements to be made. The single call to np.choice is made to remove the need to look through a (potentially large) array or hash table containing the row hashes for each row and because the matrix is in row order we can just keep the hash for that row until a new row is seen.
nb. If you don't want to run using numba then just comment out the import and the function decorator and it will run in standard numpy/scipy.
import numpy as np
import numpy.random as npr
import scipy.sparse as sparse
from scipy.sparse import coo_matrix
import numba
from numba import jit
#jit(nopython=True) # comment this if want just numpy
def countSketch(input_rows, input_data,
input_nnz,
sketch_size, seed=None):
'''
input_rows: row indices for data (can be repeats)
input_data: values seen in row location,
input_nnz : number of nonzers in the data (can replace with
len(input_data) but avoided here for speed)
sketch_size: int
seed=None : random seed
'''
hashed_rows = np.empty(input_rows.shape,dtype=np.int32)
current_row = 0
hash_val = npr.choice(sketch_size)
sign_val = npr.choice(np.array([-1.0,1.0]))
#print(hash_val)
hashed_rows[0] = hash_val
#print(hash_val)
for idx in np.arange(input_nnz):
print(idx)
row_id = input_rows[idx]
data_val = input_data[idx]
if row_id == current_row:
hashed_rows[idx] = hash_val
input_data[idx] = sign_val*data_val
else:
# make new hashes
hash_val = npr.choice(sketch_size)
sign_val = npr.choice(np.array([-1.0,1.0]))
hashed_rows[idx] = hash_val
input_data[idx] = sign_val*data_val
return hashed_rows, input_data
def sort_row_order(input_data):
sorted_row_column = np.array((input_data.row,
input_data.col,
input_data.data))
idx = np.argsort(sorted_row_column[0])
sorted_rows = np.array(sorted_row_column[0,idx], dtype=np.int32)
sorted_cols = np.array(sorted_row_column[1,idx], dtype=np.int32)
sorted_data = np.array(sorted_row_column[2,idx], dtype=np.float64)
return sorted_rows, sorted_cols, sorted_data
if __name__=="__main__":
import time
from tabulate import tabulate
matrix = sparse.random(1000, 50, 0.1)
x = np.random.randn(matrix.shape[1])
true_norm = np.linalg.norm(matrix#x,ord=2)**2
tidy_data = sort_row_order(matrix)
sketch_size = 300
start = time.time()
hashed_rows, sketched_data = countSketch(tidy_data[0],\
tidy_data[2], matrix.nnz,sketch_size)
duration_slow = time.time() - start
S_A = sparse.coo_matrix((sketched_data, (hashed_rows,matrix.col)))
approx_norm_slow = np.linalg.norm(S_A#x,ord=2)**2
rel_error_slow = approx_norm_slow/true_norm
#print("Sketch time: {}".format(duration_slow))
start = time.time()
hashed_rows, sketched_data = countSketch(tidy_data[0],\
tidy_data[2], matrix.nnz,sketch_size)
duration = time.time() - start
#print("Sketch time: {}".format(duration))
S_A = sparse.coo_matrix((sketched_data, (hashed_rows,matrix.col)))
approx_norm = np.linalg.norm(S_A#x,ord=2)**2
rel_error = approx_norm/true_norm
#print("Relative norms: {}".format(approx_norm/true_norm))
print(tabulate([[duration_slow, rel_error_slow, 'Yes'],
[duration, rel_error, 'No']],
headers=['Sketch Time', 'Relative Error', 'Dry Run'],
tablefmt='orgtbl'))

Trying to add results to an array in Python

I have 2 matrixes and I want to safe the euclidean distance of each row in an array so afterwards I can work with the data (knn Kneighbours, I use a temporal named K so I can create later a matrix of that array (2 columns x n rows, each row will contain the distance from position n of the array, in this case, k is that n).
import numpy as np
v1=np.matrix('1,2;3,4')
v2=np.matrix('5,6;7,8')
k=0
for i in v1:
distancias.append(k)=np.linalg.norm(v2-v1[k,:])
print(distancias[k])
k=k+1
It gives me an error:
File "<ipython-input-44-4d3546d9ade5>", line 10
distancias.append(k)=np.linalg.norm(v2-v1[k,:])
^
SyntaxError: can't assign to function call
And I do not really know what syntax error is.
I also tried:
import numpy as np
v1=np.matrix('1,2;3,4')
v2=np.matrix('5,6;7,8')
k=0
for i in v1:
valor=np.linalg.norm(v2-v1[k,:])
distancias.append(valor)
print(distancias[k])
k=k+1
And in this case the error is:
AttributeError Traceback (most recent call last)
<ipython-input-51-8a48ca0267d5> in <module>()
9
10 valor=np.linalg.norm(v2-v1[k,:])
---> 11 distancias.append(valor)
12 print(distancias[k])
13 k=k+1
AttributeError: 'numpy.float64' object has no attribute 'append'
You are trying to assign data to a function call, which is not possible. If you want to add the data computed by linalg.norm() to the array distancias you can do like shown below.
import numpy as np
v1=np.matrix('1,2;3,4')
v2=np.matrix('5,6;7,8')
k=0
distancias = []
for i in v1:
distancias.append(np.linalg.norm(v2-v1[k,:]))
print(distancias[k])
k=k+1
print(distancias)
Output
10.1980390272
6.32455532034
[10.198039027185569, 6.324555320336759]

Categories