I have 2 matrixes and I want to safe the euclidean distance of each row in an array so afterwards I can work with the data (knn Kneighbours, I use a temporal named K so I can create later a matrix of that array (2 columns x n rows, each row will contain the distance from position n of the array, in this case, k is that n).
import numpy as np
v1=np.matrix('1,2;3,4')
v2=np.matrix('5,6;7,8')
k=0
for i in v1:
distancias.append(k)=np.linalg.norm(v2-v1[k,:])
print(distancias[k])
k=k+1
It gives me an error:
File "<ipython-input-44-4d3546d9ade5>", line 10
distancias.append(k)=np.linalg.norm(v2-v1[k,:])
^
SyntaxError: can't assign to function call
And I do not really know what syntax error is.
I also tried:
import numpy as np
v1=np.matrix('1,2;3,4')
v2=np.matrix('5,6;7,8')
k=0
for i in v1:
valor=np.linalg.norm(v2-v1[k,:])
distancias.append(valor)
print(distancias[k])
k=k+1
And in this case the error is:
AttributeError Traceback (most recent call last)
<ipython-input-51-8a48ca0267d5> in <module>()
9
10 valor=np.linalg.norm(v2-v1[k,:])
---> 11 distancias.append(valor)
12 print(distancias[k])
13 k=k+1
AttributeError: 'numpy.float64' object has no attribute 'append'
You are trying to assign data to a function call, which is not possible. If you want to add the data computed by linalg.norm() to the array distancias you can do like shown below.
import numpy as np
v1=np.matrix('1,2;3,4')
v2=np.matrix('5,6;7,8')
k=0
distancias = []
for i in v1:
distancias.append(np.linalg.norm(v2-v1[k,:]))
print(distancias[k])
k=k+1
print(distancias)
Output
10.1980390272
6.32455532034
[10.198039027185569, 6.324555320336759]
Related
I'm trying to use two columns from an existing dataframe to generate a list of new strings with those values. I found a lot of examples doing something similar, but not the same thing, so I appreciate advice or links elsewhere if this is a repeat question. Thanks in advance!
If I start with a data frame like this:
import pandas as pd
df=pd.DataFrame(data=[["a",1],["b",2],["c",3]], columns=["id1","id2"])
id1 id2
0 a 1
1 b 2
2 c 3
I want to make a list that looks like
new_ids=['a_1','b_2','c_3'] where values are from combining values in row 0 for id1 with values for row 0 for id2 and so on.
I started by making lists from the columns, but can't figure out how to combine them into a new list. I also tried not using intermediate lists, but couldn't get that either. Error messages below are accurate to the mock data, but are different from the ones with real data.
#making separate lists version
#this function works
def get_ids(orig_df):
id1_list=[]
id2_list=[]
for i in range(len(orig_df)):
id1_list.append(orig_df['id1'].values[i])
id2_list.append(orig_df['id2'].values[i])
return(id1_list,id2_list)
idlist1,idlist2=get_ids(df)
#this is the part that doesn't work
new_id=[]
for i,j in zip(idlist1,idlist2):
row='_'.join(str(idlist1[i]),str(idlist2[j]))
new_id.append(row)
#------------------------------------------------------------------------
#AttributeError Traceback (most recent call #last)
#<ipython-input-44-09983bd890a6> in <module>
# 1 newid_list=[]
# 2 for i in range(len(df)):
#----> 3 n1=df['id1'[i].values]
# 4 n2=df['id2'[i].values]
# 5 nid= str(n1)+"_"+str(n2)
#AttributeError: 'str' object has no attribute 'values'
#skipping making lists (also doesn't work)
newid_list=[]
for i in range(len(df)):
n1=df['id1'[i].values]
n2=df['id2'[i].values]
nid= str(n1)+"_"+str(n2)
newid_list.append(nid)
#---------------------------------------------------------------------------
#TypeError Traceback (most recent call last)
#<ipython-input-41-6b0c949a1ad5> in <module>
# 1 new_id=[]
# 2 for i,j in zip(idlist1,idlist2):
#----> 3 row='_'.join(str(idlist1[i]),str(idlist2[j]))
# 4 new_id.append(row)
# 5 #return ', '.join(new_id)
#TypeError: list indices must be integers or slices, not str
(df.id1 + "_" + df.id2.astype(str)).tolist()
output:
['a_1', 'b_2', 'c_3']
your approaches(corrected):
def get_ids(orig_df):
id1_list=[]
id2_list=[]
for i in range(len(orig_df)):
id1_list.append(orig_df['id1'].values[i])
id2_list.append(orig_df['id2'].values[i])
return(id1_list,id2_list)
idlist1, idlist2=get_ids(df)
#this is the part that doesn't work
new_id=[]
for i,j in zip(idlist1,idlist2):
row='_'.join([str(i),str(j)])
new_id.append(row)
newid_list=[]
for i in range(len(df)):
n1=df['id1'][i]
n2=df['id2'][i]
nid= str(n1)+"_"+str(n2)
newid_list.append(nid)
points:
in first approach, when you loop on data, i and j are data, not indices, so use them as data and convert them to string.
join get list as data and simply define a list using 2 data: [str(i),str(j)] and pass to join
in second approach, you can get every element of every column using df['id1'][i] and you don't need values that return all elements of column as a numpy array
if you want to use values:
(df.id1.values + "_" + df.id2.values.astype(str)).tolist()
Try this:
import pandas as pd
df=pd.DataFrame(data=[["a",1],["b",2],["c",3]], columns=["id1","id2"])
index=0
newid_list=[]
while index < len(df):
newid_list.append(str(df['id1'][index]) + '_' + str(df['id2'][index]))
index+=1
I have a dataframe like the following one:
index
paper_id
title
embedding
0
000a0fc8bbef80410199e690191dc3076a290117
PfSWIB, a potential chromatin regulator for va...
[-0.21326999, -0.39155999, 0.18850000, -0.0664...
1
000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a
Correlation between antimicrobial consumption ...
[-0.23322999, -0.27436000, -0.10449000, -0.536...
2
000b0174f992cb326a891f756d4ae5531f2845f7
Full Title: A systematic review of MERS-CoV (M...
[0.26385999, -0.07325000, 0.03762100, -0.12043...
Where the "embedding" column is a np.array() of some length, whose elements are floats. I need to compute the cosine similarity between every pair of paper_id, and my aim is trying to parallelize it since many of these computations are independent of each other. I thought dask delayed objects would be efficient for this purpose:
The code of my function is:
#dask.delayed
def cosine(vector1, vector2):
#one can use only the very first elements of the embeddings, i.e. lengths of the embeddings must coincide
num_elem = min(len(vector1), len(vector2))
vec1_norm = np.linalg.norm(vector1[0:num_elem])
vec2_norm = np.linalg.norm(vector2[0:num_elem])
try:
cosine = np.vdot(vector1[0:num_elem], vector2[0:num_elem])/(vec1_norm*vec2_norm)
except:
cosine = 0.
return cosine
delayed_cosine_matrix = np.eye(len(cosine_df),len(cosine_df))
for x in range(1, len(cosine_df)):
for y in range(x):
delayed_cosine_matrix[x,y] = cosine(cosine_df.embedding[x], cosine_df.embedding[y])
delayed_cosine_matrix[y,x] = cosine(cosine_df.embedding[x], cosine_df.embedding[y])
This however returns an error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
TypeError: float() argument must be a string or a number, not 'Delayed'
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
<ipython-input-114-90cefc4986d5> in <module>
3 for x in range(1, len(cosine_df)):
4 for y in range(x):
----> 5 delayed_cosine_matrix[x,y] = cosine(cosine_df.embedding[x], cosine_df.embedding[y])
ValueError: setting an array element with a sequence.
Moreover, I would stress the fact that I have chosen np.eye() since the cosine of a vector with itself returns one, as well as I would like to exploit the symmetry of the operator, i.e.
cosine(x,y) == cosine(y,x)
Is there a way to efficiently do and parallelize it, or am I totally out of scope?
EDIT: I'm adding a small snippet code that reproduces the columns and layout needed for the dataframe (i.e. only "embeddings" and the index)
import numpy as np
import pandas as pd
emb_lengths = np.random.randint(100, 1000, size = 100)
elements = [np.random.random(size = (1,x)) for x in emb_lengths ]
my_df = pd.DataFrame(elements, columns = ['embeddings'])
my_df.embeddings = my_df.embeddings.apply(lambda x: x[0])
my_df
I am trying to convert an array into a data frame. And this below is the array.
Array =
[[array([[1.28327719, 0.86585652, 0.66163084],
[1.80828697, 1.24887998, 0.70235812],
[2.66044828, 1.35045788, 0.68215603],
[1.33065745, 1.4577204 , 0.75933679]]),
array([[1.28560483, 0.98658628, 0.67595305],
[1.73489671, 1.482433 , 0.71539607],
[1.29564167, 1.44918617, 0.74288636],
[2.43989581, 1.19118473, 0.64724577]]),
array([[1.27456576, 1.57166264, 0.854981 ],
[1.87001532, 1.57796163, 0.66740871],
[2.74672303, 1.29211241, 0.63669436],
[1.35104199, 0.84856452, 0.69297247]]),
array([[1.38296077, 0.91410661, 0.68056606],
[1.68320947, 1.42367818, 0.6659204 ],
[1.26965674, 1.55126723, 0.73756696],
[2.28880844, 1.27031044, 0.66577891]])],
[array([[1.72877886, 1.47973077, 0.68263402],
[2.28954891, 1.47387583, 0.72014133],
[1.25488202, 1.52890787, 0.72603781],
[1.36624708, 1.02959695, 0.72986648]]),
array([[1.78269554, 1.45968652, 0.65845671],
[1.29550163, 1.56630194, 0.80255398],
[1.33910381, 1.06375653, 0.73887124],
[2.99602633, 1.32380946, 0.71921367]]),
array([[1.32761929, 0.86097994, 0.61124086],
[1.36946819, 1.64210996, 0.66995842],
[1.29004191, 1.69784434, 1.17951575],
[2.29966943, 1.71713578, 0.62684209]]),
array([[1.50548041, 1.56619072, 0.64304549],
[2.38288223, 1.6995361 , 0.62946513],
[1.28558107, 0.78421077, 0.60182813],
[1.22364377, 1.6643322 , 1.00434432]])]]
pd.DataFrame(centroid)
0 1 2 3
0 [[1.283277189792161, 0.8658565155306925, 0.661... [[1.2856048285071469, 0.9865862768448912, 0.67... [[1.274565759781191, 1.5716626415220676, 0.854... [[1.3829607676718185, 0.9141066092756043, 0.68...
1 [[1.7287788611203834, 1.479730766338439, 0.682... [[1.7826955386102115, 1.4596865242143404, 0.65... [[1.3276192850743926, 0.8609799418002607, 0.61... [[1.5054804147099767, 1.566190719572681, 0.643...
If I just put them in pd.Dataframe it shows like this. and I tried to change the column's name by this code.
pd.DataFrame({'Summar':centroid[:,0],'Autumn':centroid[:,1],'Winter':centroid[:,2],'Spring':centroid[:,3]})
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-91-5cb5f6e37746> in <module>()
----> 1 pd.DataFrame({'Summar':Array[:,0],'Autumn': Array[:,1],'Winter': Array[:,2],'Spring': Array[:,3]})
TypeError: list indices must be integers or slices, not tuple
But it shows this error ....
Your Array is a list of list containing arrays. Try converting it to an array using
Array = np.array(Array)
Then check the shape of your Array
print (Array.shape)
# (2, 4, 4, 3)
You should now be able to use slicing
Hi I am having problems with this code:
**import numpy as np
# Summarize the data about minutes spent in the classroom
#total_minutes = total_minutes_by_account.values()
total_minutes = list(total_minutes_by_account.values())
type(total_minutes)
# Printing out the samething converting to a list
print('Printing out the samething converting to a list ')
print(type(total_minutes))
print ('Mean:', np.mean(total_minutes))
print ('Standard deviation:', np.std(total_minutes))
print ('Minimum:', np.min(total_minutes))
print ('Maximum:', np.max(total_minutes))**
The error I get is:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-93-945375bf6098> in <module>()
3 # Summarize the data about minutes spent in the classroom
4 #total_minutes = total_minutes_by_account.values()
----> 5 total_minutes = list(total_minutes_by_account.values())
6 type(total_minutes)
7 #print(total_minutes)
AttributeError: 'list' object has no attribute 'values'
I really would lie to know how I can make this work, I can do it with pandas converitng it to a numpy array and the getting values for the statistics I want with numpy
I'm trying to plot the last 30 days of sst data using a for loop. My code will run through the first loop fine but then give this error on the second:
Traceback (most recent call last):
File "sstt.py", line 20, in <module>
Temp = Temp[i,:,:]
IndexError: too many indices for array
It doesn't matter what indice I start on, the second loop always gives this error. If I start on -29, then -28 fails. If I start on -28, -27 fails, etc.
Code:
import numpy as np
import math as m
import urllib2
from pydap.client import open_url
from pydap.proxy import ArrayProxy
data_url_mean = 'http://www.esrl.noaa.gov/psd/thredds/dodsC/Datasets/noaa.oisst.v2.highres/sst.day.mean.2015.v2.nc'
dataset1 = open_url(data_url_mean)
# Daily Mean
Temp = dataset1['sst']
timestep = [-29,-28,-27,-26,-25,-24,-23,-22,-21,-20,-19,-18,-17,-16,-15,-14,-13,-12,-11,-10,-9,-8,-7,-6,-5,-4,-3,-2,-1]
for i in timestep:
# Daily Mean
Temp = Temp[i,:,:]
Temp = Temp.array[:]
Temp = Temp * (9./5.) + 32.
Temp = Temp.squeeze()
print i
You're assigning all of your values to the same variable. After the first pass of the loop, Temp is no longer equal to the dataset, and the attempt to perform the operation expecting it to be the dataset fails.
You need to come up with some new names for the variables that you assign values to.