How to change array into data frame in pandas - python

I am trying to convert an array into a data frame. And this below is the array.
Array =
[[array([[1.28327719, 0.86585652, 0.66163084],
[1.80828697, 1.24887998, 0.70235812],
[2.66044828, 1.35045788, 0.68215603],
[1.33065745, 1.4577204 , 0.75933679]]),
array([[1.28560483, 0.98658628, 0.67595305],
[1.73489671, 1.482433 , 0.71539607],
[1.29564167, 1.44918617, 0.74288636],
[2.43989581, 1.19118473, 0.64724577]]),
array([[1.27456576, 1.57166264, 0.854981 ],
[1.87001532, 1.57796163, 0.66740871],
[2.74672303, 1.29211241, 0.63669436],
[1.35104199, 0.84856452, 0.69297247]]),
array([[1.38296077, 0.91410661, 0.68056606],
[1.68320947, 1.42367818, 0.6659204 ],
[1.26965674, 1.55126723, 0.73756696],
[2.28880844, 1.27031044, 0.66577891]])],
[array([[1.72877886, 1.47973077, 0.68263402],
[2.28954891, 1.47387583, 0.72014133],
[1.25488202, 1.52890787, 0.72603781],
[1.36624708, 1.02959695, 0.72986648]]),
array([[1.78269554, 1.45968652, 0.65845671],
[1.29550163, 1.56630194, 0.80255398],
[1.33910381, 1.06375653, 0.73887124],
[2.99602633, 1.32380946, 0.71921367]]),
array([[1.32761929, 0.86097994, 0.61124086],
[1.36946819, 1.64210996, 0.66995842],
[1.29004191, 1.69784434, 1.17951575],
[2.29966943, 1.71713578, 0.62684209]]),
array([[1.50548041, 1.56619072, 0.64304549],
[2.38288223, 1.6995361 , 0.62946513],
[1.28558107, 0.78421077, 0.60182813],
[1.22364377, 1.6643322 , 1.00434432]])]]
pd.DataFrame(centroid)
0 1 2 3
0 [[1.283277189792161, 0.8658565155306925, 0.661... [[1.2856048285071469, 0.9865862768448912, 0.67... [[1.274565759781191, 1.5716626415220676, 0.854... [[1.3829607676718185, 0.9141066092756043, 0.68...
1 [[1.7287788611203834, 1.479730766338439, 0.682... [[1.7826955386102115, 1.4596865242143404, 0.65... [[1.3276192850743926, 0.8609799418002607, 0.61... [[1.5054804147099767, 1.566190719572681, 0.643...
If I just put them in pd.Dataframe it shows like this. and I tried to change the column's name by this code.
pd.DataFrame({'Summar':centroid[:,0],'Autumn':centroid[:,1],'Winter':centroid[:,2],'Spring':centroid[:,3]})
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-91-5cb5f6e37746> in <module>()
----> 1 pd.DataFrame({'Summar':Array[:,0],'Autumn': Array[:,1],'Winter': Array[:,2],'Spring': Array[:,3]})
TypeError: list indices must be integers or slices, not tuple
But it shows this error ....

Your Array is a list of list containing arrays. Try converting it to an array using
Array = np.array(Array)
Then check the shape of your Array
print (Array.shape)
# (2, 4, 4, 3)
You should now be able to use slicing

Related

Python- trying to make new list combining values from other list

I'm trying to use two columns from an existing dataframe to generate a list of new strings with those values. I found a lot of examples doing something similar, but not the same thing, so I appreciate advice or links elsewhere if this is a repeat question. Thanks in advance!
If I start with a data frame like this:
import pandas as pd
df=pd.DataFrame(data=[["a",1],["b",2],["c",3]], columns=["id1","id2"])
id1 id2
0 a 1
1 b 2
2 c 3
I want to make a list that looks like
new_ids=['a_1','b_2','c_3'] where values are from combining values in row 0 for id1 with values for row 0 for id2 and so on.
I started by making lists from the columns, but can't figure out how to combine them into a new list. I also tried not using intermediate lists, but couldn't get that either. Error messages below are accurate to the mock data, but are different from the ones with real data.
#making separate lists version
#this function works
def get_ids(orig_df):
id1_list=[]
id2_list=[]
for i in range(len(orig_df)):
id1_list.append(orig_df['id1'].values[i])
id2_list.append(orig_df['id2'].values[i])
return(id1_list,id2_list)
idlist1,idlist2=get_ids(df)
#this is the part that doesn't work
new_id=[]
for i,j in zip(idlist1,idlist2):
row='_'.join(str(idlist1[i]),str(idlist2[j]))
new_id.append(row)
#------------------------------------------------------------------------
#AttributeError Traceback (most recent call #last)
#<ipython-input-44-09983bd890a6> in <module>
# 1 newid_list=[]
# 2 for i in range(len(df)):
#----> 3 n1=df['id1'[i].values]
# 4 n2=df['id2'[i].values]
# 5 nid= str(n1)+"_"+str(n2)
#AttributeError: 'str' object has no attribute 'values'
#skipping making lists (also doesn't work)
newid_list=[]
for i in range(len(df)):
n1=df['id1'[i].values]
n2=df['id2'[i].values]
nid= str(n1)+"_"+str(n2)
newid_list.append(nid)
#---------------------------------------------------------------------------
#TypeError Traceback (most recent call last)
#<ipython-input-41-6b0c949a1ad5> in <module>
# 1 new_id=[]
# 2 for i,j in zip(idlist1,idlist2):
#----> 3 row='_'.join(str(idlist1[i]),str(idlist2[j]))
# 4 new_id.append(row)
# 5 #return ', '.join(new_id)
#TypeError: list indices must be integers or slices, not str
(df.id1 + "_" + df.id2.astype(str)).tolist()
output:
['a_1', 'b_2', 'c_3']
your approaches(corrected):
def get_ids(orig_df):
id1_list=[]
id2_list=[]
for i in range(len(orig_df)):
id1_list.append(orig_df['id1'].values[i])
id2_list.append(orig_df['id2'].values[i])
return(id1_list,id2_list)
idlist1, idlist2=get_ids(df)
#this is the part that doesn't work
new_id=[]
for i,j in zip(idlist1,idlist2):
row='_'.join([str(i),str(j)])
new_id.append(row)
newid_list=[]
for i in range(len(df)):
n1=df['id1'][i]
n2=df['id2'][i]
nid= str(n1)+"_"+str(n2)
newid_list.append(nid)
points:
in first approach, when you loop on data, i and j are data, not indices, so use them as data and convert them to string.
join get list as data and simply define a list using 2 data: [str(i),str(j)] and pass to join
in second approach, you can get every element of every column using df['id1'][i] and you don't need values that return all elements of column as a numpy array
if you want to use values:
(df.id1.values + "_" + df.id2.values.astype(str)).tolist()
Try this:
import pandas as pd
df=pd.DataFrame(data=[["a",1],["b",2],["c",3]], columns=["id1","id2"])
index=0
newid_list=[]
while index < len(df):
newid_list.append(str(df['id1'][index]) + '_' + str(df['id2'][index]))
index+=1

How to parallelize row dataframe computations with dask

I have a dataframe like the following one:
index
paper_id
title
embedding
0
000a0fc8bbef80410199e690191dc3076a290117
PfSWIB, a potential chromatin regulator for va...
[-0.21326999, -0.39155999, 0.18850000, -0.0664...
1
000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a
Correlation between antimicrobial consumption ...
[-0.23322999, -0.27436000, -0.10449000, -0.536...
2
000b0174f992cb326a891f756d4ae5531f2845f7
Full Title: A systematic review of MERS-CoV (M...
[0.26385999, -0.07325000, 0.03762100, -0.12043...
Where the "embedding" column is a np.array() of some length, whose elements are floats. I need to compute the cosine similarity between every pair of paper_id, and my aim is trying to parallelize it since many of these computations are independent of each other. I thought dask delayed objects would be efficient for this purpose:
The code of my function is:
#dask.delayed
def cosine(vector1, vector2):
#one can use only the very first elements of the embeddings, i.e. lengths of the embeddings must coincide
num_elem = min(len(vector1), len(vector2))
vec1_norm = np.linalg.norm(vector1[0:num_elem])
vec2_norm = np.linalg.norm(vector2[0:num_elem])
try:
cosine = np.vdot(vector1[0:num_elem], vector2[0:num_elem])/(vec1_norm*vec2_norm)
except:
cosine = 0.
return cosine
delayed_cosine_matrix = np.eye(len(cosine_df),len(cosine_df))
for x in range(1, len(cosine_df)):
for y in range(x):
delayed_cosine_matrix[x,y] = cosine(cosine_df.embedding[x], cosine_df.embedding[y])
delayed_cosine_matrix[y,x] = cosine(cosine_df.embedding[x], cosine_df.embedding[y])
This however returns an error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
TypeError: float() argument must be a string or a number, not 'Delayed'
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
<ipython-input-114-90cefc4986d5> in <module>
3 for x in range(1, len(cosine_df)):
4 for y in range(x):
----> 5 delayed_cosine_matrix[x,y] = cosine(cosine_df.embedding[x], cosine_df.embedding[y])
ValueError: setting an array element with a sequence.
Moreover, I would stress the fact that I have chosen np.eye() since the cosine of a vector with itself returns one, as well as I would like to exploit the symmetry of the operator, i.e.
cosine(x,y) == cosine(y,x)
Is there a way to efficiently do and parallelize it, or am I totally out of scope?
EDIT: I'm adding a small snippet code that reproduces the columns and layout needed for the dataframe (i.e. only "embeddings" and the index)
import numpy as np
import pandas as pd
emb_lengths = np.random.randint(100, 1000, size = 100)
elements = [np.random.random(size = (1,x)) for x in emb_lengths ]
my_df = pd.DataFrame(elements, columns = ['embeddings'])
my_df.embeddings = my_df.embeddings.apply(lambda x: x[0])
my_df

Python reading a text file into a 2D array and accessing the data

I am trying to read data from a text file into a 2D array and then access each element of the data. I have tried a number of different approaches but I am unable to access each element of the data,
Here is an extract of the data,
GRID 16 7.5 5.961539 0.
GRID 17 7.5 11.92308 0.
GRID 18 7.5 17.88461 0.
GRID 19 7.5 23.84615 0.
GRID 20 7.5 29.80769 0.
GRID 21 7.5 35.76923 0.
GRID 22 7.5 41.73077 0.
GRID 23 7.5 47.69231 0.
GRID 24 7.5 53.65384 0.
Using the example here, Import nastran nodes deck in Python using numpy
It imports OK but it as a 1D array and I 'ary[1,1]' for example, I get the following response,
x[1,1]
Traceback (most recent call last):
File "<ipython-input-85-3e593ebbc211>", line 1, in <module>
x[1,1]
IndexError: too many indices for array
What I am hoping for is,
17
I have also tried the following code and again this reads into a 1D array,
ary = []
with open(os.path.join(dir, fn)) as fi:
for line in fi:
if line.startswith('GRID'):
ary.append([line[i:i+8] for i in range(0, len(line), 8)])
and I get the following error,
ary[1,2]
Traceback (most recent call last):
File "<ipython-input-83-9ac21a0619e9>", line 1, in <module>
ary[1,2]
TypeError: list indices must be integers or slices, not tuple
I am new to Python but I do have experience with VBA where I have used arrays a lot, but I am struggling to understand how to load an array and how to access the specific data.
You can use genfromtxt function.
import numpy as np
ary = np.genfromtxt(file_name, dtype=None)
This will automatically load your file and detect fields type. Now you can access ary by row or by column, for example
In: ary['f1']
Out: array([16, 17, 18, 19, 20, 21, 22, 23, 24])
In: ary[2]
Out: (b'GRID', 18, 7.5, 17.88461, 0.)
or by single element:
In: ary[3]['f1']
Out: 19
In: ary['f1'][3]
Out: 19
You are importing it from a text file? Can you save the text file as a csv? If so, you can easily load the data using pandas.
import pandas as pd
data = pd.read_csv(path_to_file)
Also, it might be that you just need to reshape your numpy array using something like:
x = x.reshape(-1, 4)
EDIT:
Since your format is based on fixed width, you would want to use the fixed width in pandas instead of read_csv. Example below uses width of 8.
x = pd.read_fwf(path_to_file, widths=8)

Trying to add results to an array in Python

I have 2 matrixes and I want to safe the euclidean distance of each row in an array so afterwards I can work with the data (knn Kneighbours, I use a temporal named K so I can create later a matrix of that array (2 columns x n rows, each row will contain the distance from position n of the array, in this case, k is that n).
import numpy as np
v1=np.matrix('1,2;3,4')
v2=np.matrix('5,6;7,8')
k=0
for i in v1:
distancias.append(k)=np.linalg.norm(v2-v1[k,:])
print(distancias[k])
k=k+1
It gives me an error:
File "<ipython-input-44-4d3546d9ade5>", line 10
distancias.append(k)=np.linalg.norm(v2-v1[k,:])
^
SyntaxError: can't assign to function call
And I do not really know what syntax error is.
I also tried:
import numpy as np
v1=np.matrix('1,2;3,4')
v2=np.matrix('5,6;7,8')
k=0
for i in v1:
valor=np.linalg.norm(v2-v1[k,:])
distancias.append(valor)
print(distancias[k])
k=k+1
And in this case the error is:
AttributeError Traceback (most recent call last)
<ipython-input-51-8a48ca0267d5> in <module>()
9
10 valor=np.linalg.norm(v2-v1[k,:])
---> 11 distancias.append(valor)
12 print(distancias[k])
13 k=k+1
AttributeError: 'numpy.float64' object has no attribute 'append'
You are trying to assign data to a function call, which is not possible. If you want to add the data computed by linalg.norm() to the array distancias you can do like shown below.
import numpy as np
v1=np.matrix('1,2;3,4')
v2=np.matrix('5,6;7,8')
k=0
distancias = []
for i in v1:
distancias.append(np.linalg.norm(v2-v1[k,:]))
print(distancias[k])
k=k+1
print(distancias)
Output
10.1980390272
6.32455532034
[10.198039027185569, 6.324555320336759]

How to slice Numpy datetime64 array

I have a python zip with 2 arrays
zipped = zip(array1[], array2[])
Where array1 is of type numpy.datetime64[] adn array2 is a temperature
I want to make a time window in 1st array so i can have fixed array len (because i have other zipped arrays but they differ in array length)
this is what i have:
start = np.datetime64('2016-06-17T15:00')
stop = np.datetime64('2016-06-19T15:00')
index, temp = sensor_cal.get_arrays('ParsedData/parsed.csv')
print(index)
print(temp)
index2 = index[start:stop] /////////////This doesn't work
print(index2)
How can i define a time window like this....
My objective is to get same length arrays in the same time window (because they were previously frequency normalized) and then make a graph where xAxis is time and the various series correspond to the multiple temperature sensor arrays
My error:
['2016-06-17T13:23:59.000000000' '2016-06-17T13:24:59.000000000' '2016-06-17T13:25:59.000000000' ..., '2016-06-20T09:55:59.000000000'
'2016-06-20T09:56:59.000000000' '2016-06-20T09:57:59.000000000'] [[
nan] [ nan] [ nan] ..., [ 25.54 ] [
25.56333333] [ 25.59333333]] Traceback (most recent call last): File "main_cal.py", line 10, in
index[start:stop] IndexError: failed to coerce slice entry of type numpy.datetime64 to integer
You can use pandas indexing which is designed for this. 'series' is a 1d array with an index attached. With reference to Wes McKinney's Python for Data Analysis:
import pandas as pd
temp = np.random.randn(366)
time_series = pd.Series(temp,index=np.arange(np.datetime64('2015-12-19'),np.datetime64('2016-12-19')))
start = np.datetime64('2016-01-17T15:00')
stop = np.datetime64('2016-06-19T15:00')
time_series[start:stop]
Output:
2016-01-18 -0.690170
2016-01-19 -0.638598
2016-01-20 0.231680
2016-01-21 -0.202787
2016-01-22 -1.333620
2016-01-23 1.525161
2016-01-24 -0.908140
2016-01-25 0.493663
2016-01-26 -1.768979
2016-01-27 0.147327
...

Categories