Concatenate x, y arrays keeping row index - python

I have one multiindex dataframe which contains the x and y coordinates of different body segments across time. It looks like this:
segment 0 1 ... 98 99
coords k x y k ... y k x y
0 0.008525 312.05 361.65 0.011500 ... 329.97 0.012414 621.83 327.77
1 0.004090 312.32 359.98 0.007290 ... 329.00 0.034572 623.31 327.13
2 0.006645 313.42 359.11 0.011194 ... 330.53 0.003275 621.18 327.55
3 0.008367 314.79 361.47 0.013591 ... 329.58 0.026624 624.32 327.76
4 0.005160 315.91 364.54 0.009056 ... 329.97 0.026840 624.54 327.97
... ... ... ... ... ... ... ... ... ...
40006 -0.081192 323.60 354.73 -0.070411 ... 431.78 0.088513 432.43 433.49
40007 -0.050125 319.29 357.99 -0.074568 ... 431.00 0.470994 436.47 432.65
The shape is 40008 rows and 300 columns. The k value I do not need.
For some plotting, however, I need my data to look like this:
[[index0, x_i0_s0, y_i0_s0],
[index0, x_i0_s1,y_i0_s1],
[index0, x_i0_s2,y_i0_s2],
...
[[index40007, x_i40007_s97, y_400i70_s97],
[index40007, x_i40007_s98,y_i40007_s98],
[index40007, x_i40007_s99,y_i40007_s99]]]
Or with real data:
[[0, 312.05, 361.65],
...
[4007, 436.47, 432.65]]
So basically I can get rid of the segment ID, but keep the index. The ouput array should have the following dimensions: (len(index)*segments, 3). In in this case being (4000800, 3).
Since I am not very good at manipulating multi-index dataframes I have tried to get the x and y coordinates separately by:
x = df.xs(('x',), level=('coords',), axis=1)
y = df.xs(('y',), level=('coords',), axis=1)
And after that I have tried different things like np.column_stack() and np.reshape() but without success. The furthest I have gone is with:
x = df.xs(('x',), level=('coords',), axis=1)
y = df.xs(('y',), level=('coords',), axis=1)
result = np.stack((x,y)), axis=2)
Which gives me an array of shape (40008, 100, 2), instead of (400800, 3)
Any help would be greatly appreciated, thank you!

Try this:
# A smaller input dataframe to see if I understand your problem correctly
index = pd.MultiIndex.from_product(
[range(5), list("kxy")], names=["segment", "corrds"]
)
df = pd.DataFrame(np.arange(10 * len(index)).reshape(-1, len(index)), columns=index)
# The manipulation
result = (
df.rename_axis("index")
.stack("segment")
.reset_index()[["index", "x", "y"]]
.to_numpy()
)

Related

How to filter and append arrays

I have a code that I am using to try and filter out arrays that are missing values as seen here:
from astropy.table import Table
import numpy as np
data = '/home/myname/data.fits'
data = Table.read(data, format="fits")
ID = np.array(data['id'])
ID.astype(str)
redshift = np.array(data['z'])
redshift.astype(float)
radius = np.array(data['r'])
radius.astype(float)
mag = np.array(data['M'])
mag.astype(float)
def stack(array1, array2, array3, array4):
#stacks multiple arrays to have corresponding values next to eachother
stacked_array = [(array1[i], array2[i], array3[i], array4[i]) for i in range(0, array1.size)]
stacked_array = np.array(stacked_array)
return(stacked_array)
stacked = stack(ID, redshift, radius, mag)
filtered_array = np.array([])
for i in stacked:
if not i.any == 'nan':
np.insert(filtered_array, i[0], axis=0)
The last for loop is where i'm having difficulty. I want to insert the rows from my stacked array into my filtered array if it has all of the information (some rows are missing redshift, others are missing magnitude etc...). How would I be able to loop over my stacked array and filter out all of the rows that have all 4 values I want? I keep getting this error currently.
TypeError: _insert_dispatcher() missing 1 required positional argument: 'values'
So something like this?
a=[[1,2,3,4],[1,"nan",2,3]]
b=[i for i in a if not any(j=='nan' for j in i)]
which prints [[1, 2, 3, 4]].
You can switch:
for i in stacked:
if not i.any == 'nan':
np.insert(filtered_array, i[0], axis=0)
to:
def any_is_nan(col):
return len(list(filter(lambda x: x=='nan',col))) > 0
filtered_array = list(filter(lambda x: not any_is_nan(x),stacked))
Please refer to filter.

Iterate the code in a shortest way for the whole dataset

I have very big df:
df.shape() = (106, 3364)
I want to calculate so called frechet distance by using this Frechet Distance between 2 curves. And it works good. Example:
x = df['1']
x1 = df['1.1']
p = np.array([x, x1])
y = df['2']
y1 = df['2.1']
q = np.array([y, y1])
P_final = list(zip(p[0], p[1]))
Q_final = list(zip(q[0], q[1]))
from frechetdist import frdist
frdist(P_final,Q_final)
But I can not do row by row like:
`1 and 1.1` to `1 and 1.1` which is equal to 0
`1 and 1.1` to `2 and 2.1` which is equal to some number
...
`1 and 1.1` to `1682 and 1682.1` which is equal to some number
I want to create something (first idea is for loop, but maybe you have better solution) to calculate this frdist(P_final,Q_final) between:
first rows to all rows (including itself)
second row to all rows (including itself)
Finally, I supposed to get a matrix size (106,106) with 0 on diagonal (because distance between itself is 0)
matrix =
0 1 2 3 4 5 ... 105
0 0
1 0
2 0
3 0
4 0
5 0
... 0
105 0
Not including my trial code because it is confusing everyone!
EDITED:
Sample data:
1 1.1 2 2.1 3 3.1 4 4.1 5 5.1
0 43.1024 6.7498 45.1027 5.7500 45.1072 3.7568 45.1076 8.7563 42.1076 8.7563
1 46.0595 1.6829 45.0595 9.6829 45.0564 4.6820 45.0533 8.6796 42.0501 3.6775
2 25.0695 5.5454 44.9727 8.6660 41.9726 2.6666 84.9566 3.8484 44.9566 1.8484
3 35.0281 7.7525 45.0322 3.7465 14.0369 3.7463 62.0386 7.7549 65.0422 7.7599
4 35.0292 7.5616 45.0292 4.5616 23.0292 3.5616 45.0292 7.5616 25.0293 7.5613
I just used own sample data in your format (I hope)
import pandas as pd
from frechetdist import frdist
import numpy as np
# create sample data
df = pd.DataFrame([[1,2,3,4,5,6], [3,4,5,6,8,9], [2,3,4,5,2,2], [3,4,5,6,7,3]], columns=['1','1.1','2', '2.1', '3', '3.1'])
# this matrix will hold the result
res = np.ndarray(shape=(df.shape[1] // 2, df.shape[1] // 2), dtype=np.float32)
for row in range(res.shape[0]):
for col in range(row, res.shape[1]):
# extract the two functions
P = [*zip([df.loc[:, f'{row+1}'], df.loc[:, f'{row+1}.1']])]
Q = [*zip([df.loc[:, f'{col+1}'], df.loc[:, f'{col+1}.1']])]
# calculate distance
dist = frdist(P, Q)
# put result back (its symmetric)
res[row, col] = dist
res[col, row] = dist
# output
print(res)
Output:
[[0. 4. 7.5498343]
[4. 0. 5.5677643]
[7.5498343 5.5677643 0. ]]
Hope that helps
EDIT: Some general tips:
If speed matters: check if frdist handles also a numpy array of shape
(n_values, 2) than you could save the rather expensive zip-and-unpack operation
and directly use the arrays or build the data directly in a format the your library needs
Generally, use better column namings (3 and 3.1 is not too obvious). Why you dont call them x3, y3 or x3 and f_x3
I would actually put the data into two different Matrices. If you watch the
code I had to do some not-so-obvious stuff like iterating over shape
divided by two and built indices from string operations because of the given table layout

Generate many to many list of dictionary relation from python data frame

import pandas as pd
import os
list_of_dict2 = [[{'1580674': ['HA-567034786', 'AB-1018724']}], [{'1554970': ['AB-6348403', 'HA-7298656']}, {'1554970': ['AB-2060953', 'HA-990228']}, {'1554970': ['HA-7287204', 'AB-1092380','GR-33333']}]]
list_of_dict = []
for i in list_of_dict2:
for j in i:
list_of_dict.append(list(j.values())[0])
df = pd.DataFrame(list_of_dict)
print(df)
My Current dataframe result
0 1 2
0 HA-567034786 AB-1018724 None
1 AB-6348403 HA-7298656 None
2 AB-2060953 HA-990228 None
3 HA-7287204 AB-1092380 GR-33333
using the list of a dictionary I can generate the data frame with my below code. But my problem is I am having some
problem to make it many to many list of dictionary. Let me explain what I want to achieve.
For example, for every row of data frame, I want to make it many to many dictionaries with multiple values on the list. Say, with the last index 3 I want to make it like below
Expected Output:(for 2nd index)
{
"AB-2060953" : ['HA-990228'],
"HA-990228" : ['AB-2060953']
}
Expected Output:(for 3rd index)
{
"HA-7287204" : ['AB-1092380','GR-33333'],
"AB-1092380" : ['HA-7287204','GR-33333'],
"GR-33333" : ['AB-1092380','HA-7287204']
}
One approach could be the following:
def make_dict(row):
s = set(row[~row.isna()])
return {x: list(s - {x}) for x in s}
df.apply(make_dict, axis=1)
# Output:
0 {'AB-1018724': ['HA-567034786'], 'HA-567034786': ['AB-1018724']}
1 {'AB-6348403': ['HA-7298656'], 'HA-7298656': ['AB-6348403']}
2 {'HA-990228': ['AB-2060953'], 'AB-2060953': ['HA-990228']}
3 {'GR-33333': ['AB-1092380', 'HA-7287204'], 'AB-1092380': ['GR-33333', 'HA-7287204'], 'HA-7287204': ['GR-33333', 'AB-1092380']}
dtype: object
Or, without assuming uniqueness and dealing with sets,
df.apply(lambda row: {x: [y for y in row if y and x != y] for x in row if x}, axis=1)

Extract values from a 2D matrix based on the values of one column

I have a 2D numpy array "X" with m rows and n columns. I am trying a extract a sub-array when the values of the column r fall in a certain range. Right now I have implemented this by looping through each and every row, which as expected is really slow. What is the simpler way to do this in python?
for j in range(m):
if ((X[j,r]>=lower1) & (X[j,r]<=upper1)):
count=count+1
if count==1:
X_subset=X[j,:]
else:
X_subset=np.vstack([X_subset,X[j,:]])
For example:
X=np.array([[10,3,20],
[1,1,25],
[15,4,30]])
I want to get the subset of this 2D array if the values of second column are in the range 3 to 4 (r=1, lower1=3, upper1=4). The result should be:
[[ 10 3 20]
[ 15 4 30]]
You can use boolean indexing:
>>> def select(X, r, lower1, upper1):
... m = X.shape[0]
... count = 0
... for j in range(m):
... if ((X[j,r]>lower1) & (X[j,r]<upper1)):
... count=count+1
... if count==1:
... X_subset=X[j,:]
... else:
... X_subset=np.vstack([X_subset,X[j,:]])
... return X_subset
...
# an example
>>> X = np.random.random((5, 5))
>>> r = 2
>>> l, u = 0.4, 0.8
# your method:
>>> select(X, r, l, u)
array([[0.35279849, 0.80630909, 0.67111171, 0.59768928, 0.71130907],
[0.3013973 , 0.15820738, 0.69827899, 0.69536766, 0.70500236],
[0.07456726, 0.51917318, 0.58905997, 0.93859414, 0.47375552],
[0.27942043, 0.62996422, 0.78499397, 0.52212271, 0.51194071]])
# boolean indexing:
>>> X[(X[:, r] > l) & (X[:, r] < u)]
array([[0.35279849, 0.80630909, 0.67111171, 0.59768928, 0.71130907],
[0.3013973 , 0.15820738, 0.69827899, 0.69536766, 0.70500236],
[0.07456726, 0.51917318, 0.58905997, 0.93859414, 0.47375552],
[0.27942043, 0.62996422, 0.78499397, 0.52212271, 0.51194071]])

pandas lambda tuple mapping

Trying to compute a range (confidence interval) to return two values in lambda mapped over a column.
M=12.4; n=10; T=1.3
dt = pd.DataFrame( { 'vc' : np.random.randn(10) } )
ci = lambda c : M + np.asarray( -c*T/np.sqrt(n) , c*T/np.sqrt(n) )
dt['ci'] = dt['vc'].map( ci )
print '\n confidence interval ', dt['ci'][:,1]
..er , so how does this get done?
then, how to unpack the tuple in a lambda?
(I want to check if the range >0, ie contains the mean)
neither of the following work:
appnd = lambda c2: c2[0]*c2[1] > 0 and 1 or 0
app2 = lambda x,y: x*y >0 and 1 or 0
dt[cnt] = dt['ci'].map(app2)
It's probably easier to see by defining a proper function for the CI, rather than a lambda.
As far as the unpacking goes, maybe you could let the function take an argument for whether to add or subtract, and then apply it twice.
You should also calculate the mean and size in the function, instead of assigning them ahead of time.
In [40]: def ci(arr, op, t=2.0):
M = arr.mean()
n = len(arr)
rhs = arr * t / np.sqrt(n)
return np.array(op(M, rhs))
You can import the add and sub functions from operator
From there it's just a one liner:
In [47]: pd.concat([dt.apply(ci, axis=1, op=x) for x in [sub, add]], axis=1)
Out[47]:
vc vc
0 -0.374189 1.122568
1 0.217528 -0.652584
2 -0.636278 1.908835
3 -1.132730 3.398191
4 0.945839 -2.837518
5 -0.053275 0.159826
6 -0.031626 0.094879
7 0.931007 -2.793022
8 -1.016031 3.048093
9 0.051007 -0.153022
[10 rows x 2 columns]
I'd recommend breaking that into a few steps for clarity.
Get the minus one with r1 = dt.apply(ci, axis=1, op=sub), and the plus with r2 = dt.apply(ci, axis=1, op=add). Combine with pd.concat([r1, r2], axis=1)
Basically, it's hard to tell from dt.apply what the output should look like, just seeing some tuples. By applying separately, we get two 10 x 1 arrays.

Categories