How to column stack arrays ignoring nan in Python?

How to column stack arrays ignoring nan in Python? - python

I have data of the form in a text file.
Text file entry
#x y z
1 1 1
2 4
3 9
4 16
5 25
6 36
7 49
8 64 512
9 81 729
10 100 1000
11 121
12 144 1728
13 169
14 196
15 225
16 256 4096
17 289
18 324
19 361 6859
20 400
21 441 9261
22 484
23 529 12167
24 576
25 625
Some of the entries in the third column are empty. I am trying to create an array of x (column 1) and z (column 3) ignoring nan. Let the array be B. The contents of B should be:
1 1
8 512
9 729
10 1000
12 1728
16 4096
19 6859
21 9261
23 12167
I tried doing this using the code:
import numpy as np
A = np.genfromtxt('data.dat', comments='#', delimiter='\t')
B = []
for i in range(len(A)):
if ~ np.isnan(A[i, 2]):
B = np.append(B, np.column_stack((A[i, 0], A[i, 2])))
print B.shape
This does not work. It creates a column vector. How can this be done in Python?

Using pandas would make your life quite easier (note the regular expression to define delimiter):
from pandas import read_csv
data = read_csv('data.dat', delimiter='\s+').values
print(data[~np.isnan(data[:, 2])][:, [0, 2]])
Which results in:
array([[ 8.00000000e+00, 5.12000000e+02],
[ 9.00000000e+00, 7.29000000e+02],
[ 1.00000000e+01, 1.00000000e+03],
[ 1.20000000e+01, 1.72800000e+03],
[ 1.60000000e+01, 4.09600000e+03],
[ 1.90000000e+01, 6.85900000e+03],
[ 2.10000000e+01, 9.26100000e+03],
[ 2.30000000e+01, 1.21670000e+04]])

If you read your data.dat file and assign the content to a variable, say data:
You can iterate over the lines and split them and process only the ones that have 3 elements:
B=[]
for line in data.split('\n'):
if len(line.split()) == 3:
x,y,z = line.split()
B.append((x,z)) # or B.append(str(x)+'\t'+str(z)+'\n')
# or any othr format you need

Not always the functions provided by the libraries are easy to use, as you found out. The following program does it manually, and creates an array with the values from the datafile.
import numpy as np
def main():
B = np.empty([0, 2], dtype = int)
with open("data.dat") as inf:
for line in inf:
if line[0] == "#": continue
l = line.split()
if len(l) == 3:
l = [int(d) for d in l[1:]]
B = np.vstack((B, l))
print B.shape
print B
return 0
if __name__ == '__main__':
main()
Note that:
1) The append() function works on lists, not on arrays - at least not in the syntax you used. The easiest way to extend arrays is 'piling' rows, using vstack (or hstack for columns)
2) Specifying a delimiter in genfromtxt() can come to bite you. By default the delimiter is any white space, which is normally what you want.

From your input dataframe:
In [33]: df.head()
Out[33]:
x y z
0 1 1 1
1 2 4 NaN
2 3 9 NaN
3 4 16 NaN
4 5 25 NaN
.. you can get to the output dataframe B by doing this :
In [34]: df.dropna().head().drop('y', axis=1)
Out[34]:
x z
0 1 1
7 8 512
8 9 729
9 10 1000
11 12 1728

Related

length of list len(list) resulting wrong value in Python

It might sound trivial but I am surprised by the output. Basically, I have am calculating y = m*x + b for given a, b & x. With below code I am able to get the desired result of y which a list of 20 values.
But when I am checking the length of the list, I am getting 1 in return. And the range is (0,1) which is weird as I was expecting it to be 20.
Am I making any mistake here?
a = 10
b = 0
x = df['x']
print(x)
0 0.000000
1 0.052632
2 0.105263
3 0.157895
4 0.210526
5 0.263158
6 0.315789
7 0.368421
8 0.421053
9 0.473684
10 0.526316
11 0.578947
12 0.631579
13 0.684211
14 0.736842
15 0.789474
16 0.842105
17 0.894737
18 0.947368
19 1.000000
y_new = []
for i in x:
y = a*x +b
y_new.append(y)
len(y_new)
Output: 1
print(y_new)
[0 0.000000
1 0.526316
2 1.052632
3 1.578947
4 2.105263
5 2.631579
6 3.157895
7 3.684211
8 4.210526
9 4.736842
10 5.263158
11 5.789474
12 6.315789
13 6.842105
14 7.368421
15 7.894737
16 8.421053
17 8.947368
18 9.473684
19 10.000000
Name: x, dtype: float64]

I would propose two solutions:
The first solution is : you convert your columnn df['x'] into a list by doing df['x'].tolist() and you re-run your code and also you should replace ax+b by ai+b
The second solution is (which I would do): You convert your df['x'] into an array by doing x = np.array(df['x']). By doing this you can do some array broadcasting.
So, your code will simply be :
x = np.array(df['x'])
y = a*x + b
This should give you the desired output.
I hope this would be helpful

With the code below, I have a length of 20 for the array y_new. Are you sure to print the right value? According to this post, df['x'] returns a panda Series so df['x'] is equivalent to pd.Series(...).
df['x'] — index a column named 'x'. Returns pd.Series
import pandas as pd
a = 10
b = 0
x = pd.Series(data=[0.000000,0.052632,0.105263,0.157895,0.210526, 0.263158, 0.315789, 0.368421, 0.421053,0.473684,0.526316,0.578947,0.631579
,0.684211,0.736842,0.789474,0.842105,0.894737,0.947368,1.000000])
y_new = []
for i in x:
y = a*x +b
y_new.append(y)
print("y_new length: " + str(len(y_new)) )
Output:
y_new length: 20

Optimising itertools combination with grouped DataFrame and post filter

I have a DataFrame as
Locality money
1 3
1 4
1 10
1 12
1 15
2 16
2 18
I have to do a combination with replacement of money column with a groupby view on Locality and a filter on the money difference. The target must be like
Locality money1 money2
1 3 3
1 3 4
1 4 4
1 10 10
1 10 12
1 10 15
1 12 12
1 12 15
1 15 15
2 16 16
2 16 18
2 18 18
Note that the combination is applied for values on the same Locality and values which have a difference less than 6.
My current code is
from itertools import combinations_with_replacement
import numpy as np
import panda as pd
def generate_graph(input_series, out_cols):
return pd.DataFrame(list(combinations_with_replacement(input_series, r=2)), columns=out_cols)
df = (
df.groupby(['Locality'])['money'].apply(
lambda x: generate_graph(x, out_cols=['money1', 'money2'])
).reset_index().drop(columns=['level_1'], errors='ignore')
)
# Ensure the Distance between money is within the permissible limit
df = df.loc[(
df['money2'] - df['money1'] < 6
)]
The issue is, I have a DataFrame with 100000 rows which takes almost 33 seconds to process my code. I need to optimize the time taken by my code probably using numpy. I am looking for optimizing the groupby and the post-filter which takes extra space and time. For sample data, you can use this code to generate the DataFrame.
# Generate dummy data
t1 = list(range(0, 100000))
b = np.random.randint(100, 10000, 100000)
a = (b/100).astype(int)
df = pd.DataFrame({'Locality': a, 'money': t1})
df = df.sort_values(by=['Locality', 'money'])

To gain both running time speedup and reduce space consumption:
Instead of post-filtering - apply an extended function (say combine_values) that generates dataframe on a generator expression yielding already filtered (by condition) combinations.
(factor below is a default argument that indicates to the mentioned permissible limit)
In [48]: def combine_values(values, out_cols, factor=6):
...: return pd.DataFrame(((m1, m2) for m1, m2 in combinations_with_replacement(values, r=2)
...: if m2 - m1 < factor), columns=out_cols)
...:
In [49]: df_result = (
...: df.groupby(['Locality'])['money'].apply(
...: lambda x: combine_values(x, out_cols=['money1', 'money2'])
...: ).reset_index().drop(columns=['level_1'], errors='ignore')
...: )
Execution time performance:
In [50]: %time df.groupby(['Locality'])['money'].apply(lambda x: combine_values(x, out_cols=['money1', 'money2'])).reset_index().drop(columns=['l
...: evel_1'], errors='ignore')
CPU times: user 2.42 s, sys: 1.64 ms, total: 2.42 s
Wall time: 2.42 s
Out[50]:
Locality money1 money2
0 1 34 34
1 1 106 106
2 1 123 123
3 1 483 483
4 1 822 822
... ... ... ...
105143 99 99732 99732
105144 99 99872 99872
105145 99 99889 99889
105146 99 99913 99913
105147 99 99981 99981
[105148 rows x 3 columns]

vectorized sum of interpolated values over many arrays

I have a large set (thousands) of smooth lines (series of x,y pairs) with different sampling of x and y and different length for each line, i.e.
x_0 = {x_00, x_01, ..., } # length n_0
x_1 = {x_10, x_11, ..., } # length n_1
...
x_m = {x_m0, x_m1, ..., } # length n_m
y_0 = {y_00, y_01, ..., } # length n_0
y_1 = {y_10, y_11, ..., } # length n_1
...
y_m = {y_m0, y_m1, ..., } # length n_m
I want to find cumulative properties of each line interpolated to a regular set of x points, i.e. x = {x_0, x_1 ..., x_n-1}
Currently I'm for-looping over each line, creating an interpolant, resampling, and then taking the sum/median/whatever of that result. It works, but it's really slow. Is there any way to vectorize / matrisize this operation?
I was thinking, since linear interpolation can be a matrix operation, perhaps it's possible. At the same time, since each row can have a different length... it might be complicated. Edit: but zero padding the shorter arrays would be easy...
What I'm doing now looks something like,
import numpy as np
import scipy as sp
import scipy.interpolate
...
# `xx` and `yy` are lists of lists with the x and y points respectively
# `xref` are the reference x values at which I want interpolants
yref = np.zeros([len(xx), len(xref)])
for ii, (xi, yi) in enumerate(zip(xx, yy)):
yref[ii] = sp.interp(xref, xi, yi)
y_med = np.median(yref, axis=-1)
y_sum = np.sum(yref, axis=-1)
...

Hopefully, you can adjust the following for your purposes.
I included pandas because it has an interpolation feature to fill in missing values.
Setup
import pandas as pd
import numpy as np
x = np.arange(19)
x_0 = x[::2]
x_1 = x[::3]
np.random.seed([3,1415])
y_0 = x_0 + np.random.randn(len(x_0)) * 2
y_1 = x_1 + np.random.randn(len(x_1)) * 2
xy_0 = pd.DataFrame(y_0, index=x_0)
xy_1 = pd.DataFrame(y_1, index=x_1)
Note:
x is length 19
x_0 is length 10
x_1 is length 7
xy_0 looks like:
0
0 -4.259448
2 -0.536932
4 0.059001
6 1.481890
8 7.301427
10 9.946090
12 12.632472
14 14.697564
16 17.430729
18 19.541526
xy_0 can be aligned with x via reindex
xy_0.reindex(x)
0
0 -4.259448
1 NaN
2 -0.536932
3 NaN
4 0.059001
5 NaN
6 1.481890
7 NaN
8 7.301427
9 NaN
10 9.946090
11 NaN
12 12.632472
13 NaN
14 14.697564
15 NaN
16 17.430729
17 NaN
18 19.541526
we can then fill in missing with interpolate
xy_0.reindex(x).interpolate()
0
0 -4.259448
1 -2.398190
2 -0.536932
3 -0.238966
4 0.059001
5 0.770445
6 1.481890
7 4.391659
8 7.301427
9 8.623759
10 9.946090
11 11.289281
12 12.632472
13 13.665018
14 14.697564
15 16.064147
16 17.430729
17 18.486128
18 19.541526
What about xy_1
xy_1.reindex(x)
0
0 -1.216416
1 NaN
2 NaN
3 3.704781
4 NaN
5 NaN
6 5.294958
7 NaN
8 NaN
9 8.168262
10 NaN
11 NaN
12 10.176849
13 NaN
14 NaN
15 14.714924
16 NaN
17 NaN
18 19.493678
Interpolated
xy_0.reindex(x).interpolate()
0
0 -1.216416
1 0.423983
2 2.064382
3 3.704781
4 4.234840
5 4.764899
6 5.294958
7 6.252726
8 7.210494
9 8.168262
10 8.837791
11 9.507320
12 10.176849
13 11.689541
14 13.202233
15 14.714924
16 16.307842
17 17.900760
18 19.493678

outputting python/numpy arrays as columns

I'm very new to python, but have been using it to calculate and filter through data. I'm trying to output my array so I can pass it to other programs, but the output is one solid piece of text, with brackets and commas separating it.
I understand there are ways of manipulating this, but I want to understand why my code has output it in this format, and how to make it output it in nice columns instead.
The array was generated with:
! /usr/bin/env python
import numpy as np
import networkx
import gridData
from scipy.spatial.distance import euclidean
INPUT1=open("test_area.xvg",'r')
INPUT2=open("test_atom.xvg",'r')
OUTPUT1= open("negdist.txt",'w')
area = []
pointneg = []
posneg = []
negdistance =[ ]
negresarea = []
while True:
line = INPUT1.readline()
if not line:
break
col = line.split()
if col:
area.append(((col[0]),float(col[1])))
pointneg.append((-65.097000,5.079000,-9.843000))
while True:
line = INPUT2.readline()
if not line:
break
col = line.split()
if col:
pointneg.append((float(col[5]),float(col[6]),float(col[7])))
posneg.append((col[4]))
for col in posneg:
negresarea.append(area[int(col)-1][1])
a=len(pointneg)
for x in xrange(a-1):
negdistance.append((-1,(negresarea[x]),euclidean((pointneg[0]),(pointneg[x]))))
print >> OUTPUT1, negdistance
example output:
[(-1, 1.22333, 0.0), (-1, 1.24223, 153.4651968428021), (-1, 1.48462, 148.59335545709976), (-1, 1.39778, 86.143305392816202), (-1, 0.932278, 47.914688322058403), (-1, 1.04997, 28.622555546282022),
desired output:
[-1, 1.22333, 0.0
-1, 1.24223, 153.4651968428021
-1, 1.48462, 148.59335545709976
-1, 1.39778, 86.143305392816202
-1, 0.932278, 47.914688322058403
-1, 1.04997, 28.622555546282022...
Example inputs:
example input1
1 2.12371 0
2 1.05275 0
3 0.865794 0
4 0.933986 0
5 1.09092 0
6 1.22333 0
7 1.54639 0
8 1.24223 0
9 1.10928 0
10 1.16232 0
11 0.60942 0
12 1.40117 0
13 1.58521 0
14 1.00011 0
15 1.18881 0
16 1.68442 0
17 0.866275 0
18 1.79196 0
19 1.4375 0
20 1.198 0
21 1.01645 0
22 1.82221 0
23 1.99409 0
24 1.0728 0
25 0.679654 0
26 1.15578 0
27 1.28326 0
28 1.00451 0
29 1.48462 0
30 1.33399 0
31 1.13697 0
32 1.27483 0
33 1.18738 0
34 1.08141 0
35 1.15163 0
36 0.93699 0
37 0.940171 0
38 1.92887 0
39 1.35721 0
40 1.85447 0
41 1.39778 0
42 1.97309 0
Example Input2
ATOM 35 CA GLU 6 56.838 -5.202 -102.459 1.00273.53 C
ATOM 55 CA GLU 8 54.729 -6.650 -96.930 1.00262.73 C
ATOM 225 CA GLU 29 5.407 -2.199 -58.801 1.00238.62 C
ATOM 321 CA GLU 41 -24.633 -0.327 -34.928 1.00321.69 C

The problem is the multiple parenthesis when you append. You are appending tuples.
what you want is to be adding lists - i.e. the ones with square brackets.
import numpy as np
area = []
with open('example2.txt') as filehandle:
for line in filehandle:
if line.strip() == '':continue
line = line.strip().split(',')
area.append([int(line[0]),float(line[1]),float(line[2])])
area = np.array(area)
print(area)
'example2.txt' is the data you provided made into a csv

I didn't really get an answer that enabled me to understand the problem, the one suggested above just prevented to whole code working properly. I did find a work around by including the print command in the loop defining my final output.
for x in xrange(a-1):
negdistance.append((-1,(negresarea[x]),euclidean((pointneg[0]),(pointneg[x]))))
print negdistance
negdistance =[]

To find the difference b/w two numbers in a column of file?

Consider a input file with 5 column(0-5):
1 0 937 306 97 3
2 164472 75 17 81 3
3 197154 35268 306 97 3
4 310448 29493 64 38 1
5 310541 29063 64 38 1
6 310684 33707 64 38 1
7 319091 47451 16 41 1
8 319101 49724 16 41 1
9 324746 61578 1 5 1
10 324939 54611 1 5 1
for the second column i,e column1(0,164472,197154-----------) need to find the difference b/w numbers so that the column1 should be (0,164472-0,197154-164472,____) so (0,164472,32682..............).
And the output file must change only the column1 values all other values must remain the same as input file:
1 0 937 306 97 3
2 164472 75 17 81 3
3 32682 35268 306 97 3
4 113294 29493 64 38 1
5 93 29063 64 38 1
6 143 33707 64 38 1
7 8407 47451 16 41 1
8 10 49724 16 41 1
9 5645 61578 1 5 1
10 193 54611 1 5 1
if anyone could suggest a python code to do this it would be helpfull........
Actually i tried to append all the columns into list and find the difference of column2 and again write back to another file.But the input file i have posed is just a sample the entire input file contains 50,000 lines so my attempt failed
The attempt code i tried is as follows:
import sys
import numpy
old_stdout = sys.stdout
log_file = open("newc","a")
sys.stdout = log_file
a1 = []; a2 = []; a2f = []; v = []; a3 = []; a4 = []; a5 = []; a6 = []
with open("newfileinput",'r') as f:
for line in f:
job = map(int,line.split())
a1.append(job[0])
a3.append(job[2])
a4.append(job[3])
a5.append(job[4])
a6.append(job[5])
a2.append(job[1])
v = [a2[i+1]-a2[i] for i in range(len(a2)-1)]
print a1
print v
print a3
print a4
print a5
print a6
sys.stdout = old_stdout
log_file.close()
Now from the output file of the code "newc" which contained 6 list i wrote it in to an file one by one...Which was time consuming.... & not so efficient...
So if anyone could suggest a simpler method it will be helpful..........

Try this. let me know if any problems or if you want me to explain any of the code:
import sys
log_file = open("newc.txt","a")
this_no, prev_no = 0, 0
with open("newfileinput.txt",'r') as f:
for line in f:
row = line.split()
this_no = int(row[1])
log_file.write(line.replace(str(this_no), str(this_no - prev_no)))
prev_no = this_no
log_file.close()

don't downvote me, just for fun.
import re
from time import sleep
p = re.compile(r'\s+')
data = '''1 0 937 306 97 3
2 164472 75 17 81 3
3 197154 35268 306 97 3
4 310448 29493 64 38 1
5 310541 29063 64 38 1
6 310684 33707 64 38 1
7 319091 47451 16 41 1
8 319101 49724 16 41 1
9 324746 61578 1 5 1
10 324939 54611 1 5 1\n''' * 5000
data = data.split('\n')[0:-1]
data = [p.split(one) for one in data]
data = [map(int, one) for one in data]
def list_diff(a, b):
temp = a[:]
temp[1] = a[1] - b[1]
return temp
result = [
data[0],
]
for i, _ in enumerate(data):
if i < len(data) - 1:
result.append(list_diff(data[i+1], data[i]))
for i, one in enumerate(result):
one[0] = i+1
print one
sleep(0.1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to column stack arrays ignoring nan in Python? - python

From your input dataframe: In [33]: df.head() Out[33]: x y z 0 1 1 1 1 2 4 NaN 2 3 9 NaN 3 4 16 NaN 4 5 25 NaN .. you can get to the output dataframe B by doing this : In [34]: df.dropna().head().drop('y', axis=1) Out[34]: x z 0 1 1 7 8 512 8 9 729 9 10 1000 11 12 1728

Related

length of list len(list) resulting wrong value in Python

Optimising itertools combination with grouped DataFrame and post filter

vectorized sum of interpolated values over many arrays

outputting python/numpy arrays as columns

To find the difference b/w two numbers in a column of file?

Categories

Resources