outputting python/numpy arrays as columns - python

I'm very new to python, but have been using it to calculate and filter through data. I'm trying to output my array so I can pass it to other programs, but the output is one solid piece of text, with brackets and commas separating it.
I understand there are ways of manipulating this, but I want to understand why my code has output it in this format, and how to make it output it in nice columns instead.
The array was generated with:
! /usr/bin/env python
import numpy as np
import networkx
import gridData
from scipy.spatial.distance import euclidean
INPUT1=open("test_area.xvg",'r')
INPUT2=open("test_atom.xvg",'r')
OUTPUT1= open("negdist.txt",'w')
area = []
pointneg = []
posneg = []
negdistance =[ ]
negresarea = []
while True:
line = INPUT1.readline()
if not line:
break
col = line.split()
if col:
area.append(((col[0]),float(col[1])))
pointneg.append((-65.097000,5.079000,-9.843000))
while True:
line = INPUT2.readline()
if not line:
break
col = line.split()
if col:
pointneg.append((float(col[5]),float(col[6]),float(col[7])))
posneg.append((col[4]))
for col in posneg:
negresarea.append(area[int(col)-1][1])
a=len(pointneg)
for x in xrange(a-1):
negdistance.append((-1,(negresarea[x]),euclidean((pointneg[0]),(pointneg[x]))))
print >> OUTPUT1, negdistance
example output:
[(-1, 1.22333, 0.0), (-1, 1.24223, 153.4651968428021), (-1, 1.48462, 148.59335545709976), (-1, 1.39778, 86.143305392816202), (-1, 0.932278, 47.914688322058403), (-1, 1.04997, 28.622555546282022),
desired output:
[-1, 1.22333, 0.0
-1, 1.24223, 153.4651968428021
-1, 1.48462, 148.59335545709976
-1, 1.39778, 86.143305392816202
-1, 0.932278, 47.914688322058403
-1, 1.04997, 28.622555546282022...
Example inputs:
example input1
1 2.12371 0
2 1.05275 0
3 0.865794 0
4 0.933986 0
5 1.09092 0
6 1.22333 0
7 1.54639 0
8 1.24223 0
9 1.10928 0
10 1.16232 0
11 0.60942 0
12 1.40117 0
13 1.58521 0
14 1.00011 0
15 1.18881 0
16 1.68442 0
17 0.866275 0
18 1.79196 0
19 1.4375 0
20 1.198 0
21 1.01645 0
22 1.82221 0
23 1.99409 0
24 1.0728 0
25 0.679654 0
26 1.15578 0
27 1.28326 0
28 1.00451 0
29 1.48462 0
30 1.33399 0
31 1.13697 0
32 1.27483 0
33 1.18738 0
34 1.08141 0
35 1.15163 0
36 0.93699 0
37 0.940171 0
38 1.92887 0
39 1.35721 0
40 1.85447 0
41 1.39778 0
42 1.97309 0
Example Input2
ATOM 35 CA GLU 6 56.838 -5.202 -102.459 1.00273.53 C
ATOM 55 CA GLU 8 54.729 -6.650 -96.930 1.00262.73 C
ATOM 225 CA GLU 29 5.407 -2.199 -58.801 1.00238.62 C
ATOM 321 CA GLU 41 -24.633 -0.327 -34.928 1.00321.69 C

The problem is the multiple parenthesis when you append. You are appending tuples.
what you want is to be adding lists - i.e. the ones with square brackets.
import numpy as np
area = []
with open('example2.txt') as filehandle:
for line in filehandle:
if line.strip() == '':continue
line = line.strip().split(',')
area.append([int(line[0]),float(line[1]),float(line[2])])
area = np.array(area)
print(area)
'example2.txt' is the data you provided made into a csv

I didn't really get an answer that enabled me to understand the problem, the one suggested above just prevented to whole code working properly. I did find a work around by including the print command in the loop defining my final output.
for x in xrange(a-1):
negdistance.append((-1,(negresarea[x]),euclidean((pointneg[0]),(pointneg[x]))))
print negdistance
negdistance =[]

Related

Tensor indexing with matrix

I have matrix (3 x 15) dummies with sequences of tokens as rows:
[[ 1 66 67 68 0 0 0 0 0 0 0 0 0 0 0]
[ 1 66 67 66 68 66 67 66 0 0 0 0 0 0 0]
[ 1 66 67 68 18 19 20 21 22 23 24 25 26 17 0]]
Also, there's a tensor probs of shape (3 x 15 x n_tokens) with token probabilities.
From probs I need to select only probabilities of tokens in dummies.
I think, it may be possible to use the matrix as indices for the tensor, but I haven't found how to do that.
You can do that like this:
import tensorflow as tf
dummies = ...
probs = ...
s = tf.shape(dummies)
i = tf.range(s[0])
j = tf.range(s[1])
ii, jj = tf.meshgrid(i, j, indexing='ij')
idx = tf.stack([ii, jj, dummies], axis=-1)
result = tf.gather_nd(probs, idx)

Python SQL Data to Array (like csv)

currently I am trying to read my SQL-Data to an array in python.
I am new to python so please be kind ;)
My csv-export can be read easily:
data = pd.read_csv('esoc_data.csv', header = None)
x = data[[1,2,3,4,5,6,7,8,9,10,11,12]]
This one picks the second column (starting from 1, not 0) till 12th column of my dataset. I need this data in this exact format!
Now I want to do the same with the data I get from my SQL-fetch.
names = [i for i in cursor.fetchall()]
This one gives me my data with all (0-12) columns and separated by ","
Result:
[(name#mail.com', 13, 13, 0, 24, 2, 0, 20, 3, 0, 31, 12, 2), (...)]
Now .. how do I get this into my "specific" format I mentioned before?
I just need the numbers like this:
1 2 3 4 5 6 7 8 9 10 11 12
0 13 13 0 24 2 0 20 3 0 31 12 2
1 21 0 0 24 0 0 32 0 0 30 0 0
2 9 7 0 26 31 0 19 27 0 30 32 2
I'm sorry if this is peanuts for you.
You can run a multi-loop for this, something like
def our_method():
parent_list = list()
for name in names:
child_list = list()
for index, item in enumerate(name):
if index != 0:
child_list.append(item)
parent_list.append(child_list)
return parent_list

How to column stack arrays ignoring nan in Python?

I have data of the form in a text file.
Text file entry
#x y z
1 1 1
2 4
3 9
4 16
5 25
6 36
7 49
8 64 512
9 81 729
10 100 1000
11 121
12 144 1728
13 169
14 196
15 225
16 256 4096
17 289
18 324
19 361 6859
20 400
21 441 9261
22 484
23 529 12167
24 576
25 625
Some of the entries in the third column are empty. I am trying to create an array of x (column 1) and z (column 3) ignoring nan. Let the array be B. The contents of B should be:
1 1
8 512
9 729
10 1000
12 1728
16 4096
19 6859
21 9261
23 12167
I tried doing this using the code:
import numpy as np
A = np.genfromtxt('data.dat', comments='#', delimiter='\t')
B = []
for i in range(len(A)):
if ~ np.isnan(A[i, 2]):
B = np.append(B, np.column_stack((A[i, 0], A[i, 2])))
print B.shape
This does not work. It creates a column vector. How can this be done in Python?
Using pandas would make your life quite easier (note the regular expression to define delimiter):
from pandas import read_csv
data = read_csv('data.dat', delimiter='\s+').values
print(data[~np.isnan(data[:, 2])][:, [0, 2]])
Which results in:
array([[ 8.00000000e+00, 5.12000000e+02],
[ 9.00000000e+00, 7.29000000e+02],
[ 1.00000000e+01, 1.00000000e+03],
[ 1.20000000e+01, 1.72800000e+03],
[ 1.60000000e+01, 4.09600000e+03],
[ 1.90000000e+01, 6.85900000e+03],
[ 2.10000000e+01, 9.26100000e+03],
[ 2.30000000e+01, 1.21670000e+04]])
If you read your data.dat file and assign the content to a variable, say data:
You can iterate over the lines and split them and process only the ones that have 3 elements:
B=[]
for line in data.split('\n'):
if len(line.split()) == 3:
x,y,z = line.split()
B.append((x,z)) # or B.append(str(x)+'\t'+str(z)+'\n')
# or any othr format you need
Not always the functions provided by the libraries are easy to use, as you found out. The following program does it manually, and creates an array with the values from the datafile.
import numpy as np
def main():
B = np.empty([0, 2], dtype = int)
with open("data.dat") as inf:
for line in inf:
if line[0] == "#": continue
l = line.split()
if len(l) == 3:
l = [int(d) for d in l[1:]]
B = np.vstack((B, l))
print B.shape
print B
return 0
if __name__ == '__main__':
main()
Note that:
1) The append() function works on lists, not on arrays - at least not in the syntax you used. The easiest way to extend arrays is 'piling' rows, using vstack (or hstack for columns)
2) Specifying a delimiter in genfromtxt() can come to bite you. By default the delimiter is any white space, which is normally what you want.
From your input dataframe:
In [33]: df.head()
Out[33]:
x y z
0 1 1 1
1 2 4 NaN
2 3 9 NaN
3 4 16 NaN
4 5 25 NaN
.. you can get to the output dataframe B by doing this :
In [34]: df.dropna().head().drop('y', axis=1)
Out[34]:
x z
0 1 1
7 8 512
8 9 729
9 10 1000
11 12 1728

To find the difference b/w two numbers in a column of file?

Consider a input file with 5 column(0-5):
1 0 937 306 97 3
2 164472 75 17 81 3
3 197154 35268 306 97 3
4 310448 29493 64 38 1
5 310541 29063 64 38 1
6 310684 33707 64 38 1
7 319091 47451 16 41 1
8 319101 49724 16 41 1
9 324746 61578 1 5 1
10 324939 54611 1 5 1
for the second column i,e column1(0,164472,197154-----------) need to find the difference b/w numbers so that the column1 should be (0,164472-0,197154-164472,____) so (0,164472,32682..............).
And the output file must change only the column1 values all other values must remain the same as input file:
1 0 937 306 97 3
2 164472 75 17 81 3
3 32682 35268 306 97 3
4 113294 29493 64 38 1
5 93 29063 64 38 1
6 143 33707 64 38 1
7 8407 47451 16 41 1
8 10 49724 16 41 1
9 5645 61578 1 5 1
10 193 54611 1 5 1
if anyone could suggest a python code to do this it would be helpfull........
Actually i tried to append all the columns into list and find the difference of column2 and again write back to another file.But the input file i have posed is just a sample the entire input file contains 50,000 lines so my attempt failed
The attempt code i tried is as follows:
import sys
import numpy
old_stdout = sys.stdout
log_file = open("newc","a")
sys.stdout = log_file
a1 = []; a2 = []; a2f = []; v = []; a3 = []; a4 = []; a5 = []; a6 = []
with open("newfileinput",'r') as f:
for line in f:
job = map(int,line.split())
a1.append(job[0])
a3.append(job[2])
a4.append(job[3])
a5.append(job[4])
a6.append(job[5])
a2.append(job[1])
v = [a2[i+1]-a2[i] for i in range(len(a2)-1)]
print a1
print v
print a3
print a4
print a5
print a6
sys.stdout = old_stdout
log_file.close()
Now from the output file of the code "newc" which contained 6 list i wrote it in to an file one by one...Which was time consuming.... & not so efficient...
So if anyone could suggest a simpler method it will be helpful..........
Try this. let me know if any problems or if you want me to explain any of the code:
import sys
log_file = open("newc.txt","a")
this_no, prev_no = 0, 0
with open("newfileinput.txt",'r') as f:
for line in f:
row = line.split()
this_no = int(row[1])
log_file.write(line.replace(str(this_no), str(this_no - prev_no)))
prev_no = this_no
log_file.close()
don't downvote me, just for fun.
import re
from time import sleep
p = re.compile(r'\s+')
data = '''1 0 937 306 97 3
2 164472 75 17 81 3
3 197154 35268 306 97 3
4 310448 29493 64 38 1
5 310541 29063 64 38 1
6 310684 33707 64 38 1
7 319091 47451 16 41 1
8 319101 49724 16 41 1
9 324746 61578 1 5 1
10 324939 54611 1 5 1\n''' * 5000
data = data.split('\n')[0:-1]
data = [p.split(one) for one in data]
data = [map(int, one) for one in data]
def list_diff(a, b):
temp = a[:]
temp[1] = a[1] - b[1]
return temp
result = [
data[0],
]
for i, _ in enumerate(data):
if i < len(data) - 1:
result.append(list_diff(data[i+1], data[i]))
for i, one in enumerate(result):
one[0] = i+1
print one
sleep(0.1)

replace zeroes in numpy array with the median value

I have a numpy array like this:
foo_array = [38,26,14,55,31,0,15,8,0,0,0,18,40,27,3,19,0,49,29,21,5,38,29,17,16]
I want to replace all the zeros with the median value of the whole array (where the zero values are not to be included in the calculation of the median)
So far I have this going on:
foo_array = [38,26,14,55,31,0,15,8,0,0,0,18,40,27,3,19,0,49,29,21,5,38,29,17,16]
foo = np.array(foo_array)
foo = np.sort(foo)
print "foo sorted:",foo
#foo sorted: [ 0 0 0 0 0 3 5 8 14 15 16 17 18 19 21 26 27 29 29 31 38 38 40 49 55]
nonzero_values = foo[0::] > 0
nz_values = foo[nonzero_values]
print "nonzero_values?:",nz_values
#nonzero_values?: [ 3 5 8 14 15 16 17 18 19 21 26 27 29 29 31 38 38 40 49 55]
size = np.size(nz_values)
middle = size / 2
print "median is:",nz_values[middle]
#median is: 26
Is there a clever way to achieve this with numpy syntax?
Thank you
This solution takes advantage of numpy.median:
import numpy as np
foo_array = [38,26,14,55,31,0,15,8,0,0,0,18,40,27,3,19,0,49,29,21,5,38,29,17,16]
foo = np.array(foo_array)
# Compute the median of the non-zero elements
m = np.median(foo[foo > 0])
# Assign the median to the zero elements
foo[foo == 0] = m
Just a note of caution, the median for your array (with no zeroes) is 23.5 but as written this sticks in 23.
foo2 = foo[:]
foo2[foo2 == 0] = nz_values[middle]
Instead of foo2, you could just update foo if you want. Numpy's smart array syntax can combine a few lines of the code you made. For example, instead of,
nonzero_values = foo[0::] > 0
nz_values = foo[nonzero_values]
You can just do
nz_values = foo[foo > 0]
You can find out more about "fancy indexing" in the documentation.

Categories