Operations with columns from different files - python

I have many files .txt of this type:
name1.fits 0 0 4088.9 0. 1. 0. -0.909983 0.01386 0.91 0.01386 -0.286976 0.00379 2.979 0.03971 0. 0.
name2.fits 0 0 4088.9 0. 1. 0. -0.84702 0.01239 0.847 0.01239 -0.250671 0.00261 3.174 0.04749 0. 0.
#name3.fits 0 0 4088.9 0. 1. 0. -0.494718 0.01168 0.4947 0.01168 -0.185677 0.0042 2.503 0.04365 0. 0.
#name4.fits 0 1 4088.9 0. 1. 0. -0.751382 0.01342 0.7514 0.01342 -0.202141 0.00267 3.492 0.07224 0. 0.
name4.fits 0 1 4088.961 0.01147 1.000169 0. -0.813628 0.01035 0.8135 0.01035 -0.217434 0.00196 3.515 0.04045 0. 0.
I want to divide the values of one of these columns by the values of a column from another file of the same type. Here is what I have so far:
with open('4026.txt','r') as out1, open('4089.txt', 'r') as out2, \
open('4116.txt', 'r') as out3, open('4121.txt', 'r') as out4, \
open('4542.txt', 'r') as out5, open('4553.txt', 'r') as out6:
for data1 in out1.readlines():
col1 = data1.strip().split()
x = col1[9]
for data2 in out2.readlines():
col2 = data2.strip().split()
y = col2[9]
f = float(y) / float(x)
print f
However I'm getting the same values for x. For example if the first set of data is 4089.txt, and the second (4026.txt) is:
name1.fits 0 0 4026.2 0. 1. 0. -0.617924 0.01749 0.6179 0.01749 -0.19384 0.00383 2.995 0.09205 0. 0.
name2.fits 0 0 4026.2 0. 1. 0. -0.644496 0.01218 0.6445 0.01218 -0.183373 0.00291 3.302 0.05261 0. 0.
#name3.fits 0 0 4026.2 0. 1. 0. -0.507311 0.01557 0.5073 0.01557 -0.176148 0.00472 2.706 0.07341 0. 0.
#name4.fits 0 1 4026.2 0. 1. 0. -0.523856 0.01086 0.5239 0.01086 -0.173477 0.00279 2.837 0.05016 0. 0.
name4.fits 0 1 4026.229 0.0144 1.014936 0. -0.619708 0.00868 0.6106 0.00855 -0.185527 0.00189 3.138 0.04441 0. 0.
and I want to divide the 9th column of each file, taking only the first elements of each column I should get 0.91/0.6179 = 1.47, but I obtain 0.958241758242.

What's happening is that the code you have is capturing the last value in the for loop and dividing that. You should conduct the division at each stage of the for-loop to get the correct divisions.
An easier approach is placing all the values in a list
e.g.
x = [0.0149,0.01218,..etc] and y = [...]
Then you divide the two lists using numpy (or a for-loop against the lists). Remember that they both need to be of the same size to work.
Sample code:
with open('4026.txt','r') as out1, open('4089.txt', 'r') as out2, open('4116.txt', 'r') as out3, open('4121.txt', 'r') as out4, open('4542.txt', 'r') as out5, open('4553.txt', 'r') as out6:
# Build two lists
x = []
y = []
for data1 in out1.readlines():
col1 = data1.strip().split()
x.append(col1[9])
for data2 in out2.readlines():
col2 = data2.strip().split()
y.append(col2[9])
for i in range(0,len(x)):
# Make sure the denominator is not zero
if y[i] != 0:
print (1.0 * x[i])/y[i]
else:
print "Not possible"

You could do it like this:
with open('4026.txt','r') as out1, open('4089.txt', 'r') as out2:
x_col9 = [data1.strip().split()[9] for data1 in out1.readlines()]
y_col9 = [data2.strip().split()[9] for data2 in out2.readlines()]
if len(x_col9) != len(y_col9):
print('Error: files do not have same number of rows')
else:
f = [(float(y) / float(x)) for x, y in zip(x_col9, y_col9)]
print(f)
It may be better to process the files as shown below because it doesn't require reading the entire contents of all of them into memory first, and instead processes each one a line at a time:
x_col9 = [data1.strip().split()[9] for data1 in out1]
y_col9 = [data2.strip().split()[9] for data2 in out2]

Related

adding values in a 2d array provided that the first value is greater than 5

there is such question, it seems elementary, but for some reason at me it does not turn out. I have the 2 d list, I need to add a line to a line so that the sum on the first number was not less than 5 (it is possible to sum up only the next lines). For example
array([[ 0. , 3.817549],
[ 3. , 21.275711],
[ 11. , 59.286198],
[ 47. , 110.136649],
[132. , 153.451585],
[263. , 171.041259],
[301. , 158.872652],
[198. , 126.488376],
[ 50. , 200.63002 ]])
and I need outpuut like this:
array([[ 14. , 84.3794...],
[ 47. , 110.136649],
[132. , 153.451585],
[263. , 171.041259],
[301. , 158.872652],
[198. , 126.488376],
[ 50. , 200.63002 ]])
Try:
arr = np.array([[ 0. , 3.817549],
[ 3. , 21.275711],
[ 11. , 59.286198],
[ 47. , 110.136649],
[132. , 153.451585],
[263. , 171.041259],
[301. , 158.872652],
[198. , 126.488376],
[ 50. , 200.63002 ]])
for i in range(len(arr)):
if arr[i, 0] >= 5.0:
arr = arr[i:, :]
break
else:
arr[i + 1, :] += arr[i, :]
I'm not entirely sure if I understand the question, but I will try to help.
I would approach this problem with the following steps:
Create a separate 2D list to store your final output and a two-value accumulator list to temporary store values. Initialize the accumulator to the values at index [0][] of your input array
Iterate over the values in the original 2D list
For each item:
a. if accumulator[0] >= 5, add the accumulated values to your output and then set the accumulator to the values at current_index + 1
b. otherwise, add the values at current_index + 1 to your accumulator
The following code was able to take your input and reproduce the exact ouput you wanted:
# Assuming current_vals is the input list...
final_vals = []
accumulator = [current_vals[0][0], current_vals[0][1]]
for sublist_index in range(1, len(current_vals) - 1):
if accumulator[0] >= 5:
final_vals.append([accumulator[0], accumulator[1]])
accumulator[0] = current_vals[sublist_index][0]
accumulator[1] = current_vals[sublist_index][1]
else:
accumulator[0] += current_vals[sublist_index][0]
accumulator[1] += current_vals[sublist_index][1]
return final_vals

Array formatting with argmax numpy Python

I am having trouble with the Numbers[(Numbers<=0).argmax():] = 0 function it is supposed to turn all the elements behind it into zeroes if the condition is met, however if the condition is not met it turns all the array elements into zeroes. How can i fix this issue. If the Numbers<=0 condition is not met the array should not change.
Array with satisfying condition at -35.15610151:
Numbers = np.array([123.6, 123.6 , 123.6, 110.3748, 111.6992976,
102.3165566, 97.81462811 , 89.50038472 , 96.48141473 , 90.49956702,
88.59907611 , 77.96718698, 61.51611052, 56.84088612, 55.36302309,
54.69866681, 56.44902415 , 59.49727145, 42.12406819, 27.42276839,
33.86711896, 32.10602877, -35.15610151, 32.34361339 , 29.20628289])
Numbers[(Numbers<=0).argmax():] = 0
Output:
[123.6 123.6 123.6 110.3748 111.6992976
102.3165566 97.81462811 89.50038472 96.48141473 90.49956702
88.59907611 77.96718698 61.51611052 56.84088612 55.36302309
54.69866681 56.44902415 59.49727145 42.12406819 27.42276839
33.86711896 32.10602877 0. 0. 0. ]
Array with no satisfying condition, turned -35.15610151 into +35.15610151:
Numbers = np.array([123.6, 123.6 , 123.6, 110.3748, 111.6992976,
102.3165566, 97.81462811 , 89.50038472 , 96.48141473 , 90.49956702,
88.59907611 , 77.96718698, 61.51611052, 56.84088612, 55.36302309,
54.69866681, 56.44902415 , 59.49727145, 42.12406819, 27.42276839,
33.86711896, 32.10602877, 35.15610151, 32.34361339 , 29.20628289])
Numbers[(Numbers<=0).argmax():] = 0
Output:
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0.]
Try these 2 methods, one is in place assignment to the NumPy view and the other creates a fresh array to be assigned to another variable -
#Method 1 (Inplace assignment)
Numbers[(Numbers<=0).cumsum(dtype=bool)] = 0
Or,
#Method 2 (Not inplace)
np.where(~(Numbers<=0).cumsum(dtype=bool), Numbers, 0)
Or,
#As an excellent suggestion by Mad Physicist!
Numbers[np.logical_or.accumulate(Numbers >= 0)] = 0
Explanation -
The bool array that returns [F, F, F, T, F, F, F] can be seen as an array of 1s and 0s. Doing a cumsum ends up propogating the first T to the subsequent elements.
This, therefore, turns the array as [F, F, F, T, T, T, T] which can now be used with just boolean indexing and set the view to 0 OR np.where to fetch original elements or 0 based on reversing the boolean with ~
Advantage here is that if your array is just composed of False, meaning no element meets the condition, it just returns the original Numbers itself, instead of setting them to 0.
Running tests -
With a value that meets condition
Numbers = np.array([123.6 , 123.6 , -123.6, 110.3748 , 111.6992976, 102.3165566, 97.81462811])
Numbers[(Numbers<=0).cumsum(dtype=bool)] = 0
#array([123.6, 123.6, 0. , 0. , 0. , 0. , 0. ])
With no values meeting the condition
Numbers = np.array([123.6 , 123.6 , 123.6, 110.3748 , 111.6992976, 102.3165566, 97.81462811])
Numbers[(Numbers<=0).cumsum(dtype=bool)] = 0
#array([123.6 , 123.6 , 123.6 , 110.3748 ,111.6992976 , 102.3165566 , 97.81462811])
EDIT: New scenario as requested
Numbers1 = np.array([1.1, 2.2, 3.3, 4.4, 5.5])
Numbers2 = np.array([1,2,-3,4,5])
Numbers2 = np.where(~(Numbers2<=0).cumsum().astype(bool), Numbers1, 0)
Numbers2
array([1.1, 2.2, 0. , 0. , 0. ])
Just use an if for this, it represents the intention quite well, and is easy to understand:
smaller_equal_zero = Numbers <= 0
if smaller_equal_zero.any():
Numbers[smaller_equal_zero.argmax():] = 0
With Python 3.8+ you can use an assignment expression in the if:
if (smaller_equal_zero := Numbers <= 0).any():
Numbers[smaller_equal_zero.argmax():] = 0

Numpy method to return the index of the occurrence of an array within an array of arrays

I have an array of arrays that represents a set of unique colour values:
[[0. 0. 0. ]
[0. 0. 1. ]
[0. 1. 1. ]
[0.5019608 0.5019608 0.5019608 ]
[0.64705884 0.16470589 0.16470589]
[0.9607843 0.9607843 0.8627451 ]
[1. 0. 0. ]
[1. 0.84313726 0. ]
[1. 1. 0. ]
[1. 1. 1. ]]
And another numpy array that represents one of the colours:
[0.9607843 0.9607843 0.8627451 ]
I need a function to find the index where the colour array occurs in the set of colours, i.e. the function should return 5 for the arrays above.
numpy.where() returns you the exact positions in the array for values of given condition. So here, it would be as following (denoting big array as arr1, and the sought vector as arr2:
np.where(np.all(arr1 == arr2, axis=1))
Which then returns array of row indexes of sought rows.
Assuming that this is a relatively short list of colors (<1000), the simplest thing to do is probably just iterate over the list and compare each element of the sub-array.
color_list = ...
color_index = -1
target_color = [0.9607843, 0.9607843, 0.8627451]
for i in range(0, len(color_list)):
cur_color = color_list[i]
if (cur_color[0] == target_color[0] and cur_color[1] = target_color[1] and cur_color[2] = target_color[2]):
color_index = i
break

Python Read Fortran Binary File

I'm trying to read a binary file output from Fortran code below, but the results are not the same from output file.
Fortran 77 code:
program test
implicit none
integer i,j,k,l
real*4 pcp(2,3,4)
open(10, file='pcp.bin', form='unformatted')
l = 0
do i=1,2
do j=1,2
do k=1,2
print*,k+l*2
pcp(i,j,k)=k+l*2
l = l + 1
enddo
enddo
enddo
do k=1,4
write(10)pcp(:,:,k)
enddo
close(10)
stop
end
I'm trying to use the Python code below:
from scipy.io import FortranFile
f = FortranFile('pcp.bin', 'r')
a = f.read_reals(dtype=float)
print(a)
Because you are writing real*4 data on a sequential file, simply try replacing dtype=float to dtype='float32' (or dtype=np.float32) in read_reals():
>>> from scipy.io import FortranFile
>>> f = FortranFile( 'pcp.bin', 'r' )
>>> print( f.read_reals( dtype='float32' ) )
[ 1. 9. 5. 13. 0. 0.]
>>> print( f.read_reals( dtype='float32' ) )
[ 4. 12. 8. 16. 0. 0.]
>>> print( f.read_reals( dtype='float32' ) )
[ 0. 0. 0. 0. 0. 0.]
>>> print( f.read_reals( dtype='float32' ) )
[ 0. 0. 0. 0. 0. 0.]
The obtained data correspond to each pcp(:,:,k) in Fortran, as verified by
do k=1,4
print "(6f8.3)", pcp(:,:,k)
enddo
which gives (with pcp initialized to zero)
1.0 9.0 5.0 13.0 0.0 0.0
4.0 12.0 8.0 16.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0
But because >>> help( FortranFile ) says
An example of an unformatted sequential file in Fortran would be written as::
OPEN(1, FILE=myfilename, FORM='unformatted')
WRITE(1) myvariable
Since this is a non-standard file format, whose contents depend on the
compiler and the endianness of the machine, caution is advised. Files from
gfortran 4.8.0 and gfortran 4.1.2 on x86_64 are known to work.
Consider using Fortran direct-access files or files from the newer Stream
I/O, which can be easily read by numpy.fromfile.
it may be simpler to use numpy.fromfile() depending on cases (as shown in StanleyR's answer).
Use nupy.fromfile (http://docs.scipy.org/doc/numpy/reference/generated/numpy.fromfile.html)
I guess you have missed something in fortran code, to write to binary file apply this code:
program test
implicit none
integer i,j,k,l, reclen
real*4 pcp(2,3,4)
inquire(iolength=reclen)pcp(:,:,1)
open(10, file='pcp.bin', form='unformatted', access = 'direct', recl = reclen)
pcp = 0
l = 0
do i=1,2
do j=1,2
do k=1,2
print*,i,j,k,k+l*2
pcp(i,j,k)=k+l*2
l = l + 1
enddo
enddo
enddo
do k=1,4
write(10, rec=k)pcp(:,:,k)
enddo
close(10)
end
To read file by python:
import numpy as np
with open('pcp.bin','rb') as f:
for k in xrange(4):
data = np.fromfile(f, dtype=np.float32, count = 2*3)
print np.reshape(data,(2,3))
Output:
[[ 1. 9. 5.]
[ 13. 0. 0.]]
[[ 4. 12. 8.]
[ 16. 0. 0.]]
[[ 0. 0. 0.]
[ 0. 0. 0.]]
[[ 0. 0. 0.]
[ 0. 0. 0.]]
Easiest way will be to use data_py package. To install enter pip install data-py
Example use
from data_py import datafile
NoOfLines=0
lineNumber=2 # Line number to be read (Excluding lines starting with '#')
df1=datafile("C:/Folder/SubFolder/data-file-name.txt")
df1.separator="," # No need to specify if separator is space(" "). For 'tab' separated values use '\t'
NoOfLines=df1.lines # Total number of lines in the data file (Excluding lines starting with '#')
Col=["Null"]*5 # This will create 5 column variables with an intial string 'Null'.
# Number of column variables (here 5) should not be greater than number of columns in data file.
df1.read(Col,lineNumber) # Will read first five columns from the data file at the line number given, and stores in Col.
print(Col)
For details visit: https://www.respt.in/p/python-package-datapy.html

Suggestions for faster for/if statements in my code?

My code takes about two hours to process. The bottleneck is in for loop and if
statements (see comment in code).
I'm beginner with python :) Can anyone recommend an efficient python way to replace the nested for and if statements?
I have tables of ~30 million rows, each row with (x,y,z) values:
20.0 11.3 7
21.0 11.3 0
22.0 11.3 3
...
My desired output is a table in the form x, y, min(z), count(min(z)). The last
column is a final count of the least z values at that (x,y). Eg:
20.0 11.3 7 7
21.0 11.3 0 10
22.0 11.3 3 1
...
There's only about 600 unique coordinates, so the output table will be 600x4.
My code:
import numpy as np
file = open('input.txt','r');
coordset = set()
data = np.zeros((600,4))*np.nan
irow = 0
ctr = 0
for row in file:
item = row.split()
x = float(item[0])
y = float(item[1])
z = float(item[2])
# build unique grid of coords
if ((x,y)) not in coordset:
data[irow][0] = x
data[irow][1] = y
data[irow][2] = z
irow = irow + 1 # grows up to 599
# lookup table of unique coords
coordset.add((x,y))
# BOTTLENECK. replace ifs? for?
for i in range(0, irow):
if data[i][0]==x and data[i][1]==y:
if z > data[i][2]:
continue
elif z==data[i][2]:
ctr = ctr + 1
data[i][3]=ctr
if z < data[i][2]:
data[i][2] = z
ctr = 1
data[i][3]=ctr
edit: For reference the approach by #Joowani computes in 1m26s. My original approach, same computer, same datafile, 106m23s.
edit2: #Ophion and #Sibster thanks for suggestions, I don't have enough credit to +1 useful answers.
Your solution seems slow because it iterates through the list (i.e. data) every time you make an update. A better approach would be using a dictionary, which takes O(1) as opposed to O(n) per update.
Here would be my solution using a dictionary:
file = open('input.txt', 'r')
#coordinates
c = {}
for line in file:
#items
(x, y, z) = (float(n) for n in line.split())
if (x, y) not in c:
c[(x, y)] = [z, 1]
elif c[(x, y)][0] > z:
c[(x, y)][0], c[(x, y)][1] = z, 1
elif c[(x, y)][0] == z:
c[(x, y)][1] += 1
for key in c:
print("{} {} {} {}".format(key[0], key[1], c[key][0], c[key][1]))
Why not change the last if to an elif ?
Like it is done now you will evaluate the z < data[i][2]: every iteration of the loop.
You could even just replace it with an else since you have already checked if z>data[i][2] and z == data[i][2] so the only remaining possibility is z < data[i][2]:
So following code will do the same and should be faster :
if z > data[i][2]:
continue
elif z==data[i][2]:
ctr = ctr + 1
data[i][3]=ctr
else:
data[i][2] = z
ctr = 1
data[i][3]=ctr
To do this in numpy use np.unique.
def count_unique(arr):
row_view=np.ascontiguousarray(a).view(np.dtype((np.void,a.dtype.itemsize * a.shape[1])))
ua, uind = np.unique(row_view,return_inverse=True)
unique_rows = ua.view(a.dtype).reshape(ua.shape + (-1,))
count=np.bincount(uind)
return np.hstack((unique_rows,count[:,None]))
First lets check for a small array:
a=np.random.rand(10,3)
a=np.around(a,0)
print a
[[ 0. 0. 0.]
[ 0. 1. 1.]
[ 0. 1. 0.]
[ 1. 0. 0.]
[ 0. 1. 1.]
[ 1. 1. 0.]
[ 1. 0. 1.]
[ 1. 0. 1.]
[ 1. 0. 0.]
[ 0. 0. 0.]]
print output
[[ 0. 0. 0. 2.]
[ 0. 1. 0. 1.]
[ 0. 1. 1. 2.]
[ 1. 0. 0. 2.]
[ 1. 0. 1. 2.]
[ 1. 1. 0. 1.]]
print np.sum(output[:,-1])
10
Looks good! Now lets check for a large array:
a=np.random.rand(3E7,3)
a=np.around(a,1)
output=count_unique(a)
print output.shape
(1331, 4) #Close as I can get to 600 unique elements.
print np.sum(output[:,-1])
30000000.0
Takes about 33 second on my machine and 3GB of memory, doing this all in memory for large arrays will likely be your bottleneck. For a reference #Joowani's solution took about 130 seconds, although this is a bit of an apple and oranges comparison as we start with a numpy array. Your milage may vary.
To read in the data as a numpy array I would view the question here, but it should look something like the following:
arr=np.genfromtxt("./input.txt", delimiter=" ")
Loading in that much data from a txt file I would really recommend using the pandas example in that link.

Categories