I would like to write a number of Python Arrays into a txt file, with one array per line. After which I would like to read the Arrays line by line.
My work in progress code below. The problem I am working on involves about 100,000 arrays (length of L)
from __future__ import division
from array import array
M = array('I',[1,2,3])
N = array('I',[10,20,30])
L = [M,N]
with open('manyArrays.txt','w') as file:
for a in L:
sA = a.tostring()
file.write(sA + '\n')
with open('manyArrays.txt','r') as file:
for line in file:
lineRead = array('I', [])
lineRead.fromstring(line)
print MRead
The error message I get is
lineRead.fromstring(line)
ValueError: string length not a multiple of item size
You can either use numpy function for this, or code lines yourself:
You could concatenate your arrays in one 2D array and save it directly with np.savetxt, load it with np.genfromtext :
M = np.array([1,2,3],dtype='I')
N = np.array([10,20,30],dtype='I')
data= np.array([M,N])
file='test.txt'
np.savetxt(file,data)
M2,N2 = np.genfromtxt(file)
Or do :
file2='test2.txt'
form="%i %i %i \n"
with open(file2,'w') as f:
for i in range(len(data)):
f.write(form % (data[i,0],data[i,1],data[i,2]))
Related
I was tired of waiting while loading a simple distance matrix from a csv file using numpy.genfromtxt. Following another SO question, I performed a perfplot test, while including some additional methods. The results (source code at the end):
The result for the largest input size shows that the best method is read_csv, which is this:
def load_read_csv(path: str):
with open(path, 'r') as csv_file:
reader = csv.reader(csv_file)
matrix = None
first_row = True
for row_index, row in enumerate(reader):
if first_row:
size = len(row)
matrix = np.zeros((size, size), dtype=int)
first_row = False
matrix[row_index] = row
return matrix
Now I doubt that reading the file line by line, converting it to the list of strings, then calling int() on each item in the list and adding it to NumPy matrix is the best possible way.
Can this function be optimized further, or is there some fast library for CSV loading (like Univocity parser in Java), or maybe just a dedicated NumPy function?
The source code of the test:
import perfplot
import csv
import numpy as np
import pandas as pd
def load_read_csv(path: str):
with open(path, 'r') as csv_file:
reader = csv.reader(csv_file)
matrix = None
first_row = True
for row_index, row in enumerate(reader):
if first_row:
size = len(row)
matrix = np.zeros((size, size), dtype=int)
first_row = False
# matrix[row_index] = [int(item) for item in row]
matrix[row_index] = row
return matrix
def load_loadtxt(path: str):
matrix = np.loadtxt(path, dtype=int, comments=None, delimiter=",", encoding="utf-8")
return matrix
def load_genfromtxt(path: str):
matrix = np.genfromtxt(path, dtype=int, comments=None, delimiter=",", deletechars=None, replace_space=None, encoding="utf-8")
return matrix
def load_pandas(path: str):
df = pd.read_csv(path, header=None, dtype=np.int32)
return df.values
def load_pandas_engine_pyarrow(path: str):
df = pd.read_csv(path, header=None, dtype=np.int32, engine='pyarrow')
return df.values
def load_pandas_engine_python(path: str):
df = pd.read_csv(path, header=None, dtype=np.int32, engine='python')
return df.values
def setup(n):
matrix = np.random.randint(0, 10000, size=(n, n), dtype=int)
filename = f"square_matrix_of_size_{n}.csv"
np.savetxt(filename, matrix, fmt="%d", delimiter=",")
return filename
b = perfplot.bench(
setup=setup, # or setup=np.random.rand
kernels=[
load_read_csv,
load_loadtxt,
load_genfromtxt,
load_pandas,
load_pandas_engine_pyarrow,
load_pandas_engine_python
],
n_range=[2 ** k for k in range(15)]
)
b.save("out.png")
b.show()
Parsing CSV files correctly while supporting several data types (eg. floating-point numbers, integers, strings) and possibly ill-formed input files is clearly not easy, and doing so efficiently is actually pretty hard. Moreover, decoding UTF-8 strings is also much slower than reading directly ASCII strings. This is the reasons why most CSV libraries are pretty slow. Not to mention wrapping library in Python could introduce pretty big overheads regarding the input types (especially string).
Hopefully, if you need to read a CSV file containing a square matrix of integers that is assumed to be correctly formed, then you can write a much faster specific code dedicated to your needs (which does not care about floating-point numbers, strings, UTF-8, header decoding, error handling, etc.).
That being said, any call to a basic CPython function tends to introduce a huge overhead. Even a simple call to open+read is relatively slow (the binary mode is significantly faster than the text mode but unfortunately not so fast). The trick is to use Numpy to load the whole binary file in RAM with np.fromfile. This function is extremely fast: it just read the whole file at once, put its binary content in a raw memory buffer and return a view on it. When the file is in the operating system cache or a high-throughput NVMe SSD storage device, it can load the file at the speed of several GiB/s.
One the file is loaded, you can decode it with Numba (or Cython) so the decoding can be nearly as fast as a native code. Note that Numba does not support well/efficiently strings/bytes. Hopefully, the function np.fromfile produces a contiguous byte array and Numba can compute it very quickly. You can know the size of the matrix by just reading the first line and counting the number of comma. Then you can fill the matrix very efficiently by decoding integer on-the-fly, packing them in a flatten matrix and just consider end-of-line characters as regular separators. Note that \r and \n can both appear in the file since the file is read in binary mode.
Here is the resulting implementation:
import numba as nb
import numpy as np
#nb.njit('int32[:,:](uint8[::1],)', cache=True)
def decode_csv_buffer(rawData):
COMMA = np.uint8(ord(','))
CR = np.uint8(ord('\r'))
LF = np.uint8(ord('\n'))
ZERO = np.uint8(ord('0'))
# Find the size of the matrix (`n`)
n = 0
lineSize = 0
for i in range(rawData.size):
c = rawData[i]
if c == CR or c == LF:
break
n += rawData[i] == COMMA
lineSize += 1
n += 1
# Empty matrix
if lineSize == 0:
return np.empty((0, 0), dtype=np.int32)
# Initialization
res = np.empty(n * n, dtype=np.int32)
# Fill the matrix
curInt = 0
curPos = 0
lastCharIsDigit = True
for i in range(len(rawData)):
c = rawData[i]
if c == CR or c == LF or c == COMMA:
if lastCharIsDigit:
# Write the last int in the flatten matrix
res[curPos] = curInt
curPos += 1
curInt = 0
lastCharIsDigit = False
else:
curInt = curInt * 10 + (c - ZERO)
lastCharIsDigit = True
return res.reshape(n, n)
def load_numba(filename):
# Load fully the file in a raw memory buffer
rawData = np.fromfile(filename, dtype=np.uint8)
# Decode the buffer using the Numba JIT
# This method only work for your specific needs and
# can simply crash if the file content is invalid.
return decode_csv_buffer(rawData)
Be aware that the code is not robust (any bad input results in an undefined behaviour, including a crash), but it is very fast.
Here are the results on my machine:
As you can see, the above Numba implementation is at least one order of magnitude faster than all others. Note that you can write an even faster code using multiple threads during the decoding, but this makes the code significantly more complex.
I am trying to explore the topic of concurrency in python. I saw a couple of post about how to optimize processes by splitting the input data and processes separately, and afterwards joining the results. My task is to calculate the mean along the Z axis of a stack of rasters. I read the list of raster from a text file, and them create a stack numpy array with the data.
Then I wrote a simple function to use the stack array as input and calculate the mean. This task take me some minutes to complete. And I would like to process the numpy array in chunks to optimize the script. However when I do so by using the numpy.split (maybe a not good idea to split my 3d Array), then I get the following error:
Traceback <most recent call last>:
File "C:\Users\~\AppData\Local\conda\conda\envs\geoprocessing\lib\site-packages\numpy\lib\shape_base.py",
line 553, in split
len(indices_or_sections)
TypeError: object of type 'int' has no len()
During handling of the above exception, another exception ocurred:
Traceback (most recent call last):
File "tf_calculation_numpy.py", line 69, in <module>
main()
Tile "tf_calculation_numpy.py", line 60, in main
subarrays = np.split(final_array, 4)
File "C:\Users\~\AppData\Local\conda\conda\envs\geoprocessing\lib\site-packages\numpy\lib\shape_base.py", line 559, in split
array split does not result in an equal division'
ValueError: array split does not result in an equal division
Code is:
import rasterio
import os
import numpy as np
import time
import concurrent.futures
def mean_py(array):
print("Calculating mean of array")
start_time = time.time()
x = array.shape[1]
y = array.shape[2]
values = np.empty((x,y), type(array[0][0][0]))
for i in range(x):
for j in range(y):
#no more need for append operations
values[i][j] = ((np.mean(array[:, i, j])))
end_time = time.time()
hours, rem = divmod(end_time-start_time, 3600)
minutes, seconds = divmod(rem, 60)
print("{:0>2}:{:0>2}:{:05.2f}".format(int(hours),int(minutes),seconds))
print(f"{'.'*80}")
return values
def TF_mean(ndarray):
sdir = r'G:\Mosaics\VH'
final_array = np.asarray(ndarray)
final_array = mean_py(final_array)
out_name = (sdir + "/" + "MEAN_VH.tif")
print(final_array.shape)
with rasterio.open(out_name, "w", **profile) as dst:
dst.write(final_array.astype('float32'), 1)
print(out_name)
print(f"\nDone!\n{'.'*80}")
def main():
sdir = r'G:\Mosaics\VH'
a = np.random.randn(250_000)
b = np.random.randn(250_000)
c = np.random.randn(250_000)
e = np.random.randn(250_000)
f = np.random.randn(250_000)
g = np.random.randn(250_000)
h = np.random.randn(250_000)
arrays = [a, b, c, e, f, g, h]
final_array = []
for array in arrays:
final_array.append(array)
print(f"{array} added")
print("Splitting nd-array!")
final_array = np.asarray(final_array)
subarrays = np.split(final_array, 4)
with concurrent.futures.ProcessPoolExecutor() as executor:
for subarray, mean in zip(subarrays, executor.map(TF_mean,subarrays)):
print(f'Processing {subarray}')
if __name__ == '__main__':
main()
I just expect to have four processes running in parallel and a way to obtained the 4 subarrays and write them as a whole Geotiff file.
The second exception is the important one here, in terms of describing the error: "array split does not result in an equal division"
final_array is a 2D array, with shape 7 by 250,000. numpy.split operates along an axis, defaulting to axis 0, so you just asked it to split a length seven axis into four equal parts. Obviously, this isn't possible, so it gives up.
To fix, you can:
Split more; you could just split in seven parts and process each separately. The executor is perfectly happy to do seven tasks, no matter how many workers you have; seven won't split evenly, so at the tail end of processing you'll likely have some workers idle while the rest finish up, but that's not the end of the world
Split on a more fine-grained level. You could just flatten the array, e.g. final_array = final_array.reshape(final_array.size), which would make it a flat 1,750,000 element array, which can be split into four parts.
Split unevenly; instead of subarrays = np.split(final_array, 4) which requires axis 0 to be evenly splittable, do subarrays = np.split(final_array, (2,4,6)), which splits into three groups of two rows, plus one group with a single row.
There are many other options depending on your use case (e.g. split on axis=1 instead of the default axis=0), but those three are the least invasive (and #1 and #3 shouldn't change behavior meaningfully; #2 might, depending on whether the separation between 250K element blocks is meaningful).
I am trying to write entire array as text or csv file.
from array import array as pyarray
import csv
tmp1 = (x for x in range(10))
tmp2 = (x+10 for x in range(10))
arr1 = pyarray('l')
with open ('fileoutput','wb') as fil1:
for i in range(10):
val = next(tmp1) - next(tmp2)
arr1.append(val)
arr1.tofile(fil1)
The problem with this code is it writes as binary file. I want to write as string, so that it would be readable. It is possible to create a loop and write file line by line, however real problem has millions of line in arr1. What is optimized way to write in human readable form?
Edit:
After changing above code line to with open ('fileoutput','w') as fil1: i.e. 'wb' to 'w', there is error:
write() argument must be str, not bytes. So this is not solved the problem. Any suggestions?
You opened the file in wb mode. This writes in binary. Write to the file in w mode to write it as a string.
with open ('fileoutput','w') as fil1:
You can try appending the results to a string then save it into a file, as following:
from array import array as pyarray
tmp1 = (x for x in range(10))
tmp2 = (x+10 for x in range(10))
arr1 = pyarray('l')
fileoutput_str = str(arr1)+'\n'
for i in range(10):
val = next(tmp1) - next(tmp2)
fileoutput_str += str(val)+'\n'
fileoutput_fn = 'fileoutput'
fileoutput_fo = open(fileoutput_fn, 'w')
fileoutput_fo.write(fileoutput_str)
fileoutput_fo.close()
You will have to remove the binary option b in order to write string into the file.
I want to write something to a binary file using python.
I am simply doing:
import numpy as np
f = open('binary.file','wb')
i=4
j=5.55
f.write('i'+'j') #where do i specify that i is an integer and j is a double?
g = open('binary.file','rb')
first = np.fromfile(g,dtype=np.uint32,count = 1)
second = np.fromfile(g,dtype=np.float64,count = 1)
print first, second
The output is just:
[] []
I know it is very easy to do this in Matlab "fwrite(binary.file, i, 'int32');", but I want to do it in python.
You appear to be having some confusion about types in Python.
The expression 'i' + 'j' is adding two strings together. This results in the string ij, which is most likely written to the file as two bytes.
The variable i is already an int. You can write it to a file as a 4-byte integer in a couple of different ways (which also apply to the float j):
Use the struct module as detailed in how to write integer number in particular no of bytes in python ( file writing). Something like this:
import struct
with open('binary.file', 'wb') as f:
f.write(struct.pack("i", i))
You would use the 'd' specifier to write j.
Use the numpy module to do the writing for you, which is especially convenient since you are already using it to read the file. The method ndarray.tofile is made just for this purpose:
i = 4
j = 5.55
with open('binary.file', 'wb') as f:
np.array(i, dtype=np.uint32).tofile(f)
np.array(j, dtype=np.float64).tofile(f)
Note that in both cases I use open as a context manager when writing the file with a with block. This ensures that the file is closed, even if an error occurs during writing.
That's because you are trying to write a string(edited) into a binary file. You also don't close the file before trying to read it again.
If you want to write ints or strings to a binary file try adding the below code:
import numpy as np
import struct
f = open('binary.file','wb')
i = 4
if isinstance(i, int):
f.write(struct.pack('i', i)) # write an int
elif isinstance(i, str):
f.write(i) # write a string
else:
raise TypeError('Can only write str or int')
f.close()
g = open('binary.file','rb')
first = np.fromfile(g,dtype=np.uint32,count = 1)
second = np.fromfile(g,dtype=np.float64,count = 1)
print first, second
I'll leave it to you to figure out the floating number.
print first, second
[4] []
The more pythonic file handler way:
import numpy as np
import struct
with open ('binary.file','wb') as f:
i = 4
if isinstance(i, int):
f.write(struct.pack('i', i)) # write an int
elif isinstance(i, str):
f.write(i) # write a string
else:
raise TypeError('Can only write str or int')
with open('binary.file','rb') as g:
first = np.fromfile(g,dtype=np.uint32,count = 1)
second = np.fromfile(g,dtype=np.float64,count = 1)
print first, second
I'm still pretty new to python. I've been able to write a program that will read in a file from binary and stores the data that's there in a few arrays. Now that I've been able to complete a few other tasks with this program, I'm trying to go back through all my code and see where I can make it more efficient, learning Python better along the way. In particular, I'm trying to update the reading and storing of data from the file. Using numpy's fromfile is MUCH, MUCH faster at unpacking data than the struct.unpack method, and works wonderfully for a 1D array structure. However, I have some of the data stored in 2D arrays. I am seemingly stuck on how to implement the same type of storing in the 2D array. Does anyone have any ideas or hints as to how I may be able to perform this?
My basic program structure is as follows:
from numpy import fromfile
import numpy as np
file = open(theFilePath,'rb')
####### File Header #########
reservedParse = 4
fileHeaderBytes = 4 + int(131072/reservedParse) #Parsing out the bins in the file header
fileHeaderArray = np.zeros(fileHeaderBytes)
fileHeaderArray[0] = fromfile(file, dtype='<I', count=1) #File Index; 4 Bytes
fileHeaderArray[1] = fromfile(file, dtype='<I', count=1) #The Number of Packets; 4 bytes
fileHeaderArray[2] = fromfile(file, dtype='<Q', count=1) #Timestamp; 16 bytes; 2, 8-byte.
fileHeaderArray[3] = fromfile(file, dtype='<Q', count=1)
fileHeaderArray[4:] = fromfile(file, dtype='<I', count=int(131072/reservedParse)) #Empty header space
####### Data Packets #########
#Note: Each packet begins with a header containing information about the data stream followed by the data.
packets = int(fileHeaderArray[1]) #The number of packets in the data stream
dataLength = int(28672)
packHeader = np.zeros(14*packets).reshape((packets,14))
data = np.zeros(dataLength*packets).reshape((packets,dataLength))
for i in range(packets):
packHeader[i][0] = fromfile(file, dtype='>H', count=1) #Model Num
packHeader[i][1] = fromfile(file, dtype='>H', count=1) #Packet ID
packHeader[i][2] = fromfile(file, dtype='>I', count=1) #Coarse Timestamp
....#Continuing on
packHeader[i][13] = fromfile(file, dtype='>i', count=1) #4 bytes of reserved space
data[i] = fromfile(file, dtype='<h', count=dataLength) #Actual data
Essentially this is what I have right now. Is there a way I can do this without doing the loop? Going through that loop does not seem particularly fast or numpy-ish.
For reference, the for-loop structure using unpack and not numpy is:
packHeader = [[0 for x in range(14)] for y in range(packets)]
data = [[0 for x in range(dataLength)] for y in range(packets)]
for i in range(packets):
packHeader[i][0] = unpack('>H', file.read(2)) #Model Num
packHeader[i][1] = unpack('>H', file.read(2)) #Packet ID
packHeader[i][2] = unpack('>I', file.read(4)) #Coarse Timestamp
....#Continuing on
packHeader[i][13] = unpack('>i', file.read(4)) #4 bytes of reserved space
packHeader[i]=list(chain(*packHeader[i])) #Deals with the tuple issue ((x,),(y,),...) -> (x,y,...)
data[i] = [unpack('<h', file.read(2)) for j in range(dataLength)] #Actual data
data[i] = list(chain(*data[i])) #Deals with the tuple issue ((x,),(y,),...) -> (x,y,...)
Let me know if any clarification is needed.