I want to write some random numbers into an ascii output file.
I generate the numbers with numpy, so the numbers are stored in numpy.array
import numpy as np
random1=np.random.uniform(-1.2,1.2,7e6)
random2=...
random3=...
All three array are of the same size.
I used standard file output, but this is really slow. Just about 8000 lines per 30 min. This may because I loop over three large arrays though.
fout1 = open("output.dat","w")
for i in range(len(random1)):
fout1.write(str(random1[i])+"\t"+ str(random2[i])+"\t"+ str(random3[i])+"\n")
fout1.close()
I also just used print str(random1[i])+"\t"+ str(random2[i])+"\t"+ str(random3[i]) and dumped everything in a file usind shell ./myprog.py > output.dat which seems a bit faster but still I am not satisfied with the output speed.
Any recommendations are really welcome.
Have you tried
random = np.vstack((random1, random2, random3)).T
random.savetxt("output.dat", delimiter="\t")
Im guessing the disk io is the most expensive operation you are doing.. You could try to create your own buffer to deal with this, instead of writing every line every loop buffer up say 100 lines and write them in one big block. Then experiment with this and see what the most benficial buffer size is
Related
I'm doing a convolutional neural network classification and all my training tiles (1000 of them) are in geotiff format. I need to get all of them to a numpy array, but I only found code that will do it for one tiff file at a time.
Is there a way to convert a whole folder of tiff files at once?
Thanks!
Try using a for loop to go through your folder
Is your goal to get them into 1000 different numpy arrays, or to 1 numpy array? If you want the latter, it might be easiest to merge them all into a larger .tiff file, then use the code you have to process it.
If you want to get them into 1000 different arrays, this reads through a directory, uses your code to make a numpy array, and sticks them in a list.
import os
arrays_from_files = []
os.chdir("your-folder")
for name in os.listdir():
if os.path.isfile(name):
nparr = ##use your code here
arrays_from_files.append(nparr)
It might be a good idea to use a dictionary and map filenames to arrays to make debugging easier down the road.
import os
arrays_by_filename = {}
os.chdir("your-folder")
for name in os.listdir():
if os.path.isfile(name):
nparr = ##use your code here
arrays_by_filename[name] = nparr
I am experimenting with a 3-dimensional zarr-array, stored on disk:
Name: /data
Type: zarr.core.Array
Data type: int16
Shape: (102174, 1100, 900)
Chunk shape: (12, 220, 180)
Order: C
Read-only: True
Compressor: Blosc(cname='zstd', clevel=3, shuffle=BITSHUFFLE, blocksize=0)
Store type: zarr.storage.DirectoryStore
No. bytes: 202304520000 (188.4G)
No. bytes stored: 12224487305 (11.4G)
Storage ratio: 16.5
Chunks initialized: 212875/212875
As I understand it, zarr-arrays can also reside in memory - compressed, as if they were on disk. So I thought why not try to load the entire thing into RAM on a machine with 32 GByte memory. Compressed, the dataset would require approximately 50% of RAM. Uncompressed, it would require about 6 times more RAM than available.
Preparation:
import os
import zarr
from numcodecs import Blosc
import tqdm
zpath = '...' # path to zarr data folder
disk_array = zarr.open(zpath, mode = 'r')['data']
c = Blosc(cname = 'zstd', clevel=3, shuffle = Blosc.BITSHUFFLE)
memory_array = zarr.zeros(
disk_array.shape, chunks = disk_array.chunks,
dtype = disk_array.dtype, compressor = c
)
The following experiment fails almost immediately with an out of memory error:
memory_array[:, :, :] = disk_array[:, :, :]
As I understand it, disk_array[:, :, :] will try to create an uncompressed, full-size numpy array, which will obviously fail.
Second attempt, which works but is agonizingly slow:
chunk_lines = disk_array.chunks[0]
chunk_number = disk_array.shape[0] // disk_array.chunks[0]
chunk_remain = disk_array.shape[0] % disk_array.chunks[0] # unhandled ...
for chunk in tqdm.trange(chunk_number):
chunk_slice = slice(chunk * chunk_lines, (chunk + 1) * chunk_lines)
memory_array[chunk_slice, :, :] = disk_array[chunk_slice, :, :]
Here, I am trying to reads a certain number of chunks at a time and put them into my in-memory array. It works, but it is about 6 to 7 times slower than what it took to write this thing to disk in the first place. EDIT: Yes, it's still slow, but the 6 to 7 times happened due to a disk issue.
What's an intelligent and fast way of achieving this? I'd guess, besides not using the right approach, my chunks might also be too small - but I am not sure.
EDIT: Shape, chunk size and compression are supposed to be identical for the on-disk array and the in-memory array. It should therefore be possible to eliminate the decompress-compress procedure in my example above.
I found zarr.convenience.copy but it is marked as an experimental feature, subject to further change.
Related issue on GitHub
You could conceivably try with fsspec.implementations.memory.MemoryFileSystem, which has a .make_mapper() method, with which you can make the kind of object expected by zarr.
However, this is really just a dict of path:io.BytesIO, which you could make yourself, if you want.
There are a couple of ways one might solve this issue today.
Use LRUStoreCache to cache (some) compressed data in memory.
Coerce your underlying store into a dict and use that as your store.
The first option might be appropriate if you only want some frequently used data in-memory. Of course how much you load into memory is something you can configure. So this could be the whole array. This will only happen with data on-demand, which may be useful for you.
The second option just creates a new in-memory copy of the array by pulling all of the compressed data from disk. The one downside is if you intend to write back to disk this will be something you need to do manually, but it is not too difficult. The update method is pretty handy for facilitating this copying of data between different stores.
I'm trying to write an algorithm that will save the filename and the 3 channel np.array stored in each filename to a csv (or similar filetype), and then be able to read in the csv and reproduce the color image image.
The format of my csv should look like this:
Filename RGB
0 foo.png np.array # the shape is 100*100*3
1 bar.png np.array
2 ... ...
As it stands, I'm iterating through each file saved in a directory and appending a list that later gets stored in a pandas.DataFrame:
df1= pandas.DataFrame()
df2= pandas.DataFrame()
directory= r'C:/my Directory'
fileList= os.listdir(directory)
filenameList= []
RGBList= []
for eachFile in fileList:
filenameList.append(eachFile)
RGBList.append(cv2.imread(directory + eachFile, 1).tostring())
df1["Filenames"]= filenameList
df2["RGB"]= RGBList
df1.to_csv('df1.csv')
df2.to_csv('df2.csv')
df1 functions as desired. I THINK df2 fuctions as intended. A print statement reveals the correct len of 30,000 for each row of the csv. However, when I read in the csv using pandas.read_csv('df2') and use a print statement to view the len of the first row, I get 110541. I intend to use np.fromstring() and np.reshape() to reshape the flattened np.array generated from np.tostring(), but I get the error:
ValueError: string size must be a multiple of element size
...because the number of elements is mismatched.
My question is:
Why is the len so much larger when I read in the csv?
Is there a more efficient way to write 3 channel color image pixel data to a csv that can easily be read back in?
If you write a single byte for each 8-bit pixel you will get a line with 1 byte per pixel. So, if your image is 80 pixels wide, you will get 80 bytes per line.
If you write a CSV, in human-readable ASCII, you will need more space. Imagine the first pixel is 186. So, you will write a 1, an 8, a 6 and a comma - i.e. 4 bytes now for the first pixel instead of a single byte in binary, and so on.
That means your file will be around 3-4x bigger, i.e. 110k instead of 30k, which is what you are seeing.
There is no "better way" to write a CSV - the problem is that is a fundamentally inefficient format designed for humans rather than computers. Why did you choose CSV? If it has to be legible for humans, you have no choice.
If it can be illegible to humans, but readily legible to computers, choose a different format such np.save() and np.load() - as you wisely have done already ;-)
let say I have some big matrix saved on disk. storing it all in memory is not really feasible so I use memmap to access it
A = np.memmap(filename, dtype='float32', mode='r', shape=(3000000,162))
now let say I want to iterate over this matrix (not essentially in an ordered fashion) such that each row will be accessed exactly once.
p = some_permutation_of_0_to_2999999()
I would like to do something like that:
start = 0
end = 3000000
num_rows_to_load_at_once = some_size_that_will_fit_in_memory()
while start < end:
indices_to_access = p[start:start+num_rows_to_load_at_once]
do_stuff_with(A[indices_to_access, :])
start = min(end, start+num_rows_to_load_at_once)
as this process goes on my computer is becoming slower and slower and my RAM and virtual memory usage is exploding.
Is there some way to force np.memmap to use up to a certain amount of memory? (I know I won't need more than the amount of rows I'm planning to read at a time and that caching won't really help me since I'm accessing each row exactly once)
Maybe instead is there some other way to iterate (generator like) over a np array in a custom order? I could write it manually using file.seek but it happens to be much slower than np.memmap implementation
do_stuff_with() does not keep any reference to the array it receives so no "memory leaks" in that aspect
thanks
This has been an issue that I've been trying to deal with for a while. I work with large image datasets and numpy.memmap offers a convenient solution for working with these large sets.
However, as you've pointed out, if I need to access each frame (or row in your case) to perform some operation, RAM usage will max out eventually.
Fortunately, I recently found a solution that will allow you to iterate through the entire memmap array while capping the RAM usage.
Solution:
import numpy as np
# create a memmap array
input = np.memmap('input', dtype='uint16', shape=(10000,800,800), mode='w+')
# create a memmap array to store the output
output = np.memmap('output', dtype='uint16', shape=(10000,800,800), mode='w+')
def iterate_efficiently(input, output, chunk_size):
# create an empty array to hold each chunk
# the size of this array will determine the amount of RAM usage
holder = np.zeros([chunk_size,800,800], dtype='uint16')
# iterate through the input, replace with ones, and write to output
for i in range(input.shape[0]):
if i % chunk_size == 0:
holder[:] = input[i:i+chunk_size] # read in chunk from input
holder += 5 # perform some operation
output[i:i+chunk_size] = holder # write chunk to output
def iterate_inefficiently(input, output):
output[:] = input[:] + 5
Timing Results:
In [11]: %timeit iterate_efficiently(input,output,1000)
1 loop, best of 3: 1min 48s per loop
In [12]: %timeit iterate_inefficiently(input,output)
1 loop, best of 3: 2min 22s per loop
The size of the array on disk is ~12GB. Using the iterate_efficiently function keeps the memory usage to 1.28GB whereas the iterate_inefficiently function eventually reaches 12GB in RAM.
This was tested on Mac OS.
I've been experimenting with this problem for a couple days now and it appears there are two ways to control memory consumption using np.mmap. The first is reliable while the second would require some testing and will be OS dependent.
Option 1 - reconstruct the memory map with each read / write:
def MoveMMapNPArray(data, output_filename):
CHUNK_SIZE = 4096
for idx in range(0,x.shape[1],CHUNK_SIZE):
x = np.memmap(data.filename, dtype=data.dtype, mode='r', shape=data.shape, order='F')
y = np.memmap(output_filename, dtype=data.dtype, mode='r+', shape=data.shape, order='F')
end = min(idx+CHUNK_SIZE, data.shape[1])
y[:,idx:end] = x[:,idx:end]
Where data is of type np.memmap. This discarding of the memmap object with each read keeps the array from being collected into memory and will keep memory consumption very low if the chunk size is low. It likely introduces some CPU overhead but was found to be small on my setup (MacOS).
Option 2 - construct the mmap buffer yourself and provide memory advice
If you look at the np.memmap source code here, you can see that it is relatively simple to create your own memmapped numpy array relatively easily. Specifically, with the snippet:
mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
mmap_np_array = ndarray.__new__(subtype, shape, dtype=descr, buffer=mm, offset=array_offset, order=order)
Note this python mmap instance is stored as the np.memmap's private _mmap attribute.
With access to the python mmap object, and python 3.8, you can use its madvise method, described here.
This allows you to advise the OS to free memory where available. The various madvise constants are described here for linux, with some generic cross platform options specified.
The MADV_DONTDUMP constant looks promising but I haven't tested memory consumption with it like I have for option 1.
I am trying to use np.fromfile in order to read a binary file that I have written with Fortran using direct access. However if I set count=-1, instead of max_items, np.fromfile returns a larger array than expected; adding zeros to the vector I've written in binary.
Fortran test code:
program testing
implicit none
integer*4::i
open(1,access='DIRECT', recl=20, file='mytest', form='unformatted',convert='big_endian')
write(1,rec=1) (i,i=1,20)
close(1)
end program
How I am using np.fromfile:
import numpy as np
f=open('mytest','rb')
f.seek(0)
x=np.fromfile(f,'>i4',count=20)
print len(x),x
so if I use it like this it returns exactly my [1,...,20] np array, but setting count=-1 returns [1,...,20,0,0,0,0,0] with a size of 1600.
I am using a little endian machine (shouldn't affect anything) and I am compiling the Fortran code with ifort.
I am just curious about the reason this happens, to avoid any surprises in the future.