Write (large amount of) zeros into a binary file

Write (large amount of) zeros into a binary file - python

This might be a stupid question but I'm unable to find a proper answer to it. I want to store (don't ask why) a binary representation of a (2000, 2000, 2000) array of zeros into disk, binary format. The traditional approach to achieve so would be:
with open('myfile', 'wb') as f:
f.write('\0' * 4 * 2000 * 2000 * 2000) # 4 bytes = float32
But that would imply creating a very large string which is not necessary at all. I know of two other options:
Iterate over the elements and store one byte at a time (extremely slow)
Create a numpy array and flush it to disk (as memory expensive as the string creation in the example above)
I was hopping to find something like write(char, ntimes) (as it exists in C and other languages) to copy on disk char ntimes at C speed, and not at Python-loops speed, without having to create such a big array on memory.

I don't know why you are making such a fuss about "Python's loop speed", but writing in the way of
for i in range(2000 * 2000):
f.write('\0' * 4 * 2000) # 4 bytes = float32
will tell the OS to write 8000 0-bytes. After write returns, in the next loop run it is called again.
It might be that the loop is executed slightly slower than it would be in C, but that definitely won't make a difference.
If it is ok to have a sparse file, you as well can seek to the desired file size's position and then truncate the file.

This would be a valid answer to fill a file from Python using numpy's memmap:
shape = (2000, 2000, 2000) # or just (2000 * 2000 * 2000,)
fp = np.memmap(filename, dtype='float32', mode='w+', shape=shape)
fp[...] = 0

To write zeroes, there's a nice hack: open the file in read-write, seek to the offset minus one, and write another zero.
This pads the start of the file with zeroes as well:
mega_size = 100000
with open("zeros.bin","wb+") as f:
f.seek(mega_size-1)
f.write(bytearray(1))

The original poster does not make clear why this really needs to be done in python. So here is a little shell script command which does the same thing, but probably slightly faster. Assuming the OP is on a unix like system (linux / Mac)
Definitions: bs (blocksize) = 2000 * 2000, count = 4 * 2000
The if (input file) is a special 'zero producing' device. The of (output file) has to be specified as wel.
dd bs=4000000 count=8000 if=/dev/zero of=/Users/me/thirtygig
On my computer (ssd) this takes about 110 seconds:
32000000000 bytes transferred in 108.435354 secs (295106705 bytes/sec)
You can always call this little shell command from python.
For comparisons, #glglgl little python script runs in 303 seconds on the same computer, so only 3 times slower, which is probably fast enough.

Related

Avoid reading all data into memory from hdf5 file when performing operations

I'm new to the HDF5 format, and I'm using h5py for these operations. I'm wondering about what exactly happens when I have a very large data set and perform some kind of operation on it. For instance:
>>> f = h5py.File("mytestfile.hdf5", "w")
>>> dset = f.create_dataset("mydataset", (100000000,), dtype=np.float64)
>>> dset[...] = np.linspace(0, 100, 100000000)
>>> myResult = f["mydataset"][:] * 15
# Graph myResult, or something
Is the entirety of myResult in memory now? Was dset also in memory? Is there a way to only read from the disk into memory part by part, like 100,000 points at a time or so, to avoid overflow (let's assume I might end up with data even far larger than the example, say, 50GB worth)? Is there a way to do so efficiently, and in a modular fashion (such that simply saying data * 100 automatically does the operation in chunks, and stores the results again in a new place)? Thanks in advance.

Python - file processing - memory error - speed up the performance

I'm dealing with huge numbers. I have to write them into a .txt file. Right now I have to write the all numbers between 1000000,10000000(1M-1B) into a .txt file. Since it throws me memory error if I do it in a single list, I sliced them ( I don't like this solution but couldn't find any other ).
The problem is, even with the first 50M numbers (1M-50M), I can't even open the .txt file. It's 458MB and took around + 15 mins, so I guess it'll be around a 9GB .txt file and +4 hours if I write all numbers.
When I try to open the .txt file contains numbers between 1M-50M
myfile.txt has stopped working
So right now the file contains the numbers between 1M-50M and I can't even open it, I guess if I write all numbers it's impossible to open.
I have to shuffle numbers between 1M-1B and store this numbers into a .txt file right now. Basically it's a freelance job and I'll have to deal with bigger numbers like 100B etc. Even first 50M has this problem, I don't know how to finish when the numbers are bigger.
Here are the codes for 1M-50M
import random
x = 1000000
y = 10000000
while x < 50000001:
nums = [a for a in range(x,x+y)]
random.shuffle(nums)
with open ("nums.txt","a+") as f:
for z in nums:
f.write(str(z)+"\n")
x += 10000000
How can I speed up this process?
How can I open this .txt file, should I create new file every time? If
I choose this option I have to slice the numbers more since even 50M numbers has
problem.
Is there any module can you suggest may be useful for this process?

Is there any module can you suggest may be useful for this process?
Using Numpy is really helpful for working with large arrays.
How can I speed up this process?
Using Numpy's functions arange and tofile dramatically speed up the process (see code below). Generation of the initial array is about 50 times faster and writing the array to a file is about 7 times faster.
The code just performs each operation once (change number=1 to a higher value to get better accuracy) and only generates number up to between 1M and 2M but you can see the general picture.
import random
import timeit
import numpy
x = 10**6
y = 2 * 10**6
def list_rand():
nums = [a for a in range(x, y)]
random.shuffle(nums)
return nums
def numpy_rand():
nums = numpy.arange(x, y)
numpy.random.shuffle(nums)
return nums
def std_write(nums):
with open ('nums_std.txt', 'w') as f:
for z in nums:
f.write(str(z) + '\n')
def numpy_write(nums):
with open('nums_numpy.txt', 'w') as f:
nums.tofile(f, '\n')
print('list generation, random [secs]')
print('{:10.4f}'.format(timeit.timeit(stmt='list_rand()', setup='from __main__ import list_rand', number=1)))
print('numpy array generation, random [secs]')
print('{:10.4f}'.format(timeit.timeit(stmt='numpy_rand()', setup='from __main__ import numpy_rand', number=1)))
print('standard write [secs]')
nums = list_rand()
print('{:10.4f}'.format(timeit.timeit(stmt='std_write(nums)', setup='from __main__ import std_write, nums', number=1)))
print('numpy write [secs]')
nums = numpy_rand()
print('{:10.4f}'.format(timeit.timeit(stmt='numpy_write(nums)', setup='from __main__ import numpy_write, nums', number=1)))
list generation, random [secs]
1.3995
numpy array generation, random [secs]
0.0319
standard write [secs]
2.5745
numpy write [secs]
0.3622
How can I open this .txt file, should I create new file every time? If
I choose this option I have to slice the numbers more since even 50M
numbers has problem.
It really depends what you are trying to do with the numbers. Find their relative position? Delete one from the list? Restore the array?

I would not help You with the Python, but if You need to shuffle a consecutive sequence, You can improve the shuffling algorithm. Make a bit array of 1E9 items, if would be about 125MB. Generate random number. If it is not present in the bit array, add it there and write it to the file. Repeat until You have 99% of numbers in the file.
Now convert the unused numbers in bit array into ordinary array - it would be 80MB. Shuffle them and write to the file.
You needed about 200MB of memory for 1E9 items (and 8 minutes, written in C#). You should be able to shuffle 100E9 items in 20GB of RAM and less than a day.

Finding out who else is referring, big data

I have 50 million rows of data like:
referring_id,referred_id
1000,1001
1000,1002
1001,1000
1001,1002
1002,1003
The goal is to find all the cases that share incoming connections, numerical examples should help:
If we want to calculate the measure for 1001, we can see that it has incoming from 1000, so we look who else has incoming connection from 1000, and that is 1002.
So the result would be [1002].
For 1002 we can see 1000 and 1001 are referring to 1002, so that means we look who else do they refer to; result being [1001,1000] (1000 refers 1001, 1001 refers 1000).
If it would be smaller data, I would just store for every referring a set of outgoing connections, and then loop over referred and take a union over all those that have incoming connections.
The problem is that this doesn't fit in memory.
I'm using csv to loop over the file and process the lines one at a time not to load it into memory, even though I have 16gb ram.
Does anyone have an idea how to handle it?

You should give pandas a try. It uses NumPy arrays to store the data. This can help to save memory. For example, an integer has the size of 8 bytes instead of 24 in Python 2 or 28 in Python 3. If the numbers are small, you might be able to use np.int16 or np.int32 to reduce the size to 2 or 4 bytes per integer.
This solution seems to fit your description:
s = """referring_id,referred_id
1000,1001
1000,1002
1001,1000
1001,1002
1002,1003"""
import csv
import numpy as np
pd.read_csv(io.StringIO(s), dtype=np.int16)
# use: df = pd.read_csv('data.csv', dtype=np.int16)
by_refered = df.groupby('referred_id')['referring_id'].apply(frozenset)
by_refering = df.groupby('referring_id')['referred_id'].apply(frozenset)
with open('connections.csv', 'w') as fobj:
writer = csv.writer(fobj)
writer.writerow(['id', 'connections'])
for x in by_refered.index:
tmp = set()
for id_ in by_refered[x]:
tmp.update(by_refering[id_])
tmp.remove(x)
writer.writerow([x] + list(tmp))
Content of connections.csv:
id,connections
1000,1002
1001,1002
1002,1000,1001
1003
Depending on your data you might get away with this. If there are many repeated connections, the number of sets and their size may be small enough. Otherwise, you would need to use some chunked approach.

Round system RAM and HDD bytes, UP to nearest even gigabyte. Python

I'm collecting the system information of the current machine. Part of this information is the RAM and HDD capacity. Problem is that the capacity being gathered is measured in bytes rather than GB.
In a nutshell, how do I convert the display of the internal specifications to resemble what you would see from a consumer/commercial stand point?
1000GB HDD or 8GB RAM as opposed to the exact number of bytes available. Especially since manufacturers set aside different amounts of recovery sectors, RAM can be used for integrated graphics and the 1000 vs 1024 binary differential, etc... Here's an example of my current code:
import os
import wmi #import native powershell functionality
import math
c = wmi.WMI()
SYSINFO = c.Win32_ComputerSystem()[0] # Manufacturer/Model/Spec blob
RAMTOTAL = int(SYSINFO.TotalPhysicalMemory) # Gathers only the RAM capacity in bytes.
RAMROUNDED = math.ceil(RAMTOTAL / 2000000000.) * 2.000000000 # attempts to round bytes to nearest, even, GB.
HDDTOTAL = int(HDDINFO.size) # Gathers only the HDD capacity in bytes.
HDDROUNDED = math.ceil(HDDTOTAL / 2000000000.) * 2.000000000 # attempts to round bytes to nearest, even, GB.
HDDPRNT = "HDD: " + str(HDDROUNDED) + "GB"
RAMPRNT = "RAM: " + str(RAMROUNDED) + "GB"
print(HDDPRNT)
print(RAMPRNT)
The area of interest is lines 8-11where I'm rounding up to the nearest even number since the internal size of RAM/HDD are always lower than advertised for reasons mentioned previously. StackOverflow posts have gotten me this method which is the most accurate, across the most machines, but it's still hard coded. Meaning the HDD only rounds accurately for either hundreds of GB or thousands, not both. Also, the RAM isn't 100% accurate.
Here's a couple workarounds that come to mind that will produce the results I'm looking for:
Adding additional commands to RAMTOTAL that may or may not be available. Allowing for GB output instead of KB. However. I would prefer it to be apart of the WMI import instead of straight native Windows code.
Figure out a more static method of rounding. ie: if HDDTOTAL > 1TB round up to decimal point X. else HDDTOTAL < 1TB use different rounding method.

I think you could write a simple function that solves it. In case the number in kB would be significantly smaller or greater, I added a possibility of different suffixes (It is inspired by very similar example in a book Dive Into Python 3). It might look something like this:
def round(x):
a = 0
while x > 1000:
suffixes = ('kB','MB','GB','TB')
a += 1 #This will go up the suffixes tuple with each division
x = x /1000
return math.ceil(x), suffixes[a]
Results of this function might look like this:
>>> print(round(19276246))
(20, 'GB')
>>> print(round(135565666656))
(136, 'TB')
>>> print(round(1355))
(2, 'MB')
and you could implement it to your code like this:
import os
import wmi #import native powershell functionality
import math
def round(x):
a = 0
while x > 1000:
suffixes = ('kB','MB','GB','TB')
a += 1 #This will go up the suffixes tuple for each division
x = x /1000
return math.ceil(x), suffixes[a]
.
.
.
RAMROUNDED = round(RAMTOTAL) #attempts to round bytes to nearest, even, GB.
HDDTOTAL = int(HDDINFO.size) # Gathers only the HDD capacity in bytes.
HDDROUNDED = round(HDDTOTAL) #attempts to round bytes to nearest, even, GB.
HDDPRNT = "HDD: " + str(HDDROUNDED[0]) + HDDROUNDED[1]
RAMPRNT = "RAM: " + str(RAMROUNDED[0]) + RAMROUNDED[1]
print(HDDPRNT)
print(RAMPRNT)

PowerShell has a lot of very powerful native math capabilities built in, allowing us to do things like divide by 1GB to get the whole number in gigabytes of a particular drive.
So, to see the total physical memory rounded by 1 GB, this is how to do it:
get-wmiobject -Class Win32_ComputerSystem |
select #{Name='Ram(GB)';Expression={[int]($_.TotalPhysicalMemory /1GB)}}
This method is called a Calculated Property, the way it differs from using a regular select statement (like Select TotalPhysicalMemory) is that I'm telling PowerShell to make a new Prop call Ram(GB) and use the following expression to determine it's value.
[int]($_.TotalPhysicalMemory /1GB)
The expression I'm using begins in the parenthesis, where I'm getting the TotalPhysicalMemory (which returns as 17080483840). I then divide by 1GB to give me 15.9074401855469. Finally, I apply [int] to cast the whole thing as an integer that is to say, make it a whole number, rounding as appropriate.
Here is the output
>Ram(GB)
-------
16

I used a combination of the two previous suggestions.
I used an if loop, rather than a while loop, but get the same results. I also mirrored the same internal process of the suggested powershell commands to keep the script more native to python and without adding modules/dependencies.
GBasMB = int(1000000000) # Allows for accurate Bytes to GB conversion
global RAMSTRROUNDED
RAMTOTAL = int(SYSINFO.TotalPhysicalMemory) / (GBasMB) # Outputs GB by dividing by previous MB variable
RAMROUNDED = math.ceil(RAMTOTAL / 2.) * 2 # Rounds up to nearest even whole number
RAMSTRROUNDED = int(RAMROUNDED) # final converted variable
HDDTOTAL = int(HDDINFO.size) / (GBasMB) # Similar process as before for HardDrive
HDDROUNDED = math.ceil(HDDTOTAL / 2.) * 2 # round up to nearest even whole number
def ROUNDHDDTBORGB(): # function for determining TB or GB sized HDD
global HDDTBORGBOUTPUT
global HDDPRNT
if HDDROUNDED >= 1000: # if equal to or greater than 1000GB, list as 1TB
HDDTBORGB = HDDROUNDED * .001
HDDTBORGBOUTPUT = str(HDDTBORGB) + "TB"
HDDPRNT = "HDD: " + str(HDDTBORGBOUTPUT)
print(HDDPRNT)
elif HDDROUNDED < 1000: # if less than 1000GB list as GB
HDDTBORGBOUTPUT = str(str(HDDROUNDED) + "GB")
HDDPRNT = "HDD: " + str(HDDTBORGBOUTPUT)
I've ran this script on several dozen computers and seems to accurately gather the appropriate amount of RAM and HDD capacities. Regardless of how much RAM the integrated graphics decides to consume and/or reserve sectors on the HDD, etc...

Fast way to read interleaved data?

I've got a file containing several channels of data. The file is sampled at a base rate, and each channel is sampled at that base rate divided by some number -- it seems to always be a power of 2, though I don't think that's important.
So, if I have channels a, b, and c, sampled at divders of 1, 2, and 4, my stream will look like:
a0 b0 c0 a1 a2 b1 a3 a4 b2 c1 a5 ...
For added fun, the channels can independently be floats or ints (though I know for each one), and the data stream does not necessarily end on a power of 2: the example stream would be valid without further extension. The values are sometimes big and sometimes little-endian, though I know what I'm dealing with up-front.
I've got code that properly unpacks these and fills numpy arrays with the correct values, but it's slow: it looks something like (hope I'm not glossing over too much; just giving an idea of the algorithm):
for sample_num in range(total_samples):
channels_to_sample = [ch for ch in all_channels if ch.samples_for(sample_num)]
format_str = ... # build format string from channels_to_sample
data = struct.unpack( my_file.read( ... ) ) # read and unpack the data
# iterate over data tuple and put values in channels_to_sample
for val, ch in zip(data, channels_to_sample):
ch.data[sample_num / ch.divider] = val
And it's slow -- a few seconds to read a 20MB file on my laptop. Profiler tells me I'm spending a bunch of time in Channel#samples_for() -- which makes sense; there's a bit of conditional logic there.
My brain feels like there's a way to do this in one fell swoop instead of nesting loops -- maybe using indexing tricks to read the bytes I want into each array? The idea of building one massive, insane format string also seems like a questionable road to go down.
Update
Thanks to those who responded. For what it's worth, the numpy indexing trick reduced the time required to read my test data from about 10 second to about 0.2 seconds, for a speedup of 50x.

The best way to really improve the performance is to get rid of the Python loop over all samples and let NumPy do this loop in compiled C code. This is a bit tricky to achieve, but it is possible.
First, you need a bit of preparation. As pointed out by Justin Peel, the pattern in which the samples are arranged repeats after some number of steps. If d_1, ..., d_k are the divisors for your k data streams and b_1, ..., b_k are the sample sizes of the streams in bytes, and lcm is the least common multiple of these divisors, then
N = lcm*sum(b_1/d_1+...+b_k/d_k)
will be the number of bytes which the pattern of streams will repeat after. If you have figured out which stream each of the first N bytes belongs to, you can simply repeat this pattern.
You can now build the array of stream indices for the first N bytes by something similar to
stream_index = []
for sample_num in range(lcm):
stream_index += [i for i, ch in enumerate(all_channels)
if ch.samples_for(sample_num)]
repeat_count = [b[i] for i in stream_index]
stream_index = numpy.array(stream_index).repeat(repeat_count)
Here, d is the sequence d_1, ..., d_k and b is the sequence b_1, ..., b_k.
Now you can do
data = numpy.fromfile(my_file, dtype=numpy.uint8).reshape(-1, N)
streams = [data[:,stream_index == i].ravel() for i in range(k)]
You possibly need to pad the data a bit at the end to make the reshape() work.
Now you have all the bytes belonging to each stream in separate NumPy arrays. You can reinterpret the data by simply assigning to the dtype attribute of each stream. If you want the first stream to be intepreted as big endian integers, simply write
streams[0].dtype = ">i"
This won't change the data in the array in any way, just the way it is interpreted.
This may look a bit cryptic, but should be much better performance-wise.

Replace channel.samples_for(sample_num) with a iter_channels(channels_config) iterator that keeps some internal state and lets you read the file in one pass. Use it like this:
for (chan, sample_data) in izip(iter_channels(), data):
decoded_data = chan.decode(sample_data)
To implement the iterator, think of a base clock with a period of one. The periods of the various channels are integers. Iterate the channels in order, and emit a channel if the clock modulo its period is zero.
for i in itertools.count():
for chan in channels:
if i % chan.period == 0:
yield chan

The grouper() recipe along with itertools.izip() should be of some help here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.