I need to import a binary file from Python -- the contents are signed 16-bit integers, big endian.
The following Stack Overflow questions suggest how to pull in several bytes at a time, but is this the way to scale up to read in a whole file?
Reading some binary file in Python
Receiving 16-bit integers in Python
I thought to create a function like:
from numpy import *
import os
def readmyfile(filename, bytes=2, endian='>h'):
totalBytes = os.path.getsize(filename)
values = empty(totalBytes/bytes)
with open(filename, 'rb') as f:
for i in range(len(values)):
values[i] = struct.unpack(endian, f.read(bytes))[0]
return values
filecontents = readmyfile('filename')
But this is quite slow (the file is 165924350 bytes). Is there a better way?
Use numpy.fromfile.
I would directly read until EOF (it means checking for receiving an empty string), removing then the need to use range() and getsize.
Alternatively, using xrange (instead of range) should improve things, especially for memory usage.
Moreover, as Falmarri suggested, reading more data at the same time would improve performance quite a lot.
That said, I would not expect miracles, also because I am not sure a list is the most efficient way to store all that amount of data.
What about using NumPy's Array, and its facilities to read/write binary files? In this link there is a section about reading raw binary files, using numpyio.fread. I believe this should be exactly what you need.
Note: personally, I have never used NumPy; however, its main raison d'etre is exactly handling of big sets of data - and this is what you are doing in your question.
You're reading and unpacking 2 bytes at a time
values[i] = struct.unpack(endian,f.read(bytes))[0]
Why don't you read like, 1024 bytes at a time?
I have had the same kind of problem, although in my particular case I have had to convert a very strange binary format (500 MB) file with interlaced blocks of 166 elements that were 3-bytes signed integers; so I've had also the problem of converting from 24-bit to 32-bit signed integers that slow things down a little.
I've resolved it using NumPy's memmap (it's just a handy way of using Python's memmap) and struct.unpack on large chunk of the file.
With this solution I'm able to convert (read, do stuff, and write to disk) the entire file in about 90 seconds (timed with time.clock()).
I could upload part of the code.
I think the bottleneck you have here is twofold.
Depending on your OS and disc controller, the calls to f.read(2) with f being a bigish file are usually efficiently buffered -- usually. In other words, the OS will read one or two sectors (with disc sectors usually several KB) off disc into memory because this is not a lot more expensive than reading 2 bytes from that file. The extra bytes are cached efficiently in memory ready for the next call to read that file. Don't rely on that behavior -- it might be your bottleneck -- but I think there are other issues here.
I am more concerned about the single byte conversions to a short and single calls to numpy. These are not cached at all. You can keep all the shorts in a Python list of ints and convert the whole list to numpy when (and if) needed. You can also make a single call struct.unpack_from to convert everything in a buffer vs one short at a time.
Consider:
#!/usr/bin/python
import random
import os
import struct
import numpy
import ctypes
def read_wopper(filename,bytes=2,endian='>h'):
buf_size=1024*2
buf=ctypes.create_string_buffer(buf_size)
new_buf=[]
with open(filename,'rb') as f:
while True:
st=f.read(buf_size)
l=len(st)
if l==0:
break
fmt=endian[0]+str(l/bytes)+endian[1]
new_buf+=(struct.unpack_from(fmt,st))
na=numpy.array(new_buf)
return na
fn='bigintfile'
def createmyfile(filename):
bytes=165924350
endian='>h'
f=open(filename,"wb")
count=0
try:
for int in range(0,bytes/2):
# The first 32,767 values are [0,1,2..0x7FFF]
# to allow testing the read values with new_buf[value<0x7FFF]
value=count if count<0x7FFF else random.randint(-32767,32767)
count+=1
f.write(struct.pack(endian,value&0x7FFF))
except IOError:
print "file error"
finally:
f.close()
if not os.path.exists(fn):
print "creating file, don't count this..."
createmyfile(fn)
else:
read_wopper(fn)
print "Done!"
I created a file of random shorts signed ints of 165,924,350 bytes (158.24 MB) which comports to 82,962,175 signed 2 byte shorts. With this file, I ran the read_wopper function above and it ran in:
real 0m15.846s
user 0m12.416s
sys 0m3.426s
If you don't need the shorts to be numpy, this function runs in 6 seconds. All this on OS X, python 2.6.1 64 bit, 2.93 gHz Core i7, 8 GB ram. If you change buf_size=1024*2 in read_wopper to buf_size=2**16 the run time is:
real 0m10.810s
user 0m10.156s
sys 0m0.651s
So your main bottle neck, I think, is the single byte calls to unpack -- not your 2 byte reads from disc. You might want to make sure that your data files are not fragmented and if you are using OS X that your free disc space (and here) is not fragmented.
Edit I posted the full code to create then read a binary file of ints. On my iMac, I consistently get < 15 secs to read the file of random ints. It takes about 1:23 to create since the creation is one short at a time.
Related
Caveat
This is NOT a duplicate of this. I'm not interested in finding out my memory consumption or the matter, as I'm already doing that below. The question is WHY the memory consumption is like this.
Also, even if I did need a way to profile my memory do note that guppy (the suggested Python memory profiler in the aforementioned link does not support Python 3 and the alternative guppy3 does not give accurate results whatsoever yielding in results such as (see actual sizes below):
Partition of a set of 45968 objects. Total size = 5579934 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 13378 29 1225991 22 1225991 22 str
1 11483 25 843360 15 2069351 37 tuple
2 2974 6 429896 8 2499247 45 types.CodeType
Background
Right, so I have this simple script which I'm using to do some RAM consumption tests, by reading a file in 2 different ways:
reading a file one line at a time, processing, and discarding it (via generators), which is efficient and recommended for basically any file size (especially large files), which works as expected.
reading a whole file into memory (I know this is advised against, however this was just for educational purposes).
Test script
import os
import psutil
import time
with open('errors.log') as file_handle:
statistics = os.stat('errors.log') # See below for contents of this file
file_size = statistics.st_size / 1024 ** 2
process = psutil.Process(os.getpid())
ram_usage_before = process.memory_info().rss / 1024 ** 2
print(f'File size: {file_size} MB')
print(F'RAM usage before opening the file: {ram_usage_before} MB')
file_handle.read() # loading whole file in memory
ram_usage_after = process.memory_info().rss / 1024 ** 2
print(F'Expected RAM usage after loading the file: {file_size + ram_usage_before} MB')
print(F'Actual RAM usage after loading the file: {ram_usage_after} MB')
# time.sleep(30)
Output
File size: 111.75 MB
RAM usage before opening the file: 8.67578125 MB
Expected RAM usage after loading the file: 120.42578125 MB
Actual RAM usage after loading the file: 343.2109375 MB
I also added a 30 second sleep to check with awk at the os level, where I've used the following command:
ps aux | awk '{print $6/1024 " MB\t\t" $11}' | sort -n
which yields:
...
343.176 MB python # my script
619.883 MB /Applications/PyCharm.app/Contents/MacOS/pycharm
2277.09 MB com.docker.hyperkit
The file contains about 800K copies of the following line:
[2019-09-22 16:50:17,236] ERROR in views, line 62: 404 Not Found: The
following URL: http://localhost:5000/favicon.ico was not found on the
server.
Is it because of block sizes or dynamic allocation, whereby the contents would be loaded in blocks and a lot of that memory would actually be unused ?
When you're opening a file in Python, by default you're opening it in Text-mode. That means that the binary data is decoded based on operating system defaults or explicitly given codecs.
Like all data, textual data is represented by bytes in your computer. Most of the English alphabet is representable in a single byte, e.g. the letter "A" is usually translated to the number 65, or in binary: 01000001. This encoding (ASCII) is good enough for many cases, but when you want to write text in languages like Romanian, it is already not enough, because the characters ă, ţ, etc. are not part of ASCII.
For a while, people used different encodings per language (group), e.g. the Latin-x group of encodings (ISO-8859-x) for languages based on the latin alphabet, and other encodings for other (especially CJK) languages.
If you want to represent some Asian languages, or several different languages, you'll need encodings that encode one character to multiple bytes. That can either be a fixed number (e.g. in UTF-32 and UTF-16) or a variable number, like in the most common "popular" encoding today, UTF-8.
Back to Python: The Python string interface promises many properties, among them random access in O(1) complexity, meaning you can get the 1245th character even from a very long string very quickly. This clashes with the compact UTF-8 encoding: Because one "character" (really: one unicode codepoint) is sometimes one and sometimes several bytes long, Python couldn't just jump to the memory address start_of_string + length_of_one_character * offset, as the length_of_one_character varies in UTF-8. Python therefore needs to use a fixed-bytelength encoding.
For optimization reasons it doesn't always use UCS-4 (~UTF-32), because that will waste lots of space when the text is ASCII-only. Instead, Python dynamically chooses either Latin-1, UCS-2, or UCS-4 to store strings internally.
To bring everything together with an example:
Say you want to store the string "soluţie" in memory, from a file encoded as UTF-8. Since the letter ţ needs two bytes to be represented, Python chooses UCS-2:
characters | s | o | l | u | ţ | i | e
utf-8 |0x73 |0x6f |0x6c |0x75 |0xc5 0xa3|0x69 |0x65
ucs-2 |0x00 0x73|0x00 0x6f|0x00 0x6c|0x00 0x75|0x01 0x63|0x00 0x69|0x00 0x65
As you can see, UTF-8 (file on disk) needs 8 bytes, whereas UCS-2 needs 14.
Add to this the overhead of a Python string and the Python interpreter itself, and your calculations make sense again.
When you open a file in binary mode (open(..., 'rb')), you don't decode the bytes, but take them as-is. That is problematic if there is text in the file (because in order to process the data you'll sooner or later want to convert it to a string, where you then have to do the decoding), but if it's really binary data, such as an image, it's fine (and better).
This answer contains simplifications. Use with caution.
I have seen some installation files (huge ones, install.sh for Matlab or Mathematica, for example) for Unix-like systems, they must have embedded quite a lot of binary data, such as icons, sound, graphics, etc, into the script. I am wondering how that can be done, since this can be potentially useful in simplifying file structure.
I am particularly interested in doing this with Python and/or Bash.
Existing methods that I know of in Python:
Just use a byte string: x = b'\x23\xa3\xef' ..., terribly inefficient, takes half a MB for a 100KB wav file.
base64, better than option 1, enlarge the size by a factor of 4/3.
I am wondering if there are other (better) ways to do this?
You can use base64 + compression (using bz2 for instance) if that suits your data (e.g., if you're not embedding already compressed data).
For instance, to create your data (say your data consist of 100 null bytes followed by 200 bytes with value 0x01):
>>> import bz2
>>> bz2.compress(b'\x00' * 100 + b'\x01' * 200).encode('base64').replace('\n', '')
'QlpoOTFBWSZTWcl9Q1UAAABBBGAAQAAEACAAIZpoM00SrccXckU4UJDJfUNV'
And to use it (in your script) to write the data to a file:
import bz2
data = 'QlpoOTFBWSZTWcl9Q1UAAABBBGAAQAAEACAAIZpoM00SrccXckU4UJDJfUNV'
with open('/tmp/testfile', 'w') as fdesc:
fdesc.write(bz2.decompress(data.decode('base64')))
Here's a quick and dirty way. Create the following script called MyInstaller:
#!/bin/bash
dd if="$0" of=payload bs=1 skip=54
exit
Then append your binary to the script, and make it executable:
cat myBinary >> myInstaller
chmod +x myInstaller
When you run the script, it will copy the binary portion to a new file specified in the path of=. This could be a tar file or whatever, so you can do additional processing (unarchiving, setting execute permissions, etc) after the dd command. Just adjust the number in "skip" to reflect the total length of the script before the binary data starts.
What's the easiest efficient way to read from stdin and output every nth byte?
I'd like a command-line utility that works on OS X, and would prefer to avoid compiled languages.
This Python script is fairly slow (25s for a 3GB file when n=100000000):
#!/usr/bin/env python
import sys
n = int(sys.argv[1])
while True:
chunk = sys.stdin.read(n)
if not chunk:
break
sys.stdout.write(chunk[0])
Unfortunately we can't use sys.stdin.seek to avoid reading the entire file.
Edit: I'd like to optimize for the case when n is a significant fraction of the file size. For example, I often use this utility to sample 500 bytes at equally-spaced locations from a large file.
NOTE: OP change the example n from 100 to 100000000 which effectively render my code slower than his, normally i would just delete my answer since it is no longer better than the original example, but my answer gotten a vote so i will just leave it as it is.
the only way that i can think of to make it faster is to read everything at once and use slice
#!/usr/bin/env python
import sys
n = int(sys.argv[1])
data = sys.stdin.read()
print(data[::n])
although, trying to fit a 3GB file into the ram might be a very bad idea
I'm trying to run the following code but for some reason I get the following error: "TypeError: limit must be an integer".
Reading csv data file
import sys
import csv
maxInt = sys.maxsize
decrement = True
while decrement:
decrement = False
try:
**csv.field_size_limit(maxInt)**
except OverflowError:
maxInt = int(maxInt/10)
decrement = True
with open("Data.csv", 'rb') as textfile:
text = csv.reader(textfile, delimiter=" ", quotechar='|')
for line in text:
print ' '.join(line)
The error occurs in the starred line. I have only added the extra bit above the csv read statement as the file was too large to read normally. Alternatively, I could change the file to a text file from csv but I'm not sure whether this will corrupt the data further I can't actually see any of the data as the file is >2GB and hence costly to open.
Any ideas? I'm fairly new to Python but I'd really like to learn a lot more.
I'm not sure whether this qualifies as an answer or not, but here are a few things:
First, the csv reader automatically buffers per line of the CSV, so the file size shouldn't matter too much, 2KB or 2GB, whatever.
What might matter is the number of columns or amount of data inside the fields themselves. If this CSV contains War and Peace in each column, then yeah, you're going to have an issue reading it.
Some ways to potentially debug are to run print sys.maxsize, and to just open up a python interpreter, import sys, csv and then run csv.field_size_limit(sys.maxsize). If you are getting some terribly small number or an exception, you may have a bad install of Python. Otherwise, try to take a simpler version of your file. Maybe the first line, or the first several lines and just 1 column. See if you can reproduce the smallest possible case and remove the variability of your system and the file size.
On Windows 7 64bit with Python 2.6, maxInt = sys.maxsize returns 9223372036854775807L which consequently results in a TypeError: limit must be an integer when calling csv.field_size_limit(maxInt). Interestingly, using maxInt = int(sys.maxsize) does not change this. A crude workaround is to simlpy use csv.field_size_limit(2147483647) which of course cause issues on other platforms. In my case this was adquat to identify the broken value in the CSV, fix the export options in the other application and remove the need for csv.field_size_limit().
-- originally posted by user roskakori on this related question
I'm trying to read a huge amount of lines from standard input with python.
more hugefile.txt | python readstdin.py
The problem is that the program freezes as soon as i've read just a single line.
print sys.stdin.read(8)
exit(1)
This prints the first 8 bytes but then i expect it to terminate but it never does. I think it's not really just reading the first bytes but trying to read the whole file into memory.
Same problem with sys.stdin.readline()
What i really want to do is of course to read all the lines but with a buffer so i don't run out of memory.
I'm using python 2.6
This should work efficiently in a modern Python:
import sys
for line in sys.stdin:
# do something...
print line,
You can then run the script like this:
python readstdin.py < hugefile.txt
Back in the day, you had to use xreadlines to get efficient huge line-at-a-time IO -- and the docs now ask that you use for line in file.
Of course, this is of assistance only if you're actually working on the lines one at a time. If you're just reading big binary blobs to pass onto something else, then your other mechanism might be as efficient.