I´ve a "binary" file with variable size record. Each record is composed of an amount of little endians 2 byte-sized integer numbers. I know the start position of each record and it´s size.
What´s the fastest way to read this to a Python array of integer?
I don't think you can do better than opening the file and reading the size of each recorde then use struct.unpack('<i', buff) for each integer you want to read, file.read(2), will get you 2 integers.
Related
I encounter weird problem and could not solve it for days. I have created byte array that contains values from 1 to 250 and write it to binary file from C# using WriteAllBytes.
Later i read it from Python using np.fromfile(filename, dtype=np.ubyte). However, i realize this functions was adding arbitrary comma (see the image). Interestingly it is not visible in array property. And if i call numpy.array2string, comma turns '\n'. One solution is to replace comma with none, however i have very long sequences it will take forever on 100gb of data to use replace function. I also recheck the files by reading using .net Core, i'm quite sure comma is not there.
What could i be missing?
Edit:
I was trying to read all byte values to array and cast each member to or entire array to string. I found out that most reliable way to do this is:
list(map(str, (ubyte_array))
Above code returns string list that its elements without any arbitrary comma or blank space.
I've encoded an 8x8 block of numbers using RLE and got this string of bits that I need to write to a file. The last coefficient of the sequence is padded accordingly so that the whole thing is divisible by 8.
In the image above, you can see the array of numbers, the RLE encoding(86 bits long), the padded version so its divisible by 8 (88 bits long) and the concatenated string that I am to write to a file.
0000110010001100110110111100001000111011001100010000111101111000011000110011101000100011
Would the best way to write this concatenated string be to divide the thing into 8 bit long substrings and write those individually as bytes or is there a simpler, faster way to do this? When it comes to binary data, the only module I've worked with so far is struct, this is the first time i've tackled something like this. Any and all advice would be appreciated.
I would convert it to a list using regex.
import re
file = open(r"insert file path here", "a")
bitstring = "0000110010001100110110111100001000111011001100010000111101111000011000110011101000100011"
bitlist = re.findall('........', bitstring) #seperates bitstring into a list of each item being 8 characters long
for i in range(len(bitlist)):
file.write(bitlist[i])
file.write("\n")
Let me know if this is what you mean.
I would also like to mention how pulling data from a text file will not be the most efficient way of store it. The fastest would ideally be kept in an array such as [["10110100"], ["00101010"]] and pull the data by doing...
bitarray = [["10110100"], ["00101010"]]
>>> print(bitarray[0][0])
10110100
I need to parse a big-endian binary file and convert it to little-endian. However, the people who have handed the file over to me seem unable to tell me anything about what data types it contains, or how it is organized — the only thing they know for certain is that it is a big-endian binary file with some old data. The function struct.unpack(), however, requires a format character as its first argument.
This is the first line of the binary file:
import binascii
path = "BC2003_lr_m32_chab_Im.ised"
with open(path, 'rb') as fd:
line = fd.readline()
print binascii.hexlify(line)
a0040000dd0000000000000080e2f54780f1094840c61a4800a92d48c0d9424840a05a48404d7548e09d8948a0689a48e03fad48a063c248c01bda48c0b8f448804a0949100b1a49e0d62c49e0ed41499097594900247449a0a57f4900d98549b0278c49a0c2924990ad9949a0eba049e080a8490072b049c0c2b849d077c1493096ca494022d449a021de49a099e849e08ff349500a
Is it possible to change the endianness of a file without knowing anything about it?
You cannot do this without knowing the datatypes. There is little point in attempting to do so otherwise.
Even if it was a homogeneous sequence of one datatype, you'd still need to know what you are dealing with; flipping the byte order in double values is very different from short integers.
Take a look at the formatting characters table; anything with a different byte size in it will result in a different set of bytes being swapped; for double values, you need to reverse the order of every 8 bytes, for example.
If you know what data should be in the file, then at least you have a starting point; you'd have to puzzle out how those values fit into the bytes given. It'll be a puzzle, but with a target set of values you can build a map of the datatypes contained, then write a byte-order adjustment script. If you don't even have that, best not to start as the task is impossible to achieve.
I am trying to read some negative values from a compressed file that has the hex values:
FFFFFFFF, which should be -1, but displays as 4294967295
FFFFFFFE, which should be -2, but displays as 4294967294
I know FF should be the marker for - but is there a method in python that can just read the values directly or do I have to make my own method?
Thank you!
Edit: This is for Python 2.6. My program reads from binary data and I am just displaying it in hex to make it simpler. The program simply reads 4 bytes at a time and grabs values from those 4 bytes. It is just some of those values are negative and display the above numbers. I am also hoping someone can explain how Python interprets the binary data into a value so I can write a reverse protocol. Thank you!
I read from hex and convert to values through this method.
def readtoint(read):
keynumber = read[::-1]
hexoffset=''
for letter in keynumber:
temp=hex(ord(letter))[2:]
if len(temp)==1:
temp="0"+temp
hexoffset += temp
value = int(hexoffset, 16)
return value
It grabs 4 bytes, inverses the order, then converts the hex value into a int value. THe values I posted above are inverted already.
Use the struct module:
import struct
def readtoint(read):
return struct.unpack('<i', read)[0]
Example:
>>> readtoint('\xfe\xff\xff\xff')
-2
Post you file reading code to get the perfect answer. But answer to your question is almost certainly here:
Reading integers from binary file in Python
I need to get the numbers of one line randomly, and put each line in other array, then get the numbers of one col.
I have a big file, more than 400M. In that file, there are 13496*13496 number, means 13496 rows and 13496 cols. I want to read them to a array.
This is my code:
_L1 = [[0 for col in range(13496)] for row in range(13496)]
_L1file = open('distanceCMD.function.txt')
while (i<13496):
print "i="+str(i)
_strlf = _L1file.readline()
_strlf = _strlf.split('\t')
_strlf = _strlf[:-1]
_L1[i] = _strlf
i += 1
_L1file.close()
And this is my error message:
MemoryError:
File "D:\research\space-function\ART3.py", line 30, in <module>
_strlf = _strlf.split('\t')
you might want to approach your problem in another way. Process the file line by line. I don't see a need to store the whole big file into array. Otherwise, you might want to tell us what you are actually trying to do.
for line in open("400MB_file"):
# do something with line.
Or
f=open("file")
for linenum,line in enumerate(f):
if linenum+1 in [2,3,10]:
print "there are ", len(line.split())," columns" #assuming you want to split on spaces
print "100th column value is: ", line.split()[99]
if linenum+1>10:
break # break if you want to stop after the 10th line
f.close()
This is a simple case of your program demanding more memory than is available to the computer. An array of 13496x13496 elements requires 182,142,016 'cells', where a cell is a minimum of one byte (if storing chars) and potentially several bytes (if storing floating-point numerics, for example). I'm not even taking your particular runtimes' array metadata into account, though this would typically be a tiny overhead on a simple array.
Assuming each array element is just a single byte, your computer needs around 180MB of RAM to hold it in memory in its' entirety. Trying to process it could be impractical.
You need to think about the problem a different way; as has already been mentioned, a line-by-line approach might be a better option. Or perhaps processing the grid in smaller units, perhaps 10x10 or 100x100, and aggregating the results. Or maybe the problem itself can be expressed in a different form, which avoids the need to process the entire dataset altogether...?
If you give us a little more detail on the nature of the data and the objective, perhaps someone will have an idea to make the task more manageable.
Short answer: the Python object overhead is killing you. In Python 2.x on a 64-bit machine, a list of strings consumes 48 bytes per list entry even before accounting for the content of the strings. That's over 8.7 Gb of overhead for the size of array you describe.
On a 32-bit machine it'll be a bit better: only 28 bytes per list entry.
Longer explanation: you should be aware that Python objects themselves can be quite large: even simple objects like ints, floats and strings. In your code you're ending up with a list of lists of strings. On my (64-bit) machine, even an empty string object takes up 40 bytes, and to that you need to add 8 bytes for the list pointer that's pointing to this string object in memory. So that's already 48 bytes per entry, or around 8.7 Gb. Given that Python allocates memory in multiples of 8 bytes at a time, and that your strings are almost certainly non-empty, you're actually looking at 56 or 64 bytes (I don't know how long your strings are) per entry.
Possible solutions:
(1) You might do (a little) better by converting your entries from strings to ints or floats as appropriate.
(2) You'd do much better by either using Python's array type (not the same as list!) or by using numpy: then your ints or floats would only take 4 or 8 bytes each.
Since Python 2.6, you can get basic information about object sizes with the sys.getsizeof function. Note that if you apply it to a list (or other container) then the returned size doesn't include the size of the contained list objects; only of the structure used to hold those objects. Here are some values on my machine.
>>> import sys
>>> sys.getsizeof("")
40
>>> sys.getsizeof(5.0)
24
>>> sys.getsizeof(5)
24
>>> sys.getsizeof([])
72
>>> sys.getsizeof(range(10)) # 72 + 8 bytes for each pointer
152
MemoryError exception:
Raised when an operation runs out of
memory but the situation may still be
rescued (by deleting some objects).
The associated value is a string
indicating what kind of (internal)
operation ran out of memory. Note that
because of the underlying memory
management architecture (C’s malloc()
function), the interpreter may not
always be able to completely recover
from this situation; it nevertheless
raises an exception so that a stack
traceback can be printed, in case a
run-away program was the cause.
It seems that, at least in your case, reading the entire file into memory is not a doable option.
Replace this:
_strlf = _strlf[:-1]
with this:
_strlf = [float(val) for val in _strlf[:-1]]
You are making a big array of strings. I can guarantee that the string "123.00123214213" takes a lot less memory when you convert it to floating point.
You might want to include some handling for null values.
You can also go to numpy's array type, but your problem may be too small to bother.