I'll be happy for your help with some problem that I have.
Goal: To read a .h264 file (I extracted the raw bitstream to a file using ffmpeg) using python, and save it in some data structure (probably a list, I'll be happy for suggestions).
I want to read the data as hexa, for example I'll show how the data looks like:
What I want is to feed each byte(2 hexa digits), into a list, or some other data structure.
But any step forward will help me.
My Attempts:
First I tried to read the way I know:
with open(path, 'r') as fp:
data = fp.read()
Didn't work, got just ".
After a lot of changes, I tried something else, I saw online:
with open(path, 'r') as fp:
hex_list = ["{:02}".format(ord(c)) for c in fp.read()]
Still got an empty list.
I'll be happy for you help.
Thanks a lot.
EDIT:
Thanks to the comment below, I tried to open using 'rb', but still with no luck.
If you have an h264 mp4 file, you can open it and get a hexadecimal string representation like this using binascii.hexlify():
import binascii
with open('test.mp4', 'rb') as fin:
hexa = binascii.hexlify(fin.read())
print(hexa[0:1000])
hexa will be a python bytes object, and you can easily get back the binary representation by doing binascii.unhexlify(hexa). This will be much more efficient than storing the hex representation as strings in a list(), both in terms of space and time. You can access the bytes array with indices/slices, so whatever you were intending to do with the list will probably work fine with this (it will just be much faster and use a lot less memory).
One thing to keep in mind though is to get the the first hexadecimal digit from a bytes object, you don't do hexa[0], but rather hexa[0:1]. To get the first pair of hexadecimal digits (byte), you do: hexa[0:2]. The second byte is hexa[2:4] etc. As explained in the docs for hex():
Since bytes objects are sequences of integers (akin to a tuple), for a
bytes object b, b[0] will be an integer, while b[0:1] will be a bytes
object of length 1. (This contrasts with text strings, where both
indexing and slicing will produce a string of length 1)
Related
I've encoded an 8x8 block of numbers using RLE and got this string of bits that I need to write to a file. The last coefficient of the sequence is padded accordingly so that the whole thing is divisible by 8.
In the image above, you can see the array of numbers, the RLE encoding(86 bits long), the padded version so its divisible by 8 (88 bits long) and the concatenated string that I am to write to a file.
0000110010001100110110111100001000111011001100010000111101111000011000110011101000100011
Would the best way to write this concatenated string be to divide the thing into 8 bit long substrings and write those individually as bytes or is there a simpler, faster way to do this? When it comes to binary data, the only module I've worked with so far is struct, this is the first time i've tackled something like this. Any and all advice would be appreciated.
I would convert it to a list using regex.
import re
file = open(r"insert file path here", "a")
bitstring = "0000110010001100110110111100001000111011001100010000111101111000011000110011101000100011"
bitlist = re.findall('........', bitstring) #seperates bitstring into a list of each item being 8 characters long
for i in range(len(bitlist)):
file.write(bitlist[i])
file.write("\n")
Let me know if this is what you mean.
I would also like to mention how pulling data from a text file will not be the most efficient way of store it. The fastest would ideally be kept in an array such as [["10110100"], ["00101010"]] and pull the data by doing...
bitarray = [["10110100"], ["00101010"]]
>>> print(bitarray[0][0])
10110100
I've got an I2S microphone connected to a microcontroller and have managed to dump 16-bit audio WAV audio to a python bytearray object that looks like this (using a micropython library):
raw = bytearray(b"\xac\xffW\x00\xfc\xfe\xac\xffs\xfe\xfc\xfe+\xfes\xfe7\xfe+\xfe\x8c\xfe7\xfe\x1f\xff\x8c\xfe\xcf\xff\x1f\xfft\x00\xcf\xff\xfb\x00t\x00?\x01\xfb...")
I have successfully written these bytearray dumps to a file I created like this:
wav = open('16bitaudio.wav','wb')
#....some code to write the wav header
wav.write(raw)
wav.close()
When I open this on my PC it plays the samples I've recorded faithfully, sounds great.
My issue comes - I want to translate this data to an integer which represents the average intensity of sound in my samples. I first attempted to do this:
intensity = sum(raw)/count(raw)
However, this tends to result in a number ~128 almost all the time - suggesting to me it's being read as random bytes. Upon further investigation, these array functions seem to assume that we've only got an 8 bit byte (reading the value b'\xffW' which I believe is a little endian 22527):
>>> int(raw[1])
255
which appears to be just the b'\xff' part.
I can get my expected value by parsing just the byte manyally into int.from_bytes:
>>> int.from_bytes(b'\xffW','little')
22527
However I can't seem to iterate through the bytearray without it truncating to 8-bit.
Finally, I have read the struct.unpack methods - which look OK, but I'm not sure bytearray get packed with bytes of consistent length.... e.g.:
>>> len(bytearray(b'\xfdo\xfe\x7f\xfd\xd3\xf1d'))
8
Even though I see only 6 bytes represented. The ultimate problem I have with unpacking I'm not sure if each byte is 8 or 16 bit ahead of time, so I don't know what combination of letters to use in the second argument...
So, given the b-string representation, it seems that python DOES have knowledge of the way the bytes are encoded, however it seems like the normal array functions I've got on hand are getting this info from the bytearray. I'm sure there is a pythonic way to parse this bytearray to integers, but I just can't find it...
Any help extremely appreciated
Thanks #juanpa.arrivillaga for the answer I was looking for. I used the array library which seemed to solve all of my problems:
import array
result = array.array('h', raw)
Graphing the values output here is the same as the oscilloscope for my audio file.... Cheers!
integers
oscilloscope of my working .wav
I am new to python. Here is what I am struggling to implement.
I have a very long bit string "1010101010101011111010101010111100001101010101011 ... ".
I want to write this as bits and create a binary file using python.
(Later I want disassemble this using IDA, this is not important for this question).
Is there any way I can write to a file at bit level (as binary)? Or do I have to convert it to bytes first and then write byte by byte? What is the best approach.
Yes, you have to convert it to bytes first and then write those bytes to a file. Working on a per byte basis is probably also the best idea to keep control over the ordering of your bytes (big vs. little endian), etc.
You can use int("10101110", 2) to easily convert a bit string to a numeric value. Then use a bytearray to create a sequence of all your byte values. The result will look something like this:
s = "1010101010101011111010101010111100001101010101011"
i = 0
buffer = bytearray()
while i < len(s):
buffer.append( int(s[i:i+8], 2) )
i += 8
# now write your buffer to a file
with open(my_file, 'bw') as f:
f.write(buffer)
I want to write an encoded text to a file using Python 3.6, the issue is that I want to write it as a string and not as bytes.
text = open(file, 'r').read()
enc = text.encode(encoding) # for example: "utf-32"
f = open(new_file, 'w')
f.write(str(enc)[2:-1])
f.close()
The problem is, I still get the file content as bytes (e.g. the '\n' remains the same instead of become a new row).
I also tried to use:
enc.decode(encoding)
but it's just returning me back the old text I had in the first place.
any ideas how can I improve this piece of code?
Thanks.
The problem you have here is that you encode into utf-32 bytes object and then cast it back into a string object without specifying an encoding. The default is utf-8, so you've just converted using the wrong encoding. If you pass the same encoding to str then it should work.
Better yet, don't call str at all when writing out - if you already have a bytes object, it's not necessary.
This concept generally trips up a lot of people. I suggest reading the explanation here to help wrap your head around how and why we do the string/bytes conversions. A good rule of thumb - string types inside your python, and decode to string from bytes as data comes in, encode from string to bytes as it goes out.
I need to parse a big-endian binary file and convert it to little-endian. However, the people who have handed the file over to me seem unable to tell me anything about what data types it contains, or how it is organized — the only thing they know for certain is that it is a big-endian binary file with some old data. The function struct.unpack(), however, requires a format character as its first argument.
This is the first line of the binary file:
import binascii
path = "BC2003_lr_m32_chab_Im.ised"
with open(path, 'rb') as fd:
line = fd.readline()
print binascii.hexlify(line)
a0040000dd0000000000000080e2f54780f1094840c61a4800a92d48c0d9424840a05a48404d7548e09d8948a0689a48e03fad48a063c248c01bda48c0b8f448804a0949100b1a49e0d62c49e0ed41499097594900247449a0a57f4900d98549b0278c49a0c2924990ad9949a0eba049e080a8490072b049c0c2b849d077c1493096ca494022d449a021de49a099e849e08ff349500a
Is it possible to change the endianness of a file without knowing anything about it?
You cannot do this without knowing the datatypes. There is little point in attempting to do so otherwise.
Even if it was a homogeneous sequence of one datatype, you'd still need to know what you are dealing with; flipping the byte order in double values is very different from short integers.
Take a look at the formatting characters table; anything with a different byte size in it will result in a different set of bytes being swapped; for double values, you need to reverse the order of every 8 bytes, for example.
If you know what data should be in the file, then at least you have a starting point; you'd have to puzzle out how those values fit into the bytes given. It'll be a puzzle, but with a target set of values you can build a map of the datatypes contained, then write a byte-order adjustment script. If you don't even have that, best not to start as the task is impossible to achieve.