Python: write and read blocks of binary data to a file - python

I am working on a script where it will breakdown another python script into blocks and using pycrypto to encrypt the blocks (all of this i have successfully done so far), now i am storing the encrypted blocks to a file so that the decrypter can read it and execute each block. The final result of the encryption is a list of binary outputs (something like blocks=[b'\xa1\r\xa594\x92z\xf8\x16\xaa',b'xfbI\xfdqx|\xcd\xdb\x1b\xb3',etc...]).
When writing the output to a file, they all end up into one giant line, so that when reading the file, all the bytes come back in one giant line, instead of each item from the original list. I also tried converting the bytes into a string, and adding a '\n' at the end of each one, but the problem there is that I still need the bytes, and I can't figure out how to undo the string to get the original byte.
To summarize this, i am looking to either: write each binary item to a separate line in a file so i can easily read the data and use it in the decryption, or i could translate the data to a string and in the decrpytion undo the string to get back the original binary data.
Here is the code for writing to the file:
new_file = open('C:/Python34/testfile.txt','wb')
for byte_item in byte_list:
# This or for the string i just replaced wb with w and
# byte_item with ascii(byte_item) + '\n'
new_file.write(byte_item)
new_file.close()
and for reading the file:
# Or 'r' instead of 'rb' if using string method
byte_list = open('C:/Python34/testfile.txt','rb').readlines()

A file is a stream of bytes without any implied structure. If you want to load a list of binary blobs then you should store some additional metadata to restore the structure e.g., you could use the netstring format:
#!/usr/bin/env python
blocks = [b'\xa1\r\xa594\x92z\xf8\x16\xaa', b'xfbI\xfdqx|\xcd\xdb\x1b\xb3']
# save blocks
with open('blocks.netstring', 'wb') as output_file:
for blob in blocks:
# [len]":"[string]","
output_file.write(str(len(blob)).encode())
output_file.write(b":")
output_file.write(blob)
output_file.write(b",")
Read them back:
#!/usr/bin/env python3
import re
from mmap import ACCESS_READ, mmap
blocks = []
match_size = re.compile(br'(\d+):').match
with open('blocks.netstring', 'rb') as file, \
mmap(file.fileno(), 0, access=ACCESS_READ) as mm:
position = 0
for m in iter(lambda: match_size(mm, position), None):
i, size = m.end(), int(m.group(1))
blocks.append(mm[i:i + size])
position = i + size + 1 # shift to the next netstring
print(blocks)
As an alternative, you could consider BSON format for your data or ascii armor format.

I think what you're looking for is byte_list=open('C:/Python34/testfile.txt','rb').read()
If you know how many bytes each item is, you can use read(number_of_bytes) to process one item at a time.
read() will read the entire file, but then it is up to you to decode that entire list of bytes into their respective items.

In general, since you're using Python 3, you will be working with bytes objects (which are immutable) and/or bytearray objects (which are mutable).
Example:
b1 = bytearray('hello', 'utf-8')
print b1
b1 += bytearray(' goodbye', 'utf-8')
print b1
open('temp.bin', 'wb').write(b1)
#------
b2 = open('temp.bin', 'rb').read()
print b2
Output:
bytearray(b'hello')
bytearray(b'hello goodbye')
b'hello goodbye'

Related

Need to convert string to format usable by .hex() or other hex conversion method

I am reading hex data from a .csv file that has multiple rows (example format: FFFDF3FFFBF2FFFAF210FFF0) using the following code:
with open('c:\\temp\\results.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=",")
line_count = 0
file = open('c:\\temp\\sent.csv', 'w')
for row in csv_reader:
hex_string = f'{row[0]}'
bytes_object = bytes.fromhex(hex_string)
file.write(str(bytes_object) + '\n')
line_count += 1
file.close()
The output file contains mutliple rows that are converted to this format (sorry new to python so not sure if this is a bytearray or what it is actually called): b'\xff\xfd\xf3\xff\xfb\xf2\xff\xfa\xf2\x10\xff\xf0'
I am trying to convert back from this format to the orginal format reading the rows of the newly created .csv file (need to edit readable ascii in the file and covert back for use in another program).
file = open('c:\\temp\\sent.csv', 'r')
for row in file:
byte_string = row
# hex_object = byte_string.hex()
#THIS works if I enter the byte array in directly, but not if reading
#from file hex_object = byte_string.hex()
hex_object = b'\xff\xfd\x03\xff\xfb\x03\xff\xfd\x01\xff\xfb\x17\xff\xfa\xff\xf0\xff\xfd\x00\xff\xfb\x00'.hex()
print(hex_object)
#print(byte_string)
# writer.writerow(hex_object)
Is there a way to get this to work? I have tried several encoding methods, but since the data is already in the proper format I really just need to get it in a readable type for the .hex() method. I am using the latest version of Python 3.8.1enter code here
You are storing a textual representation of your bytes object and then trying to read it back without conversion to/from binary. Instead you are better off opening the output file in binary format like this:
file = open('c:\\temp\\sent.csv', 'wb')
and write the bytes to file:
bytes_object = bytes.fromhex(hex_string)
file.write(bytes_object)
(no need for newline character).
Then to do the opposite open in binary format:
with open('c:\\temp\\sent.csv', "rb") as f:
data = f.read()
s = data.hex()
print(s)
Here data is a bytes object and it has the hex() function you are looking for.

How to insert text at line and column position in a file?

I would like to insert a string at a specific column of a specific line in a file.
Suppose I have a file file.txt
How was the English test?
How was the Math test?
How was the Chemistry test?
How was the test?
I would like to change the last line to say How was the History test? by adding the string History at line 4 column 13.
Currently I read in every line of the file and add the string to the specified position.
with open("file.txt", "r+") as f:
# Read entire file
lines = f.readlines()
# Update line
lino = 4 - 1
colno = 13 -1
lines[lino] = lines[lino][:colno] + "History " + lines[lino][colno:]
# Rewrite file
f.seek(0)
for line in lines:
f.write(line)
f.truncate()
f.close()
But I feel like I should be able to simply add the line to the file without having to read and rewrite the entire file.
This is possibly a duplicate of below SO thread
Fastest Way to Delete a Line from Large File in Python
In above it's a talk about delete, which is just a manipulation, and yours is more of a modification. So the code would get updated like below
def update(filename, lineno, column, text):
fro = open(filename, "rb")
current_line = 0
while current_line < lineno - 1:
fro.readline()
current_line += 1
seekpoint = fro.tell()
frw = open(filename, "r+b")
frw.seek(seekpoint, 0)
# read the line we want to update
line = fro.readline()
chars = line[0: column-1] + text + line[column-1:]
while chars:
frw.writelines(chars)
chars = fro.readline()
fro.close()
frw.truncate()
frw.close()
if __name__ == "__main__":
update("file.txt", 4, 13, "History ")
In a large file it make sense to not make modification till the lineno where the update needs to happen, Imagine you have file with 10K lines and update needs to happen at 9K, your code will load all 9K lines of data in memory unnecessarily. The code you have would work still but is not the optimal way of doing it
The function readlines() reads the entire file. But it doesn't have to. It actually reads from the current file cursor position to the end, which happens to be 0 right after opening. (To confirm this, try f.tell() right after with statement.) What if we started closer to the end of the file?
The way your code is written implies some prior knowledge of your file contents and layouts. Can you place any constraints on each line? For example, given your sample data, we might say that lines are guaranteed to be 27 bytes or less. Let's round that to 32 for "power of 2-ness" and try seeking backwards from the end of the file.
# note the "rb+"; need to open in binary mode, else seeking is strictly
# a "forward from 0" operation. We need to be able to seek backwards
with open("file.txt", "rb+") as f:
# caveat: if file is less than 32 bytes, this will throw
# an exception. The second parameter, 2, says "from end of file"
f.seek(-32, 2)
last = f.readlines()[-1].decode()
At which point the code has only read the last 32 bytes of the file.1 readlines() (at the byte level) will look for the line end byte (in Unix, \n or 0x0a or byte value 10), and return the before and after. Spelled out:
>>> last = f.readlines()
>>> print( last )
[b'hemistry test?\n', b'How was the test?']
>>> last = last[-1]
>>> print( last )
b'How was the test?'
Crucially, this works robustly under UTF-8 encoding by exploiting the UTF-8 property that ASCII byte values under 128 do not occur when encoding non-ASCII bytes. In other words, the exact byte \n (or 0x0a) only ever occurs as a newline and never as part of a character. If you are using a non-UTF-8 encoding, you will need to check if the code assumptions still hold.
Another note: 32 bytes is arbitrary given the example data. A more realistic and typical value might be 512, 1024, or 4096. Finally, to put it back to a working example for you:
with open("file.txt", "rb+") as f:
# caveat: if file is less than 32 bytes, this will throw
# an exception. The second parameter, 2, says "from end of file"
f.seek(-32, 2)
# does *not* read while file, unless file is exactly 32 bytes.
last = f.readlines()[-1]
last_decoded = last.decode()
# Update line
colno = 13 -1
last_decoded = last_decoded[:colno] + "History " + last_decoded[colno:]
last_line_bytes = len( last )
f.seek(-last_line_bytes, 2)
f.write( last_decoded.encode() )
f.truncate()
Note that there is no need for f.close(). The with statement handles that automatically.
1 The pedantic will correctly note that the computer and OS will likely have read at least 512 bytes, if not 4096 bytes, relating to the on-disk or in-memory page size.
You can use this piece of code :
with open("test.txt",'r+') as f:
# Read the file
lines=f.readlines()
# Gets the column
column=int(input("Column:"))-1
# Gets the line
line=int(input("Line:"))-1
# Gets the word
word=input("Word:")
lines[line]=lines[line][0:column]+word+lines[line][column:]
# Delete the file
f.seek(0)
for i in lines:
# Append the lines
f.write(i)
This answer will only loop through the file once and only write everything after the insert. In cases where the insert is at the end there is almost no overhead and where the insert at the beginning it is no worse than a full read and write.
def insert(file, line, column, text):
ln, cn = line - 1, column - 1 # offset from human index to Python index
count = 0 # initial count of characters
with open(file, 'r+') as f: # open file for reading an writing
for idx, line in enumerate(f): # for all line in the file
if idx < ln: # before the given line
count += len(line) # read and count characters
elif idx == ln: # once at the line
f.seek(count + cn) # place cursor at the correct character location
remainder = f.read() # store all character afterwards
f.seek(count + cn) # move cursor back to the correct character location
f.write(text + remainder) # insert text and rewrite the remainder
return # You're finished!
I'm not sure whether you were having problems changing your file to contain the word "History", or whether you wanted to know how to only rewrite certain parts of a file, without having to rewrite the whole thing.
If you were having problems in general, here is some simple code which should work, so long as you know the line within the file that you want to change. Just change the first and last lines of the program to read and write statements accordingly.
fileData="""How was the English test?
How was the Math test?
How was the Chemistry test?
How was the test?""" # So that I don't have to create the file, I'm writing the text directly into a variable.
fileData=fileData.split("\n")
fileData[3]=fileData[3][:11]+" History"+fileData[3][11:] # The 3 referes to the line to add "History" to. (The first line is line 0)
storeData=""
for i in fileData:storeData+=i+"\n"
storeData=storeData[:-1]
print(storeData) # You can change this to a write command.
If you wanted to know how to change specific "parts" to a file, without rewriting the whole thing, then (to my knowledge) that is not possible.
Say you had a file which said Ths is a TEST file., and you wanted to correct it to say This is a TEST file.; you would technically be changing 17 characters and adding one on the end. You are changing the "s" to an "i", the first space to an "s", the "i" (from "is") to a space, etc... as you shift the text forward.
A computer can't actually insert bytes between other bytes. It can only move the data, to make room.

How to store first N strings from a txt file in Python?

I'm trying to figure out how to get the first N strings from a txt file, and store them into an array. Right now, I have code that gets every string from a txt file, separated by a space delimiter, and stores it into an array. However, I want to be able to only grab the first N number of strings from it, not every single string. Here is my code (and I'm doing it from a command prompt):
import sys
f = open(sys.argv[1], "r")
contents = f.read().split(' ')
f.close()
I'm sure that the only line I need to fix is:
contents = f.read().split(' ')
I'm just not sure how to limit it here to N number of strings.
If the file is really big, but not too big--that is, big enough that you don't want to read the whole file (especially in text mode or as a list of lines), but not so big that you can't page it into memory (which means under 2GB on a 32-bit OS, but a lot more on 64-bit), you can do this:
import itertools
import mmap
import re
import sys
n = 5
# Notice that we're opening in binary mode. We're going to do a
# bytes-based regex search. This is only valid if (a) the encoding
# is ASCII-compatible, and (b) the spaces are ASCII whitespace, not
# other Unicode whitespace.
with open(sys.argv[1], 'rb') as f:
# map the whole file into memory--this won't actually read
# more than a page or so beyond the last space
m = mmap.mmap(f.fileno(), access=mmap.ACCESS_READ)
# match and decode all space-separated words, but do it lazily...
matches = re.finditer(r'(.*?)\s', m)
bytestrings = (match.group(1) for match in matches)
strings = (b.decode() for b in bytestrings)
# ... so we can stop after 5 of them ...
nstrings = itertools.islice(strings, n)
# ... and turn that into a list of the first 5
contents = list(nstrings)
Obviously you can combine steps together, even cramming the whole thing into a giant one-liner if you want. (An idiomatic version would be somewhere between that extreme and this one.)
If you're fine with reading the whole file (assuming it's not memory prohibitive to do so) you can just do this:
strings_wanted = 5
strings = open('myfile').read().split()[:strings_wanted]
That works like this:
>>> s = 'this is a test string with more than five words.'
>>> s.split()[:5]
['this', 'is', 'a', 'test', 'string']
If you actually want to stop reading exactly as soon as you've reached the nth word, you pretty much have to read a byte at a time. But that's going to be slow, and complicated. Plus, it's still not really going to stop reading after the nth word, unless you're reading in binary mode and decoding manually, and you disable buffering.
As long as the text file has line breaks (as opposed to being one giant 80MB line), and it's acceptable to read a few bytes past the nth word, a very simple solution will still be pretty efficient: just read and split line by line:
import sys
f = open(sys.argv[1], "r")
contents = []
for line in f:
contents += line.split()
if len(contents) >= n:
del contents[n:]
break
f.close()
what about just:
output=input[:3]
output will contain the first three strings in input

mmap in python printing binary data instead of text

I am trying to read a big file of 30 MB character by character. I found an interesting article on how to read a big file. Fast Method to Stream Big files
Problem: Output printing binary data instead of actual human readable text
Code:
def getRow(filepath):
offsets = get_offsets(filepath)
random.shuffle(offsets)
with gzip.open(filepath, "r+b") as f:
i = 0
mm = mmap.mmap(f.fileno(), 0, access = mmap.ACCESS_READ)
for position in offsets:
mm.seek(position)
record = mm.readline()
x = record.split(",")
yield x
def get_offsets(input_filename):
offsets = []
with open(input_filename, 'r+b') as f:
i = 0
mm = mmap.mmap(f.fileno(), 0, access = mmap.ACCESS_READ)
for record in iter(mm.readline, ''):
loc = mm.tell()
offsets.append(loc)
i += 1
return offsets
for line in getRow("hello.dat.gz"):
print line
Output: The output is producing some weird binary data.
['w\xc1\xd9S\xabP8xy\x8f\xd8\xae\xe3\xd8b&\xb6"\xbeZ\xf3P\xdc\x19&H\\#\x8e\x83\x0b\x81?R\xb0\xf2\xb5\xc1\x88rJ\
Am I doing something terribly stupid?
EDIT:
I found the problem. It is because of gzip.open. Not sure how to get rid of this. Any ideas?
As per the documentation of GZipFile:
fileno(self)
Invoke the underlying file object's `fileno()` method.
You are mapping a view of the compressed .gz file, not a view of the compressed data.
mmap() can only operate on OS file handles, it cannot map arbitrary Python file objects.
So no, you cannot transparently map a decompressed view of a compressed file unless this is supported directly by the underlying operating system.

Python - Parse file into outputs based on magic number/length

I'm a complete beginner to coding - only started 3 weeks ago, and really only have codecademy's Python course under my belt - so simple explanations would be really appreciated!
I'm trying to write a python script that reads a file as a HEX string, and then parses the file into individual output files based on finding a "magic number" within the HEX string.
EG: if my HEX string were "0011AABB00BBAACC00223344", I might want to parse this string into new output files based on the magic number "00", and telling python that each output should be 8 characters long. The output for the example string above should be 3 files that contain the HEX values:
"0011AABB"
"00BBAACC"
"00223344"
Here's what I have so far (assuming in this case that the string above is contained within the "hextests" file
import os
import binascii
filename = "hextests"
# read file as a binary string
with open(filename, 'rb') as f:
content = f.read()
# convert binary string to hex string
hexString = binascii.hexlify(content)
# define magic number as "00"
magic_N = "00"
# attempting to create a new substring called newFile that is equal to each instance magic_N repeats throughout the file for a length of 8 characters
for chars in hexString:
newFile = ""
if chars == magic_N:
newFile += chars.len(9)
# attempting to create a series of new output files for each instance of newFile - while incrementing the output file name
if newFile != "":
i = 0
while os.path.exists("output_file%s.xyz" % i):
i += 1
fh = with open("output_file%s.xyz" % i, "wb"):
newFile
I'm sure I have a lot of errors to work through on this - and it's likely more complicated than I think .... but my main question has to do with the proper way to define the chars and newFile variables. I'm pretty sure python sees chars as only single characters in the string, so it's failing because I'm attempting to search for a magic_N that is longer than 1 character. Am I correct that that is part of the issue?
Also, if you understand the main goal of this script, any other thoughts about things I should be doing differently?
Thanks so much for the help!
You can try something like this:
filename = "hextests"
# read file as a binary string
with open(filename, "rb") as f:
content = f.read()
# You don't need this part if you want
# to parse the hex string as it is given in the file
# convert binary string to hex string
# hexString = binascii.hexlify(content)
# Remove the newline at the end of the string
hexString = content.strip()
# define magic number as "00"
magic_N = "00"
i = 0
j = 0
while i < len(hexString) - 1:
index = hexString.find(magic_N, i)
# This is the part which was incorrect in your code.
with open("output_file_%s.xyz" % j, "wb") as output:
output.write(hexString[i:i+8])
i += 8
j += 1
Note that you need to explicitly call write method to write the data to the output file.
Here it is assumed that the chunks of data are exactly 8 hex symbols long and they always start with 00. So it's not a flexible solution but it gives you an idea on how to tackle the problem.

Categories