segmenting and writing binary file using Python - python

I have two binary input files, firstfile and secondfile. secondfile is firstfile + additional material. I want to isolate this additional material in a separate file, newfile. This is what I have so far:
import os
import struct
origbytes = os.path.getsize(firstfile)
fullbytes = os.path.getsize(secondfile)
numbytes = fullbytes-origbytes
with open(secondfile,'rb') as f:
first = f.read(origbytes)
rest = f.read()
Naturally, my inclination is to do (which seems to work):
with open(newfile,'wb') as f:
f.write(rest)
I can't find it but thought I read on SO that I should pack this first using struct.pack before writing to file. The following gives me an error:
with open(newfile,'wb') as f:
f.write(struct.pack('%%%ds' % numbytes,rest))
-----> error: bad char in struct format
This works however:
with open(newfile,'wb') as f:
f.write(struct.pack('c'*numbytes,*rest))
And for the ones that work, this gives me the right answer
with open(newfile,'rb') as f:
test = f.read()
len(test)==numbytes
-----> True
Is this the correct way to write a binary file? I just want to make sure I'm doing this part correctly to diagnose if the second part of the file is corrupted as another reader program I am feeding newfile to is telling me, or I am doing this wrong. Thank you.

If you know that secondfile is the same as firstfile + appended data, why even read in the first part of secondfile?
with open(secondfile,'rb') as f:
f.seek(origbytes)
rest = f.read()
As for writing things out,
with open(newfile,'wb') as f:
f.write(rest)
is just fine. The stuff with struct would just be a no-op anyway. The only thing you might consider is the size of rest. If it could be large, you may want to read and write the data in blocks.

There is no reason to use the struct module, which is for converting between binary formats and Python objects. There's no conversion needed here.
Strings in Python 2.x are just an array of bytes and can be read and written to and from files. (In Python 3.x, the read function returns a bytes object, which is the same thing, if you open the file with open(filename, 'rb').)
So you can just read the file into a string, then write it again:
import os
origbytes = os.path.getsize(firstfile)
fullbytes = os.path.getsize(secondfile)
numbytes = fullbytes-origbytes
with open(secondfile,'rb') as f:
first = f.seek(origbytes)
rest = f.read()
with open(newfile,'wb') as f:
f.write(rest)

You don't need to read origbytes, just move file pointer to the right position: f.seek(numbytes)
You don't need struct packing, write rest to the newfile.

This is not c, there is no % in the format string. What you want is:
f.write(struct.pack('%ds' % numbytes,rest))
It worked for me:
>>> struct.pack('%ds' % 5,'abcde')
'abcde'
Explanation: '%%%ds' % 15 is '%15s', while what you want is '%ds' % 15 which is '15s'

Related

Read() function erases text in file [duplicate]

Started Python a week ago and I have some questions to ask about reading and writing to the same files. I've gone through some tutorials online but I am still confused about it. I can understand simple read and write files.
openFile = open("filepath", "r")
readFile = openFile.read()
print readFile
openFile = open("filepath", "a")
appendFile = openFile.write("\nTest 123")
openFile.close()
But, if I try the following I get a bunch of unknown text in the text file I am writing to. Can anyone explain why I am getting such errors and why I cannot use the same openFile object the way shown below.
# I get an error when I use the codes below:
openFile = open("filepath", "r+")
writeFile = openFile.write("Test abc")
readFile = openFile.read()
print readFile
openFile.close()
I will try to clarify my problems. In the example above, openFile is the object used to open file. I have no problems if I want write to it the first time. If I want to use the same openFile to read files or append something to it. It doesn't happen or an error is given. I have to declare the same/different open file object before I can perform another read/write action to the same file.
#I have no problems if I do this:
openFile = open("filepath", "r+")
writeFile = openFile.write("Test abc")
openFile2 = open("filepath", "r+")
readFile = openFile2.read()
print readFile
openFile.close()
I will be grateful if anyone can tell me what I did wrong here or is it just a Pythong thing. I am using Python 2.7. Thanks!
Updated Response:
This seems like a bug specific to Windows - http://bugs.python.org/issue1521491.
Quoting from the workaround explained at http://mail.python.org/pipermail/python-bugs-list/2005-August/029886.html
the effect of mixing reads with writes on a file open for update is
entirely undefined unless a file-positioning operation occurs between
them (for example, a seek()). I can't guess what
you expect to happen, but seems most likely that what you
intend could be obtained reliably by inserting
fp.seek(fp.tell())
between read() and your write().
My original response demonstrates how reading/writing on the same file opened for appending works. It is apparently not true if you are using Windows.
Original Response:
In 'r+' mode, using write method will write the string object to the file based on where the pointer is. In your case, it will append the string "Test abc" to the start of the file. See an example below:
>>> f=open("a","r+")
>>> f.read()
'Test abc\nfasdfafasdfa\nsdfgsd\n'
>>> f.write("foooooooooooooo")
>>> f.close()
>>> f=open("a","r+")
>>> f.read()
'Test abc\nfasdfafasdfa\nsdfgsd\nfoooooooooooooo'
The string "foooooooooooooo" got appended at the end of the file since the pointer was already at the end of the file.
Are you on a system that differentiates between binary and text files? You might want to use 'rb+' as a mode in that case.
Append 'b' to the mode to open the file in binary mode, on systems
that differentiate between binary and text files; on systems that
don’t have this distinction, adding the 'b' has no effect.
http://docs.python.org/2/library/functions.html#open
Every open file has an implicit pointer which indicates where data will be read and written. Normally this defaults to the start of the file, but if you use a mode of a (append) then it defaults to the end of the file. It's also worth noting that the w mode will truncate your file (i.e. delete all the contents) even if you add + to the mode.
Whenever you read or write N characters, the read/write pointer will move forward that amount within the file. I find it helps to think of this like an old cassette tape, if you remember those. So, if you executed the following code:
fd = open("testfile.txt", "w+")
fd.write("This is a test file.\n")
fd.close()
fd = open("testfile.txt", "r+")
print fd.read(4)
fd.write(" IS")
fd.close()
... It should end up printing This and then leaving the file content as This IS a test file.. This is because the initial read(4) returns the first 4 characters of the file, because the pointer is at the start of the file. It leaves the pointer at the space character just after This, so the following write(" IS") overwrites the next three characters with a space (the same as is already there) followed by IS, replacing the existing is.
You can use the seek() method of the file to jump to a specific point. After the example above, if you executed the following:
fd = open("testfile.txt", "r+")
fd.seek(10)
fd.write("TEST")
fd.close()
... Then you'll find that the file now contains This IS a TEST file..
All this applies on Unix systems, and you can test those examples to make sure. However, I've had problems mixing read() and write() on Windows systems. For example, when I execute that first example on my Windows machine then it correctly prints This, but when I check the file afterwards the write() has been completely ignored. However, the second example (using seek()) seems to work fine on Windows.
In summary, if you want to read/write from the middle of a file in Windows I'd suggest always using an explicit seek() instead of relying on the position of the read/write pointer. If you're doing only reads or only writes then it's pretty safe.
One final point - if you're specifying paths on Windows as literal strings, remember to escape your backslashes:
fd = open("C:\\Users\\johndoe\\Desktop\\testfile.txt", "r+")
Or you can use raw strings by putting an r at the start:
fd = open(r"C:\Users\johndoe\Desktop\testfile.txt", "r+")
Or the most portable option is to use os.path.join():
fd = open(os.path.join("C:\\", "Users", "johndoe", "Desktop", "testfile.txt"), "r+")
You can find more information about file IO in the official Python docs.
Reading and Writing happens where the current file pointer is and it advances with each read/write.
In your particular case, writing to the openFile, causes the file-pointer to point to the end of file. Trying to read from the end would result EOF.
You need to reset the file pointer, to point to the beginning of the file before through seek(0) before reading from it
You can read, modify and save to the same file in python but you have actually to replace the whole content in file, and to call before updating file content:
# set the pointer to the beginning of the file in order to rewrite the content
edit_file.seek(0)
I needed a function to go through all subdirectories of folder and edit content of the files based on some criteria, if it helps:
new_file_content = ""
for directories, subdirectories, files in os.walk(folder_path):
for file_name in files:
file_path = os.path.join(directories, file_name)
# open file for reading and writing
with io.open(file_path, "r+", encoding="utf-8") as edit_file:
for current_line in edit_file:
if condition in current_line:
# update current line
current_line = current_line.replace('john', 'jack')
new_file_content += current_line
# set the pointer to the beginning of the file in order to rewrite the content
edit_file.seek(0)
# delete actual file content
edit_file.truncate()
# rewrite updated file content
edit_file.write(new_file_content)
# empties new content in order to set for next iteration
new_file_content = ""
edit_file.close()

Python: CSV directly from the web, resulting in unusable data

I'm using Python 3.5 on Windows.
I have this little piece of code that downloads close to one hundred CSV files from different URLs stored in Links.txt:
from urllib import request
new_lines = 'None'
def download_data(csv_url):
response = request.urlopen(csv_url)
csv = response.read()
csv_str = str(csv)
global new_lines
new_lines = csv_str.split("\\n")
with open('Links.txt') as file:
for line in file:
URL = line
file_name = URL[54:].rsplit('.ST', 1)[0]
download_data(URL)
save_destination = 'C:\\Download data\\Data\\' + file_name + '.csv'
fx = open(save_destination, "w")
for lines in new_lines:
fx.write(lines+"\n")
fx.close()
The problem is that the CSV files generated always starts with b ' and after the last line of the data follows another ' and a couple of empty rows to wrap things up. I do not see these characters when I look at the files from the browser (before I download them).
This creates problems when I want to import and use the data in a database. Do you have any idea on why this happens and how I can get the code to write the CSV files correctly?
Tips that can make the code faster/better, or adjustments for other flaws in the code are obviously very welcome.
What's happening is that urllib treats its stream as bytes - any string that looks like b'...' means it's a byte-string.
Your immediate problem could be solved by encoding the stream by calling decode('utf-8') (as Chedy2149 shows), which will convert the data's bytes.
However, you can complete elide this problem by downloading the file directly to disk. You go through the work of downloading it, splitting it, and writing it to disk, but all that seems unnecessary because your code just ultimately writes the file's contents to disk without additional work against them.
You can use urllib.request.urlretrieve and download to a file directly.
Here's an example, modified from your code.
import urllib.request
def download_data(url, file_to_save):
filename, rsp = urllib.request.urlretrieve(url, file_to_save)
# Assuming everything worked, the file has been downloaded to file_to_save
with open('Links.txt') as file:
for line in file:
url = line.rstrip() # adding this here to remove extraneous '\n' from string
file_name = url[54:].rsplit('.ST', 1)[0]
save_destination = 'C:\\Download data\\Data\\' + file_name + '.csv'
download_data(url, save_destination)
In the download_data function you need to convert the byte string csv response to a plain string.
Try replacing csv_str = str(csv) by csv_str = csv.decode('utf-8').
This should properly decode the byte string returned by response.read().
The problem is that your function returns a bytes object; str() doesn't convert it to a string the way you expect. Use csv_str = csv.decode() instead.

TypeError: expected a character buffer object - while trying to save integer to textfile

I'm trying to make a very simple 'counter' that is supposed to keep track of how many times my program has been executed.
First, I have a textfile that only includes one character: 0
Then I open the file, parse it as an int, add 1 to the value, and then try to return it to the textfile:
f = open('testfile.txt', 'r+')
x = f.read()
y = int(x) + 1
print(y)
f.write(y)
f.close()
I'd like to have y overwrite the value in the textfile, and then close it.
But all I get is TypeError: expected a character buffer object.
Edit:
Trying to parse y as a string:
f.write(str(y))
gives
IOError: [Errno 0] Error
Have you checked the docstring of write()? It says:
write(str) -> None. Write string str to file.
Note that due to buffering, flush() or close() may be needed before
the file on disk reflects the data written.
So you need to convert y to str first.
Also note that the string will be written at the current position which will be at the end of the file, because you'll already have read the old value. Use f.seek(0) to get to the beginning of the file.`
Edit: As for the IOError, this issue seems related. A cite from there:
For the modes where both read and writing (or appending) are allowed
(those which include a "+" sign), the stream should be flushed (fflush)
or repositioned (fseek, fsetpos, rewind) between either a reading
operation followed by a writing operation or a writing operation
followed by a reading operation.
So, I suggest you try f.seek(0) and maybe the problem goes away.
from __future__ import with_statement
with open('file.txt','r+') as f:
counter = str(int(f.read().strip())+1)
f.seek(0)
f.write(counter)
Just try the code below:
As I see you have inserted 'r+' or this command open the file in read mode so you are not able to write into it, so you have to open file in write mode 'w' if you want to overwrite
the file contents and write new data, otherwise you can append data to file by using 'a'
I hope this will help ;)
f = open('testfile.txt', 'w')# just put 'w' if you want to write to the file
x = f.readlines() #this command will read file lines
y = int(x)+1
print y
z = str(y) #making data as string to avoid buffer error
f.write(z)
f.close()

encoding string

"afile" is a previously existing file.
handle=open("afile",'r+b')
data=handle.readline()
handle.close()
# signgenerator is a hashlib.md5() object
signgenerator.update(data)
hex=signgenerator.hexdigest()
print(hex) # prints out 061e3f139c80d04f039b7753de5313ce
and write this to a file
f=open("syncDB.txt",'a')
#hex=hex.encode('utf-8')
pickle.dump(hex,f)
f.close()
But when i read back the file as
while True:
data=f.readline()
print(data)
This gives the output:
b'\x80\x03X \x00\x00\x00061e3f139c80d04f039b7753de5313ceq\x00.\x80\x03X \x00\x00\x00d9afd4bb6bc57679f6b10c0b9610d2e0q\x00.\x80\x03X \x00\x00\x008b70452c46285d825d3670d433151841q\x00.\x80\x03X \x00\x00\x00061e3f139c80d04f039b7753de5313ceq\x00.\x80\x03X \x00\x00\x00d9afd4bb6bc57679f6b10c0b9610d2e0q\x00.\x80\x03X \x00\x00\x008b70452c46285d825d3670d433151841q\x00.\x80\x03X \x00\x00\x00b857c3b319036d72cb85fe8a679531b0q\x00.\x80\x03X \x00\x00\x007532fb972cdb019630a2e5a1373fe1c5q\x00.\x80\x03X \x00\x00\x000126bb23767677d0a246d6be1d2e4d5cq\x00.'
How do i encode to get the same hexdigest back from these bytes??
Also I am getting some gibberish characters in syncDb.txt like "€X" after each line.How do I correctly write the data in a readable form??
You need to unpickle the data:
pickle.load(open('syncDB.txt', 'r+b'))
What you have there is pickled data. Proof:
>>> import pickle
>>> pickle.loads(b'\x80\x03X \x00\x00\x00061e3f139c80d04f039b7753de5313ceq\x00.\x80\x03X \x00\x00\x00d9afd4bb6bc57679f6b10c0b9610d2e0q\x00.\x80\x03X \x00\x00\x008b70452c46285d825d3670d433151841q\x00.\x80\x03X \x00\x00\x00061e3f139c80d04f039b7753de5313ceq\x00.\x80\x03X \x00\x00\x00d9afd4bb6bc57679f6b10c0b9610d2e0q\x00.\x80\x03X \x00\x00\x008b70452c46285d825d3670d433151841q\x00.\x80\x03X \x00\x00\x00b857c3b319036d72cb85fe8a679531b0q\x00.\x80\x03X \x00\x00\x007532fb972cdb019630a2e5a1373fe1c5q\x00.\x80\x03X \x00\x00\x000126bb23767677d0a246d6be1d2e4d5cq\x00.')
'061e3f139c80d04f039b7753de5313ce'
But there's no point in pickling a hex string. You can just put it in the file. The pickle module should be used with more complex structures, like arrays, dicts, or even classes.
Don't pickle the hexdigest, just write it out as text.
with open("afile",'rb') as handle:
data=handle.readline()
signgenerator.update(data)
hex=signgenerator.hexdigest()
with open("syncDB.txt",'ab') as f:
f.write(hex + '\n')
with open("syncDB.txt",'rb') as f:
for data in f:
print(data)
If you really want to use pickle, you need to use the pickle.load function to read the data back from the file.

How can I change a huge file into csv in python

I'm a beginner in python. I have a huge text file (hundreds of GB) and I want to convert the file into csv file. In my text file, I know the row delimiter is a string "<><><><><><><>". If a line contains that string, I want to replace it with ". Is there a way to do it without having to read the old file and rewriting a new file.
Normally I thought I need to do something like this:
fin = open("input", "r")
fout = open("outpout", "w")
line = f.readline
while line != "":
if line.contains("<><><><><><><>"):
fout.writeline("\"")
else:
fout.writeline(line)
line = f.readline
but copying hundreds of GB is wasteful. Also I don't know if open will eat lots of memory (does it treat file handler as a stream?)
Any help is greatly appreciated.
Note: an example of the file would be
file.txt
<><><><><><><>
abcdefeghsduai
asdjliwa
1231214 ""
<><><><><><><>
would be one row and one column in csv.
#richard-levasseur
I agree, sed seems like the right way to go. Here's a rough cut at what the OP describes:
sed -i -e's/<><><><><><><>/"/g' foo.txt
This will do the replacement in-place in the existing foo.txt. For that reason, I recommend having the original file under some sort of version control; any of the DVCS should fit the bill.
Yes, open() treats the file as a stream, as does readline(). It'll only read the next line. If you call read(), however, it'll read everything into memory.
Your example code looks ok at first glance. Almost every solution will require you to copy the file elsewhere. Its not exactly easy to modify the contents of a file inplace without a 1:1 replacement.
It may be faster to use some standard unix utilities (awk and sed most likely), but I lack the unix and bash-fu necessary to provide a full solution.
It's only wasteful if you don't have disk to spare. That is, fix it when it's a problem. Your solution looks ok as a first attempt.
It's not wasteful of memory because a file handler is a stream.
Reading lines is simply done using a file iterator:
for line in fin:
if line.contains("<><><><><><><>"):
fout.writeline("\"")
Also consider the CSV writer object to write CSV files, e.g:
import csv
writer = csv.writer(open("some.csv", "wb"))
writer.writerows(someiterable)
With python you will have to create a new file for safety sake, it will cause alot less headaches than trying to write in place.
The below listed reads your input 1 line at a time and buffers the columns (from what I understood of your test input file was 1 row) and then once the end of row delimiter is hit it will write that buffer to disk, flushing manually every 1000 lines of the original file. This will save some IO as well instead of writing every segment, 1000 writes of 32 bytes each will be faster than 4000 writes of 8 bytes.
fin = open(input_fn, "rb")
fout = open(output_fn, "wb")
row_delim = "<><><><><><><>"
write_buffer = []
for i, line in enumerate(fin):
if not i % 1000:
fout.flush()
if row_delim in line and i:
fout.write('"%s"\r\n'%'","'.join(write_buffer))
write_buffer = []
else:
write_buffer.append(line.strip())
Hope that helps.
EDIT: Forgot to mention, while using .readline() is not a bad thing don't use .readlines() which will go and read the entire content of the file into a list containing each line which is incredibly inefficient. Using the built in iterator that comes with a file object is the best memory usage and speed.
#Constatin suggests that if you would be satisfied with replacing '<><><><><><><>\n' by '" \n'
then the replacement string is the same length, and in that case you can craft a solution to in-place editing with mmap. You will need python 2.6. It's vital that the file is opened in the right mode!
import mmap, os
CHUNK = 2**20
oldStr = ''
newStr = '" '
strLen = len(oldStr)
assert strLen==len(newStr)
f = open("myfilename", "r+")
size = os.fstat(f.fileno()).st_size
for offset in range(0,size,CHUNK):
map = mmap.mmap(f.fileno(),
length=min(CHUNK+strLen,size-offset), # not beyond EOF
offset=offset)
index = 0 # start at beginning
while 1:
index = map.find(oldStr,index) # find next match
if index == -1: # no more matches in this map
break
map[index:index+strLen] = newStr
f.close()
This code is not debugged! It works for me on a 3 MB test case, but it may not work on a large ( > 2GB) file - the mmap module still seems a bit immature, so I wouldn't rely on it too much.
Looking at the bigger picture, from what you've posted it isn't clear that your file will end up as valid CSV. Also be aware that the tool you're planning to use to actually process the CSV may be flexible enough to deal with the file as it stands.
If you're delimiting fields with double quotes, it looks like you need to escape the double quotes you have occurring in your elements (for example 1231214 "" will need to be \n1231214 \"\").
Something like
fin = open("input", "r")
fout = open("output", "w")
for line in fin:
if line.contains("<><><><><><><>"):
fout.writeline("\"")
else:
fout.writeline(line.replace('"',r'\"')
fin.close()
fout.close()
[For the problem exactly as stated] There's no way that this can be done without copying the data, in python or any other language. If your processing always replaced substrings with new substrings of equal length, maybe you could do it in-place. But whenever you replace <><><><><><><> with " you are changing the position of all subsequent characters in the file. Copying from one place to another is the only way to handle this.
EDIT:
Note that the use of sed won't actually save any copying...sed doesn't really edit in-place either. From the GNU sed manual:
-i[SUFFIX]
--in-place[=SUFFIX]
This option specifies that files are to be edited in-place. GNU sed does this by creating a temporary file and sending output to this file rather than to the standard output.
(emphasis mine.)

Categories