Will I damage a text file by binary copying it?

Will I damage a text file by binary copying it? - python

I want to copy a tree of files/directories (recursively, of course) that have bad characters in the file names. So, I'm opening the file, reading its contents, and dumping them into a new file with a cleaned name.
If the contents of a file are text, and I read() them into write() while in binary mode, is there any chance of that damaging the contents?
for name in os.listdir(src_path):
name = clean_name(name)
src_full = os.sep.join((src_path, name))
dst_full = os.sep.join((dst_path, name))
...
if isfile(src_full):
with open(dst_full, 'xb'): as dst_file:
with open(src_full, 'rb')) as src_file:
dst_file.write(src_file.read())

No, there is no chance of damaging the contents. You'll be reading the exact contents as they are, bit for bit, insofar your hardware can provide you with the accurate contents.
What can happen is that you forget to copy across all the file metadata; access control information and modification and creation dates are lost, for example.
Rather than read the whole file into memory, use the shutil.copyfile() function to handle the file copy for you; it'll copy data across in blocks. Better still, use the shutil.copy() or shutil.copy2() functions and it'll copy across permissions too; copy2() also copies file access and creation times.
All three functions open the file in binary mode; the source is opened with 'rb', the destination with 'wb'. If you must have exclusive opening (mode 'xb'), you'll need to open the file objects yourself (as you already do) and use shutil.copyfileobj() to get the efficient file copy, followed either by a shutil.copymode() call (to replicate shutil.copy() and copy file permissions) or a shutil.copystat() call (to replicate what shutil.copy2() does).

Related

Checking the input and output file are not the same in Python

I have a python script that takes in two arguments, the name of the input and output files, i.e. it starts of like
inputFile=open(sys.argv[1],'r')
outFile=open(sys.argv[2],'w')
Then performs whatever operation reading from inputFile and writing to the outFile.
Now a few times through human error I've accidentally given the same argument twice, the result being that my input file is replaced with a blank line. Is there are a straight-forward way to stop this happening?
I thought it might be as simple as adding
if sys.argv[1]==sys.argv[2]:
inputFile.close()
outFile.close()
immediately after the first lines above, but this already leaves the file blank.

Simply do :
import os
if os.path.realpath(sys.argv[1]) != os.path.realpath(sys.argv[2]):
inputFile=open(sys.argv[1],'r')
outFile=open(sys.argv[2],'w')
else:
raise ValueError('Input and output files are the same')
This will prevent human mistakes by raising a welcomed error that won't destroy your input file.
os.path.realpath will transform any relative path to an absolute path, so that, even if the strings are different, you can raise the error when absolute paths are identical (thanks #Jean-François Fabre for reminding me this)

opening the file for writing immediately truncates the file, so the damage is already done when you compare the strings.
That said:
on windows filesystems, the protection is "built-in" since if the file is open as read mode, it cannot be open as write mode at the same time: good (there's a "grey area" for networked filesystems, though)
on Linux/Unix, the risk is there. But comparing the name isn't enough. What if both different paths point on the same file after all? (consider: foo/bar and /mydrive/foo/bar or foo/../bar and bar)
You could use os.path.realpath() on both files prior to comparing for instance to resolve relative paths that could be different (that wouldn't solve symbolic link problems, but it's better than nothing)
And for the windows "gray area" I was mentionning, comparing the lowercase version of the names would be a good idea.

The input file is becoming blank because open(filename, 'w') overwrites a file with whatever needs to be placed in it. 'w' is useful for file creation and then writing to that file. I'd suggest trying open(filename, 'a') for appending a pre-existing file. I can't quite remember if this creates a file if it's not already existing, but it sounds like you have 2 existing files already, so append should be what you need.
If you decide to go the if sys.argv[1] == sys.argv[2] method, try placing str() around each item you're comparing, just to be certain it's comparing them properly.

Copy file handle so that there are two independent handles to the same file

I have a Python program which does the following:
It takes a list of files as input
It iterates through the list several times, each time opening the files and then closing them
What I would like is some way to open each file at the beginning, and then when iterating through the files make a copy of each file handle. Essentially this would take the form of a copy operation on file handles that allows a file to be traversed independently by multiple handles. The reason for wanting to do this is because on Unix systems, if a program obtains a file handle and the corresponding file is then deleted, the program is still able to read the file. If I try reopening the files by name on each iteration, the files might have been renamed or deleted so it wouldn't work. If I try using f.seek(0), then that might affect another thread/generator/iterator.
I hope my question makes sense, and I would like to know if there is a way to do this.

If you really want to get a copy of a file handle, you would need to use POSIX dup system call. In python, that would be accessed by using os.dup - see docs. If you have a file object (e.g. from calling open()), then you need to call fileno() method to get file descriptor.
So the entire code will look like this:
with open("myfile") as f:
fd = f.fileno() # get descriptor
fd2 = os.dup(fd) # duplicate descriptor
f2 = os.fdopen(fd2) # get corresponding file object

Why does pyPdf2.PdfFileReader() require a file object as an input?

csv.reader() doesn't require a file object, nor does open(). Does pyPdf2.PdfFileReader() require a file object because of the complexity of the PDF format, or is there some other reason?

It's just a matter of how the library was written. csv.reader allows any iterable that returns strings (which includes files). open is opening the file, so of course it doesn't take an open file (although it can take an integer pointing at an open file descriptor). Typically, it is better to handle the file separately, usually within a with block so that it is closed properly.
with open('input.pdf', 'rb') as f:
# do something with the file

pypdf can take a BytesIO stream or a file path as well. I actually recommend passing the file path in most cases as pypdf will then take care of closing the file for you.

cPickle.dump always dumping at end of file

cPickle.dump(object,file) always dumps at the end of the file. Is there a way to dump at specific position in the file? I expected the following snippet to work
file = open("test","ab")
file.seek(50,0)
cPickle.dump(object, file)
file.close()
However, the above snippet dumps the object at the end of the file (assume file already contains 1000 chars), no matter where I seek the file pointer to.

I think this may be more of a problem with how you open the file than with cPickle.
ab mode, besides being an append mode (which should bear no relevance, since you seek), provides the O_TRUNC flag to the low-level open syscall. If you don't want truncation, you should try the r+ mode.
If this doesn't solve yout problem and your objects are not very large, you can still use dumps:
file = open("test","ab")
file.seek(50,0)
dumped= cPickle.dumps(object)
file.write(dumped)
file.close()

How to copy a JSON file in another JSON file, with Python

I want to copy the contents of a JSON file in another JSON file, with Python
Any ideas ?
Thank you :)

Given the lack of research effort, I normally wouldn't answer, but given the poor suggestions in comments, I'll bite and give a better option.
Now, this largely depends on what you mean, do you wish to overwrite the contents of one file with another, or insert? The latter can be done like so:
with open("from.json", "r") as from, open("to.json", "r") as to:
to_insert = json.load(from)
destination = json.load(to)
destination.append(to_insert) #The exact nature of this line varies. See below.
with open("to.json", "w") as to:
json.dump(to, destination)
This uses python's json module, which allows us to do this very easily.
We open the two files for reading, then open the destination file again in writing mode to truncate it and write to it.
The marked line depends on the JSON data structure, here I am appending it to the root list element (which could not exist), but you may want to place it at a particular dict key, or somesuch.
In the case of replacing the contents, it becomes easier:
with open("from.json", "r") as from, open("to.json", "w") as to:
to.write(from.read())
Here we literally just read the data out of one file and write it into the other file.
Of course, you may wish to check the data is JSON, in which case, you can use the JSON methods as in the first solution, which will throw exceptions on invalid data.
Another, arguably better, solution to this could also be shutil's copy methods, which would avoid actually reading or writing the file contents manually.
Using the with statement gives us the benefit of automatically closing our files - even if exceptions occur. It's best to always use them where we can.
Note that in versions of Python before 2.7, multiple context managers are not handled by the with statement, instead you will need to nest them:
with open("from.json", "r") as from:
with open("to.json", "r+") as to:
...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Will I damage a text file by binary copying it? - python

Related

Checking the input and output file are not the same in Python

Copy file handle so that there are two independent handles to the same file

Why does pyPdf2.PdfFileReader() require a file object as an input?

cPickle.dump always dumping at end of file

How to copy a JSON file in another JSON file, with Python

Categories

Resources