Why does pyPdf2.PdfFileReader() require a file object as an input? - python

csv.reader() doesn't require a file object, nor does open(). Does pyPdf2.PdfFileReader() require a file object because of the complexity of the PDF format, or is there some other reason?

It's just a matter of how the library was written. csv.reader allows any iterable that returns strings (which includes files). open is opening the file, so of course it doesn't take an open file (although it can take an integer pointing at an open file descriptor). Typically, it is better to handle the file separately, usually within a with block so that it is closed properly.
with open('input.pdf', 'rb') as f:
# do something with the file

pypdf can take a BytesIO stream or a file path as well. I actually recommend passing the file path in most cases as pypdf will then take care of closing the file for you.

Related

Fastest Method to read many files line by line in Python

I have a conceptual question. I am new to Python and I am looking to do task that involves processing bigger log files. Some of these can get up to 5 and 6GB
I need to parse through many files in a location. These are text files.
I know of the with open() method, and recently just ran into pathlib. So I need to not only read the file line by line to extract values to upload into a DB, i also need to get file properties that Pathlib gives you and upload them as well.
Is it faster to use with open and underneath it, call a path object from which to read files... something like this:
for filename in glob('**/*.*', recursive=False):
fpath = Path(filename)
with open(filename, 'rb', buffering=102400) as logfile:
for line in logfile:
#regex operation
print(line)
Or would it be better to use Pathlib:
with Path("src/module.py") as f:
contents = open(f, "r")
for line in contents:
#regex operation
print(line)
Also since I've never used Pathlib to open files for reading. When it comes to this: Path.open(mode=’r’, buffering=-1, encoding=None, errors=None, newline=None)
What does newline and errors mean? I assume buffering here is the same as buffering in the with open function?
I also saw this contraption that uses with open in conjuction with Path object though how it works, I have no idea:
path = Path('.editorconfig')
with open(path, mode='wt') as config:
config.write('# config goes here')
pathlib is intended to be a more elegant solution to interacting with the file system, but it's not necessary. It'll add a small amount of fixed overhead (since it wraps other lower level APIs), but shouldn't change how performance scales in any meaningful way.
Since, as noted, pathlib is largely a wrapper around lower level APIs, you should know Path.open is implemented in terms of open, and the arguments all mean the same thing for both; reading the docs for the built-in open will describe the arguments.
As for the last bit of your question (passing a Path object to the built-in open), that works because most file-related APIs were updated to support any object that implements the os.PathLike ABC.

Will I damage a text file by binary copying it?

I want to copy a tree of files/directories (recursively, of course) that have bad characters in the file names. So, I'm opening the file, reading its contents, and dumping them into a new file with a cleaned name.
If the contents of a file are text, and I read() them into write() while in binary mode, is there any chance of that damaging the contents?
for name in os.listdir(src_path):
name = clean_name(name)
src_full = os.sep.join((src_path, name))
dst_full = os.sep.join((dst_path, name))
...
if isfile(src_full):
with open(dst_full, 'xb'): as dst_file:
with open(src_full, 'rb')) as src_file:
dst_file.write(src_file.read())
No, there is no chance of damaging the contents. You'll be reading the exact contents as they are, bit for bit, insofar your hardware can provide you with the accurate contents.
What can happen is that you forget to copy across all the file metadata; access control information and modification and creation dates are lost, for example.
Rather than read the whole file into memory, use the shutil.copyfile() function to handle the file copy for you; it'll copy data across in blocks. Better still, use the shutil.copy() or shutil.copy2() functions and it'll copy across permissions too; copy2() also copies file access and creation times.
All three functions open the file in binary mode; the source is opened with 'rb', the destination with 'wb'. If you must have exclusive opening (mode 'xb'), you'll need to open the file objects yourself (as you already do) and use shutil.copyfileobj() to get the efficient file copy, followed either by a shutil.copymode() call (to replicate shutil.copy() and copy file permissions) or a shutil.copystat() call (to replicate what shutil.copy2() does).

How can I read from an already opened file in universal newline mode?

I have a file-like object representing a potentially endless stream. I want to read from this stream and count the lines, among other things, and I want to use universal newlines.
I don't have access to the statment that opens the file, so I can't just add mode='rU' to the open statement or equivalent thereof.
Nor can I read the entire file into memory and use splitlines() or io.StringIO(unicode(mystream.read()), newline=None)
Does anyone know of a way to accomplish this?
A python file-like object typically supports the ".fileno()" method. That returns the underlying file handle. Once you have the file handle, you should be able to use os.fdopen(file_handle, "rU") to obtain a new file object with universal newline semantics.

cPickle.dump always dumping at end of file

cPickle.dump(object,file) always dumps at the end of the file. Is there a way to dump at specific position in the file? I expected the following snippet to work
file = open("test","ab")
file.seek(50,0)
cPickle.dump(object, file)
file.close()
However, the above snippet dumps the object at the end of the file (assume file already contains 1000 chars), no matter where I seek the file pointer to.
I think this may be more of a problem with how you open the file than with cPickle.
ab mode, besides being an append mode (which should bear no relevance, since you seek), provides the O_TRUNC flag to the low-level open syscall. If you don't want truncation, you should try the r+ mode.
If this doesn't solve yout problem and your objects are not very large, you can still use dumps:
file = open("test","ab")
file.seek(50,0)
dumped= cPickle.dumps(object)
file.write(dumped)
file.close()

Adding a file-like object to a Zip file in Python

The Python ZipFile API seems to allow the passing of a file path to ZipFile.write or a byte string to ZipFile.writestr but nothing in between. I would like to be able to pass a file like object, in this case a django.core.files.storage.DefaultStorage but any file-like object in principle. At the moment I think I'm going to have to either save the file to disk, or read it into memory. Neither of these is perfect.
You are correct, those are the only two choices. If your DefaultStorage object is large, you may want to go with saving it to disk first; otherwise, I would use:
zipped = ZipFile(...)
zipped.writestr('archive_name', default_storage_object.read())
If default_storage_object is a StringIO object, it can use default_storage_object.getvalue().
While there's no option that takes a file-like object, there is an option to open a zip entry for writing (ZipFile.open). [doc]
import zipfile
import shutil
with zipfile.ZipFile('test.zip','w') as archive:
with archive.open('test_entry.txt','w') as outfile:
with open('test_file.txt','rb') as infile:
shutil.copyfileobj(infile, outfile)
You can use your input stream as the source instead, and not have to copy the file to disk first. The downside is that if something goes wrong with your stream, the zip file will be unusable. In my application, we bypass files with errors, so we end up getting a local copy of the file anyway to ensure integrity and keep a usable zip file.

Categories