Fastest Method to read many files line by line in Python - python

I have a conceptual question. I am new to Python and I am looking to do task that involves processing bigger log files. Some of these can get up to 5 and 6GB
I need to parse through many files in a location. These are text files.
I know of the with open() method, and recently just ran into pathlib. So I need to not only read the file line by line to extract values to upload into a DB, i also need to get file properties that Pathlib gives you and upload them as well.
Is it faster to use with open and underneath it, call a path object from which to read files... something like this:
for filename in glob('**/*.*', recursive=False):
fpath = Path(filename)
with open(filename, 'rb', buffering=102400) as logfile:
for line in logfile:
#regex operation
print(line)
Or would it be better to use Pathlib:
with Path("src/module.py") as f:
contents = open(f, "r")
for line in contents:
#regex operation
print(line)
Also since I've never used Pathlib to open files for reading. When it comes to this: Path.open(mode=’r’, buffering=-1, encoding=None, errors=None, newline=None)
What does newline and errors mean? I assume buffering here is the same as buffering in the with open function?
I also saw this contraption that uses with open in conjuction with Path object though how it works, I have no idea:
path = Path('.editorconfig')
with open(path, mode='wt') as config:
config.write('# config goes here')

pathlib is intended to be a more elegant solution to interacting with the file system, but it's not necessary. It'll add a small amount of fixed overhead (since it wraps other lower level APIs), but shouldn't change how performance scales in any meaningful way.
Since, as noted, pathlib is largely a wrapper around lower level APIs, you should know Path.open is implemented in terms of open, and the arguments all mean the same thing for both; reading the docs for the built-in open will describe the arguments.
As for the last bit of your question (passing a Path object to the built-in open), that works because most file-related APIs were updated to support any object that implements the os.PathLike ABC.

Related

What is the most pythonic way to open a file?

I'm trying to clean up my code a little bit, and I have trouble figuring which of these 2 ways is considered the most pythonic one
import os
dir = os.path.dirname(__file__)
str1 = 'filename.txt'
f = open(os.path.join(dir,str1),'r')
Although the second seems to be cleanest one, I find the declaration of fullPath a bit too much, since it will only be used once.
import os
dir = os.path.dirname(__file__)
str1 = 'filename.txt'
fullPath = os.path.join(dir,str1)
f = open(fullPath,'r')
In general, is it a better thing to avoid calling functions inside of another call, even if it adds a line of code ?
with open('file path', 'a') as f:
data = f.read()
#do something with data
or
f = open(os.path.join(dir,str1),'r')
f.close()
file = open('newfile.txt', 'r')
for line in file:
print line
OR
lines = [line for line in open('filename')]
If file is huge, read() is definitively bad idea, as it loads (without size parameter), whole file into memory.
If your file is huge this will cause latency !
So, i don't recommend read() or readlines()
There are many ways to open files in python which goes to say that there really isn't really a pythonic way of doing it. It all just boils down to which method you see are most connivence, especially in regards to what you're actually trying to do with the file once its open.
Most users use the IDLE GUI "click" to open files because it allows them to view the current file and also make some alterations if there's a need for such.
Others might just rely on the command lines to perform the task, at the cost of not being able to do anything other than opening the file.
Using Command Lines:
% python myfile.py
note that in order for this to work you need to make sure the system is "looking" into the directory where your file is storied. Using the 'cd' is useful to finding you route there.
% python import myfile myfile.title
This method is known as the object.attribute method of opening files. This method is useful when the file you're opening has an operation that you would like to implement.
There are more ways than what's been stated above, be sure to consult the pyDocs for further details.

Python basics - request data from API and write to a file

I am trying to use "requests" package and retrieve info from Github, like the Requests doc page explains:
import requests
r = requests.get('https://api.github.com/events')
And this:
with open(filename, 'wb') as fd:
for chunk in r.iter_content(chunk_size):
fd.write(chunk)
I have to say I don't understand the second code block.
filename - in what form do I provide the path to the file if created? where will it be saved if not?
'wb' - what is this variable? (shouldn't second parameter be 'mode'?)
following two lines probably iterate over data retrieved with request and write to the file
Python docs explanation also not helping much.
EDIT: What I am trying to do:
use Requests to connect to an API (Github and later Facebook GraphAPI)
retrieve data into a variable
write this into a file (later, as I get more familiar with Python, into my local MySQL database)
Filename
When using open the path is relative to your current directory. So if you said open('file.txt','w') it would create a new file named file.txt in whatever folder your python script is in. You can also specify an absolute path, for example /home/user/file.txt in linux. If a file by the name 'file.txt' already exists, the contents will be completely overwritten.
Mode
The 'wb' option is indeed the mode. The 'w' means write and the 'b' means bytes. You use 'w' when you want to write (rather than read) froma file, and you use 'b' for binary files (rather than text files). It is actually a little odd to use 'b' in this case, as the content you are writing is a text file. Specifying 'w' would work just as well here. Read more on the modes in the docs for open.
The Loop
This part is using the iter_content method from requests, which is intended for use with large files that you may not want in memory all at once. This is unnecessary in this case, since the page in question is only 89 KB. See the requests library docs for more info.
Conclusion
The example you are looking at is meant to handle the most general case, in which the remote file might be binary and too big to be in memory. However, we can make your code more readable and easy to understand if you are only accessing small webpages containing text:
import requests
r = requests.get('https://api.github.com/events')
with open('events.txt','w') as fd:
fd.write(r.text)
filename is a string of the path you want to save it at. It accepts either local or absolute path, so you can just have filename = 'example.html'
wb stands for WRITE & BYTES, learn more here
The for loop goes over the entire returned content (in chunks incase it is too large for proper memory handling), and then writes them until there are no more. Useful for large files, but for a single webpage you could just do:
# just W becase we are not writing as bytes anymore, just text.
with open(filename, 'w') as fd:
fd.write(r.content)

Why does pyPdf2.PdfFileReader() require a file object as an input?

csv.reader() doesn't require a file object, nor does open(). Does pyPdf2.PdfFileReader() require a file object because of the complexity of the PDF format, or is there some other reason?
It's just a matter of how the library was written. csv.reader allows any iterable that returns strings (which includes files). open is opening the file, so of course it doesn't take an open file (although it can take an integer pointing at an open file descriptor). Typically, it is better to handle the file separately, usually within a with block so that it is closed properly.
with open('input.pdf', 'rb') as f:
# do something with the file
pypdf can take a BytesIO stream or a file path as well. I actually recommend passing the file path in most cases as pypdf will then take care of closing the file for you.

python: open two files as one fileobject

I have two files: a header and the body. I am using a library to read the whole thing. I can use "fileinput.input" to create one FileInput object and hand this to the library that reads the data. Problem is FileInput objects do not have a '.read' attribute which the library seems to expect.
I need a FileObject with a .read that is like reading both files as one.
Any ideas existing workarounds? Yes, I know I can build my own little class or cat files together. Just wondering if there is some magic FileObject joiner I've never heard of.
If your library reads from a file with .read(), there isn't much point in some abstraction of merging multiple file-objects as one. it is quite trivial to read everything and throw it into StringIO.
if you just want to call readline() on the files, try this:
def cat(*args):
for arg in args:
with open(arg,'r') as f:
for line in f:
yield line
for line in cat('/tmp/x1','/tmp/x2'):
processLine(line)
Your file objects are already iterable via generators, so just use itertools to chain them into one big iterable.
import itertools
all_the_things = itertools.chain(HeaderFile, BodyFile)
for line in all_the_things:
# your code here

How to know when to manage resources in Python

I hope I framed the question right. I am trying to force myself to be a better programmer. By better I mean efficient. I want to write a program to identify the files in a directory and read each file for further processing. After some shuffling I got to this:
for file in os.listdir(dir):
y=open(dir+'\\'+file,'r').readlines()
for line in y:
pass
y.close()
It should be no surprise that I get an AttributeError since y is a list. I didn't think about that when I wrote the snippet.
I am thinking about this and am afraid that I have five open files (there are five files in the directory specified by dir.
I can fix the code so it runs and I explicitly close the files after opening them. I am curious if I need to or if Python handles closing the file in the next iteration of the loop. If so then I only need to write:
for file in os.listdir(dir):
y=open(dir+'\\'+file,'r').readlines()
for line in y:
pass
I am guessing that it(python) does handle this effortlessly. The reason I think that this might be handled is that I have changed the object/thing that y is referencing. When I start the second iteration there are no more memory references to the file that was opened and read using the readlines method.
Python will close open files when they get garbage-collected, so generally you can forget about it -- particularly when reading.
That said, if you want to close explicitely, you could do this:
for file in os.listdir(dir):
f = open(dir+'\\'+file,'r')
y = f.readlines()
for line in y:
pass
f.close()
However, we can immediately improve this, because in python you can iterate over file-like objects directly:
for file in os.listdir(dir):
y = open(dir+'\\'+file,'r')
for line in y:
pass
y.close()
Finally, in recent python, there is the 'with' statement:
for file in os.listdir(dir):
with open(dir+'\\'+file,'r') as y:
for line in y:
pass
When the with block ends, python will close the file for you and clean it up.
(you also might want to look into os.path for more pythonic tools for manipulating file names and directories)
Don't worry about it. Python's garbage collector is good, and I've never had a problem with not closing file-pointers (for read operations at least)
If you did want to explicitly close the file, just store the open() in one variable, then call readlines() on that, for example..
f = open("thefile.txt")
all_lines = f.readlines()
f.close()
Or, you can use the with statement, which was added in Python 2.5 as a from __future__ import, and "properly" added in Python 2.6:
from __future__ import with_statement # for python 2.5, not required for >2.6
with open("thefile.txt") as f:
print f.readlines()
# or
the_file = open("thefile.txt")
with the_file as f:
print f.readlines()
The file will automatically be closed at the end of the block.
..but, there are other more important things to worry about in the snippets you posted, mostly stylistic things.
Firstly, try to avoid manually constructing paths using string-concatenation. The os.path module contains lots of methods to do this, in a more reliable, cross-platform manner.
import os
y = open(os.path.join(dir, file), 'r')
Also, you are using two variable names, dir and file - both of which are built-in functions. Pylint is a good tool to spot things like this, in this case it would give the warning:
[W0622] Redefining built-in 'file'

Categories