I have two files: a header and the body. I am using a library to read the whole thing. I can use "fileinput.input" to create one FileInput object and hand this to the library that reads the data. Problem is FileInput objects do not have a '.read' attribute which the library seems to expect.
I need a FileObject with a .read that is like reading both files as one.
Any ideas existing workarounds? Yes, I know I can build my own little class or cat files together. Just wondering if there is some magic FileObject joiner I've never heard of.
If your library reads from a file with .read(), there isn't much point in some abstraction of merging multiple file-objects as one. it is quite trivial to read everything and throw it into StringIO.
if you just want to call readline() on the files, try this:
def cat(*args):
for arg in args:
with open(arg,'r') as f:
for line in f:
yield line
for line in cat('/tmp/x1','/tmp/x2'):
processLine(line)
Your file objects are already iterable via generators, so just use itertools to chain them into one big iterable.
import itertools
all_the_things = itertools.chain(HeaderFile, BodyFile)
for line in all_the_things:
# your code here
Related
I have a conceptual question. I am new to Python and I am looking to do task that involves processing bigger log files. Some of these can get up to 5 and 6GB
I need to parse through many files in a location. These are text files.
I know of the with open() method, and recently just ran into pathlib. So I need to not only read the file line by line to extract values to upload into a DB, i also need to get file properties that Pathlib gives you and upload them as well.
Is it faster to use with open and underneath it, call a path object from which to read files... something like this:
for filename in glob('**/*.*', recursive=False):
fpath = Path(filename)
with open(filename, 'rb', buffering=102400) as logfile:
for line in logfile:
#regex operation
print(line)
Or would it be better to use Pathlib:
with Path("src/module.py") as f:
contents = open(f, "r")
for line in contents:
#regex operation
print(line)
Also since I've never used Pathlib to open files for reading. When it comes to this: Path.open(mode=’r’, buffering=-1, encoding=None, errors=None, newline=None)
What does newline and errors mean? I assume buffering here is the same as buffering in the with open function?
I also saw this contraption that uses with open in conjuction with Path object though how it works, I have no idea:
path = Path('.editorconfig')
with open(path, mode='wt') as config:
config.write('# config goes here')
pathlib is intended to be a more elegant solution to interacting with the file system, but it's not necessary. It'll add a small amount of fixed overhead (since it wraps other lower level APIs), but shouldn't change how performance scales in any meaningful way.
Since, as noted, pathlib is largely a wrapper around lower level APIs, you should know Path.open is implemented in terms of open, and the arguments all mean the same thing for both; reading the docs for the built-in open will describe the arguments.
As for the last bit of your question (passing a Path object to the built-in open), that works because most file-related APIs were updated to support any object that implements the os.PathLike ABC.
I'm trying to clean up my code a little bit, and I have trouble figuring which of these 2 ways is considered the most pythonic one
import os
dir = os.path.dirname(__file__)
str1 = 'filename.txt'
f = open(os.path.join(dir,str1),'r')
Although the second seems to be cleanest one, I find the declaration of fullPath a bit too much, since it will only be used once.
import os
dir = os.path.dirname(__file__)
str1 = 'filename.txt'
fullPath = os.path.join(dir,str1)
f = open(fullPath,'r')
In general, is it a better thing to avoid calling functions inside of another call, even if it adds a line of code ?
with open('file path', 'a') as f:
data = f.read()
#do something with data
or
f = open(os.path.join(dir,str1),'r')
f.close()
file = open('newfile.txt', 'r')
for line in file:
print line
OR
lines = [line for line in open('filename')]
If file is huge, read() is definitively bad idea, as it loads (without size parameter), whole file into memory.
If your file is huge this will cause latency !
So, i don't recommend read() or readlines()
There are many ways to open files in python which goes to say that there really isn't really a pythonic way of doing it. It all just boils down to which method you see are most connivence, especially in regards to what you're actually trying to do with the file once its open.
Most users use the IDLE GUI "click" to open files because it allows them to view the current file and also make some alterations if there's a need for such.
Others might just rely on the command lines to perform the task, at the cost of not being able to do anything other than opening the file.
Using Command Lines:
% python myfile.py
note that in order for this to work you need to make sure the system is "looking" into the directory where your file is storied. Using the 'cd' is useful to finding you route there.
% python import myfile myfile.title
This method is known as the object.attribute method of opening files. This method is useful when the file you're opening has an operation that you would like to implement.
There are more ways than what's been stated above, be sure to consult the pyDocs for further details.
with open(file, 'rb') as readerfile:
reader = csv.reader(readerfile)
In the above syntax, can I perform the first and second line together? It seems unnecessary to use 2 variables ('readerfile' and 'reader' above) if I only need to use the latter.
Is the former variable ('readerfile') ever used?
Can I use the same variable name for both is that bad form?
You can do:
reader = csv.reader(open(file, 'rb'))
but that would mean you are not closing your file explicitly.
with open(file, 'rb') as readerfile:
The first line opens the file and stores the file object in readerfile. The with statement ensures that the file is closed when you exit the block by any means, including exceptions.
reader = csv.reader(readerfile)
The second line creates a CSV reader object using the file object. It needs the file object (otherwise where would it read the data from?). Of course you could conceivably store it in the same variable
readerfile = csv.reader(readerfile)
if you wanted to (and don't plan on using the file object again), but this will likely lead to confusion for readers of your code.
Note that you haven't read anything yet! You still need to iterate over the reader object in order to get the data that you're interested in, and if you close the file before that happens then the reader object won't work. The file object is used behind the scenes by the reader object, even if you "hide" it by overwriting the readerfile variable.
Lastly, if you really want to do everything on one line, you could conceivably define a function that abstracts the with statement:
def with1(context, func):
with context as x:
return func(x)
Now you can write this as one line:
data = with1(open(file, 'rb'), lambda readerfile: list(csv.reader(readerfile)))
It's by no means clearer, however.
This is not recommended at all
Why is it important to use one line?
Most python programmers know well the benefits of using the with statement. Keep in mind that readers might be lazy (that is -read line by line-) on some cases. You want to be able to handle the file with the correct statement, ensuring the correct closing, even if errors arise.
Nevertheless, you can use a one liner for this, as stated in other answers:
reader = csv.reader(open(file, 'rb'))
So basically you want a one-liner?
reader = csv.reader(open(file, 'rb'))
As said before, the problem with that is with open() allows you to do the following steps in one time:
Open the file
Do what you want with the file (inside your open block)
Close the file (that is implicit and you don't have to specify it)
If you don't use with open but directly open, you file stays opened until the object is garbage collected, and that could lead to unpredicted behaviour in some cases.
Plus, your original code (two lines) is much more readable than a one-liner.
If you put them together, then the file won't be closed automatically -- but that often doesn't really matter, since it will be closed automatically when the script terminates.
It's not common to need to reference the raw file once acsv.readerinstance has been created from (except possibly to explicitly close it if you're not using awithstatement).
If you use the same variable name for both, it will probably work because thecsv.readerinstance will still hold a reference to the file object, so it won't be garbage collected until the program ends. It's not a commonly idiom, however.
Since csv files are often processed sequentially, the following can be a fairly concise way to do it since thecsv.readerinstance frequently doesn't really need to be given a variable name and it will close the file properly even if an exception occurs:
with open(file, 'rb') as readerfile:
for row in csv.reader(readerfile):
process the data...
I want to copy the contents of a JSON file in another JSON file, with Python
Any ideas ?
Thank you :)
Given the lack of research effort, I normally wouldn't answer, but given the poor suggestions in comments, I'll bite and give a better option.
Now, this largely depends on what you mean, do you wish to overwrite the contents of one file with another, or insert? The latter can be done like so:
with open("from.json", "r") as from, open("to.json", "r") as to:
to_insert = json.load(from)
destination = json.load(to)
destination.append(to_insert) #The exact nature of this line varies. See below.
with open("to.json", "w") as to:
json.dump(to, destination)
This uses python's json module, which allows us to do this very easily.
We open the two files for reading, then open the destination file again in writing mode to truncate it and write to it.
The marked line depends on the JSON data structure, here I am appending it to the root list element (which could not exist), but you may want to place it at a particular dict key, or somesuch.
In the case of replacing the contents, it becomes easier:
with open("from.json", "r") as from, open("to.json", "w") as to:
to.write(from.read())
Here we literally just read the data out of one file and write it into the other file.
Of course, you may wish to check the data is JSON, in which case, you can use the JSON methods as in the first solution, which will throw exceptions on invalid data.
Another, arguably better, solution to this could also be shutil's copy methods, which would avoid actually reading or writing the file contents manually.
Using the with statement gives us the benefit of automatically closing our files - even if exceptions occur. It's best to always use them where we can.
Note that in versions of Python before 2.7, multiple context managers are not handled by the with statement, instead you will need to nest them:
with open("from.json", "r") as from:
with open("to.json", "r+") as to:
...
I hope I framed the question right. I am trying to force myself to be a better programmer. By better I mean efficient. I want to write a program to identify the files in a directory and read each file for further processing. After some shuffling I got to this:
for file in os.listdir(dir):
y=open(dir+'\\'+file,'r').readlines()
for line in y:
pass
y.close()
It should be no surprise that I get an AttributeError since y is a list. I didn't think about that when I wrote the snippet.
I am thinking about this and am afraid that I have five open files (there are five files in the directory specified by dir.
I can fix the code so it runs and I explicitly close the files after opening them. I am curious if I need to or if Python handles closing the file in the next iteration of the loop. If so then I only need to write:
for file in os.listdir(dir):
y=open(dir+'\\'+file,'r').readlines()
for line in y:
pass
I am guessing that it(python) does handle this effortlessly. The reason I think that this might be handled is that I have changed the object/thing that y is referencing. When I start the second iteration there are no more memory references to the file that was opened and read using the readlines method.
Python will close open files when they get garbage-collected, so generally you can forget about it -- particularly when reading.
That said, if you want to close explicitely, you could do this:
for file in os.listdir(dir):
f = open(dir+'\\'+file,'r')
y = f.readlines()
for line in y:
pass
f.close()
However, we can immediately improve this, because in python you can iterate over file-like objects directly:
for file in os.listdir(dir):
y = open(dir+'\\'+file,'r')
for line in y:
pass
y.close()
Finally, in recent python, there is the 'with' statement:
for file in os.listdir(dir):
with open(dir+'\\'+file,'r') as y:
for line in y:
pass
When the with block ends, python will close the file for you and clean it up.
(you also might want to look into os.path for more pythonic tools for manipulating file names and directories)
Don't worry about it. Python's garbage collector is good, and I've never had a problem with not closing file-pointers (for read operations at least)
If you did want to explicitly close the file, just store the open() in one variable, then call readlines() on that, for example..
f = open("thefile.txt")
all_lines = f.readlines()
f.close()
Or, you can use the with statement, which was added in Python 2.5 as a from __future__ import, and "properly" added in Python 2.6:
from __future__ import with_statement # for python 2.5, not required for >2.6
with open("thefile.txt") as f:
print f.readlines()
# or
the_file = open("thefile.txt")
with the_file as f:
print f.readlines()
The file will automatically be closed at the end of the block.
..but, there are other more important things to worry about in the snippets you posted, mostly stylistic things.
Firstly, try to avoid manually constructing paths using string-concatenation. The os.path module contains lots of methods to do this, in a more reliable, cross-platform manner.
import os
y = open(os.path.join(dir, file), 'r')
Also, you are using two variable names, dir and file - both of which are built-in functions. Pylint is a good tool to spot things like this, in this case it would give the warning:
[W0622] Redefining built-in 'file'