I am reading a CSV file into Dictreader and want to print its contents twice on the terminal. But it is printing only once. Is Dictreader BLANK after first print?
dictreader = csv.DictReader(reader)
for k in dictreader:
print(k) # Prints all keys/values
for i in dictreader:
print(i) # Doesn't print anything
Yes, if you look at the source for DictReader you'll see it's an iterator (has an implementation for __next__ and __iter__ returns self).
After going through it once it will be exhausted; consequent iterations will simply not produce anything. You could create a list from it if you need to iterate through it more times.
All csv wrappers wrap file-like objects. File-like objects have state, specifically, the seek position (and pipes can't seek at all), so the wrappers allow the object to manage position, parsing whatever comes next.
Making iteration work twice in a row would mean the csv wrappers have to cache the file contents (consuming unbounded amounts of memory) or require them to seek back to the beginning of the underlying file (not possible for streaming file-like objects).
Think of the csv wrappers as semi-file-like makes this easier to grasp. You can't do for line in myfile: twice in a row without seeking, and similarly, you can't do for row in mycsv: twice in a row without seeking the underlying file-like object.
Assuming your reader is seekable, you could iterate it twice (without consuming unbounded memory) by doing:
dictreader = csv.DictReader(reader)
for k in dictreader:
print(k) # Prints all keys/values
reader.seek(0) # Restart from beginning
for i in dictreader:
print(i) # Prints all keys/values
Or if the files are known to be small, you could cache:
# Cache reusable values
dictlines = tuple(csv.DictReader(reader))
for k in dictlines:
print(k) # Prints all keys/values
for i in dictlines:
print(i) # Prints all keys/values
You could also use itertools.tee for the same purpose, but that only helps if all iterators are going to be advanced (somewhat) in tandem; if you're running one to completion before starting the next, it's usually faster to just cache to list or tuple.
Related
I'm been learning python and playing around with dictionaries and .csv files and the csv module. It seems like the csv.DictReader() function can help turn .csv files into dictionary objects, but there's a bit of a quirk with the Reader objects that I'm confused about.
I've read a little bit into the documentation (and then tried to find answers looking up at the csv.Reader() function), but I'm still a little unsure.
Why does this code run as expected:
with open("cool_csv.csv") as cool_csv_file:
cool_csv_text = cool_csv_file.read()
print(cool_csv_text)
and yet the following code returns a ValueError: I/O operation on closed file.
with open("cool_csv.csv") as cool_csv_file:
cool_csv_dict = csv.DictReader(cool_csv_file)
for row in cool_csv_dict:
print(row["Cool Fact"])
Since we saved the DictReader object to a python variable, shouldn’t we be able to call the variable after we close the file, like if I were assigned cool cool_csv.read()?
I know the proper way to code this would be:
with open("cool_csv.csv") as cool_csv_file:
cool_csv_dict = csv.DictReader(cool_csv_file)
for row in cool_csv_dict:
print(row["Cool Fact"])
But why does the for row in cool_csv_dict: section have to be nested in the open() section?
My only guess would be that because the csv.DictReader() object is not quite an actual dictionary (or something like that), there’s some shenanigans because it still needs to point somewhere (because maybe thats the "reader" part?).
Can anyone shed any light?
csv.DictReader doesn't read the entire file into memory when you create the cool_csv_dict object. Each time you call it to get the next record from the CSV, it reads the next line from cool_csv_file. Therefore, it needs this to be kept open so it can read from it as needed.
The argument to csv.DictReader can be any iterator that returns lines. So if you don't want to keep the file open, you can call readlines() to get all the lines into a list, then pass that to csv.DictReader
with open("cool_csv.csv") as cool_csv_file:
cool_csv_lines = cool_csv_file.readlines()
cool_csv_dict = csv.DictReader(cool_csv_lines)
for row in cool_csv_dict:
print(row("Cool Fact")
Good afternoon, I have a multiple list of IP and MAC, list of arbitrary length
A = [['10.0.0.1','00:4C:3S:**:**:**', 0], ['10.0.0.2', '00:5C:4S:**:**:**', 0], [....], [....]]
I want to check if this MAC is in the oui file:
E043DB (base 16) Shenzhen
2405f5 (base 16) Integrated
3CD92B (base 16) Hewlett Packard
...
If the MAC from the list is in the file, write the name of the manufacturer as 3 list items. I'm trying to do so and it turns out to check only the first element, the remaining ones are not checked, how can I do this please tell me?
f = open('oui.txt', 'r')
for values in A:
for line in f.readlines():
if values[1][0:8].replace(':','') in line:
values[2]=(line.split('(base 16)')[1].strip())
f.close()
print (A)
And get an answer:
A = [['10.0.0.1','00:4C:3S:**:**:**', 'Firm Name'], ['10.0.0.2', '00:5C:4S:**:**:**', 0], [....], [....]]
The Problem
Consider the "shape" of your code:
f = open('a file')
for values in [ 'some list' ]:
for line in f.readlines():
Your two loops are doing this:
Start with first value in list
Read all lines remaining in file object f
Move to next value in list
Read all lines remaining in file object f
Except that the first time you told it to "read all lines remaining" it would do so.
So, unless you have some way to put more lines into f (which can happen with async files like stdin!) you are going to get one "good" pass through the file, and then every subsequent pass the file object will point to the end of the file, so you'll get nothing.
A Solution
When you are dealing with a file, you want to only process it one time. File I/O is expensive compared to other operations. So you can choose to either (a) read the entire file into memory, and do whatever you want since it's not a file any more; or (b) scan through it only one time.
If you choose to scan through it only once, the easy solution is just to invert the two for loops. Instead of doing this:
for item in list:
for line in file:
Do this instead:
for line in file:
for item in list:
And presto! You are now only reading the file one time.
Other Considerations
If I look at your code, and your examples, it seems like you are trying for an exact match on a particular key. You trim down the MAC addresses in your list to check them against the manufacturer ids.
This suggests to me that you might well have many, many more list values (source MAC addresses) than you have manufacturers. So perhaps you should consider reading the contents of the tile into memory, rather than processing it one line at a time.
Once you have the file in memory, consider building a proper dictionary. You have a key (MAC prefix) and a value (manufacturer). So build something like:
for line in f:
mac = line.split('(base 16)')[0].strip()
mfg = line.split('(base 16)')[1].strip()
mac_to_mfg[mac] = mfg
Then you can make one pass through the source addresses and use the dict's O(1) lookup to your advantage:
for src in A:
prefix = src[1][:8].replace(':', '')
if prefix in mac_to_mfg:
# etc...
The problem is you got the order of the loops reversed. Usually this isn't that big of a problem, but when working objects that are consumed (like the IO file object) the contents will no longer produce once it's been iterated over.
You'll need to iterate the lines first, and then within each lines iterate through A to check the values:
with open('oui.txt', 'r') as f:
for line in f.readlines():
for values in A:
if values[1][0:8].replace(':','') in line:
values[2]=(line.split('(base 16)')[1].strip())
print (A)
Notice I changed your file opening to use the with context manager instead, where once your code exists the with block it will automatically close() the file for you. It is recommended over manually opening the file as you might forget to close() it after.
The use case for this would be creating multiple generators based on some file-object without any of them trampling each other's read state.
Originally I (thought I) had a working implementation using seek() and tell() where each generator was decorated by a meta-generator which maintained the file-handle position. This worked fine on things like StringIO, but failed on real files due the to read-ahead buffer mutilating the offset.
Using readline() or otherwise mocking the real file-object isn't viable as the reason for doing this was the excessively large files prompting a generator expression in the first place. So losing the read-ahead buffer isn't really a good option (as an aside, why was Python implemented this way in the first place? Shouldn't the buffer be like a cache and not actually exposed to the user? Proper encapsulation should have prevented this tell() issue in the first place...)
I then tried to use copy.copy, but that results in something like this: <closed file '<uninitialized file>', mode '<uninitialized file>' at 0x7f722ffda810>. Which appears unusable.
Does there exist an alternative way to copy? Is there a way to initialize a file-object? Or should I give up on this use case entirely because it is not possible in Python?
You are looking for itertools.tee.
from itertools import tee
with open("somefile.txt", "r") as fh:
fh1, fh2, fh3 = tee(fh, 3)
Once you call tee, do not use the parent iterator again. The iterators returned from tee may be used freely and independently, however.
For file objects specifically (to keep file-specific methods like read), you can just open a file multiple times; each file object will maintain its own file pointer as it reads the file.
fh1, fh2, fh3 = [open("somefile.txt") for i in range(3)]
or, if you already have a file object fh:
fh1, fh2, fh3 = [open(fh.name) for i in range(3)]
This doesn't preserve an already advanced file pointer, but it's easy enough to jump ahead:
for x in fh1, fh2, fh3:
x.seek(fh.tell())
I'm wondering about the memory benefits of python generators in this use case (if any). I wish to read in a large text file that must be shared between all objects. Because it only needs to be used once and the program finishes once the list is exhausted I was planning on using generators.
The "saved state" of a generator I believe lets it keep track of what is the next value to be passed to whatever object is calling it. I've read that generators also save memory usage by not returning all the values at once, but rather calculating them on the fly. I'm a little confused if I'd get any benefit in this use case though.
Example Code:
def bufferedFetch():
while True:
buffer = open("bigfile.txt","r").read().split('\n')
for i in buffer:
yield i
Considering that the buffer is going to be reading in the entire "bigfile.txt" anyway, wouldn't this be stored within the generator, for no memory benefit? Is there a better way to return the next value of a list that can be shared between all objects?
Thanks.
In this case no. You are reading the entire file into memory by doing .read().
What you ideally want to do instead is:
def bufferedFetch():
with open("bigfile.txt","r") as f:
for line in f:
yield line
The python file object takes care of line endings for you (system dependent) and it's built-in iterator will yield lines by simply iterating over it one line at a time (not reading the entire file into memory).
I have a function that read lines from a file and process them. However, I want to delete every line that I have read, but without using readlines() that reads all of the lines at once and stores them into a list.
If the problem is that you run out of memory, then I suggest you use the for line in file syntax, as this will only load the lines one at a time:
bigFile = open('path/to/file.dat','r')
for line in bigFile:
processLine(line)
If you can construct your system so that it can process the file line-by-line, then it won't run out of memory trying to read the whole file. The program will discard the copy it has made of the file contents when it moves onto the next line.
Why does this work when readlines doesn't?
In Python there are iterators, which provide an interface to supply one item of a collection at a time, iterating over the whole collection if .next() is called repeatedly. Because you rarely need the whole collection at once, this can allow the program to work with a single item in memory instead, and thus allow large files to be processed.
By contrast, the readlines function has to return a whole list, rather than an iterator object, so it cannot delay the processing of later lines like an iterator could. Since Python 2.3, the old xreadlines read iterator was deprecated in favour of using for line in file, because the file object returned by open had been changed to return an iterator rather than a list.
This follows the functional paradigm called 'lazy evaluation', where you avoid doing any actual processing unless and until the result is needed.
More iterators
Iterators can be chained together (process the lines of this file, then that one), or otherwise combined using the excellent itertools module (included in Python). These are very powerful, and can allow you to separate out the way you combine files or inputs from the code that processes them.
First of all, deleting the first line of a file is a costly process. Actually, you are unlikely to be able to do it without rewriting most of the file.
You have multiple approaches that could solve your issue:
1.In python, file objects have an iterator over the lines, may be you can use this to solve your memory issues
document_count = 0
with open(filename) as handler:
for index, line in enumerate(handler):
if line == '.':
document_count += 1
2.Use an index. Reserve a certain part of your file to the index(fixed size, make sure to reserve enough space, let's say the first 100Ko of your file should be reserved for the index, that's about 100K entries) or even another index file, every time you add a document put it's starting position on the index. Once you know the document position, just use the seek function to get there and start reading
3.Read the file once and store every document position, this is very similar to the previous idea, except it's in memory which is better performance-wise but you will have to repeat the process every time you run the application (no persistence)