I want to read all lines from a file and store in a list, 2 by 2 elements until list finish iterating.
-Read first 2 lines from a list, and do stuff
-Continue reading more 2 lines, and do stuff
-Do so until whole list finish
Inside test.txt
aa
bb
cc
dd
ee
ff
gg
hh
accounts_list = []
with open('test.txt', 'r') as f:
accounts_list = [line.strip() for line in f]
for acc in accounts_list:
#do stuff with 2
#continue reading more 2
# do another stuff with the next 2
# read until whole list finish
How i can do that, i can't make it.
I think iterating through the indices by 2 steps range(0, len(accounts_list), 2) instead of list item of accounts_list should work
accounts_list = []
with open('test.txt', 'r') as f:
accounts_list = [line.strip() for line in f]
for i in range(0, len(accounts_list), 2):
acc1 = accounts_list[i]
acc2 = accounts_list[i + 1]
#do stuff with 2
#continue reading more 2
# do another stuff with the next 2
# read until whole list finish
If you use an explicit iterator (instead of letting the for loop create its own), you can read two lines per iteration: one by the loop itself, the second in the body.
with open('test.txt', 'r') as f:
itr = iter(f)
for acc1 in itr:
acc2 = next(itr)
# Do stuff with acc1 and acc2
# If the file has an odd number of lines,
# you should wrap the assignment to acc2
# in a try statement to catch the explicit StopIteration it will raise
I don't really like the asymmetry of reading the two lines in two different ways, so I would use an explicit while loop and use next to get both lines.
with open('test.txt', 'r') as f:
itr = iter(f)
while True:
try:
acc1 = next(itr)
acc2 = next(itr)
except StopIteration:
break
# Do stuff with acc1 and acc2
Also, be sure to check the recipe section of the itertools documentation for the grouper recipe.
To store file in a nested list "2 by 2 elements":
code:
with open('test.txt', 'r') as f:
accounts_list = [[acc0.strip(), acc1.strip()] for acc0, acc1 in zip(f, f)]
output:
[['aa', 'bb'], ['cc', 'dd'], ['ee', 'ff'], ['gg', 'hh']]
then you can iterate over on this list to work with
Here's one reasonably idiomatic way of doing it. The key is the use of zip(it, it) (passing the same iterator twice), which causes zip to create a generator of tuples consisting of pairs of values from it. (A more general form of this idiom is documented under "tips and tricks" in the Python documentation for zip):
with open('test.txt', 'r') as f:
accounts_iter = (line.strip() for line in f)
for first, second in zip(accounts_iter, accounts_iter):
# Just an example
print(f" First is '{first}'")
print(f"Second is '{second}'")
That snippet does not create a list; it simply iterates over the input. Unless you need to keep the data for later use, there is no point in creating the list; it's simply a waste of memory (which could be a lot of memory if the file is large.) However, if you really wanted a list, for some reason, you could do the following:
with open('test.txt', 'r') as f:
accounts_iter = (line.strip() for line in f)
accounts = [*zip(accounts_iter, accounts_iter)]
That will create a list of tuples, each tuple containing two consecutive lines.
Two other notes:
This only works with iterators (which includes generators); not with iterable containers. You need to turn an iterable container into an iterator using the iter built-in function (as is done in the example in the Python docs).
zip stops when the shortest iterator finishes. So if your file has an odd number of lines, the above will ignore the last line. If you would prefer to see an error if that happens, you can use the strict=True keyword argument to zip (again, as shown in the documentation). If, on the other hand, you'd prefer to have the loop run anyway, even with some missing lines, you could use itertools.zip_longest instead of zip.
Related
Context: I have a file with ~44 million rows. Each is an individual with US address, so there's a "ZIP Code" field. File is txt, pipe-delimited.
Due to size, I cannot (at least on my machine) use Pandas to analyze. So a basic question I have is: How many records (rows) are there for each distinct ZIP code? I took the following steps, but I wonder if there's a faster, more Pythonic way to do this (seems like there is, I just don't know).
Step 1: Create a set for ZIP values from file:
output = set()
with open(filename) as f:
for line in f:
output.add(line.split('|')[8] # 9th item in the split string is "ZIP" value
zip_list = list(output) # List is length of 45,292
Step 2: Created a "0" list, same length as first list:
zero_zip = [0]*len(zip_list)
Step 3: Created a dictionary (with all zeroes) from those two lists:
zip_dict = dict(zip(zip_list, zero_zip))
Step 4: Lastly I ran through the file again, this time updating the dict I just created:
with open(filename) as f:
next(f) # skip first line, which contains headers
for line in f:
zip_dict[line.split('|')[8]] +=1
I got the end result but wondering if there's a simpler way. Thanks all.
Creating the zip_dict can be replaced with a defaultdict. If you can run through every line in the file, you don't need to do it twice, you can just keep a running count.
from collections import defaultdict
d = defaultdict(int)
with open(filename) as f:
for line in f:
parts = line.split('|')
d[parts[8]] += 1
This is simple using the built-in Counter class.
from collections import Counter
with open(filename) as f:
c = Counter(line.split('|')[8] for line in f)
print(c)
I'm trying to start reading some file from line 3, but I can't.
I've tried to use readlines() + the index number of the line, as seen bellow:
x = 2
f = open('urls.txt', "r+").readlines( )[x]
line = next(f)
print(line)
but I get this result:
Traceback (most recent call last):
File "test.py", line 441, in <module>
line = next(f)
TypeError: 'str' object is not an iterator
I would like to be able to set any line, as a variable, and from there, all the time that I use next() it goes to the next line.
IMPORTANT: as this is a new feature and all my code already uses next(f), the solution needs to be able to work with it.
Try this (uses itertools.islice):
from itertools import islice
f = open('urls.txt', 'r+')
start_at = 3
file_iterator = islice(f, start_at - 1, None)
# to demonstrate
while True:
try:
print(next(file_iterator), end='')
except StopIteration:
print('End of file!')
break
f.close()
urls.txt:
1
2
3
4
5
Output:
3
4
5
End of file!
This solution is better than readlines because it doesn't load the entire file into memory and only loads parts of it when needed. It also doesn't waste time iterating previous lines when islice can do that, making it much faster than #MadPhysicist's answer.
Also, consider using the with syntax to guarantee the file gets closed:
with open('urls.txt', 'r+') as f:
# do whatever
The readlines method returns a list of strings for the lines. So when you take readlines()[2] you're getting the third line, as a string. Calling next on that string then makes no sense, so you get an error.
The easiest way to do this is to slice the list: readlines()[x:] gives a list of everything from line x onwards. Then you can use that list however you like.
If you have your heart set on an iterator, you can turn a list (or pretty much anything) into an iterator with the iter builtin function. Then you can next it to your heart's content.
The following code will allow you to use an iterator to print the first line:
In [1]: path = '<path to text file>'
In [2]: f = open(path, "r+")
In [3]: line = next(f)
In [4]: print(line)
This code will allow you to print the lines starting from the xth line:
In [1]: path = '<path to text file>'
In [2]: x = 2
In [3]: f = iter(open(path, "r+").readlines()[x:])
In [4]: f = iter(f)
In [5]: line = next(f)
In [6]: print(line)
Edit: Edited the solution based on #Tomothy32's observation.
The line you printed returns a string:
open('urls.txt', "r+").readlines()[x]
open returns a file object. Its readlines method returns a list of strings. Indexing with [x] returns the third line in the file as a single string.
The first problem is that you open the file without closing it. The second is that your index doesn't specify a range of lines until the end. Here's an incremental improvement:
with open('urls.txt', 'r+') as f:
lines = f.readlines()[x:]
Now lines is a list of all the lines you want. But you first read the whole file into memory, then discarded the first two lines. Also, a list is an iterable, not an iterator, so to use next on it effectively, you'd need to take an extra step:
lines = iter(lines)
If you want to harness the fact that the file is already a rather efficient iterator, apply next to it as many times as you need to discard unwanted lines:
with open('urls.txt', 'r+') as f:
for _ in range(x):
next(f)
# now use the file
print(next(f))
After the for loop, any read operation you do on the file will start from the third line, whether it be next(f), f.readline(), etc.
There are a few other ways to strip the first lines. In all cases, including the example above, next(f) can be replaced with f.readline():
for n, _ in enumerate(f):
if n == x:
break
or
for _ in zip(f, range(x)): pass
After you run either of these loops, next(f) will return the xth line.
Just call next(f) as many times as you need to. (There's no need to overcomplicate this with itertools, nor to slurp the entire file with readlines.)
lines_to_skip = 3
with open('urls.txt') as f:
for _ in range(lines_to_skip):
next(f)
for line in f:
print(line.strip())
Output:
% cat urls.txt
url1
url2
url3
url4
url5
% python3 test.py
url4
url5
I have a medium-size file (25MB, 1000000 rows), and I want to read every row except every third row.
FIRST QUESTION: Is it faster to load the whole file into memory and then read the rows (method .read()), or load and read one row at the time (method .readline())?
Since I'm not an experienced coder I tried the second option with islice method from itertools module.
import intertools
with open(input_file) as inp:
inp_atomtype = itertools.islice(inp, 0, 40, 3)
inp_atomdata = itertools.islice(inp, 1, 40, 3)
for atomtype, atomdata in itertools.zip_longest(inp_atomtype, inp_atomdata):
print(atomtype + atomdata)
Although looping through single generator (inp_atomtype or inp_atomdata) prints correct data, looping through both of them simultaneously (as in this code) prints wrong data.
SECOND QUESTION: How can I reach desired rows using generators?
You don't need to slice the iterator, a simple line counter should be enough:
with open(input_file) as f:
current_line = 0
for line in f:
current_line += 1
if current_line % 3: # ignore every third line
print(line) # NOTE: print() will add an additional new line by default
As for turning it into a generator, just yield the line instead of printing.
When it comes to speed, given that you'll be reading your lines anyway the I/O part will probably take the same but you might benefit a bit (in total processing time) from fast list slicing instead of counting lines if you have enough working memory to keep the file contents and if loading the whole file upfront instead of streaming is acceptable.
yield is perfect for this.
This functions yields pairs from an iterable and skip every third item:
def two_thirds(seq):
_iter = iter(seq)
while True:
yield (next(_iter), next(_iter))
next(_iter)
You will lose half pairs, which means that two_thirds(range(2)) will stop iterating immediately.
https://repl.it/repls/DullNecessaryCron
You can also use the grouper recipe from itertools doc and ignore the third item in each tuple generated:
for atomtype, atomdata, _ in grouper(lines, 3):
pass
FIRST QUESTION: I am pretty sure that .readline() is faster than .read(). Plus, the fastest way based my test is to do lopping like:
with open(file, 'r') as f:
for line in f:
...
SECOND QUESTION: I am not quite sure abut this. you may consider to use yield.
There is a code snippet you may refer:
def myreadlines(f, newline):
buf = ""
while True:
while newline in buf:
pos = buf.index(newline)
yield buf[:pos]
buf = buf[pos + len(newline):]
chunk = f.read(4096)
if not chunk:
# the end of file
yield buf
break
buf += chunk
with open("input.txt") as f:
for line in myreadlines(f, "{|}"):
print (line)
q2: here's my generator:
def yield_from_file(input_file):
with open(input_file) as file:
yield from file
def read_two_skip_one(gen):
while True:
try:
val1 = next(gen)
val2 = next(gen)
yield val1, val2
_ = next(gen)
except StopIteration:
break
if __name__ == '__main__':
for atomtype, atomdata in read_two_skip_one(yield_from_file('sample.txt')):
print(atomtype + atomdata)
sample.txt was generated with a bash shell (it's just lines counting to 100)
for i in {001..100}; do echo $i; done > sample.txt
regarding q1: if you're reading the file multiple times, you'd be better off to have it in memory. otherwise you're fine reading it line by line.
Regarding the problem you're having with the wrong results:
both itertools.islice(inp, 0, 40, 3) statements will use inp as generator. Both will call next(inp), to provide you with a value.
Each time you call next() on an iterator, it will change its state, so that's where your problems come from.
You can use a generator expression:
with open(input_file, 'r') as f:
generator = (line for e, line in enumerate(f, start=1) if e % 3)
enumerate adds line numbers to each line, and the if clause ignores line numbers divisible by 3 (default numbering starts at 0, so you have to specify start=1 to get the desired pattern).
Keep in mind that you can only use the generator while the file is still open.
there are several dictionaries in the variable highscores. I need to sort it by its key values, and sorted() isn't working.
global highscores
f = open('RPS.txt', 'r')
highscores = [line.strip() for line in f]
sorted(highscores)
highscores = reverse=True[:5]
for line in f:
x = line.strip()
print(x)
f.close()
this is the error:
TypeError: 'bool' object is not subscriptable
sorted(v) an iterator that returns each element of v in order; it is not a list. You can use the iterator in a for loop to process the elements one at a time:
for k in sorted(elements): ...
You can transform each element and store the result in a list:
v = [f(k) for k in sorted(elements)]
Or you can just capture all elements into a list.
v = list(k)
Note that in the code above, elements are strings from a file, not a dictionary.
The following should do what (I think) you want:
with open('RPS.txt', 'r') as f: # will automatically close f
highscores = [line.strip() for line in f]
highscores = sorted(highscores, reverse=True)[:5]
for line in highscores:
print(line)
The primary problem was the way you're using sorted(). And, at the end, rather than trying to iterate though the lines of the file again (which won't work because files aren't list and can't be arbitrarily iterated-over) WHat the code above does is sort the lines read from the file and then takes first 5 of that list, which was saved in highscores. Following that it prints them. There's no need to strip the lines again, that was taken care of when the file was first read.
I am looking for a method in Python which can read multiple lines from a file(10 lines at a time). I have already looked into readlines(sizehint), I tried to pass value 10 but doesn't read only 10 lines. It actually reads till end of the file(I have tried on the small file). Each line is 11 bytes long and each read should fetch me 10 lines each time. If less than 10 lines are found then return only those lines. My actual file contains more than 150K lines.
Any idea how I can achieve this?
You're looking for itertools.islice():
with open('data.txt') as f:
lines = []
while True:
line = list(islice(f, 10)) #islice returns an iterator ,so you convert it to list here.
if line:
#do something with current set of <=10 lines here
lines.append(line) # may be store it
else:
break
print lines
This should do it
def read10Lines(fp):
answer = []
for i in range(10):
answer.append(fp.readline())
return answer
Or, the list comprehension:
ten_lines = [fp.readline() for _ in range(10)]
In both cases, fp = open('path/to/file')
Another solution which can get rid of the silly infinite loop in favor of a more familiar for loop relies on itertools.izip_longest and a small trick with iterators. The trick is that zip(*[iter(iterator)]*n) breaks iterator up into chunks of size n. Since a file is already generator-like iterator (as opposed to being sequence like), we can write:
from itertools import izip_longest
with open('data.txt') as f:
for ten_lines in izip_longest(*[f]*10,fillvalue=None):
if ten_lines[-1] is None:
ten_lines = filter(ten_lines) #filter removes the `None` values at the end
process(ten_lines)
from itertools import groupby, count
with open("data.txt") as f:
groups = groupby(f, key=lambda x,c=count():next(c)//10)
for k, v in groups:
bunch_of_lines = list(v)
print bunch_of_lines