Multiline file read in Python - python

I am looking for a method in Python which can read multiple lines from a file(10 lines at a time). I have already looked into readlines(sizehint), I tried to pass value 10 but doesn't read only 10 lines. It actually reads till end of the file(I have tried on the small file). Each line is 11 bytes long and each read should fetch me 10 lines each time. If less than 10 lines are found then return only those lines. My actual file contains more than 150K lines.
Any idea how I can achieve this?

You're looking for itertools.islice():
with open('data.txt') as f:
lines = []
while True:
line = list(islice(f, 10)) #islice returns an iterator ,so you convert it to list here.
if line:
#do something with current set of <=10 lines here
lines.append(line) # may be store it
else:
break
print lines

This should do it
def read10Lines(fp):
answer = []
for i in range(10):
answer.append(fp.readline())
return answer
Or, the list comprehension:
ten_lines = [fp.readline() for _ in range(10)]
In both cases, fp = open('path/to/file')

Another solution which can get rid of the silly infinite loop in favor of a more familiar for loop relies on itertools.izip_longest and a small trick with iterators. The trick is that zip(*[iter(iterator)]*n) breaks iterator up into chunks of size n. Since a file is already generator-like iterator (as opposed to being sequence like), we can write:
from itertools import izip_longest
with open('data.txt') as f:
for ten_lines in izip_longest(*[f]*10,fillvalue=None):
if ten_lines[-1] is None:
ten_lines = filter(ten_lines) #filter removes the `None` values at the end
process(ten_lines)

from itertools import groupby, count
with open("data.txt") as f:
groups = groupby(f, key=lambda x,c=count():next(c)//10)
for k, v in groups:
bunch_of_lines = list(v)
print bunch_of_lines

Related

Read lines 2by 2 python

I want to read all lines from a file and store in a list, 2 by 2 elements until list finish iterating.
-Read first 2 lines from a list, and do stuff
-Continue reading more 2 lines, and do stuff
-Do so until whole list finish
Inside test.txt
aa
bb
cc
dd
ee
ff
gg
hh
accounts_list = []
with open('test.txt', 'r') as f:
accounts_list = [line.strip() for line in f]
for acc in accounts_list:
#do stuff with 2
#continue reading more 2
# do another stuff with the next 2
# read until whole list finish
How i can do that, i can't make it.
I think iterating through the indices by 2 steps range(0, len(accounts_list), 2) instead of list item of accounts_list should work
accounts_list = []
with open('test.txt', 'r') as f:
accounts_list = [line.strip() for line in f]
for i in range(0, len(accounts_list), 2):
acc1 = accounts_list[i]
acc2 = accounts_list[i + 1]
#do stuff with 2
#continue reading more 2
# do another stuff with the next 2
# read until whole list finish
If you use an explicit iterator (instead of letting the for loop create its own), you can read two lines per iteration: one by the loop itself, the second in the body.
with open('test.txt', 'r') as f:
itr = iter(f)
for acc1 in itr:
acc2 = next(itr)
# Do stuff with acc1 and acc2
# If the file has an odd number of lines,
# you should wrap the assignment to acc2
# in a try statement to catch the explicit StopIteration it will raise
I don't really like the asymmetry of reading the two lines in two different ways, so I would use an explicit while loop and use next to get both lines.
with open('test.txt', 'r') as f:
itr = iter(f)
while True:
try:
acc1 = next(itr)
acc2 = next(itr)
except StopIteration:
break
# Do stuff with acc1 and acc2
Also, be sure to check the recipe section of the itertools documentation for the grouper recipe.
To store file in a nested list "2 by 2 elements":
code:
with open('test.txt', 'r') as f:
accounts_list = [[acc0.strip(), acc1.strip()] for acc0, acc1 in zip(f, f)]
output:
[['aa', 'bb'], ['cc', 'dd'], ['ee', 'ff'], ['gg', 'hh']]
then you can iterate over on this list to work with
Here's one reasonably idiomatic way of doing it. The key is the use of zip(it, it) (passing the same iterator twice), which causes zip to create a generator of tuples consisting of pairs of values from it. (A more general form of this idiom is documented under "tips and tricks" in the Python documentation for zip):
with open('test.txt', 'r') as f:
accounts_iter = (line.strip() for line in f)
for first, second in zip(accounts_iter, accounts_iter):
# Just an example
print(f" First is '{first}'")
print(f"Second is '{second}'")
That snippet does not create a list; it simply iterates over the input. Unless you need to keep the data for later use, there is no point in creating the list; it's simply a waste of memory (which could be a lot of memory if the file is large.) However, if you really wanted a list, for some reason, you could do the following:
with open('test.txt', 'r') as f:
accounts_iter = (line.strip() for line in f)
accounts = [*zip(accounts_iter, accounts_iter)]
That will create a list of tuples, each tuple containing two consecutive lines.
Two other notes:
This only works with iterators (which includes generators); not with iterable containers. You need to turn an iterable container into an iterator using the iter built-in function (as is done in the example in the Python docs).
zip stops when the shortest iterator finishes. So if your file has an odd number of lines, the above will ignore the last line. If you would prefer to see an error if that happens, you can use the strict=True keyword argument to zip (again, as shown in the documentation). If, on the other hand, you'd prefer to have the loop run anyway, even with some missing lines, you could use itertools.zip_longest instead of zip.

A simpler way to create a dictionary with counts from a 43 million row text file?

Context: I have a file with ~44 million rows. Each is an individual with US address, so there's a "ZIP Code" field. File is txt, pipe-delimited.
Due to size, I cannot (at least on my machine) use Pandas to analyze. So a basic question I have is: How many records (rows) are there for each distinct ZIP code? I took the following steps, but I wonder if there's a faster, more Pythonic way to do this (seems like there is, I just don't know).
Step 1: Create a set for ZIP values from file:
output = set()
with open(filename) as f:
for line in f:
output.add(line.split('|')[8] # 9th item in the split string is "ZIP" value
zip_list = list(output) # List is length of 45,292
Step 2: Created a "0" list, same length as first list:
zero_zip = [0]*len(zip_list)
Step 3: Created a dictionary (with all zeroes) from those two lists:
zip_dict = dict(zip(zip_list, zero_zip))
Step 4: Lastly I ran through the file again, this time updating the dict I just created:
with open(filename) as f:
next(f) # skip first line, which contains headers
for line in f:
zip_dict[line.split('|')[8]] +=1
I got the end result but wondering if there's a simpler way. Thanks all.
Creating the zip_dict can be replaced with a defaultdict. If you can run through every line in the file, you don't need to do it twice, you can just keep a running count.
from collections import defaultdict
d = defaultdict(int)
with open(filename) as f:
for line in f:
parts = line.split('|')
d[parts[8]] += 1
This is simple using the built-in Counter class.
from collections import Counter
with open(filename) as f:
c = Counter(line.split('|')[8] for line in f)
print(c)

Spilting a large list of string and creating a list of the results

I have a large list of strings. Each string has a number of segments separated by a ";":
'1,2,23,17,-1006,0.20;1,3,3,2258,-1308,0.72;'
I want to split each string by the ";" and save the resulting list.
I am currently using:
player_parts = []
for line in playerinf:
parts = line.split(";")
player_parts = player_parts + parts
Is there a faster way to do this?
If I understand you correctly, you can try itertools.chain and unpacking a list comprehension:
from itertools import chain
lines = ['1,2,23,17,-1006,0.20;1,3,3,2258,-1308,0.72;', '2,3,34,56,-2134,0.50;2,4,7,2125,-3408,0.56;']
parts = list(chain(*[line.split(';')[:-1] for line in lines]))
parts
# ['1,2,23,17,-1006,0.20',
# '1,3,3,2258,-1308,0.72',
# '2,3,34,56,-2134,0.50',
# '2,4,7,2125,-3408,0.56']
I added a [:-1] to drop the last empty element of the split(';'). If however you need that empty element, just remove [:-1].
Since chain runs on compiled code it should be much faster than the python interpreter.
The run time for 10000 lines are:
using chain: 0.34399986267089844s
using your method: > 240.234s # (I didn't want to wait any more)
Every time you do player_parts = player_parts + parts, you're combining two lists into a new list and assigning that list to player_parts. That's very inefficient. Doing player_parts.extend(parts) would greatly improve performance, since it's adding the contents to the end of the original player_parts list.
However, it looks like you may be adding some empty strings to the player_parts list. So let's see if there's a better way.
It sounds like you have a file like this:
1,2,23,17,-1006,0.20;1,3,3,2258,-1308,0.72;
1,2,23,17,-1006,0.20;1,3,3,2258,-1308,0.72
1,2,23,17,-1006,0.20;1,3,3,2258,-1308,0.72;
And you want this result:
['1,2,23,17,-1006,0.20', '1,3,3,2258,-1308,0.72', '1,2,23,17,-1006,0.20',
'1,3,3,2258,-1308,0.72', '1,2,23,17,-1006,0.20', '1,3,3,2258,-1308,0.72']
So this should work:
f = open('infile', 'r')
player_parts = []
for line in f: # For each line in the file
for segment in line.split(';'): # For each segment in the line
if segment.strip(): # If the segment has anything in it besides whitespace
player_parts.append(segment) # Add it to the end of the list
If you're comfortable with comprehensions, you can do this:
f = open('infile', 'r')
player_parts = []
for line in f:
player_parts.extend(segment for segment in line.split(';') if segment.strip())
As far as I know list comprehensions are always a good approach if speed is important.
player_parts = [line.split(';') for line in playerinf]

How to loop through two generators of the same opened file

I have a medium-size file (25MB, 1000000 rows), and I want to read every row except every third row.
FIRST QUESTION: Is it faster to load the whole file into memory and then read the rows (method .read()), or load and read one row at the time (method .readline())?
Since I'm not an experienced coder I tried the second option with islice method from itertools module.
import intertools
with open(input_file) as inp:
inp_atomtype = itertools.islice(inp, 0, 40, 3)
inp_atomdata = itertools.islice(inp, 1, 40, 3)
for atomtype, atomdata in itertools.zip_longest(inp_atomtype, inp_atomdata):
print(atomtype + atomdata)
Although looping through single generator (inp_atomtype or inp_atomdata) prints correct data, looping through both of them simultaneously (as in this code) prints wrong data.
SECOND QUESTION: How can I reach desired rows using generators?
You don't need to slice the iterator, a simple line counter should be enough:
with open(input_file) as f:
current_line = 0
for line in f:
current_line += 1
if current_line % 3: # ignore every third line
print(line) # NOTE: print() will add an additional new line by default
As for turning it into a generator, just yield the line instead of printing.
When it comes to speed, given that you'll be reading your lines anyway the I/O part will probably take the same but you might benefit a bit (in total processing time) from fast list slicing instead of counting lines if you have enough working memory to keep the file contents and if loading the whole file upfront instead of streaming is acceptable.
yield is perfect for this.
This functions yields pairs from an iterable and skip every third item:
def two_thirds(seq):
_iter = iter(seq)
while True:
yield (next(_iter), next(_iter))
next(_iter)
You will lose half pairs, which means that two_thirds(range(2)) will stop iterating immediately.
https://repl.it/repls/DullNecessaryCron
You can also use the grouper recipe from itertools doc and ignore the third item in each tuple generated:
for atomtype, atomdata, _ in grouper(lines, 3):
pass
FIRST QUESTION: I am pretty sure that .readline() is faster than .read(). Plus, the fastest way based my test is to do lopping like:
with open(file, 'r') as f:
for line in f:
...
SECOND QUESTION: I am not quite sure abut this. you may consider to use yield.
There is a code snippet you may refer:
def myreadlines(f, newline):
buf = ""
while True:
while newline in buf:
pos = buf.index(newline)
yield buf[:pos]
buf = buf[pos + len(newline):]
chunk = f.read(4096)
if not chunk:
# the end of file
yield buf
break
buf += chunk
with open("input.txt") as f:
for line in myreadlines(f, "{|}"):
print (line)
q2: here's my generator:
def yield_from_file(input_file):
with open(input_file) as file:
yield from file
def read_two_skip_one(gen):
while True:
try:
val1 = next(gen)
val2 = next(gen)
yield val1, val2
_ = next(gen)
except StopIteration:
break
if __name__ == '__main__':
for atomtype, atomdata in read_two_skip_one(yield_from_file('sample.txt')):
print(atomtype + atomdata)
sample.txt was generated with a bash shell (it's just lines counting to 100)
for i in {001..100}; do echo $i; done > sample.txt
regarding q1: if you're reading the file multiple times, you'd be better off to have it in memory. otherwise you're fine reading it line by line.
Regarding the problem you're having with the wrong results:
both itertools.islice(inp, 0, 40, 3) statements will use inp as generator. Both will call next(inp), to provide you with a value.
Each time you call next() on an iterator, it will change its state, so that's where your problems come from.
You can use a generator expression:
with open(input_file, 'r') as f:
generator = (line for e, line in enumerate(f, start=1) if e % 3)
enumerate adds line numbers to each line, and the if clause ignores line numbers divisible by 3 (default numbering starts at 0, so you have to specify start=1 to get the desired pattern).
Keep in mind that you can only use the generator while the file is still open.

Preffered way of counting lines, characters and words from a file as a whole in Python

I have found 2 ways of counting the lines of a file as they can be seen below. (note: I need to read the file as a whole and not line-by-line)
Trying to get a feel of which approach is better in terms of efficiency and/or good-coding-style.
names = {}
for each_file in glob.glob('*.cpp'):
with open(each_file) as f:
names[each_file] = sum(1 for line in f if line.strip())
(as seen here)
data = open('test.cpp', 'r').read()
print(len(data.splitlines()), len(data.split()), len(data))
(as seen here)
And in the same topic, regarding the counting the number of characters and the counting number of words in a file; is there a better way than the one suggested above?
Use a generator expression for memory efficiency (this approach will avoid reading the whole file into memory). Here's a demonstration.
def count(filename, what):
strategy = {'lines': lambda x: bool(x.strip()),
'words': lambda x: len(x.split()),
'chars': len
}
strat = strategy[what]
with open(filename) as f:
return sum(strat(line) for line in f)
input.txt:
this is
a test file
i just typed
output:
>>> count('input.txt', 'lines')
3
>>> count('input.txt', 'words')
8
>>> count('input.txt', 'chars')
33
Note that when counting characters, the newline characters are counted as well. Also note that this uses a pretty crude definition of "word" (you did not provide one), it just splits a line by whitespace and counts the elements of the returned list.
Create a few test files and test them in a big loop to see the average times.
Make sure the test files fit your scenarios.
I used this code:
import glob
import time
times1 = []
for i in range(0,1000):
names = {}
t0 = time.clock()
with open("lines.txt") as f:
names["lines.txt"] = sum(1 for line in f if line.strip())
print names
times1.append(time.clock()-t0)
times2 = []
for i in range(0,1000):
names = {}
t0 = time.clock()
data = open("lines.txt", 'r').read()
print("lines.txt",len(data.splitlines()), len(data.split()), len(data))
times2.append(time.clock()-t0)
print sum(times1)/len(times1)
print sum(times2)/len(times2)
and came out with the average timings:
0.0104755582104 and
0.0180650466201 seconds
This was on a text file with 23000 lines. E.g:
print("lines.txt",len(data.splitlines()), len(data.split()), len(data))
outputs: ('lines.txt', 23056, 161392, 1095160)
Test this on your actual file set to get more accurate timing data.

Categories