Python: Read huge number of lines from stdin - python

I'm trying to read a huge amount of lines from standard input with python.
more hugefile.txt | python readstdin.py
The problem is that the program freezes as soon as i've read just a single line.
print sys.stdin.read(8)
exit(1)
This prints the first 8 bytes but then i expect it to terminate but it never does. I think it's not really just reading the first bytes but trying to read the whole file into memory.
Same problem with sys.stdin.readline()
What i really want to do is of course to read all the lines but with a buffer so i don't run out of memory.
I'm using python 2.6

This should work efficiently in a modern Python:
import sys
for line in sys.stdin:
# do something...
print line,
You can then run the script like this:
python readstdin.py < hugefile.txt

Back in the day, you had to use xreadlines to get efficient huge line-at-a-time IO -- and the docs now ask that you use for line in file.
Of course, this is of assistance only if you're actually working on the lines one at a time. If you're just reading big binary blobs to pass onto something else, then your other mechanism might be as efficient.

Related

Python text file read/write optimization

I have been working on this file i/o and have made some progress reading through the site and i am wondering what other ways this can be optimized. I am parsing a test infile of 10GB/30MM lines and writing to an outfile the fields which results in aprog 1.4GB clean file. Initially, it took 40m to run this process and i have reduced it to around 30m. Anyone have any other ideas to reduce this in python. Long term i will be looking to write this in C++ - i just have to learn the language first. thanks in advance.
with open(fdir+"input.txt",'rb',(50*(1024*1024))) as r:
w=open(fdir+"output0.txt",'wb',50*(1024*1024)))
for i,l in enumerate(r):
if l[42:44]=='25':
# takes fixed width line into csv line of only a few cols
wbun.append(','.join([
l[7:15],
l[26:35],
l[44:52],
l[53:57],
format(int(l[76:89])/100.0,'.02f'),
l[89:90],
format(int(l[90:103])/100.0,'.02f'),
l[193:201],
l[271:278]+'\n'
]))
# write about every 5MM lines
if len(wbun)==wsize:
w.writelines(wbun)
wbun=[]
print "i_count:",i
# splits about every 4GB
if (i+1)%fsplit==0:
w.close()
w=open(fdir+"output%d.txt"%(i/fsplit+1),'wb',50*(1024*1024)))
w.writelines(wbun)
w.close()
Try running it in Pypy (https://pypy.org), it will run without changes to your code, and probably faster.
Also, C++ might be an overkill, especially if you don't know it yet. Consider learning Go or D instead.

Output every nth byte of stdin

What's the easiest efficient way to read from stdin and output every nth byte?
I'd like a command-line utility that works on OS X, and would prefer to avoid compiled languages.
This Python script is fairly slow (25s for a 3GB file when n=100000000):
#!/usr/bin/env python
import sys
n = int(sys.argv[1])
while True:
chunk = sys.stdin.read(n)
if not chunk:
break
sys.stdout.write(chunk[0])
Unfortunately we can't use sys.stdin.seek to avoid reading the entire file.
Edit: I'd like to optimize for the case when n is a significant fraction of the file size. For example, I often use this utility to sample 500 bytes at equally-spaced locations from a large file.
NOTE: OP change the example n from 100 to 100000000 which effectively render my code slower than his, normally i would just delete my answer since it is no longer better than the original example, but my answer gotten a vote so i will just leave it as it is.
the only way that i can think of to make it faster is to read everything at once and use slice
#!/usr/bin/env python
import sys
n = int(sys.argv[1])
data = sys.stdin.read()
print(data[::n])
although, trying to fit a 3GB file into the ram might be a very bad idea

Need to read from long running os command with Python 2.4

Firstly, I'm stuck with Python 2.4. This is a large enterprise environment and I'm unable to update to python 2.7 which would be my preference.
I need to read the output of some dtrace scripts that spit out data in intervals similar to iostat. (ie: iostat 5 100 # every 5 seconds, 100 count)
I'm playing around with Popen and Popen.communicate but it seems to slurp all the data at once and then print out in one large string.
I need to enter into a while loop and read the output 1 line at a time.
Can someone point me into the right direction for doing this?
Much thx.
import subprocess
p = subprocess.Popen("some_long_command",stdout=subprocess.PIPE)
for line in iter(p.stdout.readline, ""):
print line
I think at least ...

How do I print the output onto a txt file: Mac

This is my first time asking a question. I am just starting to get into programming, so i am beginning with Python. So I've basically got a random number generator inside of a while loop, thats inside of my "r()' function. What I want to do is take all of the numbers (basically like an infinite amount until i shut down idle) and put them into a text file. Now i have looked for this on the world wide web and have found solutions for this, but on a windows computer. I have a mac with python 2.7. ANY HELP IS VERY MUCH APPRECIATED! My current code is below
from random import randrange
def r():
while True:
print randrange(1,10)
The general idea is to open the file, write to it (as many times as you need to), and close it. This is explained in the tutorial under Reading and Writing Files.
The with statement (described toward the end of that section) is a great way to make sure the file always gets closed. (Otherwise, when you stopped your script with ^C, the file might end up missing the last few hundred bytes, and you'd have to use try/finally to handle that properly.)
The write method on files isn't quite as "friendly" as the print statement—it doesn't automatically convert things to strings, add a newline at the end, accept multiple comma-separated values, etc. So usually, you'll want to use string formatting to do that stuff for you.
For example:
def r():
with open('textfile.txt', 'w') as f:
while True:
f.write('{}\n'.format(randrange(1, 10)))
You'll need to call the function and then redirect the output to a file or use the python API to write to a file. Your whole script could be:
from random import randrange
def r():
while True:
print randrange(1,10)
r()
Then you can run python script_name.py > output.txt
If you'd like to use the python API to write to a file, your script should be modified to something like the following:
from random import randrange
def r():
with open('somefile.txt', 'w') as f:
while True:
f.write('{}\n'.format(randrange(1,10)))
r()
The with statement will take care of closing the file instance appropriately.

python limit must be integer

I'm trying to run the following code but for some reason I get the following error: "TypeError: limit must be an integer".
Reading csv data file
import sys
import csv
maxInt = sys.maxsize
decrement = True
while decrement:
decrement = False
try:
**csv.field_size_limit(maxInt)**
except OverflowError:
maxInt = int(maxInt/10)
decrement = True
with open("Data.csv", 'rb') as textfile:
text = csv.reader(textfile, delimiter=" ", quotechar='|')
for line in text:
print ' '.join(line)
The error occurs in the starred line. I have only added the extra bit above the csv read statement as the file was too large to read normally. Alternatively, I could change the file to a text file from csv but I'm not sure whether this will corrupt the data further I can't actually see any of the data as the file is >2GB and hence costly to open.
Any ideas? I'm fairly new to Python but I'd really like to learn a lot more.
I'm not sure whether this qualifies as an answer or not, but here are a few things:
First, the csv reader automatically buffers per line of the CSV, so the file size shouldn't matter too much, 2KB or 2GB, whatever.
What might matter is the number of columns or amount of data inside the fields themselves. If this CSV contains War and Peace in each column, then yeah, you're going to have an issue reading it.
Some ways to potentially debug are to run print sys.maxsize, and to just open up a python interpreter, import sys, csv and then run csv.field_size_limit(sys.maxsize). If you are getting some terribly small number or an exception, you may have a bad install of Python. Otherwise, try to take a simpler version of your file. Maybe the first line, or the first several lines and just 1 column. See if you can reproduce the smallest possible case and remove the variability of your system and the file size.
On Windows 7 64bit with Python 2.6, maxInt = sys.maxsize returns 9223372036854775807L which consequently results in a TypeError: limit must be an integer when calling csv.field_size_limit(maxInt). Interestingly, using maxInt = int(sys.maxsize) does not change this. A crude workaround is to simlpy use csv.field_size_limit(2147483647) which of course cause issues on other platforms. In my case this was adquat to identify the broken value in the CSV, fix the export options in the other application and remove the need for csv.field_size_limit().
-- originally posted by user roskakori on this related question

Categories