as an example I have two scripts, say script1.py
f = open("output1.txt", "w")
count = 1
for i in range(100):
f.write(str(count) + "\n")
print(str(count))
count +=1
f.close
This script prints numbers from 1 to 100 to a file and to standard output.
Then I have a second script, say script2.py
import sys
import time
stdin = sys.stdin
f1 = open("output2.txt", "w")
for line in stdin:
if len(line)>0:
print(line.strip())
time.sleep(0.05)
f1.write(line.strip() + "\n")
which reads data from standard input and prints them to a file. I added a time.sleep command to ensure the second script consumes data at a far lower rate than they are produced by the first one.
I run the scripts from the command line as
python3 script1.py | python3 script2.py
so redirecting the standard output of the first (so the print() command) to the standard input of the second one.
It works as somehow expected, two files are generated containing numbers from 1 to 100.
I am nevertheless wondering how the data transfer part works, from first to second script.
the first script generates data at a faster rate. Where are these data stored, waiting for the second script to access them?
Is there some sort of buffer that is put in place between the two process? Or what else?
Is Python responsible for this, or the OS?
Is the buffer limited in size? Can it be programmed (e.g. accessed to direct data to another target as well)?
Thanks a bunch
it is because of the pipe "|", more info here: https://ss64.com/nt/syntax-redirection.html
commandA | commandB Pipe the output from commandA into commandB
so the prints from your script1 are sent to your script2.
My guess on how it works is that every print is saved in memory as a big string and then sent back (as text) to the second that s why sys.stdin works
I'm writing on a script that reads all lines from multiple files, reads in a number at the beginning of each block and puts that number in front of every line of the block until the next number and so on. Afterwards it writes all read lines into a single .csv file.
The files I am reading look like this:
13368:
2385003,4,2004-07-08
659432,3,2005-03-16
13369:
751812,2,2002-12-16
2625420,2,2004-05-25
And the output file should look like this:
13368,2385003,4,2004-07-08
13368,659432,3,2005-03-16
13369,751812,2,2002-12-16
13369,2625420,2,2004-05-25
Currently my script is this:
from asyncio import Semaphore, ensure_future, gather, run
import time
limit = 8
async def read(file_list):
tasks = list()
result = None
sem = Semaphore(limit)
for file in file_list:
task = ensure_future(read_bounded(file,sem))
tasks.append(task)
result = await gather(*tasks)
return result
async def read_bounded(file,sem):
async with sem:
return await read_one(file)
async def read_one(filename):
result = list()
with open(filename) as file:
dataList=[]
content = file.read().split(":")
file.close()
j=1
filmid=content[0]
append=result.append
while j<len(content):
for entry in content[j].split("\n"):
if len(entry)>10:
append("%s%s%s%s" % (filmid,",",entry,"\n"))
else:
if len(entry)>0:
filmid=entry
j+=1
return result
if __name__ == '__main__':
start=time.time()
write_append="w"
files = ['combined_data_1.txt', 'combined_data_2.txt', 'combined_data_3.txt', 'combined_data_4.txt']
res = run(read(files))
with open("output.csv",write_append) as outputFile:
for result in res:
outputFile.write(''.join(result))
outputFile.flush()
outputFile.close()
end=time.time()
print(end-start)
It has a runtime of about 135 Seconds (The 4 files that are read are each 500MB big and the output file has 2.3GB). Running the script takes about 10GB of RAM. I think this might be a problem.
The biggest amount of time is needed to create the list of all lines, I think.
I would like to reduce the runtime of this program, but I am new to python and not sure how to do this. Can you give me some advice?
Thanks
Edit:
I measured the times for the following commands in cmd (I have only Windows installed on my Computer, so I used hopefully equivalent cmd-Commands):
sequential writing to NUL
timecmd "type combined_data_1.txt combined_data_2.txt combined_data_3.txt combined_data_4.txt > NUL"
combined_data_1.txt
combined_data_2.txt
combined_data_3.txt
combined_data_4.txt
command took 0:1:25.87 (85.87s total)
sequential writing to file
timecmd "type combined_data_1.txt combined_data_2.txt combined_data_3.txt combined_data_4.txt > test.csv"
combined_data_1.txt
combined_data_2.txt
combined_data_3.txt
combined_data_4.txt
command took 0:2:42.93 (162.93s total)
parallel
timecmd "type combined_data_1.txt > NUL & type combined_data_2.txt > NUL & type combined_data_3.txt >NUL & type combined_data_4.txt > NUL"
command took 0:1:25.51 (85.51s total)
In this case you're not gaining anything by using asyncio for two reasons:
asyncio is single-threaded and doesn't parallelize processing (and, in Python, neither can threads)
the IO calls access the file system, which asyncio doesn't cover - it is primarily about network IO
The giveaway that you're not using asyncio correctly is the fact that your read_one coroutine doesn't contain a single await. That means that it never suspends execution, and that it will run to completion before ever yielding to another coroutine. Making it an ordinary function (and dropping asyncio altogether) would have the exact same result.
Here is a rewritten version of the script with the following changes:
byte IO throughout, for efficiency
iterates through the file rather than loading all at once
sequential code
import sys
def process(in_filename, outfile):
with open(in_filename, 'rb') as r:
for line in r:
if line.endswith(b':\n'):
prefix = line[:-2]
continue
outfile.write(b'%s,%s' % (prefix, line))
def main():
in_files = sys.argv[1:-1]
out_file = sys.argv[-1]
with open(out_file, 'wb') as out:
for fn in in_files:
process(fn, out)
if __name__ == '__main__':
main()
On my machine and Python 3.7, this version performs at approximately 22 s/GiB, tested on four randomly generated files, of 550 MiB each. It has a negligible memory footprint because it never loads the whole file into memory.
The script runs on Python 2.7 unchanged, where it clocks at 27 s/GiB. Pypy (6.0.0) runs it much faster, taking only 11 s/GiB.
Using concurrent.futures in theory ought to allow processing in one thread while another is waiting for IO, but the result ends up being significantly slower than the simplest sequential approach.
You want to read 2 GiB and write 2 GiB with low elapsed time and low memory consumption.
Parallelism, for core and for spindle, matters.
Ideally you would tend to keep all of them busy.
I assume you have at least four cores available.
Chunking your I/O matters, to avoid excessive malloc'ing.
Start with the simplest possible thing.
Please make some measurements and update your question to include them.
sequential
Please make sequential timing measurements of
$ cat combined_data_[1234].csv > /dev/null
and
$ cat combined_data_[1234].csv > big.csv
I assume you will have low CPU utilization, and thus will be measuring read & write I/O rates.
parallel
Please make parallel I/O measurements:
cat combined_data_1.csv > /dev/null &
cat combined_data_2.csv > /dev/null &
cat combined_data_3.csv > /dev/null &
cat combined_data_4.csv > /dev/null &
wait
This will let you know if overlapping reads offers a possibility for speedup.
For example, putting the files on four different physical filesystems might allow this -- you'd be keeping four spindles busy.
async
Based on these timings, you may choose to ditch async I/O, and instead fork off four separate python interpreters.
logic
content = file.read().split(":")
This is where much of your large memory footprint comes from.
Rather than slurping in the whole file at once, consider reading by lines, or in chunks.
A generator might offer you a convenient API for that.
EDIT:
compression
It appears that you are I/O bound -- you have idle cycles while waiting on the disk.
If the final consumer of your output file is willing to do decompression, then
consider using gzip, xz/lzma, or snappy.
The idea is that most of the elapsed time is spent on I/O, so you want to manipulate smaller files to do less I/O.
This benefits your script when writing 2 GiB of output,
and may also benefit the code that consumes that output.
As a separate item, you might possibly arrange for the code that produces the four input files to produce compressed versions of them.
I have tried to solve your problem. I think this is very easy and simple way if don't have any prior knowledge of any special library.
I just took 2 input files named input.txt & input2.txt with following contents.
Note: All files are in same directory.
input.txt
13368:
2385003,4,2004-07-08
659432,3,2005-03-16
13369:
751812,2,2002-12-16
2625420,2,2004-05-25
input2.txt
13364:
2385001,5,2004-06-08
659435,1,2005-03-16
13370:
751811,2,2023-12-16
2625220,2,2015-05-26
I have written the code in modular way so that you could easily import and use it in your project. Once you run the below code from terminal using python3 csv_writer.py, it will read all the files provided in list file_names and generate output.csv will the result that you're looking for.
csv_writer.py
# https://stackoverflow.com/questions/55226823/reduce-runtime-file-reading-string-manipulation-of-every-line-and-file-writing
import re
def read_file_and_get_output_lines(file_names):
output_lines = []
for file_name in file_names:
with open(file_name) as f:
lines = f.readlines()
for new_line in lines:
new_line = new_line.strip()
if not re.match(r'^\d+:$', new_line):
output_line = [old_line]
output_line.extend(new_line.split(","))
output_lines.append(output_line)
else:
old_line = new_line.rstrip(":")
return output_lines
def write_lines_to_csv(output_lines, file_name):
with open(file_name, "w+") as f:
for arr in output_lines:
line = ",".join(arr)
f.write(line + '\n')
if __name__ == "__main__":
file_names = [
"input.txt",
"input2.txt"
]
output_lines = read_file_and_get_output_lines(file_names)
print(output_lines)
# [['13368', '2385003', '4', '2004-07-08'], ['13368', '659432', '3', '2005-03-16'], ['13369', '751812', '2', '2002-12-16'], ['13369', '2625420', '2', '2004-05-25'], ['13364', '2385001', '5', '2004-06-08'], ['13364', '659435', '1', '2005-03-16'], ['13370', '751811', '2', '2023-12-16'], ['13370', '2625220', '2', '2015-05-26']]
write_lines_to_csv(output_lines, "output.csv")
output.csv
13368,2385003,4,2004-07-08
13368,659432,3,2005-03-16
13369,751812,2,2002-12-16
13369,2625420,2,2004-05-25
13364,2385001,5,2004-06-08
13364,659435,1,2005-03-16
13370,751811,2,2023-12-16
13370,2625220,2,2015-05-26
I have a simple script that reads values from a device and outputs them via print, and another script, which listens on stdin and interprets each number. The device outputs one number each second. Surprisingly, piping the scripts on my ubuntu box does not work. However, if the first script is made not to read from the device but generate random numbers as fast as it can, the second script successfully receives the data.
Below is a simplified example of my situation.
print.py:
#!/usr/bin/env python2
import time
import sys
while True:
time.sleep(1) # without this everything works
print "42"
sys.stdout.flush()
read.py:
#!/usr/bin/env python2
import sys
while True:
for str in sys.stdin:
print str
Command line invocation:
vorac#laptop:~/test$ ./print.py | ./read.py
Here is the end result. The first script reads from the device and the second graphs the data in two separate time frames (what is shown are random numbers).
Ah, now that is a tricky problem. It happens because the iterator method for sys.stdin (which is xreadlines()) is buffered. In other words, when your loop implicitly calls next(sys.stdin) to get the next line of input, Python tries to read from the real under-the-hood standard input stream until its internal buffer is full, and only once the buffer is full does it proceed through the body of the loop. The buffer size is 8 kilobytes, so this takes a while.
You can see this by decreasing the time delay in the sleep() call to 0.001 or some such value, depending on the capabilities of your system. If you hit the time just right, you'll see nothing for a few seconds, and then a whole block of 42s come out all at once.
To fix it, use sys.stdin.readline(), which is unbuffered.
while True:
line = sys.stdin.readline()
print line
You might also want to strip off the trailing newline before printing it, otherwise you'll get double line breaks. Use line.rstrip('\n'), or just print line, to suppress the extra newline that gets printed.
I changed your read.py and it worked for me :), you forget to .readline() from stdin.
import sys
while True:
line = sys.stdin.readline()
if line:
print line.strip()
else:
continue
Output is :
$ python write.py | python read.py
42
42
42
42
42
42
42
Firstly, I'm stuck with Python 2.4. This is a large enterprise environment and I'm unable to update to python 2.7 which would be my preference.
I need to read the output of some dtrace scripts that spit out data in intervals similar to iostat. (ie: iostat 5 100 # every 5 seconds, 100 count)
I'm playing around with Popen and Popen.communicate but it seems to slurp all the data at once and then print out in one large string.
I need to enter into a while loop and read the output 1 line at a time.
Can someone point me into the right direction for doing this?
Much thx.
import subprocess
p = subprocess.Popen("some_long_command",stdout=subprocess.PIPE)
for line in iter(p.stdout.readline, ""):
print line
I think at least ...
I have a Python script that imports some log data into a StringIO object and then reads the data in that, line by line, and enters them into a DB table. The script takes considerably longer after some iteration. To explain, it takes ~1.6 seconds to run through 1500 logs, and ~1m16s to run through 3500 logs and then 20 second for 1100 logs!
My script is laid out as follows:
for dir in dirlist:
file = StringIO.StringIO(...output from some system command to get logs...)
for line in file:
ctr+=1
...
do some regex matches and replacements
...
cursor.insert(..."insert query"...)
if ctr >= 1000:
conn.commit() # commit once every 1000 transactions
Use cProfile to profile your script and find out where the time is actually spent. It is not usually helpful to just guess where the time is spent without any details. Profiling will tell you whether the performance issue is with some regex matching stuff or the insert query.