Reduce runtime, file reading, string manipulation of every line and file writing

Reduce runtime, file reading, string manipulation of every line and file writing - python

I'm writing on a script that reads all lines from multiple files, reads in a number at the beginning of each block and puts that number in front of every line of the block until the next number and so on. Afterwards it writes all read lines into a single .csv file.
The files I am reading look like this:
13368:
2385003,4,2004-07-08
659432,3,2005-03-16
13369:
751812,2,2002-12-16
2625420,2,2004-05-25
And the output file should look like this:
13368,2385003,4,2004-07-08
13368,659432,3,2005-03-16
13369,751812,2,2002-12-16
13369,2625420,2,2004-05-25
Currently my script is this:
from asyncio import Semaphore, ensure_future, gather, run
import time
limit = 8
async def read(file_list):
tasks = list()
result = None
sem = Semaphore(limit)
for file in file_list:
task = ensure_future(read_bounded(file,sem))
tasks.append(task)
result = await gather(*tasks)
return result
async def read_bounded(file,sem):
async with sem:
return await read_one(file)
async def read_one(filename):
result = list()
with open(filename) as file:
dataList=[]
content = file.read().split(":")
file.close()
j=1
filmid=content[0]
append=result.append
while j<len(content):
for entry in content[j].split("\n"):
if len(entry)>10:
append("%s%s%s%s" % (filmid,",",entry,"\n"))
else:
if len(entry)>0:
filmid=entry
j+=1
return result
if __name__ == '__main__':
start=time.time()
write_append="w"
files = ['combined_data_1.txt', 'combined_data_2.txt', 'combined_data_3.txt', 'combined_data_4.txt']
res = run(read(files))
with open("output.csv",write_append) as outputFile:
for result in res:
outputFile.write(''.join(result))
outputFile.flush()
outputFile.close()
end=time.time()
print(end-start)
It has a runtime of about 135 Seconds (The 4 files that are read are each 500MB big and the output file has 2.3GB). Running the script takes about 10GB of RAM. I think this might be a problem.
The biggest amount of time is needed to create the list of all lines, I think.
I would like to reduce the runtime of this program, but I am new to python and not sure how to do this. Can you give me some advice?
Thanks
Edit:
I measured the times for the following commands in cmd (I have only Windows installed on my Computer, so I used hopefully equivalent cmd-Commands):
sequential writing to NUL
timecmd "type combined_data_1.txt combined_data_2.txt combined_data_3.txt combined_data_4.txt > NUL"
combined_data_1.txt
combined_data_2.txt
combined_data_3.txt
combined_data_4.txt
command took 0:1:25.87 (85.87s total)
sequential writing to file
timecmd "type combined_data_1.txt combined_data_2.txt combined_data_3.txt combined_data_4.txt > test.csv"
combined_data_1.txt
combined_data_2.txt
combined_data_3.txt
combined_data_4.txt
command took 0:2:42.93 (162.93s total)
parallel
timecmd "type combined_data_1.txt > NUL & type combined_data_2.txt > NUL & type combined_data_3.txt >NUL & type combined_data_4.txt > NUL"
command took 0:1:25.51 (85.51s total)

In this case you're not gaining anything by using asyncio for two reasons:
asyncio is single-threaded and doesn't parallelize processing (and, in Python, neither can threads)
the IO calls access the file system, which asyncio doesn't cover - it is primarily about network IO
The giveaway that you're not using asyncio correctly is the fact that your read_one coroutine doesn't contain a single await. That means that it never suspends execution, and that it will run to completion before ever yielding to another coroutine. Making it an ordinary function (and dropping asyncio altogether) would have the exact same result.
Here is a rewritten version of the script with the following changes:
byte IO throughout, for efficiency
iterates through the file rather than loading all at once
sequential code
import sys
def process(in_filename, outfile):
with open(in_filename, 'rb') as r:
for line in r:
if line.endswith(b':\n'):
prefix = line[:-2]
continue
outfile.write(b'%s,%s' % (prefix, line))
def main():
in_files = sys.argv[1:-1]
out_file = sys.argv[-1]
with open(out_file, 'wb') as out:
for fn in in_files:
process(fn, out)
if __name__ == '__main__':
main()
On my machine and Python 3.7, this version performs at approximately 22 s/GiB, tested on four randomly generated files, of 550 MiB each. It has a negligible memory footprint because it never loads the whole file into memory.
The script runs on Python 2.7 unchanged, where it clocks at 27 s/GiB. Pypy (6.0.0) runs it much faster, taking only 11 s/GiB.
Using concurrent.futures in theory ought to allow processing in one thread while another is waiting for IO, but the result ends up being significantly slower than the simplest sequential approach.

You want to read 2 GiB and write 2 GiB with low elapsed time and low memory consumption.
Parallelism, for core and for spindle, matters.
Ideally you would tend to keep all of them busy.
I assume you have at least four cores available.
Chunking your I/O matters, to avoid excessive malloc'ing.
Start with the simplest possible thing.
Please make some measurements and update your question to include them.
sequential
Please make sequential timing measurements of
$ cat combined_data_[1234].csv > /dev/null
and
$ cat combined_data_[1234].csv > big.csv
I assume you will have low CPU utilization, and thus will be measuring read & write I/O rates.
parallel
Please make parallel I/O measurements:
cat combined_data_1.csv > /dev/null &
cat combined_data_2.csv > /dev/null &
cat combined_data_3.csv > /dev/null &
cat combined_data_4.csv > /dev/null &
wait
This will let you know if overlapping reads offers a possibility for speedup.
For example, putting the files on four different physical filesystems might allow this -- you'd be keeping four spindles busy.
async
Based on these timings, you may choose to ditch async I/O, and instead fork off four separate python interpreters.
logic
content = file.read().split(":")
This is where much of your large memory footprint comes from.
Rather than slurping in the whole file at once, consider reading by lines, or in chunks.
A generator might offer you a convenient API for that.
EDIT:
compression
It appears that you are I/O bound -- you have idle cycles while waiting on the disk.
If the final consumer of your output file is willing to do decompression, then
consider using gzip, xz/lzma, or snappy.
The idea is that most of the elapsed time is spent on I/O, so you want to manipulate smaller files to do less I/O.
This benefits your script when writing 2 GiB of output,
and may also benefit the code that consumes that output.
As a separate item, you might possibly arrange for the code that produces the four input files to produce compressed versions of them.

I have tried to solve your problem. I think this is very easy and simple way if don't have any prior knowledge of any special library.
I just took 2 input files named input.txt & input2.txt with following contents.
Note: All files are in same directory.
input.txt
13368:
2385003,4,2004-07-08
659432,3,2005-03-16
13369:
751812,2,2002-12-16
2625420,2,2004-05-25
input2.txt
13364:
2385001,5,2004-06-08
659435,1,2005-03-16
13370:
751811,2,2023-12-16
2625220,2,2015-05-26
I have written the code in modular way so that you could easily import and use it in your project. Once you run the below code from terminal using python3 csv_writer.py, it will read all the files provided in list file_names and generate output.csv will the result that you're looking for.
csv_writer.py
# https://stackoverflow.com/questions/55226823/reduce-runtime-file-reading-string-manipulation-of-every-line-and-file-writing
import re
def read_file_and_get_output_lines(file_names):
output_lines = []
for file_name in file_names:
with open(file_name) as f:
lines = f.readlines()
for new_line in lines:
new_line = new_line.strip()
if not re.match(r'^\d+:$', new_line):
output_line = [old_line]
output_line.extend(new_line.split(","))
output_lines.append(output_line)
else:
old_line = new_line.rstrip(":")
return output_lines
def write_lines_to_csv(output_lines, file_name):
with open(file_name, "w+") as f:
for arr in output_lines:
line = ",".join(arr)
f.write(line + '\n')
if __name__ == "__main__":
file_names = [
"input.txt",
"input2.txt"
]
output_lines = read_file_and_get_output_lines(file_names)
print(output_lines)
# [['13368', '2385003', '4', '2004-07-08'], ['13368', '659432', '3', '2005-03-16'], ['13369', '751812', '2', '2002-12-16'], ['13369', '2625420', '2', '2004-05-25'], ['13364', '2385001', '5', '2004-06-08'], ['13364', '659435', '1', '2005-03-16'], ['13370', '751811', '2', '2023-12-16'], ['13370', '2625220', '2', '2015-05-26']]
write_lines_to_csv(output_lines, "output.csv")
output.csv
13368,2385003,4,2004-07-08
13368,659432,3,2005-03-16
13369,751812,2,2002-12-16
13369,2625420,2,2004-05-25
13364,2385001,5,2004-06-08
13364,659435,1,2005-03-16
13370,751811,2,2023-12-16
13370,2625220,2,2015-05-26

Related

How is data transferred between two Python scripts

as an example I have two scripts, say script1.py
f = open("output1.txt", "w")
count = 1
for i in range(100):
f.write(str(count) + "\n")
print(str(count))
count +=1
f.close
This script prints numbers from 1 to 100 to a file and to standard output.
Then I have a second script, say script2.py
import sys
import time
stdin = sys.stdin
f1 = open("output2.txt", "w")
for line in stdin:
if len(line)>0:
print(line.strip())
time.sleep(0.05)
f1.write(line.strip() + "\n")
which reads data from standard input and prints them to a file. I added a time.sleep command to ensure the second script consumes data at a far lower rate than they are produced by the first one.
I run the scripts from the command line as
python3 script1.py | python3 script2.py
so redirecting the standard output of the first (so the print() command) to the standard input of the second one.
It works as somehow expected, two files are generated containing numbers from 1 to 100.
I am nevertheless wondering how the data transfer part works, from first to second script.
the first script generates data at a faster rate. Where are these data stored, waiting for the second script to access them?
Is there some sort of buffer that is put in place between the two process? Or what else?
Is Python responsible for this, or the OS?
Is the buffer limited in size? Can it be programmed (e.g. accessed to direct data to another target as well)?
Thanks a bunch

it is because of the pipe "|", more info here: https://ss64.com/nt/syntax-redirection.html
commandA | commandB Pipe the output from commandA into commandB
so the prints from your script1 are sent to your script2.
My guess on how it works is that every print is saved in memory as a big string and then sent back (as text) to the second that s why sys.stdin works

How can I speed up this python script to read and process a csv file?

I am trying to process a relatively large (about 100k lines) csv file in python. This is what my code looks like:
#!/usr/bin/env python
import sys
reload(sys)
sys.setdefaultencoding("utf8")
import csv
import os
csvFileName = sys.argv[1]
with open(csvFileName, 'r') as inputFile:
parsedFile = csv.DictReader(inputFile, delimiter=',')
totalCount = 0
for row in parsedFile:
target = row['new']
source = row['old']
systemLine = "some_curl_command {source}, {target}".format(source = source, target = target)
os.system(systemLine)
totalCount += 1
print "\nProcessed number: " + str(totalCount)
I'm not sure how to optimize this script. Should I use something besides DictReader?
I have to use Python 2.7, and cannot upgrade to Python 3.

If you want to avoid multiprocessing it is possible to split your long csv file into few smaller csvs and run them simultaneously. Like
$ python your_script.py 1.csv &
$ python your_script.py 2.csv &
Ampersand stands for background execution in linux envs. More details here. I don't have enough knowledge about anything similar in Windows, but it's possible to open few cmd windows, lol.
Anyway it's much better to stick with multiprocessing, ofc.
What about to use requests instead of curl?
import requests
response = requests.get(source_url)
html = response.content
with open(target, "w") as file:
file.write(html)
Here's the doc.
Avoid print statements, in long-term run they're slow as hell. For development and debugging that's ok, but when you decide to start final execution of your script you can remove it and check count of processed files directly in the target folder.

running
subprocess.Popen(systemLine)
instead of
os.system(systemLine)
should speed things up. Please note that systemLine has to be a list of strings e.g ['some_curl_command', 'source', 'target'] in order to work. If you want to limit the number of concurrent commands have a look at that.

Python - Checking concordance between two huge text files

So, this one has been giving me a hard time!
I am working with HUGE text files, and by huge I mean 100Gb+. Specifically, they are in the fastq format. This format is used for DNA sequencing data, and consists of records of four lines, something like this:
#REC1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))*55CCF>>>>>>CCCCCCC65
#REC2
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
.
.
.
For the sake of this question, just focus on the header lines, starting with a '#'.
So, for QA purposes, I need to compare two such files. These files should have matching headers, so the first record in the other file should also have the header '#REC1', the next should be '#REC2' and so on. I want to make sure that this is the case, before I proceed to heavy downstream analyses.
Since the files are so large, a naive iteration a string comparisson would take very long, but this QA step will be run numerous times, and I can't afford to wait that long. So I thought a better way would be to sample records from a few points in the files, for example every 10% of the records. If the order of the records is messed up, I'd be very likely to detect it.
So far, I have been able to handle such files by estimating the file size and than using python's file.seek() to access a record in the middle of the file. For example, to access a line approximately in the middle, I'd do:
file_size = os.stat(fastq_file).st_size
start_point = int(file_size/2)
with open(fastq_file) as f:
f.seek(start_point)
# look for the next beginning of record, never mind how
But now the problem is more complex, since I don't know how to coordinate between the two files, since the bytes location is not an indicator of the line index in the file. In other words, how can I access the 10,567,311th lines in both files to make sure they are the same, without going over the whole file?
Would appreciate any ideas\hints. Maybe iterating in parallel? but how exactly?
Thanks!

Sampling is one approach, but you're relying on luck. Also, Python is the wrong tool for this job. You can do things differently and calculate an exact answer in a still reasonably efficient way, using standard Unix command-line tools:
Linearize your FASTQ records: replace the newlines in the first three lines with tabs.
Run diff on a pair of linearized files. If there is a difference, diff will report it.
To linearize, you can run your FASTQ file through awk:
$ awk '\
BEGIN { \
n = 0; \
} \
{ \
a[n % 4] = $0; \
if ((n+1) % 4 == 0) { \
print a[0]"\t"a[1]"\t"a[2]"\t"a[3]; \
} \
n++; \
}' example.fq > example.fq.linear
To compare a pair of files:
$ diff example_1.fq.linear example_2.fq.linear
If there's any difference, diff will find it and tell you which FASTQ record is different.
You could just run diff on the two files directly, without doing the extra work of linearizing, but it is easier to see which read is problematic if you first linearize.
So these are large files. Writing new files is expensive in time and disk space. There's a way to improve on this, using streams.
If you put the awk script into a file (e.g., linearize_fq.awk), you can run it like so:
$ awk -f linearize_fq.awk example.fq > example.fq.linear
This could be useful with your 100+ Gb files, in that you can now set up two Unix file streams via bash process substitutions, and run diff on those streams directly:
$ diff <(awk -f linearize_fq.awk example_1.fq) <(awk -f linearize_fq.awk example_2.fq)
Or you can use named pipes:
$ mkfifo example_1.fq.linear
$ mkfifo example_2.fq.linear
$ awk -f linearize_fq.awk example_1.fq > example_1.fq.linear &
$ awk -f linearize_fq.awk example_2.fq > example_2.fq.linear &
$ diff example_1.fq.linear example_2.fq.linear
$ rm example_1.fq.linear example_2.fq.linear
Both named pipes and process substitutions avoid the step of creating extra (regular) files, which could be an issue for your kind of input. Writing linearized copies of 100+ Gb files to disk could take a while to do, and those copies could also use disk space you may not have much of.
Using streams gets around those two problems, which makes them very useful for handling bioinformatics datasets in an efficient way.
You could reproduce these approaches with Python, but it will almost certainly run much slower, as Python is very slow at I/O-heavy tasks like these.

Iterating in parallel might be the best way to do this in Python. I have no idea how fast this will run (a fast SSD will probably be the best way to speed this up), but since you'll have to count newlines in both files anyway, I don't see a way around this:
with open(file1) as f1, open(file2) as f2:
for l1, l2 in zip(f1,f2):
if l1.startswith("#REC"):
if l1 != l2:
print("Difference at record", l1)
break
else:
print("No differences")
This is written for Python 3 where zip returns an iterator; in Python 2, you need to use itertools.izip() instead.

Have you looked into using the rdiff command.
The upsides of rdiff are:
with the same 4.5GB files, rdiff only ate about 66MB of RAM and scaled very well. It never crashed to date.
it is also MUCH faster than diff.
rdiff itself combines both diff and patch capabilities, so you can create deltas and apply them using the same program
The downsides of rdiff are:
it's not part of standard Linux/UNIX distribution – you have to
install the librsync package.
delta files rdiff produces have a slightly different format than diff's.
delta files are slightly larger (but not significantly enough to care).
a slightly different approach is used when generating a delta with rdiff, which is both good and bad – 2 steps are required. The
first one produces a special signature file. In the second step, a
delta is created using another rdiff call (all shown below). While
the 2-step process may seem annoying, it has the benefits of
providing faster deltas than when using diff.
See: http://beerpla.net/2008/05/12/a-better-diff-or-what-to-do-when-gnu-diff-runs-out-of-memory-diff-memory-exhausted/

import sys
import re
""" To find of the difference record in two HUGE files. This is expected to
use of minimal memory. """
def get_rec_num(fd):
""" Look for the record number. If not found return -1"""
while True:
line = fd.readline()
if len(line) == 0: break
match = re.search('^#REC(\d+)', line)
if match:
num = int(match.group(1))
return(num)
return(-1)
f1 = open('hugefile1', 'r')
f2 = open('hugefile2', 'r')
hf1 = dict()
hf2 = dict()
while f1 or f2:
if f1:
r = get_rec_num(f1)
if r < 0:
f1.close()
f1 = None
else:
# if r is found in f2 hash, no need to store in f1 hash
if not r in hf2:
hf1[r] = 1
else:
del(hf2[r])
pass
pass
if f2:
r = get_rec_num(f2)
if r < 0:
f2.close()
f2 = None
else:
# if r is found in f1 hash, no need to store in f2 hash
if not r in hf1:
hf2[r] = 1
else:
del(hf1[r])
pass
pass
print('Records found only in f1:')
for r in hf1:
print('{}, '.format(r));
print('Records found only in f2:')
for r in hf2:
print('{}, '.format(r));

Both answers from #AlexReynolds and #TimPietzcker are excellent from my point of view, but I would like to put my two cents in. You also might want to speed up your hardware:
Raplace HDD with SSD
Take n SSD's and create a RAID 0. In the perfect world you will get n times speed up for your disk IO.
Adjust the size of chunks you read from the SSD/HDD. I would expect, for instance, one 16 MB read to be executed faster than sixteen 1 MB reads. (this applies to a single SSD, for RAID 0 optimization one has to take a look at RAID controller options and capabilities).
The last option is especially relevant to NOR SSD's. Don't pursuit the minimal RAM utilization, but try to read as much as it needs to keep your disk reading fast. For instance, parallel reads of single rows from two files can probably speed down reading - imagine an HDD where two rows of the two files are always on the same side of the same magnetic disk(s).

python: read lines from compressed text files

Is it possible to read a line from a gzip-compressed text file using Python without extracting the file completely? I have a text.gz file which is around 200 MB. When I extract it, it becomes 7.4 GB. And this is not the only file I have to read. For the total process, I have to read 10 files. Although this will be a sequential job, I think it will a smart thing to do it without extracting the whole information. How can this be done using Python? I need to read the text file line-by-line.

Using gzip.GzipFile:
import gzip
with gzip.open('input.gz','rt') as f:
for line in f:
print('got line', line)
Note: gzip.open(filename, mode) is an alias for gzip.GzipFile(filename, mode).
I prefer the former, as it looks similar to with open(...) as f: used for opening uncompressed files.

You could use the standard gzip module in python. Just use:
gzip.open('myfile.gz')
to open the file as any other file and read its lines.
More information here: Python gzip module

Have you tried using gzip.GzipFile? Arguments are similar to open.

The gzip library (obviously) uses gzip, which can be a bit slow. You can speed things up with a system call to pigz, the parallelized version of gzip. The downsides are you have to install pigz and it will take more cores during the run, but it is much faster and not more memory intensive. The call to the file then becomes os.popen('pigz -dc ' + filename) instead of gzip.open(filename,'rt'). The pigz flags are -d for decompress and -c for stdout output which can then be grabbed by os.popen.
The following code take in a file and a number (1 or 2) and counts the number of lines in the file with the different calls while measuring the time the code takes. Defining the following code in the unzip-file.py:
#!/usr/bin/python
import os
import sys
import time
import gzip
def local_unzip(obj):
t0 = time.time()
count = 0
with obj as f:
for line in f:
count += 1
print(time.time() - t0, count)
r = sys.argv[1]
if sys.argv[2] == "1":
local_unzip(gzip.open(r,'rt'))
else:
local_unzip(os.popen('pigz -dc ' + r))
Calling these using /usr/bin/time -f %M which measures the maximum memory usage of the process on a 28G file we get:
$ /usr/bin/time -f %M ./unzip-file.py $file 1
(3037.2604110240936, 1223422024)
5116
$ /usr/bin/time -f %M ./unzip-file.py $file 2
(598.771901845932, 1223422024)
4996
Showing that the system call is about five times faster (10 minutes compared to 50 minutes) using basically the same maximum memory. It is also worth noting that depending on what you are doing per line reading in the file might not be the limiting factor, in which case the option you take does not matter.

Python / IDLE CPU usage for no reason

Here is a strange problem I have with IDLE (version 2.6.5 with the same Python version) on windows.
I try to run the following three commands:
fid= open('file.txt', 'r')
lines=fid.readlines()
print lines
When the print lines command is executed, the pythonw.exe process is going CPU crazy, consuming 100% of CPU and the IDLE seems to not be responding. The file.txt is around 130 kb - I don't consider that file very large !
When the lines finally print (after some minutes), if I try to scroll up to see them, I once again experience the same very large CPU usage.
The memory usage of pythonw.exe is around 15-16 MB all the time.
Can anybody explain to me this behaviour - obviously this can't be a bug in IDLE since it would have been discovered ... Also, what can I do to supress that behavior ? I like using IDLE for script like tasks involving data transformations from files.

Try reading it line by line:
fid = open('file.txt', 'r')
for line in fid:
print line
From the documentation on Input Output, there seem to be two ways to read files:
print f.read() # This reads the *whole* file. Might be bad to do this for large files.
for l in f: # This reads it line by line
print l # and prints it. Might be better for big files.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.