Extracting data from file performance wise (subprocess vs file read) Python - python

Wondering what is the most efficient method to read data from a locally hosted file using python.
Either using subprocesses and just cat the contents of the file:
ssh = subprocess.Popen(['cat', dir_to_file],
stdout=subprocess.PIPE)
for line in ssh.stdout:
print line
OR simply read contents of the file:
f = open(dir_to_file)
data = f.readlines()
f.close()
for line in data:
print line
I am creating a script that has to read the contents of many files and I was wondering which method is most efficient in terms of CPU usage and also which is the fastest in terms of runtime.
This is my first post here at stackoverflow, apologies on formatting.
Thanks

#chrisd1100 is correct that printing line by line is the bottleneck. After a quick experiment, here is what I found.
I ran and timed the two methods above repeatedly (A - subprocess, B - readline) on two different file sizes (~100KB and ~10MB).
Trial 1: ~100KB
subprocess: 0.05 - 0.1 seconds
readline: 0.02 - 0.026 seconds
Trial 2: ~10MB
subprocess: ~7 seconds
readlin: ~7 seconds
At the larger file size, printing line by line becomes by far the most expensive operation. On smaller file sizes, it seems that readline has about 2x speed performance. Tentatively, I'd say that readline is faster.
These were all run on Python 2.7.10, OSX 10.11.13, 2.8 Ghz i7.

Related

Writing a big file in python faster in memory efficient way

I am trying to create a big file with the same text but my system hangs after executing the script after sometime.
the_text = "This is the text I want to copy 100's of time"
count = 0
while True:
the_text += the_text
count += 1
if count > (int)1e10:
break
NOTE: Above is an oversimplified version of my code. I want to create a file containing the same text many times and the size of the file is around 27GB.
I know it's because RAM is being overloaded. And that's what I want to know how can I do this in fast and effective way in python.
Don't accumulate the string in memory, instead write them directly to file:
the_text = "This is the text I want to copy 100's of time"
with open( "largefile.txt","wt" ) as output_file
for n in range(10000000):
output_file.write(the_text)
This took ~14s on my laptop using SSD to create a file of ~440MiB.
Above code is writing one string at a time - I'm sure it could be speeded up by batching the lines together, but doesn't seem much point speculating on that without any info about what your application can do.
Ultimately this will be limited by the disk speed; if your disk can manage 50MiB/s sustained writes then writing 450MiB will take about 9s - this sounds like what my laptop is doing with the line-by-line writes
If I write 100 strings write(the_text*100) at once for /100 times, i.e. range(100000), this takes ~6s, speedup of 2.5x, writing at ~70MiB/s
If I write 1000 strings at once using range(10000) this takes ~4s - my laptop is starting to top out at ~100MiB/s.
I get ~125MiB/s with write(the_text*100000).
Increasing further to write(the_text*1000000) slows things down, presumably Python memory handling for the string starts to take appreciable time.
Doing text i/o will be slowing things down a bit - I know with Python I can do about 300MiB/s combined read+write of binary files.
SUMMARY: for a 27GiB file, my laptop running Python 3.9.5 on Windows 10 maxes out at about 125MiB/s or 8s/GiB, so would take ~202s to create the file, when writing strings in chunks of about 4.5MiB (45 chars*100,000). YMMV

Python, running into memory error when parsing a 30MB file(already downloaded into my local computer)

Here is my download address, the file name is 'kosarak'
http://fimi.uantwerpen.be/data/
My parsing code is:
parsedDat = [line.split() for line in open('kosarak.dat').readlines()]
I need this data as a whole to run some method on it, so read one line by one line and do the operation on each line is not fit for me here.
The file is only 30 MB and my computer has at least 10G memory left and 30+G Hard drive place,So I guess there shouldn't be any resource problem
FYI: My python version is 2.7 and I am running my python inside Spyder. My OS is windows 10.
PS: You don't need to use my parsing code/method to do the job,as long as you could get the data from file to my python environment that would be perfect.
Perhaps this may help.
with open('kosarak.dat', 'r') as f: # Or 'rb' for binary data.
parsed_data = [line.split() for line in f]
The difference being that your approach reads all of the lines in the file at once and then processes each one (effectively requiring 2x memory, once for the file data and once again for the parsed data, all of which must be stored in memory at the same time), whereas this approach just reads the file line by line and only needs the memory for the resulting parsed_data.
In addition, your method did not explicitly close the file (although you may just not have shown that portion of your code). This method uses a context manager (with expression [as variable]:) which will close the object automatically once the with block terminates, even following an error. See PEP 343.

Does size of a file affects the performance of the write in python

I was trying to write around 5 billion line to a file using python. I have noticed that the performance of the writes are getting worse as the file is getting bigger.
For example at the beginning I was writing 10 million lines per second, after 3 billion lines, it writes 10 times slower than before.
I was wondering if this is actually related to the size of the file?
That is, do you think that the performance is getting better if I break this big file into the smaller ones or the size of the file does not affect the performance of the writes.
If you think it affects the performance can you please explain why?
--Some more info --
The memory consumption is the same (1.3%) all the time. The length of the lines are the same. So the logic is that I read one line from a file (lets call it file A). each line of the file A contains 2 tab separated value, if one of the values has some specific characteristics I add the same line to file B. This operation is O(1), I just convert the value to int and check if that value % someNumber is any of the 7 flags that I want.
Every time I read 10M lines from file A I output the line number. (Thats how I know the performance dropped). File B is the one which gets bigger and bigger and the writes to it gets slower.
The OS is Ubuntu.
With this Python script:
from __future__ import print_function
import time
import sys
import platform
if sys.version_info[0]==2:
range=xrange
times=[]
results=[]
t1=time.time()
t0=t1
tgt=5000000000
bucket=tgt/10
width=len('{:,} '.format(tgt))
with open('/tmp/disk_test.txt', 'w') as fout:
for line in range(1,tgt+1):
fout.write('Line {:{w},}\n'.format(line, w=width))
if line%bucket==0:
s='{:15,} {:10.4f} secs'.format(line, time.time()-t1)
results.append(s)
print(s)
t1=time.time()
else:
info=[platform.system(), platform.release(),sys.version, tgt, time.time()-t0]
s='\n\nDone!\n{} {}\n{} \n\n{:,} lines written in {:10.3f} secs'.format(*info)
fout.write('{}\n{}'.format(s, '\n'.join(results)))
print(s)
Under Python 2 and OS X, prints:
500,000,000 475.9865 secs
1,000,000,000 484.6921 secs
1,500,000,000 463.2881 secs
2,000,000,000 460.7206 secs
2,500,000,000 456.8965 secs
3,000,000,000 455.3824 secs
3,500,000,000 453.9447 secs
4,000,000,000 454.0475 secs
4,500,000,000 454.1346 secs
5,000,000,000 454.9854 secs
Done!
Darwin 13.3.0
2.7.8 (default, Jul 2 2014, 10:14:46)
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)]
5,000,000,000 lines written in 4614.091 secs
Under Python 3.4 and OS X:
500,000,000 632.9973 secs
1,000,000,000 633.0552 secs
1,500,000,000 682.8792 secs
2,000,000,000 743.6858 secs
2,500,000,000 654.4257 secs
3,000,000,000 653.4609 secs
3,500,000,000 654.4969 secs
4,000,000,000 652.9719 secs
4,500,000,000 657.9033 secs
5,000,000,000 667.0891 secs
Done!
Darwin 13.3.0
3.4.1 (default, May 19 2014, 13:10:29)
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)]
5,000,000,000 lines written in 6632.965 secs
The resulting file is 139 GB. You can see that on a relatively empty disk (my /tmp path is a 3 TB volume) the times are linear.
My suspicion is that under Ubuntu, you are running into the OS trying to keep that growing file contiguous on an EXT4 disk.
Recall that both OS X's HFS+ and Linux's EXT4 file system use allocate-on-flush disc allocation schemes. The Linux OS will also attempt to actively move files to allow the allocation to be contiguous (not fragmented)
For Linux EXT4 -- you can preallocate larger files to reduce this effect. Use fallocate as shown in this SO post. Then rewind the file pointer in Python and overwrite in place.
You may be able to use the Python truncate method to create the file, but the results are platform dependent.
Something similar to (pseudo code):
def preallocate_file(path, size):
''' Preallocate of file at "path" of "size" '''
# use truncate or fallocate on Linux
# Depending on your platform, You *may* be able to just the following
# works on BSD and OS X -- probably most *nix:
with open(path, 'w') as f:
f.truncate(size)
preallocate_file(fn, size)
with open(fn, 'r+') as f:
f.seek(0) # start at the beginning
# write whatever
f.truncate() # erases the unused portion...
The code which can cause this is not part of Python. If you are writing to a file system type which has issues with large files, the code you need to examine is the file system driver.
For workarounds, experiment with different file systems for your platform (but then this is no longer a programming question, and hence doesn't belong on StackOverflow).
As you say after 3 billion line you faced with a crash on your performance and your memory is the same (1.3%) all the time ! and As other guys mentioned , there is nothing in the Python I/O code that will affect performance based on filesize. so it may be happen because of a software issue (OS) or hardware issue ! for solve this problem i suggest the below ways :
Use $ time python yourprogram.py command for analyze your timing that shows you the below result :
real - refers to the actual elasped time
user - refers to the amount of cpu time spent outside of kernel
sys - refers to the amount of cpu time spent inside kernel specific functions
read more about real,user,sys in THIS stachoverflow answer by
ConcernedOfTunbridgeWells.
Use a Line-by-line timing and execution frequency with a profiler, so line_profiler is an easy and unobtrusive way to profile your code and use to see how fast and how often each line of code is running in your scripts.
you can install line_profiler that written by Robert Kern , you can install the python package via pip :
$ pip install line_profiler
read documentation HERE . also you can install memory_profiler for find How much memory does your lines use! install with this command :
$ pip install -U memory_profiler
$ pip install psutil
and documentation HERE
The last and more inportant way is find where’s the memory leak ?
The cPython interpreter uses reference counting as it’s main method of keeping track of memory. This means that every object contains a counter, which is incremented when a reference to the object is stored somewhere, and decremented when a reference to it is deleted. When the counter reaches zero, the cPython interpreter knows that the object is no longer in use so it deletes the object and deallocates the occupied memory.
A memory leak can often occur in your program if references to objects are held even though the object is no longer in use.
The quickest way to find these “memory leaks” is to use an awesome tool called objgraph written by Marius Gedminas. This tool allows you to see the number of objects in memory and also locate all the different places in your code that hold references to these objects.
using pipe for install objgraph:
pip install objgraph
Once you have this tool installed, insert into your code a statement to invoke the debugger:
import pdb; pdb.set_trace()
Which objects are the most common?
At run time, you can inspect the top 20 most prevalent objects in your program by running:
a result like this :
(pdb) import objgraph
(pdb) objgraph.show_most_common_types()
MyBigFatObject 20000
tuple 16938
function 4310
dict 2790
wrapper_descriptor 1181
builtin_function_or_method 934
weakref 764
list 634
method_descriptor 507
getset_descriptor 451
type 439
so read documentation HERE .
sources :
http://mg.pov.lt/objgraph/#python-object-graphs
https://pypi.python.org/pypi/objgraph
http://www.appneta.com/blog/line-profiler-python/
https://sublime.wbond.net/packages/LineProfiler
http://www.huyng.com/posts/python-performance-analysis/
What do 'real', 'user' and 'sys' mean in the output of time(1)?

python: read lines from compressed text files

Is it possible to read a line from a gzip-compressed text file using Python without extracting the file completely? I have a text.gz file which is around 200 MB. When I extract it, it becomes 7.4 GB. And this is not the only file I have to read. For the total process, I have to read 10 files. Although this will be a sequential job, I think it will a smart thing to do it without extracting the whole information. How can this be done using Python? I need to read the text file line-by-line.
Using gzip.GzipFile:
import gzip
with gzip.open('input.gz','rt') as f:
for line in f:
print('got line', line)
Note: gzip.open(filename, mode) is an alias for gzip.GzipFile(filename, mode).
I prefer the former, as it looks similar to with open(...) as f: used for opening uncompressed files.
You could use the standard gzip module in python. Just use:
gzip.open('myfile.gz')
to open the file as any other file and read its lines.
More information here: Python gzip module
Have you tried using gzip.GzipFile? Arguments are similar to open.
The gzip library (obviously) uses gzip, which can be a bit slow. You can speed things up with a system call to pigz, the parallelized version of gzip. The downsides are you have to install pigz and it will take more cores during the run, but it is much faster and not more memory intensive. The call to the file then becomes os.popen('pigz -dc ' + filename) instead of gzip.open(filename,'rt'). The pigz flags are -d for decompress and -c for stdout output which can then be grabbed by os.popen.
The following code take in a file and a number (1 or 2) and counts the number of lines in the file with the different calls while measuring the time the code takes. Defining the following code in the unzip-file.py:
#!/usr/bin/python
import os
import sys
import time
import gzip
def local_unzip(obj):
t0 = time.time()
count = 0
with obj as f:
for line in f:
count += 1
print(time.time() - t0, count)
r = sys.argv[1]
if sys.argv[2] == "1":
local_unzip(gzip.open(r,'rt'))
else:
local_unzip(os.popen('pigz -dc ' + r))
Calling these using /usr/bin/time -f %M which measures the maximum memory usage of the process on a 28G file we get:
$ /usr/bin/time -f %M ./unzip-file.py $file 1
(3037.2604110240936, 1223422024)
5116
$ /usr/bin/time -f %M ./unzip-file.py $file 2
(598.771901845932, 1223422024)
4996
Showing that the system call is about five times faster (10 minutes compared to 50 minutes) using basically the same maximum memory. It is also worth noting that depending on what you are doing per line reading in the file might not be the limiting factor, in which case the option you take does not matter.

Python / IDLE CPU usage for no reason

Here is a strange problem I have with IDLE (version 2.6.5 with the same Python version) on windows.
I try to run the following three commands:
fid= open('file.txt', 'r')
lines=fid.readlines()
print lines
When the print lines command is executed, the pythonw.exe process is going CPU crazy, consuming 100% of CPU and the IDLE seems to not be responding. The file.txt is around 130 kb - I don't consider that file very large !
When the lines finally print (after some minutes), if I try to scroll up to see them, I once again experience the same very large CPU usage.
The memory usage of pythonw.exe is around 15-16 MB all the time.
Can anybody explain to me this behaviour - obviously this can't be a bug in IDLE since it would have been discovered ... Also, what can I do to supress that behavior ? I like using IDLE for script like tasks involving data transformations from files.
Try reading it line by line:
fid = open('file.txt', 'r')
for line in fid:
print line
From the documentation on Input Output, there seem to be two ways to read files:
print f.read() # This reads the *whole* file. Might be bad to do this for large files.
for l in f: # This reads it line by line
print l # and prints it. Might be better for big files.

Categories