I have been working on this file i/o and have made some progress reading through the site and i am wondering what other ways this can be optimized. I am parsing a test infile of 10GB/30MM lines and writing to an outfile the fields which results in aprog 1.4GB clean file. Initially, it took 40m to run this process and i have reduced it to around 30m. Anyone have any other ideas to reduce this in python. Long term i will be looking to write this in C++ - i just have to learn the language first. thanks in advance.
with open(fdir+"input.txt",'rb',(50*(1024*1024))) as r:
w=open(fdir+"output0.txt",'wb',50*(1024*1024)))
for i,l in enumerate(r):
if l[42:44]=='25':
# takes fixed width line into csv line of only a few cols
wbun.append(','.join([
l[7:15],
l[26:35],
l[44:52],
l[53:57],
format(int(l[76:89])/100.0,'.02f'),
l[89:90],
format(int(l[90:103])/100.0,'.02f'),
l[193:201],
l[271:278]+'\n'
]))
# write about every 5MM lines
if len(wbun)==wsize:
w.writelines(wbun)
wbun=[]
print "i_count:",i
# splits about every 4GB
if (i+1)%fsplit==0:
w.close()
w=open(fdir+"output%d.txt"%(i/fsplit+1),'wb',50*(1024*1024)))
w.writelines(wbun)
w.close()
Try running it in Pypy (https://pypy.org), it will run without changes to your code, and probably faster.
Also, C++ might be an overkill, especially if you don't know it yet. Consider learning Go or D instead.
Related
I am trying to create a big file with the same text but my system hangs after executing the script after sometime.
the_text = "This is the text I want to copy 100's of time"
count = 0
while True:
the_text += the_text
count += 1
if count > (int)1e10:
break
NOTE: Above is an oversimplified version of my code. I want to create a file containing the same text many times and the size of the file is around 27GB.
I know it's because RAM is being overloaded. And that's what I want to know how can I do this in fast and effective way in python.
Don't accumulate the string in memory, instead write them directly to file:
the_text = "This is the text I want to copy 100's of time"
with open( "largefile.txt","wt" ) as output_file
for n in range(10000000):
output_file.write(the_text)
This took ~14s on my laptop using SSD to create a file of ~440MiB.
Above code is writing one string at a time - I'm sure it could be speeded up by batching the lines together, but doesn't seem much point speculating on that without any info about what your application can do.
Ultimately this will be limited by the disk speed; if your disk can manage 50MiB/s sustained writes then writing 450MiB will take about 9s - this sounds like what my laptop is doing with the line-by-line writes
If I write 100 strings write(the_text*100) at once for /100 times, i.e. range(100000), this takes ~6s, speedup of 2.5x, writing at ~70MiB/s
If I write 1000 strings at once using range(10000) this takes ~4s - my laptop is starting to top out at ~100MiB/s.
I get ~125MiB/s with write(the_text*100000).
Increasing further to write(the_text*1000000) slows things down, presumably Python memory handling for the string starts to take appreciable time.
Doing text i/o will be slowing things down a bit - I know with Python I can do about 300MiB/s combined read+write of binary files.
SUMMARY: for a 27GiB file, my laptop running Python 3.9.5 on Windows 10 maxes out at about 125MiB/s or 8s/GiB, so would take ~202s to create the file, when writing strings in chunks of about 4.5MiB (45 chars*100,000). YMMV
This question already has an answer here:
eval vs json.loads memory consumption
(1 answer)
Closed 2 years ago.
i implement each line 1 (not line 0) as string from 2 files (1st ~30MB and 2nd ~50MB) where line 0 has just some information which i dont need atm. line 1 is a string array which has around 1.3E6 smaller arrays like that ['I1000009', 'A', '4024', 'A'] as information in it.
[[['I1000009', 'A', '4024', 'A'], ['I1000009', 'A', '6734', 'G'],...],[['H1000004', 'B', '4024', 'A'], ['L1000009', 'B', '6734', 'C'],...],[and so on],...]
both files are in the same way filled. thats the reason why the files are between 30 and 50MB big. i read that files with my .py script to have access to the single information which i need:
import sys
myID = sys.argv[1]
otherID = sys.argv[2]
samePath = '/home/srv/Dokumente/srv/'
FolderName = 'checkArrays/'
finishedFolder = samePath+'finishedAnalysis/'
myNewFile = samePath+FolderName+myID[0]+'/'+myID+'.txt'
otherFile = samePath+FolderName+otherID[0]+'/'+otherID+'.txt'
nameFileOKarray = '_array_goodData.txt'
import csv
import os
import re #for regular expressions
# Text 2 - Start
import operator # zum sortieren der csv files
# Text 2 - End
whereIsMyArray = 1
text_file = open(finishedFolder+myID+nameFileOKarray, "r")
line = text_file.readlines()[whereIsMyArray:];
myGoodFile = eval(line[0])
text_file.close()
text_file = open(finishedFolder+otherID+nameFileOKarray, "r")
line = text_file.readlines()[whereIsMyArray:];
otherGoodFile = eval(line[0])
text_file.close()
print(str(myGoodFile[0][0][0]))
print(str(otherGoodFile[0][0][0]))
the problem what i have is, that if i start my .py script over the shell:
python3 checkarr_v1.py 44 39
the RAM of my 4GB pi server increase to the limit of RAM and Swap and dies. then i tried to start the .py script on a 32Gb RAM server and look at that it worked, but the usage of the RAM is really huge. see pics
(slack mode) overview of normal usage of RAM and CPU:
slackmode
(startsequence) overview in highest usage of RAM ~6GB and CPU: highest point
then it goes up and down after for ~1min: 1.2Gb to 3.6Gb then to 1.7Gb then to 1Gb and then the script finish ~1min and the right output was shown.
can you help me to understand if there is a better way to solve that for an 4Gb raspberry pi? is that a better way to write the 2 files, because the [",] symbols took also there spaces in the file? Is that a better solution as the eval function is to implement that string to an array? sry for that questions, but i cant understand why the 80MB files increase the RAM to around 6Gb. that sounds that i make something wrong. br and thx
1.3E9 arrays is going to be lots and lots of bytes if you read that into your application, no matter what you do.
I don't know if your code does what you actually want to do, but you're only ever using the first data item. If that's what you want to do, then don't read the whole file, just read that first part.
But also: I would advice against using "eval" for deserializing data.
The built-in json module will give data in almost the same format (if you control the input format).
Still, in the end: If you want to hold that much data in your program, you're looking at many GB of memory usage.
If you just want to process it, I'd take a more iterative approach and do a little at the time rather than to swallow the whole files. Especially with limited resources.
Update: I See now that it's 1.3e6, not 1.3e9 entries. Big difference. :-) Then json data should be okay. On my machine a list of 1.3M ['RlFKUCUz', 'A', '4024', 'A'] takes about 250MB.
I am trying to run a rolling horizon optimisation where I have multiple optimisation scripts, each generating their own results. Instead of printing results to screen at every interval, I want to write each of the results using model.write("results.sol") - and then read them back into a results processing script (separate python script).
I have tried using read("results.sol") using Python, but the file format is not recognised. Is there any way that you can read/process the .sol file format that Gurobi outputs? It would seem bizarre if you cannot read the .sol file at some later point and generate plots etc.
Maybe I have missed something blindingly obvious.
Hard to answer without seeing your code as we have to guess what you are doing.
But well...
When you use
model.write("out.sol")
Gurobi will use it's own format to write it (and what is written is inferred from the file-suffix).
This can easily be read by:
model.read("out.sol")
If you used
x = read("out.sol")
you are using python's basic IO-tools and of course python won't interpret that file in respect to the format. Furthermore reading like that is text-mode (and maybe binary is required; not sure).
General rule: if you wrote the solution using a class-method of class model, then read using a class-method of class model too.
The usage above is normally used to reinstate some state of your model (e.g. MIP-start). If you want to plot it, you will have to do further work. In this case, using python's IO tools might be a good idea and you should respect the format described here. This could be read as csv or manually (and opposed to my remark earlier: it is text-mode; not binary).
So assuming the example from the link is in file gur.sol:
import csv
with open('gur.sol', newline='\n') as csvfile:
reader = csv.reader((line.replace(' ', ' ') for line in csvfile), delimiter=' ')
next(reader) # skip header
sol = {}
for var, value in reader:
sol[var] = float(value)
print(sol)
Output:
{'z': 0.2, 'x': 1.0, 'y': 0.5}
Remarks:
Code is ugly because python's csv module has some limitations
Delimiter is two-spaces in this format and we need to hack the code to read it (as only one character is allowed in this function)
Code might be tailored to python 3 (what i'm using; probably the next() method will be different in py2)
pandas would be much much better for this purpose (huge tool with a very good csv_reader)
I'm trying to read a huge amount of lines from standard input with python.
more hugefile.txt | python readstdin.py
The problem is that the program freezes as soon as i've read just a single line.
print sys.stdin.read(8)
exit(1)
This prints the first 8 bytes but then i expect it to terminate but it never does. I think it's not really just reading the first bytes but trying to read the whole file into memory.
Same problem with sys.stdin.readline()
What i really want to do is of course to read all the lines but with a buffer so i don't run out of memory.
I'm using python 2.6
This should work efficiently in a modern Python:
import sys
for line in sys.stdin:
# do something...
print line,
You can then run the script like this:
python readstdin.py < hugefile.txt
Back in the day, you had to use xreadlines to get efficient huge line-at-a-time IO -- and the docs now ask that you use for line in file.
Of course, this is of assistance only if you're actually working on the lines one at a time. If you're just reading big binary blobs to pass onto something else, then your other mechanism might be as efficient.
Here is a strange problem I have with IDLE (version 2.6.5 with the same Python version) on windows.
I try to run the following three commands:
fid= open('file.txt', 'r')
lines=fid.readlines()
print lines
When the print lines command is executed, the pythonw.exe process is going CPU crazy, consuming 100% of CPU and the IDLE seems to not be responding. The file.txt is around 130 kb - I don't consider that file very large !
When the lines finally print (after some minutes), if I try to scroll up to see them, I once again experience the same very large CPU usage.
The memory usage of pythonw.exe is around 15-16 MB all the time.
Can anybody explain to me this behaviour - obviously this can't be a bug in IDLE since it would have been discovered ... Also, what can I do to supress that behavior ? I like using IDLE for script like tasks involving data transformations from files.
Try reading it line by line:
fid = open('file.txt', 'r')
for line in fid:
print line
From the documentation on Input Output, there seem to be two ways to read files:
print f.read() # This reads the *whole* file. Might be bad to do this for large files.
for l in f: # This reads it line by line
print l # and prints it. Might be better for big files.