Python script takes longer as it runs - python

I have a Python script that imports some log data into a StringIO object and then reads the data in that, line by line, and enters them into a DB table. The script takes considerably longer after some iteration. To explain, it takes ~1.6 seconds to run through 1500 logs, and ~1m16s to run through 3500 logs and then 20 second for 1100 logs!
My script is laid out as follows:
for dir in dirlist:
file = StringIO.StringIO(...output from some system command to get logs...)
for line in file:
ctr+=1
...
do some regex matches and replacements
...
cursor.insert(..."insert query"...)
if ctr >= 1000:
conn.commit() # commit once every 1000 transactions

Use cProfile to profile your script and find out where the time is actually spent. It is not usually helpful to just guess where the time is spent without any details. Profiling will tell you whether the performance issue is with some regex matching stuff or the insert query.

Related

Writing to a text file does not occur in real-time. How to fix this

I have a python script that takes a long time to run.
I placed print-outs throughout the script to observe its progress.
As this script different programs, some of whom print many messages, it is unfeasible to print directly to the screen.
Therefore, I am using a report file
f_report = open(os.path.join("//shared_directory/projects/work_area/", 'report.txt'), 'w')
To which I print my massages:
f_report.write(" "+current_image+"\n")
However, when I look at the file while the script is running, I do not see the messages. They appear only when the program finishes and closes the file, making my approach useless for monitoring on-going progress.
What should I do in order to make python output the messages to the report file in real time?
Many thanks.
You should use flush() function to write immediately to the file.
f_report.write(" "+current_image+"\n")
f_report.flush()
try this:
newbuffer = 0
f_report = open(os.path.join("//shared_directory/projects/work_area/", 'report.txt'), 'w', newbuffer)
it sets up a 0 buffer which will push OS to write content to file "immediately". well, different OS may behavior differently but in general content will be flushed out right away.

python: unable to find files in recently changed directory (OSx)

I'm automating some tedious shell tasks, mostly file conversions, in a kind of blunt force way with os.system calls (Python 2.7). For some bizarre reason, however, my running interpreter doesn't seem to be able to find the files that I just created.
Example code:
import os, time, glob
# call a node script to template a word document
os.system('node wordcv.js')
# print the resulting document to pdf
os.system('launch -p gowdercv.docx')
# move to the directory that pdfwriter prints to
os.chdir('/users/shared/PDFwriter/pauliglot')
print glob.glob('*.pdf')
I expect to have a length 1 list with the resulting filename, instead I get an empty list.
The same occurs with
pdfs = [file for file in os.listdir('/users/shared/PDFwriter/pauliglot') if file.endswith(".pdf")]
print pdfs
I've checked by hand, and the expected files are actually where they're supposed to be.
Also, I was under the impression that os.system blocked, but just in case it doesn't, I also stuck a time.sleep(1) in there before looking for the files. (That's more than enough time for the other tasks to finish.) Still nothing.
Hmm. Help? Thanks!
You should add a wait after the call to launch. Launch will spawn the task in the background and return before the document is finished printing. You can either put in some arbitrary sleep statements or if you want you can also check for file existence if you know what the expected filename will be.
import time
# print the resulting document to pdf
os.system('launch -p gowdercv.docx')
# give word about 30 seconds to finish printing the document
time.sleep(30)
Alternative:
import time
# print the resulting document to pdf
os.system('launch -p gowdercv.docx')
# wait for a maximum of 90 seconds
for x in xrange(0, 90):
time.sleep(1)
if os.path.exists('/path/to/expected/filename'):
break
Reference for potentially needing a longer than 1 second wait here

python will not let another program write test report to .txt file while its running

I have an automated RAM tester that writes a test report for each Module it tests. the RAM tester keeps adding to the test report indefinitely. What I want to do is have Python read the report and look for the word "PASS" and the speed of the RAM.
Once the two words are found, I need Python to write to the serial port and clear the report so there is nothing in the .txt file. That way it is ready to loop around and read the next report from the next module tested.
The code is all written besides when Python is running the RAM tester will not write its report to the.txtfile. I have created a small program that takes a test report I captured from the RAM tester and writes it to the .txt file every 3 seconds and that works perfectly.
The program I am working on opens the.txtfile, finds the text my other program wrote to it, finds the 2 key words, deletes them, loops around and does it until I close the program like I want it to. I have done some trouble shooting with it by commenting out chunks of code and everything works until it runs the
file = open("yup.txt", "r+")
txt = file.read()
part, then the RAM tester fails to write the report. I think that loop is screwing it up by constantly accessing/reading the.txtfile...not too sure though. Also Python does not crash at all it just sits there in the loop so I have no problems as far as that goes.
Here is the code I'm having troubles with:
cache_size = os.lstat("yup.txt").st_size
print '\nsearching for number of characters in cache\n'
time.sleep(2)
if cache_size == 0:
print ('0 characters found in cache!\n')
time.sleep(1.5)
print ('there is no data to process!\n')
time.sleep(1.5)
print ('waiting for RAMBot\n')
if cache_size > 0:
print '%d characters found in cache!' % (cache_size)
time.sleep(1.5)
print ('\ndata analysis will now begin\n')
print('________________________________________________________________________________')
x = 1
while x == 1:
file = open("yup.txt" , "r+")
txt = file.read()
if "PASS" and '#2x400MHZ' in txt:
ser.write('4')
print('DDR2 PC-6400 (800MHz) module detected')
open('yup.txt' , 'w')
file.close()
if "PASS" and '#2x333MHZ' in txt:
ser.write('3')
print('DDR2 PC-5300 (667MHz) module detected')
open('yup.txt' , 'w')
file.close()
if "PASS" and '#2x266MHZ' in txt:
ser.write('2')
print('DDR2 PC-4200 (533MHz) module detected')
open('yup.txt' , 'w')
file.close()
if "PASS" and '#2x200MHZ' in txt:
ser.write('1')
print('DDR2 PC-3200 (400MHz) module detected')
open('yup.txt' , 'w')
file.close()
Here is a one of the test reports from the RAM tester:
Test No.: 1
Module : DDR2 256Mx72 2GB 2R(8)#2x333MHZ 1.8V
(Tested at 2x400MHz)
Addr.(rowxcol.) : 14 x 10
Data (rankxbit) : 2 x 72
Internal Banks : 8
Burst : Mode=Sequential, Length=8
AC parameters : CL=5, AL=0, Trcd=5, Trp=5
S/N from SPD : a128f4f3
Test Loop # : 1
Test Pattern : wA, wD, mt, mX, mC, mY, S.O.E
## PASS: Loop 1 ##
Elapsed Time : 00:00:53.448
Date : 09/26/2014, 16:07:40
Am not sure if this helps or not but here is the small program that I wrote to simulate the RAM tester writing its test reports to the.txtfile. I am still confused on why this works and the RAM tester writing the test report has problems...
import os
import time
Q = '''Test No.: 1
Module : DDR2 256Mx72 2GB 2R(8)#2x333MHZ 1.8V
(Tested at 2x400MHz)
Addr.(rowxcol.) : 14 x 10
Data (rankxbit) : 2 x 72
Internal Banks : 8
Burst : Mode=Sequential, Length=8
AC parameters : CL=5, AL=0, Trcd=5, Trp=5
S/N from SPD : a128f4f3
Test Loop # : 1
Test Pattern : wA, wD, mt, mX, mC, mY, S.O.E
## PASS: Loop 1 ##
Elapsed Time : 00:00:53.448
Date : 09/26/2014, 16:07:40'''
x = 1
while x == 1:
host = open('yup.txt' , 'w')
host.write(Q)
host.close()
time.sleep(3)
Thank you very much in advance, I really need to get this to work so it is much appreciated.
The problem is that on Windows, two programs generally can't have the same file open at the same time. When you try to open the file in w or r+ mode, you're asking it to open the file for exclusive access, meaning it will fail if someone else already has the file open, and it will block anyone else from opening the file.
If you want the specifics on sharing and locks in Windows, see the dwShareMode explanation in the CreateFile function on MSDN. (Of course you're not calling CreateFile, you're just using Python's open, which calls CreateFile for you—or, in older versions, calls fopen, which itself calls CreateFile.)
So, how do you work around this?
The simplest thing to do is just not keep the file open. Open the file, write it, and close it again. (Also, since you never write to file, why open it in r+ mode in the first place?)
You will also have to add some code that handles an OSError caused by the race condition of both programs trying to open and write the file at the exact same time, but that's just a simple try:/except: with a loop around it.
Could you just open the file with more permissive sharing?
Sure. You could, for example, use pywin32 to call CreateFile and WriteFile instead of using Python's open and write wrappers, and then you can pass any parameters you want for dwShareMode.
But think about what this means. What happens if both programs try to write the file at the same time? Who wins? If you're lucky, you lose one test output. If you're unlucky, script A blanks the file halfway through script B writing its test output, and you get a garbage file that you can't parse and throw an indecipherable and hard-to-reproduce exception. So, is that really what you want?
Meanwhile, you've got some other weird stuff in your code.
Why are you opening another handle to the same path just to truncate it? Why not just, say, file.truncate(0)? Doing another open while you still have file open in r+ mode means you end up conflicting with yourself, even if no other program was trying to use the same file.
You're also relying on some pretty odd behavior of the file pointer. You've read everything in file. You haven't seeked back to the start, or reopened the file. You've truncated the file and overwritten it with about the same amount of data. So when you read() again, you should get nothing, or maybe a few lines if the test reports aren't always the exact same length. The fact that you're actually getting the whole file is an unexpected consequence of some weird things Windows does in its C stdio library.

passing information from one script to another

I have two python scripts, scriptA and scriptB, which run on Unix systems. scriptA takes 20s to run and generates a number X. scriptB needs X when it is run and takes around 500ms. I need to run scriptB everyday but scriptA only once every month. So I don't want to run scriptA from scriptB. I also don't want to manually edit scriptB each time I run scriptA. I thought of updating a file through scriptA but I'm not sure where such a file can be placed ideally so that scriptB can read it later; independent of the location of these two scripts. What is the best way of storing this value X in an Unix system so that it can be used later by scriptB?
Many programs in Linux/Unix keep config in /etc/ and use subfolder in /var/ for other files.
But probably you could need root privilages.
If you run script in your home folder than you could create file ~/.scripB.rc or folder ~/.scriptB/ or ~/.config/scriptB/
See also on wikipedia Filesystem Hierarchy Standard
It sounds like you want to serialize ScriptA's results, save it in a file or database somewhere, then have ScriptB read those results (possibly also modifying the file or updating the database entry to indicate that those results have now been processed).
To make that work you need for ScriptA and ScriptB to agree on the location and format of the data ... and you might want to implement some sort of locking to ensure that ScriptB doesn't end up with corrupted inputs if it happens to be run at the same time that ScriptA is writing or updating the data (and, conversely, that ScriptA doesn't corrupt the data store by writing thereto while ScriptB is accessing it).
Of course ScriptA and ScriptB could each have a filename or other data location hard-coded into their sources. However, that would violation the DRY Principle. So you might want them to share a configuration file. (Of course the configuration filename is also repeated in these sources ... or at least the import of the common bit of configuration code ... but the latter still ensures that an installation/configuration detail (location and, possibly, format, of the data store) is decoupled from the source code. Thus it can be changed (in the shared config) without affecting the rest of the code for either script.
As for precisely which type of file and serialization to use ... that's a different question.
These days, as strange as it may sound, I'd would suggest using SQLite3. It may seem like over-kill to use an SQL "database" for simply storing a single value. However, SQLite3 is included in the Python standard libraries, and it only needs a filename for configuration.
You could also use a pickle or JSON or even YAML (which would require a third party module) ... or even just text or some binary representation using something like struct. However, any of those will require that you parse your results and deal with any parsing or formatting errors. JSON would be the simplest option among these alternatives. Additionally you'd have to do your own file locking and handling if you wanted ScriptA and ScriptB (and, potentially, any other scripts you ever write for manipulating this particular data) to be robust against any chance of concurrent operations.
The advantage of SQLite3 is that it handles the parsing and decoding and the locking and concurrency for you. You create the table once (perhaps embedded in ScriptA as a rarely used "--initdb" option for occasions when you need to recreate the data store). Your code to read it might look as simple as:
#!/usr/bin/python
import sqlite3
db = sqlite3.connect('./foo.db')
cur = db.cursor()
results = cur.execute(SELECT value, MAX(date) FROM results').fetchone()[0]
... and writing a new value would look a bit like:
#!/usr/bin/python
# (Same import, db= and cur= from above)
with db:
cur.execute('INSERT INTO results (value) VALUES (?)', (myvalue,))
All of this assuming you had, at some time, initialized the data store (foo.db in this example) with something like:
#!/usr/bin/python
# (Same import, db= and cur= from above)
with db:
cur.execute('CREATE TABLE IF NOT EXISTS results (value INTEGER NOT NULL, date TIMESTAMP DEFAULT current_timestamp)')
(Actually you could just execute that command every time if you wanted your scripts to recovery silently from cleaning out the old data).
This might seem like more code than a JSON file-based approach. However, SQLite3 is providing ACID(transactional) semantics as well as abstracting away the serialization and deserialization.
Also note that I'm glossing over a few details. My example above are actually creating a whole table of results, with timestamps for when they were written to your datastore. These would accumulate over time and, if you were using this approach, you'd periodically want to clean up your "results" table with a command like:
#!/usr/bin/python
# (Same import, db= and cur= from above)
with db:
cur.execute('DELETE FROM results where date < ?', cur.execute('SELECT MAX(date) from results').fetchone())
Alternatively if you really never want to have access to your prior results that change from INSERT into UPDATE like so:
#!/usr/bin/python
# (Same import, db= and cur= from above)
with db:
cur.execute(cur.execute('UPDATE results SET value=(?)', (mynewvalue,))
(Also note that the (mynewvalue,) is a single element tuple. The DBAPI requires that our parameters be wrapped in tuples which is easy to forget when you first start using it with single parameters such as this).
Obviously if you took this UPDATE only approach you could drop the 'date' column from the 'results' table and all those references to MAX(data) from the queries.
I chose use the slightly more complex schema in my early examples because they allow your scripts to be a bit more robust with very little additional complexity. You could then do other error checking, detecting missing values where ScriptB finds that ScriptA hasn't been run as intended, for example).
Edit/run crontab -e:
# this will run every month on the 25th at 2am
0 2 25 * * python /path/to/scriptA.py > /dev/null
# this will run every day at 2:10 am
10 2 * * * python /path/to/scriptB.py > /dev/null
Create an external file for both scripts:
In scriptA:
>>> with open('/path/to/test_doc','w+') as f:
... f.write('1')
...
In scriptB:
>>> with open('/path/to/test_doc','r') as f:
... v = f.read()
...
>>> v
'1'
You can take a look at PyPubSub
It's a python package which provides a publish - subscribe Python API that facilitates event-based programming.
It'll give you an OS independent solution to your problem and only requires few additional lines of code in both A and B.
Also you don't need to handle messy files!
Assuming you are not running the two scripts at the same time, you can (pickle and) save the go between object anywhere so long as when you load and save the file you point to the same system path. For example:
import pickle # or import cPickle as pickle
# Create a python object like a dictionary, list, etc.
favorite_color = { "lion": "yellow", "kitty": "red" }
# Write to file ScriptA
f_myfile = open('C:\\My Documents\\My Favorite Folder\\myfile.pickle', 'wb')
pickle.dump(favorite_color, f_myfile)
f_myfile.close()
# Read from file ScriptB
f_myfile = open('C:\\My Documents\\My Favorite Folder\\myfile.pickle', 'rb')
favorite_color = pickle.load(f_myfile) # variables come out in the order you put them in
f_myfile.close()

Performance effect of using print statements in Python script

I have a Python script that process a huge text file (with around 4 millon lines) and writes the data into two separate files.
I have added a print statement, which outputs a string for every line for debugging. I want to know how bad it could be from the performance perspective?
If it is going to very bad, I can remove the debugging line.
Edit
It turns out that having a print statement for every line in a file with 4 million lines is increasing the time way too much.
Tried doing it in a very simple script just for fun, the difference is quite staggering:
In large.py:
target = open('target.txt', 'w')
for item in xrange(4000000):
target.write(str(item)+'\n')
print item
Timing it:
[gp#imdev1 /tmp]$ time python large.py
real 1m51.690s
user 0m10.531s
sys 0m6.129s
gp#imdev1 /tmp]$ ls -lah target.txt
-rw-rw-r--. 1 gp gp 30M Nov 8 16:06 target.txt
Now running the same with "print" commented out:
gp#imdev1 /tmp]$ time python large.py
real 0m2.584s
user 0m2.536s
sys 0m0.040s
Yes it affects performance.
I wrote a small program to demonstrate-
import time
start_time=time.time()
for i in range(100):
for j in range(100):
for k in range(100):
print(i,j,k)
print(time.time()-start_time)
input()
The time measured was-160.2812204496765
Then I replaced the print statement by pass. The results were shocking. The measured time without print was- 0.26517701148986816.

Categories