I have django app that uses two external scripts. One script moves a file from A to B, stores the value for B in a database - and exists afterwards, which should commit any possibly open transactions. The next script reacts to movement of the file (using inotify), calculates md5sum (which appearently takes time) and then looks for an entry in the database like
x = Queue.get(filename=location).
Looking at the timestamps of my logs, I am 100% sure that the first script is long done before the second script (actually a daemon) runs the query. Interestingly enough, the thing works perfectly after a restart of daemon.
This leads me to believe that somehow the Queryset (I actually run the code shown above everytime a new file is detected with inotify) is cached during runtime of the daemon. I however would not want to restart the daemon all the time, but instead force the query to actually use the DB instead of that cache.
The django documentation doesn't say much about that - however usually django is not used as an external :)
Thank you in advance for any hints!
Ben
PS: as per request the source of the relevant part from the daemon
def _get_info(self, path):
try:
obj = Queue.objects.get(filename=path)
x = obj.x
return x
except Exception, e:
self.logger.error("Error in lookup: %s" % e)
return None
This is called by a thread everytime a new file is moved to the watched directory
Whereas the code in the first script looks like
for f in Queue.objects.all():
if (matching_stuff_here):
f.filename = B
f.save()
sys.exit(0)
You haven't shown any actual code, so we have to guess. My guess would be that even though the transaction in the first script is done and committed, you're still inside an open transaction in script B: and because of transaction isolation, you won't see any changes in B until you finish the transaction there.
Related
I am fetching data from a SQL Server database and saving them to files for consequent processing in Python.
I am using Make to automate fetching and refetching of the data (in case some settings change, only the affected part of the queries is run anew, not all of them). So I have a simple Makefile as follows:
rawdata: datafile1.h5, datafile2.h5 # ... more files like this
datafile1.h5: data1_specific_config.py, common_config.py
python fetch_data1.py
datafile2.h5: data2_specific_config.py, common_config.py
python fetch_data2.py
# ... similar rules for other files
and when needed I just run make rawdata.
Now all the SQL queries executed by the scripts fetch_dataN.py have a significant common part. Schematically the queryN which is run by fetch_dataN.py looks like this:
select ... into ##common_tmp_table ... /*this is identical for all queries*/
select ... from (... ##common_tmp_table ...) /*this is queryN specific; but the same ##common_tmp_table is used*/
Here is the problem: when I now run make rawdata in a situation where say five different datafiles need to be rebuilt, then the same query select ... into ##common_tmp_table ... is run five times with the identical output into ##common_tmp_table. The query takes quite a long time to run so re-executing it five times slows everything down significantly.
But the temporary table is always deleted when one script fetch_dataN.py finishes because the db connection which created it is terminated.
Question:
Is there a way how I could force the table ##common_tmp_table to be created only once and persisted between all the scripts fetch_dataN.py that are executed by make rawdata?
In particular, is there a way how to use the same db connection in all the scripts run by make rawdata? Or perhaps to open one extra connection that will persist while all the scripts will be running and which will prevent the global temporary table to be dropped?
Work-around that I know of:
I am able to work around this by manually creating the ##common_tmp_table (e.g. in MS SQL Server Management Studio) before running make rawdata and keeping the connection used to this open till all the scripts finish. But this is obviously ugly and annoying.
If make rawdata could open a separate process that would open a connection, create the tmp table and keep waiting until everything else finishes, that would be a solution. But I don't know if this is possible.
Limitations:
I can't make changes in the database (such as creating a permanent table instead of a temporary one)
I need the scripts to stay separate so that they can be executed by make independently (having everything in one script with the same db connection and thus the same tmp table wouldn't help - rebuilding all the datafiles whenever one or two of them need to be re-fetched would be even slower)
Notes:
MS SQL Server 2008 R2
pyodbc 4.0.28 (for connecting to the database)
python 3.7.6
make 4.3
conda 4.7.12
Thanks.
So I found a solution which works very nicely: The idea is to let make rawdata execute a python script which
opens a db connection and keeps it open
creates the ##common_tmp_table
runs make rawdata_ which takes care of rebuilding of the datafiles (just the same as make rawdata did in the code posted in the question, but now without select ... into ##common_tmp_table ... in the queries)
closes the connection
In code:
Makefile:
#THIS IS NEW
.PHONY rawdata # to always rebuild rawdata target
rawdata:
python fetch_all_non_uptodate.py # just call a script that (among other stuff) runs `make rawdata_`
#THE REST IS AS BEFORE (just added underscore)
rawdata_: datafile1.h5, datafile2.h5 # ... more files like this
datafile1.h5: data1_specific_config.py, common_config.py
python fetch_data1.py
datafile2.h5: data2_specific_config.py, common_config.py
python fetch_data2.py
# ... similar rules for other files
fetch_all_non_uptodate.py:
import subprocess
import pyodbc
conn = pyodbc.connect(...) #open db connection
# simulate the run of make with the -q flag to find out whether all the datafiles are up-to-date (return code 0) or not (return code 1); nothing is re-fetched as yet
uptodate = (subprocess.run(['make', '-q', 'rawdata_']).returncode == 0)
# if the raw datafiles are not up-to-date
if not uptodate:
create_common_tmp_table(conn) # create the ##common_tmp_table in the db and keep it while conn is open
conn.commit() #commit the creation of the tmp table (Important! - otherwise the other connections won't see it!)
subprocess.run(['make', 'rawdata_']) # run make to re-fetch whatever datafiles need to be re-fetched
# the queries can make use of the existing tmp table
# otherwise we just simulate the make output telling that all is up-to-date
else:
print("make: Nothing to be done for 'rawdata'.")
conn.close()
queryN:
/*keep just the specific part - the ##common_tmp_table already exists*/
select ... from (... ##common_tmp_table ...)
Environment
Flask 0.10.1
SqlAlchemy 1.0.10
Python 3.4.3
Using unittest
I have created two separate tests whose goals are looking into the databases through 700k records and doing some string finds. When the tests are executed one at a time, it works fine, but when the whole script is executed with:
python name_of_script.py
it exits with "KILLED" at random places.
The main code on both tests go something like this:
def test_redundant_categories_exist(self):
self.assertTrue(self.get_redundant_categories() > 0, 'There are 0 redundant categories to remove. Cannot test removing them if there are none to remove.')
def get_redundant_categories(self):
total = 0
with get_db_session_scope(db.session) as db_session:
records = db_session.query(Category)
for row in records:
if len(row.c) > 1:
c = row.c
#TODO: threads, each thread handles a bulk of rows
redundant_categories = [cat_x.id
for cat_x in c
for cat_y in c
if cat_x != cat_y and re.search(r'(^|/)' + cat_x.path + r'($|/)', cat_y.path)
]
total += len(redundant_categories)
records = None
db_session.close()
return total
The other test calls a function located in the manager.py file that does something similar, but with an added bulk delete in the database.
def test_remove_redundant_mappings(self):
import os
os.system( "python ../../manager.py remove_redundant_mappings" )
self.assertEqual(self.get_redundant_categories(), 0, "There are redundant categories left after running manager.py remove_redundant_mappings()")
Is it possible for the data to be kept in memory between tests? I don't quite understand how executing the tests individually works, but when run back to back, the process ends with Killed.
Any ideas?
Edit (things I've tried to no avail):
import the function from manager.py and call it without os.system(..)
import gc and run a gc.collect() after get_redundant_categories() and after calling remove_redundant_mappings()
While searching high and low, I serendipitously came upon the following comment in this StackOverflow question/answer
What is happening, I think, is that people are instantiating sessions and not closing them. The objects are then being garbage collected without closing the sessions. Why sqlalchemy sessions don't close themselves when the session object goes out of scope has always and will always be beyond me. #melchoir55
So I added the following the to the method that was being tested:
db_session.close()
Now the unittest executes without getting killed.
I'm really new to programming in general and very inexperienced, and I'm learning python as I think it's more simple than other languages. Anyway, I'm trying to use Flask-Ask with ngrok to program an Alexa skill to check data online (which changes a couple of times per hour). The script takes four different numbers (from a different URL) and organizes it into a dictionary, and uses Selenium and phantomjs to access the data.
Obviously, this exceeds the 8-10 second maximum runtime for an intent before Alexa decides that it's taken too long and returns an error message (I know its timing out as ngrok and the python log would show if an actual error occurred, and it invariably occurs after 8-10 seconds even though after 8-10 seconds it should be in the middle of the script). I've read that I could just reprompt it, but I don't know how and that would only give me 8-10 more seconds, and the script usually takes about 25 seconds just to get the data from the internet (and then maybe a second to turn it into a dictionary).
I tried putting the getData function right after the intent that runs when the Alexa skill is first invoked, but it only runs when I initialize my local server and just holds the data for every new Alexa session. Because the data changes frequently, I want it to perform the function every time I start a new session for the skill with Alexa.
So, I decided just to outsource the function that actually gets the data to another script, and make that other script run constantly in a loop. Here's the code I used.
import time
def getData():
username = '' #username hidden for anonymity
password = '' #password hidden for anonymity
browser = webdriver.PhantomJS(executable_path='/usr/local/bin/phantomjs')
browser.get("https://gradebook.com") #actual website name changed
browser.find_element_by_name("username").clear()
browser.find_element_by_name("username").send_keys(username)
browser.find_element_by_name("password").clear()
browser.find_element_by_name("password").send_keys(password)
browser.find_element_by_name("password").send_keys(Keys.RETURN)
global currentgrades
currentgrades = []
gradeids = ['2018202', '2018185', '2018223', '2018626', '2018473', '2018871', '2018886']
for x in range(0, len(gradeids)):
try:
gradeurl = "https://www.gradebook.com/grades/"
browser.get(gradeurl)
grade = browser.find_element_by_id("currentStudentGrade[]").get_attribute('innerHTML').encode('utf8')[0:3]
if grade[2] != "%":
grade = browser.find_element_by_id("currentStudentGrade[]").get_attribute('innerHTML').encode('utf8')[0:4]
if grade[1] == "%":
grade = browser.find_element_by_id("currentStudentGrade[]").get_attribute('innerHTML').encode('utf8')[0:1]
currentgrades.append(grade)
except Exception:
currentgrades.append('No assignments found')
continue
dictionary = {"class1": currentgrades[0], "class2": currentgrades[1], "class3": currentgrades[2], "class4": currentgrades[3], "class5": currentgrades[4], "class6": currentgrades[5], "class7": currentgrades[6]}
return dictionary
def run():
dictionary = getData()
time.sleep(60)
That script runs constantly and does what I want, but then in my other script, I don't know how to just call the dictionary variable. When I use
from getdata.py import dictionary
in the Flask-ask script it just runs the loop and constantly gets the data. I just want the Flask-ask script to take the variable defined in the "run" function and then use it without running any of the actual scripts defined in the getdata script, which have already run and gotten the correct data. If it matters, both scripts are running in Terminal on a MacBook.
Is there any way to do what I'm asking about, or are there any easier workarounds? Any and all help is appreciated!
It sounds like you want to import the function, so you can run it; rather than importing the dictionary.
try deleting the run function and then in your other script
from getdata import getData
Then each time you write getData() it will run your code and get a new up-to-date dictionary.
Is this what you were asking about?
This issue has been resolved.
As for the original question, I didn't figure out how to make it just import the dictionary instead of first running the function to generate the dictionary. Furthermore, I realized there had to be a more practical solution than constantly running a script like that, and even then not getting brand new data.
My solution was to make the script that gets the data start running at the same time as the launch function. Here was the final script for the first intent (the rest of it remained the same):
#ask.intent("start_skill")
def start_skill():
welcome_message = 'What is the password?'
thread = threading.Thread(target=getData, args=())
thread.daemon = True
thread.start()
return question(welcome_message)
def getData():
#script to get data here
#other intents and rest of script here
By design, the skill requested a numeric passcode to make sure I was the one using it before it was willing to read the data (which was probably pointless, but this skill is at least as much for my own educational reasons as for practical reasons, so, for the extra practice, I wanted this to have as many features as I could possibly justify). So, by the time you would actually be able to ask for the data, the script to get the data will have finished running (I have tested this and it seems to work without fail).
I have a Django app saving objects to the database and a celery task that periodically does some processing on some of those objects. The problem is that the user can delete an object after it has been selected by the celery task for processing, but before the celery task has actually finished processing and saving it. So when the celery task does call .save(), the object re-appears in the database even though the user deleted it. Which is really spooky for users, of course.
So here's some code showing the problem:
def my_delete_view(request, pk):
thing = Thing.objects.get(pk=pk)
thing.delete()
return HttpResponseRedirect('yay')
#app.task
def my_periodic_task():
things = get_things_for_processing()
# if the delete happens anywhere between here and the .save(), we're hosed
for thing in things:
process_thing(thing) # could take a LONG time
thing.save()
I thought about trying to fix it by adding an atomic block and a transaction to test if the object actually exists before saving it:
#app.task
def my_periodic_task():
things = Thing.objects.filter(...some criteria...)
for thing in things:
process_thing(thing) # could take a LONG time
try:
with transaction.atomic():
# just see if it still exists:
unused = Thing.objects.select_for_update().get(pk=thing.pk)
# no exception means it exists. go ahead and save the
# processed version that has all of our updates.
thing.save()
except Thing.DoesNotExist:
logger.warning("Processed thing vanished")
Is this the correct pattern to do this sort of thing? I mean, I'll find out if it works within a few days of running it in production, but it would be nice to know if there are any other well-accepted patterns for accomplishing this sort of thing.
What I really want is to be able to update an object if it still exists in the database. I'm ok with the race between user edits and edits from the process_thing, and I can always throw in a refresh_from_db just before the process_thing to minimize the time during which user edits would be lost. But I definitely can't have objects re-appearing after the user has deleted them.
if you open a transaction for the time of processing of celery task, you should avoid such a problems:
#app.task
#transaction.atomic
def my_periodic_task():
things = get_things_for_processing()
# if the delete happens anywhere between here and the .save(), we're hosed
for thing in things:
process_thing(thing) # could take a LONG time
thing.save()
sometimes, you would like to report to the frontend, that you are working on the data, so you can add select_for_update() to your queryset (most probably in get_things_for_processing), then in the code responsible for deletion you need to handle errors when db will report that specific record is locked.
For now, it seems like the pattern of "select again atomically, then save" is sufficient:
#app.task
def my_periodic_task():
things = Thing.objects.filter(...some criteria...)
for thing in things:
process_thing(thing) # could take a LONG time
try:
with transaction.atomic():
# just see if it still exists:
unused = Thing.objects.select_for_update().get(pk=thing.pk)
# no exception means it exists. go ahead and save the
# processed version that has all of our updates.
thing.save()
except Thing.DoesNotExist:
logger.warning("Processed thing vanished")
(this is the same code as in my original question).
I have a function like this in Django:
def uploaded_files(request):
global source
global password
global destination
username = request.user.username
log_id = request.user.id
b = File.objects.filter(users_id=log_id, flag='F') # Get the user id from session .delete() to use delete
source = 'sachet.adhikari#69.43.202.97:/home/sachet/my_files'
password = 'password'
destination = '/home/zurelsoft/my_files/'
a = Host.objects.all() #Lists hosts
command = subprocess.Popen(['sshpass', '-p', password, 'rsync', '--recursive', source],
stdout=subprocess.PIPE)
command = command.communicate()[0]
lines = (x.strip() for x in command.split('\n'))
remote = [x.split(None, 4)[-1] for x in lines if x]
base_name = [os.path.basename(ok) for ok in remote]
files_in_server = base_name[1:]
total_files = len(files_in_server)
info = subprocess.Popen(['sshpass', '-p', password, 'rsync', source, '--dry-run'],
stdout=subprocess.PIPE)
information = info.communicate()[0]
command = information.split()
filesize = command[1]
#st = int(os.path.getsize(filesize))
#filesize = size(filesize, system=alternative)
date = command[2]
users_b = User.objects.all()
return render_to_response('uploaded_files.html', {'files': b, 'username':username, 'host':a, 'files_server':files_in_server, 'file_size':filesize, 'date':date, 'total_files':total_files, 'list_users':users_b}, context_instance=RequestContext(request))
The main usage of the function is to transfer the file from the server to local machine and writes the data into the database. What I want it: There are single file which is of 10GB which will take a long time to copy. Since the copying happens using rsync in command line, I want to let user play with other menus while the file is being transferred. How can I achieve that? For example if the user presses OK, the file will be transferring in command line, so I want to show user "The file is being transferred" message and stop rolling the cursor or something like that? Is multiprocessing or threading appropriate in this case? Thanks
Assuming that function works inside of a view, your browser will timeout before the 10GB file has finished transferring over. Maybe you should re-think your architecture for this?
There are probably several ways to do this, but here are some that come to my mind right now:
One solution is to have an intermediary storing the status of the file transfer. Before you begin the process that transfers the file, set a flag somewhere like a database saying the process has begun. Then if you make your subprocess call blocking, wait for it to complete, check the output of the command if possible and update the flag you set earlier.
Then have whatever front end you have poll the status of the file transfer.
Another solution, if you make the subprocess call non-blocking as in your example, in that case you should use a thread which sits there reading the stdout and updating an intermediary store which your front end can query to get a more 'real time' update of the transfer process.
What you need is Celery.
It let's you spawn job as a parallel task and return http response.
RaviU solutions would certainly work.
Another option is to call a blocking subprocess in its own Thread. This thread could be responsible for setting a flag or information (in memcache, db, or just a file on the harddrive) as well as clearing it when it's complete. Personally, there is no love lost between reading rsyncs stdout and I so I usually just ask the OS for the filesize.
Also, if you don't need the file absolutely ASAP, adding "-c" to do a checksum can be good for those giant files. source: personal experience trying to transfer giant video files over spotty campus network.
I will say the one problem with all of the solutions so far is that it doesn't work for "N" files. Eventually, even if you make sure each file only can be transfered once at a time, if you have a lot of different files then eventually it'll bog down the system. You might be better off just using some sort of task queue unless you know it will only ever be the one file at a time. I haven't used one recently, but a quick google search yielded Celery which doesn't look to bad.
Every web server has a facility of uploading files. And what it does for large files is that it divides the file in chunks and does a merge after every chunk is received. What you can do here is that you can have a hidden tag in your html page which has a value attribute and whenever your upload webservice returns you an ok message at that point of time you can change the hidden html value to something relevant and also write a function that keeps on reading the value of that hidden html element and check whether your file uploading has been finished or not.