Locking a row with SQLite (read lock ?) - python

I have developed a basic proxy tester in Python. Proxy IPs and ports, as well as their date_of_last_test (e.g. 31/12/2011 10:10:10) and result_of_last_test (OK or KO) are stored in a single SQLite table. (I realize I could store a lot more details on the tests results and keep an history / stats, but this simple model suits my needs).
Here is the simplified code of the tester main loop, where I loop over the proxies and update their status:
while True:
# STEP 1: select
myCursor.execute("SELECT * from proxy ORDER BY date_of_last_test ASC;")
row = myCursor.fetchone()
# STEP 2: update
if isProxyWorking(row['ip'], row['port']): # this test can last a few seconds
updateRow(row['ip'], row['port'], 'OK')
else:
updateRow(row['ip'], row['port'], 'KO')
My code works fine when run as a single process. Now, I would like to be able to run many processes of the program, using the same SQLite database file.
The problem with the current code is the lack of a locking mechanism that would prevent several processes from testing the same proxy.
What would be the cleanest way to put a lock on the row at STEP 1 / SELECT time, so that the next process doing a SELECT gets the next row ?
In other words, I'd like to avoid the following situation:
Let's say it's 10PM, and the DB contains 2 proxies:
Proxy A tested for the last time at 8PM and proxy B tested at 9PM.
I start two processes of the tester to update their statuses:
10:00 - Process 1 gets the "oldest" proxy to test it: A
10:01 -
Process 2 gets the "oldest" proxy to test it: !!! A !!! (here I'd
like Process 2 to get proxy B because A is already being tested -
though not updated yet in db)
10:10 - Testing of A by Process 1 is
over, its status is updated in DB
10:11 - Testing of A by Process 2 is
over, its status is updated (!!! AGAIN !!!) in DB
There is no actual error/exception in this case, but a waste of time I want to avoid.

SQlite only allows one process to update anything in the database at a time, From the FAQ
Multiple processes can have the same database open at the same time. Multiple processes can be doing a SELECT at the same time. But only one process can be making changes to the database at any moment in time,
and
When SQLite tries to access a file that is locked by another process, the default behavior is to return SQLITE_BUSY. You can adjust this behavior from C code using the sqlite3_busy_handler() or sqlite3_busy_timeout() API functions.
So if there only a few updates then this will work otherwise you need to change to a more capable database.
so there is only one lock which is on the whole database

Related

How to know whether a nifi process group has completed the data transfer using nipyapi?

I have to know the status of data transfer job(flow inside a Process Group) whether it is completed, failed or is it running. I want to do this using nipyapi for a web application.
I have a process group in NiFi, inside which I have the NiFi flow. I am scheduling the process group using nipyapi:
nipyapi.canvas.schedule_process_group(id, True)
Now I want to monitor the status of the process group using nipyapi. By status I specifically want to know whether it's still running, failed or completed.
NiFi does not really have a concept of a job that can be checked for completion. Once you start all the components in a process group, they are then running indefinitely until someone stops them.
The concept of being "done" or "complete" is really dependent on what you data flow is doing. For example, if your first processor is say GetFile, once that processor is running it is going to monitor the directory for files until someone stops the processor. While the processor is running it has no way of knowing if there will ever be more files, or if it has already seen all the files that will ever be dropped in the directory. That knowledge is only known by whoever/whatever is putting the files there.
To determine failure you need to do something in your data flow to capture the failures. Most processors have a failure relationship, so you would need to route these somewhere and take some action to track the failures.
I think I found a good solution for this problem. This is how i solved it.
So i have a mysql db which basically keeps track of all the files that are to be transferred. The database table will have 2 columns. One for the Filename (lets say is unique) and flag for whether the file has been transferred (True and False).
For Nifi Screenshot click here
We have 3 section of processors.
First: listSFTP and putMySQL
Second: getSFTP and putHDFS
Third: listHDFS and putHDFS
First Section responsible for listing the files in SFTP. It Gets all the files and adds a row into the mysql that the Filename is 'X' and 'False' for not transferred yet.
insert into NifiTest.Jobs values('${filename}', 0);
Third Section does the same thing for HDFS. It will either insert with Transferred = True or update if already a row exists with the same file name.
insert into NifiTest.Jobs values('${filename}', 1) on duplicate key update TRANSFERRED = 1;
2nd Section does nothing but Send the file to the HDFS.
Now to check when the data transfer job is finished.
You will start the entire process group together. When you query the Database and you get all Transferred = 1, that means the job is finished.
It might feel like there are some cases where it might fail but when you carefully think out all the cases you will see that it takes care of all the situations.
Let me know if i am wrong or some improvement can be made to this solution.
You can achieve this without a database using Nifi process groups variable registry APIs of process groups. Create a custom processor which sets a variable in process group lets say is_complete = true as last processor. Then you can monitor for this variable with nipyapi.

SQL Stored Procedures not finishing when called from Python

I'm trying to call a stored procedure in my MSSQL database from a python script, but it does not run completely when called via python. This procedure consolidates transaction data into hour/daily blocks in a single table which is later grabbed by the python script. If I run the procedure in SQL studio, it completes just fine.
When I run it via my script, it gets cut short about 2/3's of the way through. Currently I found a work around, by making the program sleep for 10 seconds before moving on to the next SQL statement, however this is not time efficient and unreliable as some procedures may not finish in that time. I'm looking for a more elegant way to implement this.
Current Code:
cursor.execute("execute mySP")
time.sleep(10)
cursor.commit()
The most related article I can find to my issue is here:
make python wait for stored procedure to finish executing
I tried the solution using Tornado and I/O generators, but ran into the same issue as listed in the article, that was never resolved. I also tried the accepted solution to set a runningstatus field in the database by my stored procedures. At the beginnning of my SP Status is updated to 1 in RunningStatus, and when the SP finished Status is updated to 0 in RunningStatus. Then I implemented the following python code:
conn=pyodbc_connect(conn_str)
cursor=conn.cursor()
sconn=pyodbc_connect(conn_str)
scursor=sconn.cursor()
cursor.execute("execute mySP")
cursor.commit()
while 1:
q=scursor.execute("SELECT Status FROM RunningStatus").fetchone()
if(q[0]==0):
break
When I implement this, the same problem happens as before with my storedprocedure finishing executing prior to it actually being complete. If I eliminate my cursor.commit(), as follows, I end up with the connection just hanging indefinitely until I kill the python process.
conn=pyodbc_connect(conn_str)
cursor=conn.cursor()
sconn=pyodbc_connect(conn_str)
scursor=sconn.cursor()
cursor.execute("execute mySP")
while 1:
q=scursor.execute("SELECT Status FROM RunningStatus").fetchone()
if(q[0]==0):
break
Any assistance in finding a more efficient and reliable way to implement this, as opposed to time.sleep(10) would be appreciated.
As OP found out, inconsistent or imcomplete processing of stored procedures from application layer like Python may be due to straying from best practices of TSQL scripting.
As #AaronBetrand highlights in this Stored Procedures Best Practices Checklist blog, consider the following among other items:
Explicitly and liberally use BEGIN ... END blocks;
Use SET NOCOUNT ON to avoid messages sent to client for every row affected action, possibly interrupting workflow;
Use semicolons for statement terminators.
Example
CREATE PROCEDURE dbo.myStoredProc
AS
BEGIN
SET NOCOUNT ON;
SELECT * FROM foo;
SELECT * FROM bar;
END
GO

Python script stops, no errors giving

I have an python script that needs to be running all the time. Sometimes it can run for a hole day, sometimes it only runs for like an hour.
import RPi.GPIO as GPIO
import fdb
import re
con = fdb.connect(dsn='10.100.2.213/3050:/home/trainee2/Desktop/sms', user='sysdba', password='trainee') #connect to database
cur = con.cursor() #initialize cursor
pinnen = [21,20,25,24,23,18,26,19,13,6,27,17] #these are the GPIO pins we use, they are the same on all PI's! We need them in this sequence.
status = [0] * 12 #this is an empty array were we'll safe the status of each pin
ids = []
controlepin = [2] * 12 #this array will be the same as the status array, only one step behind, we have this array so we can know where a difference is made so we can send it
GPIO.setmode(GPIO.BCM) #Initialize GPIO
getPersonIDs() #get the ids we need
for p in range(0,12):
GPIO.setup(pinnen[p],GPIO.IN) #setup all the pins to read out data
while True: #this will repeat endlessly
for e in range(0,12):
if ids[e]: #if there is a value in the ids (this is only neccesary for PI 3 when there are not enough users
status[e] = GPIO.input(pinnen[e]) #get the status of the GPIO. 0 is dark, 1 is light
if (status[e] != controlepin[e]): #if there are changes
id = ids[e]
if id != '': #if the id is not empty
if status[e] == 1: #if there is no cell phone present
cur.execute("INSERT INTO T_ENTRIES (F_US_ID, F_EN_STATE) values (? ,0)",(id)) #SEND 0, carefull! Status 0 sends 1, status 1 sends 0 to let it make sense in the database!!
else :
cur.execute("INSERT INTO T_ENTRIES (F_US_ID, F_EN_STATE) values (? ,1)",(id))
con.commit() #commit your query
controlepin[e] = status[e] #safe the changes so we woulnd't spam our database
time.sleep(1) #sleep for one second, otherwise script will crash cause of while true
def getPersonIDs(): #here we get the IDS
cur.execute("SELECT first 12 A.F_US_ID FROM T_RACK_SLOTS a order by F_RS_ID;") #here is where the code changes for each pi
for (ID) in cur:
ids.append(ID) #append all the ids to the array
The script is used for a cellphone rack, through LDR's I can see if a cellphone is present, then I send that data to a Firebird database. The scripts are running om my Raspberry PI's.
Can it be that that the script just stops if the connection is lost for a few seconds? Is there a way to make sure they query's are always send?
Can it be that that the script just stops if the connection is lost for a few seconds?
More so, the script IS stopping for every Firebird command, including con.commit() and it only continues when Firebird processes the command/query.
So, not knowing much of Python libraries I would still give you some advices.
1) use parameters and prepared queries as much as you can.
if status[e] == 1: #if there is no cell phone present
cur.execute("INSERT INTO T_ENTRIES (F_US_ID, F_EN_STATE) values (? ,0)",(id)) #SEND 0, carefull! Status 0 sends 1, status 1 sends 0 to let it make sense in the database!!
else :
cur.execute("INSERT INTO T_ENTRIES (F_US_ID, F_EN_STATE) values (? ,1)",(id))
That is not the best idea. You force Firebird engine to parse the query text and build the query again and again. Waste of time and resources.
The correct approach is to make INSERT INTO T_ENTRIES (F_US_ID, F_EN_STATE) values (?,?) query, then prepare it, and then run the already prepared query changing the parameters. You would only prepare it once, before the loop, and then would run it many times.
Granted, I do not know how to prepare queries in the Python library, but I think you'd find the examples.
2) do not use SQL server for saving every single data element you get. It is a known mal-practice, that was suggested again decade ago. Especially with lazy versioned engine Interbase/Firebird is.
The thing is, with every your statement Firebird checks some internal statistics and sometimes it decides time came to do housekeeping.
For example, your select statement is akin for garbage collection. Firebird might stop for scanning all the table, find the orphaned obsolete versions of rows and clear them away. For example your insert statement is akin for index recreation: if Firebird would think the B-Tree of the index is got too one-sided, it would drop it, and build a new balanced tree, reading out the whole table ( and yes, reading the table may provoke GC on top of tree recreation ).
More so, let as steer away from Firebird specifics - what would you do ig Firebird crashes? Just crashes, it is a program, and like every program it may have bugs. Or for example you run out of disk space and Firebird can no more insert anything into the database - where would your hardware sensors data end in then? Won't it just be lost ?
http://www.translate.ru - this one works usually better than Google or Microsoft translation, especially if you set the vocabulary to computers.
See #7 at http://www.ibase.ru/dontdoit/ - "Do not issue commit after every single row". #24 at https://www.ibase.ru/45-ways-to-improve-firebird-performance-russian/ suggests committing packets of about a thousand rows as a sweet spot between to many transactions and too much uncommitted data. Also check #9, #10, #16 and #17 and #44 at the last link.
The overall structure of your software complex I believe has to be split into two services.
Query data from hardware sensors and save it to plain stupid binary flat file. Since this file is the most simplistic format that can be - the performance and reliability would be maxed.
Take ready binary files, and insert them into SQL database in bulk insert mode.
So, for example, you set the threshold of 10240 rows.
The service #1 creates the file "Phones_1_Processing" with BINARY well-defined format. It also creates and opens "Phones_2_Processing" file, but keeps it at 0 length. Then it keeps adding rows into the "Phones_1_Processing" for a while. It might also flush OS file buffers after every row, or every 100 rows, or something that would get best balance between reliability and performance.
When the threshold is met, the service #1 switches into recording incoming data cells into the already created and opened "Phones_2_Processing" file. It can be done instantly, change one file handler type variable in your program.
Then the service #1 closes and renames "Phones_1_Processing" into ""Phones_1_Complete".
Then the service #1 creates new empty file "Phones_3_Processing" and keeps it open with zero length. Now it is back at state "1" - ready to instantly switch its recording into new file, when the current file is over.
The key points here is that the service should only do most simple and most fast operations. Since any random delay would mean your realtime-generated data is lost and would never be recovered. BTW, how can you disable Garbage Collection in Python, so it would not "stop the world" suddenly? Okay, half-joking. But the point is kept. GC is random non-deterministic bogging down of your system, and it is badly compatible with regular non-buffered hardware data acquisition. That primary acquisition of non-buffered data is better be done with most simple=predictable services, and GC is a good global optimization, but the price is it tends to generate sudden local no-service spikes.
As this all happens with Service #1 with have another one.
Service #2 keeps monitoring data changes in the folder you use to save primary data. It subscribes to "some file was renamed" events and ignores others. Which service to use? ask Linux guru. iNotify, dNotify, FAM, Gamin, anything of a kind that would suit.
When Service #2 is awaken with "file was renamed and xxxx is new name" it checks if the new file name ends with "_Complete". If it does not - then that was a bogus event.
When the event is for a new "Phone_...._Complete" file, then it is time to "bulk insert" it into Firebird. Google for "Firebird bulk insert", for example http://www.firebirdfaq.org/faq209/
The Service #2 renames "Phone_1_Complete" into "Phone_1_Inserting", so the state of data packet is persisted (as file name).
The Service #2 attaches this file into Firebird database as an EXTERNAL TABLE
The Service #2 proceeds with bulk insert, as described above. De-activating indexes, it opens a no-auto-undo transaction and keeps pumping the rows from the External Table into the destination table. If the service or server crashes here - you have a consistent state: transaction gets rolled back and the file name shows it still is pending to be inserted.
When all the rows are pumped - frankly, if Python can work with binary files, it would be a single INSERT-FROM-SELECT, - you commit the transaction, delete the External Table (detaching firebird from the file), then rename the "Phone_1_Inserting" file into "Phone_2_Done" (persisting its changed state) and then you delete it.
Then the service #2 looks if there are new "_Complete" files already ready in the folder, and if not, it goes into step 1 - sleeps until FAM event would awake it
All in all, you should DECOUPLE your services.
https://en.wikipedia.org/wiki/Coupling_%28computer_programming%29
The service who's main responsibility is to be with not a tiny pause ready to get and save data flow, and another service whose responsibility transfer the saved data into SQL database for ease of processing and it is not a big issue if it sometimes makes delays for few seconds as long as it does not lose data in the end.

Run python job every x minutes

I have a small python script that basically connects to a SQL Server (Micrsoft) database and gets users from there, and then syncs them to another mysql database, basically im just running queries to check if the user exists, if not, then add that user to the mysql database.
The script usually would take around 1 min to sync. I require the script to do its work every 5 mins (for example) exactly once (one sync per 5 mins).
How would be the best way to go about building this?
I have some test data for the users but on the real site, theres a lot more users so I can't guarantee the script takes 1 min to execute, it might even take 20 mins. However having an interval of say 15 mins everytime the script executes would be ideal for the problem...
Update:
I have the connection params for the sql server windows db, so I'm using a small ubuntu server to sync between the two databases located on different servers. So lets say db1 (windows) and db2 (linux) are the database servers, I'm using s1 (python server) and pymssql and mysql python modules to sync.
Regards
I am not sure cron is right for the job. It seems to me that if you have it run every 15 minutes but sometimes a synch takes 20 minutes you could have multiple processes running at once and possibly collide.
If the driving force is a constant wait time between the variable execution times then you might need a continuously running process with a wait.
def main():
loopInt = 0
while(loopInt < 10000):
synchDatabase()
loopInt += 1
print("call #" + str(loopInt))
time.sleep(300) #sleep 5 minutes
main()
(obviously not continuous, but long running) You can set the result of while to true and it will be continuous. (comment out loopInt += 1)
Edited to add: Please see note in comments about monitoring the process as you don't want the script to hang or crash and you not be aware of it.
You might want to use a system that handles queues, for example RabbitMQ, and use Celery as the python interface to implement it. With Celery, you can add tasks (like execution of a script) to a queue or run a schedule that'll perform a task after a given interval (just like cron).
Get started http://celery.readthedocs.org/en/latest/

Why doesn't this loop display an updated object count every five seconds?

I use this python code to output the number of Things every 5 seconds:
def my_count():
while True:
print "Number of Things: %d" % Thing.objects.count()
time.sleep(5)
my_count()
If another process generates a new Thing while my_count() is running, my_count() will keep printing the same number, even though it now has changed in the database. (But if I kill my_count() and restart it, it will display the new Thing count.)
Things are stored in a MYSQL innodb database, and this code runs on ubuntu.
Why won't my_count() display the new Thing.objects.count() without being restarted?
Because Python DB API is by default in AUTOCOMMIT=OFF mode, and (at least for MySQLdb) on REPEATABLE READ isolation level. This means that behind the scenes you have an ongoing database transaction (InnoDB is transactional engine) in which the first access to given row (or maybe even table, I'm not sure) fixes "view" of this resource for the remaining part of the transaction.
To prevent this behaviour, you have to 'refresh' current transaction:
from django.db import transaction
#transaction.autocommit
def my_count():
while True:
transaction.commit()
print "Number of Things: %d" % Thing.objects.count()
time.sleep(5)
-- note that the transaction.autocommit decorator is only for entering transaction management mode (this could also be done manually using transaction.enter_transaction_management/leave_transaction_managemen functions).
One more thing - to be aware - Django's autocommit is not the same autocommit you have in database - it's completely independent. But this is out of scope for this question.
Edited on 22/01/2012
Here is a "twin answer" to a similar question.

Categories