web2py Scheduled task to recreate (reset) database - python

I am dealing with a CRON job that places a text file with 9000 lines device names.
The job recreates the file every day with an updated list from a network crawler in our domain.
What I was running into is when I have the following worker running my import into my database the db.[name].id kept growing with this method below
scheduler.py
# -*- coding: utf-8 -*-
from gluon.scheduler import Scheduler
def demo1():
db(db.asdf.id>0).delete()
db.commit()
with open('c:\(project)\devices.list') as f:
content = f.readlines()
for line in content:
db.asdf.insert(asdf = line)
db.commit()
mysched = Scheduler(db, tasks = dict(demo1 = demo1) )
default.py (initial kickoff)
#auth.requires_membership('!Group-IS_MASTER')
def rgroup():
mysched.queue_task('demo1',start_time=request.now,stop_time = None,prevent_drift=True,repeats=0,period=86400)
return 'you are member of a group!'
So the next time the job kicked off it would start at db.[name].id = 9001. So every day the ID number would grow by 9000 or so depending on the crawler's return. It just looked sloppy and I didn't want to run into issues years down the road with database limitations that I don't know about.
(I'm a DB newb (I know, I don't know stuff))
SOOOOOOO.....
This is what I came up with and I don't know if this is the best practice or not. And an issue that I ran into when using db.[name].drop() in the same function that is creating entries is the db tables didn't exist and my job status went to 'FAILED'. So I defined the table in the job. see below:
scheduler.py
from gluon.scheduler import Scheduler
def demo1():
db.asdf.drop() #<=====Kill db.asdf
db.commit() #<=====Commit Kill
db.define_table('asdf',Field('asdf'),auth.signature ) #<==== Phoenix Rebirth!!!
with open('c:\(project)\devices.list') as f:
content = f.readlines()
for line in content:
db.asdf.insert(asdf = line)
db.commit() #<=========== Magic
mysched = Scheduler(db, tasks = dict(demo1 = demo1) )
In the line of Phoenix Rebirth in the comments of code above. Is that the best way to achieve my goal?
It starts my ID back at 1 and that's what I want but is that how I should be going about it?
Thanks!
P.S. Forgive my example with windows dir structure as my current non-prod sandbox is my windows workstation. :(

Why wouldn't you check if the line is present prior to inserting its corresponding record ?
...
with open('c:\(project)\devices.list') as f:
content = f.readlines()
for line in content:
# distinguishing t_ for tables and f_ for fields
db_matching_entries = db(db.t_asdf.f_asdf==line).select()
if len(db_matching_entries) == 0:
db.t_asdf.insert(f_asdf = line)
else:
# here you could update your record, just in case ;-)
pass
db.commit() #<=========== Magic
Got a similar process that takes few seconds to complete with 2k-3k entries. Yours should not take longer than half a minute.

Related

Post beanstalk queue data to Mssql using python script

Guys I'm currently developing an offline ALPR solution.
So far I've used OpenAlpr software running on Ubuntu. By using a python script I found on StackOverlFlow I'm able to read the beanstalk queue data (plate number & meta data) of the ALPR but I need to send this data from the beanstalk queue to a mssql database. Does anyone know how to export beanstalk queue data or JSON data to the database? The code below is for local-host, how do i modify it to automatically post data to the mssql database? The data in the beanstalk queue is in JSON format [key=value].
The read & write csv was my addition to see if it can save the json data as csv on localdisk
import beanstalkc
import json
from pprint import pprint
beanstalk = beanstalkc.Connection(host='localhost', port=11300)
TUBE_NAME='alprd'
text_file = open('output.csv', 'w')
# For diagnostics, print out a list of all the tubes available in Beanstalk.
print beanstalk.tubes()
# For diagnostics, print the number of items on the current alprd queue.
try:
pprint(beanstalk.stats_tube(TUBE_NAME))
except beanstalkc.CommandFailed:
print "Tube doesn't exist"
# Watch the "alprd" tube; this is where the plate data is.
beanstalk.watch(TUBE_NAME)
while True:
# Wait for a second to get a job. If there is a job, process it and delete it from the queue.
# If not, return to sleep.
job = beanstalk.reserve(timeout=5000)
if job is None:
print "No plates yet"
else:
plates_info = json.loads(job.body)
# Do something with this data (e.g., match a list, open a gate, etc.).
# if 'data_type' not in plates_info:
# print "This shouldn't be here... all OpenALPR data should have a data_type"
# if plates_info['data_type'] == 'alpr_results':
# print "Found an individual plate result!"
if plates_info['data_type'] == 'alpr_group':
print "Found a group result!"
print '\tBest plate: {} ({:.2f}% confidence)'.format(
plates_info['candidates'][0]['plate'],
plates_info['candidates'][0]['confidence'])
make_model = plates_info['vehicle']['make_model'][0]['name']
print '\tVehicle information: {} {} {}'.format(
plates_info['vehicle']['year'][0]['name'],
plates_info['vehicle']['color'][0]['name'],
' '.join([word.capitalize() for word in make_model.split('_')]))
elif plates_info['data_type'] == 'heartbeat':
print "Received a heartbeat"
text_file.write('Best plate')
# Delete the job from the queue when it is processed.
job.delete()
text_file.close()
AFAIK there is no way to directly export data from beanstalkd.
What you have makes sense, that is streaming data out of a tube into a flat file or performing a insert into the DB directly
Given the IOPS beanstalkd can be produced, it might still be a reasonable solution (depends on what performance you are expecting)
Try asking https://groups.google.com/forum/#!forum/beanstalk-talk as well

peewee savepoint does not exist

I'm using peewee to interface a MySQL database. I have a list of entries which must be inserted into database and updated in case they're already present there. I'm using create_or_get function for this. I also use threading to speed up the process; code looks like this:
# pool is just a map wrapper around standard threading module
pool = utils.ThreadPool()
for page in xrange(0, pages):
pool.add_task(self.update_page, page)
pool.wait_completion()
def update_page(self, num):
for entry in self.get_entries_from_page(num):
self.push_entry(entry)
def push_entry(self, entry):
with _db.execution_context():
result, new = EntryModel.create_or_get(**entry) # << error here
if not new :
if entry['date'] > result.date:
result.hits += 1
result.date = track['date']
result.save()
Database initialization:
_db.initialize(playhouse.pool.PooledMySQLDatabase(n, user = u, passwd = w, host = h, port = p))
Everything was running smoothly, but all of sudden I began to receive a lot of errors on the mentioned line:
(1305, 'SAVEPOINT s449cd5a8d165440aaf47b205e2932362 does not exist')
Savepoint number changes every time and data is not being written to database. Recreating database did not help. What can lead to this error?
Try removing autocommit=True during database connection create.

Python- performance issue with mysql and google play subscription

I am trying to extract the purchase_state from google play, following the steps bellow:
import base64
import requests
import smtplib
from collections import OrderedDict
import mysql.connector
from mysql.connector import errorcode
......
Query db ,returning thousand of lines with purchase_idfield from my table
Check for each row from db, and extract purchase_id, then query google play for all of them. for example if the results of my previous
step is 1000, 1000 times is querying google (refresh token + query).
Add a new field purchase status from the google play to a new dictionary apart from some other fields whcih are grabbed from mysql query.
The last step is doing a loop over my dic as a follow and prepare the desirable report
AFTER EDITED:
def build_dic_from_db(data,access_token):
dic = {}
for row in data:
product_id = row['product_id']
purchase_id = row['purchase_id']
status = check_purchase_status(access_token, product_id,purchase_id)
cnt = 1
if row['user'] not in dic:
dic[row['user']] = {'id':row['user_id'],'country': row['country_name'],'reg_ts': row['user_registration_timestamp'],'last_active_ts': row['user_last_active_action_timestamp'],
'total_credits': row['user_credits'],'total_call_sec_this_month': row['outgoing_call_seconds_this_month'],'user_status':row['user_status'],'mobile':row['user_mobile_phone_number_num'],'plus':row['user_assigned_msisdn_num'],
row['product_id']:{'tAttemp': cnt,'tCancel': status}}
else:
if row['product_id'] not in dic[row['user']]:
dic[row['user']][row['product_id']] = {'tAttemp': cnt,'tCancel':status}
else:
dic[row['user']][row['product_id']]['tCancel'] += status
dic[row['user']][row['product_id']]['tAttemp'] += cnt
return dic
The problem is that my code is working slowly ~ Total execution time: 448.7483880519867 and I am wondering if there is away to improve my script. Is there any suggestion?
I hope I'm right about this, but the bottleneck seems to be the connection to the playstore. Doing it sequentially will take a long time, whereas the server is able to take a million requests at a time. So here's a way to process your jobs with executors (you need the "concurrent" package installed)
In that example, you'll be able to send 100 requests at the same time.
from concurrent import futures
EXECUTORS = futures.ThreadPoolExecutor(max_workers=100)
jobs = dict()
for row in data:
product_id = row['product_id']
purchase_id = row['purchase_id']
job = EXECUTORS.submit(check_purchase_status,
access_token, product_id,purchase_id)
jobs[job] = row
for job in futures.as_completed(jobs.keys()):
# here collect your results and do something useful with them :)
status = job.result()
# make the connection with current row
row = jobs[job]
# now you have your status and the row
And BTW try to use temp variables or you're constantly accessing your dictionary with the same keys, which is not good for performance AND readability of your code.

MySQL connection/query make file not work

I've got some test code I'm working on. In a separate HTML file, a button onclick event gets the URL of the page and passes it as a variable (jquery_input) to this python script. Python then scrapes the URL and identifies two pieces of data, which it then formats and concatenates together (resulting in the variable lowerCaseJoined). This concatenated variable has a corresponding entry in a MySQL database. With each entry in the db, there is an associated .gif file.
From here, what I'm trying to do is open a connection to the MySQL server and query the concatenated variable against the db to get the associated .gif file.
Once this has been accomplished, I want to print the .gif file as an alert on the webpage.
If I take out the db section of the code (connection, querying), the code runs just fine. Also, I am successfully able to execute the db part of the code independently through the Python shell. However, when the entire code resides in one file, nothing happens when I click the button. I've systematically removed the lines of code related to the db connection, and my code begins stalling out at the first line (db = MySQLdb.connection...). So it looks like as soon as I start trying to connect to the db, the program goes kaput.
Here is the code:
#!/usr/bin/python
from bs4 import BeautifulSoup as Soup
import urllib
import re
import cgi, cgitb
import MySQLdb
cgitb.enable() # for troubleshooting
# the cgi library gets the var from the .html file
form = cgi.FieldStorage()
jquery_input = form.getvalue("stuff_for_python", "nothing sent")
# the next section scrapes the URL,
# finds the call no and location,
# formats them, and concatenates them
content = urllib.urlopen(jquery_input).read()
soup = Soup(content)
extracted = soup.find_all("tr", {"class": "bibItemsEntry"})
cleaned = str(extracted)
start = cleaned.find('browse') +8
end = cleaned.find('</a>', start)
callNo = cleaned[start:end]
noSpacesCallNo = callNo.replace(' ', '')
noSpacesCallNo2 = noSpacesCallNo.replace('.', '')
startLoc = cleaned.find('field 1') + 13
endLoc = cleaned.find('</td>', startLoc)
location = cleaned[startLoc:endLoc]
noSpacesLoc = location.replace(' ', '')
joined = (noSpacesCallNo2+noSpacesLoc)
lowerCaseJoined = joined.lower()
# the next section establishes a connection
# with the mySQL db and queries it
# using the call/loc code (lowerCaseJoined)
db = MySQLdb.connect(host="localhost", user="...", "passwd="...",
db="locations")
cur = db.cursor()
queryDb = """
SELECT URL FROM locations WHERE location = %s
"""
cur.execute(queryDb, lowerCaseJoined)
result = cur.fetchall()
cur.close()
db.close()
# the next 2 'print' statements are important for web
print "Content-type: text/html"
print
print result
Any ideas what I'm doing wrong?
I'm new at programming, so I'm sure there's a lot that can be improved upon here. But prior to refining it I just want to get the thing to work!
I figured out the problem. Seems that I had quotation mark before the password portion of the db connection line. Things are all good now.

Python MySQL queue: run code/queries in parallel, but commit separately

Program 1 inserts some jobs into a table job_table.
Program 2 needs to :
1. get the job from the table
2. handle the job
-> this needs to be multi-threaded (because each job involves urllib waiting time, which should run in parallel)
3. insert the results into my_other_table, commiting the result
Any good (standard?) ways to implement this? The issue is that commiting inside one thread, also commits the other threads.
I was able to pick the records from the mysql table and put them in queue later get them from queue but not able to insert into a new mysql table.
Here i am able to pick up only the new records when ever they fall into the table.
Hope this may help you.
Any mistakes please assist me.
from threading import Thread
import time
import Queue
import csv
import random
import pandas as pd
import pymysql.cursors
from sqlalchemy import create_engine
import logging
queue = Queue.Queue(1000)
logging.basicConfig(level=logging.DEBUG, format='(%(threadName)-9s) %(message)s', )
conn = pymysql.connect(conn-details)
cursor = conn.cursor()
class ProducerThread(Thread):
def run(self):
global queue
cursor.execute("SELECT ID FROM multi ORDER BY ID LIMIT 1")
min_id = cursor.fetchall()
min_id1 = list(min_id[0])
while True:
cursor.execute("SELECT ID FROM multi ORDER BY ID desc LIMIT 1")
max_id = cursor.fetchall()
max_id1 = list(max_id[0])
sql = "select * from multi where ID between '{}' and '{}'".format(min_id1[0], max_id1[0])
cursor.execute(sql)
data = cursor.fetchall()
min_id1[0] = max_id1[0] + 1
for row in data:
num = row
queue.put(num) # acquire();wait()
logging.debug('Putting ' + str(num) + ' : ' + str(queue.qsize()) + ' items in queue')
class ConsumerThread(Thread):
def run(self):
global queue
while True:
num = queue.get()
print num
logging.debug('Getting ' + str(num) + ' : ' + str(queue.qsize()) + ' items in queue')
**sql1 = """insert into multi_out(ID,clientname) values ('%s','%s')""",num[0],num[1]
print sql1
# cursor.execute(sql1, num)
cursor.execute("""insert into multi_out(ID,clientname) values ('%s','%s')""",(num[0],num[1]))**
# conn.commit()
# conn.close()
def main():
ProducerThread().start()
num_of_consumers = 20
for i in range(num_of_consumers):
ConsumerThread().start()
main()
What probably happens is you share the MySQL connection between the two threads. Try creating a new MySQL connection inside each thread.
For program 2, look at http://www.celeryproject.org/ :)
This is a common task when doing some sort of web crawling. I have implemented a single thread which grabs a job, waits for the http response, then writes the response to a database table.
The problems I have come across with my method, is you need to lock the table where you are grabbing jobs from, and mark them as in progress or complete, in order for multiple threads to not try and grab the same task.
Just used threading.Thread in python and override the run method.
Use 1 database connection per thread. (some db libraries in python are not thread safe)
If you have X number of threads running, periodically reading from the jobs table then MySQL will do the concurrency for you.
Or if you need even more assurance, you can always lock the jobs table yourself before reading the next available entry. This way you can be 100% sure that a single job will only be processed once.
As #Martin said, keep connections separate for all threads. They can use the same credentials.
So in short:
Program one -> Insert into jobs
Program two -> Create a write lock on the jobs table, so no one else can read from it
Program two -> read next available job
Program two -> Unlock the table
Do everything else as usual, MySQL will handle concurrency

Categories