Post beanstalk queue data to Mssql using python script - python

Guys I'm currently developing an offline ALPR solution.
So far I've used OpenAlpr software running on Ubuntu. By using a python script I found on StackOverlFlow I'm able to read the beanstalk queue data (plate number & meta data) of the ALPR but I need to send this data from the beanstalk queue to a mssql database. Does anyone know how to export beanstalk queue data or JSON data to the database? The code below is for local-host, how do i modify it to automatically post data to the mssql database? The data in the beanstalk queue is in JSON format [key=value].
The read & write csv was my addition to see if it can save the json data as csv on localdisk
import beanstalkc
import json
from pprint import pprint
beanstalk = beanstalkc.Connection(host='localhost', port=11300)
TUBE_NAME='alprd'
text_file = open('output.csv', 'w')
# For diagnostics, print out a list of all the tubes available in Beanstalk.
print beanstalk.tubes()
# For diagnostics, print the number of items on the current alprd queue.
try:
pprint(beanstalk.stats_tube(TUBE_NAME))
except beanstalkc.CommandFailed:
print "Tube doesn't exist"
# Watch the "alprd" tube; this is where the plate data is.
beanstalk.watch(TUBE_NAME)
while True:
# Wait for a second to get a job. If there is a job, process it and delete it from the queue.
# If not, return to sleep.
job = beanstalk.reserve(timeout=5000)
if job is None:
print "No plates yet"
else:
plates_info = json.loads(job.body)
# Do something with this data (e.g., match a list, open a gate, etc.).
# if 'data_type' not in plates_info:
# print "This shouldn't be here... all OpenALPR data should have a data_type"
# if plates_info['data_type'] == 'alpr_results':
# print "Found an individual plate result!"
if plates_info['data_type'] == 'alpr_group':
print "Found a group result!"
print '\tBest plate: {} ({:.2f}% confidence)'.format(
plates_info['candidates'][0]['plate'],
plates_info['candidates'][0]['confidence'])
make_model = plates_info['vehicle']['make_model'][0]['name']
print '\tVehicle information: {} {} {}'.format(
plates_info['vehicle']['year'][0]['name'],
plates_info['vehicle']['color'][0]['name'],
' '.join([word.capitalize() for word in make_model.split('_')]))
elif plates_info['data_type'] == 'heartbeat':
print "Received a heartbeat"
text_file.write('Best plate')
# Delete the job from the queue when it is processed.
job.delete()
text_file.close()

AFAIK there is no way to directly export data from beanstalkd.
What you have makes sense, that is streaming data out of a tube into a flat file or performing a insert into the DB directly
Given the IOPS beanstalkd can be produced, it might still be a reasonable solution (depends on what performance you are expecting)
Try asking https://groups.google.com/forum/#!forum/beanstalk-talk as well

Related

How to save progress from scraping reddit data as database file using python

I wrote a script for getting some posts from reddit.
import praw
import pandas as pd
reddit = praw.Reddit(client_id='*******', \
client_secret='*******', \
user_agent='**********', \
username='********', \
password='*******8')
subreddit1 = reddit.subreddit("Tea")
subreddit2 = reddit.subreddit("Biophysics")
top_subreddit1 = subreddit1.top(limit=500)
top_subreddit2 = subreddit2.top(limit=500)
topics_dict = { "title":[],
"score":[],
"id":[], "url":[],
"comms_num": [],
"created": [],
"body":[]}
for submission1 in top_subreddit1:
topics_dict["title"].append(submission1.title)
topics_dict["score"].append(submission1.score)
topics_dict["id"].append(submission1.id)
topics_dict["url"].append(submission1.url)
topics_dict["comms_num"].append(submission1.num_comments)
topics_dict["created"].append(submission1.created)
topics_dict["body"].append(submission1.selftext)
for submission2 in top_subreddit2:
topics_dict["title"].append(submission2.title)
topics_dict["score"].append(submission2.score)
topics_dict["id"].append(submission2.id)
topics_dict["url"].append(submission2.url)
topics_dict["comms_num"].append(submission2.num_comments)
topics_dict["created"].append(submission2.created)
topics_dict["body"].append(submission2.selftext)
topics_data = pd.DataFrame(topics_dict)
topics_data
But it only displays in my jupyter.
Now I want to save progress as a database file. All advice are appreciated.
You have a couple of options. I'll present two, each with their pros and cons:
1. CSV
Simply save your file to a .csv using DataFrame.to_csv:
topics_data.to_csv('path_to_file.csv')
You can then proceed to parse this csv file in your client application, i.e., whatever application is going to use your scraped data.
Pros
Simple to save
Very barebones; further processing would be simple
Cons
Very barebones; if you need any sort of structure, you won't have any flexibility
2. SQLITE
You can also opt to store the dataframe in sqlite using DataFrame.to_sql:
import sqlite3
db_file = 'my.db'
# This creates a new database file if it doesn't exist
db_conn = sqlite3.connect(db_file)
# This creates a new table 'topics_data' if it doesn't exist
topics_data.to_sql('topics_data', con=db_conn)
Pros
Might be easier to parse for your client
SQL is a very strong query language. You can take advantage of this
Cons
Could be overkill if all you need is basic data transfer
Find out more about sqlite here: sqlite tutorial
To save data locally for internal python use later you can use the built in pickle
import pickle
def save_obj(obj, name ):
with open(f'{name}.pkl', 'wb') as f:
pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)
def load_obj(name):
try:
with open(f'{name}.pkl', 'rb') as f:
return pickle.load(f)
print("")
print(f"loaded {name}")
print("")
except Exception as e:
print("")
print(f"Error loading object '{name}': {e}")
print("")

Django - Heroku - Serving dynamic large file timing out

I have one function view that creates a report using xlsxwriter, it is created on the fly using a StringIO as buffer and finally sending through HttpResponse. It works well using Local Server.
The problem is that on Heroku, after some seconds (documentation mention 30 seconds timeout and not modifiable) the server hangs out and reboot the web process, giving error as a response.
What is the best way to...?:
create an xmlx file on the fly (dynamically) in memory
serve the entire file to the client.
prevent server to hang out because of the long process running
This is a piece of the code I am using:
def reporte_usuarios(request):
from xlsxwriter.workbook import Workbook
try:
import cStringIO as StringIO
except ImportError:
import StringIO
# create a workbook in memory
output = StringIO.StringIO()
workbook = Workbook(output)
bold = workbook.add_format({'bold': True})
# get the data
from django.db.models import Count
usuarios = User.objects.filter(....... # all filter stuff
for usr in usuarios:
if usr.activos > 0:
# create a workbook sheet every User registered
ws = workbook.add_worksheet(u'%s' % usr.username)
# some relevant user data
ws.write(1, 1, u'USUARIO: %s' % usr.username)
...
# get rows for user
log = LogActivos.objects.filter(usuario=usr).select_related('activo__unidad__id', 'activo__unidad__nombre', 'activo__nombre')
# write headers
ws.write(3, 0, u'FECHA', bold)
...
sig_fila = 4 #starting row for data (after headers)
for l in log:
# write all data
ws.write(sig_fila, 0, u'%s' % l.fecha)
...
sig_fila += 1
# close the workbook
workbook.close()
# go to the beginning of the buffer
output.seek(0)
# response using the buffer
response = HttpResponse(output.read(), content_type='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet')
response['Content-Disposition'] = 'attachment; filename="ACTIVOS_USUARIOS__%s.xlsx"' % datetime.now().strftime("%Y%m%d_%H%M")
return response
Notes: I am using Gunicorn on Heroku, django 1.9.13 and python 2.7.11
IMHO you should follow a totally different approach in this case.
As you are generating a file rather big in size, it's normal for the system to hang out because of a timeout error.
What you could do instead is to deploy a background task queue, like Celery or DjangoRQ. With that, you will get a background task to create this file using your user's data, and then you can let your user know that it's ready by any mean, like a notification or an email.
If you need more details regarding how you can do something like this, let me know and I can help :)

Spark streaming not reading from local directory

I am trying to write a spark streaming application using Spark Python API.
The application should read text files from local directory and send it to Kafka cluster.
When submitting the python script to spark engine, nothing sent to kafka at all.
I tried to print the events instead of send it to Kafka and found that there is nothing read.
Here is the code of the script.
#!/usr/lib/python
# -*- coding: utf-8 -*-
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from kafka import KafkaProducer
import sys
import time
reload(sys)
sys.setdefaultencoding('utf8')
producer = KafkaProducer(bootstrap_servers="kafka-b01.css.org:9092,kafka-b02.css.org:9092,kafka-b03.css.org:9092,kafka-b04.css.org:9092,kafka-b05.css.org:9092")
def send_to_kafka(rdd):
tweets = rdd.collect()
print ("--------------------------")
print (tweets)
print "--------------------------"
#for tweet in tweets:
# producer.send('test_historical_job', value=bytes(tweet))
if __name__ == "__main__":
conf = SparkConf().setAppName("TestSparkFromPython")
sc = SparkContext(conf=conf)
ssc = StreamingContext(sc, 1)
tweetsDstream = ssc.textFileStream("/tmp/historical/")
tweetsDstream.foreachRDD(lambda rdd: send_to_kafka(rdd))
ssc.start()
ssc.awaitTermination()
I am submitting the script using this command
./spark-submit --master spark://spark-master:7077 /apps/historical_streamer.py
The output of the print statement is an empty list.
--------------------------
[]
--------------------------
EDIT
based on this question I changed the path of the data directory from "/tmp/historical/" to "file:///tmp/historical/".
I tried to run the job first and then move files to the directory but unfortunately it did not work also.
File stream based sources like fileStream or textFileStream expect data files to be:
be created in the dataDirectory by atomically moving or renaming them into the data directory.
If there are no new files in a given window there is nothing to proces so pre-existing files (it seems to be the case here) won't be read on won't show on the output.
Your function:
def send_to_kafka(rdd):
tweets = rdd.collect()
print ("--------------------------")
print (tweets)
print "--------------------------"
#for tweet in tweets:
# producer.send('test_historical_job', value=bytes(tweet))
will collect all the rdd, but it won't print the content of the rdd. To do so, you should use the routine:
tweets.foreach(println)
that will, for every element in the RDD, give as output the elements. As explained in the Spark Documentation
Hope this will help

web2py Scheduled task to recreate (reset) database

I am dealing with a CRON job that places a text file with 9000 lines device names.
The job recreates the file every day with an updated list from a network crawler in our domain.
What I was running into is when I have the following worker running my import into my database the db.[name].id kept growing with this method below
scheduler.py
# -*- coding: utf-8 -*-
from gluon.scheduler import Scheduler
def demo1():
db(db.asdf.id>0).delete()
db.commit()
with open('c:\(project)\devices.list') as f:
content = f.readlines()
for line in content:
db.asdf.insert(asdf = line)
db.commit()
mysched = Scheduler(db, tasks = dict(demo1 = demo1) )
default.py (initial kickoff)
#auth.requires_membership('!Group-IS_MASTER')
def rgroup():
mysched.queue_task('demo1',start_time=request.now,stop_time = None,prevent_drift=True,repeats=0,period=86400)
return 'you are member of a group!'
So the next time the job kicked off it would start at db.[name].id = 9001. So every day the ID number would grow by 9000 or so depending on the crawler's return. It just looked sloppy and I didn't want to run into issues years down the road with database limitations that I don't know about.
(I'm a DB newb (I know, I don't know stuff))
SOOOOOOO.....
This is what I came up with and I don't know if this is the best practice or not. And an issue that I ran into when using db.[name].drop() in the same function that is creating entries is the db tables didn't exist and my job status went to 'FAILED'. So I defined the table in the job. see below:
scheduler.py
from gluon.scheduler import Scheduler
def demo1():
db.asdf.drop() #<=====Kill db.asdf
db.commit() #<=====Commit Kill
db.define_table('asdf',Field('asdf'),auth.signature ) #<==== Phoenix Rebirth!!!
with open('c:\(project)\devices.list') as f:
content = f.readlines()
for line in content:
db.asdf.insert(asdf = line)
db.commit() #<=========== Magic
mysched = Scheduler(db, tasks = dict(demo1 = demo1) )
In the line of Phoenix Rebirth in the comments of code above. Is that the best way to achieve my goal?
It starts my ID back at 1 and that's what I want but is that how I should be going about it?
Thanks!
P.S. Forgive my example with windows dir structure as my current non-prod sandbox is my windows workstation. :(
Why wouldn't you check if the line is present prior to inserting its corresponding record ?
...
with open('c:\(project)\devices.list') as f:
content = f.readlines()
for line in content:
# distinguishing t_ for tables and f_ for fields
db_matching_entries = db(db.t_asdf.f_asdf==line).select()
if len(db_matching_entries) == 0:
db.t_asdf.insert(f_asdf = line)
else:
# here you could update your record, just in case ;-)
pass
db.commit() #<=========== Magic
Got a similar process that takes few seconds to complete with 2k-3k entries. Yours should not take longer than half a minute.

couchdb-python change notifications

I'm trying to use couchdb.py to create and update databases. I'd like to implement notification changes, preferably in continuous mode. Running the test code posted below, I don't see how the changes scheme works within python.
class SomeDocument(Document):
#############################################################################
# def __init__ (self):
intField = IntegerField()#for now - this should to be an integer
textField = TextField()
couch = couchdb.Server('http://127.0.0.1:5984')
databasename = 'testnotifications'
if databasename in couch:
print 'Deleting then creating database ' + databasename + ' from server'
del couch[databasename]
db = couch.create(databasename)
else:
print 'Creating database ' + databasename + ' on server'
db = couch.create(databasename)
for iii in range(5):
doc = SomeDocument(intField=iii,textField='somestring'+str(iii))
doc.store(db)
print doc.id + '\t' + doc.rev
something = db.changes(feed='continuous',since=4,heartbeat=1000)
for iii in range(5,10):
doc = SomeDocument(intField=iii,textField='somestring'+str(iii))
doc.store(db)
time.sleep(1)
print something
print db.changes(since=iii-1)
The value
db.changes(since=iii-1)
returns information that is of interest, but in a format from which I haven't worked out how to extract the sequence or revision numbers, or the document information:
{u'last_seq': 6, u'results': [{u'changes': [{u'rev': u'1-9c1e4df5ceacada059512a8180ead70e'}], u'id': u'7d0cb1ccbfd9675b4b6c1076f40049a8', u'seq': 5}, {u'changes': [{u'rev': u'1-bbe2953a5ef9835a0f8d548fa4c33b42'}], u'id': u'7d0cb1ccbfd9675b4b6c1076f400560d', u'seq': 6}]}
Meanwhile, the code I'm really interested in using:
db.changes(feed='continuous',since=4,heartbeat=1000)
Returns a generator object and doesn't appear to provide notifications as they come in, as the CouchDB guide suggests ....
Has anyone used changes in couchdb-python successfully?
I use long polling rather than continous, and that works ok for me. In long polling mode db.changes blocks until at least one change has happened, and then returns all the changes in a generator object.
Here is the code I use to handle changes. settings.db is my CouchDB Database object.
since = 1
while True:
changes = settings.db.changes(since=since)
since = changes["last_seq"]
for changeset in changes["results"]:
try:
doc = settings.db[changeset["id"]]
except couchdb.http.ResourceNotFound:
continue
else:
// process doc
As you can see it's an infinite loop where we call changes on each iteration. The call to changes returns a dictionary with two elements, the sequence number of the most recent update and the objects that were modified. I then loop through each result loading the appropriate object and processing it.
For a continuous feed, instead of the while True: line use for changes in settings.db.changes(feed="continuous", since=since).
I setup a mailspooler using something similar to this. You'll need to also load couchdb.Session() I also use a filter for only receiving unsent emails to the spooler changes feed.
from couchdb import Server
s = Server('http://localhost:5984/')
db = s['testnotifications']
# the since parameter defaults to 'last_seq' when using continuous feed
ch = db.changes(feed='continuous',heartbeat='1000',include_docs=True)
for line in ch:
doc = line['doc']
// process doc here
doc['priority'] = 'high'
doc['recipient'] = 'Joe User'
# doc['state'] + 'sent'
db.save(doc)
This will allow you access your doc directly from the changes feed, manipulate your data as you see fit, and finally update you document. I use a try/except block on the actual 'db.save(doc)' so I can catch when a document has been updated while I was editing and reload the doc before saving.

Categories