Query hive from flask - python

I am new to flask and i am using the following flask cookiecutter to start with a quick prototype. The main idea of project is to collect data from hive cluster and push it to the end user using flask.
Although, i was successfully able to connect flask to the hive server using pyhive connector but I am getting a weird issue that's the related to the select limit where i am trying to query more than 50 items.
In my case i built just Hive class similar to the flask extension development around for pyhive similar demo:
from pyhive import hive
from flask import current_app
# Find the stack on which we want to store the database connection.
# Starting with Flask 0.9, the _app_ctx_stack is the correct one,
# before that we need to use the _request_ctx_stack.
try:
from flask import _app_ctx_stack as stack
except ImportError:
from flask import _request_ctx_stack as stack
class Hive(object):
def __init__(self, app=None):
self.app = app
if app is not None:
self.init_app(app)
def init_app(self, app):
# Use the newstyle teardown_appcontext if it's available,
# otherwise fall back to the request context
if hasattr(app, 'teardown_appcontext'):
app.teardown_appcontext(self.teardown)
else:
app.teardown_request(self.teardown)
def connect(self):
return hive.connect(current_app.config['HIVE_DATABASE_URI'], database="orc")
def teardown(self, exception):
ctx = stack.top
if hasattr(ctx, 'hive_db'):
ctx.hive_db.close()
return None
#property
def connection(self):
ctx = stack.top
if ctx is not None:
if not hasattr(ctx, 'hive_db'):
ctx.hive_db = self.connect()
return ctx.hive_db
and created an endpoint to load data from hive:
#blueprint.route('/hive/<limit>')
def connect_to_hive(limit):
cur = hive.connection.cursor()
query = "select * from part_raw where year=2018 LIMIT {0}".format(limit)
cur.execute(query)
res = cur.fetchall()
return jsonify(data=res)
At the first run everything works fine if i try to load things with limited to 50 items, but as soon as i increase it keeps in state where nothing load. However when i load data using jupyter notebooks it works fine that's why i suspect that i might missed something from my flask code.

The issue was library version issues, solved this by adding the following to my requirements:
# Hive with needed dependencies
sasl==0.2.1
thrift==0.11.0
thrift-sasl==0.3.0
PyHive==0.6.1
The old version was as follow:
sasl>=0.2.1
thrift>=0.10.0
#thrift_sasl>=0.1.0
git+https://github.com/cloudera/thrift_sasl # Using master branch in order to get Python 3 SASL patches
PyHive==0.6.1
As stated in the development requirement files within pyhive project.

Related

Checking database connection with flask_mongoengine

I am practicing python with flask, flask_mongoengine and mongoDB.
I have followed various tutorials and for the moment am using application factory to set up my App (webservice for querying a database).
Code looks something like this:
db = MongoEngine()
def initialise_db(app):
db.init_app(app)
def create_app():
app = Flask(__name__)
api = Api(app)
initialise_db(app)
initialise_routes(api)
return app
I'd like to know if there is a way of checking the database connection is OK, before creating the app. Here the initialise_db(app) runs with no error even if there is no running instance of mongoDB (on port 21017 or other if specified in app.config).
I only get an error at the point were I try to do a db.Document.save()...
EDIT:
I have implemented a simple check_connection() function where I try to save a generic document in my DB, and raise an exception if this fails:
from db_models import checkElement
def check_connection():
try:
checkElement(checkID=1).save()
saved = True
except Exception as e:
saved = False
raise e
finally:
if saved:
checkElement.objects.get(checkID=1).delete()
I'm pretty sure this is not a very clean way to check for a DB connection, is there a better way to do the same with mongoengine/pymongo ??

Avoid Failed to read from defunct connection when using neo4j in Flask Application

I am using the following version of neo4j libraries:
neo4j==1.7.2
neobolt==1.7.9
neotime==1.7.4
I have a flask app and in development I am using the internal flask application server. (In prod I will use a docker container with uwsgi, but this quesiton is about my dev setup.)
I have encapsuated neo4j into a class and my application maintains a single instance of this class:
class ChartStoreConnectionClass():
driver = None
def __init__(self, configDict):
self.driver = neo4j.GraphDatabase.driver(
configDict["boltUri"],
auth=(configDict["basicAuthUsername"], configDict["basicAuthPassword"]),
encrypted=True,
trust=neo4j.TRUST_SYSTEM_CA_SIGNED_CERTIFICATES,
# trust=neo4j.TRUST_ALL_CERTIFICATES,
# trust=neo4j.TRUST_CUSTOM_CA_SIGNED_CERTIFICATES, Custom CA support is not implemented
)
def readonlyQuery(self, queryFN):
res = None
with self.driver.session() as session:
tx = session.begin_transaction()
res = queryFN(tx)
tx.rollback()
return res
def execute(self, queryFN):
res = None
with self.driver.session() as session:
tx = session.begin_transaction()
res = queryFN(tx)
tx.commit()
return res
This setup works for a while, but sometimes I get the following error:
neobolt.exceptions.ServiceUnavailable: Failed to read from defunct connection Address(host='127.0.0.1',
port=7687)
when I simply retry the request it works the second time. I have read around the error message and found mutiple posts talking about neo4j in multi-threaded vs multi-process environments but I do not believe they are relevant to me here.
There errors are occurring on the commit of the execute function. The queryFN I pass to it is a very simple one liner which takes next to no time to execute.
Is it wrong to have a single driver instance for my application? (I thought this was the way to do it because the driver creates a connection pool and it makes sense that my application has a connection pool.)
What is the recommended way to use neo4j with Flask?
I have seen this example https://github.com/neo4j-examples/movies-python-bolt/blob/master/movies.py but they simply have a single driver object in the same way as I do. (Except that is global not inside a class, but functionally mine is the same.)
Where should I be looking to debug this issue?

How to use connection pooling with psycopg2 (postgresql) with Flask

How should I use psycopg2 with Flask? I suspect it wouldn't be good to open a new connection every request so how can I open just one and make it globally available to the application?
from flask import Flask
app = Flask(__name__)
app.config.from_object('config') # Now we can access the configuration variables via app.config["VAR_NAME"].
import psycopg2
import myapp.views
Using connection Pooling is needed with Flask or any web server, as you rightfully mentioned, it is not wise to open and close connections for every request.
psycopg2 offers connection pooling out of the box. The AbstractConnectionPool class which you can extend and implement or a SimpleConnectionPool class that can be used out of the box. Depending on how you run the Flask App, you may what to use ThreadedConnectionPool which is described in docs as
A connection pool that works with the threading module.
Creating a simple Flask app and adding a ConnectionPool to it
import psycopg2
from psycopg2 import pool
from flask import Flask
app = Flask(__name__)
postgreSQL_pool = psycopg2.pool.SimpleConnectionPool(1, 20, user="postgres",
password="pass##29",
host="127.0.0.1",
port="5432",
database="postgres_db")
#app.route('/')
def hello_world():
# Use getconn() to Get Connection from connection pool
ps_connection = postgreSQL_pool.getconn()
# use cursor() to get a cursor as normal
ps_cursor = ps_connection.cursor()
#
# use ps_cursor to interact with DB
#
# close cursor
ps_cursor.close()
# release the connection back to connection pool
postgreSQL_pool.putconn(ps_connection)
return 'Hello, World!'
The Flask App itself is not complete or production-ready, please follow the instructions on Flask Docs to manage DB credentials and use the Pool object across the Flask App within the Flask context
I would strongly recommend using Libraries such as SQLAlchemy along with Flask (available as a wrapper) which will maintain connections and manage the pooling for you. Allowing you to focus on your logic

Python - Flask - open a webpage in default browser

I am working on a small project in Python. It is divided into two parts.
First part is responsible to crawl the web and extract some infromation and insert them into a database.
Second part is resposible for presenting those information with use of the database.
Both parts share the database. In the second part I am using Flask framework to display information as html with some formatting, styling and etc. to make it look cleaner.
Source files of both parts are in the same package, but to run this program properly user has to run crawler and results presenter separately like this :
python crawler.py
and then
python presenter.py
Everything is allright just except one thing. What I what presenter to do is to create result in html format and open the page with results in user's default browser, but it is always opened twice, probably due to the presence of run() method, which starts Flask in a new thread and things get cloudy for me. I don't know what I should do to be able to make my presenter.py to open only one tab/window after running it.
Here is the snippet of my code :
from flask import Flask, render_template
import os
import sqlite3
# configuration
DEBUG = True
DATABASE = os.getcwd() + '/database/database.db'
app = Flask(__name__)
app.config.from_object(__name__)
app.config.from_envvar('CRAWLER_SETTINGS', silent=True)
def connect_db():
"""Returns a new connection to the database."""
try:
conn = sqlite3.connect(app.config['DATABASE'])
return conn
except sqlite3.Error:
print 'Unable to connect to the database'
return False
#app.route('/')
def show_entries():
u"""Loads pages information and emails from the database and
inserts results into show_entires template. If there is a database
problem returns error page.
"""
conn = connect_db()
if conn:
try:
cur = connect_db().cursor()
results = cur.execute('SELECT url, title, doctype, pagesize FROM pages')
pages = [dict(url=row[0], title=row[1].encode('utf-8'), pageType=row[2], pageSize=row[3]) for row in results.fetchall()]
results = cur.execute('SELECT url, email from emails')
emails = {}
for row in results.fetchall():
emails.setdefault(row[0], []).append(row[1])
return render_template('show_entries.html', pages=pages, emails=emails)
except sqlite3.Error, e:
print ' Exception message %s ' % e
print 'Could not load data from the database!'
return render_template('show_error_page.html')
else:
return render_template('show_error_page.html')
if __name__ == '__main__':
url = 'http://127.0.0.1:5000'
webbrowser.open_new(url)
app.run()
I use similar code on Mac OS X (with Safari, Firefox, and Chrome browsers) all the time, and it runs fine. Guessing you may be running into Flask's auto-reload feature. Set debug=False and it will not try to auto-reload.
Other suggestions, based on my experience:
Consider randomizing the port you use, as quick edit-run-test loops sometimes find the OS thinking port 5000 is still in use. (Or, if you run the code several times simultaneously, say by accident, the port truly is still in use.)
Give the app a short while to spin up before you start the browser request. I do that through invoking threading.Timer.
Here's my code:
import random, threading, webbrowser
port = 5000 + random.randint(0, 999)
url = "http://127.0.0.1:{0}".format(port)
threading.Timer(1.25, lambda: webbrowser.open(url) ).start()
app.run(port=port, debug=False)
(This is all under the if __name__ == '__main__':, or in a separate "start app" function if you like.)
So this may or may not help. But my issue was with flask opening in microsoft edge when executing my app.py script... NOOB solution. Go to settings and default apps... And then change microsoft edge to chrome... And now it opens flask in chrome everytime. I still have the same issue where things just load though

How do I make one instance in Python that I can access from different modules?

I'm writing a web application that connects to a database. I'm currently using a variable in a module that I import from other modules, but this feels nasty.
# server.py
from hexapoda.application import application
if __name__ == '__main__':
from paste import httpserver
httpserver.serve(application, host='127.0.0.1', port='1337')
# hexapoda/application.py
from mongoalchemy.session import Session
db = Session.connect('hexapoda')
import hexapoda.tickets.controllers
# hexapoda/tickets/controllers.py
from hexapoda.application import db
def index(request, params):
tickets = db.query(Ticket)
The problem is that I get multiple connections to the database (I guess that because I import application.py in two different modules, the Session.connect() function gets executed twice).
How can I access db from multiple modules without creating multiple connections (i.e. only call Session.connect() once in the entire application)?
Try the Twisted framework with something like:
from twisted.enterprise import adbapi
class db(object):
def __init__(self):
self.dbpool = adbapi.ConnectionPool('MySQLdb',
db='database',
user='username',
passwd='password')
def query(self, sql)
self.dbpool.runInteraction(self._query, sql)
def _query(self, tx, sql):
tx.execute(sql)
print tx.fetchone()
That's probably not what you want to do - a single connection per app means that your app can't scale.
The usual solution is to connect to the database when a request comes in and store that connection in a variable with "request" scope (i.e. it lives as long as the request).
A simple way to achieve that is to put it in the request:
request.db = ...connect...
Your web framework probably offers a way to annotate methods or something like a filter which sees all requests. Put the code to open/close the connection there.
If opening connections is expensive, use connection pooling.

Categories