Inserting data into mongodb using python

Inserting data into mongodb using python - python

Should I be re-initializing the connection on every insert?
class TwitterStream:
def __init__(self, timeout=False):
while True:
dump_data()
def dump_data:
##dump my data into mongodb
##should I be doing this every time??:
client=MongoClient()
mongo=MongoClient('localhost',27017)
db=mongo.test
db.insert('some stuff':'other stuff')
##dump data and close connection
#########################
Do I need to open the connection every time I write a record? Or can I leave a connection open assuming I'll be writing to the database 5 times per second with about 10kb each time?
If just one connection is enough, where should I define the variables which hold the connection (client, mongo, db)?

Open one MongoClient that lives for the duration of your program:
client = MongoClient()
class TwitterStream:
def dump_data:
while True:
db = client.test
db.insert({'some stuff': 'other stuff'})
Opening a single MongoClient means you only pay its startup cost once, and its connection-pooling will minimize the cost of opening new connections.
If you're concerned about surviving occasional network issues, wrap your operations in an exception block:
try:
db.insert(...)
except pymongo.errors.ConnectionFailure:
# Handle error.
...

Opening connections is in general an expensive operation, so I recommend you to reuse them as much as possible.
In the case of MongoClient, you should be able to leave the connection open and keep reusing it. However, as the connection lives on for a longer time, eventually you'll start hitting connectivity issues. The recommended solution for this it to configure MongoClient to use auto-reconnect, and catch the AutoReconnect exception as part of your retry mechanisms.
Here's an example of said approach, taken from http://python.dzone.com/articles/save-monkey-reliably-writing:
while True:
time.sleep(1)
data = {
'time': datetime.datetime.utcnow(),
'oxygen': random.random()
}
# Try for five minutes to recover from a failed primary
for i in range(60):
try:
mabel_db.breaths.insert(data, safe=True)
print 'wrote'
break # Exit the retry loop
except pymongo.errors.AutoReconnect, e:
print 'Warning', e
time.sleep(5)

Related

Multiple Database connections using UPDATE ... RETURNING, seem to not update rows in tasks table

Preface
I want to process tasks listed in a database table in parallel. Not looking for working code.
The Setup
1 PostgreSQL database server D
1 processing server P
1 User terminal T
using Python 3.6, psycopg2.7.6, PostgreSQL 11
D holds tables with data to be processed and a tasks table. A user at T ssh's into P, where the following command can be issued:
python -m core.utils.task
This task.py script is essentially a while loop that gets a task t from the tasks table on D with the status 'new' until there are no new tasks left. A task t is basically a set of arguments for another function called do_something(t). do_something(t) itself will make many connections to D to get data that needs to be processed and set task's to status 'done' once it finished – the while loop starts all over and gets a new task.
In order to run python -m core.utils.task multiple times I open multiple ssh connections. Not so good, I know; threading or multiprocessing would be better. But his is just for testing if I can run the mentioned command twice.
There is a script that manages all the database interactions called pgsql.py which is needed to get a task and then by do_something(t). I adapted a singleton pattern from this SE post.
Pseudo-Code (mostly)
task.py
import mymodule
import pgsql
def main():
while True:
r, c = pgsql.SQL.select_task() # rows and columns
task = dotdict(dict(zip(c, r[0])))
mymodule.do_something(task)
if __name__ == "__main__":
main()
mymodule.py
import pgsql
def do_something(t):
input = pgsql.SQL.get_images(t.table,t.schema,t.image_id,t.image_directory)
some_other_function(input)
pgsql.SQL.task_status(t.task_id,'done')
pgsql.py
import psycopg2 as pg
class Postgres(object):
"""Adapted from https://softwareengineering.stackexchange.com/a/358061/348371"""
_instance = None
def __new__(cls):
if cls._instance is None:
cls._instance = object.__new__(cls)
db_config = {'dbname': 'dev01', 'host': 'XXXXXXXX',
'password': 'YYYYY', 'port': 5432, 'user': 'admin'}
try:
print('connecting to PostgreSQL database...')
connection = Postgres._instance.connection = pg.connect(**db_config)
connection.set_session(isolation_level='READ COMMITTED', autocommit=True)
except Exception as error:
print('Error: connection not established {}'.format(error))
Postgres._instance = None
else:
print('connection established')
return cls._instance
def __init__(self):
self.connection = self._instance.connection
def query(self, query):
try:
with self.connection.cursor() as cur:
cur.execute(query)
rows = cur.fetchall()
cols = [desc[0] for desc in cur.description]
except Exception as error:
print('error execting query "{}", error: {}'.format(query, error))
return None
else:
return rows, cols
def __del__(self):
self.connection.close()
db = Postgres()
class SQL():
def select_task():
s = """
UPDATE schema.tasks
SET status = 'ready'
WHERE task_id = ( SELECT task_id
FROM schema.tasks
WHERE tasks.status = 'new'
LIMIT 1)
RETURNING *
;
""".format(m=mode)
return Postgres.query(db, s)
def task_status(id,status):
s = """
UPDATE
schema.tasks
SET
status = '{s}'
WHERE
tasks.task_id = '{id}'
;
""".format(s=status,
id=id)
return Postgres.query(db, s)
Problem
This works with one ssh connection. Tasks are retrieved from the database and processed, once finished the task is set to 'done'. Once I open a second ssh connection in a second terminal to run python -m core.utils.task (so to say, in parallel) the exact same rows of the tasks table are processed in both - ignoring that they have been updated.
Question
What are your suggestions to get this to work? There are millions of tasks and I need to run them in parallel. Before implementing threading or multiprocessing I wanted to test it with multiple ssh connections first, bad idea? I have fiddled around with the isolation levels and autocommit settings in psycopg2's set_session() but without luck. I checked the sessions in the Database server and can see that each process of python -m core.utils.task has its own PID, only connecting once, exactly like this singleton pattern should work. Any ideas or pointers how to deal with this are much appreciated!

The main problem is that performing one task is not an atomic operation. Therefore, in different ssh sessions, the same task can be processing several times.
In this implementation, you can try to use an "INPROGRESS" status for task so as not to retrieve tasks that are already being processed (with "INPROGRESS" status). But be sure to use autocommit.
But I would implement this using threads and database connection pool. And would extract tasks in batches using OFFSET and LIMIT. The do_something, select_task and task_status functions would implement for batch of tasks.
Also, there is no need to implement the Postgres class as a singleton.
Amended (see the comments below)
You can add FOR UPDATE SKIP LOCKED to the SQL query in current implementation (see url).
If you want to work with batches, then separate the data by some serial column (well, or just sort the data in a table).
My implementation using batches.
This can be implemented using ThreadPoolExecutor and PersistentConnectionPool.

MySQL server has gone away python MySQLdb

In my python script, I've subscribed to a web socket. Whenever the data is received, I'm inserting this data into MySQL db. Every second there are about 100-200 queries. The problem is it works for some time, and then it gives the error "error 2006: MySQL server has gone away"
I've increased Max_allowed_packets up to 512M. but it didn't work.
Here's my code.
def db_entry(threadName, _data):
_time = time.strftime('%Y-%m-%d %H:%M:%S')
#print ("starting new thread...")
for data in _data:
#print (data)
sql = "INSERT INTO %s (Script_Name, Lot_Size, Date, Time, Last_Price, Price_Change, Open,High, Low, Close, Volume, Buy_Quantity, Sell_Quantity) VALUES('%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s')" %("_" + str(data['instrument_token']), data['instrument_token'], 1, datetime.datetime.today().strftime("%Y-%m-%d"), _time, data['last_price'], data['change'], data['ohlc']['open'], data['ohlc']['high'], data['ohlc']['low'], data['ohlc']['close'], data['volume'], data['buy_quantity'], data['sell_quantity'])
cursor.execute(sql)
# Commit your changes in the database
db.commit()
def on_tick(tick, ws):
thread_name = "Thread" + str(thread_count + 1)
try:
_thread.start_new_thread(db_entry,(thread_name,tick, ))
except exception as e:
print (e)
raise
def on_connect(ws):
# Subscribe to a list of instrument_tokens (RELIANCE and ACC here).
ws.subscribe(instrument_token)
# Set RELIANCE to tick in `full` mode.
ws.set_mode(ws.MODE_FULL,instrument_token)
# Assign the callbacks.
kws.on_tick = on_tick
kws.on_connect = on_connect
kws.enable_reconnect(reconnect_interval=5, reconnect_tries=50)
# Infinite loop on the main thread. Nothing after this will run.
# You have to use the pre-defined callbacks to manage subscriptions.
kws.connect()
Thanks in advance. :)

The documentation provided by the MySQL developer docs are very clear on this point. Odds are, some of those MySQL queries are running slower than others because they're waiting for their turn to insert data. If they wait too long, MySQL will just close their connection. By default, MySQL's wait_timeout is eight hours (28800s). Has the MySQL configuration been tweaked? How much hardware is allocated to MySQL?
Generally, look at all the timeout configurations. Read them and understand them. Do not simply copy and paste all the performance tweaks bloggers like blogging about.

Finally, It's solved.
I kept the db connection open which was causing the problem.
I'm closing the db connection when the query is fired. and opening again when want to insert something again.

You need to create an object with its own connection handling methods. I use this and works well.
class DB():
def __init__(self, **kwargs):
self.conn = MySQLdb.connect(‘host’, ‘user’, ‘pass’, ‘db’)
try:
if (self.conn):
status = "DB init success"
else:
status = "DB init failed"
self.conn.autocommit(True)
# self.conn.select_db(DB_NAME)
self.cursor = self.conn.cursor()
except Exception as e:
status = "DB init fail %s " % str(e)
def execute(self, query):
try:
if self.conn is None:
self.__init__()
else:
self.conn.ping(True)
self.cursor.execute(query)
return self.cursor.fetchall()
except Exception as e:
import traceback
traceback.print_exc()
# error ocurs,rollback
self.conn.rollback()
return False
Usage
data = DB().execute("SELECT * FROM Users")
print(data)

pyMySQL: How to check if connection is already opened or close

I am getting the error InterfaceError (0, ''). Is there way in Pymysql library I can check whether connection or cursor is closed. For cursor I am already using context manager like that:
with db_connection.cursor() as cursor:
....

You can use Connection.open attribute.
The Connection.open field will be 1 if the connection is open and 0 otherwise. So you can say
if conn.open:
# do something
The conn.open attribute will tell you whether the connection has been
explicitly closed or whether a remote close has been detected.
However, it's always possible that you will try to issue a query and
suddenly the connection is found to have given out - there is no way
to detect this ahead of time (indeed, it might happen during the
process of issuing the query), so the only truly safe thing is to wrap
your calls in a try/except block

Use conn.connection in if statement.
import pymysql
def conn():
mydb=pymysql.Connect('localhost','root','password','demo_db',autocommit=True)
return mydb.cursor()
def db_exe(query,c):
try:
if c.connection:
print("connection exists")
c.execute(query)
return c.fetchall()
else:
print("trying to reconnect")
c=conn()
except Exception as e:
return str(e)
dbc=conn()
print(db_exe("select * from users",dbc))

This is how I did it, because I want to still run the query even if the connection goes down:
def reconnect():
mydb=pymysql.Connect(host='localhost',user='root',password='password',database='demo_db',ssl={"fake_flag_to_enable_tls":True},autocommit=True)
return mydb.cursor()
try:
if (c.connection.open != True):
c=reconnect() # reconnect
if c.connection.open:
c.execute(query)
return c.fetchall()
except Exception as e:
return str(e)

I think the try and catch might do the trick instead of checking cursor only.
try:
c = db_connection.cursor()
except OperationalError:
connected = False
else:
connected = True
#code here

I initially went with the solution from AKHIL MATHEW to call conn.open but later during testing found that sometimes conn.open was returning positive results even though the connection was lost. To be certain, I found I could call conn.ping() which actually tests the connection. The function also accepts an optional parameter (reconnect=True) which will cause it to automatically reconnect if the ping fails.
Of course there is a cost to this - as implied by the name, ping actually goes out to the server and tests the connection. You don't want to do this before every query, but in my case I have an AWS lambda spinning up on warm start and trying to reuse the connection, so I think I can justify testing my connection once on each warm start and reconnecting if it's been lost.

Using Django DB connection in custom threaded scripts

There are many questions about is django db connection thread safe, but they all seem to be asking the default request threads.
What if I am writing custom script that uses database connection in threads:
from django.db import connections
import threading
class Transform(object):
def transform_data(self, listing):
cursor = self.connection.cursor()
cursor.execute('SELECT ... WHERE id = %s', listing.id)
data = cursor.fetchall()
...
def run(self):
connection = self.connections['legacy']
for listing in listings:
threading.Thread(target=self.transform_data, args=[listing])
How safe is data inside transform_data thread in terms of the result from cursor is not mixed up with other threads?

Ideally each thread should be using its own connection. If you do that when you execute the select query inside transform_data you are essentially getting a snapshot of the data at that point in time. You can retrieve the rows without having to worry about their being updated or deleted by other threads provided that the other threads have their own connection.
If all threads share the same connection what exactly happens is very dependent on what database you are using and transaction isolation level

Each item in the connections object returns a thread-local connection to that database. By default, these connections cannot be shared between threads; attempting to do so will result in a DatabaseError.
Always use connections[alias] within the thread that executes your queries. Never access connections[alias] in the parent thread and pass the object to the child thread. This will ensure that every connection object you use is local to the current thread, avoiding any threading issues.
To fix your code and make it thread-safe, you would change it like this:
from django.db import connections
import threading
class Transform(object):
def transform_data(self, listing):
# Access the database connection on the global `connections` object
# from within the child thread.
cursor = connections['legacy'].cursor()
cursor.execute('SELECT ... WHERE id = %s', listing.id)
data = cursor.fetchall()
...
def run(self):
for listing in listings:
threading.Thread(target=self.transform_data, args=[listing])

Set database connection timeout in Python

I'm creating a RESTful API which needs to access the database. I'm using Restish, Oracle, and SQLAlchemy. However, I'll try to frame my question as generically as possible, without taking Restish or other web APIs into account.
I would like to be able to set a timeout for a connection executing a query. This is to ensure that long running queries are abandoned, and the connection discarded (or recycled). This query timeout can be a global value, meaning, I don't need to change it per query or connection creation.
Given the following code:
import cx_Oracle
import sqlalchemy.pool as pool
conn_pool = pool.manage(cx_Oracle)
conn = conn_pool.connect("username/p4ss#dbname")
conn.ping()
try:
cursor = conn.cursor()
cursor.execute("SELECT * FROM really_slow_query")
print cursor.fetchone()
finally:
cursor.close()
How can I modify the above code to set a query timeout on it?
Will this timeout also apply to connection creation?
This is similar to what java.sql.Statement's setQueryTimeout(int seconds) method does in Java.
Thanks

for the query, you can look on timer and conn.cancel() call.
something in those lines:
t = threading.Timer(timeout,conn.cancel)
t.start()
cursor = conn.cursor()
cursor.execute(query)
res = cursor.fetchall()
t.cancel()

In linux see /etc/oracle/sqlnet.ora,
sqlnet.outbound_connect_timeout= value
also have options:
tcp.connect_timeout and sqlnet.expire_time, good luck!

You could look at setting up PROFILEs in Oracle to terminate the queries after a certain number of logical_reads_per_call and/or cpu_per_call

Timing Out with the System Alarm
Here's how to use the operating system timout to do this. It's generic, and works for things other than Oracle.
import signal
class TimeoutExc(Exception):
"""this exception is raised when there's a timeout"""
def __init__(self): Exception.__init__(self)
def alarmhandler(signame,frame):
"sigalarm handler. raises a Timeout exception"""
raise TimeoutExc()
nsecs=5
signal.signal(signal.SIGALRM, alarmhandler) # set the signal handler function
signal.alarm(nsecs) # in 5s, the process receives a SIGALRM
try:
cx_Oracle.connect(blah blah) # do your thing, connect, query, etc
signal.alarm(0) # if successful, turn of alarm
except TimeoutExc:
print "timed out!" # timed out!!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Inserting data into mongodb using python - python

Related

Multiple Database connections using UPDATE ... RETURNING, seem to not update rows in tasks table

MySQL server has gone away python MySQLdb

pyMySQL: How to check if connection is already opened or close

Using Django DB connection in custom threaded scripts

Set database connection timeout in Python

Categories

Resources