Parallelize data import from mongodb python - python

How to import data in parallel manner from mongodb. A solution is to scan all the mongodb, lets say it is 1000 rows there. And then to split, and fetch them in 100's batches and then combine them again so all are 1000.
Below is the code to import data to python from mongodb.
import pandas as pd
from pymongo import MongoClient
def _connect_mongo(host, port, username, password, db):
""" A util for making a connection to mongo """
if username and password:
mongo_uri = 'mongodb://%s:%s#%s:%s/%s' % (username, password, host, port, db)
conn = MongoClient(mongo_uri)
else:
conn = MongoClient(host, port)
return conn[db]
def read_mongo(db, collection, query={}, host='localhost', port=27017, username=None, password=None, no_id=True):
""" Read from Mongo and Store into DataFrame """
# Connect to MongoDB
db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)
# Make a query to the specific DB and Collection
cursor = db[collection].find(query)
# Expand the cursor and construct the DataFrame
df = pd.DataFrame(list(cursor))
# Delete the _id
if no_id:
del df['_id']
return df

As I said my comment, have you tried optimizing your database by using indexes? If the database is slow, I don't think parallelizing will improve it. If you still want to go with parallel, call read_mongo with multiple threads.
For indexes you should check https://docs.mongodb.com/manual/indexes/
There's no code related thing to add here, you just need to better understand your databases.
As for the code, python has concurrent (threads) or parallelism (multiprocessing package). You'd need to make your program call read_mongo with your already define/split queries.
There are many examples out there. I'd try the indexes before because it will help for the parallel stuff after.

Related

Fetch from one database, Insert/Update into another using SQLAlchemy

We have data in a Snowflake cloud database that we would like to move into an Oracle database. As we would like to work toward refreshing the Oracle database regularly, I am trying to use SQLAlchemy to automate this.
I would like to do this using Core because my team is all experienced with SQL, but I am the only one with Python experience. I think it would be easier to tweak the data pulls if we just pass SQL strings. Plus the Snowflake db has some columns with JSON that seems easier to parse using direct SQL since I do not see JSON in the SnowflakeDialect.
I have established connections to both databases and am able to do select queries from both. I have also manually created the tables in our Oracle db so that the keys and datatypes match what I am pulling from Snowflake. When I try to insert, though, my Jupyter notebook just continuously says "Executing Cell" and hangs. Any thoughts on how to proceed or how to get the notebook to tell me where the hangup is?
from sqlalchemy import create_engine,pool,MetaData,text
from snowflake.sqlalchemy import URL
import pandas as pd
eng_sf = create_engine(URL( #engine for snowflake
account = 'account'
user = 'user'
password = 'password'
database = 'database'
schema = 'schema'
warehouse = 'warehouse'
role = 'role'
timezone = 'timezone'
))
eng_o = create_engine("oracle+cx_oracle://{}[{}]:{}#{}".format('user','proxy','password','database'),poolclass=pool.NullPool) #engine for oracle
meta_o = MetaData()
meta_o.reflect(bind=eng_o)
person_o = meta_o['bb_lms_person'] # other oracle tables follow this example
meta_sf = MetaData()
meta_sf.reflect(bind=eng_sf,only=['person']) # other snowflake tables as well, but for simplicity, let's look at one
person_sf = meta_sf.tables['person']
person_query = """
SELECT ID
,EMAIL
,STAGE:student_id::STRING as STUDENT_ID
,ROW_INSERTED_TIME
,ROW_UPDATED_TIME
,ROW_DELETED_TIME
FROM cdm_lms.PERSON
"""
with eng_sf.begin() as connection:
result = connection.execute(text(person_query)).fetchall() # this snippet runs and returns result as expected
with eng_o.begin() as connection:
connection.execute(person_o.insert(),result) # this is a coinflip, sometimes it runs, sometimes it just hangs 5ever
eng_sf.dispose()
eng_o.dispose()
I've checked the typical offenders. The keys for both person_o and the result are all lowercase and match. Any guidance would be appreciated.
use the metadata for the table. the fTable_Stage update or inserted as fluent functions and assign values to lambda variables. This is very safe because only metadata field variables can be used in the lambda. I am updating three fields:LateProbabilityDNN, Sentiment_Polarity, Sentiment_Subjectivity
engine = create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
connection=engine.connect()
metadata=MetaData()
Session = sessionmaker(bind = engine)
session = Session()
fTable_Stage=Table('fTable_Stage', metadata,autoload=True,autoload_with=engine)
stmt=fTable_Stage.update().where(fTable_Stage.c.KeyID==keyID).values(\
LateProbabilityDNN=round(float(late_proba),2),\
Sentiment_Polarity=round(my_valance.sentiment.polarity,2),\
Sentiment_Subjectivity= round(my_valance.sentiment.subjectivity,2)\
)
connection.execute(stmt)

write scraped binary file to blob without first writing it to disk

I use the requests library to retrieve a binary file from a website. I now want to store it in MySQL as a BLOB. I don't want to take the intermediate step of writing the file to disk. What is the best way to do this?
At present, I am using base64 to encode the binary file so that MySQL will accept it, as in this suggestion. Is this the best strategy, or is there a way that permits me to skip the encoding (and the subsequent decoding when I retrieve the file)?
Minimal example:
import base64
import pymysql
import requests
myPDF = requests.get("https://arxiv.org/pdf/2004.00627.pdf")
myPDF_encoded = base64.b64encode(myPDF.content)
conn = pymysql.connect(
host = "127.0.0.1",
user = user,
passwd = password,
db = "myDB")
cur = conn.cursor()
insertLine = "INSERT INTO myDB (PDF) VALUES (%s)"
cur.execute(insertLine, myPDF_encoded)
conn.commit()
Many posts speak to the general problem of writing a binary file to a BLOB, but as best I can tell, all start from the assumption that the file is to be read from disk.
Much better solution for modern versions of mySQL: skip the base64 encoding, and use _binary %s to send binary data, or just add binary_prefix = True option when setting up the pymysql connection. For example,
import pymysql
import requests
myPDF = requests.get("https://arxiv.org/pdf/2004.00627.pdf")
conn = pymysql.connect(
host = "127.0.0.1",
user = user,
passwd = password,
db = "myDB",
binary_prefix = True)
cur = conn.cursor()
insertLine = "INSERT INTO myDB (PDF) VALUES (%s)"
cur.execute(insertLine, myPDF)
conn.commit()

Multiple Database connections using UPDATE ... RETURNING, seem to not update rows in tasks table

Preface
I want to process tasks listed in a database table in parallel. Not looking for working code.
The Setup
1 PostgreSQL database server D
1 processing server P
1 User terminal T
using Python 3.6, psycopg2.7.6, PostgreSQL 11
D holds tables with data to be processed and a tasks table. A user at T ssh's into P, where the following command can be issued:
python -m core.utils.task
This task.py script is essentially a while loop that gets a task t from the tasks table on D with the status 'new' until there are no new tasks left. A task t is basically a set of arguments for another function called do_something(t). do_something(t) itself will make many connections to D to get data that needs to be processed and set task's to status 'done' once it finished – the while loop starts all over and gets a new task.
In order to run python -m core.utils.task multiple times I open multiple ssh connections. Not so good, I know; threading or multiprocessing would be better. But his is just for testing if I can run the mentioned command twice.
There is a script that manages all the database interactions called pgsql.py which is needed to get a task and then by do_something(t). I adapted a singleton pattern from this SE post.
Pseudo-Code (mostly)
task.py
import mymodule
import pgsql
def main():
while True:
r, c = pgsql.SQL.select_task() # rows and columns
task = dotdict(dict(zip(c, r[0])))
mymodule.do_something(task)
if __name__ == "__main__":
main()
mymodule.py
import pgsql
def do_something(t):
input = pgsql.SQL.get_images(t.table,t.schema,t.image_id,t.image_directory)
some_other_function(input)
pgsql.SQL.task_status(t.task_id,'done')
pgsql.py
import psycopg2 as pg
class Postgres(object):
"""Adapted from https://softwareengineering.stackexchange.com/a/358061/348371"""
_instance = None
def __new__(cls):
if cls._instance is None:
cls._instance = object.__new__(cls)
db_config = {'dbname': 'dev01', 'host': 'XXXXXXXX',
'password': 'YYYYY', 'port': 5432, 'user': 'admin'}
try:
print('connecting to PostgreSQL database...')
connection = Postgres._instance.connection = pg.connect(**db_config)
connection.set_session(isolation_level='READ COMMITTED', autocommit=True)
except Exception as error:
print('Error: connection not established {}'.format(error))
Postgres._instance = None
else:
print('connection established')
return cls._instance
def __init__(self):
self.connection = self._instance.connection
def query(self, query):
try:
with self.connection.cursor() as cur:
cur.execute(query)
rows = cur.fetchall()
cols = [desc[0] for desc in cur.description]
except Exception as error:
print('error execting query "{}", error: {}'.format(query, error))
return None
else:
return rows, cols
def __del__(self):
self.connection.close()
db = Postgres()
class SQL():
def select_task():
s = """
UPDATE schema.tasks
SET status = 'ready'
WHERE task_id = ( SELECT task_id
FROM schema.tasks
WHERE tasks.status = 'new'
LIMIT 1)
RETURNING *
;
""".format(m=mode)
return Postgres.query(db, s)
def task_status(id,status):
s = """
UPDATE
schema.tasks
SET
status = '{s}'
WHERE
tasks.task_id = '{id}'
;
""".format(s=status,
id=id)
return Postgres.query(db, s)
Problem
This works with one ssh connection. Tasks are retrieved from the database and processed, once finished the task is set to 'done'. Once I open a second ssh connection in a second terminal to run python -m core.utils.task (so to say, in parallel) the exact same rows of the tasks table are processed in both - ignoring that they have been updated.
Question
What are your suggestions to get this to work? There are millions of tasks and I need to run them in parallel. Before implementing threading or multiprocessing I wanted to test it with multiple ssh connections first, bad idea? I have fiddled around with the isolation levels and autocommit settings in psycopg2's set_session() but without luck. I checked the sessions in the Database server and can see that each process of python -m core.utils.task has its own PID, only connecting once, exactly like this singleton pattern should work. Any ideas or pointers how to deal with this are much appreciated!
The main problem is that performing one task is not an atomic operation. Therefore, in different ssh sessions, the same task can be processing several times.
In this implementation, you can try to use an "INPROGRESS" status for task so as not to retrieve tasks that are already being processed (with "INPROGRESS" status). But be sure to use autocommit.
But I would implement this using threads and database connection pool. And would extract tasks in batches using OFFSET and LIMIT. The do_something, select_task and task_status functions would implement for batch of tasks.
Also, there is no need to implement the Postgres class as a singleton.
Amended (see the comments below)
You can add FOR UPDATE SKIP LOCKED to the SQL query in current implementation (see url).
If you want to work with batches, then separate the data by some serial column (well, or just sort the data in a table).
My implementation using batches.
This can be implemented using ThreadPoolExecutor and PersistentConnectionPool.

Share DB connection in a process pool

I have a Python 3 program that updates a large list of rows based on their ids (in a table in a Postgres 9.5 database).
I use multiprocessing to speed up the process. As Psycopg's connections can’t be shared across processes, I create a connection for each row, then close it.
Overall, multiprocessing is faster than single processing (5 times faster with 8 CPUs). However, creating a connection is slow: I'd like to create just a few connections and keep them open as long as required.
Since .map() chops ids_list into a number of chunks which it submits to the process pool, would it be possible to share a database connection for all ids in the same chunk/process?
Sample code:
from multiprocessing import Pool
import psycopg2
def create_db_connection():
conn = psycopg2.connect(database=database,
user=user,
password=password,
host=host)
return conn
def my_function(item_id):
conn = create_db_connection()
# Other CPU-intensive operations are done here
cur = conn.cursor()
cur.execute("""
UPDATE table
SET
my_column = 1
WHERE id = %s;
""",
(item_id, ))
cur.close()
conn.commit()
if __name__ == '__main__':
ids_list = [] # Long list of ids
pool = Pool() # os.cpu_count() processes
pool.map(my_function, ids_list)
Thanks for any help you can provide.
You can use the initializer parameter of the Pool constructor.
Setup the DB connection in the initializer function. Maybe pass the connection credentials as parameters.
Have a look at the docs: https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.pool

How do I insert async to a log in mongodb?

I want to use PyMongo as a logger for a Django app.
I don't mind if some inserts in the log table are lost, so I want to send a log to mongodb in another server, and continue the execution without waiting for confirmation.
I am reading pymongo docs, but it's not clear to me if the inserts in a collection are blocking or not.
I'm thinking of doing this inside a django model method
from pymongo import MongoClient
conn = MongoClient('mongoserver', 27017)
db = conn.main
col = db.log
col.insert({"user": "Pedro", "action": "search", "Origin": "Katmandu"}, w=0)
conn.close()
I don't know if I the insert is async like that and if the connection should be closed or not
Because you're passing a w=0 parameter to insert, the operation is non-blocking and the call will simply queue the document for insert and return.
And leave the connection open for best performance.

Categories