I would like to have 2 processes: one for collecting data and store it to sqlite database and one for using the data.
I'm trying to pass to the collecting process a handle to sqlite, but getting an error. Below is my simplified code
import sqlite3
import multiprocessing
import time
def do(db_handler):
print("start inserting data")
for i in range(20):
time.sleep(1)
db_handler.execute("insert into temp_table (f1, f2) Values (?,?)", (i, i*2))
if __name__ == "__main__":
connection = sqlite3.connect("temp_db.db", check_same_thread=False)
dBhandler = connection.cursor()
p = multiprocessing.Process(target=do, args=(dBhandler,))
p.start()
for i in range(20):
time.sleep(1)
result = dBhandler.execute("SELECT * FROM tasks")
rows = result.fetchall()
print("i =", i)
for row in rows:
print(row)
print("============")
I'm receiving an error: TypeError: Can't pickle sqlite3.cursor objects
Any idea how to solve it?
Related
I have a script used to migrate data from SQLite to Postgres. I just use a for loop to transfer tables one by one. Now, I want to experiment with transfering multiple tables in concurrency using threads, multiprocessing or asyncio to speed up the program to compare the runtimes between those ways.
How do you do one of those ways?
Here is my script:
import psycopg2, sqlite3, sys
import time
import multiprocessing
sqdb="C://Users//duongnb//Desktop//Python//SqliteToPostgreFull//testmydb6.db"
sqlike="table"
pgdb="testmydb11"
pguser="postgres"
pgpswd="1234"
pghost="127.0.0.1"
pgport="5432"
consq=sqlite3.connect(sqdb)
cursq=consq.cursor()
tabnames=[]
print()
cursq.execute('SELECT name FROM sqlite_master WHERE type="table" AND name LIKE "%table%";')
tabgrab = cursq.fetchall()
for item in tabgrab:
tabnames.append(item[0])
print(tabgrab)
def copyTable(table):
print(table)
cursq.execute("SELECT sql FROM sqlite_master WHERE type='table' AND name = ?;", (table,))
create = cursq.fetchone()[0]
cursq.execute("SELECT * FROM %s;" %table)
rows=cursq.fetchall()
colcount=len(rows[0])
pholder='%s,'*colcount
newholder=pholder[:-1]
try:
conpg = psycopg2.connect(database=pgdb, user=pguser, password=pgpswd,
host=pghost, port=pgport)
curpg = conpg.cursor()
curpg.execute("DROP TABLE IF EXISTS %s;" %table)
create = create.replace("AUTOINCREMENT", "")
curpg.execute(create)
curpg.executemany("INSERT INTO %s VALUES (%s);" % (table, newholder),rows)
conpg.commit()
if conpg:
conpg.close()
except psycopg2.DatabaseError as e:
print ('Error %s' % e)
sys.exit(1)
finally:
print("Complete")
consq.close()
if __name__ == "__main__":
start_time = time.time()
for table in tabnames:
p = multiprocessing.Process(target = copyTable, args = (table))
p.start()
for table in tabnames:
p.join()
print("All processes finished.")
duration = time.time() - start_time
print(f"Duration {duration} seconds")
You should put the inner of for table in tabnames into a function, say copyTable. Then you're able to use the multiprocessing package to parallelize your code. It should look something like this:
for table in tabnames:
p = multiprocessing.Process(target = copyTable, args = (table))
p.start()
for table in tabnames:
p.join()
print("All processes finished.")
But you can speed up your code even more if you use a COPY (https://www.postgresql.org/docs/current/sql-copy.html) instead of the many INSERT commands.
Instead of the multiprocessing module, you can also use the threading module, which works quite similarly. Then you have threads instead of processes. Because of the interpreter lock I would expect a worse performance with this.
The code below worked in Windows but in Linux is hanging:
from impala.dbapi import connect
from multiprocessing import Pool
conn = connect(host='172.16.12.12', port=10000, user='hive', password='hive', database='test',auth_mechanism='PLAIN')
cur = conn.cursor()
def test_hive(a):
cur.execute('select {}'.format(a))
tab_cc = cur.fetchall()
tab_cc =tab_cc[0][0]
print(a,tab_cc)
if __name__ == '__main__':
pool = Pool(processes=8)
alist=[1,2,3]
for i in range(len(alist)):
pool.apply_async(test_hive,str(i))
pool.close()
pool.join()
When I change alist=[1,2,3] to alist=[1] it works in Linux.
I see two possible causes for this behavior:
an exception raised in test_hive in the context of a forked subprocess
a deadlock caused by the fact that fork does not copy threads from the parent and/or the fact that mutexes are copied in the state they have when the fork call is executed
To check for exceptions, add return tab_cc to the end of your test_hive function and gather the results returned by the pool:
if __name__ == '__main__':
pool = Pool(processes=8)
alist = [1,2,3]
results = []
for i in range(len(alist)):
results.append(pool.apply_async(test_hive, str(i)))
pool.close()
pool.join()
for result in results:
try:
print(result.get())
except Exception as e:
print("{}: {}".format(type(e).__name__, e))
As for the threads, I did a quick search through the impala repo and it seems like they somehow play a role around the usage of thrift. I'm not sure if Python's threading module can actually see them, when originating from that library. You might try with print(multiprocessing.current_process(), threading.enumerate()), both at the module level (e.g. after cur = conn.cursor()) and at the beginning of the test_hive function and see if the _MainProcess(MainProcess, started) shows a longer list of active threads than all of the ForkProcess(ForkPoolWorker-<worker#>, started daemon).
As for a potential solution: I somewhat suspect the fact that you create conn and cur at the module level to be the culprit; all childs use a copy of those two.
Try and move these two lines to the beginning of test_hive, so that each process creates a connection and a cursor if its own:
conn = connect(host='172.16.12.12', port=10000, user='hive', password='hive', database='test',auth_mechanism='PLAIN')
cur = conn.cursor()
from impala.dbapi import connect
import time,datetime,sys,re
import psycopg2 as pg
today = datetime.date.today()
from multiprocessing import Pool
def test_hive(a):
conn = connect(host='172.16.12.12', port=10000, user='hive', password='hive', database='test',auth_mechanism='PLAIN')
cur = conn.cursor()
#print(a)
cur.execute('select {}'.format(a))
tab_cc = cur.fetchall()
tab_cc =tab_cc[0][0]
return tab_cc
if __name__ == '__main__':
pool = Pool(processes=8)
alist = [1,2,4,4,4,4,5,3]
results = []
for i in range(len(alist)):
results.append(pool.apply_async(test_hive, str(i)))
pool.close()
pool.join()
for result in results:
try:
print(result.get())
except Exception as e:
print("{}: {}".format(type(e).__name__, e))
I move these two lines to test_hive is worked.
conn = connect(host='172.16.12.12', port=10000, user='hive', password='hive', database='test',auth_mechanism='PLAIN')
cur = conn.cursor()
I am curious to know if this is the correct way of using cx_Oracle with context lib and connection pooling using DBRCP.
import cx_Oracle
import threading
import time
def get_connection():
connection = cx_Oracle.connect(user='username', password='password', dsn='mydsn_name/service_name:pooled')
return connection
def myfunc():
with get_connection() as conn:
cursor = conn.cursor()
for _ in range(10):
cursor.execute("select * from mytable")
val = cursor.fetchone()
time.sleep(60)
print("Thread", threading.current_thread().name, "fetched sequence =", val)
results = []
for thread in range(0,10):
current_thread = threading.Thread(name = f'Thread {thread}', target = myfunc)
results.append(current_thread)
current_thread.start()
print('Started All Threads')
for thread in results:
thread.join()
print("All done!")
I am not sure If i am doing the right thing here .
And have no idea how to confirm that the connection is being returned to the connection pool.
And each thread is not opening a brand new connection to the database.
Although the doc's on cx_Oracle seem to indicate i am on the right path.
You'll get most benefit if you also use a cx_Oracle connect pool at the same time as DRCP. You need to set cclass with DRCP otherwise you will lose its benefits. You can then decide what level of session reuse (the 'purity') to use. Check the cx_Oracle tutorial. From solutions/connect_pool2.py:
pool = cx_Oracle.SessionPool("pythonhol", "welcome", "localhost/orclpdb:pooled",
min = 2, max = 5, increment = 1, threaded = True)
def Query():
con = pool.acquire(cclass="PYTHONHOL", purity=cx_Oracle.ATTR_PURITY_SELF)
#con = pool.acquire(cclass="PYTHONHOL", purity=cx_Oracle.ATTR_PURITY_NEW)
cur = con.cursor()
for i in range(4):
cur.execute("select myseq.nextval from dual")
seqval, = cur.fetchone()
There are V$ views like V$CPOOL_STATS you can query to check whether DRCP is being used. Links to some resources are in https://oracle.github.io/node-oracledb/doc/api.html#drcp
I am extremely new to have never used any sort of parallel processing methods. I want to read a huge amount of data (i.e. at least 2 million rows) from the SQL Server and want to use parallel processing to speed up the reading. Below is my attempt at the parallel processing using concurrent future process pool.
class DatabaseWorker(object):
def __init__(self, connection_string, n, result_queue = []):
self.connection_string = connection_string
stmt = "select distinct top %s * from dbo.KrishAnalyticsAllCalls" %(n)
self.query = stmt
self.result_queue = result_queue
def reading(self,x):
return(x)
def pooling(self):
t1 = time.time()
con = pyodbc.connect(self.connection_string)
curs = con.cursor()
curs.execute(self.query)
with concurrent.futures.ProcessPoolExecutor(max_workers=8) as executor:
print("Test1")
future_to_read = {executor.submit(self.reading, row): row for row in curs.fetchall()}
print("Test2")
for future in concurrent.futures.as_completed(future_to_read):
print("Test3")
read = future_to_read[future]
try:
print("Test4")
self.result_queue.append(future.result())
except:
print("Not working")
print("\nTime take to grab this data is %s" %(time.time() - t1))
df = DatabaseWorker(r'driver={SQL Server}; server=SPROD_RPT01; database=Reporting;', 2*10**7)
df.pooling()
I am not getting any output with my current implementation. "Test1" prints and that's it. Nothing else happens. I understood the various examples provided by concurrent future documents but I am unable to implement it here. I will highly appreciate your help. Thank you.
This python script does some data stuff on a product list coming from a sqlite table...the commented out for loop works as expected but the Multiprocessing loop does not work at all... i can see the processess being fired but the script just halts. any help ?
import sqlite3 as lite
import sys
import pandas as pd
import datetime
from datetime import date
import time
from Levenshtein import *
import multiprocessing as mp
import copy
def getProducts():
con = None
try:
con = lite.connect('pm.db', check_same_thread=False)
con.row_factory = lite.Row
cur = con.cursor()
cur.execute("SELECT * FROM products" )
rows = cur.fetchall()
except lite.Error, e:
print "Error %s:" % e.args[0]
sys.exit(1)
finally:
if con:
con.close()
return rows
def test_mp(row):
print row
dictArray = []
counter = 0
rows = getProducts()
#for row in rows:
#counter += 1
#print 'product {count} from {max}'.format(count=counter, max=len(rows))
#dictArray.extend(test_mp(row))
pool = mp.Pool(10)
for ret in pool.imap(test_mp, rows):
print 'Done processing product'
dictArray.extend(ret)
pool.terminate()
This is how i fixed it.. apparently the array of sqliteRows rows is not playing well inside the pool.imap function...i used another row factory to create a generic dict.
def dict_factory(cursor, row):
d = {}
for idx, col in enumerate(cursor.description):
d[col[0]] = row[idx]
return d
in getProducts:
con.row_factory = dict_factory
credits go to How can I get dict from sqlite query?