I am extremely new to have never used any sort of parallel processing methods. I want to read a huge amount of data (i.e. at least 2 million rows) from the SQL Server and want to use parallel processing to speed up the reading. Below is my attempt at the parallel processing using concurrent future process pool.
class DatabaseWorker(object):
def __init__(self, connection_string, n, result_queue = []):
self.connection_string = connection_string
stmt = "select distinct top %s * from dbo.KrishAnalyticsAllCalls" %(n)
self.query = stmt
self.result_queue = result_queue
def reading(self,x):
return(x)
def pooling(self):
t1 = time.time()
con = pyodbc.connect(self.connection_string)
curs = con.cursor()
curs.execute(self.query)
with concurrent.futures.ProcessPoolExecutor(max_workers=8) as executor:
print("Test1")
future_to_read = {executor.submit(self.reading, row): row for row in curs.fetchall()}
print("Test2")
for future in concurrent.futures.as_completed(future_to_read):
print("Test3")
read = future_to_read[future]
try:
print("Test4")
self.result_queue.append(future.result())
except:
print("Not working")
print("\nTime take to grab this data is %s" %(time.time() - t1))
df = DatabaseWorker(r'driver={SQL Server}; server=SPROD_RPT01; database=Reporting;', 2*10**7)
df.pooling()
I am not getting any output with my current implementation. "Test1" prints and that's it. Nothing else happens. I understood the various examples provided by concurrent future documents but I am unable to implement it here. I will highly appreciate your help. Thank you.
Related
I've a function in my code which is as follows:
async def register():
db = connector.connect(host='localhost',user='root',password='root',database='testing')
cursor = db.cursor()
cursor.execute('LOCK TABLES Data WRITE;')
cursor.execute('SELECT Total_Reg FROM Data;')
data = cursor.fetchall()
reg = data[0][0]
print(reg)
if reg >= 30:
print("CLOSED!")
return
await asyncio.sleep(1)
cursor.execute('UPDATE Data SET Total_Reg = Total_Reg + 1 WHERE Id = 1')
cursor.execute('COMMIT;')
print("REGISTERED!")
db.close()
In case of multiple instances of this register function running at the same time, there is an unexpected infinite loop occurs blocking my entire code. Why is that so? Also, if it's a deadlock [I assume] then why my program is not raising any error? Please tell me why is this happening? And what can be done to prevent this issue?
A much simpler construct:
db = connector.connect(...
cursor = db.cursor()
if cursor.execute('UPDATE Data SET Total_Reg = Total_Reg + 1 WHERE Id = 1 AND Total_Reg < 30'):
print("REGISTERED!")
else:
print("CLOSED!")
db.close()
So:
Don't use LOCK TABLES, not until you understand transactions, and then rarely
Use the SQL to enforce the constraints you want
Use the return value to see if any rows where changed
Don't use sleep statements
I am querying SQL Server for the list of fields both with threading and without threading.
import pyodbc
import datetime
import concurrent.futures
server = 'xx.xxx.xxx.xxx,1433'
database = 'db'
username = 'user'
password = 'password'
cnxn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+server+';DATABASE='+database + ';UID='+username+';PWD='+password + ';MARS_Connection=yes' + ';Max_Pool_Size=100000')
filter_list = ["department_id", "employee_id", "city", "country", "state", "zip_cope", "department_name", "employee_name", "employee_experience"]
t1 = datetime.datetime.now()
result_list = []
def query_executor(field):
try:
ft1=datetime.datetime.now()
cursor = cnxn.cursor()
result = cursor.execute("""SELECT DISTINCT TOP 1000 [{}] from EMPLOYEE_DETAILS""".format(field))
print(field)
result_list1 = [filter_item[0] for filter_item in result if filter_item[0]]
# print("#############################################")
return {"name": field, "filter_data": result_list1}
except Exception as e:
print(e)
finally:
print("#############################################")
print(datetime.datetime.now()-ft1)
print("#############################################")
cursor.close()
# with threading
with concurrent.futures.ThreadPoolExecutor() as executor:
result = [executor.submit(query_executor, field) for field in filter_list]
for f in concurrent.futures.as_completed(result):
result_list.append(f.result())
print(result_list)
t2 = datetime.datetime.now()
print("#############################################")
print('with threading time taken')
print(t2-t1)
print("#############################################")
#without threading
for f in filter_list:
result_list.append(query_executor(f))
print(result_list)
t2 = datetime.datetime.now()
print("#############################################")
print('without threading time taken')
print(t2-t1)
print("#############################################")
While running, I comment either of one to see the time taken individually with and without threading. But I don't see much time difference. In fact sometime threading one gets slow.
Am I doing something wrong? How will I get the performance boost? filter_list list can grow even bigger sometime, which may lead to slow response.
Thanks in advance!
I have a script used to migrate data from SQLite to Postgres. I just use a for loop to transfer tables one by one. Now, I want to experiment with transfering multiple tables in concurrency using threads, multiprocessing or asyncio to speed up the program to compare the runtimes between those ways.
How do you do one of those ways?
Here is my script:
import psycopg2, sqlite3, sys
import time
import multiprocessing
sqdb="C://Users//duongnb//Desktop//Python//SqliteToPostgreFull//testmydb6.db"
sqlike="table"
pgdb="testmydb11"
pguser="postgres"
pgpswd="1234"
pghost="127.0.0.1"
pgport="5432"
consq=sqlite3.connect(sqdb)
cursq=consq.cursor()
tabnames=[]
print()
cursq.execute('SELECT name FROM sqlite_master WHERE type="table" AND name LIKE "%table%";')
tabgrab = cursq.fetchall()
for item in tabgrab:
tabnames.append(item[0])
print(tabgrab)
def copyTable(table):
print(table)
cursq.execute("SELECT sql FROM sqlite_master WHERE type='table' AND name = ?;", (table,))
create = cursq.fetchone()[0]
cursq.execute("SELECT * FROM %s;" %table)
rows=cursq.fetchall()
colcount=len(rows[0])
pholder='%s,'*colcount
newholder=pholder[:-1]
try:
conpg = psycopg2.connect(database=pgdb, user=pguser, password=pgpswd,
host=pghost, port=pgport)
curpg = conpg.cursor()
curpg.execute("DROP TABLE IF EXISTS %s;" %table)
create = create.replace("AUTOINCREMENT", "")
curpg.execute(create)
curpg.executemany("INSERT INTO %s VALUES (%s);" % (table, newholder),rows)
conpg.commit()
if conpg:
conpg.close()
except psycopg2.DatabaseError as e:
print ('Error %s' % e)
sys.exit(1)
finally:
print("Complete")
consq.close()
if __name__ == "__main__":
start_time = time.time()
for table in tabnames:
p = multiprocessing.Process(target = copyTable, args = (table))
p.start()
for table in tabnames:
p.join()
print("All processes finished.")
duration = time.time() - start_time
print(f"Duration {duration} seconds")
You should put the inner of for table in tabnames into a function, say copyTable. Then you're able to use the multiprocessing package to parallelize your code. It should look something like this:
for table in tabnames:
p = multiprocessing.Process(target = copyTable, args = (table))
p.start()
for table in tabnames:
p.join()
print("All processes finished.")
But you can speed up your code even more if you use a COPY (https://www.postgresql.org/docs/current/sql-copy.html) instead of the many INSERT commands.
Instead of the multiprocessing module, you can also use the threading module, which works quite similarly. Then you have threads instead of processes. Because of the interpreter lock I would expect a worse performance with this.
Using pool with multi processing for hitting an api and storing the response into mysql for around 5k queries. The program never exits even after complete execution.Found that this happens as the processes are still open after execution.
Tried using join(), terminate(). Nothing works.
def fetchAndStore(query):
conn = pymysql.connect(host=host,
user=user,
passwd=passwd,
db=db)
x = conn.cursor()
full_url = getFullUrl(redirection_service, query)
response = urllib2.urlopen(full_url)
html = response.read()
data = json.loads(html)
if data is not None:
store = json.dumps(data["RESPONSE"]["redirectionStore"])
else:
store = 'search.flipkart.com'
try:
stmt = "INSERT INTO golden_response(`store`) VALUES ('%s')" %( store)
x.execute(stmt)
conn.commit()
conn.close()
except Exception, e:
global failure
failure += 1
print e
return "Done"
#main
queries = db_connection()
pool = Pool(processes=20, maxtasksperchild=5)
results = []
for query in queries:
results.append(pool.apply_async(fetchAndStore, (query,)))
pool.close()
pool.join()
print "completed"
The process should exit and print completed, in the ideal situation.
I am curious to know if this is the correct way of using cx_Oracle with context lib and connection pooling using DBRCP.
import cx_Oracle
import threading
import time
def get_connection():
connection = cx_Oracle.connect(user='username', password='password', dsn='mydsn_name/service_name:pooled')
return connection
def myfunc():
with get_connection() as conn:
cursor = conn.cursor()
for _ in range(10):
cursor.execute("select * from mytable")
val = cursor.fetchone()
time.sleep(60)
print("Thread", threading.current_thread().name, "fetched sequence =", val)
results = []
for thread in range(0,10):
current_thread = threading.Thread(name = f'Thread {thread}', target = myfunc)
results.append(current_thread)
current_thread.start()
print('Started All Threads')
for thread in results:
thread.join()
print("All done!")
I am not sure If i am doing the right thing here .
And have no idea how to confirm that the connection is being returned to the connection pool.
And each thread is not opening a brand new connection to the database.
Although the doc's on cx_Oracle seem to indicate i am on the right path.
You'll get most benefit if you also use a cx_Oracle connect pool at the same time as DRCP. You need to set cclass with DRCP otherwise you will lose its benefits. You can then decide what level of session reuse (the 'purity') to use. Check the cx_Oracle tutorial. From solutions/connect_pool2.py:
pool = cx_Oracle.SessionPool("pythonhol", "welcome", "localhost/orclpdb:pooled",
min = 2, max = 5, increment = 1, threaded = True)
def Query():
con = pool.acquire(cclass="PYTHONHOL", purity=cx_Oracle.ATTR_PURITY_SELF)
#con = pool.acquire(cclass="PYTHONHOL", purity=cx_Oracle.ATTR_PURITY_NEW)
cur = con.cursor()
for i in range(4):
cur.execute("select myseq.nextval from dual")
seqval, = cur.fetchone()
There are V$ views like V$CPOOL_STATS you can query to check whether DRCP is being used. Links to some resources are in https://oracle.github.io/node-oracledb/doc/api.html#drcp