My Python code is to exctract information from XML files and load it into a database.
These files named as numbers 11582.xml, 5300593.xml, etc.
and I have around 1 million files.
I have built the code and it is working fine.
I faced a problem that the code does not take full processor/memory/disk capacity.
My processor should 20% max used.
I asked here and other places and found that I have to use multithread to use full capacity.
So I have to change my script to adopt multithreading.
I did that but still not working to the max.
What I did wrong? and how to fix it?
My code:
import pymssql
import pyodbc
import pandas as pd
import thread
import glob
import xml.etree.ElementTree as ET
conn = pyodbc.connect('Driver={SQL Server};'
'Server=Server123;'
'Database=NLP;'
'Trusted_Connection=yes;')
cursor = conn.cursor()
def ExtractData(ThreadName):
for file in glob.glob("H:\\datas_Output\\xmldata\\" & ThreadName & "*.xmi"):
filename = file[24:-8]
tree = ET.parse(file)
root = tree.getroot()
for Tag in ['Kitkat', 'Snickers', 'Bounty']:
iTag = '{textsem.ecore}' + Tag
for country in root.findall(iTag):
XMIID = country.get('{XMI}id')
sofa = country.get('sofa')
cursor.execute("INSERT INTO Tags (filename,tag,xmiid,sofa) VALUES (?,?,?,?)", filename, Tag, XMIID, sofa)
try:
thread.start_new_thread( ExtractData, ("1") )
thread.start_new_thread( ExtractData, ("2") )
except:
print("Error: unable to start thread")
conn.commit()
Why not generate the list of files and then process them with a pool?
import multiprocessing
def ExtractData(file):
filename = file[24:-8]
tree = ET.parse(file)
root = tree.getroot()
for Tag in ['Kitkat', 'Snickers', 'Bounty']:
iTag = '{textsem.ecore}' + Tag
for country in root.findall(iTag):
XMIID = country.get('{XMI}id')
sofa = country.get('sofa')
cursor.execute("INSERT INTO Tags (filename,tag,xmiid,sofa) VALUES (?,?,?,?)", filename, Tag, XMIID, sofa)
# if there are millions of files, you might want an iterator
filename_iterator = glob.iglob("H:/datas_Output/xmldata/", recursive=True)
with multiprocessing.Pool(processes=multiprocessing.cpu_count()) as pool:
results = pool.map(ExtractData, filename_iterator)
Python multithreading isn't "real". For one thing, there's a "Global Interpreter Lock" (GIL) that only allows one thread to run py at the same time. Also, python threads don't use independent real processor cores. The main benefit is one thread may be blocked on external I/O operations; the other threads can do something.
However, in other environments such as JVMs (Java, Kotlin etc.) real threads are supported.
Related
Issue in Brief
I have recently started using an azure server running ubuntu 20.04. My workflow includes running around 50 python scripts 24/7 and they are operationally very important to my team. But the issue when I just start those python scripts my ram usage is nominal like 12/16 GB remains free in my system after running all my scripts.
But slowly RAM usage by those scripts starts increasing to the point where the system starts to kill them to free up some main memory.
I have no idea what the issue is over here. My scripts are pretty simple and I really don't know where and how do I resolve this issue. Can anyone please show/tell me some guidelines on how do I approach solving this issue?
Comments
I am using python 3.10. Script function is to download the data from some server and upload it to my MySQL database. I can provide the code if anyone asks for it.
Let me know if I can provide anything else to make this easier for you.
Code files
I am uploading the code which is taking up the maximum memory according to htop.
dcx_trades.py
import json
import time
import datetime
from mysql_connector import SQLConnector
import pandas as pd
import sys
import os
import signal
from contextlib import contextmanager
def raise_timeout(signum, frame):
print("timeout")
raise Exception("timouttt")
#contextmanager
def timeout(time):
# Register a function to raise a TimeoutError on the signal.
signal.signal(signal.SIGALRM, raise_timeout)
# Schedule the signal to be sent after ``time``.
signal.alarm(time)
try:
yield
except TimeoutError:
# exit()
pass
finally:
# Unregister the signal so it won't be triggered
# if the timeout is not reached.
signal.signal(signal.SIGALRM, signal.SIG_IGN)
from_db = {'user': 'db_user_name', 'password': 'password', 'host': 'host_url', 'database': 'crypto'}
s = SQLConnector('crypto', from_db)
dict_ = {'timestamp': '', "exchange": "coindcx", "symbol":"", 'error_msg':''}
df = pd.DataFrame(columns = ["exchange_id","timestamp","symbol","price","quantity","exchange","turnover"])
df.set_index('symbol')
while True:
try:
data = pd.read_csv('dcx_trades.csv')
trades = data.to_dict(orient='records')
data = data.iloc[0:0]
if len(trades):
for trade in trades:
utc_time = datetime.datetime.fromtimestamp(trade['T']/1000, datetime.timezone.utc)
local_time = utc_time.astimezone()
datetime_formatted = local_time.strftime("%Y-%m-%d %H:%M:%S")
dict_['timestamp'] = datetime_formatted
dict_["exchange_id"] = 12345
dict_["symbol"] = trade['s']
dict_['price'] = trade['p']
dict_['quantity'] = trade['q']
dict_['turnover'] = float(trade['p'])*float(trade['q'])
dict_['error'] = '0'
df = df.append(dict_, ignore_index=True)
print(df)
df_new = df
df_new= df_new.to_dict(orient='records')
df = df.iloc[0:0]
data.to_csv('dcx_trades.csv', mode='w', index=False)
if len(df_new):
with timeout(60):
try:
print(datetime.datetime.now())
s.add_multipletrades(df_new)
print(datetime.datetime.now())
except Exception as e:
print(e)
os.execv(sys.executable, ['python'] + sys.argv)
print("error_time:", datetime.datetime.now())
except Exception as e:
data = pd.read_csv('dcx_trades.csv')
data = data.loc[1:]
data.to_csv('dcx_trades.csv',index=False)
pass
Objective of the file:
Firstly s = SQLConnector('crypto', from_db) this lines makes the connection with the DB. All the database-related functions have been defined in another file named mysql_connector.py which I have imported in the beginning.
Then the code reads from the CSV file named dcx_trades.csv and preprocesses the data as per the database table. Before uploading the data into DB it clears the CSV file so as to remove duplicates. timeout(60) function is used because sometimes the file get stuck while writing into the DB and thus it needs to get restarted which is what timeout() function does.
All of those transforms can easily be done in SQL --
LOAD DATA into a temp table with whatever columns match the columns and datatypes in the file
Run a single INSERT .. SELECT .. to copy the values over, doing whatever expressions are needed (such as p * q).
I have a function that runs over a number of tables in a sqlite database. It reads the data, does some stuff and then saves the result in a csv-file.
from __future__ import division
import sqlalchemy as sql
import pandas as pd
import os
import multiprocessing as mp
dst = r'H:\Results'
eng = sql.create_engine('sqlite:///Y:/Database/some.db') # database on external drive
con = eng.connect()
def get_res(tab_name,lock):
query_tr = """SELECT t, p, size, event, direction \
FROM {tb} WHERE event IN (4, 5)""".format(tb=tab_name)
df_tr = pd.read_sql_query(query_tr,con)
# do some stuff with df_tr ...
with lock:
df_tr.to_csv(os.path.join(dst,'my_res.csv'), mode='a')
return 1
I do this in parallel like so
if __name__=='__main__':
workers = mp.cpu_count()
tables = sql.inspect(eng).get_table_names()
man = mp.Manager()
pool = mp.Pool(workers)
lock = man.Lock()
res = {tab_name: pool.apply_async(get_res,(tab_name,lock)) for tab_name in tables}
pool.close()
pool.join()
man.shutdown()
The strange thing is that the call man.shutdown() returns a Windows Error 5: Access Denied when the function reads the data from a database that is on an external hard disk drive, but works absolutely fine when the database is on the computer's hard drive. The function get_res goes through correctly without any error and does what it should do.
I know that this is not much to go on, but are there any suggestions why that could be the case?
My questions basically is is there a best practice approach to db interaction and am I doing something silly / wrong in the below that is costing processing time.
My program pulls data from a website and writes to a SQL database. Speed is very important and I want to be able to refresh the data as quickly as possible. I've tried a number of ways and I feel its still way too slow - i.e. could be much better with a better approach / design to interaction with the db and I'm sure I'm making all sorts of mistakes. I can download the data to memory very quickly but the writes to the db take much much longer.
The 3 main approaches I've tried are:
Threads that pull the data and populate a list of SQL commands, when
threads complete run sql in main thread
Threads that pull data and push to SQL (as per below code)
Threads that pull data and populate a q with separate thread(s)
polling the q and pushing to the db.
Code as below:
import MySQLdb as mydb
class DatabaseUtility():
def __init__(self):
"""set db parameters"""
def updateCommand(self, cmd):
"""run SQL commands and return number of matched rows"""
try:
self.cur.execute(cmd)
return int(re.search('Rows matched: (\d+)', self.cur._info).group(1))
except Exception, e:
print ('runCmd error: ' + str(e))
print ('With SQL: ' + cmd)
return 0
def addCommand(self, cmd):
"""write SQL command to db"""
try:
self.cur.execute(cmd)
return self.cur.rowcount
except Exception, e:
print ('runCmd error: ' + str(e))
print ('With SQL: ' + cmd)
return 0
I've created a class that instantiates a db connection and is called as below:
from Queue import Queue
from threading import Thread
import urllib2
import json
from databasemanager import DatabaseUtility as dbU
from datalinks import getDataLink, allDataLinks
numThreads = 3
q = Queue()
dbu = dbU()
class OddScrape():
def __init__(self, name, q):
self.name = name
self.getOddsData(self.name, q)
def getOddsData(self, i, q):
"""Worker thread - parse each datalink and update / insert to db"""
while True:
#get datalink, create db connection
self.dbu = dbU()
matchData = q.get()
#load data link using urllib2 and do a bunch of stuff
#to parse the data to the required format
#try to update in db and insert if not found
sql = "sql to update %s" %(params)
update = self.dbu.updateCommand(sql)
if update < 1:
sql = "sql to insert" %(params)
self.dbu.addCommand(sql)
q.task_done()
self.dbu.dbConClose()
print eventlink
def threadQ():
#set up some threads
for i in range(numThreads):
worker = Thread(target=OddScrape, args=(i, q,))
worker.start()
#get urldata for all matches required and add to q
matchids = dbu.runCommand("sql code to determine scope of urls")
for match in matchids:
sql = "sql code to get url data %s" %match
q.put(dbu.runCommand(sql))
q.join()
I've also added an index to the table I'm writing too which seemed to help a tiny bit but not noticeably:
CREATE INDEX `idx_oddsdata_bookid_datalinkid`
ON `dbname`.`oddsdata` (bookid, datalinkid) COMMENT '' ALGORITHM DEFAULT LOCK DEFAULT;
Multiple threads implies multiple connections. Although getting a connection is "fast" in MySQL, it is not instantaneous. I do not know the relative speed of getting a connection versus running a query, but I doubt if you multi-threaded idea will win.
Could you show us examples of the actual queries (SQL, not python code) you need to run. We may have suggestions on combining queries, improved indexes, etc. Please provide SHOW CREATE TABLE, too. (You mentioned a CREATE INDEX, but it is useless out of context.)
It looks like you are doing a multi-step process that could be collapsed into INSERT ... ON DUPLICATE KEY UPDATE ....
I have a Python script that I'd like to run everyday and I'd prefer that it only takes 1-2 hours to run. It's currently setup to hit 4 different APIs for a given URL, capture the results, and then save the data into a PostgreSQL database. The problem is I have over 160,000 URLs to go through and the script ends up taking a really long time -- I ran some preliminary tests and it would take over 36 hours to go through each URL in its current format. So, my question boils down to: should I optimize my script to run multiple threads at the same time? Or should I scale out the number of servers I'm using? Obviously the second approach will be more costly so I'd prefer to have multiple threads running on the same instance.
I'm using a library I created (SocialAnalytics) which provides methods to hit the different API endpoints and parse the results. Here's how I have my script configured:
import psycopg2
from socialanalytics import pinterest
from socialanalytics import facebook
from socialanalytics import twitter
from socialanalytics import google_plus
from time import strftime, sleep
conn = psycopg2.connect("dbname='***' user='***' host='***' password='***'")
cur = conn.cursor()
# Select all URLs
cur.execute("SELECT * FROM urls;")
urls = cur.fetchall()
for url in urls:
# Pinterest
try:
p = pinterest.getPins(url[2])
except:
p = { 'pin_count': 0 }
# Facebook
try:
f = facebook.getObject(url[2])
except:
f = { 'comment_count': 0, 'like_count': 0, 'share_count': 0 }
# Twitter
try:
t = twitter.getShares(url[2])
except:
t = { 'share_count': 0 }
# Google
try:
g = google_plus.getPlusOnes(url[2])
except:
g = { 'plus_count': 0 }
# Save results
try:
now = strftime("%Y-%m-%d %H:%M:%S")
cur.execute("INSERT INTO social_stats (fetched_at, pinterest_pins, facebook_likes, facebook_shares, facebook_comments, twitter_shares, google_plus_ones) VALUES(%s, %s, %s, %s, %s, %s, %s, %s);", (now, p['pin_count'], f['like_count'], f['share_count'], f['comment_count'], t['share_count'], g['plus_count']))
conn.commit()
except:
conn.rollback()
You can see that each call to the API is using the Requests library, which is a synchronous, blocking affair. After some preliminary research I discovered Treq, which is an API on top of Twisted. The asynchronous, non-blocking nature of Twisted seems like a good candidate for improving my approach, but I've never worked with it and I'm not sure how exactly (and if) it'll help me achieve my goal.
Any guidance is much appreciated!
At first you should measure time that your script spends on every step. May be you discover something interesting :)
Second, you can split your urls on chunks:
chunk_size = len(urls)/cpu_core_count; // don't forget about remainder of division
After these steps you can use multiprocessing for processing every chunk in parallel. Here is example for you:
import multiprocessing as mp
p = mp.Pool(5)
# first solution
for urls_chunk in urls: # urls = [(url1...url6),(url7...url12)...]
res = p.map(get_social_stat, urls_chunk)
for record in res:
save_to_db(record)
# or, simple
res = p.map(get_social_stat, urls)
for record in res:
save_to_db(record)
Also, gevent can help you. Because it can optimize time spending on processing sequence of synchronous blocking requests.
Program 1 inserts some jobs into a table job_table.
Program 2 needs to :
1. get the job from the table
2. handle the job
-> this needs to be multi-threaded (because each job involves urllib waiting time, which should run in parallel)
3. insert the results into my_other_table, commiting the result
Any good (standard?) ways to implement this? The issue is that commiting inside one thread, also commits the other threads.
I was able to pick the records from the mysql table and put them in queue later get them from queue but not able to insert into a new mysql table.
Here i am able to pick up only the new records when ever they fall into the table.
Hope this may help you.
Any mistakes please assist me.
from threading import Thread
import time
import Queue
import csv
import random
import pandas as pd
import pymysql.cursors
from sqlalchemy import create_engine
import logging
queue = Queue.Queue(1000)
logging.basicConfig(level=logging.DEBUG, format='(%(threadName)-9s) %(message)s', )
conn = pymysql.connect(conn-details)
cursor = conn.cursor()
class ProducerThread(Thread):
def run(self):
global queue
cursor.execute("SELECT ID FROM multi ORDER BY ID LIMIT 1")
min_id = cursor.fetchall()
min_id1 = list(min_id[0])
while True:
cursor.execute("SELECT ID FROM multi ORDER BY ID desc LIMIT 1")
max_id = cursor.fetchall()
max_id1 = list(max_id[0])
sql = "select * from multi where ID between '{}' and '{}'".format(min_id1[0], max_id1[0])
cursor.execute(sql)
data = cursor.fetchall()
min_id1[0] = max_id1[0] + 1
for row in data:
num = row
queue.put(num) # acquire();wait()
logging.debug('Putting ' + str(num) + ' : ' + str(queue.qsize()) + ' items in queue')
class ConsumerThread(Thread):
def run(self):
global queue
while True:
num = queue.get()
print num
logging.debug('Getting ' + str(num) + ' : ' + str(queue.qsize()) + ' items in queue')
**sql1 = """insert into multi_out(ID,clientname) values ('%s','%s')""",num[0],num[1]
print sql1
# cursor.execute(sql1, num)
cursor.execute("""insert into multi_out(ID,clientname) values ('%s','%s')""",(num[0],num[1]))**
# conn.commit()
# conn.close()
def main():
ProducerThread().start()
num_of_consumers = 20
for i in range(num_of_consumers):
ConsumerThread().start()
main()
What probably happens is you share the MySQL connection between the two threads. Try creating a new MySQL connection inside each thread.
For program 2, look at http://www.celeryproject.org/ :)
This is a common task when doing some sort of web crawling. I have implemented a single thread which grabs a job, waits for the http response, then writes the response to a database table.
The problems I have come across with my method, is you need to lock the table where you are grabbing jobs from, and mark them as in progress or complete, in order for multiple threads to not try and grab the same task.
Just used threading.Thread in python and override the run method.
Use 1 database connection per thread. (some db libraries in python are not thread safe)
If you have X number of threads running, periodically reading from the jobs table then MySQL will do the concurrency for you.
Or if you need even more assurance, you can always lock the jobs table yourself before reading the next available entry. This way you can be 100% sure that a single job will only be processed once.
As #Martin said, keep connections separate for all threads. They can use the same credentials.
So in short:
Program one -> Insert into jobs
Program two -> Create a write lock on the jobs table, so no one else can read from it
Program two -> read next available job
Program two -> Unlock the table
Do everything else as usual, MySQL will handle concurrency