Improve Speed of Python Script: Multithreading or Multiple Instances? - python

I have a Python script that I'd like to run everyday and I'd prefer that it only takes 1-2 hours to run. It's currently setup to hit 4 different APIs for a given URL, capture the results, and then save the data into a PostgreSQL database. The problem is I have over 160,000 URLs to go through and the script ends up taking a really long time -- I ran some preliminary tests and it would take over 36 hours to go through each URL in its current format. So, my question boils down to: should I optimize my script to run multiple threads at the same time? Or should I scale out the number of servers I'm using? Obviously the second approach will be more costly so I'd prefer to have multiple threads running on the same instance.
I'm using a library I created (SocialAnalytics) which provides methods to hit the different API endpoints and parse the results. Here's how I have my script configured:
import psycopg2
from socialanalytics import pinterest
from socialanalytics import facebook
from socialanalytics import twitter
from socialanalytics import google_plus
from time import strftime, sleep
conn = psycopg2.connect("dbname='***' user='***' host='***' password='***'")
cur = conn.cursor()
# Select all URLs
cur.execute("SELECT * FROM urls;")
urls = cur.fetchall()
for url in urls:
# Pinterest
try:
p = pinterest.getPins(url[2])
except:
p = { 'pin_count': 0 }
# Facebook
try:
f = facebook.getObject(url[2])
except:
f = { 'comment_count': 0, 'like_count': 0, 'share_count': 0 }
# Twitter
try:
t = twitter.getShares(url[2])
except:
t = { 'share_count': 0 }
# Google
try:
g = google_plus.getPlusOnes(url[2])
except:
g = { 'plus_count': 0 }
# Save results
try:
now = strftime("%Y-%m-%d %H:%M:%S")
cur.execute("INSERT INTO social_stats (fetched_at, pinterest_pins, facebook_likes, facebook_shares, facebook_comments, twitter_shares, google_plus_ones) VALUES(%s, %s, %s, %s, %s, %s, %s, %s);", (now, p['pin_count'], f['like_count'], f['share_count'], f['comment_count'], t['share_count'], g['plus_count']))
conn.commit()
except:
conn.rollback()
You can see that each call to the API is using the Requests library, which is a synchronous, blocking affair. After some preliminary research I discovered Treq, which is an API on top of Twisted. The asynchronous, non-blocking nature of Twisted seems like a good candidate for improving my approach, but I've never worked with it and I'm not sure how exactly (and if) it'll help me achieve my goal.
Any guidance is much appreciated!

At first you should measure time that your script spends on every step. May be you discover something interesting :)
Second, you can split your urls on chunks:
chunk_size = len(urls)/cpu_core_count; // don't forget about remainder of division
After these steps you can use multiprocessing for processing every chunk in parallel. Here is example for you:
import multiprocessing as mp
p = mp.Pool(5)
# first solution
for urls_chunk in urls: # urls = [(url1...url6),(url7...url12)...]
res = p.map(get_social_stat, urls_chunk)
for record in res:
save_to_db(record)
# or, simple
res = p.map(get_social_stat, urls)
for record in res:
save_to_db(record)
Also, gevent can help you. Because it can optimize time spending on processing sequence of synchronous blocking requests.

Related

Python- performance issue with mysql and google play subscription

I am trying to extract the purchase_state from google play, following the steps bellow:
import base64
import requests
import smtplib
from collections import OrderedDict
import mysql.connector
from mysql.connector import errorcode
......
Query db ,returning thousand of lines with purchase_idfield from my table
Check for each row from db, and extract purchase_id, then query google play for all of them. for example if the results of my previous
step is 1000, 1000 times is querying google (refresh token + query).
Add a new field purchase status from the google play to a new dictionary apart from some other fields whcih are grabbed from mysql query.
The last step is doing a loop over my dic as a follow and prepare the desirable report
AFTER EDITED:
def build_dic_from_db(data,access_token):
dic = {}
for row in data:
product_id = row['product_id']
purchase_id = row['purchase_id']
status = check_purchase_status(access_token, product_id,purchase_id)
cnt = 1
if row['user'] not in dic:
dic[row['user']] = {'id':row['user_id'],'country': row['country_name'],'reg_ts': row['user_registration_timestamp'],'last_active_ts': row['user_last_active_action_timestamp'],
'total_credits': row['user_credits'],'total_call_sec_this_month': row['outgoing_call_seconds_this_month'],'user_status':row['user_status'],'mobile':row['user_mobile_phone_number_num'],'plus':row['user_assigned_msisdn_num'],
row['product_id']:{'tAttemp': cnt,'tCancel': status}}
else:
if row['product_id'] not in dic[row['user']]:
dic[row['user']][row['product_id']] = {'tAttemp': cnt,'tCancel':status}
else:
dic[row['user']][row['product_id']]['tCancel'] += status
dic[row['user']][row['product_id']]['tAttemp'] += cnt
return dic
The problem is that my code is working slowly ~ Total execution time: 448.7483880519867 and I am wondering if there is away to improve my script. Is there any suggestion?
I hope I'm right about this, but the bottleneck seems to be the connection to the playstore. Doing it sequentially will take a long time, whereas the server is able to take a million requests at a time. So here's a way to process your jobs with executors (you need the "concurrent" package installed)
In that example, you'll be able to send 100 requests at the same time.
from concurrent import futures
EXECUTORS = futures.ThreadPoolExecutor(max_workers=100)
jobs = dict()
for row in data:
product_id = row['product_id']
purchase_id = row['purchase_id']
job = EXECUTORS.submit(check_purchase_status,
access_token, product_id,purchase_id)
jobs[job] = row
for job in futures.as_completed(jobs.keys()):
# here collect your results and do something useful with them :)
status = job.result()
# make the connection with current row
row = jobs[job]
# now you have your status and the row
And BTW try to use temp variables or you're constantly accessing your dictionary with the same keys, which is not good for performance AND readability of your code.

Python threading and SQL

My questions basically is is there a best practice approach to db interaction and am I doing something silly / wrong in the below that is costing processing time.
My program pulls data from a website and writes to a SQL database. Speed is very important and I want to be able to refresh the data as quickly as possible. I've tried a number of ways and I feel its still way too slow - i.e. could be much better with a better approach / design to interaction with the db and I'm sure I'm making all sorts of mistakes. I can download the data to memory very quickly but the writes to the db take much much longer.
The 3 main approaches I've tried are:
Threads that pull the data and populate a list of SQL commands, when
threads complete run sql in main thread
Threads that pull data and push to SQL (as per below code)
Threads that pull data and populate a q with separate thread(s)
polling the q and pushing to the db.
Code as below:
import MySQLdb as mydb
class DatabaseUtility():
def __init__(self):
"""set db parameters"""
def updateCommand(self, cmd):
"""run SQL commands and return number of matched rows"""
try:
self.cur.execute(cmd)
return int(re.search('Rows matched: (\d+)', self.cur._info).group(1))
except Exception, e:
print ('runCmd error: ' + str(e))
print ('With SQL: ' + cmd)
return 0
def addCommand(self, cmd):
"""write SQL command to db"""
try:
self.cur.execute(cmd)
return self.cur.rowcount
except Exception, e:
print ('runCmd error: ' + str(e))
print ('With SQL: ' + cmd)
return 0
I've created a class that instantiates a db connection and is called as below:
from Queue import Queue
from threading import Thread
import urllib2
import json
from databasemanager import DatabaseUtility as dbU
from datalinks import getDataLink, allDataLinks
numThreads = 3
q = Queue()
dbu = dbU()
class OddScrape():
def __init__(self, name, q):
self.name = name
self.getOddsData(self.name, q)
def getOddsData(self, i, q):
"""Worker thread - parse each datalink and update / insert to db"""
while True:
#get datalink, create db connection
self.dbu = dbU()
matchData = q.get()
#load data link using urllib2 and do a bunch of stuff
#to parse the data to the required format
#try to update in db and insert if not found
sql = "sql to update %s" %(params)
update = self.dbu.updateCommand(sql)
if update < 1:
sql = "sql to insert" %(params)
self.dbu.addCommand(sql)
q.task_done()
self.dbu.dbConClose()
print eventlink
def threadQ():
#set up some threads
for i in range(numThreads):
worker = Thread(target=OddScrape, args=(i, q,))
worker.start()
#get urldata for all matches required and add to q
matchids = dbu.runCommand("sql code to determine scope of urls")
for match in matchids:
sql = "sql code to get url data %s" %match
q.put(dbu.runCommand(sql))
q.join()
I've also added an index to the table I'm writing too which seemed to help a tiny bit but not noticeably:
CREATE INDEX `idx_oddsdata_bookid_datalinkid`
ON `dbname`.`oddsdata` (bookid, datalinkid) COMMENT '' ALGORITHM DEFAULT LOCK DEFAULT;
Multiple threads implies multiple connections. Although getting a connection is "fast" in MySQL, it is not instantaneous. I do not know the relative speed of getting a connection versus running a query, but I doubt if you multi-threaded idea will win.
Could you show us examples of the actual queries (SQL, not python code) you need to run. We may have suggestions on combining queries, improved indexes, etc. Please provide SHOW CREATE TABLE, too. (You mentioned a CREATE INDEX, but it is useless out of context.)
It looks like you are doing a multi-step process that could be collapsed into INSERT ... ON DUPLICATE KEY UPDATE ....

How to speed up web scraping in python

I'm working on a project for school and I am trying to get data about movies. I've managed to write a script to get the data I need from IMDbPY and Open Movie DB API (omdbapi.com). The challenge I'm experiencing is that I'm trying to get data for 22,305 movies and each request takes about 0.7 seconds. Essentially my current script will take about 8 hours to complete. Looking for any way to maybe use multiple requests at the same time or any other suggestions to significantly speed up the process of getting this data.
import urllib2
import json
import pandas as pd
import time
import imdb
start_time = time.time() #record time at beginning of script
#used to make imdb.com think we are getting this data from a browser
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
#Open Movie Database Query url for IMDb IDs
url = 'http://www.omdbapi.com/?tomatoes=true&i='
#read the ids from the imdb_id csv file
imdb_ids = pd.read_csv('ids.csv')
cols = [u'Plot', u'Rated', u'tomatoImage', u'Title', u'DVD', u'tomatoMeter',
u'Writer', u'tomatoUserRating', u'Production', u'Actors', u'tomatoFresh',
u'Type', u'imdbVotes', u'Website', u'tomatoConsensus', u'Poster', u'tomatoRotten',
u'Director', u'Released', u'tomatoUserReviews', u'Awards', u'Genre', u'tomatoUserMeter',
u'imdbRating', u'Language', u'Country', u'imdbpy_budget', u'BoxOffice', u'Runtime',
u'tomatoReviews', u'imdbID', u'Metascore', u'Response', u'tomatoRating', u'Year',
u'imdbpy_gross']
#create movies dataframe
movies = pd.DataFrame(columns=cols)
i=0
for i in range(len(imdb_ids)-1):
start = time.time()
req = urllib2.Request(url + str(imdb_ids.ix[i,0]), None, headers) #request page
response = urllib2.urlopen(req) #actually call the html request
the_page = response.read() #read the json from the omdbapi query
movie_json = json.loads(the_page) #convert the json to a dict
#get the gross revenue and budget from IMDbPy
data = imdb.IMDb()
movie_id = imdb_ids.ix[i,['imdb_id']]
movie_id = movie_id.to_string()
movie_id = int(movie_id[-7:])
data = data.get_movie_business(movie_id)
data = data['data']
data = data['business']
#get the budget $ amount out of the budget IMDbPy string
try:
budget = data['budget']
budget = budget[0]
budget = budget.replace('$', '')
budget = budget.replace(',', '')
budget = budget.split(' ')
budget = str(budget[0])
except:
None
#get the gross $ amount out of the gross IMDbPy string
try:
budget = data['budget']
budget = budget[0]
budget = budget.replace('$', '')
budget = budget.replace(',', '')
budget = budget.split(' ')
budget = str(budget[0])
#get the gross $ amount out of the gross IMDbPy string
gross = data['gross']
gross = gross[0]
gross = gross.replace('$', '')
gross = gross.replace(',', '')
gross = gross.split(' ')
gross = str(gross[0])
except:
None
#add gross to the movies dict
try:
movie_json[u'imdbpy_gross'] = gross
except:
movie_json[u'imdbpy_gross'] = 0
#add gross to the movies dict
try:
movie_json[u'imdbpy_budget'] = budget
except:
movie_json[u'imdbpy_budget'] = 0
#create new dataframe that can be merged to movies DF
tempDF = pd.DataFrame.from_dict(movie_json, orient='index')
tempDF = tempDF.T
#add the new movie to the movies dataframe
movies = movies.append(tempDF, ignore_index=True)
end = time.time()
time_took = round(end-start, 2)
percentage = round(((i+1) / float(len(imdb_ids))) * 100,1)
print i+1,"of",len(imdb_ids),"(" + str(percentage)+'%)','completed',time_took,'sec'
#increment counter
i+=1
#save the dataframe to a csv file
movies.to_csv('movie_data.csv', index=False)
end_time = time.time()
print round((end_time-start_time)/60,1), "min"
Use Eventlet library to fetch concurently
As advised in comments, you shall fetch your feeds concurrently. This can be done by using treading, multiprocessing, or using eventlet.
Install eventlet
$ pip install eventlet
Try web crawler sample from eventlet
See: http://eventlet.net/doc/examples.html#web-crawler
Understanding concurrency with eventlet
With threading system takes care of switching between your threads. This brings big problem in case you have to access some common data structures, as you never know, which other thread is currently accessing your data. You then start playing with synchronized blocks, locks, semaphores - just to synchronize access to your shared data structures.
With eventlet it goes much simpler - you always run only one thread and jump between them only at I/O instructions or at other eventlet calls. The rest of your code runs uninterrupted and without a risk, another thread would mess up with our data.
You only have to take care of following:
all I/O operations must be non-blocking (this is mostly easy, eventlet provides non-blocking versions for most of the I/O you need).
your remaining code must not be CPU expensive as it would block switching between "green" threads for longer time and the power of "green" multithreading would be gone.
Great advantage with eventlet is, that it allows to write code in straightforward way without spoiling it (too) much with Locks, Semaphores etc.
Apply eventlet to your code
If I understand it correctly, list of urls to fetch is known in advance and order of their processing in your analysis is not important. This shall allow almost direct copy of example from eventlet. I see, that an index i has some significance, so you might consider mixing url and the index as a tuple and processing them as independent jobs.
There are definitely other methods, but personally I have found eventlet really easy to use comparing it to other techniques while getting really good results (especially with fetching feeds). You just have to grasp main concepts and be a bit careful to follow eventlet requirements (keep being non-blocking).
Fetching urls using requests and eventlet - erequests
There are various packages for asynchronous processing with requests, one of them using eventlet and being namederequests see https://github.com/saghul/erequests
Simple sample fetching set of urls
import erequests
# have list of urls to fetch
urls = [
'http://www.heroku.com',
'http://python-tablib.org',
'http://httpbin.org',
'http://python-requests.org',
'http://kennethreitz.com'
]
# erequests.async.get(url) creates asynchronous request
async_reqs = [erequests.async.get(url) for url in urls]
# each async request is ready to go, but not yet performed
# erequests.map will call each async request to the action
# what returns processed request `req`
for req in erequests.map(async_reqs):
if req.ok:
content = req.content
# process it here
print "processing data from:", req.url
Problems for processing this specific question
We are able to fetch and somehow process all urls we need. But in this question, processing is bound to particular record in source data, so we will need to match processed request with index of record we need for getting further details for final processing.
As we will see later, asynchronous processing does not honour order of requests, some are processed sooner and some later and map yields whatever is completed.
One option is to attach index of given url to the requests and use it later when processing returned data.
Complex sample of fetching and processing urls with preserving url indices
Note: following sample is rather complex, if you can live with solution provided above, skip this. But make sure you are not running into problems detected and resolved below (urls being modified, requests following redirects).
import erequests
from itertools import count, izip
from functools import partial
urls = [
'http://www.heroku.com',
'http://python-tablib.org',
'http://httpbin.org',
'http://python-requests.org',
'http://kennethreitz.com'
]
def print_url_index(index, req, *args, **kwargs):
content_length = req.headers.get("content-length", None)
todo = "PROCESS" if req.status_code == 200 else "WAIT, NOT YET READY"
print "{todo}: index: {index}: status: {req.status_code}: length: {content_length}, {req.url}".format(**locals())
async_reqs = (erequests.async.get(url, hooks={"response": partial(print_url_index, i)}) for i, url in izip(count(), urls))
for req in erequests.map(async_reqs):
pass
Attaching hooks to request
requests (and erequests too) allows defining hooks to event called response. Each time, the request gets a response, this hook function is called and can do something or even modify the response.
Following line defines some hook to response:
erequests.async.get(url, hooks={"response": partial(print_url_index, i)})
Passing url index to the hook function
Signature of any hook shall be func(req, *args, *kwargs)
But we need to pass into the hook function also the index of url we are processing.
For this purpose we use functools.partial which allows creation of simplified functions by fixing some of parameters to specific value. This is exactly what we need, if you see print_url_index signature, we need just to fix value of index, the rest will fit requirements for hook function.
In our call we use partial with name of simplified function print_url_index and providing for each url unique index of it.
Index could be provided in the loop by enumerate, in case of larger number of parameters we may work more memory efficient way and use count, which generates each time incremented number starting by default from 0.
Let us run it:
$ python ereq.py
WAIT, NOT YET READY: index: 3: status: 301: length: 66, http://python-requests.org/
WAIT, NOT YET READY: index: 4: status: 301: length: 58, http://kennethreitz.com/
WAIT, NOT YET READY: index: 0: status: 301: length: None, http://www.heroku.com/
PROCESS: index: 2: status: 200: length: 7700, http://httpbin.org/
WAIT, NOT YET READY: index: 1: status: 301: length: 64, http://python-tablib.org/
WAIT, NOT YET READY: index: 4: status: 301: length: None, http://kennethreitz.org
WAIT, NOT YET READY: index: 3: status: 302: length: 0, http://docs.python-requests.org
WAIT, NOT YET READY: index: 1: status: 302: length: 0, http://docs.python-tablib.org
PROCESS: index: 3: status: 200: length: None, http://docs.python-requests.org/en/latest/
PROCESS: index: 1: status: 200: length: None, http://docs.python-tablib.org/en/latest/
PROCESS: index: 0: status: 200: length: 12064, https://www.heroku.com/
PROCESS: index: 4: status: 200: length: 10478, http://www.kennethreitz.org/
This shows, that:
requests are not processed in the order they were generated
some requests follow redirection, so hook function is called multiple times
carefully inspecting url values we can see, that no url from original list urls is reported by response, even for index 2 we got extra / appended. That is why simple lookup of response url in original list of urls would not help us.
When web-scraping we generally have two types of bottlenecks:
IO blocks - whenever we make a request, we need to wait for the server to respond, which can block our entire program.
CPU blocks - when parsing web scraped content, our code might be limited by CPU processing power.
CPU Speed
CPU blocks are an easy fix - we can spawn more processes. Generally, 1 CPU core can efficiently handle 1 process. So if our scraper is running on a machine that has 12 CPU cores we can spawn 12 processes for 12x speed boost:
from concurrent.futures import ProcessPoolExecutor
def parse(html):
... # CPU intensive parsing
htmls = [...]
with ProcessPoolExecutor() as executor:
for result in executor.map(parse, htmls):
print(result)
Python's ProcessPooolExecutor spawns optimal amount of threads (equal to CPU cores) and distributes task through them.
IO Speed
For IO-blocking we have more options as our goal is to get rid of useless waiting which can be done through threads, processes and asyncio loops.
If we're making thousands of requests we can't spawn hundreds of processes. Threads will be less expensive but still, there's a better option - asyncio loops.
Asyncio loops can execute tasks in no specific order. In other words, while task A is being blocked task B can take over the program. This is perfect for web scraping as there's very little overhead computing going on. We can scale to thousands requests in a single program.
Unfortunately, for asycio to work, we need to use python packages that support asyncio. For example, by using httpx and asyncio we can speed up our scraping significantly:
# comparing synchronous `requests`:
import requests
from time import time
_start = time()
for i in range(50):
request.get("http://httpbin.org/delay/1")
print(f"finished in: {time() - _start:.2f} seconds")
# finished in: 52.21 seconds
# versus asynchronous `httpx`
import httpx
import asyncio
from time import time
_start = time()
async def main():
async with httpx.AsyncClient() as client:
tasks = [client.get("http://httpbin.org/delay/1") for i in range(50)]
for response_future in asyncio.as_completed(tasks):
response = await response_future
print(f"finished in: {time() - _start:.2f} seconds")
asyncio.run(main())
# finished in: 3.55 seconds
Combining Both
With async code we can avoid IO-blocks and with processes we can scale up CPU intensive parsing - a perfect combo to optimize web scraping:
import asyncio
import multiprocessing
from concurrent.futures import ProcessPoolExecutor
from time import sleep, time
import httpx
async def scrape(urls):
"""this is our async scraper that scrapes"""
results = []
async with httpx.AsyncClient(timeout=httpx.Timeout(30.0)) as client:
scrape_tasks = [client.get(url) for url in urls]
for response_f in asyncio.as_completed(scrape_tasks):
response = await response_f
# emulate data parsing/calculation
sleep(0.5)
...
results.append("done")
return results
def scrape_wrapper(args):
i, urls = args
print(f"subprocess {i} started")
result = asyncio.run(scrape(urls))
print(f"subprocess {i} ended")
return result
def multi_process(urls):
_start = time()
batches = []
batch_size = multiprocessing.cpu_count() - 1 # let's keep 1 core for ourselves
print(f"scraping {len(urls)} urls through {batch_size} processes")
for i in range(0, len(urls), batch_size):
batches.append(urls[i : i + batch_size])
with ProcessPoolExecutor() as executor:
for result in executor.map(scrape_wrapper, enumerate(batches)):
print(result)
print("done")
print(f"multi-process finished in {time() - _start:.2f}")
def single_process(urls):
_start = time()
results = asyncio.run(scrape(urls))
print(f"single-process finished in {time() - _start:.2f}")
if __name__ == "__main__":
urls = ["http://httpbin.org/delay/1" for i in range(100)]
multi_process(urls)
# multi-process finished in 7.22
single_process(urls)
# single-process finished in 51.28
These foundation concepts sound complex, but once you narrow it down to the roots of the issue, the fixes are very straight and already present in Python!
For more details on this subject see my blog Web Scraping Speed: Processes, Threads and Async

Python MySQL queue: run code/queries in parallel, but commit separately

Program 1 inserts some jobs into a table job_table.
Program 2 needs to :
1. get the job from the table
2. handle the job
-> this needs to be multi-threaded (because each job involves urllib waiting time, which should run in parallel)
3. insert the results into my_other_table, commiting the result
Any good (standard?) ways to implement this? The issue is that commiting inside one thread, also commits the other threads.
I was able to pick the records from the mysql table and put them in queue later get them from queue but not able to insert into a new mysql table.
Here i am able to pick up only the new records when ever they fall into the table.
Hope this may help you.
Any mistakes please assist me.
from threading import Thread
import time
import Queue
import csv
import random
import pandas as pd
import pymysql.cursors
from sqlalchemy import create_engine
import logging
queue = Queue.Queue(1000)
logging.basicConfig(level=logging.DEBUG, format='(%(threadName)-9s) %(message)s', )
conn = pymysql.connect(conn-details)
cursor = conn.cursor()
class ProducerThread(Thread):
def run(self):
global queue
cursor.execute("SELECT ID FROM multi ORDER BY ID LIMIT 1")
min_id = cursor.fetchall()
min_id1 = list(min_id[0])
while True:
cursor.execute("SELECT ID FROM multi ORDER BY ID desc LIMIT 1")
max_id = cursor.fetchall()
max_id1 = list(max_id[0])
sql = "select * from multi where ID between '{}' and '{}'".format(min_id1[0], max_id1[0])
cursor.execute(sql)
data = cursor.fetchall()
min_id1[0] = max_id1[0] + 1
for row in data:
num = row
queue.put(num) # acquire();wait()
logging.debug('Putting ' + str(num) + ' : ' + str(queue.qsize()) + ' items in queue')
class ConsumerThread(Thread):
def run(self):
global queue
while True:
num = queue.get()
print num
logging.debug('Getting ' + str(num) + ' : ' + str(queue.qsize()) + ' items in queue')
**sql1 = """insert into multi_out(ID,clientname) values ('%s','%s')""",num[0],num[1]
print sql1
# cursor.execute(sql1, num)
cursor.execute("""insert into multi_out(ID,clientname) values ('%s','%s')""",(num[0],num[1]))**
# conn.commit()
# conn.close()
def main():
ProducerThread().start()
num_of_consumers = 20
for i in range(num_of_consumers):
ConsumerThread().start()
main()
What probably happens is you share the MySQL connection between the two threads. Try creating a new MySQL connection inside each thread.
For program 2, look at http://www.celeryproject.org/ :)
This is a common task when doing some sort of web crawling. I have implemented a single thread which grabs a job, waits for the http response, then writes the response to a database table.
The problems I have come across with my method, is you need to lock the table where you are grabbing jobs from, and mark them as in progress or complete, in order for multiple threads to not try and grab the same task.
Just used threading.Thread in python and override the run method.
Use 1 database connection per thread. (some db libraries in python are not thread safe)
If you have X number of threads running, periodically reading from the jobs table then MySQL will do the concurrency for you.
Or if you need even more assurance, you can always lock the jobs table yourself before reading the next available entry. This way you can be 100% sure that a single job will only be processed once.
As #Martin said, keep connections separate for all threads. They can use the same credentials.
So in short:
Program one -> Insert into jobs
Program two -> Create a write lock on the jobs table, so no one else can read from it
Program two -> read next available job
Program two -> Unlock the table
Do everything else as usual, MySQL will handle concurrency

Python, SQLite and threading

I'm working on an application that will gather data through HTTP from several places, cache the data locally and then serve it through HTTP.
So I was looking at the following. My application will first create several threads that will gather data at a specified interval and cache that data locally into a SQLite database.
Then in the main thread start a CherryPy application that will query that SQLite database and serve the data.
My problem is: how do I handle connections to the SQLite database from my threads and from the CherryPy application?
If I'd do a connection per thread to the database will I also be able to create/use an in memory database?
Short answer: Don't use Sqlite3 in a threaded application.
Sqlite3 databases scale well for size, but rather terribly for concurrency. You will be plagued with "Database is locked" errors.
If you do, you will need a connection per thread, and you have to ensure that these connections clean up after themselves. This is traditionally handled using thread-local sessions, and is performed rather well (for example) using SQLAlchemy's ScopedSession. I would use this if I were you, even if you aren't using the SQLAlchemy ORM features.
This test is being done to determine the best way to write and read from SQLite database. We follow 3 approaches below
Read and write without any threads (the methods with the word normal on it)
Read and write with Threads
Read and write with processes
Our sample dataset is a dummy generated OHLC dataset with a symbol, timestamp, and 6 fake values for ohlc and volumefrom, volumeto
Reads
Normal method takes about 0.25 seconds to read
Threaded method takes 10 seconds
Processing takes 0.25 seconds to read
Winner: Processing and Normal
Writes
Normal method takes about 1.5 seconds to write
Threaded method takes about 30 seconds
Processing takes about 30 seconds
Winner: Normal
Note: All records are not written using the threaded and processed write methods. Threaded and processed write methods obviously run into database locked errors as the writes are queued up
SQlite only queues up writes to a certain threshold and then throws sqlite3.OperationalError indicating database is locked.
The ideal way is to retry inserting the same chunk again but there is no point as the method execution for parallel insertion takes more tine than a sequential read even without retrying
the locked/failed inserts
Without retrying, 97% of the rows were written and still took 10x more time than a sequential write
Strategies to takeaway:
Prefer reading SQLite and writing it in the same thread
If you must do multithreading, use multiprocessing to read which has more or less the same performance and defer to single threaded write operations
DO NOT USE THREADING for reads and writes as it is 10x slower on both, you can thank the GIL for that
Here is the code for the complete test
import sqlite3
import time
import random
import string
import os
import timeit
from functools import wraps
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
import threading
import os
database_file = os.path.realpath('../files/ohlc.db')
create_statement = 'CREATE TABLE IF NOT EXISTS database_threading_test (symbol TEXT, ts INTEGER, o REAL, h REAL, l REAL, c REAL, vf REAL, vt REAL, PRIMARY KEY(symbol, ts))'
insert_statement = 'INSERT INTO database_threading_test VALUES(?,?,?,?,?,?,?,?)'
select = 'SELECT * from database_threading_test'
def time_stuff(some_function):
def wrapper(*args, **kwargs):
t0 = timeit.default_timer()
value = some_function(*args, **kwargs)
print(timeit.default_timer() - t0, 'seconds')
return value
return wrapper
def generate_values(count=100):
end = int(time.time()) - int(time.time()) % 900
symbol = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(10))
ts = list(range(end - count * 900, end, 900))
for i in range(count):
yield (symbol, ts[i], random.random() * 1000, random.random() * 1000, random.random() * 1000, random.random() * 1000, random.random() * 1e9, random.random() * 1e5)
def generate_values_list(symbols=1000,count=100):
values = []
for _ in range(symbols):
values.extend(generate_values(count))
return values
#time_stuff
def sqlite_normal_read():
"""
100k records in the database, 1000 symbols, 100 rows
First run
0.25139795300037804 seconds
Second run
Third run
"""
conn = sqlite3.connect(os.path.realpath('../files/ohlc.db'))
try:
with conn:
conn.execute(create_statement)
results = conn.execute(select).fetchall()
print(len(results))
except sqlite3.OperationalError as e:
print(e)
#time_stuff
def sqlite_normal_write():
"""
1000 symbols, 100 rows
First run
2.279409104000024 seconds
Second run
2.3364172020001206 seconds
Third run
"""
l = generate_values_list()
conn = sqlite3.connect(os.path.realpath('../files/ohlc.db'))
try:
with conn:
conn.execute(create_statement)
conn.executemany(insert_statement, l)
except sqlite3.OperationalError as e:
print(e)
#time_stuff
def sequential_batch_read():
"""
We read all the rows for each symbol one after the other in sequence
First run
3.661222331999852 seconds
Second run
2.2836898810001003 seconds
Third run
0.24514851899994028 seconds
Fourth run
0.24082150699996419 seconds
"""
conn = sqlite3.connect(os.path.realpath('../files/ohlc.db'))
try:
with conn:
conn.execute(create_statement)
symbols = conn.execute("SELECT DISTINCT symbol FROM database_threading_test").fetchall()
for symbol in symbols:
results = conn.execute("SELECT * FROM database_threading_test WHERE symbol=?", symbol).fetchall()
except sqlite3.OperationalError as e:
print(e)
def sqlite_threaded_read_task(symbol):
results = []
conn = sqlite3.connect(os.path.realpath('../files/ohlc.db'))
try:
with conn:
results = conn.execute("SELECT * FROM database_threading_test WHERE symbol=?", symbol).fetchall()
except sqlite3.OperationalError as e:
print(e)
finally:
return results
def sqlite_multiprocessed_read_task(symbol):
results = []
conn = sqlite3.connect(os.path.realpath('../files/ohlc.db'))
try:
with conn:
results = conn.execute("SELECT * FROM database_threading_test WHERE symbol=?", symbol).fetchall()
except sqlite3.OperationalError as e:
print(e)
finally:
return results
#time_stuff
def sqlite_threaded_read():
"""
1000 symbols, 100 rows per symbol
First run
9.429676861000189 seconds
Second run
10.18928106400017 seconds
Third run
10.382290903000467 seconds
"""
conn = sqlite3.connect(os.path.realpath('../files/ohlc.db'))
symbols = conn.execute("SELECT DISTINCT SYMBOL from database_threading_test").fetchall()
with ThreadPoolExecutor(max_workers=8) as e:
results = e.map(sqlite_threaded_read_task, symbols, chunksize=50)
for result in results:
pass
#time_stuff
def sqlite_multiprocessed_read():
"""
1000 symbols, 100 rows
First run
0.2484774920012569 seconds!!!
Second run
0.24322178500005975 seconds
Third run
0.2863524549993599 seconds
"""
conn = sqlite3.connect(os.path.realpath('../files/ohlc.db'))
symbols = conn.execute("SELECT DISTINCT SYMBOL from database_threading_test").fetchall()
with ProcessPoolExecutor(max_workers=8) as e:
results = e.map(sqlite_multiprocessed_read_task, symbols, chunksize=50)
for result in results:
pass
def sqlite_threaded_write_task(n):
"""
We ignore the database locked errors here. Ideal case would be to retry but there is no point writing code for that if it takes longer than a sequential write even without database locke errors
"""
conn = sqlite3.connect(os.path.realpath('../files/ohlc.db'))
data = list(generate_values())
try:
with conn:
conn.executemany("INSERT INTO database_threading_test VALUES(?,?,?,?,?,?,?,?)",data)
except sqlite3.OperationalError as e:
print("Database locked",e)
finally:
conn.close()
return len(data)
def sqlite_multiprocessed_write_task(n):
"""
We ignore the database locked errors here. Ideal case would be to retry but there is no point writing code for that if it takes longer than a sequential write even without database locke errors
"""
conn = sqlite3.connect(os.path.realpath('../files/ohlc.db'))
data = list(generate_values())
try:
with conn:
conn.executemany("INSERT INTO database_threading_test VALUES(?,?,?,?,?,?,?,?)",data)
except sqlite3.OperationalError as e:
print("Database locked",e)
finally:
conn.close()
return len(data)
#time_stuff
def sqlite_threaded_write():
"""
Did not write all the results but the outcome with 97400 rows written is still this...
Takes 20x the amount of time as a normal write
1000 symbols, 100 rows
First run
28.17819765000013 seconds
Second run
25.557972323000058 seconds
Third run
"""
symbols = [i for i in range(1000)]
with ThreadPoolExecutor(max_workers=8) as e:
results = e.map(sqlite_threaded_write_task, symbols, chunksize=50)
for result in results:
pass
#time_stuff
def sqlite_multiprocessed_write():
"""
1000 symbols, 100 rows
First run
30.09209805699993 seconds
Second run
27.502465319000066 seconds
Third run
"""
symbols = [i for i in range(1000)]
with ProcessPoolExecutor(max_workers=8) as e:
results = e.map(sqlite_multiprocessed_write_task, symbols, chunksize=50)
for result in results:
pass
sqlite_normal_write()
You can use something like that.
"...create several threads that will gather data at a specified interval and cache that data locally into a sqlite database.
Then in the main thread start a CherryPy app that will query that sqlite db and serve the data."
Don't waste a lot of time on threads. The things you're describing are simply OS processes. Just start ordinary processes to do gathering and run Cherry Py.
You have no real use for concurrent threads in a single process for this. Gathering data at a specified interval -- when done with simple OS processes -- can be scheduled by the OS very simply. Cron, for example, does a great job of this.
A CherryPy App, also, is an OS process, not a single thread of some larger process.
Just use processes -- threads won't help you.
Depending on the application the DB could be a real overhead. If we are talking about volatile data, maybe you could skip the communication via DB completely and share the data between the data gathering process and the data serving process(es) via IPC. This is not an option if the data has to be persisted, of course.
Depending on the data rate sqlite could be exactly the correct way to do this. The entire database is locked for each write so you aren't going to scale to 1000s of simultaneous writes per second. But if you only have a few it is the safest way of assuring you don't overwrite each other.

Categories