I have a table called "unprocessed" where I want to read 2000 rows, send them over HTTP to another server and then insert the rows into a "processed" table and remove them from the "unprocessed" table.
My python code roughly looks like this:
db = MySQLdb.connect("localhost","username","password","database" )
# prepare a cursor object using cursor() method
cursor = db.cursor()
# Select all the records not yet sent
sql = "SELECT * from unprocessed where SupplierIDToUse = 'supplier1' limit 0, 2000"
cursor.execute(sql)
results = cursor.fetchall()
for row in results:
id = row[0]
<code is here here for sending to other server - it takes about 1/2 a second>
if sentcorrectly="1":
sql = "INSERT into processed (id, dateprocessed) VALUES ('%s', NOW()')" % (id)
try:
inserted = cursor.execute(sql)
except:
print "Failed to insert"
if inserted:
print "Inserted"
sql = "DELETE from unprocessed where id = '%s'" % (id)
try:
deleted = cursor.execute(sql)
except:
print "Failed to delete id from the unprocessed table, even though it was saved in the processed table."
db.close()
sys.exit(0)
I want to be able to run this code concurrently so that I can increase the speed of sending these records to the other server over HTTP.
At the moment if I try and run the code concurrently I get multiple copies of the same data sent top the other server and saved into the the "processed" table as the select query is getting the same id's in multiple instances of the code.
How can I lock the records when I select them and then process each record as a row before moving them to the "processed" table?
The table was MyISAM but I've converted to innoDB today as I realise there's probably a way of locking the records better with innoDB.
Based off your comment reply.
One of two solutions would be a client side python master process to collect the record ID's for all 2000 records and then split that up into chunks to be processed by sub workers.
Short version, your choices are delegate the work or rely on a possibly tricky asset locking mechanism. I would recommend the former approach as it can scale up with the aid of a message queue.
delegate logic would use multiprocessing
import multiprocessing
records = get_all_unprocessed_ids()
pool = multiprocessing.Pool(5) #create 5 workers
pool.map(process_records, records)
That would create 2000 tasks and run 5 tasks at a time or you can split records into chunks, using a solution outlined here
How do you split a list into evenly sized chunks?
pool.map(process_records, chunks(records, 100))
would create 20 lists of 100 records that would be processed in batches of 5
Edit:
syntax error - signature is map(func, iterable[, chunksize]) and I left out the argument for func.
Related
I have a table with 4million rows and I use psycopg2 to execture a:
SELECT * FROM ..WHERE query
I haven't heard before of the server side cursor and I am reading its a good practice when you expect lots of results.
I find the documentation a bit limited and I have some basic questions.
First I declare the server-side cursor as:
cur = conn.cursor('cursor-name')
then I execute the query as:
cur.itersize = 10000
sqlstr = "SELECT clmn1, clmn2 FROM public.table WHERE clmn1 LIKE 'At%'"
cur.execute(sqlstr)
My question is: What do I do now? How do I get the results?
Do I iterate through the rows as:
row = cur.fetchone()
while row:
row = cur.fetchone()
or I use fetchmany() and I do this:
row = cur.fetchmany(10)
But in the second case how can I "scroll" the results?
Also what is the point of itersize?
Psycopg2 has a nice interface for working with server side cursors. This is a possible template to use:
with psycopg2.connect(database_connection_string) as conn:
with conn.cursor(name='name_of_cursor') as cursor:
cursor.itersize = 20000
query = "SELECT * FROM ..."
cursor.execute(query)
for row in cursor:
# process row
The code above creates the connection and automatically places the query result into a server side cursor. The value itersize sets the number of rows that the client will pull down at a time from the server side cursor. The value you use should balance number of network calls versus memory usage on the client. For example, if your result count is three million, an itersize value of 2000 (the default value) will result in 1500 network calls. If the memory consumed by 2000 rows is light, increase that number.
When using for row in cursor you are of course working with one row at a time, but Psycopg2 will prefetch itersize rows at a time for you.
If you want to use fetchmany for some reason, you could do something like this:
while True:
rows = cursor.fetchmany(100)
if len(rows) > 0:
for row in rows:
# process row
else:
break
This usage of fetchmany will not trigger a network call to the server for more rows until the prefetched batch has been exhausted. (This is a convoluted example that provides nothing over the code above, but demonstrates how to use fetchmany should there be a need.)
I tend to do something like this when I don't want to load millions of rows at once. You can turn a program into quite a memory hog if you load millions of rows into memory. Especially if you're making python domain objects out of those rows or something like that. I'm not sure if the uuid4 in the name is necessary, but my thought is that I want individual server side cursors that don't overlap if two processes make the same query.
from uuid import uuid4
import psycopg2
def fetch_things() -> Iterable[MyDomainObject]:
with psycopg2.connect(database_connection_string) as conn:
with conn.cursor(name=f"my_name_{uuid4()}") as cursor:
cursor.itersize = 500_000
query = "SELECT * FROM ..."
cursor.execute(query)
for row in cursor:
yield MyDomainObject(row)
I'm interested if anyone knows if this creates a storage problem on the SQL server or anything like that.
Additionally to cur.fetchmany(n) you can use PostgreSQL cursors:
cur.execute("declare foo cursor for select * from generate_series(1,1000000)")
cur.execute("fetch forward 100 from foo")
rows = cur.fetchall()
# ...
cur.execute("fetch forward 100 from foo")
rows = cur.fetchall()
# and so on
I have very simple mysql query as following:
db = getDB()
cursor = db.cursor()
cursor.execute('select * from users')
results = cursor.fetchall()
for row in results:
process(row)
Suppose users table has 1 billion records, the process method for one record takes 10ms.
The above code will finish fetching all of the data to client side and then starting process method. It really waste time. Should I do query and process parallel please?
So I'd like to change fetchall() to fetchmany() and start a new thread for process the retrieved result when cursor starting to query new result.
My python code produces a table with weeks as columns and rows as urls accessed. To get the data for each cell a query on a mysql database is executed. The code runs very slowly. I've added indexes to the mysql tables and this has not really helped. I thought it was because i was building the html table code with concatenation but even using a list and join has not fixed the speed. The code runs slowly in both django (using an additional database connection) and standalone python. Any help of speeding this up would be appreciated.
example query that to called from a loop:
def get_postcounts(week):
pageviews = 0
cursor = connections['olap'].cursor()
sql = "SELECT SUM(F.pageview) AS pageviews FROM fact_coursevisits F INNER JOIN dim_dates D ON F.Date_Id = D.Id WHERE D.date_week=%d;" % (week)
row_count = cursor.execute(sql);
result = cursor.fetchall()
for row in result:
if row[0] is not None:
pageviews = int(row[0])
cursor.close()
return pageviews
it could be because of the number of queries that you are executing(if you are having to call this method a lot).
i would suggest querying view count and the week over a certain period in one single query and read off the results.
I have a database called tbltest which has 4 columns : Id, Fname, Lname, Iscategorized. I have to copy the data from its first three columns to another database tblcopy which has 4 columns: Id, Fname, Lname, Service_number. I copy the data only when Iscategorized is 0 and after copying, I update it to 1. The Service column tells the python service which is copying the data. Follwing is the code I use for copying using service 1.
import time
var = True
while var == True:
#!/usr/bin/python
import MySQLdb
# Open database connection
db = MySQLdb.connect("localhost","root","amanbaweja","test" )
# prepare a cursor object using cursor() method
cursor = db.cursor()
sql = "SELECT * FROM tbltester\
WHERE iscategorized = '%d'" % (0) + " limit 0,1 "
# Execute the SQL command
cursor.execute(sql)
# Fetch all the rows in a list of lists.
results = cursor.fetchall()
for row in results:
id = row[0]
fname = row[1]
lname = row[2]
iscategorized = row[3]
# Now print fetched result
print "id=%s,fname=%s,lname=%s,iscategorized=%s" % \
(id, fname, lname, iscategorized)
cursor.execute('''INSERT into tblcopy (Id, Fname, Lname, Service_number) values(%s, %s, %s, %s)''',(id, fname, lname, "service1"))
sql1 = "UPDATE tbltester SET iscategorized = 1 WHERE Id = '%s'" % id
cursor.execute(sql1)
db.commit()
db.close()
Now as my database is dynamically getting bigger and bigger, I am using multiple machines to run my python services. The python services are running together using supervisor. If I run 10 services with the above mentioned code,approximately 5 different entries get created in tblcopy as 5 python services get the same id at once. Is there any SQL method to solve my problem? Can we do this using stored procedure?
Thanks for your help in advance.
There is no point in parallelising this operation, as it is I/O bound: all SELECT's and INSERT's need to go through the same bottleneck, namely the database engine and the hard disk. This approach will actually be slower because you are now introducing concurrency issues.
Rewrite your (single-threaded) process like this:
START TRANSACTION;
SELECT id FROM tbltester WHERE iscategorized = 0 FOR UPDATE;
INSERT into tblcopy
SELECT id, fname, lname, "service1"
FROM tbltester WHERE iscategorized = 0;
UPDATE tbltester SET iscategorized = 1 WHERE iscategorized = 0;
COMMIT;
It would be a different story if there was some significant (long-lasting) processing between your initial SELECT and your final UPDATE.
This code was inefficient for several other reasons:
the connection to MySQL is needlessly open and closed at each iteration (instead, open and close the connections outside of the loop)
only one record is processed at a time (instead, process as many records at a time as possible)
a transaction is started and committed at each iteration (instead, it may be acceptable to commit only every now and then, say every 10 iterations)
Also, it looks like there is an infinite while loop. If you want to have a "service" constantly copying data between tables, you might want to add a short delay inside your loop so as to avoid constantly hitting your database if there is nothing to process. Alternatively, and probably preferrably, you may want to look into triggers.
I have written a Python script that takes a 1.5 G XML file, parses out data and feeds it to a database using copy_from. It invokes the following function every 1000 parsed nodes. There are about 170k nodes in all which update about 300k rows or more. It starts out quite fast and then gets progressively slower as time goes on. Any ideas on why this is happening and what I can do to fix it?
Here is the function where I feed the data to the db.
def db_update(val_str, tbl, cols):
conn = psycopg2.connect("dbname=<mydb> user=postgres password=<mypw>")
cur = conn.cursor()
output = cStringIO.StringIO()
output.write(val_str)
output.seek(0)
cur.copy_from(output, tbl, sep='\t', columns=(cols))
conn.commit()
I haven't included the xml parsing as I don't think that's an issue. Without the db the parser executes in under 2 minutes.
There are several things that can slow inserts as tables grow:
Triggers that have to do more work as the DB grows
Indexes, which get more expensive to update as they grow
Disable any non-critical triggers, or if that isn't possible re-design them to run in constant time.
Drop indexes, then create them after the data has been loaded. If you need any indexes for the actual INSERTs or UPDATEs you'll need to keep them an wear the cost.
If you're doing lots of UPDATEs, consider VACUUMing the table periodically, or setting autovacuum to run very aggressively. That'll help Pg re-use space rather than more expensively allocating new space from the file system, and will help avoid table bloat.
You'll also save time by not re-connecting for each block of work. Maintain a connection.
From personal experience, copy_from doesn't update any indexes after you commit anything, so you will have to do it later. I would move your conn = psycopg2.connect("dbname=<mydb> user=postgres password=<mypw>"); cur = conn.cursor() outside of the function and do a commit() when you've finnished inserting everything (I suggest to commit every ~100k rows or it will start getting slow).
Also, it may seem stupid, but it happened to me a lot of times: Make sure you reset your val_str after you call db_update. For me, when the copy_from /inserts starts to go slower it's because im inserting the same rows plus more rows.
I using the following and I don't get any hit on performance as far as I have seen:
import psycopg2
import psycopg2.extras
local_conn_string = """
host='localhost'
port='5432'
dbname='backupdata'
user='postgres'
password='123'"""
local_conn = psycopg2.connect(local_conn_string)
local_cursor = local_conn.cursor(
'cursor_unique_name',
cursor_factory=psycopg2.extras.DictCursor)
I have made the following outputs in my code to test run-time (and I am parsing a LOT of rows. More than 30.000.000).
Parsed 2600000 rows in 00:25:21
Parsed 2700000 rows in 00:26:19
Parsed 2800000 rows in 00:27:16
Parsed 2900000 rows in 00:28:15
Parsed 3000000 rows in 00:29:13
Parsed 3100000 rows in 00:30:11
I have to mention I don't "copy" anything. But I am moving my rows from a remote PostGreSQL to a local one, and in the process create a few more tables to index my data better than it was done, as 30.000.000+ is a bit too much to handle on regular queries.
NB: The time is counting upwards and is not for each query.
I believe it has to do with the way my cursor is created.
EDIT1:
I am using the following to run my query:
local_cursor.execute("""SELECT * FROM data;""")
row_count = 0
for row in local_cursor:
if(row_count % 100000 == 0 and row_count != 0):
print("Parsed %s rows in %s" % (row_count,
my_timer.get_time_hhmmss()
))
parse_row(row)
row_count += 1
print("Finished running script!")
print("Parsed %s rows" % row_count)
The my_timer is a timer class I've made, and the parse_row(row) function formats my data, transfers it to to my local DB and eventually deletes from remote DB once the data is verified as having been moved to my local DB.
EDIT2:
It takes roughly 1 minute to parse every 100.000 rows in my DB, even after parsing around 4.000.000 queries:
Parsed 3800000 rows in 00:36:56
Parsed 3900000 rows in 00:37:54
Parsed 4000000 rows in 00:38:52
Parsed 4100000 rows in 00:39:50