using declare cursor with cx_oracle - python

I am trying to parse an sqlscript file based on a delimiter ";" and then subsequently call Cx_Oracle to connect and execute statements on the DB server. I have run into an issue with a cursor related block of code. My call structure is thus:
ScriptHandle = open(filepath)
SqlScript = ScriptHandle.read()
SqlCommands = SqlScript.split(';')
for sqlcommand in SqlCommands:
print sqlcommand,'\n'*3
if sqlcommand:
ODBCCon.ExecuteWithCx_Oracle(cursor, sqlcommand)
The problem I have comes with the following sql block:
DECLARE CURSOR date_cur IS (select calendar_date
from cg_calendar dates
where dates.calendar_date between '30-Jun-2014' and '31-Jul-2015'
and global_business_or_holiday = 'B');
BEGIN
FOR date_rec in date_cur LOOP
insert into fc_pos
SELECT PP.acctid,
PP.mgrid,
PP.activitydt,
PP.secid,
PP.shrparamt,
PP.lclmktval,
PP.usdmktval,
SM.asset_name_1,
SM.asset_name_2,
SM.cg_sym,
SM.fc_local_crncy_cd,
SM.fc_local_crncy_id,
SM.fc_trade_cd,
substr(SM.asset_name_1, 17,3) as against_crncy_cd
FROM FC_acct_mgr AM, asset SM, ma_mktval PP
WHERE PP.dw_asset_id = SM.dw_asset_id
AND PP.secid = SM.asset_id
AND PP.activitydt = date_rec.calendar_date
AND AM.acctid = PP.acctid
AND AM.mgrid = PP.mgrid
AND SM.asset_categ_cd = 'FC';
END LOOP;
END;
The above python parse step disassociates the above code based on delimiter ";" where as I need to treat above as one block starting at the DECLARE and ending at the END;
How can I accomplish it from python end. I have been unable to make any headway on this and this is a legacy process flow that I am automating.
Thanks in advance.

Related

Accessing *.mdb file with Pandas [duplicate]

I have 7 tables which I want to read from an Access file (.mdb), then I need to change the values using pandas DataFrame, and then save them again in a new Access file. Do you have any suggestion on how to do that?
I am relatively new in python, and any support is highly appreciated.
This may be some help: https://pypi.python.org/pypi/pandas_access
Everything should be straight forward after you're able to load the tables into pandas data frame. Then do the data manipulations you need to and send back to Access.
I think you should check this.
https://pypi.python.org/pypi/pyodbc/
Also, to read data from Access Table, try something like this.
# -*- coding: utf-8 -*-
import pypyodbc
pypyodbc.lowercase = False
conn = pypyodbc.connect(
r"Driver={Microsoft Access Driver (*.mdb, *.accdb)};" +
r"Dbq=C:\Users\Public\Database1.accdb;")
cur = conn.cursor()
cur.execute("SELECT CreatureID, Name_EN, Name_JP FROM Creatures");
while True:
row = cur.fetchone()
if row is None:
break
print(u"Creature with ID {0} is {1} ({2})".format(
row.get("CreatureID"), row.get("Name_EN"), row.get("Name_JP")))
cur.close()
conn.close()
Or . . . just use VBA, if you are already using Access.
Dim outputFileName As String
outputFileName = CurrentProject.Path & "\Export_" & Format(Date, "yyyyMMdd") & ".xls"
DoCmd.TransferSpreadsheet acExport, acSpreadsheetTypeExcel9, "Table1", outputFileName , True
DoCmd.TransferSpreadsheet acExport, acSpreadsheetTypeExcel9, "Table2", outputFileName , True
This could be an options too . . .
strPath = "V:\Reports\Worklist_Summary.xlsx"
DoCmd.TransferSpreadsheet acExport, acSpreadsheetTypeExcel12, "qryEscByDate", strPath
DoCmd.TransferSpreadsheet acExport, acSpreadsheetTypeExcel12, "qryCreatedByDate", strPath
DoCmd.TransferSpreadsheet acExport, acSpreadsheetTypeExcel12, "qryClosedByDate", strPath
DoCmd.TransferSpreadsheet acExport, acSpreadsheetTypeExcel12, "qryCreatedByUsers", strPath
DoCmd.TransferSpreadsheet acExport, acSpreadsheetTypeExcel12, "qrySummaries", strPath
Or . . . run some VBA scripts . . .
Option Compare Database
Option Explicit
Private Sub Command2_Click()
Dim strFile As String
Dim varItem As Variant
strFile = InputBox("Designate the path and file name to export to...", "Export")
If (strFile = vbNullString) Then Exit Sub
For Each varItem In Me.List0.ItemsSelected
DoCmd.TransferSpreadsheet transferType:=acExport, _
spreadsheetType:=acSpreadsheetTypeExcel9, _
tableName:=Me.List0.ItemData(varItem), _
fileName:=strFile
Next
MsgBox "Process complete.", vbOKOnly, "Export"
End Sub
Private Sub Form_Open(Cancel As Integer)
Dim strTables As String
Dim tdf As TableDef
For Each tdf In CurrentDb.TableDefs
If (Left(tdf.Name, 4) <> "MSys") Then
strTables = strTables & tdf.Name & ","
End If
Next
strTables = Left(strTables, Len(strTables) - 1)
Me.List0.RowSource = strTables
End Sub
When all data is exported, do your transformations, and load (back to Access or another destination).
I'll bet you don't even need the export step. You can probably do everything you need to do in Access, all y itself.

Writing ADODB Recordset to Pivot Cache With Python

I am working on a project where I am converting some VBA code to Python, in order to have Python interact with Excel in much the same way VBA would. In this particular case, I am utilizing the win32com library to have Python extract data from an Oracle Database via an ADODB Connection and write the resulting recordset directly to a pivot cache. I.e. creating a pivot table with data from an external source.
import win32com.client
Excel = win32com.client.gencache.EnsureDispatch('Excel.Application')
win32c = win32com.client.constants
# Create and Open Connection
conn = win32com.client.Dispatch(r'ADODB.Connection')
DSN = 'Provider=OraOLEDB.Oracle; Data Source=localhost:1521/XEPDB1; User Id=system; Password=password;'
conn.Open(DSN)
# Create Excel File
wb = Excel.Workbooks.Add()
Sheet1 = wb.Worksheets("Sheet1")
# Create Recordset
RS = win32com.client.Dispatch(r'ADODB.Recordset')
RS.Open('SELECT * FROM employees', conn, 1, 3)
# Create Pivot Cache
PivotCache = wb.PivotCaches().Create(SourceType=win32c.xlExternal, Version=win32c.xlPivotTableVersion15)
# Write Recordset to Pivot Cache
PivotCache.Recordset = RS # <~~ This is where it breaks!
# Create Pivot Table
Pivot = PivotCache.CreatePivotTable(TableDestination:=Sheet1.Cells(2, 2), TableName:='Python Test Pivot', DefaultVersion:=win32c.xlPivotTableVersion15)
# Close Connection
RS.Close()
conn.Close()
# View Excel
Excel.Visible = 1
I am successful in extracting the data via ADODB and creating an Excel file, but when I try to write the resulting recordset to the pivot cache by setting PivotCache.Recordset = RS, I get the following error.
[Running] venv\Scripts\python.exe "c:\Project\Test\debug_file_test.py"
Traceback (most recent call last):
File "c:\Project\Test\debug_file_test.py", line 29, in <module>
PivotCache.Recordset = RS # <~~ This is where it breaks!
File "c:\Project\venv\lib\site-packages\win32com\client\__init__.py", line 482, in __setattr__
self._oleobj_.Invoke(*(args + (value,) + defArgs))
pywintypes.com_error: (-2147352567, 'Exception occurred.', (0, None, 'No such interface supported\r\n', None, 0, -2146827284), None)
[Done] exited with code=1 in 0.674 seconds
Can anybody shed some light on what I am doing wrong?
I ended up finding a solution to the issue, and want to post an answer for anyone who may come across this question at some point.
Instead of creating the recordset by Recordset.Open() I tried using the command object and create the recordset by cmd.Execute(). As it turns out that Execute returns a tuple, I had to pass cmd.Execute()[0] to the recordset in order to make it work.
This doesn't answer why my initial code doesn't work, but it does provide an answer for how to write an ADODB recordset to a PivotCache with Python.
import win32com.client
#Initiate Excel Application
Excel = win32com.client.gencache.EnsureDispatch('Excel.Application')
win32c = win32com.client.constants
# Create and Open Connection
conn = win32com.client.Dispatch('ADODB.Connection')
cmd = win32com.client.Dispatch('ADODB.Command')
DSN = 'Provider=OraOLEDB.Oracle; Data Source=localhost:1521/XEPDB1; User Id=system; Password=password;'
conn.Open(DSN)
# Define Command Properties
cmd.ActiveConnection = conn
cmd.ActiveConnection.CursorLocation = win32c.adUseClient
cmd.CommandType = win32c.adCmdText
cmd.CommandText = 'SELECT * FROM employees'
# Create Excel File
wb = Excel.Workbooks.Add()
Sheet1 = wb.Worksheets("Sheet1")
# Create Recordset
RS = win32com.client.Dispatch('ADODB.Recordset')
RS = cmd.Execute()[0]
# Create Pivot Cache
PivotCache = wb.PivotCaches().Create(SourceType=win32c.xlExternal, Version=win32c.xlPivotTableVersion15)
PivotCache.Recordset = RS
# Create Pivot Table
Pivot = PivotCache.CreatePivotTable(TableDestination:=Sheet1.Cells(2, 2), TableName:='Python Test Pivot', DefaultVersion:=win32c.xlPivotTableVersion15)
# Close Connection
RS.Close()
conn.Close()
# View Excel
Excel.Visible = 1
Update
As hinted by #Parfait the code above also works if RS = cmd.Execute()[0] is replaced by
RS.Open(cmd)
Which I actually prefer because that secures alignment between the VB syntax and the Python syntax.

Keep getting django.db.utils.ProgrammingError: no results to fetch when running more than 4 processes

I have a script that processes data. When I call it with 1-3 processes it seems to work fine. However, once I get to trying to run it with 6 or more processes django errors out and I get django.db.utils.ProgrammingError: no results to fetch
I've read a few threads where they said they fixed it in a later version of psychopg2 so I tried upgrading my package and am currently on version 2.8, but I still get the error.
EDIT: One thing I've noticed is that the error happens when 2 process try to bulk_create at the same time. I am currently thinking about locking the database connection somehow, but I am not sure
The code that makes the processes
for profile_name in profiles_to_run:
t1 = Process(target=self.run_profile_for_inspection, args = (odometer_start,odometer_end,inspection,profile_name,robot_inspection_id))
t1.start()
list_of_threads.append(t1)
thread_count = thread_count + 1
return list_of_threads
def run_profile_for_inspection(self, odometer_start, odometer_end,portal_inspection, profile_name, robot_inspection_id):
self.out("Running inspection {} from {}m to {}m for {}".format(portal_inspection.pk, odometer_start, odometer_end,profile_name))
rerun_profile = ReRunProfile(self._options['db_name'],self.out,profile_name, portal_inspection.pk, robot_inspection_id)
# rerun_profile.set_odometer_start_and_end(odometer_start,odometer_end)
rerun_profile.handle_already_created_inspections()
In ReRunProfile the get statement is what it errors out on if theres too many processes
This is pseudo code
def handle_already_created_inspections(self):
con = psycopg2.connect(dbname=self.database)
time.sleep(10)
robot_inspection_id = self.get_robot_inspection_id(self.portal_inspection_id)
portal_inspection = Inspection.objects.get(pk=self.portal_inspection_id)
with con.cursor() as cur:
cur.execute('Select * from data where start < odometer_start and end > odometer_end and profile = profile')
count = 0
processed_data = []
for row in cur:
processed_data.append(row.process())
count = count + 1
if count == 20:
Profile.objects.bulk_create(processed_data) #errors out here too
count = 0
process_data = []
EDIT: Some one has asked what the bulk_create does.
Basically this program splits up data by distance and by profile and then process it and adds it to the database.

Import text files to pig through python UDF

I'm trying to load files to pig while use python udf, i've tried two ways:
• (myudf1, sample1.pig): try to read the file from python, the file is located on my client server.
• (myudf2, sample2.pig): load file from hdfs to grunt shell first, then pass it as a parameter to python udf.
myudf1.py
from __future__ import with_statement
def get_words(dir):
stopwords=set()
with open(dir) as f1:
for line1 in f1:
stopwords.update([line1.decode('ascii','ignore').split("\n")[0]])
return stopwords
stopwords=get_words("/home/zhge/uwc/mappings/english_stop.txt")
#outputSchema("findit: int")
def findit(stp):
stp=str(stp)
if stp in stopwords:
return 1
else:
return 0
sample1.pig:
REGISTER '/home/zhge/uwc/scripts/myudf1.py' USING jython as pyudf;
item_title = load '/user/zhge/data/item_title_sample/000000_0' USING PigStorage(',') AS (title:chararray);
T = limit item_title 1;
S = FOREACH T GENERATE pyudf.findit(title);
DUMP S
I get: IOError: (2, 'No such file or directory', '/home/zhge/uwc/mappings/english_stop.txt')
For solution 2:
myudf2:
def get_wordlists(wordbag):
stopwords=set()
for t in wordbag:
stopwords.update(t.decode('ascii','ignore'))
return stopwords
#outputSchema("findit: int")
def findit(stopwordbag, stp):
stopwords=get_wordlists(stopwordbag)
stp=str(stp)
if stp in stopwords:
return 1
else:
return 0
Sample2.pig
REGISTER '/home/zhge/uwc/scripts/myudf2.py' USING jython as pyudf;
stops = load '/user/zhge/uwc/mappings/stopwords.txt' AS (stop_w:chararray);
-- this step works fine and i can see the "stops" obejct is loaded to pig
item_title = load '/user/zhge/data/item_title_sample/000000_0' USING PigStorage(',') AS (title:chararray);
T = limit item_title 1;
S = FOREACH T GENERATE pyudf.findit(stops.stop_w, title);
DUMP S;
Then I got:
ERROR org.apache.pig.tools.grunt.Grunt -ERROR 1066: Unable to open iterator for alias S. Backend error : Scalar has more than one row in the output. 1st : (a), 2nd :(as
Your second example should work. Though you LIMITed the wrong expression -- it should be on the stops relationship. Therefore it should be:
stops = LOAD '/user/zhge/uwc/mappings/stopwords.txt' AS (stop_w:chararray);
item_title = LOAD '/user/zhge/data/item_title_sample/000000_0' USING PigStorage(',') AS (title:chararray);
T = LIMIT stops 1;
S = FOREACH item_title GENERATE pyudf.findit(T.stop_w, title);
However, since it looks like you need to process all of the stop words first this will not be enough. You'll need to do a GROUP ALL and then pass the results to your get_wordlist function instead:
stops = LOAD '/user/zhge/uwc/mappings/stopwords.txt' AS (stop_w:chararray);
item_title = LOAD '/user/zhge/data/item_title_sample/000000_0' USING PigStorage(',') AS (title:chararray);
T = FOREACH (GROUP stops ALL) GENERATE pyudf.get_wordlists(stops) AS ready;
S = FOREACH item_title GENERATE pyudf.findit(T.ready, title);
You'll have to update your UDF to accept a list of dicts though for this method to work.

in python: child processes going defunct while others are not, unsure why

edit: the answer was that the os was axing processes because i was consuming all the memory
i am spawning enough subprocesses to keep the load average 1:1 with cores, however at some point within the hour, this script could run for days, 3 of the processes go :
tipu 14804 0.0 0.0 328776 428 pts/1 Sl 00:20 0:00 python run.py
tipu 14808 64.4 24.1 2163796 1848156 pts/1 Rl 00:20 44:41 python run.py
tipu 14809 8.2 0.0 0 0 pts/1 Z 00:20 5:43 [python] <defunct>
tipu 14810 60.3 24.3 2180308 1864664 pts/1 Rl 00:20 41:49 python run.py
tipu 14811 20.2 0.0 0 0 pts/1 Z 00:20 14:04 [python] <defunct>
tipu 14812 22.0 0.0 0 0 pts/1 Z 00:20 15:18 [python] <defunct>
tipu 15358 0.0 0.0 103292 872 pts/1 S+ 01:30 0:00 grep python
i have no idea why this is happening, attached is the master and slave. i can attach the mysql/pg wrappers if needed as well, any suggestions?
slave.py:
from boto.s3.key import Key
import multiprocessing
import gzip
import os
from mysql_wrapper import MySQLWrap
from pgsql_wrapper import PGSQLWrap
import boto
import re
class Slave:
CHUNKS = 250000
BUCKET_NAME = "bucket"
AWS_ACCESS_KEY = ""
AWS_ACCESS_SECRET = ""
KEY = Key(boto.connect_s3(AWS_ACCESS_KEY, AWS_ACCESS_SECRET).get_bucket(BUCKET_NAME))
S3_ROOT = "redshift_data_imports"
COLUMN_CACHE = {}
DEFAULT_COLUMN_VALUES = {}
def __init__(self, job_queue):
self.log_handler = open("logs/%s" % str(multiprocessing.current_process().name), "a");
self.mysql = MySQLWrap(self.log_handler)
self.pg = PGSQLWrap(self.log_handler)
self.job_queue = job_queue
def do_work(self):
self.log(str(os.getpid()))
while True:
#sample job in the abstract: mysql_db.table_with_date-iteration
job = self.job_queue.get()
#queue is empty
if job is None:
self.log_handler.close()
self.pg.close()
self.mysql.close()
print("good bye and good day from %d" % (os.getpid()))
self.job_queue.task_done()
break
#curtail iteration
table = job.split('-')[0]
#strip redshift table from job name
redshift_table = re.sub(r"(_[1-9].*)", "", table.split(".")[1])
iteration = int(job.split("-")[1])
offset = (iteration - 1) * self.CHUNKS
#columns redshift is expecting
#bad tables will slip through and error out, so we catch it
try:
colnames = self.COLUMN_CACHE[redshift_table]
except KeyError:
self.job_queue.task_done()
continue
#mysql fields to use in SELECT statement
fields = self.get_fields(table)
#list subtraction determining which columns redshift has that mysql does not
delta = (list(set(colnames) - set(fields.keys())))
#subtract columns that have a default value and so do not need padding
if delta:
delta = list(set(delta) - set(self.DEFAULT_COLUMN_VALUES[redshift_table]))
#concatinate columns with padded \N
select_fields = ",".join(fields.values()) + (",\\N" * len(delta))
query = "SELECT %s FROM %s LIMIT %d, %d" % (select_fields, table,
offset, self.CHUNKS)
rows = self.mysql.execute(query)
self.log("%s: %s\n" % (table, len(rows)))
if not rows:
self.job_queue.task_done()
continue
#if there is more data potentially, add it to the queue
if len(rows) == self.CHUNKS:
self.log("putting %s-%s" % (table, (iteration+1)))
self.job_queue.put("%s-%s" % (table, (iteration+1)))
#various characters need escaping
clean_rows = []
redshift_escape_chars = set( ["\\", "|", "\t", "\r", "\n"] )
in_chars = ""
for row in rows:
new_row = []
for value in row:
if value is not None:
in_chars = str(value)
else:
in_chars = ""
#escape any naughty characters
new_row.append("".join(["\\" + c if c in redshift_escape_chars else c for c in in_chars]))
new_row = "\t".join(new_row)
clean_rows.append(new_row)
rows = ",".join(fields.keys() + delta)
rows += "\n" + "\n".join(clean_rows)
offset = offset + self.CHUNKS
filename = "%s-%s.gz" % (table, iteration)
self.move_file_to_s3(filename, rows)
self.begin_data_import(job, redshift_table, ",".join(fields.keys() +
delta))
self.job_queue.task_done()
def move_file_to_s3(self, uri, contents):
tmp_file = "/dev/shm/%s" % str(os.getpid())
self.KEY.key = "%s/%s" % (self.S3_ROOT, uri)
self.log("key is %s" % self.KEY.key )
f = gzip.open(tmp_file, "wb")
f.write(contents)
f.close()
#local saving allows for debugging when copy commands fail
#text_file = open("tsv/%s" % uri, "w")
#text_file.write(contents)
#text_file.close()
self.KEY.set_contents_from_filename(tmp_file, replace=True)
def get_fields(self, table):
"""
Returns a dict used as:
{"column_name": "altered_column_name"}
Currently only the debug column gets altered
"""
exclude_fields = ["_qproc_id", "_mob_id", "_gw_id", "_batch_id", "Field"]
query = "show columns from %s" % (table)
fields = self.mysql.execute(query)
#key raw field, value mysql formatted field
new_fields = {}
#for field in fields:
for field in [val[0] for val in fields]:
if field in exclude_fields:
continue
old_field = field
if "debug_mode" == field.strip():
field = "IFNULL(debug_mode, 0)"
new_fields[old_field] = field
return new_fields
def log(self, text):
self.log_handler.write("\n%s" % text)
def begin_data_import(self, table, redshift_table, fields):
query = "copy %s (%s) from 's3://bucket/redshift_data_imports/%s' \
credentials 'aws_access_key_id=%s;aws_secret_access_key=%s' delimiter '\\t' \
gzip NULL AS '' COMPUPDATE ON ESCAPE IGNOREHEADER 1;" \
% (redshift_table, fields, table, self.AWS_ACCESS_KEY, self.AWS_ACCESS_SECRET)
self.pg.execute(query)
master.py:
from slave import Slave as Slave
import multiprocessing
from mysql_wrapper import MySQLWrap as MySQLWrap
from pgsql_wrapper import PGSQLWrap as PGSQLWrap
class Master:
SLAVE_COUNT = 5
def __init__(self):
self.mysql = MySQLWrap()
self.pg = PGSQLWrap()
def do_work(table):
pass
def get_table_listings(self):
"""Gathers a list of MySQL log tables needed to be imported"""
query = 'show databases'
result = self.mysql.execute(query)
#turns list[tuple] into a flat list
databases = list(sum(result, ()))
#overriding during development
databases = ['db1', 'db2', 'db3']]
exclude = ('mysql', 'Database', 'information_schema')
scannable_tables = []
for database in databases:
if database in exclude:
continue
query = "show tables from %s" % database
result = self.mysql.execute(query)
#turns list[tuple] into a flat list
tables = list(sum(result, ()))
for table in tables:
exclude = ("Tables_in_%s" % database, "(", "201303", "detailed", "ltv")
#exclude any of the unfavorables
if any(s in table for s in exclude):
continue
scannable_tables.append("%s.%s-1" % (database, table))
return scannable_tables
def init(self):
#fetch redshift columns once and cache
#get columns from redshift so we can pad the mysql column delta with nulls
tables = ('table1', 'table2', 'table3')
for table in tables:
#cache columns
query = "SELECT column_name FROM information_schema.columns WHERE \
table_name = '%s'" % (table)
result = self.pg.execute(query, async=False, ret=True)
Slave.COLUMN_CACHE[table] = list(sum(result, ()))
#cache default values
query = "SELECT column_name FROM information_schema.columns WHERE \
table_name = '%s' and column_default is not \
null" % (table)
result = self.pg.execute(query, async=False, ret=True)
#turns list[tuple] into a flat list
result = list(sum(result, ()))
Slave.DEFAULT_COLUMN_VALUES[table] = result
def run(self):
self.init()
job_queue = multiprocessing.JoinableQueue()
tables = self.get_table_listings()
for table in tables:
job_queue.put(table)
processes = []
for i in range(Master.SLAVE_COUNT):
process = multiprocessing.Process(target=slave_runner, args=(job_queue,))
process.daemon = True
process.start()
processes.append(process)
#blocks this process until queue reaches 0
job_queue.join()
#signal each child process to GTFO
for i in range(Master.SLAVE_COUNT):
job_queue.put(None)
#blocks this process until queue reaches 0
job_queue.join()
job_queue.close()
#do not end this process until child processes close out
for process in processes:
process.join()
#toodles !
print("this is master saying goodbye")
def slave_runner(queue):
slave = Slave(queue)
slave.do_work()
There's not enough information to be sure, but the problem is very likely to be that Slave.do_work is raising an unhandled exception. (There are many lines of your code that could do that in various different conditions.)
When you do that, the child process will just exit.
On POSIX systems… well, the full details are a bit complicated, but in the simple case (what you have here), a child process that exits will stick around as a <defunct> process until it gets reaped (because the parent either waits on it, or exits). Since your parent code doesn't wait on the children until the queue is finished, that's exactly what happens.
So, there's a simple duct-tape fix:
def do_work(self):
self.log(str(os.getpid()))
while True:
try:
# the rest of your code
except Exception as e:
self.log("something appropriate {}".format(e))
# you may also want to post a reply back to the parent
You might also want to break the massive try up into different ones, so you can distinguish between all the different stages where things could go wrong (especially if some of them mean you need a reply, and some mean you don't).
However, it looks like what you're attempting to do is duplicate exactly the behavior of multiprocessing.Pool, but have missed the bar in a couple places. Which raises the question: why not just use Pool in the first place? You could then simplify/optimize things ever further by using one of the map family methods. For example, your entire Master.run could be reduced to:
self.init()
pool = multiprocessing.Pool(Master.SLAVE_COUNT, initializer=slave_setup)
pool.map(slave_job, tables)
pool.join()
And this will handle exceptions for you, and allow you to return values/exceptions if you later need that, and let you use the built-in logging library instead of trying to build your own, and so on. And it should only take about a dozens lines of minor code changes to Slave, and then you're done.
If you want to submit new jobs from within jobs, the easiest way to do this is probably with a Future-based API (which turns things around, making the future result the focus and the pool/executor the dumb thing that provides them, instead of making the pool the focus and the result the dumb thing it gives back), but there are multiple ways to do it with Pool as well. For example, right now, you're not returning anything from each job, so, you can just return a list of tables to execute. Here's a simple example that shows how to do it:
import multiprocessing
def foo(x):
print(x, x**2)
return list(range(x))
if __name__ == '__main__':
pool = multiprocessing.Pool(2)
jobs = [5]
while jobs:
jobs, oldjobs = [], jobs
for job in oldjobs:
jobs.extend(pool.apply(foo, [job]))
pool.close()
pool.join()
Obviously you can condense this a bit by replacing the whole loop with, e.g., a list comprehension fed into itertools.chain, and you can make it a lot cleaner-looking by passing "a submitter" object to each job and adding to that instead of returning a list of new jobs, and so on. But I wanted to make it as explicit as possible to show how little there is to it.
At any rate, if you think the explicit queue is easier to understand and manage, go for it. Just look at the source for multiprocessing.worker and/or concurrent.futures.ProcessPoolExecutor to see what you need to do yourself. It's not that hard, but there are enough things you could get wrong (personally, I always forget at least one edge case when I try to do something like this myself) that it's work looking at code that gets it right.
Alternatively, it seems like the only reason you can't use concurrent.futures.ProcessPoolExecutor here is that you need to initialize some per-process state (the boto.s3.key.Key, MySqlWrap, etc.), for what are probably very good caching reasons. (If this involves a web-service query, a database connect, etc., you certainly don't want to do that once per task!) But there are a few different ways around that.
But you can subclass ProcessPoolExecutor and override the undocumented function _adjust_process_count (see the source for how simple it is) to pass your setup function, and… that's all you have to do.
Or you can mix and match. Wrap the Future from concurrent.futures around the AsyncResult from multiprocessing.

Categories