Inserting into a Cassandra DB is slow even with execute_concurrent() - python

I am trying to insert a pandas dataframe into cassandra. I am using the execute_concurrent, but I don't see any improvement. It is taking almost 5s per row insertions. There are 14k rows so at this rate it will take more than 15 hours. I have 12 GB RAM with 2 CPU cores. How fast can I run this operation? I've tried with different concurrency numbers but without any success. Following is my code-:
from flask import session
import yaml
import pandas as pd
import argparse
from get_data import read_params
import cassandra
from cassandra.concurrent import execute_concurrent_with_args, execute_concurrent
from cassandra.cluster import Cluster, ExecutionProfile
from cassandra.auth import PlainTextAuthProvider
import sys
import time
def progressbar(it, prefix="", size=60, out=sys.stdout): # Python3.3+
count = len(it)
def show(j):
x = int(size*j/count)
print("{}[{}{}] {}/{}".format(prefix, u"█"*x, "."*(size-x), j, count),
end='\r', file=out, flush=True)
show(0)
for i, item in enumerate(it):
yield item
show(i+1)
print("\n", flush=True, file=out)
def cassandraDBLoad(config_path):
try:
config = read_params(config_path)
execution_profile = ExecutionProfile(request_timeout=10)
cassandra_config = {'secure_connect_bundle': "path"}
auth_provider = PlainTextAuthProvider(
"client_id",
"client_secret"
)
cluster = Cluster(cloud=cassandra_config, auth_provider=auth_provider)
session = cluster.connect()
session.default_timeout = None
connect_db = session.execute("select release_version from system.local")
set_keyspace = session.set_keyspace("Keyspace Name")
table_ = "big_mart"
define_columns = "Item_Identifier varchar PRIMARY KEY, Item_Weight varchar, Item_Fat_Content varchar, Item_Visibility varchar, Item_Type varchar, Item_MRP varchar, Outlet_Identifier varchar, Outlet_Establishment_Year varchar, Outlet_Size varchar, Outlet_Location_type varchar, Outlet_Type varchar, Item_Outlet_Sales varchar, source varchar"
drop_table = f"DROP TABLE IF EXISTS {table_}"
drop_result = session.execute(drop_table)
create_table = f"CREATE TABLE {table_}({define_columns});"
table_result = session.execute(create_table)
train = pd.read_csv("train_source")
test = pd.read_csv("test_source")
#Combine test and train into one file
train['source']='train'
test['source']='test'
df = pd.concat([train, test],ignore_index=True)
df = df.fillna('NA')
columns = "Item_Identifier, Item_Weight, Item_Fat_Content, Item_Visibility, Item_Type, Item_MRP, Outlet_Identifier, Outlet_Establishment_Year, Outlet_Size, Outlet_Location_Type, Outlet_Type, Item_Outlet_Sales, source"
insert_qry = f"INSERT INTO {table_}({columns}) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?)"
statement = session.prepare(insert_qry)
parameters = [
(str(df.iat[i,0]), str(df.iat[i,1]), str(df.iat[i,2]), str(df.iat[i,3]),
str(df.iat[i,4]), str(df.iat[i,5]), str(df.iat[i,6]), str(df.iat[i,7]),
str(df.iat[i,8]), str(df.iat[i,9]), str(df.iat[i,10]), str(df.iat[i,11]),
str(df.iat[i,12]))
for i in range(len(df))]
for i in progressbar(range(len(df)), "Computing: ", 40):
time.sleep(0.1)
execute_concurrent_with_args(
session,
statement,
parameters,
concurrency=500
)
session.execute(batch)
except Exception as e:
raise Exception("(cassandraDBLoad): Something went wrong in the CassandraDB Load operations\n" + str(e))
csv files link - https://drive.google.com/drive/folders/1O03lNTMfSwhUKG61zOs7fNxXIRe44GRp?usp=sharing

Even with concurrent asynchronous requests (execute_concurrent()), it will still be bottlenecked on the client side because there is only so much a single client process can do even when it's multi-threaded.
If you want to maximise the throughput of your cluster, we recommend scaling your app horizontally and run multiple instances (processes). This can be easily achieved with the Python driver using the multiprocessing module. For details, see the Python driver Performance Notes.
Finally, if your goal is to simply bulk-load data to your Cassandra DB, it makes no sense to re-invent the wheel by writing your own application when there are free, open-source tools that exist specifically for this use case.
You can use the DataStax Bulk Loader tool (DSBulk) to bulk load data in CSV format to a Cassandra table. Here are some references with examples to help you get started quickly:
Blog - DSBulk Intro + Loading data
Blog - More DSBulk Loading examples
Blog - Counting records with DSBulk
Docs - Loading data examples
DSBulk is open-source so it's free to use. Cheers!

Related

How can I refresh the data in the background of a running flask app?

I have a simple flask app that queries a database to write a csv then pyplot to create a chart out of that.
I would like to refresh the data in the background every 10 minutes while the app is running. The page doesn't need to refresh the html automatically. It just needs to have fresh data when someone opens the page.
Can I do that in a single script? Or do I need to run a different script outside in crontab or something?
I would just kick over the container every 10 minutes but it takes about 5 minutes to get the query, so that's a 5 minute outage. Not a great idea. I'd prefer it to fetch in the background.
Here is what I'm working with:
import os
from datetime import date
import teradatasql
import pandas as pd
import matplotlib.pyplot as plt
from flask import Flask, render_template
import time
import multitasking
### variables
ausername = os.environ.get('dbuser')
apassword = os.environ.get('dbpassword')
ahost = os.environ.get('dbserver')
systems = ["prd1", "prd2", "frz1", "frz2", "devl"]
qgsystems = ["", "#Tera_Prd2_v2", "#Tera_Frz1_v2", "#Tera_Frz2_v2", "#Tera_Devl_v2"]
weeks = ["0", "7", "30"]
query = """{{fn teradata_write_csv({system}_{week}_output.csv)}}select (bdi.infodata) as sysname,
to_char (thedate, 'MM/DD' ) || ' ' || Cast (thetime as varchar(11)) as Logtime,
sum(drc.cpuuexec)/sum(drc.secs) (decimal(7,2)) as "User CPU",
sum(drc.cpuuserv)/sum(drc.secs) (decimal(7,2)) as "System CPU",
sum(drc.cpuiowait)/sum(drc.secs) (decimal(7,2)) as "CPU IO Wait"
from dbc.resusagescpu{qgsystem} as drc
left outer join boeing_tables.dbcinfotbl{qgsystem} as bdi
on bdi.infokey = 'sysname'
where drc.thedate >= (current_date - {week})
order by logtime asc
Group by sysname,logtime
;
"""
### functions
#multitasking.task
def fetch(system,qgsystem,week):
with teradatasql.connect (host=ahost, user=ausername, password=apassword) as con:
with con.cursor () as cur:
cur.execute (query.format(system=system, qgsystem=qgsystem, week=week))
[ print (row) for row in cur.fetchall () ]
#multitasking.task
def plot(system,week):
for week in weeks:
for system in systems:
df = pd.read_csv(system + "_" + week + "_output.csv")
df.pop('sysname')
df.plot.area(x="Logtime")
figure = plt.gcf()
figure.set_size_inches(12, 6)
plt.savefig( "/app/static/" + system + "_" + week + "_webchart.png", bbox_inches='tight', dpi=100)
### main
for week in weeks:
for system, qgsystem in zip(systems, qgsystems):
fetch(system,qgsystem,week)
for week in weeks:
for system in systems:
plot(system,week)
app = Flask(__name__,template_folder='templates')
#app.route('/')
def index():
return render_template("index.html")

Multithread or multiprocess

So, currently, I am using multiprocessing to run these 3 functions together.
As only tokens changes, is it recommended to switch to multi-threading? (if yes, will it really help in a performance like speed-up and I think memory will be for sure used less)
This is my code:
from database_function import *
from kiteconnect import KiteTicker
import pandas as pd
from datetime import datetime, timedelta
import schedule
import time
from multiprocessing import Process
def tick_A():
#credentials code here
tokens = [x[0] for x in db_fetchquery("SELECT zerodha FROM script ORDER BY id ASC LIMIT 50")] #FETCHING FIRST 50 SCRIPTS TOKEN
#print(tokens)
##### TO MAKE SURE THE TASK STARTS AFTER 8:59 ONLY ###########
t = datetime.today()
future = datetime(t.year,t.month,t.day,8,59)
if ((future-t).total_seconds()) < 0:
future = datetime(t.year,t.month,t.day,t.hour,t.minute,(t.second+2))
time.sleep((future-t).total_seconds())
##### TO MAKE SURE THE TASK STARTS AFTER 8:59 ONLY ###########
def on_ticks(ws, ticks):
global ltp
ltp = ticks[0]["last_price"]
for tick in ticks:
print(f"{tick['instrument_token']}A")
db_runquery(f'UPDATE SCRIPT SET ltp = {tick["last_price"]} WHERE zerodha = {tick["instrument_token"]}') #UPDATING LTP IN DATABASE
#print(f"{tick['last_price']}")
def on_connect(ws, response):
#print(f"response from connect :: {response}")
# Subscribe to a list of instrument_tokens (TOKENS FETCHED ABOVE WILL BE SUBSCRIBED HERE).
# logging.debug("on connect: {}".format(response))
ws.subscribe(tokens)
ws.set_mode(ws.MODE_LTP,tokens) # SETTING TOKEN TO TICK MODE (LTP / FULL / QUOTE)
kws.on_ticks = on_ticks
kws.on_connect = on_connect
kws.connect(threaded=True)
#####TO STOP THE TASK AFTER 15:32 #######
end_time = datetime(t.year,t.month,t.day,15,32)
while True:
schedule.run_pending()
#time.sleep(1)
if datetime.now() > end_time:
break
#####TO STOP THE TASK AFTER 15:32 #######
def tick_B():
everything remains the same only tokens value changes
tokens = [x[0] for x in db_fetchquery("SELECT zerodha FROM script ORDER BY id ASC OFFSET (50) ROWS FETCH NEXT (50) ROWS ONLY")]
def tick_C():
everything remains the same only tokens value changes
tokens = [x[0] for x in db_fetchquery("SELECT zerodha FROM script ORDER BY id ASC OFFSET (100) ROWS FETCH NEXT (50) ROWS ONLY")]
if __name__ == '__main__':
def runInParallel(*fns):
proc = []
for fn in fns:
p = Process(target=fn)
p.start()
proc.append(p)
for p in proc:
p.join()
runInParallel(tick_A , tick_B , tick_C)
So, currently, I am using multiprocessing to run these 3 functions together.
As only tokens changes, is it recommended to switch to multi-threading? (if yes, will it really help in a performance like speed-up and I think memory will be for sure used less)
most Python implementations do not have true multi-threading, because they use global lock (GIL). So only one thread runs at a time.
For I/O heavy applications it should not make difference. But if you need CPU heavy operations done in parallel (and I see that you use Panda - so the answer must be yes) - you will be better off staying with multi-process app.

SQLAlchemy performance difference between engine and connection executio

I'm experiencing a strange behaviour with SQLALchemy from the performance viewpoint.
I've created this simple test script:
from sqlalchemy import *
from time import clock, time
from datetime import datetime
import os
s = "Start: "+str(datetime.now())+"\n"
engine = create_engine("sqlite:///C:/mydb.db?check_same_thread=False")
print engine
ids = ('ecab910b-261b-11e9-9479-5800e3d291a6',
'ecab90f0-261b-11e9-8801-5800e3d291a6',
'ecab90f1-261b-11e9-bcf7-5800e3d291a6',
'ecab90f2-261b-11e9-a7c1-5800e3d291a6',
'ecab90f3-261b-11e9-b8d0-5800e3d291a6',
'ecab90f4-261b-11e9-8a61-5800e3d291a6',
'ecab90f5-261b-11e9-a11a-5800e3d291a6',
'ecab90f6-261b-11e9-9a55-5800e3d291a6',
'ecab90f7-261b-11e9-b16e-5800e3d291a6',
'ecab90f8-261b-11e9-82a6-5800e3d291a6',
'ecab90f9-261b-11e9-ad9d-5800e3d291a6',
'ecab90fa-261b-11e9-ae23-5800e3d291a6',
'ecab90fb-261b-11e9-93a5-5800e3d291a6',
'ecab90fc-261b-11e9-b963-5800e3d291a6',
'ecab90fd-261b-11e9-b875-5800e3d291a6',
'ecab90fe-261b-11e9-ae90-5800e3d291a6',
'ecab90ff-261b-11e9-8e07-5800e3d291a6',
'ecab9101-261b-11e9-8808-5800e3d291a6',
'ecab9102-261b-11e9-82e5-5800e3d291a6',
'ecab9103-261b-11e9-80e7-5800e3d291a6',
'ecab9108-261b-11e9-ad8c-5800e3d291a6',
'ecab9109-261b-11e9-80d0-5800e3d291a6',
'ecab910a-261b-11e9-9c5d-5800e3d291a6',
'ecab910d-261b-11e9-86f7-5800e3d291a6',
'ecab910e-261b-11e9-9322-5800e3d291a6',
'ecab910f-261b-11e9-b454-5800e3d291a6',
'ecab9114-261b-11e9-b1ec-5800e3d291a6',
'ecab911b-261b-11e9-b40c-5800e3d291a6')
metadata = MetaData(engine)
tab = Table("mytab", metadata, autoload=True)
q = update(tab, tab.c.id.in_(ids), {"status":5})
conn = engine.connect()
el = clock()
conn.execute(q)
s += "Elapsed (from connection) = %f\n" %(clock()-el)
el = clock()
q.execute()
s += "Elapsed (from engine) = %f\n" %(clock()-el)
print s
In the script I'm simply comparing the execution time when the query is executed from the engine and from the connection. Here is the output:
Engine(sqlite:///C:/mydb.db?check_same_thread=False)
Start: 2019-02-01 13:41:28.771000
Elapsed (from connection) = 0.007447
Elapsed (from engine) = 0.576755
Can anyone help me to understand why there is such a huge difference in the execution and what should be the right apporach to use SQLAlchemy with non-ORM queries?
Thanks in advance.

python cassandra driver same insert performance as copy

I'm trying to use Python async with Cassandra to see if I can write records to Cassandra faster than the CQL COPY command.
My python code looks like this:
from cassandra.cluster import Cluster
from cassandra import ConsistencyLevel
from cassandra.query import SimpleStatement
cluster = Cluster(['1.2.1.4'])
session = cluster.connect('test')
with open('dataImport.txt') as f:
for line in f:
query = SimpleStatement (
"INSERT INTO tstTable (id, accts, info) VALUES (%s) " %(line),
consistency_level=ConsistencyLevel.ONE)
session.execute_async (query)
but its giving me the same performance as the COPY command...around 2,700 rows/sec....should it be faster with async?
Do I need to use multithreading in python? Just reading about it but not sure how it fits into this...
EDIT:
so I found something online that i'm trying to modify but can't get to quite work...I have this so far..also I split the file into 3 file into /Data/toImport/ dir:
import multiprocessing
import time
import os
from cassandra.cluster import Cluster
from cassandra import ConsistencyLevel
from cassandra.query import SimpleStatement
cluster = Cluster(['1.2.1.4'])
session = cluster.connect('test')
def mp_worker(inputArg):
with open(inputArg[0]) as f:
for line in f:
query = SimpleStatement (
"INSERT INTO CustInfo (cust_id, accts, offers) values (%s)" %(line),
consistency_level=ConsistencyLevel.ONE)
session.execute_async (query)
def mp_handler(inputData, nThreads = 8):
p = multiprocessing.Pool(nThreads)
p.map(mp_worker, inputData, chunksize=1)
p.close()
p.join()
if __name__ == '__main__':
temp_in_data = file_list
start = time.time()
in_dir = '/Data/toImport/'
N_Proc = 8
file_data = [(in_dir) for i in temp_in_data]
print '----------------------------------Start Working!!!!-----------------------------'
print 'Number of Processes using: %d' %N_Proc
mp_handler(file_data, N_Proc)
end = time.time()
time_elapsed = end - start
print '----------------------------------All Done!!!!-----------------------------'
print "Time elapsed: {} seconds".format(time_elapsed)
but get this error:
Traceback (most recent call last):
File "multiCass.py", line 27, in <module>
temp_in_data = file_list
NameError: name 'file_list' is not defined
This post A Multiprocessing Example for Improved Bulk Data Throughput provides all the details needed to improve the performance of bulk data ingestion. Basically there are 3 mechanisms and additional tuning can be done based on your use-case & hw:
single process (that's the case in your example)
multi-processing single queries
multi-processing concurrent queries
Size of batches and concurrency are the variables you'll have to play with yourself.
got it working like this:
import multiprocessing
import time
import os
from cassandra.cluster import Cluster
from cassandra import ConsistencyLevel
from cassandra.query import SimpleStatement
def mp_worker(inputArg):
cluster = Cluster(['1.2.1.4'])
session = cluster.connect('poc')
with open(inputArg[0]) as f:
for line in f:
query = SimpleStatement (
"INSERT INTO testTable (cust_id, accts, offers) values (%s)" %(line),
consistency_level=ConsistencyLevel.ONE)
session.execute_async (query)
def mp_handler(inputData, nThreads = 8):
p = multiprocessing.Pool(nThreads)
p.map(mp_worker, inputData, chunksize=1)
p.close()
p.join()
if __name__ == '__main__':
temp_in_data = ['/toImport/part-00000', '/toImport/part-00001', '/toImport/part-00002']
start = time.time()
N_Proc = 3
file_data = [(i,) for i in temp_in_data]
print '----------------------------------Start Working!!!!-----------------------------'
print 'Number of Processes using: %d' %N_Proc
mp_handler(file_data, N_Proc)
end = time.time()
time_elapsed = end - start
print '----------------------------------All Done!!!!-----------------------------'
print "Time elapsed: {} seconds".format(time_elapsed)

in python: child processes going defunct while others are not, unsure why

edit: the answer was that the os was axing processes because i was consuming all the memory
i am spawning enough subprocesses to keep the load average 1:1 with cores, however at some point within the hour, this script could run for days, 3 of the processes go :
tipu 14804 0.0 0.0 328776 428 pts/1 Sl 00:20 0:00 python run.py
tipu 14808 64.4 24.1 2163796 1848156 pts/1 Rl 00:20 44:41 python run.py
tipu 14809 8.2 0.0 0 0 pts/1 Z 00:20 5:43 [python] <defunct>
tipu 14810 60.3 24.3 2180308 1864664 pts/1 Rl 00:20 41:49 python run.py
tipu 14811 20.2 0.0 0 0 pts/1 Z 00:20 14:04 [python] <defunct>
tipu 14812 22.0 0.0 0 0 pts/1 Z 00:20 15:18 [python] <defunct>
tipu 15358 0.0 0.0 103292 872 pts/1 S+ 01:30 0:00 grep python
i have no idea why this is happening, attached is the master and slave. i can attach the mysql/pg wrappers if needed as well, any suggestions?
slave.py:
from boto.s3.key import Key
import multiprocessing
import gzip
import os
from mysql_wrapper import MySQLWrap
from pgsql_wrapper import PGSQLWrap
import boto
import re
class Slave:
CHUNKS = 250000
BUCKET_NAME = "bucket"
AWS_ACCESS_KEY = ""
AWS_ACCESS_SECRET = ""
KEY = Key(boto.connect_s3(AWS_ACCESS_KEY, AWS_ACCESS_SECRET).get_bucket(BUCKET_NAME))
S3_ROOT = "redshift_data_imports"
COLUMN_CACHE = {}
DEFAULT_COLUMN_VALUES = {}
def __init__(self, job_queue):
self.log_handler = open("logs/%s" % str(multiprocessing.current_process().name), "a");
self.mysql = MySQLWrap(self.log_handler)
self.pg = PGSQLWrap(self.log_handler)
self.job_queue = job_queue
def do_work(self):
self.log(str(os.getpid()))
while True:
#sample job in the abstract: mysql_db.table_with_date-iteration
job = self.job_queue.get()
#queue is empty
if job is None:
self.log_handler.close()
self.pg.close()
self.mysql.close()
print("good bye and good day from %d" % (os.getpid()))
self.job_queue.task_done()
break
#curtail iteration
table = job.split('-')[0]
#strip redshift table from job name
redshift_table = re.sub(r"(_[1-9].*)", "", table.split(".")[1])
iteration = int(job.split("-")[1])
offset = (iteration - 1) * self.CHUNKS
#columns redshift is expecting
#bad tables will slip through and error out, so we catch it
try:
colnames = self.COLUMN_CACHE[redshift_table]
except KeyError:
self.job_queue.task_done()
continue
#mysql fields to use in SELECT statement
fields = self.get_fields(table)
#list subtraction determining which columns redshift has that mysql does not
delta = (list(set(colnames) - set(fields.keys())))
#subtract columns that have a default value and so do not need padding
if delta:
delta = list(set(delta) - set(self.DEFAULT_COLUMN_VALUES[redshift_table]))
#concatinate columns with padded \N
select_fields = ",".join(fields.values()) + (",\\N" * len(delta))
query = "SELECT %s FROM %s LIMIT %d, %d" % (select_fields, table,
offset, self.CHUNKS)
rows = self.mysql.execute(query)
self.log("%s: %s\n" % (table, len(rows)))
if not rows:
self.job_queue.task_done()
continue
#if there is more data potentially, add it to the queue
if len(rows) == self.CHUNKS:
self.log("putting %s-%s" % (table, (iteration+1)))
self.job_queue.put("%s-%s" % (table, (iteration+1)))
#various characters need escaping
clean_rows = []
redshift_escape_chars = set( ["\\", "|", "\t", "\r", "\n"] )
in_chars = ""
for row in rows:
new_row = []
for value in row:
if value is not None:
in_chars = str(value)
else:
in_chars = ""
#escape any naughty characters
new_row.append("".join(["\\" + c if c in redshift_escape_chars else c for c in in_chars]))
new_row = "\t".join(new_row)
clean_rows.append(new_row)
rows = ",".join(fields.keys() + delta)
rows += "\n" + "\n".join(clean_rows)
offset = offset + self.CHUNKS
filename = "%s-%s.gz" % (table, iteration)
self.move_file_to_s3(filename, rows)
self.begin_data_import(job, redshift_table, ",".join(fields.keys() +
delta))
self.job_queue.task_done()
def move_file_to_s3(self, uri, contents):
tmp_file = "/dev/shm/%s" % str(os.getpid())
self.KEY.key = "%s/%s" % (self.S3_ROOT, uri)
self.log("key is %s" % self.KEY.key )
f = gzip.open(tmp_file, "wb")
f.write(contents)
f.close()
#local saving allows for debugging when copy commands fail
#text_file = open("tsv/%s" % uri, "w")
#text_file.write(contents)
#text_file.close()
self.KEY.set_contents_from_filename(tmp_file, replace=True)
def get_fields(self, table):
"""
Returns a dict used as:
{"column_name": "altered_column_name"}
Currently only the debug column gets altered
"""
exclude_fields = ["_qproc_id", "_mob_id", "_gw_id", "_batch_id", "Field"]
query = "show columns from %s" % (table)
fields = self.mysql.execute(query)
#key raw field, value mysql formatted field
new_fields = {}
#for field in fields:
for field in [val[0] for val in fields]:
if field in exclude_fields:
continue
old_field = field
if "debug_mode" == field.strip():
field = "IFNULL(debug_mode, 0)"
new_fields[old_field] = field
return new_fields
def log(self, text):
self.log_handler.write("\n%s" % text)
def begin_data_import(self, table, redshift_table, fields):
query = "copy %s (%s) from 's3://bucket/redshift_data_imports/%s' \
credentials 'aws_access_key_id=%s;aws_secret_access_key=%s' delimiter '\\t' \
gzip NULL AS '' COMPUPDATE ON ESCAPE IGNOREHEADER 1;" \
% (redshift_table, fields, table, self.AWS_ACCESS_KEY, self.AWS_ACCESS_SECRET)
self.pg.execute(query)
master.py:
from slave import Slave as Slave
import multiprocessing
from mysql_wrapper import MySQLWrap as MySQLWrap
from pgsql_wrapper import PGSQLWrap as PGSQLWrap
class Master:
SLAVE_COUNT = 5
def __init__(self):
self.mysql = MySQLWrap()
self.pg = PGSQLWrap()
def do_work(table):
pass
def get_table_listings(self):
"""Gathers a list of MySQL log tables needed to be imported"""
query = 'show databases'
result = self.mysql.execute(query)
#turns list[tuple] into a flat list
databases = list(sum(result, ()))
#overriding during development
databases = ['db1', 'db2', 'db3']]
exclude = ('mysql', 'Database', 'information_schema')
scannable_tables = []
for database in databases:
if database in exclude:
continue
query = "show tables from %s" % database
result = self.mysql.execute(query)
#turns list[tuple] into a flat list
tables = list(sum(result, ()))
for table in tables:
exclude = ("Tables_in_%s" % database, "(", "201303", "detailed", "ltv")
#exclude any of the unfavorables
if any(s in table for s in exclude):
continue
scannable_tables.append("%s.%s-1" % (database, table))
return scannable_tables
def init(self):
#fetch redshift columns once and cache
#get columns from redshift so we can pad the mysql column delta with nulls
tables = ('table1', 'table2', 'table3')
for table in tables:
#cache columns
query = "SELECT column_name FROM information_schema.columns WHERE \
table_name = '%s'" % (table)
result = self.pg.execute(query, async=False, ret=True)
Slave.COLUMN_CACHE[table] = list(sum(result, ()))
#cache default values
query = "SELECT column_name FROM information_schema.columns WHERE \
table_name = '%s' and column_default is not \
null" % (table)
result = self.pg.execute(query, async=False, ret=True)
#turns list[tuple] into a flat list
result = list(sum(result, ()))
Slave.DEFAULT_COLUMN_VALUES[table] = result
def run(self):
self.init()
job_queue = multiprocessing.JoinableQueue()
tables = self.get_table_listings()
for table in tables:
job_queue.put(table)
processes = []
for i in range(Master.SLAVE_COUNT):
process = multiprocessing.Process(target=slave_runner, args=(job_queue,))
process.daemon = True
process.start()
processes.append(process)
#blocks this process until queue reaches 0
job_queue.join()
#signal each child process to GTFO
for i in range(Master.SLAVE_COUNT):
job_queue.put(None)
#blocks this process until queue reaches 0
job_queue.join()
job_queue.close()
#do not end this process until child processes close out
for process in processes:
process.join()
#toodles !
print("this is master saying goodbye")
def slave_runner(queue):
slave = Slave(queue)
slave.do_work()
There's not enough information to be sure, but the problem is very likely to be that Slave.do_work is raising an unhandled exception. (There are many lines of your code that could do that in various different conditions.)
When you do that, the child process will just exit.
On POSIX systems… well, the full details are a bit complicated, but in the simple case (what you have here), a child process that exits will stick around as a <defunct> process until it gets reaped (because the parent either waits on it, or exits). Since your parent code doesn't wait on the children until the queue is finished, that's exactly what happens.
So, there's a simple duct-tape fix:
def do_work(self):
self.log(str(os.getpid()))
while True:
try:
# the rest of your code
except Exception as e:
self.log("something appropriate {}".format(e))
# you may also want to post a reply back to the parent
You might also want to break the massive try up into different ones, so you can distinguish between all the different stages where things could go wrong (especially if some of them mean you need a reply, and some mean you don't).
However, it looks like what you're attempting to do is duplicate exactly the behavior of multiprocessing.Pool, but have missed the bar in a couple places. Which raises the question: why not just use Pool in the first place? You could then simplify/optimize things ever further by using one of the map family methods. For example, your entire Master.run could be reduced to:
self.init()
pool = multiprocessing.Pool(Master.SLAVE_COUNT, initializer=slave_setup)
pool.map(slave_job, tables)
pool.join()
And this will handle exceptions for you, and allow you to return values/exceptions if you later need that, and let you use the built-in logging library instead of trying to build your own, and so on. And it should only take about a dozens lines of minor code changes to Slave, and then you're done.
If you want to submit new jobs from within jobs, the easiest way to do this is probably with a Future-based API (which turns things around, making the future result the focus and the pool/executor the dumb thing that provides them, instead of making the pool the focus and the result the dumb thing it gives back), but there are multiple ways to do it with Pool as well. For example, right now, you're not returning anything from each job, so, you can just return a list of tables to execute. Here's a simple example that shows how to do it:
import multiprocessing
def foo(x):
print(x, x**2)
return list(range(x))
if __name__ == '__main__':
pool = multiprocessing.Pool(2)
jobs = [5]
while jobs:
jobs, oldjobs = [], jobs
for job in oldjobs:
jobs.extend(pool.apply(foo, [job]))
pool.close()
pool.join()
Obviously you can condense this a bit by replacing the whole loop with, e.g., a list comprehension fed into itertools.chain, and you can make it a lot cleaner-looking by passing "a submitter" object to each job and adding to that instead of returning a list of new jobs, and so on. But I wanted to make it as explicit as possible to show how little there is to it.
At any rate, if you think the explicit queue is easier to understand and manage, go for it. Just look at the source for multiprocessing.worker and/or concurrent.futures.ProcessPoolExecutor to see what you need to do yourself. It's not that hard, but there are enough things you could get wrong (personally, I always forget at least one edge case when I try to do something like this myself) that it's work looking at code that gets it right.
Alternatively, it seems like the only reason you can't use concurrent.futures.ProcessPoolExecutor here is that you need to initialize some per-process state (the boto.s3.key.Key, MySqlWrap, etc.), for what are probably very good caching reasons. (If this involves a web-service query, a database connect, etc., you certainly don't want to do that once per task!) But there are a few different ways around that.
But you can subclass ProcessPoolExecutor and override the undocumented function _adjust_process_count (see the source for how simple it is) to pass your setup function, and… that's all you have to do.
Or you can mix and match. Wrap the Future from concurrent.futures around the AsyncResult from multiprocessing.

Categories