import sqlite3 as sql
v = (161.5, 164.5, 157.975, 158.5375, 159.3125, 160.325, 74052, 8)
try:
connection = sql.connect("data.db")
sql_update_query = """UPDATE RECORDS SET OPEN = ?,HIGH = ?,LOW = ?,CLOSE = ?,LAST = ?,PREVCLOSE = ?,TOTTRDQTY = ? WHERE ROWID = ?"""
cursor = connection.cursor()
cursor.execute(sql_update_query,v)
connection.commit()
print("Total", cursor.rowcount, "Records updated successfully")
connection.close()
except Exception as e:
print(e)
Here is the code that I am using to update the data on my table named "RECORDS".
I tried to check if my SQL statement was wrong on DBBrowser:
UPDATE RECORDS SET OPEN = 161.5,HIGH = 164.5,LOW = 157.975,CLOSE = 158.5375,LAST = 159.3125,PREVCLOSE = 160.325,TOTTRDQTY = 74052 WHERE ROWID = 8
Output was:
Execution finished without errors.
Result: query executed successfully. Took 2ms, 1 rows affected
At line 1:
UPDATE RECORDS SET OPEN = 161.5,HIGH = 164.5,LOW = 157.975,CLOSE = 158.5375,LAST = 159.3125,PREVCLOSE = 160.325,TOTTRDQTY = 74052 WHERE ROWID = 8
But when I run my code on python.. it just doesn't update.
I get:
Total 0 Records updated successfully
My python code runs but nothing changes on the database. Please help.
Edit: 29-04-2022:
Since my code is fine, maybe the way my database is created is causing this issue.
So I am adding the code that I use to create the DB file.
import os
import pandas as pd
import sqlite3 as sql
connection = sql.connect("data.db")
d = os.listdir("Bhavcopy/")
for f in d:
fn = "Bhavcopy/" + f
df = pd.read_excel(fn)
df["TIMESTAMP"] = pd.to_datetime(df.TIMESTAMP)
df["TIMESTAMP"] = df['TIMESTAMP'].dt.strftime("%d-%m-%Y")
df.rename(columns={"TIMESTAMP":"DATE"},inplace=True)
df.set_index("DATE",drop=True,inplace=True)
df['CHANGE'] = df.CLOSE - df.PREVCLOSE
df['PERCENT'] = round((df.CHANGE/df.PREVCLOSE) * 100, 2)
df.to_sql('RECORDS', con=connection, if_exists='append')
connection.close()
Sample of data that is being added to the database:
SYMBOL SERIES OPEN ... TIMESTAMP TOTALTRADES ISIN
0 20MICRONS EQ 58.95 ... 01-JAN-2018 1527 INE144J01027
1 3IINFOTECH EQ 8.40 ... 01-JAN-2018 7133 INE748C01020
2 3MINDIA EQ 18901.00 ... 01-JAN-2018 728 INE470A01017
3 5PAISA EQ 383.00 ... 01-JAN-2018 975 INE618L01018
4 63MOONS EQ 119.55 ... 01-JAN-2018 6628 INE111B01023
[5 rows x 13 columns]
SYMBOL SERIES OPEN ... TIMESTAMP TOTALTRADES ISIN
1412 ZODJRDMKJ EQ 43.50 ... 01-JAN-2018 10 INE077B01018
1413 ZUARI EQ 555.00 ... 01-JAN-2018 2097 INE840M01016
1414 ZUARIGLOB EQ 254.15 ... 01-JAN-2018 1670 INE217A01012
1415 ZYDUSWELL EQ 1051.00 ... 01-JAN-2018 688 INE768C01010
1416 ZYLOG EQ 4.80 ... 01-JAN-2018 635 INE225I01026
[5 rows x 13 columns]
Shape of the excel files:
(1417, 13)
Also someone asked how I am creating the table:
import sqlite3 as sql
connection = sql.connect("data.db")
cursor = connection.cursor()
#create our table:
command1 = """
CREATE TABLE IF NOT EXISTS
RECORDS(
DATE TEXT NOT NULL,
SYMBOL TEXT NOT NULL,
SERIES TEXT NOT NULL,
OPEN REAL,
HIGH REAL,
LOW REAL,
CLOSE REAL,
LAST REAL,
PREVCLOSE REAL,
TOTTRDQTY INT,
TOTTRDVAL REAL,
TOTALTRADES INT,
ISIN TEXT,
CHANGE REAL,
PERCENT REAL
)
"""
cursor.execute(command1)
connection.commit()
connection.close()
I created your table with only the numeric fields that needed to be updated, and run your code - it worked. So in the end it had to be a datatype mismatch, I'm glad you found it :)
Your code works fine both in Windows and Linux, the only reason to see that kind of behavior is that you are modifying two files with same name in a different location. Check what file is being referenced in your DBBrowser.
And in doubt prefer absolute paths as in your comment above
connection = sql.connect("C:/Users/Abinash/Desktop/data.db")
So I found the problem why the code even if correct was not working. Thanks to #gimix.
I was creating the variable v:
v = (161.5, 164.5, 157.975, 158.5375, 159.3125, 160.325, 74052, 8)
by read it from a dataframe, when everyone said that my code is correct and when gimix asked "how I created the table", I realized that it could have been a datatype mismatch. On checking I found that one of the values was string type.
so this change:
i = 0
o = float(adjdf['OPEN'].iloc[i])
h = float(adjdf['HIGH'].iloc[i])
l = float(adjdf['LOW'].iloc[i])
c = float(adjdf['CLOSE'].iloc[i])
last = float(adjdf['LAST'].iloc[i])
pc = float(adjdf['PREVCLOSE'].iloc[i])
tq = int(adjdf['TOTTRDQTY'].iloc[i])
did = int(adjdf['ID'].iloc[i])
v = (o,h,l,c,last,pc,tq,did)
This fixed the issue. Thank you very much for the help everyone.
I finally got:
Total 1 Records updated successfully
Related
Please can anyone advise how I may retrieve FX Forwards NDF's outright bid / outright ask or indeed any price data for USD/KRW for the 1W, 1M, 3M etc tenors.
I have attempted to follow the DAPI instructions as well as attempting to find answers via Stackoverflow to no avail. I can however succesfully retrieve live bid asks for SPOT USD KRW or even Equities such as AAPL no problem
I have tried using different combinations of the tickers, although I see no error codes no actual live prices come back? Please does anyone have any ideas to get live ticking NDF outright prices:
Any & all help is greatly appreciated :) as Bloomberg seemingly don't provide any assistance
['USD/KRW N 2M Curncy'], ['USD/KRW N 3M Curncy'] , ['USD/KRW N 3M ICAP Curncy']
p.s the Excel Bloomberg formula such as =BFxForward("USDKRW","3M","BidOutright") is essentially what I'm trying to replicate via python, attempting to follow the DAPI instructions seems to not work.
I have used the C++ BLPAPI pdf examples to attempt to get this working however no NDF examples seemingly exist.
def main_subscribe():
tickers = ['USD/KRW N 2M Curncy', 'USD/KRW N 6M Curncy', 'USD/KRW N 9M Curncy']
fields = ['BID', 'LAST_BID_TIME_TODAY_REALTIME', 'ASK','MID']
interval = 2
options = parseCmdLine()
# Fill SessionOptions
sessionOptions = blpapi.SessionOptions()
sessionOptions.setServerHost(options.host)
sessionOptions.setServerPort(options.port)
print("Connecting to %s:%s" % (options.host, options.port))
# Create a Session
session = blpapi.Session(sessionOptions)
# Start a Session
if not session.start():
print("Failed to start session.")
return
try:
# Open service to get subscription data from
if not session.openService('//blp/mktdata'):
print("Failed to open '//blp/mktdata")
return
# init subscriptions
subs = blpapi.SubscriptionList()
flds = ','.join(fields)
istr = interval and 'interval=%.1f' % interval or ''
for ticker in tickers:
subs.add(ticker, flds, istr, blpapi.CorrelationId(ticker))
session.subscribe(subs)
# Process received events
while(True):
# We provide timeout to give the chance for Ctrl+C handling:
ev = session.nextEvent(900)
for msg in ev:
print(msg)
# if ev.eventType() == blpapi.Event.SUBSCRIPTION_DATA:
# try:
# for msg in ev:
# #print(msg)
# print(f"{fields[0]}:{msg.getElementAsString(fields[0])} , {fields[3]}:{msg.getElementAsString(fields[3])} , {fields[2]}:{msg.getElementAsString(fields[2])} , {fields[1]}:{msg.getElementAsString(fields[1])}")
# except Exception as e:
# print(e)
# #print(msg)
# None
finally:
# Stop the session
session.stop()
This is the output when main-subscribe is run:
CID: {[ valueType=POINTER classId=0 value=0000024DBF510CB0 ]}
RequestId: -----------------------------
MarketDataEvents = {
MKTDATA_EVENT_TYPE = SUMMARY
MKTDATA_EVENT_SUBTYPE = INITPAINT
API_RULES_VERSION = 201411210
SIMP_LAST_PX_ALL_SESS_DIR_RT = 1
SMART_FIELDS_METADATA_VERSION_RT = "21.10.08.02 "
IS_DELAYED_STREAM = false
MID = 1.000000
RT_API_MACHINE = "apipubx0#----------"
RT_YLD_CHG_NET_1D = 0.000000
IND_BID_FLAG = false
IND_ASK_FLAG = false
BASE_PRICE_ENABLED_RT = false
EVT_DELTA_TIMES_RT = 0
ALL_PRICE_COND_CODE = ""}
This is the KRW <Curncy> FRD <Go> screen in the Bloomi Terminal:
If you hover the mouse over the 3M outright Bid (in the circle), the pop-up shows the underlying ticker to be KWN+3M BGN Curncy.
When I put this ticker in Excel as:
=BDP("KWN+3M BGN Curncy","BID","UpdateFrequency",500) then I get updating bid side pricing which matches the Terminal screen.
Since the underlying DAPI for Excel and Python is the same, I would guess that this ticker will work with the blpapi too. I usually find it is quicker to test tickers and fields in Excel.
Every 4 seconds, I have to store 32,000 rows of data. Each of these rows consists of one time stamp value and 464 double precision values. The column name for the time stamp is time and the column name for the precision values increase sequentially as channel1, channel2, ..., and channel 464.
I establish a connection as follows:
CONNECTION = f"postgres://{username}:{password}#{host}:{port}/{dbname}"#?sslmode=require"
self.TimescaleDB_Client = psycopg2.connect(CONNECTION)
I then verify the TimescaleDB extension with the following:
def verifyTimeScaleInstall(self):
try:
sql_query = "CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;"
cur = self.TimescaleDB_Client.cursor()
cur.execute(sql_query)
cur.close()
self.TimescaleDB_Client.commit()
except:
self.timescaleLogger.error("An error occurred in verifyTimeScaleInstall")
tb = traceback.format_exc()
self.timescaleLogger.exception(tb)
return False
I then create a hyptertable for my data with the following:
def createRAWDataTable(self):
try:
cur = self.TimescaleDB_Client.cursor()
self.query_create_raw_data_table = None
for channel in range(self.num_channel) :
channel = channel + 1
if self.query_create_raw_data_table is None:
self.query_create_raw_data_table = f"CREATE TABLE IF NOT EXISTS raw_data (time TIMESTAMPTZ NOT NULL, channel{channel} REAL"
else:
self.query_create_raw_data_table = self.query_create_raw_data_table + f", channel{channel} REAL"
self.query_create_raw_data_table = self.query_create_raw_data_table + ");"
self.query_create_raw_data_hypertable = "SELECT create_hypertable('raw_data', 'time');"
cur.execute(self.query_create_raw_data_table)
cur.execute(self.query_create_raw_data_hypertable)
self.TimescaleDB_Client.commit()
cur.close()
except:
self.timescaleLogger.error("An error occurred in createRAWDataTable")
tb = traceback.format_exc()
self.timescaleLogger.exception(tb)
return False
I then insert the data into the hypertable using the following:
def insertRAWData(self, seconds):
try:
insert_start_time = datetime.now(pytz.timezone("MST"))
current_time = insert_start_time
num_iterations = seconds * self.fs
time_increment = timedelta(seconds=1/self.fs)
raw_data_query = self.query_insert_raw_data
dtype = "float32"
matrix = np.random.rand(self.fs*seconds,self.num_channel).astype(dtype)
cur = self.TimescaleDB_Client.cursor()
data = list()
for iteration in range(num_iterations):
raw_data_row = matrix[iteration,:].tolist() #Select a particular row and all columns
time_string = current_time.strftime("%Y-%m-%d %H:%M:%S.%f %Z")
raw_data_values = (time_string,)+tuple(raw_data_row)
data.append(raw_data_values)
current_time = current_time + time_increment
start_time = time.perf_counter()
psycopg2.extras.execute_values(
cur, raw_data_query, data, template=None, page_size=100
)
print(time.perf_counter() - start_time)
self.TimescaleDB_Client.commit()
cur.close()
except:
self.timescaleLogger.error("An error occurred in insertRAWData")
tb = traceback.format_exc()
self.timescaleLogger.exception(tb)
return False
The SQL Query String that I am referencing in the above code is obtained from the following:
def getRAWData_Query(self):
try:
self.query_insert_raw_data = None
for channel in range(self.num_channel):
channel = channel + 1
if self.query_insert_raw_data is None:
self.query_insert_raw_data = f"INSERT INTO raw_data (time, channel{channel}"
else:
self.query_insert_raw_data = self.query_insert_raw_data + f", channel{channel}"
self.query_insert_raw_data = self.query_insert_raw_data + ") VALUES %s;"
return self.query_insert_raw_data
except:
self.timescaleLogger.error("An error occurred in insertRAWData_Query")
tb = traceback.format_exc()
self.timescaleLogger.exception(tb)
return False
As you can see, I am using psycopg2.extras.execute_values() to insert the values. To my understanding, this is one of the fastest ways to insert data. However, it takes about 80 seconds for me to insert this data. It is on quite a beafy system with 12 cores/24 threads, SSDs, and 256GB of RAM. Can this be done faster? It just seems quite slow.
I would like to use TimescaleDB and am evaluating its performance. But I am looking to write within 2 seconds or so for it to be acceptable.
Edit I have tried to use pandas to perform the insert, but it took longer, at about 117 seconds. The following is the function that I used.
def insertRAWData_Pandas(self, seconds):
try:
insert_start_time = datetime.now(pytz.timezone("MST"))
current_time = insert_start_time
num_iterations = seconds * self.fs
time_increment = timedelta(seconds=1/self.fs)
raw_data_query = self.query_insert_raw_data
dtype = "float32"
matrix = np.random.rand(self.fs*seconds,self.num_channel).astype(dtype)
pd_df_dict = {}
pd_df_dict["time"] = list()
for iteration in range(num_iterations):
time_string = current_time.strftime("%Y-%m-%d %H:%M:%S.%f %Z")
pd_df_dict["time"].append(time_string)
current_time = current_time + time_increment
for channel in range(self.num_channel):
pd_df_dict[f"channel{channel}"] = matrix[:,channel].tolist()
start_time = time.perf_counter()
pd_df = pd.DataFrame(pd_df_dict)
pd_df.to_sql('raw_data', self.engine, if_exists='append')
print(time.perf_counter() - start_time)
except:
self.timescaleLogger.error("An error occurred in insertRAWData_Pandas")
tb = traceback.format_exc()
self.timescaleLogger.exception(tb)
return False
edit I have tried to use CopyManager and it appears to be producing the best results at around 74 seconds. Still not what I was after however.
def insertRAWData_PGCOPY(self, seconds):
try:
insert_start_time = datetime.now(pytz.timezone("MST"))
current_time = insert_start_time
num_iterations = seconds * self.fs
time_increment = timedelta(seconds=1/self.fs)
dtype = "float32"
matrix = np.random.rand(num_iterations,self.num_channel).astype(dtype)
data = list()
for iteration in range(num_iterations):
raw_data_row = matrix[iteration,:].tolist() #Select a particular row and all columns
#time_string = current_time.strftime("%Y-%m-%d %H:%M:%S.%f %Z")
raw_data_values = (current_time,)+tuple(raw_data_row)
data.append(raw_data_values)
current_time = current_time + time_increment
channelList = list()
for channel in range(self.num_channel):
channel = channel + 1
channelString = f"channel{channel}"
channelList.append(channelString)
channelList.insert(0,"time")
cols = tuple(channelList)
start_time = time.perf_counter()
mgr = CopyManager(self.TimescaleDB_Client, 'raw_data', cols)
mgr.copy(data)
self.TimescaleDB_Client.commit()
print(time.perf_counter() - start_time)
except:
self.timescaleLogger.error("An error occurred in insertRAWData_PGCOPY")
tb = traceback.format_exc()
self.timescaleLogger.exception(tb)
return False
I tried to modify the following values in postgresql.conf. There wasn't a noticeable performance improvement.
wal_level = minimal
fsync = off
synchronous_commit = off
wal_writer_delay = 2000ms
commit_delay = 100000
I have tried to modify the chunk size according to one of the below comments using the following in my createRawDataTable() function. However, there wasn't an improvement in the insert times. Perhaps this was also expectable given that I haven't been accumulating data. The data in the database has only been a few samples, perhaps at most 1 minute worth over the course of my testing.
self.query_create_raw_data_hypertable = "SELECT create_hypertable('raw_data', 'time', chunk_time_interval => INTERVAL '3 day',if_not_exists => TRUE);"
Edit For anyone reading this, I was able to pickle and insert an 32000x464 float32 numpy matrix in about 0.5 seconds for MongoDB, which is what my final solution is. Perhaps MongoDB just does better with this workload in this case.
I have a two initial suggestions that may help with overall performance.
The default hypertable you are creating will "chunk" your data by 7 day periods (this means each chunk will hold around 4,838,400,000 rows of data given your parameters). Since your data is so granular, you may want to use a different chunk size. Check out the docs here for info on the optional chunk_time_interval argument. Changing the chuck size should help with inserting and querying speed, it also will give you better performance in compression if needed later on.
As the individuals above stated, playing around with batch inserts should also help. If you haven't checked out this stock data tutorial I would highly recommend it. Using pgcopy and it's function CopyManager could help with inserting df objects more quickly.
Hopefully, some of this information can be helpful to your situation!
disclosure: I am part of the Timescale team 😊
You can use sqlachemy library to do it and also calibrate the chunksize while you are at it.
Append the data should possibly less than 74 seconds since I perform similar kind of insertion and it takes me about 40 odd seconds.
Another possibility is to use the pandas.DataFrame.to_sql with method=callable. It will increase the performance drastically.
in comparison to just to_sql (150s) or to_sql with method = multi (196s), the callable method did the job in just 14s.
Although a comparative summary for different methods would be best described with the image
One of the fastest ways is to
first create a pandas data frame of your data that you want to insert into the DB
then use the data frame to bulk-insert your data into the DB
here is a way you can do it: How to write data frame to postgres?
I am getting two different results with the same query.
I am extending Diamond PostgresqlCollector https://github.com/python-diamond/Diamond/blob/master/src/collectors/postgres/postgres.py in order to track a new metric.
Specifically, I am trying to implement the bloat estimate queries specified here: https://github.com/ioguix/pgsql-bloat-estimation/blob/master/table/table_bloat.sql
Where I am having trouble is that when I run the query from the psql command prompt I get results which include the 'public' schemaname. But when the query is run by diamond there are no results that include 'public'. Instead, enteries are only available for pg_catalog and information_schema. I see this by checking the logs /var/log/upstart/diamond.log
The only cause I can imagine is a permissions error for the 'diamond' user, but I can see at the psql command line that the user diamond exists, and has Superuser privilege. And I get results from pg_catalog. So I can get some stats, but not from the public schema of the database I'm most interested in.
Has anyone extended postgresql collector and seen this behavior or have a suggestion of what to try next?
Adding relevant files here. The system I am testing on is a vagrant machine, but I am using a puppet file to replicate the production environment as close as possible.
/etc/diamond/diamond.conf
[server]
pid_file = /var/run/diamond.pid
collectors_path = /usr/share/diamond/collectors/, /usr/local/share/diamond/collectors/
collectors_config_path = /etc/diamond/collectors/
handlers_path = /usr/share/diamond/handlers/
handlers_config_path = /etc/diamond/handlers/
handlers = diamond.handler.archive.ArchiveHandler
[handlers]
# logging handlers
keys = console
[[default]]
[[GraphitePickleHandler]]
host = graphite-01.local
port = 2014
timeout = 15
batch = 10
# ArchiveHandler writes stats to a local logfile.
# Much easier for testing and debugging.
[[ArchiveHandler]]
keys = watched_file
# File to write archive log files
log_file = /var/log/diamond/archive.log
[collectors]
[[default]]
hostname_method = fqdn_rev
interval = 60
[[CPUCollector]]
enabled = True
percore = True
[[DiskSpaceCollector]]
enabled = False
[[DiskUsageCollector]]
enabled = False
[[LoadAverageCollector]]
enabled = True
[[MemoryCollector]]
enabled = True
[[VMStatCollector]]
enabled = False
[[UserScriptsCollector]]
enabled = True
[loggers]
keys = root
[formatters]
keys = default
[logger_root]
level = INFO
handlers = console
[handler_console]
class = StreamHandler
args = (sys.stderr,)
level = NOTSET
formatter = default
[handler_watched_file]
class = handlers.WatchedFileHandler
level = DEBUG
formatter = default
[formatter_default]
format = [%(asctime)s] [%(levelname)s] [%(threadName)s] %(message)s
[configs]
path = "/etc/diamond/configs/"
extension = ".conf"
/etc/diamond/configs/postgres-service.conf
[collectors]
# Custom internal Postgresql collector. See diamond-service/files/collectors/custompg/custompg.py
[[CustomPostgresqlCollector]]
enabled = True
interval = 10
extended = True
metrics_blacklist = [^.]+\.indexes.*
pg_version = 9.3
user = diamond
# has_admin currently only controls if diamond should report how many WAL
# files exist on disk (although the query has a bug in it). However, as an
# unprivileged user, diamond can only see queries that are running as the same
# user. So in order to get the full picture of running queries on a multi-user
# system, diamond should have superuser privileges.
has_admin = False
/usr/local/share/diamond/collectors/custompg/custompg.py
import os
import sys
# Make sure we can import the existing postgres collector
try:
import postgres
from postgres import QueryStats, metrics_registry, registry
except ImportError:
# It's likely that this is being imported in a test or script
# outside of the normal diamond runpath.
# In these instances, try to add COLLECTOR_PATH to path and import again.
# i.e. export PYTHONPATH=$PYTHONPATH:/usr/share/diamond/collectors/postgres
raise ImportError("Unable to import built-in postgres collector."
"Make sure the collector path is added to PYTHONPATH.")
class CustomPostgresqlCollector(postgres.PostgresqlCollector):
"""
Collector subclass to differentiate enabling/disabling
company-specific Postgres metric collection.
"""
#Even though nothing is being extended, this class is
# still needed for the additional queries to get picked up
# by Diamond.
pass
class NonVacuumLongestRunningQueries(QueryStats):
"""
Differentiate between vacuum and non-vacuum queries.
The built-in longest running queries metric collection
doesn't account for/filter vacuum operations.
"""
path = "%(datname)s.non_vacuum.longest_running.%(metric)s"
multi_db = True
# This query is a modified version of
# https://github.com/python-diamond/Diamond/blob/0fda1835308255e3ac4b287724340baf16b27bb1/src/collectors/postgres/postgres.py#L506-L519
base_query = """
SELECT 'query',
COALESCE(max(extract(epoch FROM CURRENT_TIMESTAMP-query_start)),0)
FROM pg_stat_activity
WHERE %s
AND %s
UNION ALL
SELECT 'transaction',
COALESCE(max(extract(epoch FROM CURRENT_TIMESTAMP-xact_start)),0)
FROM pg_stat_activity
WHERE 1=1
AND %s
"""
exclude_vacuum_queries = "query NOT LIKE '%VACUUM%'"
# Two query versions in case collector needs to run on Postgres < 9.2
query = base_query % ("current_query NOT LIKE '<IDLE%'",
exclude_vacuum_queries,
exclude_vacuum_queries)
post_92_query = base_query % ("state NOT LIKE 'idle%'",
exclude_vacuum_queries,
exclude_vacuum_queries)
class UserTableVacuumStats(QueryStats):
"""Additional per-table vacuuming stats."""
path = "%(datname)s.tables.%(schemaname)s.%(relname)s.vacuum.%(metric)s"
multi_db = True
# http://www.postgresql.org/docs/9.3/static/monitoring-stats.html#PG-STAT-ALL-TABLES-VIEW
# Also filter out generally non-volatile system tables.
base_query = """
SELECT relname, schemaname, vacuum_count, autovacuum_count
FROM pg_stat_all_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema');
"""
query = base_query
class TableBloatSize(QueryStats):
""" Track estimated table bloat size using modified query written by ioguix:
https://github.com/ioguix/pgsql-bloat-estimation/blob/master/table/table_bloat.sql
WARNING: executed with a non-superuser role, the query inspects only
tables you are granted to read.
"""
path = "%(datname)s.tables.%(schemaname)s.%(relname)s.%(metric)s"
multi_db = True
query = """
SELECT schemaname, relname, (tblpages-est_tblpages_ff)*bs AS bloat_size
FROM (
SELECT ceil( reltuples / ( (bs-page_hdr)/tpl_size ) ) + ceil( toasttuples / 4 ) AS est_tblpages,
ceil( reltuples / ( (bs-page_hdr)*fillfactor/(tpl_size*100) ) ) + ceil( toasttuples / 4 ) AS est_tblpages_ff,
tblpages, fillfactor, bs, tblid, schemaname, relname, heappages, toastpages
FROM (
SELECT
( 4 + tpl_hdr_size + tpl_data_size + (2*ma)
- CASE WHEN tpl_hdr_size%ma = 0 THEN ma ELSE tpl_hdr_size%ma END
- CASE WHEN ceil(tpl_data_size)::int%ma = 0 THEN ma ELSE ceil(tpl_data_size)::int%ma END
) AS tpl_size, (heappages + toastpages) AS tblpages, heappages,
toastpages, reltuples, toasttuples, bs, page_hdr, tblid, schemaname, relname, fillfactor
FROM (
SELECT
tbl.oid AS tblid, ns.nspname AS schemaname, tbl.relname AS relname, tbl.reltuples,
tbl.relpages AS heappages, coalesce(toast.relpages, 0) AS toastpages,
coalesce(toast.reltuples, 0) AS toasttuples,
coalesce(substring(
array_to_string(tbl.reloptions, ' ')
FROM '%fillfactor=#"__#"%' FOR '#')::smallint, 100) AS fillfactor,
current_setting('block_size')::numeric AS bs,
CASE WHEN version()~'mingw32' OR version()~'64-bit|x86_64|ppc64|ia64|amd64' THEN 8 ELSE 4 END AS ma,
24 AS page_hdr,
23 + CASE WHEN MAX(coalesce(null_frac,0)) > 0 THEN ( 7 + count(*) ) / 8 ELSE 0::int END
+ CASE WHEN tbl.relhasoids THEN 4 ELSE 0 END AS tpl_hdr_size,
sum( (1-coalesce(s.null_frac, 0)) * coalesce(s.avg_width, 1024) ) AS tpl_data_size
FROM pg_attribute AS att
JOIN pg_class AS tbl ON att.attrelid = tbl.oid
JOIN pg_namespace AS ns ON ns.oid = tbl.relnamespace
JOIN pg_stats AS s ON s.schemaname=ns.nspname
AND s.tablename = tbl.relname AND s.inherited=false AND s.attname=att.attname
LEFT JOIN pg_class AS toast ON tbl.reltoastrelid = toast.oid
WHERE att.attnum > 0 AND NOT att.attisdropped
AND tbl.relkind = 'r'
GROUP BY 1,2,3,4,5,6,7,8,9,10, tbl.relhasoids
ORDER BY 2,3
) AS s
) AS s2
) AS s3
WHERE schemaname='public';
"""
class BtreeBloatSize(QueryStats):
""" Track estimated index bloat size using modified query written by ioguix:
https://github.com/ioguix/pgsql-bloat-estimation/blob/master/btree/btree_bloat.sql
WARNING: executed with a non-superuser role, the query inspect only index on tables you are granted to read.
WARNING: rows with is_na = 't' are known to have bad statistics ("name" type is not supported). Not relevant to Public schema
"""
path = "%(datname)s.tables.%(schemaname)s.%(relname)s.%(indexrelname)s.%(metric)s"
multi_db = True
query = """
SELECT nspname AS schemaname, relname, indexrelname,
bs*(relpages-est_pages_ff) AS bloat_size
FROM (
SELECT coalesce(1 +
ceil(reltuples/floor((bs-pageopqdata-pagehdr)*fillfactor/(100*(4+nulldatahdrwidth)::float))), 0
) AS est_pages_ff,
bs, nspname, relname, indexrelname, relpages, fillfactor
FROM (
SELECT maxalign, bs, nspname, relname, indexrelname, reltuples, relpages, relam, fillfactor,
( index_tuple_hdr_bm +
maxalign - CASE -- Add padding to the index tuple header to align on MAXALIGN
WHEN index_tuple_hdr_bm%maxalign = 0 THEN maxalign
ELSE index_tuple_hdr_bm%maxalign
END
+ nulldatawidth + maxalign - CASE -- Add padding to the data to align on MAXALIGN
WHEN nulldatawidth = 0 THEN 0
WHEN nulldatawidth::integer%maxalign = 0 THEN maxalign
ELSE nulldatawidth::integer%maxalign
END
)::numeric AS nulldatahdrwidth, pagehdr, pageopqdata
FROM (
SELECT
i.nspname, i.relname, i.indexrelname, i.reltuples, i.relpages, i.relam,
current_setting('block_size')::numeric AS bs, fillfactor,
CASE
-- MAXALIGN: 4 on 32bits, 8 on 64bits (and mingw32 ?)
WHEN version() ~ 'mingw32' OR version() ~ '64-bit|x86_64|ppc64|ia64|amd64' THEN 8
ELSE 4
END AS maxalign,
/* per page header, fixed size: 20 for 7.X, 24 for others */
24 AS pagehdr,
/* per page btree opaque data */
16 AS pageopqdata,
/* per tuple header: add IndexAttributeBitMapData if some cols are null-able */
CASE WHEN max(coalesce(s.null_frac,0)) = 0
-- IndexTupleData size
THEN 2
/* IndexTupleData size + IndexAttributeBitMapData size ( max num filed per index + 8 - 1 /8) */
ELSE 2 + (( 32 + 8 - 1 ) / 8)
END AS index_tuple_hdr_bm,
/* data len: we remove null values save space using it fractionnal part from stats */
sum( (1-coalesce(s.null_frac, 0)) * coalesce(s.avg_width, 1024)) AS nulldatawidth
FROM pg_attribute AS a
JOIN (
SELECT nspname, tbl.relname AS relname, idx.relname AS indexrelname, idx.reltuples, idx.relpages, idx.relam,
indrelid, indexrelid, indkey::smallint[] AS attnum,
coalesce(substring(
array_to_string(idx.reloptions, ' ')
from 'fillfactor=([0-9]+)')::smallint, 90) AS fillfactor
FROM pg_index
JOIN pg_class idx ON idx.oid=pg_index.indexrelid
JOIN pg_class tbl ON tbl.oid=pg_index.indrelid
JOIN pg_namespace ON pg_namespace.oid = idx.relnamespace
WHERE pg_index.indisvalid AND tbl.relkind = 'r' AND idx.relpages > 0
) AS i ON a.attrelid = i.indexrelid
JOIN pg_stats AS s ON s.schemaname = i.nspname
AND ((s.tablename = i.relname AND s.attname = pg_catalog.pg_get_indexdef(a.attrelid, a.attnum, TRUE)) -- stats from tbl
OR (s.tablename = i.indexrelname AND s.attname = a.attname))-- stats from functionnal cols
JOIN pg_type AS t ON a.atttypid = t.oid
WHERE a.attnum > 0
GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9
) AS s1
) AS s2
JOIN pg_am am ON s2.relam = am.oid WHERE am.amname = 'btree'
) AS sub
WHERE nspname='public'
ORDER BY 1,2,3;
"""
# Add the new metric queries to the
# registered set used by the collecting method.
metrics_registry.update({
'NonVacuumLongestRunningQueries': NonVacuumLongestRunningQueries,
'UserTableVacuumStats': UserTableVacuumStats,
'TableBloatSize': TableBloatSize,
'BtreeBloatSize': BtreeBloatSize,
})
registry['extended'] += ['NonVacuumLongestRunningQueries',
'UserTableVacuumStats',
'TableBloatSize',
'BtreeBloatSize']
so my issues is that im receiving values from a serial port.
The values could be any of the flowing.
BASE_RAD: NEW,ATC001#T1412010472R-77,ATC005:T1412010460R-70,SU0003;Q6V8.9S0C11.5*xx
BASE_RAD: NEW,ATC001#T1413824282R-102,ATC003:T1413824274R-98,SU001G;Q0V14.0D00*x
their is minor chnages in the out put but the biggest difference is the second line has the value D00 instead of S0
So this serial out will update me with changes to sensors and the D00 is for digital output but S0 is for the fan speed.
So my question is i have written a regular expression if i receive the first serial output that has the S0 value but if i then receive the D00 the regular expression will break.
I want to be able to write it so if it doesn't have the S0 value it would then look for the D00 value instead.
thank you for any help or advise in advance. im not sure where i should be looking or what direction i should be taking.
The code below checks the serial output and then runs the regular expression, if it find a match it then inserts that into the data base.
CODE BELOW IS PYTHON
import serial, string, MySQLdb, re
db = MySQLdb.connect(host="localhost", user="root", passwd="", db="walnut_farm")
cur = db.cursor()
serialPort = 'COM4' # BAUD Rate is 9600 as default
ser = serial.Serial()
ser.setPort(serialPort)
#ser.setBaudrate(115200) Enable if BAUD is not deault value
try:
ser.open()
except:
print('Port Error!')
else:
while True:
try:
ardString = ser.readline()
Serial_Output = ardString
p = re.compile(ur'^BASE_RAD: NEW,(.*)#T(\d*)R-(\d*),(.*):T(\d*)R-(\d*),(.*);Q(\d*)V(\d*\.?\d*)S(\d*)C(\d*\.?\d*)(.*)') # here is the regular expressions i created from this link http://regex101.com/r/dP6fE1/1
Serial_Results = re.match(p, Serial_Output)
# Assigning variables to the array values
Base_ID = Serial_Results.group(1)
Base_Time_Stamp = Serial_Results.group(2)
Base_Signal = Serial_Results.group(3)
Repeater_ID = Serial_Results.group(4)
Repeater_Time_Stamp = Serial_Results.group(5)
Repeater_Signal = Serial_Results.group(6)
Sensor_ID = Serial_Results.group(7)
Sensor_Sequence = Serial_Results.group(8)
Sensor_Input_Voltage = Serial_Results.group(9)
Sensor_Wind_Speed = Serial_Results.group(10)
Sensor_Temperature = Serial_Results.group(11)
Checksum = Serial_Results.group(12)
# Execute the SQL query to INSERT the above variables into the Database
cur.execute('INSERT INTO serial_values (Base_ID, Base_Time_Stamp, Base_Signal, Repeater_ID, Repeater_Time_Stamp, Repeater_Signal, Sensor_ID, Sensor_Sequence, Sensor_Input_Voltage, Sensor_Wind_Speed, Sensor_Temperature, Checksum) VALUES ("'+Base_ID+'", "'+Base_Time_Stamp+'", "'+Base_Signal+'", "'+Repeater_ID+'", "'+Repeater_Time_Stamp+'", "'+Repeater_Signal+'", "'+Sensor_ID+'", "'+Sensor_Sequence+'", "'+Sensor_Input_Voltage+'", "'+Sensor_Wind_Speed+'", "'+Sensor_Temperature+'", "'+Checksum+'")')
db.commit()
#ser.close()
except Exception:
pass
Take a look at this, if my interpretation is right. It's a starting point, you then have to include your mysql insert into the database.
import re
def get_output_parameters(serial_output):
p = re.compile(ur'^BASE_RAD: NEW,(.*)#T(\d*)R-(\d*),(.*):T(\d*)R-(\d*),(.*);Q(\d*)V(\d*\.?\d*)S(\d*)C(\d*\.?\d*)(.*)') # here is the regular expressions i created from this link http://regex101.com/r/dP6fE1/1
p2 = re.compile(ur'^BASE_RAD: NEW,(.*)#T(\d*)R-(\d*),(.*):T(\d*)R-(\d*),(.*);Q(\d*)V(\d*\.?\d*)D(\d*)(.*)')
Serial_Results = re.match(p, serial_output)
digital_out = False
if not Serial_Results:
Serial_Results = re.match(p2, serial_output)
digital_out = True
# Assigning variables to the array values
Base_ID = Serial_Results.group(1)
Base_Time_Stamp = Serial_Results.group(2)
Base_Signal = Serial_Results.group(3)
Repeater_ID = Serial_Results.group(4)
Repeater_Time_Stamp = Serial_Results.group(5)
Repeater_Signal = Serial_Results.group(6)
Sensor_ID = Serial_Results.group(7)
Sensor_Sequence = Serial_Results.group(8)
Sensor_Input_Voltage = Serial_Results.group(9)
Sensor_Wind_Speed = Serial_Results.group(10)
Sensor_Temperature = Serial_Results.group(11)
if not digital_out:
Checksum = Serial_Results.group(12)
print Sensor_Temperature
Serial_Output = "BASE_RAD: NEW,ATC001#T1412010472R-77,ATC005:T1412010460R-70,SU0003;Q6V8.9S0C11.5*xx"
Serial_Output2 = "BASE_RAD: NEW,ATC001#T1413824282R-102,ATC003:T1413824274R-98,SU001G;Q0V14.0D00*x"
get_output_parameters(Serial_Output)
get_output_parameters(Serial_Output2)
edit: the answer was that the os was axing processes because i was consuming all the memory
i am spawning enough subprocesses to keep the load average 1:1 with cores, however at some point within the hour, this script could run for days, 3 of the processes go :
tipu 14804 0.0 0.0 328776 428 pts/1 Sl 00:20 0:00 python run.py
tipu 14808 64.4 24.1 2163796 1848156 pts/1 Rl 00:20 44:41 python run.py
tipu 14809 8.2 0.0 0 0 pts/1 Z 00:20 5:43 [python] <defunct>
tipu 14810 60.3 24.3 2180308 1864664 pts/1 Rl 00:20 41:49 python run.py
tipu 14811 20.2 0.0 0 0 pts/1 Z 00:20 14:04 [python] <defunct>
tipu 14812 22.0 0.0 0 0 pts/1 Z 00:20 15:18 [python] <defunct>
tipu 15358 0.0 0.0 103292 872 pts/1 S+ 01:30 0:00 grep python
i have no idea why this is happening, attached is the master and slave. i can attach the mysql/pg wrappers if needed as well, any suggestions?
slave.py:
from boto.s3.key import Key
import multiprocessing
import gzip
import os
from mysql_wrapper import MySQLWrap
from pgsql_wrapper import PGSQLWrap
import boto
import re
class Slave:
CHUNKS = 250000
BUCKET_NAME = "bucket"
AWS_ACCESS_KEY = ""
AWS_ACCESS_SECRET = ""
KEY = Key(boto.connect_s3(AWS_ACCESS_KEY, AWS_ACCESS_SECRET).get_bucket(BUCKET_NAME))
S3_ROOT = "redshift_data_imports"
COLUMN_CACHE = {}
DEFAULT_COLUMN_VALUES = {}
def __init__(self, job_queue):
self.log_handler = open("logs/%s" % str(multiprocessing.current_process().name), "a");
self.mysql = MySQLWrap(self.log_handler)
self.pg = PGSQLWrap(self.log_handler)
self.job_queue = job_queue
def do_work(self):
self.log(str(os.getpid()))
while True:
#sample job in the abstract: mysql_db.table_with_date-iteration
job = self.job_queue.get()
#queue is empty
if job is None:
self.log_handler.close()
self.pg.close()
self.mysql.close()
print("good bye and good day from %d" % (os.getpid()))
self.job_queue.task_done()
break
#curtail iteration
table = job.split('-')[0]
#strip redshift table from job name
redshift_table = re.sub(r"(_[1-9].*)", "", table.split(".")[1])
iteration = int(job.split("-")[1])
offset = (iteration - 1) * self.CHUNKS
#columns redshift is expecting
#bad tables will slip through and error out, so we catch it
try:
colnames = self.COLUMN_CACHE[redshift_table]
except KeyError:
self.job_queue.task_done()
continue
#mysql fields to use in SELECT statement
fields = self.get_fields(table)
#list subtraction determining which columns redshift has that mysql does not
delta = (list(set(colnames) - set(fields.keys())))
#subtract columns that have a default value and so do not need padding
if delta:
delta = list(set(delta) - set(self.DEFAULT_COLUMN_VALUES[redshift_table]))
#concatinate columns with padded \N
select_fields = ",".join(fields.values()) + (",\\N" * len(delta))
query = "SELECT %s FROM %s LIMIT %d, %d" % (select_fields, table,
offset, self.CHUNKS)
rows = self.mysql.execute(query)
self.log("%s: %s\n" % (table, len(rows)))
if not rows:
self.job_queue.task_done()
continue
#if there is more data potentially, add it to the queue
if len(rows) == self.CHUNKS:
self.log("putting %s-%s" % (table, (iteration+1)))
self.job_queue.put("%s-%s" % (table, (iteration+1)))
#various characters need escaping
clean_rows = []
redshift_escape_chars = set( ["\\", "|", "\t", "\r", "\n"] )
in_chars = ""
for row in rows:
new_row = []
for value in row:
if value is not None:
in_chars = str(value)
else:
in_chars = ""
#escape any naughty characters
new_row.append("".join(["\\" + c if c in redshift_escape_chars else c for c in in_chars]))
new_row = "\t".join(new_row)
clean_rows.append(new_row)
rows = ",".join(fields.keys() + delta)
rows += "\n" + "\n".join(clean_rows)
offset = offset + self.CHUNKS
filename = "%s-%s.gz" % (table, iteration)
self.move_file_to_s3(filename, rows)
self.begin_data_import(job, redshift_table, ",".join(fields.keys() +
delta))
self.job_queue.task_done()
def move_file_to_s3(self, uri, contents):
tmp_file = "/dev/shm/%s" % str(os.getpid())
self.KEY.key = "%s/%s" % (self.S3_ROOT, uri)
self.log("key is %s" % self.KEY.key )
f = gzip.open(tmp_file, "wb")
f.write(contents)
f.close()
#local saving allows for debugging when copy commands fail
#text_file = open("tsv/%s" % uri, "w")
#text_file.write(contents)
#text_file.close()
self.KEY.set_contents_from_filename(tmp_file, replace=True)
def get_fields(self, table):
"""
Returns a dict used as:
{"column_name": "altered_column_name"}
Currently only the debug column gets altered
"""
exclude_fields = ["_qproc_id", "_mob_id", "_gw_id", "_batch_id", "Field"]
query = "show columns from %s" % (table)
fields = self.mysql.execute(query)
#key raw field, value mysql formatted field
new_fields = {}
#for field in fields:
for field in [val[0] for val in fields]:
if field in exclude_fields:
continue
old_field = field
if "debug_mode" == field.strip():
field = "IFNULL(debug_mode, 0)"
new_fields[old_field] = field
return new_fields
def log(self, text):
self.log_handler.write("\n%s" % text)
def begin_data_import(self, table, redshift_table, fields):
query = "copy %s (%s) from 's3://bucket/redshift_data_imports/%s' \
credentials 'aws_access_key_id=%s;aws_secret_access_key=%s' delimiter '\\t' \
gzip NULL AS '' COMPUPDATE ON ESCAPE IGNOREHEADER 1;" \
% (redshift_table, fields, table, self.AWS_ACCESS_KEY, self.AWS_ACCESS_SECRET)
self.pg.execute(query)
master.py:
from slave import Slave as Slave
import multiprocessing
from mysql_wrapper import MySQLWrap as MySQLWrap
from pgsql_wrapper import PGSQLWrap as PGSQLWrap
class Master:
SLAVE_COUNT = 5
def __init__(self):
self.mysql = MySQLWrap()
self.pg = PGSQLWrap()
def do_work(table):
pass
def get_table_listings(self):
"""Gathers a list of MySQL log tables needed to be imported"""
query = 'show databases'
result = self.mysql.execute(query)
#turns list[tuple] into a flat list
databases = list(sum(result, ()))
#overriding during development
databases = ['db1', 'db2', 'db3']]
exclude = ('mysql', 'Database', 'information_schema')
scannable_tables = []
for database in databases:
if database in exclude:
continue
query = "show tables from %s" % database
result = self.mysql.execute(query)
#turns list[tuple] into a flat list
tables = list(sum(result, ()))
for table in tables:
exclude = ("Tables_in_%s" % database, "(", "201303", "detailed", "ltv")
#exclude any of the unfavorables
if any(s in table for s in exclude):
continue
scannable_tables.append("%s.%s-1" % (database, table))
return scannable_tables
def init(self):
#fetch redshift columns once and cache
#get columns from redshift so we can pad the mysql column delta with nulls
tables = ('table1', 'table2', 'table3')
for table in tables:
#cache columns
query = "SELECT column_name FROM information_schema.columns WHERE \
table_name = '%s'" % (table)
result = self.pg.execute(query, async=False, ret=True)
Slave.COLUMN_CACHE[table] = list(sum(result, ()))
#cache default values
query = "SELECT column_name FROM information_schema.columns WHERE \
table_name = '%s' and column_default is not \
null" % (table)
result = self.pg.execute(query, async=False, ret=True)
#turns list[tuple] into a flat list
result = list(sum(result, ()))
Slave.DEFAULT_COLUMN_VALUES[table] = result
def run(self):
self.init()
job_queue = multiprocessing.JoinableQueue()
tables = self.get_table_listings()
for table in tables:
job_queue.put(table)
processes = []
for i in range(Master.SLAVE_COUNT):
process = multiprocessing.Process(target=slave_runner, args=(job_queue,))
process.daemon = True
process.start()
processes.append(process)
#blocks this process until queue reaches 0
job_queue.join()
#signal each child process to GTFO
for i in range(Master.SLAVE_COUNT):
job_queue.put(None)
#blocks this process until queue reaches 0
job_queue.join()
job_queue.close()
#do not end this process until child processes close out
for process in processes:
process.join()
#toodles !
print("this is master saying goodbye")
def slave_runner(queue):
slave = Slave(queue)
slave.do_work()
There's not enough information to be sure, but the problem is very likely to be that Slave.do_work is raising an unhandled exception. (There are many lines of your code that could do that in various different conditions.)
When you do that, the child process will just exit.
On POSIX systems… well, the full details are a bit complicated, but in the simple case (what you have here), a child process that exits will stick around as a <defunct> process until it gets reaped (because the parent either waits on it, or exits). Since your parent code doesn't wait on the children until the queue is finished, that's exactly what happens.
So, there's a simple duct-tape fix:
def do_work(self):
self.log(str(os.getpid()))
while True:
try:
# the rest of your code
except Exception as e:
self.log("something appropriate {}".format(e))
# you may also want to post a reply back to the parent
You might also want to break the massive try up into different ones, so you can distinguish between all the different stages where things could go wrong (especially if some of them mean you need a reply, and some mean you don't).
However, it looks like what you're attempting to do is duplicate exactly the behavior of multiprocessing.Pool, but have missed the bar in a couple places. Which raises the question: why not just use Pool in the first place? You could then simplify/optimize things ever further by using one of the map family methods. For example, your entire Master.run could be reduced to:
self.init()
pool = multiprocessing.Pool(Master.SLAVE_COUNT, initializer=slave_setup)
pool.map(slave_job, tables)
pool.join()
And this will handle exceptions for you, and allow you to return values/exceptions if you later need that, and let you use the built-in logging library instead of trying to build your own, and so on. And it should only take about a dozens lines of minor code changes to Slave, and then you're done.
If you want to submit new jobs from within jobs, the easiest way to do this is probably with a Future-based API (which turns things around, making the future result the focus and the pool/executor the dumb thing that provides them, instead of making the pool the focus and the result the dumb thing it gives back), but there are multiple ways to do it with Pool as well. For example, right now, you're not returning anything from each job, so, you can just return a list of tables to execute. Here's a simple example that shows how to do it:
import multiprocessing
def foo(x):
print(x, x**2)
return list(range(x))
if __name__ == '__main__':
pool = multiprocessing.Pool(2)
jobs = [5]
while jobs:
jobs, oldjobs = [], jobs
for job in oldjobs:
jobs.extend(pool.apply(foo, [job]))
pool.close()
pool.join()
Obviously you can condense this a bit by replacing the whole loop with, e.g., a list comprehension fed into itertools.chain, and you can make it a lot cleaner-looking by passing "a submitter" object to each job and adding to that instead of returning a list of new jobs, and so on. But I wanted to make it as explicit as possible to show how little there is to it.
At any rate, if you think the explicit queue is easier to understand and manage, go for it. Just look at the source for multiprocessing.worker and/or concurrent.futures.ProcessPoolExecutor to see what you need to do yourself. It's not that hard, but there are enough things you could get wrong (personally, I always forget at least one edge case when I try to do something like this myself) that it's work looking at code that gets it right.
Alternatively, it seems like the only reason you can't use concurrent.futures.ProcessPoolExecutor here is that you need to initialize some per-process state (the boto.s3.key.Key, MySqlWrap, etc.), for what are probably very good caching reasons. (If this involves a web-service query, a database connect, etc., you certainly don't want to do that once per task!) But there are a few different ways around that.
But you can subclass ProcessPoolExecutor and override the undocumented function _adjust_process_count (see the source for how simple it is) to pass your setup function, and… that's all you have to do.
Or you can mix and match. Wrap the Future from concurrent.futures around the AsyncResult from multiprocessing.