Running parallel API calls in Python without editing an external library - python

I have been trying to make my code faster by running parallel processes with no luck. I am fetching weather data with an external library (https://github.com/pnuu/fmiopendata). Under the hood the library is simply using requests.get() for fetching data from the API. Any tips on how to proceed? I could surely edit the code of fmiopendata, but I would prefer a workaround and not having to refactor others code.
Here is some working code, which I would like to edit:
from fmiopendata.wfs import download_stored_query
def parseStartTime(ts, year):
return str(year) + "-" + ts[0][0] + "-" + ts[0][1] + "T00:00:00Z"
def parseEndTime(ts, year):
return str(year) + "-" + ts[1][0] + "-" + ts[1][1] + "T23:59:59Z"
def weatherWFS(lat, lon, start_time, end_time):
# Downloading the observations form the WFS server. Using bbox and timestams for querying
while True:
try:
obs = download_stored_query(
"fmi::observations::weather::daily::multipointcoverage",
args=["bbox="+str(lon - 1e-2)+","+str(lat - 1e-2)+","+str(lon + 1e-2)+","+str(lat + 1e-2),
"starttime=" + start_time,
"endtime=" + end_time])
if obs.data == {}:
return False
else:
return obs
except:
pass
def getWeatherData(lat, lon):
StartYear, EndYear = 2011, 2021
# Handling the data is suitable chunks. Array pairs represent the starting and ending
# dates of the intervals in ["MM", "dd"] format
intervals = [
[["01", "01"], ["03", "31"]],
[["04", "01"], ["06", "30"]],
[["07", "01"], ["09", "30"]],
[["10", "01"], ["12", "31"]]
]
# Start and end timestamps are saved in an array
queries = [[parseStartTime(intervals[i], year),
parseEndTime(intervals[i], year)]
for year in range(StartYear, EndYear + 1)
for i in range(len(intervals))]
for query in queries:
# This is the request we need to run in parallel processing to save time
# the obs-objects need to be saved somehow and merged afterwards
obs = weatherWFS(lat, lon, query[0], query[1])
""" INSERT MAGIC CODE HERE """
lat, lon = 62.6, 29.72
getWeatherData(lat, lon)

Answering to my self:
The best solution I found so far is to use concurrent.futures with either the map() or submit() functions.
The suggested solution by Trambi does not improve the execution, as the requests are not CPU intensive. The bottleneck here is the waiting time, which the CPU has to stay idle, and therefore using separate processes is not going to solve the problem. However, multithreading can improve the speed, as the threads are created and shut down quicker.
Using the ThreadPoolExecutor with combination with as_completed(), I was able to recude the execution time with ~15%.
from concurrent.futures import ThreadPoolExecutor, as_completed
from fmiopendata.wfs import download_stored_query
def parseStartTime(ts, year):
return str(year) + "-" + ts[0][0] + "-" + ts[0][1] + "T00:00:00Z"
def parseEndTime(ts, year):
return str(year) + "-" + ts[1][0] + "-" + ts[1][1] + "T23:59:59Z"
def weatherWFS(lat, lon, start_time, end_time):
# Downloading the observations form the WFS server. Using bbox and timestams for querying
while True:
try:
obs = download_stored_query(
"fmi::observations::weather::daily::multipointcoverage",
args=["bbox="+str(lon - 1e-2)+","+str(lat - 1e-2)+","+str(lon + 1e-2)+","+str(lat + 1e-2),
"starttime=" + start_time,
"endtime=" + end_time])
if obs.data == {}:
return False
else:
return obs
except:
pass
def getWeatherData(lat, lon):
StartYear, EndYear = 2011, 2021
# Handling the data is suitable chunks. Array pairs represent the starting and ending
# dates of the intervals in ["MM", "dd"] format
intervals = [
[["01", "01"], ["03", "31"]],
[["04", "01"], ["06", "30"]],
[["07", "01"], ["09", "30"]],
[["10", "01"], ["12", "31"]]
]
# Start and end timestamps are saved in an array
queries = [
[lat, lon,
parseStartTime(intervals[i], year),
parseEndTime(intervals[i], year)]
for year in range(StartYear, EndYear)
for i in range(len(intervals))]
observations = [executor.submit(weatherWFS, query) for query in queries]
for obs in as_completed(observations):
obs = obs.result()
"""do stuff with the observations"""
lat, lon = 62.6, 29.72
getWeatherData(lat, lon)

You could try using multiprocessing.Pool.
Replace your for query in queries: loop with something like:
import multiprocessing
iterable = zip([lat]*len(queries), [lon]*len(queries), queries)
pool = multiprocessing.Pool(len(queries))
obs_list = pool.map(func=weatherWFS, iterable=iterable)
pool.close()
pool.join()
Note that this will pass the whole query elements as arguments to weatherWFS so you should change the function signature accordingly:
def weatherWFS(lat, lon, query):
start_time = query[0]
end_time = query[1]
Depending on the length of queries and its element you might also choose to unpack queries in your iterable...

Related

What is the best way to parallel process in DataBricks to minimize query time?

I am working on a project where I need to take a list of ids and run said ids through an API pull that can only retrieve one record detail at a time per submitted id. Essentially what I have is a dataframe called df_ids that consist of over 12M ids that needs to go though the below function in order to obtain that information requested by the end user for the entire population:
def ELOQUA_CONTACT(id):
API = EloquaAPI(f'1.0/data/contact/{id}')
try:
contactid = API['id'].lower()
except:
contactid = ''
try:
company = API['accountName']
except:
company = ''
df = pd.DataFrame([contactid, company]).T.rename(columns={0:'contactid', 1:'company'})
return df
If I run something like ELOQUA_CONTACT(df_ids['Eloqua_Contact_IDs'][2]) it will give me the API record for the id = 2 in the form of a dataframe. The issue is, now I need to scale this to the entire 12M id population and build it in a way that it can be run and processed on a daily basis.
I have tried two techniques for parallel processing in DataBricks (python based; AWS backed). The first is based off a template that my manager developed for threading and when sampling it for just 1000 records it takes just shy of 2 minutes to query.
def DealEntries(df_input,n_sets):
n_rows = df_input.shape[0]
entry_per_set = n_rows // n_sets
extra = n_rows % n_sets
outlist = []
for i in range(n_sets):
if i != n_sets - 1:
idx = range(0+entry_per_set * i, entry_per_set * (i + 1))
else:
idx = range(0+entry_per_set * i, entry_per_set * (i + 1) + extra)
outlist.append(idx)
return outlist
class ThreadWithReturnValue(Thread):
def __init__(self, group=None, target=None, name=None, args=(), kwargs=None, *, daemon=None):
Thread.__init__(self, group, target, name, args, kwargs, daemon=daemon)
self._return = None
def run(self):
if self._target is not None:
self._return = self._target(*self._args, **self._kwargs)
def join(self):
Thread.join(self)
return self._return
data_input = pd.DataFrame(df_ids['Eloqua_Contact_IDs'][:1000])
rows_per_thread = 300
n_rows = data_input.shape[0]
threads = ceil(n_rows/rows_per_thread)
completed = 0
global df_results
outlist = DealEntries(data_input, threads)
df_results = []
for i in range(threads):
rng = [x for x in outlist[i]]
curr_input = data_input['Eloqua_Contact_IDs'][rng]
jobs = []
for id in curr_input.astype(str):
thread = ThreadWithReturnValue(target=ELOQUA_CONTACT, kwargs={'id' : id})
jobs.append(thread)
for j in jobs:
j.start()
for j in jobs:
df_results.append(j.join())
df_out = pd.concat(df_results)
df_out
The second method is something that I just put together and runs in about 20 seconds.
from multiprocessing.pool import ThreadPool
parallels = ThreadPool(1000)
df_results = parallels.map(ELOQUA_CONTACT, [i for i in df_ids['Eloqua_Contact_IDs'][:1000]])
df_out = pd.concat(df_results)
df_out
This issues with both of these is that when scalling the time per record up from 1k to 12M, the first method would take around 916 days to run and the second would take like 167 days to run. This needs to be scaled and parallel processed to a level that can run the 12M records in less then a day. Is there any other methologies or features associated with DataBricks/AWS/Python/Spark/etc that I can leverage to meet this objective? Once built, this would be put into a scheduled workflow(formally job) in DataBricks and run on its own spinup cluster that I can alter the backend resources with (CPU + RAM size).
Any insight or advice is very much welcomed. Thank you.

How to Speed Up This Python Loop

downloadStart = datetime.now()
while (True):
requestURL = transactionAPI.format(page = tempPage,limit = 5000)
response = requests.get(requestURL,headers=headers)
json_data = json.loads(response.content)
tempMomosTransactionHistory.extend(json_data["list"])
if(datetime.fromtimestamp(json_data["list"][-1]["crtime"]) < datetime(datetime.today().year,datetime.today().month,datetime.today().day - dateRange)):
break
tempPage += 1
downloadEnd = datetime.now()
Any suggestions please threading or something like that ?
Outputs here
downloadtime 0:00:02.056010
downloadtime 0:00:05.680806
downloadtime 0:00:05.447945
You need to improve it in two ways.
Optimise code within loop
Parallelize code execution
#1
By looking at your code I can see one improvement ie. create datetime.today object instead of doing 3 times. Check other methods like transactionAPI optimise further.
#2:
If you multi core CPU machine then you take advantage of machine by spanning thread per page. Refer to modified code of above.
import threading
def processRequest(tempPage):
requestURL = transactionAPI.format(page = tempPage,limit = 5000)
response = requests.get(requestURL,headers=headers)
json_data = json.loads(response.content)
tempMomosTransactionHistory.extend(json_data["list"])
downloadStart = datetime.now()
while (True):
#create thread per page
t1 = threading.Thread(target=processRequest, args=(tempPage, ))
t1.start()
#Fetch datetime today object once instaed 3 times
datetimetoday = datetime()
if(datetime.fromtimestamp(json_data["list"][-1]["crtime"]) < datetime(datetimetoday.year,datetimetoday.month,datetimetoday.day - dateRange)):
break
tempPage += 1
downloadEnd = datetime.now()

Fastest way to insert many rows of data?

Every 4 seconds, I have to store 32,000 rows of data. Each of these rows consists of one time stamp value and 464 double precision values. The column name for the time stamp is time and the column name for the precision values increase sequentially as channel1, channel2, ..., and channel 464.
I establish a connection as follows:
CONNECTION = f"postgres://{username}:{password}#{host}:{port}/{dbname}"#?sslmode=require"
self.TimescaleDB_Client = psycopg2.connect(CONNECTION)
I then verify the TimescaleDB extension with the following:
def verifyTimeScaleInstall(self):
try:
sql_query = "CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;"
cur = self.TimescaleDB_Client.cursor()
cur.execute(sql_query)
cur.close()
self.TimescaleDB_Client.commit()
except:
self.timescaleLogger.error("An error occurred in verifyTimeScaleInstall")
tb = traceback.format_exc()
self.timescaleLogger.exception(tb)
return False
I then create a hyptertable for my data with the following:
def createRAWDataTable(self):
try:
cur = self.TimescaleDB_Client.cursor()
self.query_create_raw_data_table = None
for channel in range(self.num_channel) :
channel = channel + 1
if self.query_create_raw_data_table is None:
self.query_create_raw_data_table = f"CREATE TABLE IF NOT EXISTS raw_data (time TIMESTAMPTZ NOT NULL, channel{channel} REAL"
else:
self.query_create_raw_data_table = self.query_create_raw_data_table + f", channel{channel} REAL"
self.query_create_raw_data_table = self.query_create_raw_data_table + ");"
self.query_create_raw_data_hypertable = "SELECT create_hypertable('raw_data', 'time');"
cur.execute(self.query_create_raw_data_table)
cur.execute(self.query_create_raw_data_hypertable)
self.TimescaleDB_Client.commit()
cur.close()
except:
self.timescaleLogger.error("An error occurred in createRAWDataTable")
tb = traceback.format_exc()
self.timescaleLogger.exception(tb)
return False
I then insert the data into the hypertable using the following:
def insertRAWData(self, seconds):
try:
insert_start_time = datetime.now(pytz.timezone("MST"))
current_time = insert_start_time
num_iterations = seconds * self.fs
time_increment = timedelta(seconds=1/self.fs)
raw_data_query = self.query_insert_raw_data
dtype = "float32"
matrix = np.random.rand(self.fs*seconds,self.num_channel).astype(dtype)
cur = self.TimescaleDB_Client.cursor()
data = list()
for iteration in range(num_iterations):
raw_data_row = matrix[iteration,:].tolist() #Select a particular row and all columns
time_string = current_time.strftime("%Y-%m-%d %H:%M:%S.%f %Z")
raw_data_values = (time_string,)+tuple(raw_data_row)
data.append(raw_data_values)
current_time = current_time + time_increment
start_time = time.perf_counter()
psycopg2.extras.execute_values(
cur, raw_data_query, data, template=None, page_size=100
)
print(time.perf_counter() - start_time)
self.TimescaleDB_Client.commit()
cur.close()
except:
self.timescaleLogger.error("An error occurred in insertRAWData")
tb = traceback.format_exc()
self.timescaleLogger.exception(tb)
return False
The SQL Query String that I am referencing in the above code is obtained from the following:
def getRAWData_Query(self):
try:
self.query_insert_raw_data = None
for channel in range(self.num_channel):
channel = channel + 1
if self.query_insert_raw_data is None:
self.query_insert_raw_data = f"INSERT INTO raw_data (time, channel{channel}"
else:
self.query_insert_raw_data = self.query_insert_raw_data + f", channel{channel}"
self.query_insert_raw_data = self.query_insert_raw_data + ") VALUES %s;"
return self.query_insert_raw_data
except:
self.timescaleLogger.error("An error occurred in insertRAWData_Query")
tb = traceback.format_exc()
self.timescaleLogger.exception(tb)
return False
As you can see, I am using psycopg2.extras.execute_values() to insert the values. To my understanding, this is one of the fastest ways to insert data. However, it takes about 80 seconds for me to insert this data. It is on quite a beafy system with 12 cores/24 threads, SSDs, and 256GB of RAM. Can this be done faster? It just seems quite slow.
I would like to use TimescaleDB and am evaluating its performance. But I am looking to write within 2 seconds or so for it to be acceptable.
Edit I have tried to use pandas to perform the insert, but it took longer, at about 117 seconds. The following is the function that I used.
def insertRAWData_Pandas(self, seconds):
try:
insert_start_time = datetime.now(pytz.timezone("MST"))
current_time = insert_start_time
num_iterations = seconds * self.fs
time_increment = timedelta(seconds=1/self.fs)
raw_data_query = self.query_insert_raw_data
dtype = "float32"
matrix = np.random.rand(self.fs*seconds,self.num_channel).astype(dtype)
pd_df_dict = {}
pd_df_dict["time"] = list()
for iteration in range(num_iterations):
time_string = current_time.strftime("%Y-%m-%d %H:%M:%S.%f %Z")
pd_df_dict["time"].append(time_string)
current_time = current_time + time_increment
for channel in range(self.num_channel):
pd_df_dict[f"channel{channel}"] = matrix[:,channel].tolist()
start_time = time.perf_counter()
pd_df = pd.DataFrame(pd_df_dict)
pd_df.to_sql('raw_data', self.engine, if_exists='append')
print(time.perf_counter() - start_time)
except:
self.timescaleLogger.error("An error occurred in insertRAWData_Pandas")
tb = traceback.format_exc()
self.timescaleLogger.exception(tb)
return False
edit I have tried to use CopyManager and it appears to be producing the best results at around 74 seconds. Still not what I was after however.
def insertRAWData_PGCOPY(self, seconds):
try:
insert_start_time = datetime.now(pytz.timezone("MST"))
current_time = insert_start_time
num_iterations = seconds * self.fs
time_increment = timedelta(seconds=1/self.fs)
dtype = "float32"
matrix = np.random.rand(num_iterations,self.num_channel).astype(dtype)
data = list()
for iteration in range(num_iterations):
raw_data_row = matrix[iteration,:].tolist() #Select a particular row and all columns
#time_string = current_time.strftime("%Y-%m-%d %H:%M:%S.%f %Z")
raw_data_values = (current_time,)+tuple(raw_data_row)
data.append(raw_data_values)
current_time = current_time + time_increment
channelList = list()
for channel in range(self.num_channel):
channel = channel + 1
channelString = f"channel{channel}"
channelList.append(channelString)
channelList.insert(0,"time")
cols = tuple(channelList)
start_time = time.perf_counter()
mgr = CopyManager(self.TimescaleDB_Client, 'raw_data', cols)
mgr.copy(data)
self.TimescaleDB_Client.commit()
print(time.perf_counter() - start_time)
except:
self.timescaleLogger.error("An error occurred in insertRAWData_PGCOPY")
tb = traceback.format_exc()
self.timescaleLogger.exception(tb)
return False
I tried to modify the following values in postgresql.conf. There wasn't a noticeable performance improvement.
wal_level = minimal
fsync = off
synchronous_commit = off
wal_writer_delay = 2000ms
commit_delay = 100000
I have tried to modify the chunk size according to one of the below comments using the following in my createRawDataTable() function. However, there wasn't an improvement in the insert times. Perhaps this was also expectable given that I haven't been accumulating data. The data in the database has only been a few samples, perhaps at most 1 minute worth over the course of my testing.
self.query_create_raw_data_hypertable = "SELECT create_hypertable('raw_data', 'time', chunk_time_interval => INTERVAL '3 day',if_not_exists => TRUE);"
Edit For anyone reading this, I was able to pickle and insert an 32000x464 float32 numpy matrix in about 0.5 seconds for MongoDB, which is what my final solution is. Perhaps MongoDB just does better with this workload in this case.
I have a two initial suggestions that may help with overall performance.
The default hypertable you are creating will "chunk" your data by 7 day periods (this means each chunk will hold around 4,838,400,000 rows of data given your parameters). Since your data is so granular, you may want to use a different chunk size. Check out the docs here for info on the optional chunk_time_interval argument. Changing the chuck size should help with inserting and querying speed, it also will give you better performance in compression if needed later on.
As the individuals above stated, playing around with batch inserts should also help. If you haven't checked out this stock data tutorial I would highly recommend it. Using pgcopy and it's function CopyManager could help with inserting df objects more quickly.
Hopefully, some of this information can be helpful to your situation!
disclosure: I am part of the Timescale team 😊
You can use sqlachemy library to do it and also calibrate the chunksize while you are at it.
Append the data should possibly less than 74 seconds since I perform similar kind of insertion and it takes me about 40 odd seconds.
Another possibility is to use the pandas.DataFrame.to_sql with method=callable. It will increase the performance drastically.
in comparison to just to_sql (150s) or to_sql with method = multi (196s), the callable method did the job in just 14s.
Although a comparative summary for different methods would be best described with the image
One of the fastest ways is to
first create a pandas data frame of your data that you want to insert into the DB
then use the data frame to bulk-insert your data into the DB
here is a way you can do it: How to write data frame to postgres?

Python Multi-threading in a recordset

I have a database record set (approx. 1000 rows) and I am currently iterating through them, to integrate more data using extra db query for each record.
Doing that, raises the overall process time to maybe 100 seconds.
What I want to do is share the functionality to 2-4 processes.
I am using Python 2.7 to have AWS Lambda compatibility.
def handler(event, context):
try:
records = connection.get_users()
mandrill_client = open_mandrill_connection()
mandrill_messages = get_mandrill_messages()
mandrill_template = 'POINTS weekly-report-to-user'
start_time = time.time()
messages = build_messages(mandrill_messages, records)
print("OVERALL: %s seconds ---" % (time.time() - start_time))
send_mandrill_message(mandrill_client, mandrill_template, messages)
connection.close_database_connection()
return "Process Completed"
except Exception as e:
print(e)
Following is the function which I want to put into threads:
def build_messages(messages, records):
for record in records:
record = dict(record)
stream = get_user_stream(record)
data = compile_loyalty_stream(stream)
messages['to'].append({
'email': record['email'],
'type': 'to'
})
messages['merge_vars'].append({
'rcpt': record['email'],
'vars': [
{
'name': 'total_points',
'content': record['total_points']
},
{
'name': 'total_week',
'content': record['week_points']
},
{
'name': 'stream_greek',
'content': data['el']
},
{
'name': 'stream_english',
'content': data['en']
}
]
})
return messages
What I have tried is importing the multiprocessing library:
from multiprocessing.pool import ThreadPool
Created a pool inside the try block and mapped the function inside this pool:
pool = ThreadPool(4)
messages = pool.map(build_messages_in, itertools.izip(itertools.repeat(mandrill_messages), records))
def build_messages_in(a_b):
build_msg(*a_b)
def build_msg(a, b):
return build_messages(a, b)
def get_user_stream(record):
response = []
i = 0
for mod, mod_id, act, p, act_created in izip(record['models'], record['model_ids'], record['actions'],
record['points'], record['action_creation']):
information = get_reference(mod, mod_id)
if information:
response.append({
'action': act,
'points': p,
'created': act_created,
'info': information
})
if (act == 'invite_friend') \
or (act == 'donate') \
or (act == 'bonus_500_general') \
or (act == 'bonus_1000_general') \
or (act == 'bonus_500_cancel') \
or (act == 'bonus_1000_cancel'):
response[i]['info']['date_ref'] = act_created
response[i]['info']['slug'] = 'attiki'
if (act == 'bonus_500_general') \
or (act == 'bonus_1000_general') \
or (act == 'bonus_500_cancel') \
or (act == 'bonus_1000_cancel'):
response[i]['info']['title'] = ''
i += 1
return response
Finally I removed the for loop from the build_message function.
What I get as a results is a 'NoneType' object is not iterable.
Is this the correct way of doing this?
Your code seems pretty in-depth and so you cannot be sure that multithreading will lead to any performance gains when applied on a high level. Therefore, it's worth digging down to the point that gives you the largest latency and considering how to approach the specific bottleneck. See here for greater discussion on threading limitations.
If, for example as we discussed in comments, you can pinpoint a single task that is taking a long time, then you could try to parallelize it using multiprocessing instead - to leverage more of your CPU power. Here is a generic example that hopefully is simple enough to understand to mirror your Postgres queries without going into your own code base; I think that's an unfeasible amount of effort tbh.
import multiprocessing as mp
import time
import random
import datetime as dt
MAILCHIMP_RESPONSE = [x for x in range(1000)]
def chunks(l, n):
n = max(1, n)
return [l[i:i + n] for i in range(0, len(l), n)]
def db_query():
''' Delayed response from database '''
time.sleep(0.01)
return random.random()
def do_queries(query_list):
''' The function that takes all your query ids and executes them
sequentially for each id '''
results = []
for item in query_list:
query = db_query()
# Your super-quick processing of the Postgres response
processing_result = query * 2
results.append([item, processing_result])
return results
def single_processing():
''' As you do now - equivalent to get_reference '''
result_of_process = do_queries(MAILCHIMP_RESPONSE)
return result_of_process
def multi_process(chunked_data, queue):
''' Same as single_processing, except we put our results in queue rather
than returning them '''
result_of_process = do_queries(chunked_data)
queue.put(result_of_process)
def multiprocess_handler():
''' Divide and conquor on our db requests. We split the mailchimp response
into a series of chunks and fire our queries simultaneously. Thus, each
concurrent process has a smaller number of queries to make '''
num_processes = 4 # depending on cores/resources
size_chunk = len(MAILCHIMP_RESPONSE) / num_processes
chunked_queries = chunks(MAILCHIMP_RESPONSE, size_chunk)
queue = mp.Queue() # This is going to combine all the results
processes = [mp.Process(target=multi_process,
args=(chunked_queries[x], queue)) for x in range(num_processes)]
for p in processes: p.start()
divide_and_conquor_result = []
for p in processes:
divide_and_conquor_result.extend(queue.get())
return divide_and_conquor_result
if __name__ == '__main__':
start_single = dt.datetime.now()
single_process = single_processing()
print "Single process took {}".format(dt.datetime.now() - start_single)
print "Number of records processed = {}".format(len(single_process))
start_multi = dt.datetime.now()
multi = multiprocess_handler()
print "Multi process took {}".format(dt.datetime.now() - start_multi)
print "Number of records processed = {}".format(len(multi))

How can I make this code more pythonic?

I am reading a bunch of daily files and using glob to concatenate them all together into separate dataframes.I eventually join them together and basically create a single large file which I use to connect to a dashboard. I am not too familiar with Python but I used pandas and sklearn often.
As you can see, I am basically just reading the last 60 (or more) days worth of data (last 60 files) and creating a dataframe for each. This works, but I am wondering if there is a more pythonic/better way? I watched a video on pydata (about not being restricted by PEP 8 and making sure your code is pythonic) which was interesting.
(FYI - the reason why I need to read 60 days worth of time is because customers can fill out a survey from a call which happened a long time ago. The customer fills out a survey today about a call that happened in July. I need to know about that call (how long it lasted, what the topic was, etc).
os.chdir(r'C:\\Users\Documents\FTP\\')
loc = r'C:\\Users\Documents\\'
rosterloc = r'\\mand\\'
splitsname = r'Splits.csv'
fcrname = r'global_disp_'
npsname = r'survey_'
ahtname = r'callbycall_'
rostername = 'Daily_Roster.csv'
vasname = r'vas_report_'
ext ='.csv'
startdate = dt.date.today() - Timedelta('60 day')
enddate = dt.date.today()
daterange = Timestamp(enddate) - Timestamp(startdate)
daterange = (daterange / np.timedelta64(1, 'D')).astype(int)
data = []
frames = []
calls = []
bracket = []
try:
for date_range in (Timestamp(startdate) + dt.timedelta(n) for n in range(daterange)):
aht = pd.read_csv(ahtname+date_range.strftime('%Y_%m_%d')+ext)
calls.append(aht)
except IOError:
print('File does not exist:', ahtname+date_range.strftime('%Y_%m_%d')+ext)
aht = pd.concat(calls)
print('AHT Done')
try:
for date_range in (Timestamp(startdate) + dt.timedelta(n) for n in range(daterange)):
fcr = pd.read_csv(fcrname+date_range.strftime('%m_%d_%Y')+ext, parse_dates = ['call_time'])
data.append(fcr)
except IOError:
print('File does not exist:', fcrname+date_range.strftime('%m_%d_%Y')+ext)
fcr = pd.concat(data)
print('FCR Done')
try:
for date_range in (Timestamp(enddate) - dt.timedelta(n) for n in range(3)):
nps = pd.read_csv(npsname+date_range.strftime('%m_%d_%Y')+ext, parse_dates = ['call_date','date_completed'])
frames.append(nps)
except IOError:
print('File does not exist:', npsname+date_range.strftime('%m_%d_%Y')+ext)
nps = pd.concat(frames)
print('NPS Done')
try:
for date_range in (Timestamp(startdate) + dt.timedelta(n) for n in range(daterange)):
vas = pd.read_csv(vasname+date_range.strftime('%m_%d_%Y')+ext, parse_dates = ['Call_date'])
bracket.append(vas)
except IOError:
print('File does not exist:', vasname+date_range.strftime('%m_%d_%Y')+ext)
vas = pd.concat(bracket)
print('VAS Done')
roster = pd.read_csv(loc+rostername)
print('Roster Done')
splits = pd.read_csv(loc+splitsname)
print('Splits Done')
I didn't change names, but IMHO they should be more verbose eg. pd == panda? Not sure. Here is some more pythonic way to write it:
from functools import partial
import logging
from operator import add, sub
import os
import datetime as dt
import contextlib
os.chdir(r'C:\\Users\Documents\FTP\\')
location = r'C:\\Users\Documents\\'
roster_location = r'\\mand\\'
splits_name = r'Splits.csv'
fcr_name = r'global_disp_'
nps_name = r'survey_'
aht_name = r'callbycall_'
roster_name = 'Daily_Roster.csv'
vas_name = r'vas_report_'
ext = '.csv'
start_date = dt.date.today() - Timedelta('60 day')
end_date = dt.date.today()
daterange = Timestamp(end_date) - Timestamp(start_date)
daterange = (daterange / np.timedelta64(1, 'D')).astype(int)
logger = logging.getLogger() # logger is better than "print" in case, when you have multiple tiers to log. In this case: regular debug and exceptions
def timestamps_in_range(daterange, method=add): # injected operation method instead of "if" statement in case of subtracting
for n in xrange(daterange):
yield method(Timestamp(start_date), dt.timedelta(n)) # use generators for creating series of data in place
def read_csv(name, date_range, **kwargs): # use functions/methods to shorten (make more readable) long, repetitive method invocation
return pd.read_csv(name + date_range.strftime('%Y_%m_%d') + ext, kwargs)
def log_done(module): # use functions/methods to shorten (make more readable) long, repetitive method invocation
logger.debug("%s Done" % module)
#contextlib.contextmanager #contextmanager is great to separate business logic from exception handling
def mapper(function, iterable):
try:
yield map(function, iterable) # map instead of executing function in "for" loop
except IOError, err:
logger.error('File does not exist: ', err.filename)
# Following code is visualy tight and cleaner.
# Shows only what's needed, hiding most insignificant details and repetitive code
read_csv_aht = partial(read_csv, aht_name) # partial pre-fills function (first argument) with arguments of this function (remaining arguments). In this case it is useful for feeding "map" function - it takes one-argument function to execute on each element of a list
with mapper(read_csv_aht, timestamps_in_range(daterange)) as calls: # contextmanager beautifully hides "dangerous" content, sharing only the "safe" result to be used
aht = pd.concat(calls)
log_done('AHT')
read_csv_fcr = partial(read_csv, fcr_name)
with mapper(read_csv_fcr, timestamps_in_range(daterange)) as data:
fcr = pd.concat(data)
log_done('FCR')
read_csv_nps = partial(read_csv, nps_name, parse_dates=['call_date', 'date_completed'])
with mapper(read_csv_nps, timestamps_in_range(3, sub)) as frames:
nps = pd.concat(frames)
log_done('NPS')
read_csv_vas = partial(read_csv, vas_name, parse_dates=['Call_date'])
with mapper(read_csv_vas, timestamps_in_range(daterange)) as bracket:
vas = pd.concat(bracket)
log_done('VAS')
roster = pd.read_csv(location + roster_name)
log_done('Roster')
splits = pd.read_csv(location + splits_name)
log_done('Splits')

Categories