reduce for loop time in dataframe operation

reduce for loop time in dataframe operation - python

To see the sample response check this on browser https://bittrex.com/Api/v2.0/pub/market/GetTicks?marketName=BTC-ETH&tickInterval=thirtyMin. I have 275 values market list, and 330 time interval in time_series list. GetTicks api have 1000s of list of dict. I am only interested in those record where interval in time_series matches with 'T' value in GetTicks api. if time_series doesnt not matches with 'T' value in GetTicks api then I am setting respective values of 'BV'/'L' in master df as 0. Each loop is taking 3 second to execute, making around 20-25minute of execution time. is there a better pythonic way to construct this master df in less time ? Appreciate your help/suggestion.
my code --->
for (mkt, market_pair) in enumerate(market_list):
getTicks = requests.get("https://bittrex.com/Api/v2.0/pub/market/GetTicks?marketName=" + str(
market_pair) + "&tickInterval=thirtyMin")
getTicks_result = (getTicks.json())["result"]
print(mkt + 1, '/', len_market_list, market_pair, "API called", datetime.utcnow().strftime('%H:%M:%S.%f'))
first_df = pd.DataFrame(getTicks_result)
first_df.set_index('T', inplace=True)
for tk, interval in enumerate(time_series):
if interval in first_df.index:
master_bv_df.loc[market_pair, interval] = first_df.loc[interval,'BV']
bv_sh.cell(row=mkt + 2, column=tk + 3).value = first_df.loc[interval,'BV']
master_lp_df.loc[market_pair, interval] = first_df.loc[interval,'L']
lp_sh.cell(row=mkt + 2, column=tk + 3).value = first_df.loc[interval,'L']
else:
master_bv_df.loc[market_pair, interval] = master_lp_df.loc[market_pair, interval]=0
bv_sh.cell(row=mkt + 2, column=tk + 3).value = lp_sh.cell(row=mkt + 2, column=tk + 3).value=0

Related

Selecting columns using [[]] is very inefficient especially as the size of the dataset increases in python using pandas

Created sample data using below function:
def create_sample(num_of_rows=1000):
num_of_rows = num_of_rows # number of records to generate.
data = {
'var1' : [random.uniform(0.0, 1.0) for x in range(num_of_rows)],
'other' : [random.uniform(0.0, 1.0) for x in range(num_of_rows)]
}
df = pd.DataFrame(data)
print("Shape : {}".format(df.shape))
print("Type : \n{}".format(df.dtypes))
return df
df = create_sample()
times = []
for i in range(1, 300):
start = time.time()
# Make the dataframe 1 column bigger
df['var' + str(i + 1)] = df['var' + str(i)]
# Select two columns from the dataframe using double square brackets
####################################################
temp = df[['var' + str(i + 1), 'var' + str(i)]]
####################################################
end = time.time()
times.append(end - start)
start = end
plt.plot(times)
print(sum(times))
The graph is linear
enter image description here
used pd.concat to select columns, the graph shows peaks at every 100.. why is this so
df = create_sample()
times = []
for i in range(1, 300):
start = time.time()
# Make the dataframe 1 column bigger
df['var' + str(i + 1)] = df['var' + str(i)]
# Select two columns from the dataframe using double square brackets
####################################################
temp = pd.concat([df['var' + str(i + 1)],df['var' + str(i)]], axis=1)
####################################################
end = time.time()
times.append(end - start)
start = end
plt.plot(times)
print(sum(times))
please ignore indentation.
**From the above we can see that the time taken to select columns using [[]] increases linerly with the size of the dataset.
However, using pd.concat the time does not increase materially. Why increases in every 100 records only. The above is not obvious
**

Speeding Up Datafile Reading Program For School Project

I am in a lower-level coding class (Python) and have a major project due in three days. One of our grading criteria is program speed. My program runs in about 30 seconds, ideally it would execute in 15 or less. Here is my code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
import time
start_time = time.time()#for printing execution time
#function for appending any number of files to a dataframe
def read_data_files(pre, start, end): #reading in the data
data = pd.DataFrame()#dataframe with all the data from files
x = start
while x <= end:
filename = pre + str(x) + ".csv" #string manipulation
dpath = pd.read_csv("C:\\Users\\jacks\\Downloads\\BMEN 207 Project 1
Data\\" + filename )
for y in dpath:
dpath = dpath.rename(columns = {y: y})
data = data.append(dpath)
x += 1
return data
data = read_data_files("Data_", 5, 163) #start, end, prefix...
#converting to human time and adding to new column in dataframe
human_time = []
for i in data[' Time']:
i = int(i)
j = datetime.utcfromtimestamp(i).strftime('%Y-%m-%d %H:%M:%S')
human_time.append(j)
human_timen = np.array(human_time) #had issues here for some reason, so i
created another array to replace the time column in the dataframe
data[' Time'] = human_timen
hours = [] #for use as x-axis in plot
stdlist = [] #for use as y-axis in plot
histlist = [] #for storing magnitudes of most active hour
def magfind(row): #separate function to calculate the magnitude of each row in
each dataframe
return (row[' Acc X'] ** 2 + row[' Acc Y'] ** 2 + row[' Acc Z'] ** 2) ** .5
def filterfunction(intro1, intro2, first, last): #two different intros to deal
with the issue of '00:' versus '10:' timestamps
k = first
meanmax = 0
active = 0
while k <= last:
if 0 <= k < 7: #data from hours 0 to 6, none after
hr = intro1 + str(k) + ':'
tfilter = data[' Time'].str.contains(hr)
acc = data.loc[tfilter, [' Acc X', ' Acc Y', ' Acc Z']]
acc['magnitude'] = acc.apply(magfind, axis = 1) #creates magnitude
column using prior function, column has magnitudes for every row of every file
p = acc.loc[:, 'magnitude'].std()#finds std dev for the column and
appends to a list for graphing
m = acc.loc[:, 'magnitude'].mean()
stdlist.append(p)
elif k == 12 or 20 < k <= last: #data at 12 and beyond hour 20
hr = intro2 + str(k) + ":"
tfilter = data[' Time'].str.contains(hr)
acc = data.loc[tfilter, [' Acc X', ' Acc Y', ' Acc Z']]
acc['magnitude'] = acc.apply(magfind, axis = 1)
p = acc.loc[:, 'magnitude'].std()
m = acc.loc[:, 'magnitude'].mean()
stdlist.append(p)
else: #in the case that we are dealing with an hour that has no data
p = 0
m = 0
stdlist.append(p) #need this so that the hours with no data still
get graphed
if m > meanmax: # for determining which hour was the most active, and
appending those magnitudes to a list for histogramming
meanmax = m
active = k #most active hour
for i in acc['magnitude']:
histlist.append(i) #adding all the magnitudes for histogramming
hours.append(k)
k += 1
print("mean magnitude", meanmax)
print("most active hour", active)
return hours, stdlist, histlist
filterfunction(' 0', ' ', 0, 23)
The slow speed stems from the "filterfunction" function. What this program does is read data from over 100 files, and this function specifically sorts the data into a dataframe and analyzes by time (each individual hour) in order to calculate the data in all rows for that hour. I believe that it might be able to be sped up by changing up the way that the data is filtered to search by hour, but am not sure. The reason I have statements to dis-include certain k-values is that there are hours with no data to manipulate, which would mess up the list of standard deviation calculations as well as the plot that this data will father. Any tips or ideas for speeding this up would be greatly appreciated!

One suggestion to speed it up a bit is to remove this line since it is not being used anywhere in the program:
import matplotlib.pyplot as plt
matplotlib is a big library so removing it should improve performance.
Also I think you could get rid of numpy since it is used once only...consider using a tuple

I could not able to test because I am on mobile now. However my main idea is not making the code better or leen. I changed the functioning part of the process.
Integrated the 'multiprocessing' library(method) into your code and also calculated the system cpu cores and divide all processes between them.
Multiprocessing library detailed documentation: https://docs.python.org/2/library/multiprocessing.html
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import psutil
from datetime import datetime
from multiprocessing import Pool
cores = psutil.cpu_count()
start_time = time.time()#for printing execution time
#function for appending any number of files to a dataframe
def read_data_files(pre, start, end): #reading in the data
data = pd.DataFrame()#dataframe with all the data from files
x = start
while x <= end:
filename = pre + str(x) + ".csv" #string manipulation
dpath = pd.read_csv("C:\\Users\\jacks\\Downloads\\BMEN 207 Project 1
Data\\" + filename )
for y in dpath:
dpath = dpath.rename(columns = {y: y})
data = data.append(dpath)
x += 1
return data
data = read_data_files("Data_", 5, 163) #start, end, prefix...
#converting to human time and adding to new column in dataframe
human_time = []
for i in data[' Time']:
i = int(i)
j = datetime.utcfromtimestamp(i).strftime('%Y-%m-%d %H:%M:%S')
human_time.append(j)
human_timen = np.array(human_time) #had issues here for some reason, so i
created another array to replace the time column in the dataframe
data[' Time'] = human_timen
hours = [] #for use as x-axis in plot
stdlist = [] #for use as y-axis in plot
histlist = [] #for storing magnitudes of most active hour
def magfind(row): #separate function to calculate the magnitude of each row in
each dataframe
return (row[' Acc X'] ** 2 + row[' Acc Y'] ** 2 + row[' Acc Z'] ** 2) ** .5
def filterfunction(intro1, intro2, first, last): #two different intros to deal
with the issue of '00:' versus '10:' timestamps
k = first
meanmax = 0
active = 0
while k <= last:
if 0 <= k < 7: #data from hours 0 to 6, none after
hr = intro1 + str(k) + ':'
tfilter = data[' Time'].str.contains(hr)
acc = data.loc[tfilter, [' Acc X', ' Acc Y', ' Acc Z']]
acc['magnitude'] = acc.apply(magfind, axis = 1) #creates magnitude
column using prior function, column has magnitudes for every row of every file
p = acc.loc[:, 'magnitude'].std()#finds std dev for the column and
appends to a list for graphing
m = acc.loc[:, 'magnitude'].mean()
stdlist.append(p)
elif k == 12 or 20 < k <= last: #data at 12 and beyond hour 20
hr = intro2 + str(k) + ":"
tfilter = data[' Time'].str.contains(hr)
acc = data.loc[tfilter, [' Acc X', ' Acc Y', ' Acc Z']]
acc['magnitude'] = acc.apply(magfind, axis = 1)
p = acc.loc[:, 'magnitude'].std()
m = acc.loc[:, 'magnitude'].mean()
stdlist.append(p)
else: #in the case that we are dealing with an hour that has no data
p = 0
m = 0
stdlist.append(p) #need this so that the hours with no data still
get graphed
if m > meanmax: # for determining which hour was the most active, and
appending those magnitudes to a list for histogramming
meanmax = m
active = k #most active hour
for i in acc['magnitude']:
histlist.append(i) #adding all the magnitudes for histogramming
hours.append(k)
k += 1
print("mean magnitude", meanmax)
print("most active hour", active)
return hours, stdlist, histlist
# Run this with a pool of 5 agents having a chunksize of 3 until finished
agents = cores
chunksize = (len(data) / cores)
with Pool(processes=agents) as pool:
pool.map(filterfunction, (' 0', ' ', 0, 23))

don't use apply, it's not vectorized. instead, use vectorized operations whenever you can. in this case, instead of doing df.apply(magfind, 1), do:
def add_magnitude(df):
df['magnitude'] = (df[' Acc X'] ** 2 + df[' Acc Y'] ** 2 + df[' Acc Z'] ** 2) ** .5

Pandas - Add List to multiple columns (for multiple rows)

I have a list of values that I want to update into multiple columns, this is fine for a single row. However when I try to update over multiple rows it simply overrides the whole column with the last value.
List for each row looks like below (note: list length is of variable size):
['2016-03-16T09:53:05',
'2016-03-16T16:13:33',
'2016-03-17T13:30:31',
'2016-03-17T13:39:09',
'2016-03-17T16:59:01',
'2016-03-23T12:20:47',
'2016-03-23T13:22:58',
'2016-03-29T17:26:26',
'2016-03-30T09:08:17']
I can store this in empty columns by using:
for i in range(len(trans_dates)):
df[('T' + str(i + 1) + ' - Date')] = trans_dates[i]
However this updates the whole column with the single trans_dates[i] value
I thought looping over each row with the above code would work but it still overwrites.
for issues in all_issues:
for i in range(len(trans_dates)):
df[('T' + str(i + 1) + ' - Date')] = trans_dates[i]
How do I only update my current row in the loop?
Am I even going about this the right way? Or is there a faster vectorised way of doing it?
Full code snippet below:
for issues in all_issues:
print(issues)
changelog = issues.changelog
trans_dates = []
from_status = []
to_status = []
for history in changelog.histories:
for item in history.items:
if item.field == 'status':
trans_dates.append(history.created[:19])
from_status.append(item.fromString)
to_status.append(item.toString)
trans_dates = list(reversed(trans_dates))
from_status = list(reversed(from_status))
to_status = list(reversed(to_status))
print(trans_dates)
# Store raw data in created columns and convert dates to pd.to_datetime
for i in range(len(trans_dates)):
df[('T' + str(i + 1) + ' - Date')] = trans_dates[i]
for i in range(len(to_status)):
df[('T' + str(i + 1) + ' - To')] = to_status[i]
for i in range(len(from_status)):
df[('T' + str(i + 1) + ' - From')] = from_status[i]
for i in range(len(trans_dates)):
df['T' + str(i + 1) + ' - Date'] = pd.to_datetime(df['T' + str(i + 1) + ' - Date'])
EDIT: Sample input and output added.
input:
issue/row #1 list (note year changes):
['2016-03-16T09:53:05',
'2016-03-16T16:13:33',
'2016-03-17T13:30:31',
'2016-03-17T13:39:09']
issue #2
['2017-03-16T09:53:05',
'2017-03-16T16:13:33',
'2017-03-17T13:30:31']
issue #3
['2018-03-16T09:53:05',
'2018-03-16T16:13:33',
'2018-03-17T13:30:31']
issue #4
['2015-03-16T09:53:05',
'2015-03-16T16:13:33']
output:
col T1 T2 T3 T4
17 '2016-03-16T09:53:05' '2016-03-16T16:13:33' '2016-03-17T13:30:31' '2016-03-17T13:30:31'
18 '2017-03-16T09:53:05' '2017-03-16T16:13:33' '2017-03-17T13:30:31' np.nan
19 '2018-03-16T09:53:05' '2018-03-16T16:13:33' '2018-03-17T13:30:31' np.nan
20 '2015-03-16T09:53:05' '2015-03-16T16:13:33' np.nan np.nan

Instead of this:
for i in range(len(trans_dates)):
df[('T' + str(i + 1) + ' - Date')] = trans_dates[i]
Try this:
for i in range(len(trans_dates)):
df.loc[i, ('T' + str(i + 1) + ' - Date')] = trans_dates[i]
There are probably better ways to do this... df.merge or df.replace come to mind... it would be helpful if you posted what the input dataframe looked like and what the expected result is.

GroupBY frequency counts JSON response - nested field

I'm trying aggregate the response from an API call that returns a JSON object and get some frequency counts.
I've managed to do it for one of the fields in the JSON response, but a second field that I want to try the same thing isn't working
Both fields are called "category" but the one that isn't working is nested within "outcome_status".
The error I get is KeyError: 'category'
The below code uses a public API that does not require authentication, so can be tested easily.
import simplejson
import requests
#make a polygon for use in the API call
lat_coord = 51.767538
long_coord = -1.497488
lat_upper = str(lat_coord + 0.02)
lat_lower = str(lat_coord - 0.02)
long_upper = str(long_coord + 0.02)
long_lower = str(long_coord - 0.02)
#call from the API - no authentication required
api_call="https://data.police.uk/api/crimes-street/all-crime?poly=" + lat_lower + "," + long_upper + ":" + lat_lower + "," + long_lower + ":" + lat_upper + "," + long_lower + ":" + lat_upper + "," + long_upper + "&date=2017-01"
print (api_call)
request_resp=requests.get(api_call).json()
import pandas as pd
import numpy as np
df_resp = pd.DataFrame(request_resp)
#frequency counts for non-nested field (this works)
df_resp.groupby('category').context.count()
#next bit tries to do the nested (this doesn't work)
#tried dropping nulls
df_outcome = df_resp['outcome_status'].dropna()
print(df_outcome)
#tried index reset
df_outcome.reset_index()
#just errors
df_outcome.groupby('category').date.count()

I think you will have the easiest time of it, if you expand the dict in the "outcome_status" column like:
Code:
outcome_status = [
{'outcome_status_' + k: v for k, v in z.items()} for z in (
dict(category=None, date=None) if x is None else x
for x in (y['outcome_status'] for y in request_resp)
)
]
df = pd.concat([df_resp.drop('outcome_status', axis=1),
pd.DataFrame(outcome_status)], axis=1)
This uses some comprehensions to rename the fields in the outcome_status by pre-pending "outcome_status_" to the key names and turning them into columns. It also expands None values as well.
Test Code:
import requests
import pandas as pd
# make a polygon for use in the API call
lat_coord = 51.767538
long_coord = -1.497488
lat_upper = str(lat_coord + 0.02)
lat_lower = str(lat_coord - 0.02)
long_upper = str(long_coord + 0.02)
long_lower = str(long_coord - 0.02)
# call from the API - no authentication required
api_call = ("https://data.police.uk/api/crimes-street/all-crime?poly=" +
lat_lower + "," + long_upper + ":" +
lat_lower + "," + long_lower + ":" +
lat_upper + "," + long_lower + ":" +
lat_upper + "," + long_upper + "&date=2017-01")
request_resp = requests.get(api_call).json()
df_resp = pd.DataFrame(request_resp)
outcome_status = [
{'outcome_status_' + k: v for k, v in z.items()} for z in (
dict(category=None, date=None) if x is None else x
for x in (y['outcome_status'] for y in request_resp)
)
]
df = pd.concat([df_resp.drop('outcome_status', axis=1),
pd.DataFrame(outcome_status)], axis=1)
# just errors
print(df.groupby('outcome_status_category').category.count())
Results:
outcome_status_category
Court result unavailable 4
Investigation complete; no suspect identified 38
Local resolution 1
Offender given a caution 2
Offender given community sentence 3
Offender given conditional discharge 1
Offender given penalty notice 2
Status update unavailable 6
Suspect charged as part of another case 1
Unable to prosecute suspect 9
Name: category, dtype: int64

Conditionally setting values with df.loc inside a loop

I'm querying an MS Access db to retrieve a set of leases. My task is to calculate monthly totals for base rent for the next 60 months. The leases have dates related to start and end in order to calculate the correct periods in the event a lease terminates prior to 60 periods. My current challenge comes in when I attempt to increase the base rent by a certain amount whenever it's time to increment for that specific lease. I'm at a beginner level with Python/pandas so my approach is likely not optimum and the code rough looking. It's likely a vectorized approach is better suited however i'm not quite able to execute such code yet.
Data:
Lease input & output
Code:
try:
sql = 'SELECT * FROM [tbl_Leases]'
#sql = 'SELECT * FROM [Copy Of tbl_Leases]'
df = pd.read_sql(sql, conn)
#print df
#df.to_csv('lease_output.csv', index_label='IndexNo')
df_fcst_periods = pd.DataFrame()
# init increments
periods = 0
i = 0
# create empty lists to store looped info from original df
fcst_months = []
fcst_lease_num = []
fcst_base_rent = []
fcst_method = []
fcst_payment_int = []
fcst_rate_inc_amt = []
fcst_rate_inc_int = []
fcst_rent_start = []
# create array for period deltas, rent interval calc, pmt interval calc
fcst_period_delta = []
fcst_rate_int_bool = []
fcst_pmt_int_bool = []
for row in df.itertuples():
# get min of forecast period or lease ending date
min_period = min(fcst_periods, df.Lease_End_Date[i])
# count periods to loop for future periods in new df_fcst
periods = (min_period.year - currentMonth.year) * 12 + (min_period.month - currentMonth.month)
for period in range(periods):
nextMonth = (currentMonth + monthdelta(period))
period_delta = (nextMonth.year - df.Rent_Start_Date[i].year) * 12 + (nextMonth.month - df.Rent_Start_Date[i].month)
period_delta = float(period_delta)
# period delta values allow us to divide by the payment & rent intervals looking for integers
rate_int_calc = period_delta/df['Rate_Increase_Interval'][i]
pmt_int_calc = period_delta/df['Payment_Interval'][i]
# float.is_integer() method - returns bool
rate_int_bool = rate_int_calc.is_integer()
pmt_int_bool = pmt_int_calc.is_integer()
# conditional logic to handle base rent increases
if df['Forecast_Method'][i] == "Percentage" and rate_int_bool:
rate_increase = df['Base_Rent'][i] * (1 + df['Rate_Increase_Amt'][i]/100)
df.loc[df.index, "Base_Rent"] = rate_increase
fcst_base_rent.append(df['Base_Rent'][i])
print "Both True"
else:
fcst_base_rent.append(df['Base_Rent'][i])
print rate_int_bool
fcst_rate_int_bool.append(rate_int_bool)
fcst_pmt_int_bool.append(pmt_int_bool)
fcst_months.append(nextMonth)
fcst_period_delta.append(period_delta)
fcst_rent_start.append(df['Rent_Start_Date'][i])
fcst_lease_num.append(df['Lease_Number'][i])
#fcst_base_rent.append(df['Base_Rent'][i])
fcst_method.append(df['Forecast_Method'][i])
fcst_payment_int.append(df['Payment_Interval'][i])
fcst_rate_inc_amt.append(df['Rate_Increase_Amt'][i])
fcst_rate_inc_int.append(df['Rate_Increase_Interval'][i])
i += 1
df_fcst_periods['Month'] = fcst_months
df_fcst_periods['Rent_Start_Date'] = fcst_rent_start
df_fcst_periods['Lease_Number'] = fcst_lease_num
df_fcst_periods['Base_Rent'] = fcst_base_rent
df_fcst_periods['Forecast_Method'] = fcst_method
df_fcst_periods['Payment_Interval'] = fcst_payment_int
df_fcst_periods['Rate_Increase_Amt'] = fcst_rate_inc_amt
df_fcst_periods['Rate_Increase_Interval'] = fcst_rate_inc_int
df_fcst_periods['Period_Delta'] = fcst_period_delta
df_fcst_periods['Rate_Increase_Interval_bool'] = fcst_rate_int_bool
df_fcst_periods['Payment_Interval_bool'] = fcst_pmt_int_bool
except Exception, e:
print str(e)
conn.close()

I ended up initializing a variable before the periods loop which allowed me to perform a calculation when looping to obtain the correct base rents for subsequent periods.
# init base rent, rate increase amount, new rate for leases
base_rent = df['Base_Rent'][i]
rate_inc_amt = float(df['Rate_Increase_Amt'][i])
new_rate = 0
for period in range(periods):
nextMonth = (currentMonth + monthdelta(period))
period_delta = (nextMonth.year - df.Rent_Start_Date[i].year) * 12 + (nextMonth.month - df.Rent_Start_Date[i].month)
period_delta = float(period_delta)
# period delta values allow us to divide by the payment & rent intervals looking for integers
rate_int_calc = period_delta/df['Rate_Increase_Interval'][i]
pmt_int_calc = period_delta/df['Payment_Interval'][i]
# float.is_integer() method - returns bool
rate_int_bool = rate_int_calc.is_integer()
pmt_int_bool = pmt_int_calc.is_integer()
# conditional logic to handle base rent increases
if df['Forecast_Method'][i] == "Percentage" and rate_int_bool:
new_rate = base_rent * (1 + rate_inc_amt/100)
base_rent = new_rate
fcst_base_rent.append(new_rate)
elif df['Forecast_Method'][i] == "Manual" and rate_int_bool:
new_rate = base_rent + rate_inc_amt
base_rent = new_rate
fcst_base_rent.append(new_rate)
else:
fcst_base_rent.append(base_rent)
Still open for any alternative approaches though!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

reduce for loop time in dataframe operation - python

Related

Selecting columns using [[]] is very inefficient especially as the size of the dataset increases in python using pandas

Speeding Up Datafile Reading Program For School Project

Pandas - Add List to multiple columns (for multiple rows)

GroupBY frequency counts JSON response - nested field

Conditionally setting values with df.loc inside a loop

Categories

Resources