Trouble with pandas iterrows and loop counter

Trouble with pandas iterrows and loop counter - python

I have a dataset containing the US treasury curve for each day over a few years. Rows = Dates, Columns = tenor of specific treasury bond (3 mo, 1 yr, 10yr, etc)
I have python code that loops through each day and calibrates parameters for an interest rate model. I am having trouble looping through each row via iterrows and with my loop counter. The goal is to go row by row and calibrate the model to that daily curve, store the calibrated parameters in a dataframe, and then move onto the next row and repeat.
def do_calibration_model1():
global i
for index, row in curves.iterrows():
day = np.array(row) #the subsequent error_fxn uses this daily curve
calibration()
i += 1
def calibration():
i = 0
param = scipy.brute(error_fxn, bounds...., etc.)
opt = scipy.fmin(error_fxn, param, xtol..., ftol...)
calibration.loc[i] = np.array(opt) # store result of minimization (parameters for that day)
The code works correctly for the first iteration but then keeps repeating the calibration for the first row in the dataframe (curves). Further, it does not store the parameters in the next row of the calibration dataframe. I view the first issue as relating to the iterrows while the second is an issue of the loop counter.
Any thoughts on what is going wrong? I have a Matlab background and find the pandas setup to be very frustrating.
For reference I have consulted the links below to no avail.
https://www.python.org/dev/peps/pep-0212/
http://nipunbatra.github.io/2015/06/pandas-iteration/
Per Jason's comment below I have updated the code to:
def do_calibration_model1():
global i
for index, row in curves.iterrows():
for i in range(0,len(curves)):
day = np.array(row) #the subsequent error_fxn uses this daily curve
param = scipy.brute(error_fxn, bounds...., etc.)
opt = scipy.fmin(error_fxn, param, xtol..., ftol...)
calibration.loc[i] = np.array(opt) # store result of minimization (parameters for that day)
i += 1
The revised code now places the appropriate parameters in each row of the calibration dataframe based on the loop counter.
*However, it still does not move to the second (or subsequent rows) of the curves dataframe for the pandas iterrows function.

Each time calibration is called, you set i = 0. As a result, when you call calibration.loc[i] = np.array(opt), what is being written is item 0 of calibration. The variable i is never actually anything except 0 in this function.
In function do_calibration_model1(), you declare global i and then augment i by one at the end of the function call. I'm not sure what this i counter is meant to accomplish. Perhaps you think that the i in do_calibration_model1() is updating the value of the i variable in the calibration() function, but this is not the case. Given that there is no global i statement in calibration(), the i in this function is a local variable.
Regarding iterrows, I don't think you need the embedded for loop that cycles through the length of curves. Here's a quick example to show you how iterrows works:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
new = pd.DataFrame({'sum': [],
'mean': []})
for index, row in df.iterrows():
temp = {'sum': sum(row), 'mean': np.mean(row)}
new = new.append(temp, ignore_index=True)
In the above, df looks like this:
A B C D
0 -2.197018 1.905543 0.773851 -0.006683
1 0.675442 0.818040 -0.561957 0.002737
2 -0.833482 0.248135 -1.159698 -0.302912
3 0.784216 -0.156225 -0.043505 -2.539486
4 -0.637248 0.034303 -1.405159 -1.590045
5 0.289257 -0.085030 -0.619899 -0.211158
6 0.804702 -0.838365 0.199911 0.210378
7 -0.031306 0.166793 -0.200867 1.343865
And the new dataframe populated through the iterrows loop looks like this:
mean sum
0 0.118923 0.475693
1 0.233566 0.934262
2 -0.511989 -2.047958
3 -0.488750 -1.954999
4 -0.899537 -3.598148
5 -0.156707 -0.626830
6 0.094157 0.376626
7 0.319621 1.278485
Note that using append here makes unnecessary the use of an i counter and simplifies the code.
Returning to your code, I suggest something like the following:
def do_calibration_model1():
callibration = pd.DataFrame({'a': [],
'b': []})
for index, row in curves.iterrows():
day = np.array(row)
param = scipy.brute(error_fxn, bounds...., etc.)
opt = scipy.fmin(error_fxn, param, xtol..., ftol...)
temp = {'a': ..., 'b': ...} # put opt values into dict
callibration = calibration.append(temp, ignore_index=True)
return callibration
In this step callibration = pd.DataFrame({'a': [], 'b': []}) you will need to set up the dataframe to ingest opt. Previously, you transformed opt to a numpy array, but you will need to arrange the values of opt so they fit your callibration dataframe, in the same way that I did for temp here: temp = {'sum': sum(row), 'mean': np.mean(row)}.

Related

trying to figure out a pythonic way of code that is taking time even after using list comprehension and pandas

I have two dataframes: one comprising a large data set, allprice_df, with time price series for all stocks; and the other, init_df, comprising selective stocks and trade entry dates. I am trying to find the highest price for each ticker symbol and its associated date.
The following code works but it is time consuming, and I am wondering if there is a better, more Pythonic way to accomplish this.
# Initial call
init_df = init_df.assign(HighestHigh = lambda x:
highestHigh(x['DateIdentified'], x['Ticker'], allprice_df))
# HighestHigh function in lambda call
def highestHigh(date1,ticker,allp_df):
if date1.size == ticker.size:
temp_df = pd.DataFrame(columns = ['DateIdentified','Ticker'])
temp_df['DateIdentified'] = date1
temp_df['Ticker'] = ticker
else:
print("dates and tickers size mismatching")
sys.exit(1)
counter = itertools.count(0)
high_list = [getHigh(x,y,allp_df, next(counter)) for x, y in zip(temp_df['DateIdentified'],temp_df['Ticker'])]
return high_list
# Getting high for each ticker
def getHigh(dateidentified,ticker,allp_df, count):
print("trade %s" % count)
currDate = datetime.datetime.now().date()
allpm_df = allp_df.loc[((allp_df['Ticker']==ticker)&(allp_df['date']>dateidentified)&(allp_df['date']<=currDate)),['high','date']]
hh = allpm_df.iloc[:,0].max()
hd = allpm_df.loc[(allpm_df['high']==hh),'date']
hh = round(hh,2)
h_list = [hh,hd]
return h_list
# Split the list in to 2 columns one with price and the other with the corresponding date
init_df = split_columns(init_df,"HighestHigh")
# The function to split the list elements in to different columns
def split_columns(orig_df,col):
split_df = pd.DataFrame(orig_df[col].tolist(),columns=[col+"Mod", col+"Date"])
split_df[col+"Date"] = split_df[col+"Date"].apply(lambda x: x.squeeze())
orig_df = pd.concat([orig_df,split_df], axis=1)
orig_df = orig_df.drop(col,axis=1)
orig_df = orig_df.rename(columns={col+"Mod": col})
return orig_df

There are a couple of obvious solutions that would help reduce your runtime.
First, in your getHigh function, instead of using loc to get the date associated with the maximum value for high, use idxmax to get the index of the row associated with the high and then access that row:
hh, hd = allpm_df[allpm_df['high'].idxmax()]
This will replace two O(N) operations (finding the maximum in a list, and doing a list lookup using a comparison) with one O(N) operation and one O(1) operation.
Edit
In light of your information on the size of your dataframes, my best guess is that this line is probably where most of your time is being consumed:
allpm_df = allp_df.loc[((allp_df['Ticker']==ticker)&(allp_df['date']>dateidentified)&(allp_df['date']<=currDate)),['high','date']]
In order to make this faster, I would setup your data frame to include a multi-index when you first create the data frame:
index = pd.MultiIndex.from_arrays(arrays = [ticker_symbols, dates], names = ['Symbol', 'Date'])
allp_df = pd.Dataframe(data, index = index)
allp_df.index.sortlevel(level = 0, sort_remaining = True)
This should create a dataframe with a sorted, multi-level index associated with your ticker symbol and date. Doing this will reduce your search time tremendously. Once you do that, you should be able to access all the data associated with a ticker symbol and a given date-range by doing this:
allp_df[ticker, (dateidentified: currDate)]
which should return your data much more quickly. For more information on multi-indexing, check out this helpful Pandas tutorial.

Creating a Dictionary and then a dataframe from different Types

I'm having a problem manipulating different types. Here is what I'm doing:
--
row_list=[]
for x in a:
df=selectinfo(x)
analysis=df[['Sales']].copy()
decompose_result_mult = seasonal_decompose(analysis, model="additive")
date = decompose_result_mult.trend.to_frame()
date.reset_index(inplace=True)
date=date.iloc[:,0]
trend = decompose_result_mult.trend
trend = decompose_result_mult.trend.to_frame()
trend.reset_index(inplace=True)
trend=trend.iloc[:,1]
seasonal = decompose_result_mult.seasonal
seasonal = decompose_result_mult.seasonal.to_frame()
seasonal.reset_index(inplace=True)
seasonal=seasonal.iloc[:,1]
residual = decompose_result_mult.resid
residual = decompose_result_mult.resid.to_frame()
residual.reset_index(inplace=True)
residual=residual.iloc[:,1]
observed=decompose_result_mult.observed
observed=decompose_result_mult.observed.to_frame()
observed.reset_index(inplace=True)
observed=observed.iloc[:,1]
dict= {'Date':date,'region':x,'Trend':trend,'Seasonal':seasonal,'Residual':residual,'Sales':observed}
row_list.append(dict)
pandasTable=pd.DataFrame(row_list)
The problem with this is that when I run my code I get the next result:
It is a dataframe that have inside Series... Any help? I would like to have 1 value per row and not 1 list per row.
Thank you!

Calculating averaged data in and writing to csv from a pandas dataframe

I have a very large spatial dataset stored in a dataframe. I am taking a slice of that dataframe into a new smaller subset to run further calculations.
The data has x, y and z coordinates with a number of additional columns, some of which are text and some are numeric. The x and y coordinates are on a defined grid and have a known separation.
Data looks like this
x,y,z,text1,text2,text3,float1,float2
75000,45000,120,aa,bbb,ii,12,0.2
75000,45000,110,bb,bbb,jj,22,0.9
75000,45100,120,aa,bbb,ii,11,1.8
75000,45100,110,bb,bbb,jj,45,2.4
75000,45100,100,bb,ccc,ii,13.6,1
75100,45000,120,bb,ddd,jj,8.2,2.1
75100,45000,110,bb,ddd,ii,12,0.6
For each x and y pair I want to iterate over a two series of text values and do three things in the z direction.
Calculate the average of one numeric value for all the values with a third specific text value
Sum another numeric value for all the values with the same text value
Write the a resultant table of 'x, y, average, sum' to a csv.
My code does part three (albeit very slowly) but doesn't calculate 1 or 2 or at least I don't appear to get the average and sum calculations in my output.
What have I done wrong and how can I speed it up?
for text1 in text_list1:
for text2 in text_list2:
# Get the data into smaller dataframe
df = data.loc[ (data["textfield1"] == text1) & (data["textfield2"] == text2 ) ]
#Get the minimum and maximum x and y
minXw = df['x'].min()
maxXw = df['x'].max()
minYw = df['y'].min()
maxYw = df['y'].max()
# dictionary for quicker printing
dict_out = {}
rows_list = []
# Make output filename
filenameOut = text1+"_"+text2+"_Values.csv"
# Start looping through x values
for x in np.arange(minXw, maxXw, x_inc):
xcount += 1
# Start looping through y values
for y in np.arange(minYw, maxYw, y_inc):
ycount += 1
# calculate average and sum
ave_val = df.loc[df['textfield3'] == 'text3', 'float1'].mean()
sum_val = df.loc[df['textfield3'] == 'text3', 'float2'].sum()
# Make Dictionary of output values
dict_out = dict([('text1', text1),
('text2', text2),
('text3', df['text3']),
('x' , x-x_inc),
('y' , y-y_inc),
('ave' , ave_val),
('sum' , sum_val)])
rows_list_c.append(dict_out)
# Write csv
columns = ['text1','text2','text3','x','y','ave','sum']
with open(filenameOut, 'w') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=columns)
writer.writeheader()
for data in dict_out:
writer.writerow(data)
My resultant csv gives me:
text1,text2,text3,x,y,ave,sum
text1,text2,,74737.5,43887.5,nan,0.0
text1,text2,,74737.5,43912.5,nan,0.0
text1,text2,,74737.5,43937.5,nan,0.0
text1,text2,,74737.5,43962.5,nan,0.0

Not really clear what you're trying to do. But here is a starting point
If you only need to process rows with a specific text3value, start by filtering out the other rows:
df = df[df.text3=="my_value"]
If at this point, you do not need text3 anymore, you can also drop it
df = df.drop(columns="text3")
Then you process several sub dataframes, and write each of them to their own csv file. groupby is the perfect tool for that:
for (text1, text2), sub_df in df.groupby(["text1", "text2"]):
filenameOut = text1+"_"+text2+"_Values.csv"
# Process sub df
output_df = process(sub_df)
# Write sub df
output_df.to_csv(filenameOut)
Note that if you keep your data as a DataFrame instead of converting it to a dict, you can use the DataFrame to_csv method to simply write the output csv.
Now let's have a look at the process function (Note that you dont really need to make it a separate function, you could as well dump the function body in the for loop).
At this point, if I understand correctly, you want to compute the sum and the average of every rows that have the same x and y coordinates. Here again you can use groupby and the agg function to compute the mean and the sum of the group.
def process(sub_df):
# drop the text1 and text2 columns since they are in the filename anyway
out = sub_df.drop(columns=["text1","text2"])
# Compute mean and max
return out.groupby(["x", "y"]).agg(ave=("float1", "mean"), sum=("float2", "sum"))
And that's preety much it.
Bonus: 2-liner version (but don't do that...)
for (text1, text2), sub_df in df[df.text3=="my_value"].drop(columns="text3").groupby(["text1", "text2"]):
sub_df.drop(columns=["text1","text2"]).groupby(["x", "y"]).agg(ave=("float1", "mean"), sum=("float2", "sum")).to_csv(text1+"_"+text2+"_Values.csv")

To do this in an efficient way in pandas you will need to use groupby, agg and the in-built to_csv method rather than using for loops to construct lists of data and writing each one with the csv module. Something like this:
groups = data[data["text1"].isin(text_list1) & data["text2"].isin(text_list2)] \
.groupby(["text1", "text2"])
for (text1, text2), group in groups:
group.groupby("text3") \
.agg({"float1": np.mean, "float2": sum}) \
.to_csv(f"{text1}_{text2}_Values.csv")
It's not clear exactly what you're trying to do with the incrementing of x and y values, which is also what makes your current code very slow. To present sums and averages of the floating point columns by intervals of x and y, you could make bin columns and group by those too.
data["x_bin"] = (data["x"] - data["x"].min()) // x_inc
data["y_bin"] = (data["y"] - data["y"].min()) // y_inc
groups = data[data["text1"].isin(text_list1) & data["text2"].isin(text_list2)] \
.groupby(["text1", "text2"])
for (text1, text2), group in groups:
group.groupby(["text3", "x_bin", "y_bin"]) \
.agg({"x": "first", "y": "first", "float1": np.mean, "float2": sum}) \
.to_csv(f"{text1}_{text2}_Values.csv")

Grouped Time Series forecasting with scikit-hts

I am trying to forecast sales for multiple time series I took from kaggle's Store item demand forecasting challenge. It consists of a long format time series for 10 stores and 50 items resulting in 500 time series stacked on top of each other. And for each store and each item, I have 5 years of daily records with weekly and annual seasonalities.
In total there are : 365.2days * 5years * 10stores *50items = 913000 records.
From my understanding based on what I've read so far on Hierarchical and Grouped time series, the whole dataframe could be structured as a Grouped Time Series and not simply as a strict Hierarchical Time Series as aggregation could be done at the store or item levels interchangeably.
I want to find a way to forecast all 500 time series (for store1_item1, store1_item2,..., store10_item50) for the next year (from 01-jan-2015 to 31-dec-2015) using the scikit-hts library and its AutoArimaModel function which is a wrapper function of pmdarima's AutoArima function.
To handle the two levels of seasonality, I added Fourier terms as exogenous features to deal with the annual seasonality while auto_arima deals with the weekly seasonality.
My problem is that I got an error at during prediction step.
Here's the error message :
ValueError: Provided exogenous values are not of the appropriate shape. Required (365, 4), got (365, 8).
I assume something is wrong with the exogenous dictionary but I do not know how to solve the issue as I'm using scikit-hts for the first time. To do this, I followed the official documentation of scikit-hts here.
EDIT :______________________________________________________________
I have not seen that a similar bug was reported on Github. Following the proposed fix that I implemented locally, I could have some results. However, even though there is no error when running the code, some of the forecasts are negative as raised in the comments below this post. And we even get disproportionate values for the positive ones.
Here are the plots for all the combinations of store and item. You can see that this seems to work for only one combination.
df.loc['2014','store_1_item_1'].plot()
predictions.loc['2015','store_1_item_1'].plot()
df.loc['2014','store_1_item_2'].plot()
predictions.loc['2015','store_1_item_2'].plot()
df.loc['2014','store_2_item_1'].plot()
predictions.loc['2015','store_2_item_1'].plot()
df.loc['2014','store_2_item_2'].plot()
predictions.loc['2015','store_2_item_2'].plot()
_____________________________________________________________________
Complete code:
# imports
import pandas as pd
from pmdarima.preprocessing import FourierFeaturizer
import hts
from hts.hierarchy import HierarchyTree
from hts.model import AutoArimaModel
from hts import HTSRegressor
# read data from the csv file
data = pd.read_csv('train.csv', index_col='date', parse_dates=True)
# Train/Test split with reduced size
train_data = data.query('store == [1,2] and item == [1, 2]').loc['2013':'2014']
test_data = data.query('store == [1,2] and item == [1, 2]').loc['2015']
# Create the stores time series
# For each timestamp group by store and apply sum
stores_ts = train_data.drop(columns=['item']).groupby(['date','store']).sum()
stores_ts = stores_ts.unstack('store')
stores_ts.columns = stores_ts.columns.droplevel(0)
stores_ts.columns = ['store_' + str(i) for i in stores_ts.columns]
# Create the items time series
# For each timestamp group by item and apply sum
items_ts = train_data.drop(columns=['store']).groupby(['date','item']).sum()
items_ts = items_ts.unstack('item')
items_ts.columns = items_ts.columns.droplevel(0)
items_ts.columns = ['item_' + str(i) for i in items_ts.columns]
# Create the stores_items time series
# For each timestamp group by store AND by item and apply sum
store_item_ts = train_data.pivot_table(index= 'date', columns=['store', 'item'], aggfunc='sum')
store_item_ts.columns = store_item_ts.columns.droplevel(0)
# Rename the columns as store_i_item_j
col_names = []
for i in store_item_ts.columns:
col_name = 'store_' + str(i[0]) + '_item_' + str(i[1])
col_names.append(col_name)
store_item_ts.columns = store_item_ts.columns.droplevel(0)
store_item_ts.columns = col_names
# Create a new dataframe and add the root level of the hierarchy as the sum of all stores (or all items)
df = pd.DataFrame()
df['total'] = stores_ts.sum(1)
# Concatenate all created dataframes into one df
# df is the dataframe that will be used for model training
df = pd.concat([df, stores_ts, items_ts, store_item_ts], 1)
# Build fourier terms for train and test sets
four_terms = FourierFeaturizer(365.2, 1)
# Build the exogenous features dataframe for training data
exog_train_df = pd.DataFrame()
for i in range(1, 3):
for j in range(1, 3):
_, exog = four_terms.fit_transform(train_data.query(f'store == {i} and item == {j}').sales)
exog.columns= [f'store_{i}_item_{j}_'+ x for x in exog.columns]
exog_train_df = pd.concat([exog_train_df, exog], axis=1)
exog_train_df['date'] = df.index
exog_train_df.set_index('date', inplace=True)
# add the exogenous features dataframe to df before training
df = pd.concat([df, exog_train_df], axis= 1)
# Build the exogenous features dataframe for test set
# It will be used only when using model.predict()
exog_test_df = pd.DataFrame()
for i in range(1, 3):
for j in range(1, 3):
_, exog_test = four_terms.fit_transform(test_data.query(f'store == {i} and item == {j}').sales)
exog_test.columns= [f'store_{i}_item_{j}_'+ x for x in exog_test.columns]
exog_test_df = pd.concat([exog_test_df, exog_test], axis=1)
# Build the hierarchy of the Grouped Time Series
stores = [i for i in stores_ts.columns]
items = [i for i in items_ts.columns]
store_items = col_names
# Exogenous features mapping
exog_store_items = {e: [v for v in exog_train_df.columns if v.startswith(e)] for e in store_items}
exog_stores = {e:[v for v in exog_train_df.columns if v.startswith(e)] for e in stores}
exog_items = {e:[v for v in exog_train_df.columns if v.find(e) != -1] for e in items}
exog_total = {'total':[v for v in exog_train_df.columns if v.find('FOURIER') != -1]}
# Merge all dictionaries
exog_to_merge = [exog_store_items, exog_stores, exog_items, exog_total]
exogenous = {k:v for x in exog_to_merge for k,v in x.items()}
# Build hierarchy
total = {'total': stores + items}
store_h = {k: [v for v in store_items if v.startswith(k)] for k in stores}
hierarchy = {**total, **store_h}
# Hierarchy tree automatically created by hts
ht = HierarchyTree.from_nodes(nodes=hierarchy, df=df, exogenous=exogenous)
# Instanciate the auto arima model using HTSRegressor
autoarima = HTSRegressor(model='auto_arima', D=1, m=7, seasonal=True, revision_method='OLS', n_jobs=12)
# Fit the model to the training df that includes time series and exog_train_df
# Set exogenous param to the previously built dictionary
model = autoarima.fit(df, hierarchy, exogenous=exogenous)
# Make predictions
# Set the exogenous_df param
predictions = model.predict(exogenous_df=exog_test_df, steps_ahead=365)
Other approaches I thought of and that I already implemented successfully for one series (for store 1 and item 1 for example) :
TBATS applied to each series independently inside a loop across all 500 time series
auto_arima (SARIMAX) with exogenous features (=Fourier terms to deal with the weekly and annual seasonalities) for each series independently + a loop across all 500 time series
What do you think of these approaches? Do you have other suggestions on how to scale ARIMA to multiple time series?
I also want to try LSTM but I'm new to data science and deep learning and do not know how to prepare the data. Should I keep the data in their original form (long format) and apply one hot encoding to train_data['store'] and train_data['item'] columns or should I start with the df I ended up with here?

I Hope this helped you in fixing the issue with exogenous regressors. To handle negative forecasts I would suggest you to try square root transformation.

Applying a function to every observation in a dataframe

I have a large df of coordinates that I'm putting through a function (reverse geocoder),
How can I run through the whole df without iterating (Takes very long)
Example df:
Latitude Longitude
0 -25.66026 28.0914
1 -25.67923 28.10525
2 -30.68456 19.21694
3 -30.12345 22.34256
4 -15.12546 17.12365
After running through the function I want (without a for loop...) a df:
City
0 HappyPlace
1 SadPlace
2 AveragePlace
3 CoolPlace
4 BadPlace
Note: I dont need to know how to do reverse geocoding, this is a question about applying a function to a whole df without iteration.
EDIT:
using df.apply() might not work as my code looks like this:
for i in range(len(df)):
results = g.reverse_geocode(df['LATITUDE'][i], df['LONGITUDE'][i])
city.append(results.city)

Slower approach Iterating through the list of geo points and fetching city of the geo point
import pandas as pd
import time
d = {'Latitude': [-25.66026,-25.67923,-30.68456,-30.12345,-15.12546,-25.66026,-25.67923,-30.68456,-30.12345,-15.12546], 'Longitude': [28.0914, 28.10525,19.21694,22.34256,17.12365,28.0914, 28.10525,19.21694,22.34256,17.12365]}
df = pd.DataFrame(data=d)
# example method of g.reverse_geocode() -> geo_reverse
def geo_reverse(lat, long):
time.sleep(2)
#assuming that your reverse_geocode will take 2 second
print(lat, long)
for i in range(len(df)):
results = geo_reverse(df['Latitude'][i], df['Longitude'][i])
Because of time.sleep(2). above program will take at least 20 seconds to process all ten geo point.
Better approach than above:
import pandas as pd
import time
d = {'Latitude': [-25.66026,-25.67923,-30.68456,-30.12345,-15.12546,-25.66026,-25.67923,-30.68456,-30.12345,-15.12546], 'Longitude': [28.0914, 28.10525,19.21694,22.34256,17.12365,28.0914, 28.10525,19.21694,22.34256,17.12365]}
df = pd.DataFrame(data=d)
import threading
def runnable_method(f, args):
result_info = [threading.Event(), None]
def runit():
result_info[1] = f(args)
result_info[0].set()
threading.Thread(target=runit).start()
return result_info
def gather_results(result_infos):
results = []
for i in range(len(result_infos)):
result_infos[i][0].wait()
results.append(result_infos[i][1])
return results
def geo_reverse(args):
time.sleep(2)
return "City Name of ("+str(args[0])+","+str(args[1])+")"
geo_points = []
for i in range(len(df)):
tuple_i = (df['Latitude'][i], df['Longitude'][i])
geo_points.append(tuple_i)
result_info = [runnable_method(geo_reverse, geo_point) for geo_point in geo_points]
cities_result = gather_results(result_info)
print(cities_result)
Notice the method geo_reverse has processing time of 2 seconds to fetch the data based on the geo points. In this second example the code will take only 2 seconds to process as many points as you want.
Note: Try both approach assuming that your geo_reverse will take approx. 2 seconds to fetch data. First approach will take 20+1 seconds and the processing time will increase with the increasing number of inputs but second approach will have almost constant processing time (i.e. approx 2+1) seconds no matter how many geo points you want to process.
Assume g.reverse_geocode() method is geo_reverse() on above code. Run both code (approach) above separately and see the difference on your own.
Explanation:
Take a look on above code and its major part that is creating list of tuples and comprehending that list passing each tuple to a dynamically created threads (Major part):
#Converting df of geo points into list of tuples
geo_points = []
for i in range(len(df)):
tuple_i = (df['Latitude'][i], df['Longitude'][i])
geo_points.append(tuple_i)
#List comprehension with custom methods and create run-able threads
result_info = [runnable_method(geo_reverse, geo_point) for geo_point in geo_points]
#gather result from each thread.
cities_result = gather_results(result_info)
print(cities_result)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trouble with pandas iterrows and loop counter - python

Related

trying to figure out a pythonic way of code that is taking time even after using list comprehension and pandas

Creating a Dictionary and then a dataframe from different Types

Calculating averaged data in and writing to csv from a pandas dataframe

Grouped Time Series forecasting with scikit-hts

Applying a function to every observation in a dataframe

Categories

Resources