I have a Pandas data frame (called "ud_flex" below) that looks like the one below:
The data frame has over 27 million observations in it that I'm trying to iterate through to do a calculation for each row. Below is the calculation that I'm using:
def set_fpts(pos, rank, curr_fpts):
if pos == "RB" and rank >= 3.0:
return 0
elif pos == "WR" and rank >= 4.0:
return 0
elif (pos == "TE" or pos == "QB") and rank >= 2.0:
return 0
else:
return curr_fpts
Here is the for loop that I've created:
players = ud_flex.shape[0]
for i in range(0,players):
new_fpts = set_fpts(ud_flex.iloc[i]['position_name'], ud_flex.iloc[i]['wk_rank_orig'], ud_flex.iloc[i]['fpts'])
ud_flex.at[i, 'fpts_orig'] = new_fpts
Does anyone have any suggestions for how to speed up this loop? It's currently taking nearly an hour! Thanks!
You could start making an algorithm that exits faster:
def set_fpts(pos, rank, curr_fpts):
if rank > 4:
return 0
if rank < 2:
return curr_fpts
if pos in ["TE", "QB"]:
return 0
if rank >= 3:
if pos == "RB":
return 0
return curr_fpts
In general, iterating through pandas data frames is slow, so it's not surprising that your for loop based approach is taking a while.
I suspect that the following alternative should work more quickly for a data frame of your size.
mask = (((ud_flex['position_name']=="RB") & (ud_flex['wk_rank_orig']>=3))
|((ud_flex['position_name']=="WR") & (ud_flex['wk_rang_orig']>=4))
|((ud_flex['position_name'].isin["TE","QB"]) & (ud_flex['wk_rang_orig']>=2)))
ud_flex['fpts_orig'][mask] = 0
ud_flex['fpts_orig'][~mask] = ud_flex['fpts']
Related
I currently use something like the similar bit of code to determine comparison
list_of_numbers = [29800.0, 29795.0, 29795.0, 29740.0, 29755.0, 29745.0]
high = 29980.0
lookback = 10
counter = 1
for number in list_of_numbers:
if (high >= number) \
and (counter < lookback):
counter += 1
else:
break
The resulted counter magnitude will be 7. However, it is very taxing on large data arrays. So, I have looked for a solution and came up with np.argmax(), but there seems to be an issue. For example the following:
list_of_numbers = [29800.0, 29795.0, 29795.0, 29740.0, 29755.0, 29745.0]
np_list = np.array(list_of_numbers)
high = 29980.0
print(np.argmax(np_list > high) + 1)
this will get output 1, just like argmax is suppose to .. but I want it to get output 7. Is there another method to do this that will give me similar output for the if statement ?
You can get a boolean array for where high >= number using NumPy:
list_of_numbers = [29800.0, 29795.0, 29795.0, 29740.0, 29755.0, 29745.0]
high = 29980.0
lookback = 10
boolean_arr = np.less_equal(np.array(list_of_numbers), high)
Then finding where is the first False argument in that to satisfy break condition in your code. Furthermore, to consider countering, you can use np.cumsum on the boolean array and find the first argument that satisfying specified lookback magnitude. So, the result will be the smaller value between break_arr and lookback_lim:
break_arr = np.where(boolean_arr == False)[0][0] + 1
lookback_lim = np.where(np.cumsum(boolean_arr) == lookback)[0][0] + 1
result = min(break_arr, lookback_lim)
If your list_of_numbers have not any bigger value than your specified high limit for break_arr or the specified lookback exceeds values in np.cumsum(boolean_arr) for lookback_lim, the aforementioned code will get stuck with an error like the following, relating to np.where:
IndexError: index 0 is out of bounds for axis 0 with size 0
Which can be handled by try-except or if statements e.g.:
try:
break_arr = np.where(boolean_arr == False)[0][0] + 1
except:
break_arr = len(boolean_arr) + 1
try:
lookback_lim = np.where(np.cumsum(boolean_arr) == lookback)[0][0] + 1
except:
lookback_lim = len(boolean_arr) + 1
You have you less than sign backwards, no? The following should work as the for-loop:
print(np.min([np.sum(np.array(list_of_numbers) < high) + 1, lookback]))
A look back can be accomplished using shift. A cumcount can be used to get a running total. A query can be used as a filter
I have a database of New York apartments which has thousands of rented apartments. What I'm trying to do is create another column based on "pet_level". Their are two other columns 'dog_allowed' and 'cat_allowed' that have a 0 or 1 if the pet is allowed
I'm looking to create the 'pet_level' column on this:
0 if no pets are allowed
1 if cats_allowed
2 if dogs_allowed
3 if both are allowed
my initial approach at solving this was as follows:
df['pet_level'] = df.apply(lambda x: plev(0 = x[x['dog_allowed'] == 0 & x['cat_allowed'] == 0] ,1 = x[x['cat_allowed'] == 1], 2 = x[x['dog_allowed'] == 1], 3 = x[x['dog_allowed'] == 1 & x['cat_allowed'] == 1]))
Just because I've done smaller test datasets in a similar manner
I tried out a lambda function using the apply method but that doesn't seem to allow for that.
The approach that is currently working, define a function with the conditional statements needed.
def plvl(db):
if db['cats_allowed'] == 0 and db['dogs_allowed'] == 0:
val = 0
elif db['cats_allowed'] == 1 and db['dogs_allowed'] == 0:
val = 1
elif db['cats_allowed'] == 0 and db['dogs_allowed'] == 1:
val = 2
elif db['cats_allowed'] == 1 and db['dogs_allowed'] == 1:
val = 3
return val
Then pass in that function by applying the function along the columns(axis=1) to create the desired column.
df['pet_level'] = df.apply(plvl, axis=1)
I'm not sure if this is the most performance efficient but for testing purposes it currently works. I'm sure there's are more pythonic approaches that would be less demanding and equally helpful to know.
Instead of mapping, you can vectorize the operation like this:
df['pet_level'] = df['dog_allowed'] * 1 + df['cat_allowed'] * 2
what is the most efficient way of selecting value from pandas dataframe using column name and row index (by that I mean row number)?
I have a case where I have to iterate through rows:
I have a working solution:
i = 0
while i < len(dataset) -1:
if dataset.target[i] == 1:
dataset.sum_lost[i] = dataset['to_be_repaid_principal'][i] + dataset['to_be_repaid_interest'][i]
dataset.ratio_lost[i] = dataset.sum_lost[i] / dataset['expected_returned_sum'][i]
else:
dataset.sum_lost[i] = 0
dataset.ratio_lost[i]= 0
i += 1
But this solution is so much RAM hungry. I am also getting the following warning:
"A value is trying to be set on a copy of a slice from a DataFrame."
So I am trying to come up with another one:
i = 0
while i < len(dataset) -1:
if dataset.iloc[i, :].loc['target'] == 1:
dataset.iloc[i, :].loc['sum_lost'] = dataset.iloc[i, :].loc['to_be_repaid_principal'] + dataset.iloc[i, :].loc['to_be_repaid_interest']
dataset.iloc[i, :].loc['ratio_lost'] = dataset.iloc[i, :].loc['sum_lost'] / dataset.iloc[i, :].loc['expected_returned_sum']
else:
dataset.iloc[i, :].loc['sum_lost'] = 0
dataset.iloc[i, :].loc['ratio_lost'] = 0
i += 1
But it does not work.
I would like to come up with a faster/less ram hungry solution, because this will actually be web app a few users could use simultaneously.
Thanks a lot.
If you are thinking about "looping through rows", you are not using pandas right. You should think of terms of columns instead.
Use np.where which is vectorized (read: fast):
cond = dataset['target'] == 1
dataset['sumlost'] = np.where(cond, dataset['to_be_repaid_principal'] + dataset['to_be_repaid_interest'], 0)
dataset['ratio_lost'] = np.where(cond, dataset['sumlost'] / dataset['expected_returned_sum'], 0)
I am currently using python and numpy for calculations of correlations between 2 lists: data_0 and data_1. Each list contains respecively sorted times t0 and t1.
I want to calculate all the events where 0 < t1 - t0 < t_max.
for time_0 in np.nditer(data_0):
delta_time = np.subtract(data_1, np.full(data_1.size, time_0))
delta_time = delta_time[delta_time >= 0]
delta_time = delta_time[delta_time < time_max]
Doing so, as the list are sorted, I am selecting a subarray of data_1 of the form data_1[index_min: index_max].
So I need in fact to find two indexes to get what I want.
And what's interesting is that when I go to the next time_0, as data_0 is also sorted, I just need to find the new index_min / index_max such as new_index_min >= index_min / new_index_max >= index_max.
Meaning that I don't need to scann again all the data_1.
(data list from scratch).
I have implemented such a solution not using the numpy methods (just with while loop) and it gives me the same results as before but not as fast than before (15 times longer!).
I think as normally it requires less calculation, there should be a way to make it faster using numpy methods but I don't know how to do it.
Does anyone have an idea?
I am not sure if I am super clear so if you have any questions, do not hestitate.
Thank you in advance,
Paul
Here is a vectorized approach using argsort. It uses a strategy similar to your avoid-full-scan idea:
import numpy as np
def find_gt(ref, data, incl=True):
out = np.empty(len(ref) + len(data) + 1, int)
total = (data, ref) if incl else (ref, data)
out[1:] = np.argsort(np.concatenate(total), kind='mergesort')
out[0] = -1
split = (out < len(data)) if incl else (out >= len(ref))
if incl:
out[~split] -= len(data)
split[0] = False
return np.maximum.accumulate(np.where(split, -1, out))[split] + 1
def find_intervals(ref, data, span, incl=(True, True)):
index_min = find_gt(ref, data, incl[0])
index_max = len(ref) - find_gt(-ref[::-1], -span-data[::-1], incl[1])[::-1]
return index_min, index_max
ref = np.sort(np.random.randint(0,20000,(10000,)))
data = np.sort(np.random.randint(0,20000,(10000,)))
span = 2
idmn, idmx = find_intervals(ref, data, span, (True, True))
print('checking')
for d,mn,mx in zip(data, idmn, idmx):
assert mn == len(ref) or ref[mn] >= d
assert mn == 0 or ref[mn-1] < d
assert mx == len(ref) or ref[mx] > d+span
assert mx == 0 or ref[mx-1] <= d+span
print('ok')
It works by
indirectly sorting both sets together
finding for each time in one set the preceding time in the other
this is done using maximum.reduce
the preceding steps are applied twice, the second time the times in
one set are shifted by span
I'm trading daily on Cryptocurrencies and would like to find which are the most desirable Cryptos for trading.
I have CSV file for every Crypto with the following fields:
Date Sell Buy
43051.23918 1925.16 1929.83
43051.23919 1925.12 1929.79
43051.23922 1925.12 1929.79
43051.23924 1926.16 1930.83
43051.23925 1926.12 1930.79
43051.23926 1926.12 1930.79
43051.23927 1950.96 1987.56
43051.23928 1190.90 1911.56
43051.23929 1926.12 1930.79
I would like to check:
How many quotes will end with profit:
for Buy positions - if one of the following Sells > current Buy.
for Sell positions - if one of the following Buys < current Sell.
How much time it would take to a theoretical position to become profitable.
What can be the profit potential.
I'm using the following code:
#converting from OLE to datetime
OLE_TIME_ZERO = dt.datetime(1899, 12, 30, 0, 0, 0)
def ole(oledt):
return OLE_TIME_ZERO + dt.timedelta(days=float(oledt))
#variables initialization
buy_time = ole(43031.57567) - ole(43031.57567)
sell_time = ole(43031.57567) - ole(43031.57567)
profit_buy_counter = 0
no_profit_buy_counter = 0
profit_sell_counter = 0
no_profit_sell_counter = 0
max_profit_buy_positions = 0
max_profit_buy_counter = 0
max_profit_sell_positions = 0
max_profit_sell_counter = 0
df = pd.read_csv("C:/P/Crypto/bitcoin_test_normal_276k.csv")
#comparing to max
for index, row in df.iterrows():
a = index + 1
df_slice = df[a:]
if df_slice["Sell"].max() - row["Buy"] > 0:
max_profit_buy_positions += df_slice["Sell"].max() - row["Buy"]
max_profit_buy_counter += 1
for index1, row1 in df_slice.iterrows():
if row["Buy"] < row1["Sell"] :
buy_time += ole(row1["Date"])- ole(row["Date"])
profit_buy_counter += 1
break
else:
no_profit_buy_counter += 1
#comparing to sell
for index, row in df.iterrows():
a = index + 1
df_slice = df[a:]
if row["Sell"] - df_slice["Buy"].min() > 0:
max_profit_sell_positions += row["Sell"] - df_slice["Buy"].min()
max_profit_sell_counter += 1
for index2, row2 in df_slice.iterrows():
if row["Sell"] > row2["Buy"] :
sell_time += ole(row2["Date"])- ole(row["Date"])
profit_sell_counter += 1
break
else:
no_profit_sell_counter += 1
num_rows = len(df.index)
buy_avg_time = buy_time/num_rows
sell_avg_time = sell_time/num_rows
if max_profit_buy_counter == 0:
avg_max_profit_buy = "There is no profitable buy positions"
else:
avg_max_profit_buy = max_profit_buy_positions/max_profit_buy_counter
if max_profit_sell_counter == 0:
avg_max_profit_sell = "There is no profitable sell positions"
else:
avg_max_profit_sell = max_profit_sell_positions/max_profit_sell_counter
The code works fine for 10K-20K lines but for a larger amount (276K) it take a long time (more than 10 hrs)
What can I do in order to improve it?
Is there any "Pythonic" way to compare each value in a data frame to all following values?
note - the dates in the CSV are in OLE so I need to convert it to Datetime.
File for testing:
Thanks for your comment.
Here you can find the file that I used:
First, I'd want to create the cumulative maximum/minimum values for Sell and Buy per row, so it's easy to compare to. pandas has cummax and cummin, but they go the wrong way. So we'll do:
df['Max Sell'] = df[::-1]['Sell'].cummax()[::-1]
df['Min Buy'] = df[::-1]['Buy'].cummin()[::-1]
Now, we can just compare each row:
df['Buy Profit'] = df['Max Sell'] - df['Buy']
df['Sell Profit'] = df['Sell'] - df['Min Buy']
I'm positive this isn't exactly what you want as I don't perfectly understand what you're trying to do, but hopefully it leads you in the right direction.
After comparing your function and mine, there is a slight difference, as your a is offset one off the index. Removing that offset, you'll see that my method produces the same results as yours, only in vastly shorter time:
for index, row in df.iterrows():
a = index
df_slice = df[a:]
assert (df_slice["Sell"].max() - row["Buy"]) == df['Max Sell'][a] - df['Buy'][a]
else:
print("All assertions passed!")
Note this will still take the very long time required by your function. Note that this can be fixed with shift, but I don't want to run your function for long enough to figure out what way to shift it.