Apply() function based on column condition restarting instead of changing - python

I have the following apply function:
week_to_be_predicted = 15
df['raw_data'] = df.apply(lambda x: get_raw_data(x['boxscore']) if x['week_int']<week_to_be_predicted else 0,axis=1)
df['raw_data'] = df.apply(lambda x: get_initials(x['boxscore']) if x['week_int']==week_to_be_predicted else x['raw_data'],axis=1)
Where df['week_int'] is a column of integers starting from 0 and increasing to 18. If the row value for df['week_int'] < week_to_be_predicted (in this case 15) I want the function get_raw_data to be applied, otherwise I want the function get_initials to be applied.
My question is in regards to troubleshooting the apply() function. The reason is that after successfully applying the get_raw_data to all rows where week_int < 14, instead of putting 0's for the remaining rows of df['raw_data'] (else 0), the "loop" restarts, and it begins from the first row of the dataframe and starts applying the get_raw_data all over again, seemingly stuck in an infinite loop.
What's more confounding, is that it does not always do this. The functions as written initially solved this same problem, and have been working as intended for the past ~10 weeks, but now all the sudden when I set week_to_be_predicted to 15, it is reverting to its old ways.
I'm wondering if this has something to do with the apply() function, the conditions inside the apply function, or both. It's difficult for me to troubleshoot, as the logic has worked in the past. I'm wondering if there is something about apply() that makes this a less than optimal approach, and if anybody knows what aspect might be causing the problem.
Thank you in advance.

Use a boolean mask:
def get_raw_data(sr):
return -sr
def get_initials(sr):
return sr
df = pd.DataFrame({'week_int': np.arange(0, 19),
'boxscore': np.random.random(19)})
m = df['week_int'] < week_to_be_predicted
df.loc[m, 'raw_data'] = get_raw_data(df.loc[m, 'boxscore'])
df.loc[~m, 'raw_data'] = get_initials(df.loc[~m, 'boxscore'])
Output:
>>> df
week_int boxscore raw_data
0 0 0.232081 -0.232081
1 1 0.890318 -0.890318
2 2 0.372760 -0.372760
3 3 0.697202 -0.697202
4 4 0.400200 -0.400200
5 5 0.793784 -0.793784
6 6 0.783359 -0.783359
7 7 0.898331 -0.898331
8 8 0.440433 -0.440433
9 9 0.415760 -0.415760
10 10 0.599502 -0.599502
11 11 0.941613 -0.941613
12 12 0.039865 -0.039865
13 13 0.820617 -0.820617
14 14 0.471396 -0.471396
15 15 0.794547 0.794547
16 16 0.682332 0.682332
17 17 0.638694 0.638694
18 18 0.761995 0.761995

Related

Pandas: How to round the values which are closest to the whole number for the values more than 1?

Following is the dataframe. I would like to round the values in 'Period' which are closest to the whole numbers. For example : 1.005479452 rounded to 1.0000, 2.002739726 rounded to 2.0000, 3.002739726 rounded to 3.00000, 5.005479452 rounded to 5.0000, 12.01369863 rounded to 12.0000 and so on. I have a big list. I am trying to do so because in later program I have to concatenate this dataframe with other dataframes based on 'period' column.
df = period rate
0.931506849 -0.001469
0.994520548 0.008677
1.005479452 0.11741125
1.008219178 0.073975
1.010958904 0.147474833
1.994520548 -0.007189219
2.002739726 0.1160815
2.005479452 0.06995
2.008219178 0.026808
2.010958904 0.1200695
2.980821918 -0.007745727
3.002739726 0.192208333
3.010958904 0.119895833
3.019178082 0.151857267
3.021917808 0.016165
3.863013699 0.005405321
4 0.06815
4.002739726 0.1240695
4.016438356 0.2410323
4.019178082 0.0459375
4.021917808 0.03161
4.997260274 0.0682
5.005479452 0.1249955
5.01369863 0.03260875
5.016438356 0.238069083
5.019178082 0.04590625
5.021917808 0.0120625
12.01369863 0.136991
12.01643836 0.053327917
12.01917808 0.2309365
I am trying to do something like below but couldn't move further.
df['period'] = np.where(df.period>1, df.period.round(), df.period.round(decimals = 4))
You can apply a lambda function. This one will check it the value is greater than one before rounding to whole, otherwise rounding to 4 decimal places for values less than one. I think that's what you seem to want?
df['period'] = df['period'].apply(lambda x: round(x, 0) if x > 1 else round(x, 4))
I built a function that basically iterates from 1 to whatever the max whole value should be in the dataframe. This should be faster than a solution that just iterates row-by-row, though it does assume that the dataframe is sorted (like in your example).
import pandas as pd
df = pd.DataFrame(
{
"period": [0.931506849, 0.994520548, 1.005479452, 1.008219178, 1.010958904, 1.994520548, 2.002739726, 2.005479452, 2.008219178, 2.010958904, 2.980821918, 3.002739726, 3.010958904, 3.019178082, 3.021917808, 3.863013699, 4, 4.002739726, 4.016438356, 4.019178082, 4.021917808, 4.997260274, 5.005479452, 5.01369863, 5.016438356, 5.019178082, 5.021917808, 12.01369863, 12.01643836, 12.01917808]
}
)
print(df.head())
"""
period
0 0.931507
1 0.994521
2 1.005479
3 1.008219
4 1.010959
"""
def process_df(df: pd.DataFrame) -> pd.DataFrame:
df_range_vals = [round(period) for period in df['period'].tolist()]
out_df = df.loc[df['period'] < 1]
for base in range(1, max(df_range_vals) + 1):
# only keep the ones in the range we want
temp_df = df.loc[(df['period'] >= base) & (df['period'] < base + 1)]
# if there's nothing to change, then just skip
if temp_df.empty:
continue
temp_df.loc[temp_df.first_valid_index(), 'period'] = temp_df.loc[temp_df.first_valid_index(), 'period'].round(0)
out_df = out_df.append(temp_df, ignore_index = True)
return out_df
df = process_df(df)
print(df.head())
"""
period
0 0.931507
1 0.994521
2 1.000000
3 1.008219
4 1.010959
"""
Try:
# Sort so that we know what is closes to whole no
df.sort_values(by=['period'])
# Create a new column and round everything. This is done to do
# partition effectively
df['round_period'] = df['period'].round()
df_of_values_close_to_whole_number = list(df.groupby('round_period').tail(1)['period'])
def round_func(x, df_of_val_close_to_whole_number):
return '{:.5f}'.format(round(x)) if x in df_of_val_close_to_whole_number and x > 1 else x
# Apply round only to values closer to whole number.
df['period'].apply(round_func, args=(df_of_values_close_to_whole_number,))
Output
0 0.931507
1 0.994521
2 1.00548
3 1.00822
4 1.00000
5 1.99452
6 2.00274
7 2.00548
8 2.00822
9 2.00000
10 2.98082
11 3.00274
12 3.01096
13 3.01918
14 3.00000
15 3.86301
16 4
17 4.00274
18 4.01644
19 4.01918
20 4.00000
21 4.99726
22 5.00548
23 5.0137
24 5.01644
25 5.01918
26 5.00000
27 12.0137
28 12.0164
29 12.00000
Name: period, dtype: object

how to construct an index from percentage change time series?

consider the values below
array1 = np.array([526.59, 528.88, 536.19, 536.18, 536.18, 534.14, 538.14, 535.44,532.21, 531.94, 531.89, 531.89, 531.23, 529.41, 526.31, 523.67])
I convert these into a pandas Series object
import numpy as np
import pandas as pd
df = pd.Series(array1)
And compute the percentage change as
df = (1+df.pct_change(periods=1))
from here, how do i construct an index (base=100)? My desired output should be:
0 100.00
1 100.43
2 101.82
3 101.82
4 101.82
5 101.43
6 102.19
7 101.68
8 101.07
9 101.02
10 101.01
11 101.01
12 100.88
13 100.54
14 99.95
15 99.45
I can achieve the objective through an iterative (loop) solution, but that may not be a practical solution, if the data depth and breadth is large. Secondly, is there a way in which i can get this done in a single step on multiple columns? thank you all for any guidance.
An index (base=100) is the relative change of a series in retation to its first element. So there's no need to take a detour to relative changes and recalculate the index from them when you can get it directly by
df = pd.Series(array1)/array1[0]*100
As far as I know, there is still no off-the-shelf expanding_window version for pct_change(). You can avoid the for-loop by using apply:
# generate data
import pandas as pd
series = pd.Series([526.59, 528.88, 536.19, 536.18, 536.18, 534.14, 538.14, 535.44,532.21, 531.94, 531.89, 531.89, 531.23, 529.41, 526.31, 523.67])
# copmute percentage change with respect to first value
series.apply(lambda x: ((x / series.iloc[0]) - 1) * 100) + 100
Output:
0 100.000000
1 100.434873
2 101.823050
3 101.821151
4 101.821151
5 101.433753
6 102.193357
7 101.680624
8 101.067244
9 101.015971
10 101.006476
11 101.006476
12 100.881141
13 100.535521
14 99.946828
15 99.445489
dtype: float64

Python Pandas Running Totals with Resets

I would like to perform the following task. Given a 2 columns (good and bad) I would like to replace any rows for the two columns with a running total. Here is an example of the current dataframe along with the desired data frame.
EDIT: I should have added what my intentions are. I am trying to create equally binned (in this case 20) variable using a continuous variable as the input. I know the pandas cut and qcut functions are available, however the returned results will have zeros for the good/bad rate (needed to compute the weight of evidence and information value). Zeros in either the numerator or denominator will not allow the mathematical calculations to work.
d={'AAA':range(0,20),
'good':[3,3,13,20,28,32,59,72,64,52,38,24,17,19,12,5,7,6,2,0],
'bad':[0,0,1,1,1,0,6,8,10,6,6,10,5,8,2,2,1,3,1,1]}
df=pd.DataFrame(data=d)
print(df)
Here is an explanation of what I need to do to the above dataframe.
Roughly speaking, anytime I encounter a zero for either column, I need to use a running total for the column which is not zero to the next row which has a non-zero value for the column that contained zeros.
Here is the desired output:
dd={'AAA':range(0,16),
'good':[19,20,60,59,72,64,52,38,24,17,19,12,5,7,6,2],
'bad':[1,1,1,6,8,10,6,6,10,5,8,2,2,1,3,2]}
desired_df=pd.DataFrame(data=dd)
print(desired_df)
The basic idea of my solution is to create a column from a cumsum over non-zero values in order to get the zero values with the next non zero value into one group. Then you can use groupby + sum to get your the desired values.
two_good = df.groupby((df['bad']!=0).cumsum().shift(1).fillna(0))['good'].sum()
two_bad = df.groupby((df['good']!=0).cumsum().shift(1).fillna(0))['bad'].sum()
two_good = two_good.loc[two_good!=0].reset_index(drop=True)
two_bad = two_bad.loc[two_bad!=0].reset_index(drop=True)
new_df = pd.concat([two_bad, two_good], axis=1).dropna()
print(new_df)
bad good
0 1 19.0
1 1 20.0
2 1 28.0
3 6 91.0
4 8 72.0
5 10 64.0
6 6 52.0
7 6 38.0
8 10 24.0
9 5 17.0
10 8 19.0
11 2 12.0
12 2 5.0
13 1 7.0
14 3 6.0
15 1 2.0
This code treats your etch case of trailing zeros different from your desired output, it simple cuts it off. You'd have to add some extra code to catch that one with a different logic.
P.Tillmann. I appreciate your assistance with this. For the more advanced readers I would assume you to find this code appalling, as I do. I would be more than happy to take any recommendation which makes this more streamlined.
d={'AAA':range(0,20),
'good':[3,3,13,20,28,32,59,72,64,52,38,24,17,19,12,5,7,6,2,0],
'bad':[0,0,1,1,1,0,6,8,10,6,6,10,5,8,2,2,1,3,1,1]}
df=pd.DataFrame(data=d)
print(df)
row_good=0
row_bad=0
row_bad_zero_count=0
row_good_zero_count=0
row_out='NO'
crappy_fix=pd.DataFrame()
for index,row in df.iterrows():
if row['good']==0 or row['bad']==0:
row_bad += row['bad']
row_good += row['good']
row_bad_zero_count += 1
row_good_zero_count += 1
output_ind='1'
row_out='NO'
elif index+1 < len(df) and (df.loc[index+1,'good']==0 or df.loc[index+1,'bad']==0):
row_bad=row['bad']
row_good=row['good']
output_ind='2'
row_out='NO'
elif (row_bad_zero_count > 1 or row_good_zero_count > 1) and row['good']!=0 and row['bad']!=0:
row_bad += row['bad']
row_good += row['good']
row_bad_zero_count=0
row_good_zero_count=0
row_out='YES'
output_ind='3'
else:
row_bad=row['bad']
row_good=row['good']
row_bad_zero_count=0
row_good_zero_count=0
row_out='YES'
output_ind='4'
if ((row['good']==0 or row['bad']==0)
and (index > 0 and (df.loc[index-1,'good']!=0 or df.loc[index-1,'bad']!=0))
and row_good != 0 and row_bad != 0):
row_out='YES'
if row_out=='YES':
temp_dict={'AAA':row['AAA'],
'good':row_good,
'bad':row_bad}
crappy_fix=crappy_fix.append([temp_dict],ignore_index=True)
print(str(row['AAA']),'-',
str(row['good']),'-',
str(row['bad']),'-',
str(row_good),'-',
str(row_bad),'-',
str(row_good_zero_count),'-',
str(row_bad_zero_count),'-',
row_out,'-',
output_ind)
print(crappy_fix)

how to create new column based on multiple columns with a function

This question is following up to my questionabout linear interpolation between two data points
I built following function from it:
def inter(colA, colB):
s = pd.Series([colA, np.nan, colB], index= [95, 100, 102.5])
s = s.interpolate(method='index')
return s.iloc[1]
Now I have a data frame that looks like this:
on95 on102.5 on105
Index
1 5 17 20
2 7 15 25
3 6 16 23
I would like to create a new column df['new'] that uses the function inter with inputs of on95 and on102.5
I tried like this:
df['new'] = inter(df['on95'],df['on102.5'])
but this resulted in NaN's.
I also tried with apply(inter) but did not find a way to make it work without an error message.
Any hints how I can solve this?
You need to vectorize your self defined function with np.vectorize, since the function parameters are accepted as pandas Series:
inter = np.vectorize(inter)
df['new'] = inter(df['on95'],df['on102.5'])
df
on95 on102.5 on105 new
#Index
# 1 5 17 20 13.000000
# 2 7 15 25 12.333333
# 3 6 16 23 12.666667

scale numerical values for different groups in python

I want to scale the numerical values (similar like R's scale function) based on different groups.
Noted: when I talked about the scale, I am referring to this metric
(x-group_mean)/group_std
Dataset (for demonstration the ideas) for example:
advertiser_id value
10 11
10 22
10 2424
11 34
11 342342
.....
Desirable results:
advertiser_id scaled_value
10 -0.58
10 -0.57
10 1.15
11 -0.707
11 0.707
.....
referring to this link: implementing R scale function in pandas in Python? I used the function for def scale and want to apply for it, like this fashion:
dt.groupby("advertiser_id").apply(scale)
but get an error:
ValueError: Shape of passed values is (2, 15770), indices imply (2, 23375)
In my original datasets the number of rows is 15770, but I don't think in my case the scale function maps a single value to more than 2 (in this case) results.
I would appreciate if you can give me some sample code or some suggestions into how to modify it, thanks!
First, np.std behaves differently than most other languages in that it delta degrees of freedom defaults to be 0. Therefore:
In [9]:
print df
advertiser_id value
0 10 11
1 10 22
2 10 2424
3 11 34
4 11 342342
In [10]:
print df.groupby('advertiser_id').transform(lambda x: (x-np.mean(x))/np.std(x, ddof=1))
value
0 -0.581303
1 -0.573389
2 1.154691
3 -0.707107
4 0.707107
This matches R result.
2nd, if any of your groups (by advertiser_id) happens to contain just 1 item, std would be 0 and you will get nan. Check if you get nan for this reason. R would return nan in this case as well.

Categories