Pandas Apply and Loc - efficiency and indexing

Pandas Apply and Loc - efficiency and indexing - python

I want to find the first value after each row that meets a certain criteria. So for example I want to find the first rate/value (not necessarily the first row after) after the current row that increased 5%. The added column would be the last 'first5percentIncrease' and would be the index (and/or value) of the first row (after current row) that had a 5% increase. Notice how each could not be lower than the current row's index.
amount date rate total type first5percentIncreaseValue first5percentIncreaseIndex
9248 0.05745868 2018-01-22 06:11:36 10 0.00099984 buy 10.5 9341
9249 1.14869147 2018-01-22 06:08:38 20 0.01998989 buy 21 9421
9250 0.16498080 2018-01-22 06:02:59 15 0.00286241 sell 15.75 9266
9251 0.02881844 2018-01-22 06:01:54 2 0.00049999 sell 2.1 10911
I tried using loc to apply() this to each row. The output takes at least 10 seconds for only about 9k rows. This does the job (I get a list of all values 5% higher than the given row) but is there a more efficient way to do this? Also I'd like to get only the first value but when I take do this I think it's starting from the first row. Is there a way to start .locs search from the current row so then I could just take the first value?
coin_trade_history_df['rate'].apply(
lambda y: coin_trade_history_df['rate'].loc[coin_trade_history_df['rate'].apply(
lambda x: y >= x + (x*.005))])
0 [0.01387146, 0.01387146, 0.01387148, 0.0138714...
1 [0.01387146, 0.01387146, 0.01387148, 0.0138714...
2 [0.01387146, 0.01387146, 0.01387148, 0.0138714...
3 [0.01387146, 0.01387146, 0.01387148, 0.0138714...
4 [0.01387146, 0.01387146, 0.01387148, 0.0138714...
Name: rate, dtype: object
Further clarification Peter Leimbigler said it better than me:
Oh, I think I get it now! "For each row, scan downward and get the first row you encounter that shows an increase of at least 5%," right? I'll edit my answer :) – Peter Leimbigler

Here's an approach to the specific example of labeling each row with the index of the next available row that shows an increase of at least 5%.
# Example data
df = pd.DataFrame({'rate': [100, 105, 99, 110, 130, 120, 98]})
# Series.shift(n) moves elements n places forward = down. We use
# it here in the denominator in order to compare each change with
# the initial value, rather than the final value.
mask = df.rate.diff()/df.rate.shift() >= 0.05
df.loc[mask, 'next_big_change_idx'] = df[mask].index
df.next_big_change_idx = df.next_big_change_idx.bfill().shift(-1)
# output
df
rate next_big_change_idx
0 100 1.0
1 105 3.0
2 99 3.0
3 110 4.0
4 130 NaN
5 120 NaN
6 98 NaN

Peter's answer was much faster but it only looked at the immediate next row. I wanted it to perform this on every row. Below is what I ended up with - not very fast but it goes through each row and returns the first value (or last value in my case since my time series was descending) that satisfied my criteria (increasing 5%).
def test_rows(x):
return trade_history_df['rate'].loc[
trade_history_df['rate'] >= x['rate'] + (x['rate'] * .05)].loc[
trade_history_df['date'] > x['date']].last_valid_index()
test1 = trade_history_df[['rate','date']].apply(test_rows,axis = 1)

Related

Pandas: Optimal subtract every nth row

I'm writing a function for a special case of row-wise subtraction in pandas.
First the user should be able to specify rows either by regex (i.e. "_BL[0-9]+") or by regular index i.e every 6th row
Then we must subtract every matching row from rows preceding it, but not past another match
[Optionally] Drop selected rows
Column to match on should be user-defined by either index or label
For example if:
Samples
var1
var1
something
10
20
something
20
30
something
40
30
some_BL20_thing
100
100
something
50
70
something
90
100
some_BL10_thing
100
10
Expected output should be:
Samples
var1
var1
something
-90
-80
something
-80
-70
something
-60
-70
something
-50
60
something
-10
90
My current (incomplete) implementation relies heavily on looping:
def subtract_blanks(data:pd.DataFrame, num_samples:int)->pd.DataFrame:
'''
Accepts a data dataframe and a mod int and
subtracts each blank from all mod preceding samples
'''
expr = compile(r'(_BL[0-9]{1})')
output = data.copy(deep = True)
for idx,row in output.iterrows():
if search(expr,row['Sample']):
for i in range(1,num_samples+1):
output.iloc[idx-i,data_start:] = output.iloc[idx-i,6:]-row.iloc[6:]
return output
Is there a better way of doing this? This implementation seems pretty ugly. I've also considered maybe splitting the DataFrame to chucks and operating on them instead.

Code
# Create boolean mask for matching rows
# m = np.arange(len(df)) % 6 == 5 # for index match
m = df['Samples'].str.contains(r'_BL\d+') # for regex match
# mask the values and backfill to propagate the row
# values corresponding to match in backward direction
df['var1'] = df['var1'] - df['var1'].mask(~m).bfill()
# Delete the matching rows
df = df[~m].copy()
Samples var1 var1
0 something -90.0 -80.0
1 something -80.0 -70.0
2 something -60.0 -70.0
4 something -50.0 60.0
5 something -10.0 90.0
Note: The core logic is specified in the code so I'll leave the function implementation upto the OP.

Apply multiple condition groupby + sort + sum to pandas dataframe rows

I have a dataframe that has the following columns:
Acct Num, Correspondence Date, Open Date
For each opened account, I am being asked to look back at all the correspondences that happened within
30 days of opendate of that account, then assigning points as following to the correspondences:
Forty-twenty-forty: Attribute 40% (0.4 points) of the attribution to the first touch,
40% to the last touch, and divide the remaining 20% between all touches in between
So I know apply and group by functions, but this is beyond my paygrade.
I have to group by account, with conditional based on comparison of 2 columns against eachother,
I have to do that to get a total number of correspondences, and I guess they have to be sorted as well, as the following step of assigning points to correspondences depends on the order in which they occurred.
I would like to do this efficiently, as I have a ton of rows, I know apply() can go fast, but I am pretty bad at applying it when the row-level operation I am trying to do gets even a little complex.
I appreciate any help, as I am not good at pandas.
EDIT
as per request
Acct, ContactDate, OpenDate, Points (what I need to calculate)
123, 1/1/2018, 1/1/2021, 0 (because correspondance not within 30 days of open)
123, 12/10/2020, 1/1/2021, 0.4 (first touch gets 0.4)
123, 12/11/2020, 1/1/2021, 0.2 (other 'touches' get 0.2/(num of touches-2) 'points')
123, 12/12/2020, 1/1/2021, 0.4 (last touch gets 0.4)
456, 1/1/2018, 1/1/2021, 0 (again, because correspondance not within 30 days of open)
456, 12/10/2020, 1/1/2021, 0.4 (first touch gets 0.4)
456, 12/11/2020, 1/1/2021, 0.1 (other 'touches' get 0.2/(num of touches-2) 'points')
456, 12/11/2020, 1/1/2021, 0.1 (other 'touches' get 0.2/(num of touches-2) 'points')
456, 12/12/2020, 1/1/2021, 0.4 (last touch gets 0.4)

This returns a reduced dataframe in that it excludes timeframes exceeding 30 days and then merges the original df into it get all the data in one df. This assumes your date sorting is correct, otherwise, you may have to do that upfront before applying the function below.
df['Points'] = 0 #add column to dataframe before analysis
#df.columns
#Index(['Acct', 'ContactDate', 'OpenDate', 'Points'], dtype='object')
def points(x):
newx = x.loc[(x['OpenDate'] - x['ContactDate']) <= timedelta(days=30)] # reduce for wide > 30 days
# print(newx.Acct)
if newx.Acct.count() > 2: # check more than two dates exist
newx['Points'].iloc[0] = .4 # first row
newx['Points'].iloc[-1] = .4 # last row
newx['Points'].iloc[1:-1] = .2 / newx['Points'].iloc[1:-1].count() # middle rows / by count of those rows
return newx
elif newx.Acct.count() == 2: # placeholder for later
#edge case logic here for two occurences
return newx
elif newx.Acct.count() == 1: # placeholder for later
#edge case logic here one onccurence
return newx
# groupby Acct then clean up the indices so it can be merged back into original df
dft = df.groupby('Acct', as_index=False).apply(points).reset_index().set_index('level_1').drop('level_0', axis=1)
# merge on index
df_points = df[['Acct', 'ContactDate', 'OpenDate']].merge(dft['Points'], how='left', left_index=True, right_index=True).fillna(0)
Output:
Acct ContactDate OpenDate Points
0 123 2018-01-01 2021-01-01 0.0
1 123 2020-12-10 2021-01-01 0.4
2 123 2020-12-11 2021-01-01 0.2
3 123 2020-12-12 2021-01-01 0.4
4 456 2018-01-01 2021-01-01 0.0
5 456 2020-12-10 2021-01-01 0.4
6 456 2020-12-11 2021-01-01 0.1
7 456 2020-12-11 2021-01-01 0.1
8 456 2020-12-12 2021-01-01 0.4

Get the daily percentages of values that fall within certain ranges

I have a large dataset of test results where I have a columns to represent the date a test was completed and number of hours it took to complete the test i.e.
df = pd.DataFrame({'Completed':['21/03/2020','22/03/2020','21/03/2020','24/03/2020','24/03/2020',], 'Hours_taken':[23,32,8,73,41]})
I have a months worth of test data and the tests can take anywhere from a couple of hours to a couple of days. I want to try and work out, for each day, what percentage of tests fall within the ranges of 24hrs/48hrs/72hrs ect. to complete, up to the percentage of tests that took longer than a week.
I've been able to work it out generally without taking the dates into account like so:
Lab_tests['one-day'] = Lab_tests['hours'].between(0,24)
Lab_tests['two-day'] = Lab_tests['hours'].between(24,48)
Lab_tests['GreaterThanWeek'] = Lab_tests['hours'] >168
one = Lab_tests['1-day'].value_counts().loc[True]
two = Lab_tests['two-day'].value_counts().loc[True]
eight = Lab_tests['GreaterThanWeek'].value_counts().loc[True]
print(one/10407 * 100)
print(two/10407 * 100)
print(eight/10407 * 100)
Ideally I'd like to represent the percentages in another dataset where the rows represent the dates and the columns represent the data ranges. But I can't work out how to take what I've done and modify it to get these percentages for each date. Is this possible to do in pandas?
This question, Counting qualitative values based on the date range in Pandas is quite similar but the fact that I'm counting the occurrences in specified ranges is throwing me off and I haven't been able to get a solution out of it.
Bonus Question
I'm sure you've noticed my current code is not the most elegant thing in the world, is the a cleaner way to do what I've done above, as I'm doing that for every data range that I want?
Edit:
So the Output for the sample data given would look like so:
df = pd.DataFrame({'1-day':[100,0,0,0], '2-day':[0,100,0,50],'3-day':[0,0,0,0],'4-day':[0,0,0,50]},index=['21/03/2020','22/03/2020','23/03/2020','24/03/2020'])

You're almost there. You just need to do a few final steps:
First, cast your bools to ints, so that you can sum them.
Lab_tests['one-day'] = Lab_tests['hours'].between(0,24).astype(int)
Lab_tests['two-day'] = Lab_tests['hours'].between(24,48).astype(int)
Lab_tests['GreaterThanWeek'] = (Lab_tests['hours'] > 168).astype(int)
Completed hours one-day two-day GreaterThanWeek
0 21/03/2020 23 1 0 0
1 22/03/2020 32 0 1 0
2 21/03/2020 8 1 0 0
3 24/03/2020 73 0 0 0
4 24/03/2020 41 0 1 0
Then, drop the hours column and roll the rest up to the level of Completed:
Lab_tests['one-day'] = Lab_tests['hours'].between(0,24).astype(int)
Lab_tests['two-day'] = Lab_tests['hours'].between(24,48).astype(int)
Lab_tests['GreaterThanWeek'] = (Lab_tests['hours'] > 168).astype(int)
Lab_tests.drop('hours', axis=1).groupby('Completed').sum()
one-day two-day GreaterThanWeek
Completed
21/03/2020 2 0 0
22/03/2020 0 1 0
24/03/2020 0 1 0
EDIT: To get to percent, you just need to divide each column by the sum of all three. You can sum columns by defining the axis of the sum:
...
daily_totals = Lab_tests.drop('hours', axis=1).groupby('Completed').sum()
daily_totals.sum(axis=1)
Completed
21/03/2020 2
22/03/2020 1
24/03/2020 1
dtype: int64
Then divide the daily totals dataframe by the column-wise sum of the daily totals (again, we use axis to define whether each value of the series will be the divisor for a row or a column.):
daily_totals.div(daily_totals.sum(axis=1), axis=0)
one-day two-day GreaterThanWeek
Completed
21/03/2020 1.0 0.0 0.0
22/03/2020 0.0 1.0 0.0
24/03/2020 0.0 1.0 0.0

Inserting missing numbers in dataframe

I have a program that ideally measures the temperature every second. However, in reality this does not happen. Sometimes, it skips a second or it breaks down for 400 seconds and then decides to start recording again. This leaves gaps in my 2-by-n dataframe, where ideally n = 86400 (the amount of seconds in a day). I want to apply some sort of moving/rolling average to it to get a nicer plot, but if I do that to the "raw" datafiles, the amount of data points becomes less. This is shown here, watch the x-axis. I know the "nice data" doesn't look nice yet; I'm just playing with some values.
So, I want to implement a data cleaning method, which adds data to the dataframe. I thought about it, but don't know how to implement it. I thought of it as follows:
If the index is not equal to the time, then we need to add a number, at time = index. If this gap is only 1 value, then the average of the previous number and the next number will do for me. But if it is bigger, say 100 seconds are missing, then a linear function needs to be made, which will increase or decrease the value steadily.
So I guess a training set could be like this:
index time temp
0 0 20.10
1 1 20.20
2 2 20.20
3 4 20.10
4 100 22.30
Here, I would like to get a value for index 3, time 3 and the values missing between time = 4 and time = 100. I'm sorry about my formatting skills, I hope it is clear.
How would I go about programming this?

Use merge with complete time column and then interpolate:
# Create your table
time = np.array([e for e in np.arange(20) if np.random.uniform() > 0.6])
temp = np.random.uniform(20, 25, size=len(time))
temps = pd.DataFrame([time, temp]).T
temps.columns = ['time', 'temperature']
>>> temps
time temperature
0 4.0 21.662352
1 10.0 20.904659
2 15.0 20.345858
3 18.0 24.787389
4 19.0 20.719487
The above is a random table generated with missing time data.
# modify it
filled = pd.Series(np.arange(temps.iloc[0,0], temps.iloc[-1, 0]+1))
filled = filled.to_frame()
filled.columns = ['time'] # Create a fully filled time column
merged = pd.merge(filled, temps, on='time', how='left') # merge it with original, time without temperature will be null
merged.temperature = merged.temperature.interpolate() # fill nulls linearly.
# Alternatively, use reindex, this does the same thing.
final = temps.set_index('time').reindex(np.arange(temps.time.min(),temps.time.max()+1)).reset_index()
final.temperature = final.temperature.interpolate()
>>> merged # or final
time temperature
0 4.0 21.662352
1 5.0 21.536070
2 6.0 21.409788
3 7.0 21.283505
4 8.0 21.157223
5 9.0 21.030941
6 10.0 20.904659
7 11.0 20.792898
8 12.0 20.681138
9 13.0 20.569378
10 14.0 20.457618
11 15.0 20.345858
12 16.0 21.826368
13 17.0 23.306879
14 18.0 24.787389
15 19.0 20.719487

First you can set the second values to actual time values as such:
df.index = pd.to_datetime(df['time'], unit='s')
After which you can use pandas' built-in time series operations to resample and fill in the missing values:
df = df.resample('s').interpolate('time')
Optionally, if you still want to do some smoothing you can use the following operation for that:
df.rolling(5, center=True, win_type='hann').mean()
Which will smooth with a 5 element wide Hanning window. Note: any window-based smoothing will cost you value points at the edges.
Now your dataframe will have datetimes (including date) as index. This is required for the resample method. If you want to lose the date, you can simply use:
df.index = df.index.time

Python Pandas Running Totals with Resets

I would like to perform the following task. Given a 2 columns (good and bad) I would like to replace any rows for the two columns with a running total. Here is an example of the current dataframe along with the desired data frame.
EDIT: I should have added what my intentions are. I am trying to create equally binned (in this case 20) variable using a continuous variable as the input. I know the pandas cut and qcut functions are available, however the returned results will have zeros for the good/bad rate (needed to compute the weight of evidence and information value). Zeros in either the numerator or denominator will not allow the mathematical calculations to work.
d={'AAA':range(0,20),
'good':[3,3,13,20,28,32,59,72,64,52,38,24,17,19,12,5,7,6,2,0],
'bad':[0,0,1,1,1,0,6,8,10,6,6,10,5,8,2,2,1,3,1,1]}
df=pd.DataFrame(data=d)
print(df)
Here is an explanation of what I need to do to the above dataframe.
Roughly speaking, anytime I encounter a zero for either column, I need to use a running total for the column which is not zero to the next row which has a non-zero value for the column that contained zeros.
Here is the desired output:
dd={'AAA':range(0,16),
'good':[19,20,60,59,72,64,52,38,24,17,19,12,5,7,6,2],
'bad':[1,1,1,6,8,10,6,6,10,5,8,2,2,1,3,2]}
desired_df=pd.DataFrame(data=dd)
print(desired_df)

The basic idea of my solution is to create a column from a cumsum over non-zero values in order to get the zero values with the next non zero value into one group. Then you can use groupby + sum to get your the desired values.
two_good = df.groupby((df['bad']!=0).cumsum().shift(1).fillna(0))['good'].sum()
two_bad = df.groupby((df['good']!=0).cumsum().shift(1).fillna(0))['bad'].sum()
two_good = two_good.loc[two_good!=0].reset_index(drop=True)
two_bad = two_bad.loc[two_bad!=0].reset_index(drop=True)
new_df = pd.concat([two_bad, two_good], axis=1).dropna()
print(new_df)
bad good
0 1 19.0
1 1 20.0
2 1 28.0
3 6 91.0
4 8 72.0
5 10 64.0
6 6 52.0
7 6 38.0
8 10 24.0
9 5 17.0
10 8 19.0
11 2 12.0
12 2 5.0
13 1 7.0
14 3 6.0
15 1 2.0
This code treats your etch case of trailing zeros different from your desired output, it simple cuts it off. You'd have to add some extra code to catch that one with a different logic.

P.Tillmann. I appreciate your assistance with this. For the more advanced readers I would assume you to find this code appalling, as I do. I would be more than happy to take any recommendation which makes this more streamlined.
d={'AAA':range(0,20),
'good':[3,3,13,20,28,32,59,72,64,52,38,24,17,19,12,5,7,6,2,0],
'bad':[0,0,1,1,1,0,6,8,10,6,6,10,5,8,2,2,1,3,1,1]}
df=pd.DataFrame(data=d)
print(df)
row_good=0
row_bad=0
row_bad_zero_count=0
row_good_zero_count=0
row_out='NO'
crappy_fix=pd.DataFrame()
for index,row in df.iterrows():
if row['good']==0 or row['bad']==0:
row_bad += row['bad']
row_good += row['good']
row_bad_zero_count += 1
row_good_zero_count += 1
output_ind='1'
row_out='NO'
elif index+1 < len(df) and (df.loc[index+1,'good']==0 or df.loc[index+1,'bad']==0):
row_bad=row['bad']
row_good=row['good']
output_ind='2'
row_out='NO'
elif (row_bad_zero_count > 1 or row_good_zero_count > 1) and row['good']!=0 and row['bad']!=0:
row_bad += row['bad']
row_good += row['good']
row_bad_zero_count=0
row_good_zero_count=0
row_out='YES'
output_ind='3'
else:
row_bad=row['bad']
row_good=row['good']
row_bad_zero_count=0
row_good_zero_count=0
row_out='YES'
output_ind='4'
if ((row['good']==0 or row['bad']==0)
and (index > 0 and (df.loc[index-1,'good']!=0 or df.loc[index-1,'bad']!=0))
and row_good != 0 and row_bad != 0):
row_out='YES'
if row_out=='YES':
temp_dict={'AAA':row['AAA'],
'good':row_good,
'bad':row_bad}
crappy_fix=crappy_fix.append([temp_dict],ignore_index=True)
print(str(row['AAA']),'-',
str(row['good']),'-',
str(row['bad']),'-',
str(row_good),'-',
str(row_bad),'-',
str(row_good_zero_count),'-',
str(row_bad_zero_count),'-',
row_out,'-',
output_ind)
print(crappy_fix)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Apply and Loc - efficiency and indexing - python

Related

Pandas: Optimal subtract every nth row

Apply multiple condition groupby + sort + sum to pandas dataframe rows

Get the daily percentages of values that fall within certain ranges

Inserting missing numbers in dataframe

Python Pandas Running Totals with Resets

Categories

Resources