Pandas: Optimal subtract every nth row - python

I'm writing a function for a special case of row-wise subtraction in pandas.
First the user should be able to specify rows either by regex (i.e. "_BL[0-9]+") or by regular index i.e every 6th row
Then we must subtract every matching row from rows preceding it, but not past another match
[Optionally] Drop selected rows
Column to match on should be user-defined by either index or label
For example if:
Samples
var1
var1
something
10
20
something
20
30
something
40
30
some_BL20_thing
100
100
something
50
70
something
90
100
some_BL10_thing
100
10
Expected output should be:
Samples
var1
var1
something
-90
-80
something
-80
-70
something
-60
-70
something
-50
60
something
-10
90
My current (incomplete) implementation relies heavily on looping:
def subtract_blanks(data:pd.DataFrame, num_samples:int)->pd.DataFrame:
'''
Accepts a data dataframe and a mod int and
subtracts each blank from all mod preceding samples
'''
expr = compile(r'(_BL[0-9]{1})')
output = data.copy(deep = True)
for idx,row in output.iterrows():
if search(expr,row['Sample']):
for i in range(1,num_samples+1):
output.iloc[idx-i,data_start:] = output.iloc[idx-i,6:]-row.iloc[6:]
return output
Is there a better way of doing this? This implementation seems pretty ugly. I've also considered maybe splitting the DataFrame to chucks and operating on them instead.

Code
# Create boolean mask for matching rows
# m = np.arange(len(df)) % 6 == 5 # for index match
m = df['Samples'].str.contains(r'_BL\d+') # for regex match
# mask the values and backfill to propagate the row
# values corresponding to match in backward direction
df['var1'] = df['var1'] - df['var1'].mask(~m).bfill()
# Delete the matching rows
df = df[~m].copy()
Samples var1 var1
0 something -90.0 -80.0
1 something -80.0 -70.0
2 something -60.0 -70.0
4 something -50.0 60.0
5 something -10.0 90.0
Note: The core logic is specified in the code so I'll leave the function implementation upto the OP.

Related

Get the daily percentages of values that fall within certain ranges

I have a large dataset of test results where I have a columns to represent the date a test was completed and number of hours it took to complete the test i.e.
df = pd.DataFrame({'Completed':['21/03/2020','22/03/2020','21/03/2020','24/03/2020','24/03/2020',], 'Hours_taken':[23,32,8,73,41]})
I have a months worth of test data and the tests can take anywhere from a couple of hours to a couple of days. I want to try and work out, for each day, what percentage of tests fall within the ranges of 24hrs/48hrs/72hrs ect. to complete, up to the percentage of tests that took longer than a week.
I've been able to work it out generally without taking the dates into account like so:
Lab_tests['one-day'] = Lab_tests['hours'].between(0,24)
Lab_tests['two-day'] = Lab_tests['hours'].between(24,48)
Lab_tests['GreaterThanWeek'] = Lab_tests['hours'] >168
one = Lab_tests['1-day'].value_counts().loc[True]
two = Lab_tests['two-day'].value_counts().loc[True]
eight = Lab_tests['GreaterThanWeek'].value_counts().loc[True]
print(one/10407 * 100)
print(two/10407 * 100)
print(eight/10407 * 100)
Ideally I'd like to represent the percentages in another dataset where the rows represent the dates and the columns represent the data ranges. But I can't work out how to take what I've done and modify it to get these percentages for each date. Is this possible to do in pandas?
This question, Counting qualitative values based on the date range in Pandas is quite similar but the fact that I'm counting the occurrences in specified ranges is throwing me off and I haven't been able to get a solution out of it.
Bonus Question
I'm sure you've noticed my current code is not the most elegant thing in the world, is the a cleaner way to do what I've done above, as I'm doing that for every data range that I want?
Edit:
So the Output for the sample data given would look like so:
df = pd.DataFrame({'1-day':[100,0,0,0], '2-day':[0,100,0,50],'3-day':[0,0,0,0],'4-day':[0,0,0,50]},index=['21/03/2020','22/03/2020','23/03/2020','24/03/2020'])
You're almost there. You just need to do a few final steps:
First, cast your bools to ints, so that you can sum them.
Lab_tests['one-day'] = Lab_tests['hours'].between(0,24).astype(int)
Lab_tests['two-day'] = Lab_tests['hours'].between(24,48).astype(int)
Lab_tests['GreaterThanWeek'] = (Lab_tests['hours'] > 168).astype(int)
Completed hours one-day two-day GreaterThanWeek
0 21/03/2020 23 1 0 0
1 22/03/2020 32 0 1 0
2 21/03/2020 8 1 0 0
3 24/03/2020 73 0 0 0
4 24/03/2020 41 0 1 0
Then, drop the hours column and roll the rest up to the level of Completed:
Lab_tests['one-day'] = Lab_tests['hours'].between(0,24).astype(int)
Lab_tests['two-day'] = Lab_tests['hours'].between(24,48).astype(int)
Lab_tests['GreaterThanWeek'] = (Lab_tests['hours'] > 168).astype(int)
Lab_tests.drop('hours', axis=1).groupby('Completed').sum()
one-day two-day GreaterThanWeek
Completed
21/03/2020 2 0 0
22/03/2020 0 1 0
24/03/2020 0 1 0
EDIT: To get to percent, you just need to divide each column by the sum of all three. You can sum columns by defining the axis of the sum:
...
daily_totals = Lab_tests.drop('hours', axis=1).groupby('Completed').sum()
daily_totals.sum(axis=1)
Completed
21/03/2020 2
22/03/2020 1
24/03/2020 1
dtype: int64
Then divide the daily totals dataframe by the column-wise sum of the daily totals (again, we use axis to define whether each value of the series will be the divisor for a row or a column.):
daily_totals.div(daily_totals.sum(axis=1), axis=0)
one-day two-day GreaterThanWeek
Completed
21/03/2020 1.0 0.0 0.0
22/03/2020 0.0 1.0 0.0
24/03/2020 0.0 1.0 0.0

Inserting missing numbers in dataframe

I have a program that ideally measures the temperature every second. However, in reality this does not happen. Sometimes, it skips a second or it breaks down for 400 seconds and then decides to start recording again. This leaves gaps in my 2-by-n dataframe, where ideally n = 86400 (the amount of seconds in a day). I want to apply some sort of moving/rolling average to it to get a nicer plot, but if I do that to the "raw" datafiles, the amount of data points becomes less. This is shown here, watch the x-axis. I know the "nice data" doesn't look nice yet; I'm just playing with some values.
So, I want to implement a data cleaning method, which adds data to the dataframe. I thought about it, but don't know how to implement it. I thought of it as follows:
If the index is not equal to the time, then we need to add a number, at time = index. If this gap is only 1 value, then the average of the previous number and the next number will do for me. But if it is bigger, say 100 seconds are missing, then a linear function needs to be made, which will increase or decrease the value steadily.
So I guess a training set could be like this:
index time temp
0 0 20.10
1 1 20.20
2 2 20.20
3 4 20.10
4 100 22.30
Here, I would like to get a value for index 3, time 3 and the values missing between time = 4 and time = 100. I'm sorry about my formatting skills, I hope it is clear.
How would I go about programming this?
Use merge with complete time column and then interpolate:
# Create your table
time = np.array([e for e in np.arange(20) if np.random.uniform() > 0.6])
temp = np.random.uniform(20, 25, size=len(time))
temps = pd.DataFrame([time, temp]).T
temps.columns = ['time', 'temperature']
>>> temps
time temperature
0 4.0 21.662352
1 10.0 20.904659
2 15.0 20.345858
3 18.0 24.787389
4 19.0 20.719487
The above is a random table generated with missing time data.
# modify it
filled = pd.Series(np.arange(temps.iloc[0,0], temps.iloc[-1, 0]+1))
filled = filled.to_frame()
filled.columns = ['time'] # Create a fully filled time column
merged = pd.merge(filled, temps, on='time', how='left') # merge it with original, time without temperature will be null
merged.temperature = merged.temperature.interpolate() # fill nulls linearly.
# Alternatively, use reindex, this does the same thing.
final = temps.set_index('time').reindex(np.arange(temps.time.min(),temps.time.max()+1)).reset_index()
final.temperature = final.temperature.interpolate()
>>> merged # or final
time temperature
0 4.0 21.662352
1 5.0 21.536070
2 6.0 21.409788
3 7.0 21.283505
4 8.0 21.157223
5 9.0 21.030941
6 10.0 20.904659
7 11.0 20.792898
8 12.0 20.681138
9 13.0 20.569378
10 14.0 20.457618
11 15.0 20.345858
12 16.0 21.826368
13 17.0 23.306879
14 18.0 24.787389
15 19.0 20.719487
First you can set the second values to actual time values as such:
df.index = pd.to_datetime(df['time'], unit='s')
After which you can use pandas' built-in time series operations to resample and fill in the missing values:
df = df.resample('s').interpolate('time')
Optionally, if you still want to do some smoothing you can use the following operation for that:
df.rolling(5, center=True, win_type='hann').mean()
Which will smooth with a 5 element wide Hanning window. Note: any window-based smoothing will cost you value points at the edges.
Now your dataframe will have datetimes (including date) as index. This is required for the resample method. If you want to lose the date, you can simply use:
df.index = df.index.time

Pandas Apply and Loc - efficiency and indexing

I want to find the first value after each row that meets a certain criteria. So for example I want to find the first rate/value (not necessarily the first row after) after the current row that increased 5%. The added column would be the last 'first5percentIncrease' and would be the index (and/or value) of the first row (after current row) that had a 5% increase. Notice how each could not be lower than the current row's index.
amount date rate total type first5percentIncreaseValue first5percentIncreaseIndex
9248 0.05745868 2018-01-22 06:11:36 10 0.00099984 buy 10.5 9341
9249 1.14869147 2018-01-22 06:08:38 20 0.01998989 buy 21 9421
9250 0.16498080 2018-01-22 06:02:59 15 0.00286241 sell 15.75 9266
9251 0.02881844 2018-01-22 06:01:54 2 0.00049999 sell 2.1 10911
I tried using loc to apply() this to each row. The output takes at least 10 seconds for only about 9k rows. This does the job (I get a list of all values 5% higher than the given row) but is there a more efficient way to do this? Also I'd like to get only the first value but when I take do this I think it's starting from the first row. Is there a way to start .locs search from the current row so then I could just take the first value?
coin_trade_history_df['rate'].apply(
lambda y: coin_trade_history_df['rate'].loc[coin_trade_history_df['rate'].apply(
lambda x: y >= x + (x*.005))])
0 [0.01387146, 0.01387146, 0.01387148, 0.0138714...
1 [0.01387146, 0.01387146, 0.01387148, 0.0138714...
2 [0.01387146, 0.01387146, 0.01387148, 0.0138714...
3 [0.01387146, 0.01387146, 0.01387148, 0.0138714...
4 [0.01387146, 0.01387146, 0.01387148, 0.0138714...
Name: rate, dtype: object
Further clarification Peter Leimbigler said it better than me:
Oh, I think I get it now! "For each row, scan downward and get the first row you encounter that shows an increase of at least 5%," right? I'll edit my answer :) – Peter Leimbigler
Here's an approach to the specific example of labeling each row with the index of the next available row that shows an increase of at least 5%.
# Example data
df = pd.DataFrame({'rate': [100, 105, 99, 110, 130, 120, 98]})
# Series.shift(n) moves elements n places forward = down. We use
# it here in the denominator in order to compare each change with
# the initial value, rather than the final value.
mask = df.rate.diff()/df.rate.shift() >= 0.05
df.loc[mask, 'next_big_change_idx'] = df[mask].index
df.next_big_change_idx = df.next_big_change_idx.bfill().shift(-1)
# output
df
rate next_big_change_idx
0 100 1.0
1 105 3.0
2 99 3.0
3 110 4.0
4 130 NaN
5 120 NaN
6 98 NaN
Peter's answer was much faster but it only looked at the immediate next row. I wanted it to perform this on every row. Below is what I ended up with - not very fast but it goes through each row and returns the first value (or last value in my case since my time series was descending) that satisfied my criteria (increasing 5%).
def test_rows(x):
return trade_history_df['rate'].loc[
trade_history_df['rate'] >= x['rate'] + (x['rate'] * .05)].loc[
trade_history_df['date'] > x['date']].last_valid_index()
test1 = trade_history_df[['rate','date']].apply(test_rows,axis = 1)

Replace value in a pandas dataframe column by the previous one

My code detects outliers in a time series. Which I want to do is to replace the outliers in de first dataframe column with the previous value which is not an outlier.
This code just detect outliers, creating a boolean array where:
True means that a value in the dataframe is an outlier
False means that a value in the dataframe is not an outlier
series = read_csv('horario_completo.csv', header=None, squeeze=True)
df=pd.DataFrame(series)
from pandas import rolling_median
consumos=df.iloc[:,0]
df['rolling_median'] = rolling_median(consumos, window=48, center=True).fillna(method='bfill').fillna(method='ffill')
threshold =50
difference = np.abs(consumos - df['rolling_median'])
outlier = difference > threshold
Up to this point, everything works.
The next step I have thought is to create a mask to replace the Truevalues with the previous value of the same column (if this was possible, it would be much faster than making a loop).
I'll try to explain it with a little example:
This is what I have:
index consumo
0 54
1 67
2 98
index outlier
0 False
1 False
2 True
And this is what I want to do:
index consumo
0 54
1 67
2 67
I think I should create a mask like this:
df.mask(outlier, df.columns=[[0]][i-1],axis=1)
obviosly this IS NOT the way to write it. It just an explanation about how I think it could be done (I'm talking about the [i-1]).
It seems you need shift:
consumo = consumo.mask(outlier, consumo.shift())
print (consumo)
0 54.0
1 67.0
2 67.0
Name: consumo, dtype: float64
Last if all values are ints add astype:
consumo = consumo.mask(outlier, consumo.shift()).astype(int)
print (consumo)
0 54
1 67
2 67
Name: consumo, dtype: int32

python pandas: How dropping items in dateframe

I have a huge amount of points in my dateframe, so I would want to drop some of them (ideally keeping the mean values).
e.g. currently I have
date calltime
0 1491928756414930 4643
1 1491928756419607 166
2 1491928756419790 120
3 1491928756419927 142
4 1491928756420083 121
5 1491928756420217 109
6 1491928756420409 52
7 1491928756420476 105
8 1491928756420605 35
9 1491928756420654 120
10 1491928756420787 105
11 1491928756420907 93
12 1491928756421013 37
13 1491928756421062 112
14 1491928756421187 41
Is there any way to drop some amount of items, based on sampling?
To give more details. My problem is number of values for very close intervals e.g. 1491928756421062 and 1491928756421187
So I have a chart like
And instead I wanted to somehow have a mean value for those close intervals. But maybe grouped by a second...
I would use sample(), but as you said it selects randomly. If you want to take sample according to some logic, for instance, only keeping rows whose value is mean *.9 < value < mean * 1.1, you can try the following code. Actually, it all depends on your sampling strategy.
As an example, something like this could be done.
test.csv:
1491928756414930,4643
1491928756419607,166
1491928756419790,120
1491928756419927,142
1491928756420083,121
1491928756420217,109
1491928756420409,52
1491928756420476,105
1491928756420605,35
1491928756420654,120
1491928756420787,105
1491928756420907,93
1491928756421013,37
1491928756421062,112
1491928756421187,41
sampling:
df = pd.read_csv("test.csv", ",", header=None)
mean = df[1].mean()
my_sample = df[(mean *.90 < df[1]) & (df[1] < mean * 1.10)]
You're looking for resample
df.set_index(pd.to_datetime(df.date)).calltime.resample('s').mean()
This is a more complete example
tidx = pd.date_range('2000-01-01', periods=10000, freq='10ms')
df = pd.DataFrame(dict(calltime=np.random.randint(200, size=len(tidx))), tidx)
fig, axes = plt.subplots(2, figsize=(25, 10))
df.plot(ax=axes[0])
df.resample('s').mean().plot(ax=axes[1])
fig.tight_layout()

Categories