pandas groupby with condition depending on same column

pandas groupby with condition depending on same column - python

I have a dataframe with ID, date and number columns and would like to create a new column that takes the mean of all numbers for this specific ID BUT only includes the numbers in the mean where date is smaller than the date of this row. How would I do this?
df = (pd.DataFrame({'ID':['1','1','1','1','2','2'],'number':['1','4','1','4','2','5'],
'date':['2021-10-19','2021-10-16','2021-10-16','2021-10-15','2021-10-19','2021-10-10']})
.assign(date = lambda x: pd.to_datetime(x.date))
.assign(mean_no_from_previous_dts = lambda x: x[x.date<??].groupby('ID').number.transform('mean'))
)
this is what i would like to get as output
ID number date mean_no_from_previous_dts
0 1 1 2021-10-19 3.0 = mean(4+1+4)
1 1 4 2021-10-16 2.5 = mean(4+1)
2 1 1 2021-10-16 4.0 = mean(1)
3 1 4 2021-10-15 0.0 = 0 (as it's the first entry for this date and ID - this number doesnt matter, can e something else)
4 2 2 2021-10-19 5.0 = mean(5)
5 2 5 2021-10-10 0.0 = 0 (as it's the first entry for this date and ID)
so for example the first entry of the column mean_no_from_previous_dts is the mean of (4+1+4): the first 4 comes from the column number and the 2nd row because 2021-10-16 (date in the 2nd row) is smaller than 2021-10-19 (date in the 1st row). The 1 comes from the 3rd row because 2021-10-16 is smaller than 2021-10-19. The second 4 comes from the 4th row because 2021-10-15 is smaller than 2021-10-19. This is for ID = 1 the the same for ID = 2

Here is solution with numpy broadcasting per groups:
df = (pd.DataFrame({'ID':['1','1','1','1','2','2'],'number':['1','4','1','4','2','5'],
'date':['2021-10-19','2021-10-16','2021-10-16','2021-10-15','2021-10-19','2021-10-10']})
.assign(date = lambda x: pd.to_datetime(x.date), number = lambda x: x['number'].astype(int))
)
def f(x):
arr = x['date'].to_numpy()
m = arr <= arr[:, None]
#remove rows with same values - set mask to False
np.fill_diagonal(m, False)
#set greater values to `NaN` and get mean without NaNs
m = np.nanmean(np.where(m, x['number'].to_numpy(), np.nan).astype(float), axis=1)
#assign to new column
x['no_of_previous_dts'] = m
return x
#last value is set to 0 per groups
df = df.groupby('ID').apply(f).fillna({'no_of_previous_dts':0})
print (df)
ID number date no_of_previous_dts
0 1 1 2021-10-19 3.0
1 1 4 2021-10-16 2.5
2 1 1 2021-10-16 4.0
3 1 4 2021-10-15 0.0
4 2 2 2021-10-19 5.0
5 2 5 2021-10-10 0.0

Related

python dataframe number of last consequence rows less than current

I need to set number of last consequence rows less than current.
Below is a sample input and the result.
df = pd.DataFrame([10,9,8,11,10,11,13], columns=['value'])
df_result = pd.DataFrame([[10,9,8,11,10,11,13], [0,0,0,3,0,1,6]], columns=['value', 'number of last consequence rows less than current'])
Is it possible to achieve this without loop?
Otherwise solution with loop would be good.
More question
Could I do it with groupby operation, for the following input?
df = pd.DataFrame([[10,0],[9,0],[7,0],[8,0],[11,1],[10,1],[11,1],[13,1]], columns=['value','group'])
Following printed an error.
df.groupby('group')['value'].expanding()

Assuming this input:
value
0 10
1 9
2 8
3 11
4 10
5 13
You can use a cummax and expanding custom function:
df['out'] = (df['value'].cummax().expanding()
.apply(lambda s: s.lt(df.loc[s.index[-1], 'value']).sum())
)
For the particular case of < comparison, you can use a much faster trick with numpy. If a value is greater than all previous values, then it is greater than n values where n is the rank:
m = df['value'].lt(df['value'].cummax())
df['out'] = np.where(m, 0, np.arange(len(df)))
Output:
value out
0 10 0.0
1 9 0.0
2 8 0.0
3 11 3.0
4 10 0.0
5 13 5.0
update: consecutive values
df['out'] = (
df['value'].expanding()
.apply(lambda s: s.iloc[-2::-1].lt(s.iloc[-1]).cummin().sum())
)
Output:
value out
0 10 0.0
1 9 0.0
2 8 0.0
3 11 3.0
4 10 0.0
5 11 1.0
6 13 6.0

How to generate a new column which the values are between two existing columns

I need to add a new column based on the values of two existing columns.
My data set looks like this:
Date Bid Ask Last Volume
0 2021.02.01 00:01:02.327 1.21291 1.21336 0.0 0
1 2021.02.01 00:01:21.445 1.21290 1.21336 0.0 0
2 2021.02.01 00:01:31.912 1.21287 1.21336 0.0 0
3 2021.02.01 00:01:32.600 1.21290 1.21336 0.0 0
4 2021.02.01 00:02:08.920 1.21290 1.21338 0.0 0
... ... ... ... ... ...
80356 2021.02.01 23:58:54.332 1.20603 1.20605 0.0 0
and I need to generate a new column named "New" and the values of column "New" needs to have a random number between Column "Bid" and Column "Ask". For each value of the column "New", it has to be in the range from Bid to Ask (can equal to Bid or Ask).
I have tried to do like this
df['rand_between'] = df.apply(lambda x: np.random.randint(x.Ask,x.Bid), axis=1)
But I got this
Exception has occurred: ValueError
low >= high
I am new to Python.

Use np.random.uniform so you get a random float with equal probability between your high and low bounds with closure [low_bound, high_bound).
Also ditch the apply; np.random.uniform can generate the numbers using arrays of bounds. (I added a row at the bottom to make this obvious).
import numpy as np
df['New'] = np.random.uniform(df.Bid, df.Ask, len(df))
Date Bid Ask Last Volume New
0 2021.02.01 00:01:02.327 1.21291 1.21336 0.0 0 1.213114
1 2021.02.01 00:01:21.445 1.21290 1.21336 0.0 0 1.212969
2 2021.02.01 00:01:31.912 1.21287 1.21336 0.0 0 1.213342
3 2021.02.01 00:01:32.600 1.21290 1.21336 0.0 0 1.212933
4 2021.02.01 00:02:08.920 1.21290 1.21338 0.0 0 1.212948
5 2021.02.01 00:02:08.920 100.00000 115.00000 0.0 0 100.552836

All you need to do is switch the order of x.Ask and x.Bid in your code. In your dataframe, the ask prices are always higher than the bid, that's why you are getting the error:
df['rand_between'] = df.apply(lambda x: np.random.randint(x.Bid,x.Ask), axis=1)
If your ask value is sometimes greater and sometimes less than the bid, use:
df['rand_between'] = df.apply(lambda x: np.random.randint(x.Bid,x.Ask) if x.Ask > x.Bid else np.random.randint(x.Ask,x.Bid), axis=1)
Finally, if it is possible for ask to be greater, less than or equal to bis, use:
def helper(x):
if x.Ask > x.Bid:
return np.random.randint(x.Bid,x.Ask)
elif x.Bid > x.Ask:
return np.random.randint(x.Ask, x.Bid)
else:
return None
df['rand_between'] = df.apply(helper, axis=1)

You can loop through the rows using apply and then use your randint function (for floats you might want to use random.uniform). For example:
In [1]: import pandas as pd
...: from random import randint
...: df = pd.DataFrame({'bid':range(10),'ask':range(0,20,2)})
...:
...: df['new'] = df.apply(lambda x: randint(x['bid'],x['ask']), axis=1)
...: df
Out[1]:
bid ask new
0 0 0 0
1 1 2 1
2 2 4 4
3 3 6 6
4 4 8 6
5 5 10 9
6 6 12 9
7 7 14 12
8 8 16 13
9 9 18 9
The axis=1 is telling the apply function to loop over rows, not columns.

Python Pandas : Find nth largest and count in case of tie

In a pandas DataFrame, I want to find the nth largest value (row wise, not column wise) and then also find, if there are ties. I am interested in finding top 3 largest and ties. Note, there are more than 3 columns in actual problem,
e.g. if the data frame looks like this:
A B C
0 1 2 2
1 3 4 3
2 1 2 3
I want to know the nth largest values and if there were ties, so:
row 0 - 1st largest is 2 with a tie, 1 2nd largest, no third largest,
row 1 - 1st largest is 4, 3 2nd largest with a tie, no third largest,
row 2 - 1st largest 3, 2nd largest 2, 3rd largest 1.
Expected output as requested:
A B C max1 max2 max3 tie1 tie2 tie3
0 1 2 2 2 1 NaN 1 0 NaN
1 3 4 3 4 3 NaN 0 1 NaN
2 1 2 3 3 2 1.0 0 0 0.0

Use:
#top N values
N = 3
#reshape to MultiIndex Series and counts values per index, sorting
df1 = (df.stack()
.groupby(level=0).value_counts()
.sort_index(ascending=[True, False])
.reset_index(level=1))
#counter level to MultiIndex for new columns names
s = df1.groupby(level=0).cumcount().add(1)
df1 = df1.set_index(s, append=True)
#filter topN rows
df1 = df1[s.le(N).values]
df1.columns=['max','tie']
#subtract 1 for correct tie
df1['tie'] -= 1
#reshape to df with MultiIndex
df1 = df1.unstack()
#flatten columns names
df1.columns = df1.columns.map(lambda x: f'{x[0]}{x[1]}')
#add to original
df2 = df.join(df1)
print (df2)
A B C max1 max2 max3 tie1 tie2 tie3
0 1 2 2 2.0 1.0 NaN 1.0 0.0 NaN
1 3 4 3 4.0 3.0 NaN 0.0 1.0 NaN
2 1 2 3 3.0 2.0 1.0 0.0 0.0 0.0

for i in range(len(df)):
l = df.loc[0].tolist()
d = {key: l.count(key) for key in l}
The dictionary now contains the values and how many times it has been repeated in the row. You can then print the top-n values of the dictionary easily.

pandas: How to remove "duplicate" rows whose C1 column is within a tolerance, and whose C2 column is maximal?

following my previous question:
I have a dataframe:
load,timestamp,timestr
0,1576147339.49,124219
0,1576147339.502,124219
2,1576147339.637,124219
1,1576147339.641,124219
9,1576147339.662,124219
8,1576147339.663,124219
7,1576147339.663,124219
6,1576147339.663,124219
5,1576147339.663,124219
4,1576147339.663,124219
3,1576147339.663,124219
2,1576147339.663,124219
1,1576147339.663,124219
0,1576147339.663,124219
0,1576147339.673,124219
3,1576147341.567,124221
2,1576147341.568,124221
1,1576147341.569,124221
0,1576147341.57,124221
4,1576147341.581,124221
3,1576147341.581,124221
I want to remove all rows that are within some tolerance from one another, in the 'timestamp' column except the one that has the largest 'load' column.
In the above example, if tolerance=0.01, that would leave us with
load,timestamp,timestr
0,1576147339.49,124219
0,1576147339.502,124219
2,1576147339.637,124219
9,1576147339.662,124219
0,1576147339.673,124219
3,1576147341.567,124221
4,1576147341.581,124221
The maximal value of 'load' doesn't have to be the 1st one!

Idea is round values by values >1 created by multiple by tolerance divided by 1 and pass to groupby for aggregate max:
tolerance=0.01
df = df.groupby(df['timestamp'].mul(1/tolerance).round()).max().reset_index(drop=True)
print (df)
load timestamp timestr
0 0 1.576147e+09 124219
1 0 1.576147e+09 124219
2 2 1.576147e+09 124219
3 9 1.576147e+09 124219
4 0 1.576147e+09 124219
5 3 1.576147e+09 124221
6 4 1.576147e+09 124221

Rounding is susceptible to such a problem that there can be
2 rows with fractional parts e.g. 0.494 and 0.502.
The first will be rounded to 0.49 and the second to 0.50,
so they will be in different groups, despite the fact that
they are less than 0.01 apart.
So my proposition is to do the job (compute result DataFrame)
by iteration:
result = pd.DataFrame(columns=df.columns)
wrk = df.sort_values('timestamp')
threshold = 0.01
while wrk.index.size > 0:
tMin = wrk.iloc[0, 1] # min timestamp
grp = wrk[wrk.timestamp <= tMin + threshold]
row = grp.nlargest(1, 'load') # max load
result = result.append(row)
wrk.drop(grp.index, inplace=True)
To confirm my initial remark, change the fractional part of timestamp
in the first row to 0.494.
For readability, I also "shortened" the integer part.
My solution returns:
load timestamp timestr
0 0 7339.494 124219
2 2 7339.637 124219
4 9 7339.662 124219
14 0 7339.673 124219
15 3 7341.567 124221
19 4 7341.581 124221
whereas the other solution returns:
load timestamp timestr
0 0 7339.494 124219
1 0 7339.502 124219
2 2 7339.641 124219
3 9 7339.663 124219
4 0 7339.673 124219
5 3 7341.570 124221
6 4 7341.581 124221

Compare corresponding columns with each other and store the result in a new column

I had a data which I pivoted using pivot table method , now the data looks like this:
rule_id a b c
50211 8 0 0
50249 16 0 3
50378 0 2 0
50402 12 9 6
I have set 'rule_id' as index. Now I compared one column to it's corresponding column and created another column with it's result. The idea is if the first column has a value other than 0 and the second column , to which the first column is compared to ,has 0 , then 100 should be updated in the newly created column, but if the situation is vice-versa then 'Null' should be updated. If both column have 0 , then also 'Null' should be updated. If the last column has value 0 , then 'Null' should be updated and other than 0 , then 100 should be updated. But if both the columns have values other than 0(like in the last row of my data) , then the comparison should be like this for column a and b:
value_of_b/value_of_a *50 + 50
and for column b and c:
value_of_c/value_of_b *25 + 25
and similarly if there are more columns ,then the multiplication and addition value should be 12.5 and so on.
I was able to achieve all the above things apart from the last result which is the division and multiplication stuff. I used this code:
m = df.eq(df.shift(-1, axis=1))
arr = np.select([df ==0, m], [np.nan, df], 1*100)
df2 = pd.DataFrame(arr, index=df.index).rename(columns=lambda x: f'comp{x+1}')
df3 = df.join(df2)
df is the dataframe which stores my pivoted table data which I mentioned at the start. After using this code my data looks like this:
rule_id a b c comp1 comp2 comp3
50211 8 0 0 100 NaN NaN
50249 16 0 3 100 NaN 100
50378 0 2 0 NaN 100 NaN
50402 12 9 6 100 100 100
But I want the data to look like this:
rule_id a b c comp1 comp2 comp3
50211 8 0 0 100 NaN NaN
50249 16 0 3 100 NaN 100
50378 0 2 0 NaN 100 NaN
50402 12 9 6 87.5 41.67 100
If you guys can help me get the desired data , I would greatly appreciate it.
Edit:
This is how my data looks:

The problem is that the coefficient to use when building the new compx column does not depend only on the columns position. In fact in each row it is reset to its maximum of 50 after each 0 value and is half of previous one after a non 0 value. Those resetable series are hard to vectorize in pandas, especially in rows. Here I would build a companion dataframe holding only those coefficients, and use directly the numpy underlying arrays to compute them as efficiently as possible. Code could be:
# transpose the dataframe to process columns instead of rows
coeff = df.T
# compute the coefficients
for name, s in coeff.items():
top = 100 # start at 100
r = []
for i, v in enumerate(s):
if v == 0: # reset to 100 on a 0 value
top=100
else:
top = top/2 # else half the previous value
r.append(top)
coeff.loc[:, name] = r # set the whole column in one operation
# transpose back to have a companion dataframe for df
coeff = coeff.T
# build a new column from 2 consecutive ones, using the coeff dataframe
def build_comp(col1, col2, i):
df['comp{}'.format(i)] = np.where(df[col1] == 0, np.nan,
np.where(df[col2] == 0, 100,
df[col2]/df[col1]*coeff[col1]
+coeff[col1]))
old = df.columns[0] # store name of first column
# Ok, enumerate all the columns (except first one)
for i, col in enumerate(df.columns[1:], 1):
build_comp(old, col, i)
old = col # keep current column name for next iteration
# special processing for last comp column
df['comp{}'.format(i+1)] = np.where(df[col] == 0, np.nan, 100)
With this initial dataframe:
date 2019-04-25 15:08:23 2019-04-25 16:14:14 2019-04-25 16:29:05 2019-04-25 16:36:32
rule_id
50402 0 0 9 0
51121 0 1 0 0
51147 0 1 0 0
51183 2 0 0 0
51283 0 12 9 6
51684 0 1 0 0
52035 0 4 3 2
it gives as expected:
date 2019-04-25 15:08:23 2019-04-25 16:14:14 2019-04-25 16:29:05 2019-04-25 16:36:32 comp1 comp2 comp3 comp4
rule_id
50402 0 0 9 0 NaN NaN 100.000000 NaN
51121 0 1 0 0 NaN 100.0 NaN NaN
51147 0 1 0 0 NaN 100.0 NaN NaN
51183 2 0 0 0 100.0 NaN NaN NaN
51283 0 12 9 6 NaN 87.5 41.666667 100.0
51684 0 1 0 0 NaN 100.0 NaN NaN
52035 0 4 3 2 NaN 87.5 41.666667 100.0

Ok, I think you can iterate over your dataframe df and use some if-else to get the desired output.
for i in range(len(df.index)):
if df.iloc[i,1]!=0 and df.iloc[i,2]==0: # column start from index 0
df.loc[i,'colname'] = 'whatever you want' # so rule_id is column 0
elif:
.
.
.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas groupby with condition depending on same column - python

Related

python dataframe number of last consequence rows less than current

How to generate a new column which the values are between two existing columns

Python Pandas : Find nth largest and count in case of tie

pandas: How to remove "duplicate" rows whose C1 column is within a tolerance, and whose C2 column is maximal?

Compare corresponding columns with each other and store the result in a new column

Categories

Resources