Pandas keep highest value in every n consecutive rows - python

I have a pandas dataframe called df_initial with two columns 'a' and 'b' and N rows.
I would like to half the rows number, deleting the row where the value of 'b' is lower.
Thus between row 0 and row 1 I will keep row 1, between row 2 and row 3 I will keep row 3 etc..
This is the result that I would like to obtain:
print(df_initial)
a b
0 0.04 0.01
1 0.05 0.22
2 0.06 0.34
3 0.07 0.49
4 0.08 0.71
5 0.09 0.09
6 0.10 0.98
7 0.11 0.42
8 0.12 1.32
9 0.13 0.39
10 0.14 0.97
11 0.15 0.05
12 0.16 0.36
13 0.17 1.72
....
print(df_reduced)
a b
0 0.05 0.22
1 0.07 0.49
2 0.08 0.71
3 0.10 0.98
4 0.12 1.32
5 0.14 0.97
6 0.17 1.72
....
Is there some Pandas function to do this ?
I saw that there is a resample function, DataFrame.resample() , but it is valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, so not in this case.
Thanks who will help me

You can groupby every two rows (a simple way of doing so is taking the floor division of the index) and take the idxmax of column b to index the dataframe:
df.loc[df.groupby(df.index//2).b.idxmax(), :]
a b
0 0.05 0.22
1 0.07 0.49
2 0.09 0.71
3 0.11 0.98
4 0.13 1.32
5 0.15 0.97
6 0.17 1.72
Or using DataFrame.rolling:
df.loc[df.b.rolling(2).max()[1::2].index, :]

This is an application for a simple example, you can apply it on your base.
import numpy as np
import pandas as pd
ar = np.array([[1.1, 1.0], [3.3, 0.2], [2.7, 10],[ 5.4, 7], [5.3, 9],[ 1.5, 15]])
df = pd.DataFrame(ar, columns = ['a', 'b'])
for i in range(len(df)):
if df['b'][i] < df['a'][i]:
df = df.drop(index = i)
print(df)````

Related

Find number of datapoints in each range

I have a data frame that looks like this
data = [['A', 0.20], ['B',0.25], ['C',0.11], ['D',0.30], ['E',0.29]]
df = pd.DataFrame(data, columns=['col1', 'col2'])
Col1 is a primary key (each row has a unique value)
The max of col2 is 1 and the min is 0. I want to find the number of datapoint in ranges 0-.30 (both 0 and 0.30 are included), 0-.29, 0-.28, and so on till 0-.01. I can use pd.cut, but the lower limit is not fixed. My lower limit is always 0.
Can someone help?
One option using numpy broadcasting:
step = 0.01
up = np.arange(0, 0.3+step, step)
out = pd.Series((df['col2'].to_numpy()[:,None] <= up).sum(axis=0), index=up)
Output:
0.00 0
0.01 0
0.02 0
0.03 0
0.04 0
0.05 0
0.06 0
0.07 0
0.08 0
0.09 0
0.10 0
0.11 1
0.12 1
0.13 1
0.14 1
0.15 1
0.16 1
0.17 1
0.18 1
0.19 1
0.20 2
0.21 2
0.22 2
0.23 2
0.24 2
0.25 3
0.26 3
0.27 3
0.28 3
0.29 4
0.30 5
dtype: int64
With pandas.cut and cumsum:
step = 0.01
up = np.arange(0, 0.3+step, step)
(pd.cut(df['col2'], up, labels=up[1:].round(2))
.value_counts(sort=False).cumsum()
)
Output:
0.01 0
0.02 0
0.03 0
0.04 0
0.05 0
0.06 0
0.07 0
0.08 0
0.09 0
0.1 0
0.11 1
0.12 1
0.13 1
0.14 1
0.15 1
0.16 1
0.17 1
0.18 1
0.19 1
0.2 2
0.21 2
0.22 2
0.23 2
0.24 2
0.25 3
0.26 3
0.27 3
0.28 3
0.29 4
0.3 5
Name: col2, dtype: int64

Select top-N from two pandas DataFrames

Assume, there are two pandas DataFrame: df1 & df2. The df1 is a square data frame such as following
import numpy as np
import pandas as pd
item_names = [2,7,9,10,11,13,14,21,24]
np.random.seed(123)
nums = np.round(np.random.random(size=(9,9)),2)
df1 = pd.DataFrame(nums, index=item_names, columns=item_names)
df1 output:
2 7 9 10 11 13 14 21 24
2 0.70 0.29 0.23 0.55 0.72 0.42 0.98 0.68 0.48
7 0.39 0.34 0.73 0.44 0.06 0.40 0.74 0.18 0.18
9 0.53 0.53 0.63 0.85 0.72 0.61 0.72 0.32 0.36
10 0.23 0.29 0.63 0.09 0.43 0.43 0.49 0.43 0.31
11 0.43 0.89 0.94 0.50 0.62 0.12 0.32 0.41 0.87
13 0.25 0.48 0.99 0.52 0.61 0.12 0.83 0.60 0.55
14 0.34 0.30 0.42 0.68 0.88 0.51 0.67 0.59 0.62
21 0.67 0.84 0.08 0.76 0.24 0.19 0.57 0.10 0.89
24 0.63 0.72 0.02 0.59 0.56 0.16 0.15 0.70 0.32
The df2 stores item and its corresponding group information such as
df2 = pd.DataFrame({'item': item_names,
'group':['a1','a1','a1','a2',
'a2','a2','a2','a3','a3']})
df2 output:
item group
0 2 a1
1 7 a1
2 9 a1
3 10 a2
4 11 a2
5 13 a2
6 14 a2
7 21 a3
8 24 a3
The goal is to write a function which can select top N items in a specific row (item name) based on the corresponding values (largest) using these two DataFrames' information. However, the returned top N items and query item ALL MUST from 'different groups'. Such as
A query item (item = 10) is in the 4th row of df1 (item = 10). The top 2 returned items will be [9, 21] not [9, 14]. Since, item 10 is from group = a2 and any of returned items (top N) should not from a2 group. I have checked Scott Boston solution for a similar problem but it can't avoid the top N items and query item are from same group. Any suggestions? many thanks
IIUC, you want to select the N largest values excluding the values from the same group.
Here is a function that does this:
def get_top_N(idx, N=2):
group = df2.set_index('item')['group']
incl = group[group.ne(group[idx])].index
return df1.loc[idx, incl].nlargest(2).index.to_list()
get_top_N(10)
# [9, 21]
If you additionally want to ensure that all values are from different groups (this was unclear if a requirement, as this is the case for your example). You can additionally do:
def get_top_N_diff(idx, N=2):
group = df2.set_index('item')['group']
incl = group[group.ne(group[idx])].index
s = df1.loc[idx, incl]
return s.sort_values(ascending=False).groupby(group).idxmax().to_list()[:N]
get_top_N(11) # same group
# [9, 7]
get_top_N_diff(11) # different groups
# [9, 24]
Not sure exactly what you wanted... but this might point you in a direction:
import pandas as pd
import numpy as np
s2 = df2.set_index('item').group
mask = np.equal.outer(df1.index.map(s2.get), df1.columns.map(s2.get))
stacked = df1.mask(mask).stack().rename_axis(['x', 'y']).to_frame(name='v')
stacked.sort_values(['x', 'v'], ascending=[True, False]).groupby('x').head(2)
v
x y
2 14 0.98
11 0.72
7 14 0.74
10 0.44
9 10 0.85
11 0.72
10 9 0.63
21 0.43
11 9 0.94
7 0.89
13 9 0.99
21 0.60
14 24 0.62
21 0.59
21 7 0.84
10 0.76
24 7 0.72
2 0.63
A modification on the answer you mentioned:
def get_top(df1, df2, item_name, number_items):
val = df1.loc[[item_name]].T.merge(df2, left_index=True, right_on = 'item')
val = val[val['group']!=val.loc[val['item']==item_name, 'group'].values[0]]
return (val.sort_values(item_name, ascending=False)
.groupby('group')
.head(1)
.head(number_items)['item']
.to_numpy())
>>> get_top(df1, df2, 10, 2)
array([ 9, 21])

How do you give weights to dataframe columns iteratively for weighted mean average?

I have a dataframe with multiple columns having numerical float values. What I want to do is give fractional weights to each column and calculate its average to store and append it to the same df.
Let's say we have the columns: s1, s2, s3
I want to give the weights: w1, w2, w3 to them respectively
I was able to do this manually while experimenting with all values in hand. But when I go to a list format, it's giving me an error.
I was trying to do it through iteration and I've attached my code below, but it was giving me an error. I have also attached my manual code which worked, but it needs it first hand.
Code which didn't work:
score_df["weighted_avg"] += weight * score_df[feature]
Manual Code which worked but not with lists:
df["weighted_scores"] = 0.5*df["s1"] + 0.25*df["s2"] + 0.25*df["s3"]
We can use numpy broadcasting for this, since weights has the same shape as your column axis:
# given the following example df
df = pd.DataFrame(np.random.rand(10,3), columns=["s1", "s2", "s3"])
print(df)
s1 s2 s3
0 0.49 1.00 0.50
1 0.65 0.87 0.75
2 0.45 0.85 0.87
3 0.91 0.53 0.30
4 0.96 0.44 0.50
5 0.67 0.87 0.24
6 0.87 0.41 0.29
7 0.06 0.15 0.73
8 0.76 0.92 0.69
9 0.92 0.28 0.29
weights = [0.5, 0.25, 0.25]
df["weighted_scores"] = df.mul(weights).sum(axis=1)
print(df)
s1 s2 s3 weighted_scores
0 0.49 1.00 0.50 0.62
1 0.65 0.87 0.75 0.73
2 0.45 0.85 0.87 0.66
3 0.91 0.53 0.30 0.66
4 0.96 0.44 0.50 0.71
5 0.67 0.87 0.24 0.61
6 0.87 0.41 0.29 0.61
7 0.06 0.15 0.73 0.25
8 0.76 0.92 0.69 0.78
9 0.92 0.28 0.29 0.60
You can use dot
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10,3), columns=["s1", "s2", "s3"])
df['weighted_scores'] = df.dot([.5,.25,.25])
df
Out
s1 s2 s3 weighted_scores
0 0.053543 0.659316 0.033540 0.199985
1 0.631627 0.257241 0.494959 0.503863
2 0.220939 0.870247 0.875165 0.546822
3 0.890487 0.519320 0.944459 0.811188
4 0.029416 0.016780 0.987503 0.265779
5 0.843882 0.784933 0.677096 0.787448
6 0.396092 0.297580 0.965454 0.513805
7 0.109894 0.011217 0.443796 0.168700
8 0.202096 0.637105 0.959876 0.500293
9 0.847020 0.949703 0.668615 0.828090

Python: for loop iterations when adding dataframes

I have a dataframe with different returns looking something like:
0.2 -0.1 0.03 0.01
0.02 0.1 -0.1 -0.2
0.05 0.06 0.07 -0.07
0.03 -0.04 -0.04 -0.03
And I have a separate dataframe with the index returns in only one column:
0.01
0.015
-0.01
-0.02
What I want to do is to basically add(+) each row value of the index return dataframe with each value for each column in the stock return dataframe.
The desired outcome looks like:
0.21 -0.09
0.035 0.115
0.04 0.05
0.01 -0.06 etc etc
Normally in Matlab for example the for loop would be quite simple, but in python the indexing is what gets me stuck.
I have tried a simple for loop:
for i, j in df_stock_returns.iterrows():
df_new = df_stock_returns[i, j] + df_index_reuturns[j]
But that doesn't really work, any help is appreciated!
Assuming you have
In [27]: df
Out[27]:
0 1 2 3
0 0.20 -0.10 0.03 0.01
1 0.02 0.10 -0.10 -0.20
2 0.05 0.06 0.07 -0.07
3 0.03 -0.04 -0.04 -0.03
and
In [28]: dfi
Out[28]:
0
0 0.010
1 0.015
2 -0.010
3 -0.020
you can just write
In [26]: pd.concat([df[c] + dfi[0] for c in df], axis=1)
Out[26]:
0 0 1 2
0 0.210 -0.090 0.040 0.020
1 0.035 0.115 -0.085 -0.185
2 0.040 0.050 0.060 -0.080
3 0.010 -0.060 -0.060 -0.050
In pandas you almost never need to iterate over individual cells. Here I just iterated over the columns, and df[c] + dfi[0] adds the two columns element-wise. Then concat with axis=1 (0=rows, 1=columns) just concatenates everything into one dataframe.
I suppose the most straightforward way will work
for c in a.columns:
a[c] = a[c] + b
>>> a
0 1 2 3
0 0.210 -0.090 0.040 0.020
1 0.215 -0.085 0.045 0.025
2 0.190 -0.110 0.020 0.000
3 0.180 -0.120 0.010 -0.010
You can simply add two df as below
col1=[0.2,0.02]
col2=[-0.1,0.2]
col3=[0.01,0.015]
df1=pd.DataFrame(data=list(zip(col1, col2)),columns=['list1','list2'])
df2=pd.DataFrame({'list3':col3})
output = df1[:] + df2['list3'].values
The df1[:] extract all columns and it to the reference column df2['list3']

Pandas: Pairwise concatenation of column vectors

I'm working with a frame like
df = pd.DataFrame({
'G1':[1.00,0.69,0.23,0.22,0.62],
'G2':[0.03,0.41,0.74,0.35,0.62],
'G3':[0.05,0.40,0.15,0.32,0.19],
'G4':[0.30,0.20,0.51,0.70,0.67],
'G5':[0.40,0.36,0.88,0.10,0.19]
})
and I want to manipulate it so that the columns are pairwise permutations of the current columns e.g. all columns are now 10 elements long and for example column 'G1:G2' would have column 'G2' appended to column 'G1'. I have attached a mock-up pic. Note that the pic has named indices unlike the above example code. I can work with or without the indices.
How could I approach this? I can make a function to act on each column, but I think the function would have to return a data frame made by concatenation with all other columns. Not sure what that would look like.
I'd do it like this
from itertools import permutations
l1, l2 = map(list, zip(*permutations(range(len(df.columns)), 2)))
v = df.values
pd.DataFrame(
np.vstack([v[:, l1], v[:, l2]]),
list(map('S{}'.format, range(1, len(df) + 1))) * 2,
df.columns.values[l1] + ':' + df.columns.values[l2]
)
Here is one way, although I suspect there might also be a way to do this directly in pandas
from itertools import permutations
'''Get all the column permutations'''
lst = [x for x in permutations(df.columns, 2)]
'''Create a list of columns names'''
names = [x[0]+'_'+x[1] for x in lst]
'''Create the new arrays by vertically stacking pairs of column values'''
cols = [np.vstack((df[x[0]].values,df[x[1]].values)).ravel() for x in lst]
'''Create a dictionary with column names as keys and the arrays as values'''
d = dict(zip(names, cols))
'''Create new dataframe from dict'''
df2 = pd.DataFrame(d)
df2
G1_G2 G1_G3 G1_G4 G1_G5 G2_G1 G2_G3 G2_G4 G2_G5 G3_G1 G3_G2 \
0 1.00 1.00 1.00 1.00 0.03 0.03 0.03 0.03 0.05 0.05
1 0.69 0.69 0.69 0.69 0.41 0.41 0.41 0.41 0.40 0.40
2 0.23 0.23 0.23 0.23 0.74 0.74 0.74 0.74 0.15 0.15
3 0.22 0.22 0.22 0.22 0.35 0.35 0.35 0.35 0.32 0.32
4 0.62 0.62 0.62 0.62 0.62 0.62 0.62 0.62 0.19 0.19
5 0.03 0.05 0.30 0.40 1.00 0.05 0.30 0.40 1.00 0.03
6 0.41 0.40 0.20 0.36 0.69 0.40 0.20 0.36 0.69 0.41
7 0.74 0.15 0.51 0.88 0.23 0.15 0.51 0.88 0.23 0.74
8 0.35 0.32 0.70 0.10 0.22 0.32 0.70 0.10 0.22 0.35
9 0.62 0.19 0.67 0.19 0.62 0.19 0.67 0.19 0.62 0.62
This is part of the output
To avoid creating the lists and use the fact that itertools.permutations is a generator:
d = dict((x[0]+'_'+x[1] , np.vstack((df[x[0]].values,df[x[1]].values)).ravel())
for x in permutations(df.columns, 2))
df2 = pd.DataFrame(d)

Categories