Extract indices of a grouped elements in Pandas - python

The objective is to extract the index number of a randomly selected grouped rows in Pandas.
Specifically, given a df
nval
0 4
1 4
2 0
...
23 0
24 4
...
29 4
30 4
31 0
I would like to extract each 5 random index of the element 0 and 4.
For example, the 5 randomly selected value for
0
can be
3,11,15,16,22
and
4
can be
6 9 7 29 27
Currently, the code below answer the above objective
import numpy as np
import numpy.random
import pandas as pd
np.random.seed(0)
dval=[4,4,0,0,0,0,4,4,0,4,0,0,4,4,0,0,0,0,4,
4,0,0,0,0,4,0,4,4,4,4,4,0,]
df = pd.DataFrame (dict(nval=dval))
cgroup=5
df=df.reset_index()
all_df=[]
for idx in [0,4]:
x=df[df['nval']==idx].reset_index(drop=True)
ids = np.random.choice(len(x), size=cgroup, replace=False).tolist()
all_df.append(x.iloc[ids].reset_index(drop=True))
df=pd.concat(all_df).reset_index(drop=True).sort_values(by=['index'])
sel_index=df[['index']]
Which produced
index
0 3
1 6
2 7
3 9
4 11
5 15
6 16
7 22
8 27
9 29
However, I wonder there is compact way of doing this using pandas or numpy?

How about this:
import numpy as np
import numpy.random
import pandas as pd
np.random.seed(0)
dval=[4,4,0,0,0,0,4,4,0,4,0,0,4,4,0,0,0,0,4,4,0,0,0,0,4,0,4,4,4,4,4,0,]
df = pd.DataFrame (dict(nval=dval))
df2 = df.groupby('nval').sample(5).reset_index()
print(df2)
output:
index nval
0 31 0
1 22 0
2 14 0
3 8 0
4 17 0
5 29 4
6 13 4
7 1 4
8 19 4
9 12 4

IIUC, you can use
pd.DataFrame({'index': df.groupby('nval').sample(5).index.sort_values()})
I'd just keep the result as an index, so it simplifies to
df.groupby('nval').sample(5).index.sort_values()

Related

How to exclude some string patterns when using filter on pandas?

dataframe
df.columns=['ipo_date','l2y_gg_date','l1k_kk_date']
Goal
return dataframe with columns name containing _date except for ipo_date.
Try
df.filter(regex='_date&^ipo_date')
Try a negative lookbehind:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(1, 21).reshape((5, 4)),
columns=['ipo_date', 'l2y_gg_date', 'l1k_kk_date', 'other'])
filtered = df.filter(regex=r'(?<!ipo)_date')
print(filtered)
Sample df:
ipo_date l2y_gg_date l1k_kk_date other
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
4 17 18 19 20
filtered:
l2y_gg_date l1k_kk_date
0 2 3
1 6 7
2 10 11
3 14 15
4 18 19

Applying Pandas iterrows logic across many groups in a dataframe

I am having trouble applying some logic across my entire dataset. I am able to apply the logic on a small "group" but not on all of the groups (note, the groups are made by primaryFilter and secondaryFilter. Do you all mind pointing me in the right direction to go about this?
Entire Data
import pandas as pd
import numpy as np
myInput = {
'primaryFilter': [100,100,100,100,100,100,100,100,100,100,200,200,200,200,200,200,200,200,200,200],
'secondaryFilter': [1,1,1,1,2,2,2,3,3,3,1,1,2,2,2,2,3,3,3,3],
'constantValuePerGroup': [15,15,15,15,20,20,20,17,17,17,10,10,30,30,30,30,22,22,22,22],
'someValue':[3,1,4,7,9,9,2,7,3,7,6,4,7,10,10,3,4,6,7,5]
}
df_input = pd.DataFrame(data=myInput)
df_input
Test Data (First Group)
df_test = df_input[df_input.primaryFilter.isin([100])]
df_test = df_test[df_test.secondaryFilter == 1.0]
df_test['newColumn'] = np.nan
for index,row in df_test.iterrows():
if index==0:
print("start")
df_test.loc[0, 'newColumn'] = 0
elif index==df_test.shape[0]-1:
df_test.loc[index, 'newColumn'] = df_test.loc[index-1, 'newColumn'] + df_test.loc[index-1, 'someValue']
print("end")
else:
print("inter")
df_test.loc[index, 'newColumn'] = df_test.loc[index-1, 'newColumn'] + df_test.loc[index-1, 'someValue']
df_test["delta"] = df_test["constantValuePerGroup"] - df_test['newColumn']
df_test.head()
Here is the output of the test
I now would like to apply the above logic to the remaining groups 100,2 and 100,3 and 200,1 and so forth..
No need to use iterrows here, you can group the dataframe on primaryFilter and secondaryFilter columns then for each unique group take the cumulative sum of values in column someValue and shift the resulting cummulative sum by 1 position downwards to obtain newColumn. Finally subtract newColumn from constantValuePerGroup to get the delta.
df_input['newColumn'] = df_input.groupby(['primaryFilter', 'secondaryFilter'])['someValue'].apply(lambda s: s.cumsum().shift(fill_value=0))
df_input['delta'] = df_input['constantValuePerGroup'] - df_input['newColumn']
>>> df_input
primaryFilter secondaryFilter constantValuePerGroup someValue newColumn delta
0 100 1 15 3 0 15
1 100 1 15 1 3 12
2 100 1 15 4 4 11
3 100 1 15 7 8 7
4 100 2 20 9 0 20
5 100 2 20 9 9 11
6 100 2 20 2 18 2
7 100 3 17 7 0 17
8 100 3 17 3 7 10
9 100 3 17 7 10 7
10 200 1 10 6 0 10
11 200 1 10 4 6 4
12 200 2 30 7 0 30
13 200 2 30 10 7 23
14 200 2 30 10 17 13
15 200 2 30 3 27 3
16 200 3 22 4 0 22
17 200 3 22 6 4 18
18 200 3 22 7 10 12
19 200 3 22 5 17 5

pandas add value to new column based on condition

I've been searching around but couldn't find the answer I was looking for, so I apologize for asking what I would imagine is a repetitive question.
I have two dataframes - df1 is a list of transaction data and df2 is a sort of key. df1['code'] references a column in df2.
If the code for the transaction found in df1 is in df2, I'd like to append a value to that df1 entry in a new column identifying that the transaction was valid. If the code is not in df2, I'd like to note the opposite in that same new column.
I understand how I might do this with a 'for' loop, but my understanding is I should learn how to use pandas without relying on that.
Thanks in advance for the help!
Use numpy.where():
df1['new_col'] = numpy.where(df1['df1_code'].isin(df2['df2_code']), 'VALID', 'INVALID')
Sample DF
>>> import pandas as pd
>>> import numpy as np
>>> df1 = pd.DataFrame({'code':range(5,15), 'transaction':range(10)})
>>> df2 = pd.DataFrame({'code':range(12,22), 'transaction':range(7,17)})
>>> df1
code transaction
0 5 0
1 6 1
2 7 2
3 8 3
4 9 4
5 10 5
6 11 6
7 12 7
8 13 8
9 14 9
>>> df2
code transaction
0 12 7
1 13 8
2 14 9
3 15 10
4 16 11
5 17 12
6 18 13
7 19 14
8 20 15
9 21 16
>>> df1['new_col'] = np.where(df1['code'].isin(df2['code']), 'VALID', 'INVALID')
>>> df1
code transaction new_col
0 5 0 INVALID
1 6 1 INVALID
2 7 2 INVALID
3 8 3 INVALID
4 9 4 INVALID
5 10 5 INVALID
6 11 6 INVALID
7 12 7 VALID
8 13 8 VALID
9 14 9 VALID

How to subset pandas dataframe columns with idxmax output?

I have a pandas dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,40,size=(10,4)), columns=range(4), index = range(10))
df.head()
0 1 2 3
0 27 10 13 21
1 25 12 23 8
2 2 24 24 34
3 10 11 11 10
4 0 15 0 27
I'm using the idxmax function to get the columns that contain the maximum value.
df_max = df.idxmax(1)
df_max.head()
0 0
1 0
2 3
3 1
4 3
How can I use df_max along with df, to create a time-series of values corresponding to the maximum value in each row of df? This is the output I want:
0 27
1 25
2 34
3 11
4 27
5 37
6 35
7 32
8 20
9 38
I know I can achieve this using df.max(1), but I want to know how to arrive at this same output by using df_max, since I want to be able to apply df_max to other matrices (not df) which share the same columns and indices as df (but not the same values).
You may try df.lookup
df.lookup(df_max.index, df_max)
Out[628]: array([27, 25, 34, 11, 27], dtype=int64)
If you want Series/DataFrame, you pass the output to the Series/DataFrame constructor
pd.Series(df.lookup(df_max.index, df_max), index=df_max.index)
Out[630]:
0 27
1 25
2 34
3 11
4 27
dtype: int64

python Compute histogram values with a groupby panda dataframe

I want to group data from a dataframe using dataframe and I want to compute the histogram of the grouped data :
This is my dataframe :
indicator
key
14 1
14 2
14 3
15 1
16 2
16 5
16 6
17 1
18 3
And I want to get this result using groupby :
indicator
key
14 1,2,3
15 1
16 2,5,6
17 1
18 3
and then compute the histogram of every key
numpy.histogram cannot deal with the array in an array. You need to format your data like this.
import numpy as np
import pandas as pd
dataf = pd.DataFrame()
dataf['key'] = range(14,25)
dataf['indicator'] = [1,1,2,1,3,4,7,15,23,43,67]
dataf.loc[11] = [14,2]
dataf.loc[12] = [14,3]
dataf.loc[13] = [16,5]
dataf.loc[14] = [16,6]
Because there is no raw data provided, I can only assume data maybe can be reformatted like this.
In [30]: dataf
Out[30]:
key indicator
0 14 1
1 15 1
2 16 2
3 17 1
4 18 3
5 19 4
6 20 7
7 21 15
8 22 23
9 23 43
10 24 67
11 14 2
12 14 3
13 16 5
14 16 6
numpy.histogram already handled the groupby concept so you don't need to do groupby function in DataFrame.
You just need to do np.histogram(dff['indicator'])
FYI, if you want to plot a histogram, you can also use DataFrame.hist()
dataf.indicator.hist()
import matplotlib.pyplot as plt
plt.savefig('test.png')

Categories