Get index of rows after groupby and nlargest - python

I have a large dataframe where I want to use groupby and nlargest to look for the second largest, third, fourth and fifth largest value of each group. I have over 500 groups and each group has over 1000 values. I also have other columns in the dataframe which I want to keep after applying groupby and nlargest. My dataframe looks like this
df = pd.DataFrame({
'group': [1,2,3,3,4, 5,6,7,7,8],
'a': [4, 5, 3, 1, 2, 20, 10, 40, 50, 30],
'b': [20, 10, 40, 50, 30, 4, 5, 3, 1, 2],
'c': [25, 20, 5, 15, 10, 25, 20, 5, 15, 10]
})
To look for second, third, fourth largest and so on of each group for column a I use
secondlargest = df.groupby(['group'], as_index=False)['a'].apply(lambda grp: grp.nlargest(2).min())
which returns
group a
0 1 4
1 2 5
2 3 1
3 4 2
4 5 20
5 6 10
6 7 40
7 8 30
I need columns b and c present in this resulting dataframe. I use the following to subset the original dataframe but it returns an empty dataframe. How should I modify the code?
secondsubset = df[df.groupby(['group'])['a'].apply(lambda grp: grp.nlargest(2).min())]

If I understand your goal correctly, you should be able to just drop as_index=False, use idxmin instead of min, pass the result to df.loc:
df.loc[df.groupby('group')['a'].apply(lambda grp: grp.nlargest(2).idxmin())]

You can uses agg lambda. It is neater
df.groupby('group').agg(lambda grp: grp.nlargest(2).min())

Related

Determine if Age is between two values

I am trying to determine what the ages in a dataframe are that fall between 0 and 10. I have written the following, but it only returns 'Yes' even though not all values fall between 1 and 10:
x = df['Age']
for i in x :
if df['Age'].between(0, 10, inclusive=True).any():
print('Yes')
else:
print('No')
I am doing this with the intention of creating a new column in the dataframe that will categorize people based on whether they fall into an age group, i.e., 0-10, 11-20, etc...
Thanks for any help!
If you want to create a new column, assign to the column:
df['Child'] = df['Age'].between(0, 10, inclusive=True)
with the intention of creating a new column in the dataframe that will
categorize people based on whether they fall into an age group, i.e.,
0-10, 11-20, etc...
Then pd.cut is what you are looking for:
pd.cut(df['Age'], list(range(0, df['Age'].max() + 10, 10)))
For example:
df['Age'] = pd.Series([10, 7, 15, 24, 66, 43])
then the above gives you:
0 (0, 10]
1 (0, 10]
2 (10, 20]
3 (20, 30]
4 (60, 70]
5 (40, 50]

Column wise multiplication of a set of Dataframe with elements of other Dataframe

I have few pandas DataFrames (say a, b, c) like this:
And another Dataframe (name it x), whose number of rows are equal to the number of dataframes like above:
How can I multiply the entire first column of the first dataframe (a) with x[0][0]?
Then the second column of the dataframe a (bb_11) with x[0][1].
Then the third column of the dataframe a (cc_12) with x[0][2], and so on.
For dataframe b, we should use the second row of x, i.e. x[1][j] (where j varies from 0-3). That is, I need to multiply the entire first column of the second dataframe (b) with x[1][0].
Then the second column of the dataframe b (bb_11) with x[1][1].
Then the third column of the dataframe b (cc_12) with x[1][2], and so on.
N.B. All dataframe column names will be the same. Our dataframes will be read from a file so looping will be an easier option.
Sample Data:
import pandas as pd
d = {'aa_10' : pd.Series([np.nan, 2, 3, 4]),
'bb_11' : pd.Series([6, np.nan, 8, 9]),
'cc_12' : pd.Series([1, 2, np.nan, 4]),
'dd_13' : pd.Series([6, 7, 8, np.nan])}
# creates Dataframe.
a = pd.DataFrame(d)
# print the data.
print (a)
# Initialize data to Dicts of series.
d = {'aa_10' : pd.Series([np.nan, 12, 13, 14]),
'bb_11' : pd.Series([16, np.nan, 18, 19]),
'cc_12' : pd.Series([11, 12, np.nan, 14]),
'dd_13' : pd.Series([16, 17, 18, np.nan])}
# creates Dataframe.
b = pd.DataFrame(d)
# print the data.
print(b)
# Initialize data to Dicts of series.
d = {'aa_10' : pd.Series([np.nan, 21, 31, 41]),
'bb_11' : pd.Series([61, np.nan, 81, 91]),
'cc_12' : pd.Series([11, 21, np.nan, 41]),
'dd_13' : pd.Series([61, 71, 81, np.nan])}
# creates Dataframe.
c = pd.DataFrame(d)
# print the data.
print(c)
# Initialize data to Dicts of series.
d = {'aa_10' : pd.Series([1, 2, 3]),
'bb_11' : pd.Series([6, 7, 8]),
'cc_12' : pd.Series([10, 11, 12]),
'dd_13' : pd.Series([13, 14, 15])}
# creates Dataframe.
x = pd.DataFrame(d)
# print the data.
print(x)
You can do that
a,b,c = [x.mul(y) for x , y in zip([a,b,c],x.values.tolist())]
a
aa_10 bb_11 cc_12 dd_13
0 NaN 36.0 10.0 78.0
1 2.0 NaN 20.0 91.0
2 3.0 48.0 NaN 104.0
3 4.0 54.0 40.0 NaN

groupby column if value is less than some value

I have a dataframe like
df = pd.DataFrame({'time': [1, 5, 100, 250, 253, 260, 700], 'qty': [3, 6, 2, 5, 64, 2, 5]})
df['time_delta'] = df.time.diff()
and I would like to groupby time_delta such that all rows where the time_delta is less than 10 are grouped together, time_delta column could be dropped, and qty is summed.
The expected result is
pd.DataFrame({'time': [1, 100, 250, 700], 'qty': [9, 2, 71, 5]})
Basically I am hoping there is something like a df.groupby(time_delta_func(10)).agg({'time': 'min', 'qty': 'sum'}) func. I read up on pd.Grouper but it seems like the grouping based on time is very strict and interval based.
you can do it with gt meaning greater than and cumsum to create a new group each time the time-delta is greater than 10
res = (
df.groupby(df['time_delta'].gt(10).cumsum(), as_index=False)
.agg({'time':'first','qty':sum})
)
print(res)
time qty
0 1 9
1 100 2
2 250 71
3 700 5

checking if values in a pandas df are inside lists in columns of a second df

I have 2 dataframes
df1: has 4 columns each has column has a list inside with values
df2: has one column (col) the column has 1 value
I want to check if any of the values in df2(col) are inside any of the lists in the rows of df1(col1) or df1(col2), then save that row (df1 with the 4 columns)
Here is some random data to make an example:
df1 = pd.DataFrame({'col1': [[32, 24, 5, 6], [4, 8, 14],
[12, 32, 234, 15, 6], [45]],
'col2': [[13, 333 ,5], [32, 28, 5, 9],
[4], [12, 45, 21]],
'col3': [['AS', 'EWE', 'SADF', 'EW'],
['EW', 'HHT', 'IYT'], ['C', 'KJG', 'TF', 'VC', 'D'], ['BX']],
'col4': [['HG', 'FDGD' ,'F'], ['FDG', 'Y', 'FS', 'RT'],
['T'], ['XC', 'WE', 'TR']]
})
df2 = pd.DataFrame({'col': [1, 333, 8, 11, 45]})
df1:
col1 col2 col3 col4
0 [32, 24, 5, 6] [13, 333, 5] [AS, EWE, SADF, EW] [HG, FDGD, F]
1 [4, 8, 14] [32, 28, 5, 9] [EW, HHT, IYT] [FDG, Y, FS, RT]
2 [12, 32, 234, 15, 6] [4] [C, KJG, TF, VC, D] [T]
3 [45] [12, 45, 21] [BX] [XC, WE, TR]
df2:
col
0 1
1 333
2 8
3 11
4 45
This code works fine, but I am using big data, so it takes a lot to finish.
So I am wondering if there is any way to optimize it.
for index, row in df1.iterrows():
if (any(itm in row['col1'] for itm in df2['col'])):
df3 = df3.append(row)
elif (any(itm in row['col2'] for itm in df2['col'])):
df3 = df3.append(row)
And this is what the output would look like:
col1 col2 col3 col4
0 [32, 24, 5, 6] [13, 333, 5] [AS, EWE, SADF, EW] [HG, FDGD, F]
1 [4, 8, 14] [32, 28, 5, 9] [EW, HHT, IYT] [FDG, Y, FS, RT]
3 [45] [12, 45, 21] [BX] [XC, WE, TR]
The output can be either a new df or a column in df1 with '1' or '0' if the value is or not in any of the two columns.
UPDATE:
Following cs95 approach, I was able to improve the performance of the code.
My previous code would take 55s, with his approach it's only 8ms, so a speedup of around x690.
Sure, we can use setlookups to speed this up:
lookup = {*df2['col']}
df1[~df1[['col1', 'col2']].applymap(lookup.isdisjoint).all(axis=1)]
col1 col2 col3 col4
0 [32, 24, 5, 6] [13, 333, 5] [AS, EWE, SADF, EW] [HG, FDGD, F]
1 [4, 8, 14] [32, 28, 5, 9] [EW, HHT, IYT] [FDG, Y, FS, RT]
3 [45] [12, 45, 21] [BX] [XC, WE, TR]
Dealing with column of lists is hard. We can make things easier by recognizing we can use applymap since every cell in df1['col1'] and df1['col2'] have to undergo the same check (a lookup on df2['col']). Then use a little boolean logic to determine what rows to drop and you have the final result.
Your code contains a double whammy with the use of iterrows and append. Never iterate over a DataFrame because it is slow and wastes memory, and never grow a DataFrame for the same reasons.
lookup
# {1, 8, 11, 45, 333}
# get cells that have no elements in common
df1[['col1', 'col2']].applymap(lookup.isdisjoint)
col1 col2
0 True False
1 False True
2 True True
3 False False
# get rows who have no columns in common
df1[['col1', 'col2']].applymap(lookup.isdisjoint).all(axis=1)
0 False
1 False
2 True
3 False
dtype: bool
# invert the condition to get rows to keep
~df1[['col1', 'col2']].applymap(lookup.isdisjoint).all(axis=1)
0 True
1 True
2 False
3 True
dtype: bool

Pandas compare items in list in one column with single value in another column

Consider this two column df. I would like to create an apply function that compares each item in the "other_yrs" column list with the single integer in the "cur" column and keeps count of each item in the "other_yrs" column list that is greater than or equal to the single value in the "cur" column. I cannot figure out how to enable pandas to do this with apply. I am using apply functions for other purposes and they are working well. Any ideas would be very appreciated.
cur other_yrs
1 11 [11, 11]
2 12 [16, 13, 12, 9, 9, 6, 6, 3, 3, 3, 2, 1, 0]
4 16 [15, 85]
5 17 [17, 17, 16]
6 13 [8, 8]
Below is the function I used to extract the values into the "other_yrs" column. I am thinking I can just insert into this function some way of comparing each successive value in the list with the "cur" column value and keep count. I really only need to store the count of how many of the list items are <= the value in the "cur" column.
def col_check(col_string):
cs_yr_lst = []
count = 0
if len(col_string) < 1: #avoids col values of 0 meaning no other cases.
pass
else:
case_lst = col_string.split(", ") #splits the string of cases into a list
for i in case_lst:
cs_yr = int(i[3:5]) #gets the case year from each individual case number
cs_yr_lst.append(cs_yr) #stores those integers in a list and then into a new column using apply
return cs_yr_lst
The expected output would be this:
cur other_yrs count
1 11 [11, 11] 2
2 12 [16, 13, 12, 9, 9, 6, 6, 3, 3, 3, 2, 1, 0] 11
4 16 [15, 85] 1
5 17 [17, 17, 16] 3
6 13 [8, 8] 2
Use zip inside a list comprehension to zip the columns cur and other_yrs and use np.sum on boolean mask:
df['count'] = [np.sum(np.array(b) <= a) for a, b in zip(df['cur'], df['other_yrs'])]
Another idea:
df['count'] = pd.DataFrame(df['other_yrs'].tolist(), index=df.index).le(df['cur'], axis=0).sum(1)
Result:
cur other_yrs count
1 11 [11, 11] 2
2 12 [16, 13, 12, 9, 9, 6, 6, 3, 3, 3, 2, 1, 0] 11
4 16 [15, 85] 1
5 17 [17, 17, 16] 3
6 13 [8, 8] 2
You can consider explode and compare then group on level=0 and sum:
u = df.explode('other_yrs')
df['Count'] = u['cur'].ge(u['other_yrs']).sum(level=0).astype(int)
print(df)
cur other_yrs Count
1 11 [11, 11] 2
2 12 [16, 13, 12, 9, 9, 6, 6, 3, 3, 3, 2, 1, 0] 11
4 16 [15, 85] 1
5 17 [17, 17, 16] 3
6 13 [8, 8] 2
If columns contain millions of records in both of the dataframes and one has to compare each element in first column with all the elements in the second column then following code might be helpful.
for element in Dataframe1.Column1:
Dataframe2[Dateframe2.Column2.isin([element])]
Above code snippet will return one by one specific rows of dataframe2 where element from dataframe1 is found in dataframe2.column2.

Categories