there is a dataframe as following:
id year number
1 2016 3
1 2017 5
2 2016 1
2 2017 5
...
I want to extract the rows that groupby id and the value of number column is more than 3 in both 2016 and 2017.
for example in the above first 4 rows, the result is:
id year number
1 2016 3
1 2017 5
Thanks!
Compare by >=3 and use GroupBy.transform for Series with same size like original, so possible filter by boolean indexing:
df1 = df[(df["number"] >= 3).groupby(df["id"]).transform('all')]
#alternative for reassign mask to column
#df = df[df.assign(number= df["number"] >= 3).groupby("id")['number'].transform('all')]
print (df1)
id year number
0 1 2016 3
1 1 2017 5
Or use filter, but it should be slow if large DataFrame or many groups:
df1 = df.groupby("id").filter(lambda x: (x["number"] >= 3).all())
>>> great_in_both_years = df.groupby("id").apply(lambda x: (x["number"] >= 3).all())
>>> great_in_both_years
id
1 True
2 False
dtype: bool
>>> df.loc[lambda x: x["id"].map(great_in_both_years)]
id year number
0 1 2016 3
1 1 2017 5
Related
I have a table that looks similar to this:
user_id
date
count
1
2020
5
2021
7
2
2017
1
3
2020
2
2019
1
2021
3
I'm trying to keep only the row for each user_id that has the greatest count so it should look like something like this:
user_id
date
count
1
2021
7
2
2017
1
3
2021
3
I've tried using df.groupby(level=0).apply(max) but it removes the date column from the final table and I'm not sure how to modify that to keep all three original columns
You can try to specify only column count after .groupby() and then use .apply() to generate the boolean series whether the current entry in a group is equal to max count in group. Then, use .loc to locate the boolean series and display the whole dataframe.
df.loc[df.groupby(level=0)['count'].apply(lambda x: x == x.max())]
Result:
date count
user_id
1 2021 7
2 2017 1
3 2021 3
Note that if there are multiple entries in one user_id that have the same greatest count, all these entries will be kept.
In case for such multiple entries with greatest count you want to keep only one entry per user_id, you can use the following logics instead:
df1 = df.reset_index()
df1.loc[df1.groupby('user_id')['count'].idxmax()].set_index('user_id')
Result:
date count
user_id
1 2021 7
2 2017 1
3 2021 3
Note that we cannot simply use df.loc[df.groupby(level=0)["count"].idxmax()] because user_id is the row index. This code only gives you all unfiltered rows just like the original dataframe unprocessed. This is because the index that idxmax() returns in this code is the user_id itself (instead of simple RangeIndex 0, 1, 2, ...etc). Then, when .loc locates these user_id index, it will simply return all entries under the same user_id.
Demo
Let's add more entries to the sample data and see the differences between the 2 solutions:
Our base df (user_id is the row index):
date count
user_id
1 2018 7 <=== max1
1 2020 5
1 2021 7 <=== max2
2 2017 1
3 2020 3 <=== max1
3 2019 1
3 2021 3 <=== max2
1st Solution result:
df.loc[df.groupby(level=0)['count'].apply(lambda x: x == x.max())]
date count
user_id
1 2018 7
1 2021 7
2 2017 1
3 2020 3
3 2021 3
2nd Solution result:
df1 = df.reset_index()
df1.loc[df1.groupby('user_id')['count'].idxmax()].set_index('user_id')
date count
user_id
1 2018 7
2 2017 1
3 2020 3
I have a data frame with over 1500 rows
a sample of the table is like so
Site 2019 2020 2021 ....
ABC 0 1 2
DEF 1 1 2
GHI 2 0 1
JKL 0 0 0
MNO 2 1 1
I want to create a new dataframe which only selects sites and years if they have:
a value in 2019
if 2019 has a value greater that or equal to value in the next years
if there is a greater value in the next year, then the value of the previous year
if the next year has a value less than the previous year
so the out put for the example would be
Site 2019 2020 2021 ....
DEF 1 1 1
GHI 2
MNO 2 1 1
DEF has got a 1 in 2021 because there is a one in 2020
I tried to use the following to find the rows with values in the 2019 column but
for i.j in df.iterrows():
if when j=2
if i >0
return value
but I get syntax errors
Without looping the rows you can do:
df1 = df[(df[2019] > 0) & (df.loc[:, 2020:].min(axis=1) <= df.loc[:, 2019])]
cols = df1.columns.tolist()
for i in range(2, len(cols)):
df1[cols[i]] = df1.loc[:, cols[i - 1: i + 1]].min(axis=1)
df1
Output:
2019 2020 2021
DEF 1 1 1
GHI 2 0 0
MNO 2 1 1
This should work as long as you don't have too many columns. Add another comparison for each set of years that need to be compared. This will be a reference to the original df unless you use .copy() to make a deep copy.
new_df = df[(df['2019'] > 0) & (df['2019'] <= df['2020']) & (df['2020'] <= df['2021']) & (df['2021'] <= df['2022'])]
I have DataFrame with column Sales.
How can I split it into 2 based on Sales value?
First DataFrame will have data with 'Sales' < s and second with 'Sales' >= s
You can use boolean indexing:
df = pd.DataFrame({'Sales':[10,20,30,40,50], 'A':[3,4,7,6,1]})
print (df)
A Sales
0 3 10
1 4 20
2 7 30
3 6 40
4 1 50
s = 30
df1 = df[df['Sales'] >= s]
print (df1)
A Sales
2 7 30
3 6 40
4 1 50
df2 = df[df['Sales'] < s]
print (df2)
A Sales
0 3 10
1 4 20
It's also possible to invert mask by ~:
mask = df['Sales'] >= s
df1 = df[mask]
df2 = df[~mask]
print (df1)
A Sales
2 7 30
3 6 40
4 1 50
print (df2)
A Sales
0 3 10
1 4 20
print (mask)
0 False
1 False
2 True
3 True
4 True
Name: Sales, dtype: bool
print (~mask)
0 True
1 True
2 False
3 False
4 False
Name: Sales, dtype: bool
Using groupby you could split into two dataframes like
In [1047]: df1, df2 = [x for _, x in df.groupby(df['Sales'] < 30)]
In [1048]: df1
Out[1048]:
A Sales
2 7 30
3 6 40
4 1 50
In [1049]: df2
Out[1049]:
A Sales
0 3 10
1 4 20
Using groupby and list comprehension:
Storing all the split dataframe in list variable and accessing each of the seprated dataframe by their index.
DF = pd.DataFrame({'chr':["chr3","chr3","chr7","chr6","chr1"],'pos':[10,20,30,40,50],})
ans = [y for x, y in DF.groupby('chr')]
accessing the separated DF like this:
ans[0]
ans[1]
ans[len(ans)-1] # this is the last separated DF
accessing the column value of the separated DF like this:
ansI_chr=ans[i].chr
One-liner using the walrus operator (Python 3.8):
df1, df2 = df[(mask:=df['Sales'] >= 30)], df[~mask]
Consider using copy to avoid SettingWithCopyWarning:
df1, df2 = df[(mask:=df['Sales'] >= 30)].copy(), df[~mask].copy()
Alternatively, you can use the method query:
df1, df2 = df.query('Sales >= 30').copy(), df.query('Sales < 30').copy()
I like to use this for speeding up searches or rolling average finds .apply(lambda x...) type functions so I split big files into dictionaries of dataframes:
df_dict = {sale_v: df[df['Sales'] == sale_v] for sale_v in df.Sales.unique()}
This should do it if you wanted to go based on categorical groups.
I have a pandas dataframe something like this
Date ID
01/01/2016 a
05/01/2016 a
10/05/2017 a
05/05/2014 b
07/09/2014 b
12/08/2017 b
What I need to do is to add a column which shows the number of entries for each ID that occurred within the last year and another column showing the number within the next year. I've written some horrible code that iterates through the whole dataframe (millions of lines) and does the computations but there must be a better way!
I think you need between with boolean indexing for filter first and then groupby and aggregate size.
Outputs are concated and add reindex for add missing rows filled by 0:
print (df)
Date ID
0 01/01/2016 a
1 05/01/2016 a
2 10/05/2017 a
3 05/05/2018 b
4 07/09/2014 b
5 07/09/2014 c
6 12/08/2018 b
#convert to datetime (if first number is day, add parameter dayfirst)
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
now = pd.datetime.today()
print (now)
oneyarbeforenow = now - pd.offsets.DateOffset(years=1)
oneyarafternow = now + pd.offsets.DateOffset(years=1)
#first filter
a = df[df['Date'].between(oneyarbeforenow, now)].groupby('ID').size()
b = df[df['Date'].between(now, oneyarafternow)].groupby('ID').size()
print (a)
ID
a 1
dtype: int64
print (b)
ID
b 2
dtype: int64
df1 = pd.concat([a,b],axis=1).fillna(0).astype(int).reindex(df['ID'].unique(),fill_value=0)
print (df1)
0 1
a 1 0
b 0 2
c 0 0
EDIT:
If need compare each date by first date add or subtract year offset per group need custom function with condition and sum Trues:
offs = pd.offsets.DateOffset(years=1)
f = lambda x: pd.Series([(x > x.iat[-1] - offs).sum(), \
(x < x.iat[-1] + offs).sum()], index=['last','next'])
df = df.groupby('ID')['Date'].apply(f).unstack(fill_value=0).reset_index()
print (df)
ID last next
0 a 1 3
1 b 3 2
2 c 1 1
In [19]: x['date'] = pd.to_datetime( x['date']) # convert string date to datetime pd object
In [20]: x['date'] = x['date'].dt.year # get year from the date
In [21]: x
Out[21]:
date id
0 2016 a
1 2016 a
2 2017 a
3 2014 b
4 2014 b
5 2017 b
In [27]: x.groupby(['date','id']).size() # group by both columns
Out[27]:
date id
2014 b 2
2016 a 2
2017 a 1
b 1
Using resample takes care of missing inbetween years. See. year-2015
In [550]: df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
Out[550]:
Date 2014-12-31 2015-12-31 2016-12-31 2017-12-31
ID
a 0 0 2 1
b 2 0 0 1
Use rename if you want only year in columns
In [551]: (df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
.rename(columns=lambda x: x.year))
Out[551]:
Date 2014 2015 2016 2017
ID
a 0 0 2 1
b 2 0 0 1
I have a dataframe of "sentences", from which I wish to search for a keyword. Let's say that my keyword is just the letter 'A'. Sample data:
year | sentence | index
-----------------------
2015 | AAX | 0
2015 | BAX | 1
2015 | XXY | -1
2016 | AWY | 0
2017 | BWY | -1
That is, the "index" column shows the index of the first occurence of "A" in each sentence (-1 if not found). I want to group up the rows into their respective years, with a column showing the percentage of occurences of 'A' in the records of each year. That is:
year | index
-------------
2015 | 0.667
2016 | 1.0
2017 | 0
I have a feeling that this involves agg or groupby in some fashion, but I'm not clear how to string these together. I've gotten as far as:
df.groupby("index").count()
But the issue here is some kind of conditional count() first, where we first count the number of rows in year 201X containing 'A', then dividing that by the number of rows in year 201X.
You can use value_counts or GroupBy.size with boolean indexing:
What is the difference between size and count in pandas?
df2 = df['year'].value_counts()
print (df2)
2015 3
2017 1
2016 1
Name: year, dtype: int64
df1 = df.loc[df['index'] != -1, 'year'].value_counts()
print (df1)
2015 2
2016 1
Name: year, dtype: int64
Or:
df2 = df.groupby('year').size()
print (df2)
year
2015 3
2016 1
2017 1
dtype: int64
df1 = df.loc[df['index'] != -1, ['year']].groupby('year').size()
print (df1)
year
2015 2
2016 1
dtype: int64
And last divide by div:
print (df1.div(df2, fill_value=0))
2015 0.666667
2016 1.000000
2017 0.000000
Name: year, dtype: float64
from __future__ import division
import pandas as pd
x_df = # your dataframe
y = x_df.groupby('year')['sentence'].apply(lambda x: sum(True if i.count('A') >0 else False for i in x)/len(x))
#or
y = x.groupby('year')['index'].apply(lambda x: sum(True if i >=0 else False for i in x)/len(x))
Using sentence to check
df.sentence.str.contains('A').groupby(df.year).mean()
year
2015 0.666667
2016 1.000000
2017 0.000000
Name: sentence, dtype: float64
Using index that has already checked
df['index'].ne(-1).groupby(df.year).mean()
year
2015 0.666667
2016 1.000000
2017 0.000000
Name: index, dtype: float64
There are different ways to do it but no 'native' way as far as I know.
Here's one example, with only one grouby:
g = df.groupby('year')['index'].agg([lambda x: x[x>=0].count(), 'count'])
g['<lambda>'] / g['count']
Check also:
pandas dataframe groupby: sum/count of only positive numbers
Pandas Very Simple Percent of total size from Group by