python: extract rows that a column value is more than 3

python: extract rows that a column value is more than 3 - python

there is a dataframe as following:
id year number
1 2016 3
1 2017 5
2 2016 1
2 2017 5
...
I want to extract the rows that groupby id and the value of number column is more than 3 in both 2016 and 2017.
for example in the above first 4 rows, the result is:
id year number
1 2016 3
1 2017 5
Thanks!

Compare by >=3 and use GroupBy.transform for Series with same size like original, so possible filter by boolean indexing:
df1 = df[(df["number"] >= 3).groupby(df["id"]).transform('all')]
#alternative for reassign mask to column
#df = df[df.assign(number= df["number"] >= 3).groupby("id")['number'].transform('all')]
print (df1)
id year number
0 1 2016 3
1 1 2017 5
Or use filter, but it should be slow if large DataFrame or many groups:
df1 = df.groupby("id").filter(lambda x: (x["number"] >= 3).all())

>>> great_in_both_years = df.groupby("id").apply(lambda x: (x["number"] >= 3).all())
>>> great_in_both_years
id
1 True
2 False
dtype: bool
>>> df.loc[lambda x: x["id"].map(great_in_both_years)]
id year number
0 1 2016 3
1 1 2017 5

Related

Getting max row from multi-index table

I have a table that looks similar to this:
user_id
date
count
1
2020
5
2021
7
2
2017
1
3
2020
2
2019
1
2021
3
I'm trying to keep only the row for each user_id that has the greatest count so it should look like something like this:
user_id
date
count
1
2021
7
2
2017
1
3
2021
3
I've tried using df.groupby(level=0).apply(max) but it removes the date column from the final table and I'm not sure how to modify that to keep all three original columns

You can try to specify only column count after .groupby() and then use .apply() to generate the boolean series whether the current entry in a group is equal to max count in group. Then, use .loc to locate the boolean series and display the whole dataframe.
df.loc[df.groupby(level=0)['count'].apply(lambda x: x == x.max())]
Result:
date count
user_id
1 2021 7
2 2017 1
3 2021 3
Note that if there are multiple entries in one user_id that have the same greatest count, all these entries will be kept.
In case for such multiple entries with greatest count you want to keep only one entry per user_id, you can use the following logics instead:
df1 = df.reset_index()
df1.loc[df1.groupby('user_id')['count'].idxmax()].set_index('user_id')
Result:
date count
user_id
1 2021 7
2 2017 1
3 2021 3
Note that we cannot simply use df.loc[df.groupby(level=0)["count"].idxmax()] because user_id is the row index. This code only gives you all unfiltered rows just like the original dataframe unprocessed. This is because the index that idxmax() returns in this code is the user_id itself (instead of simple RangeIndex 0, 1, 2, ...etc). Then, when .loc locates these user_id index, it will simply return all entries under the same user_id.
Demo
Let's add more entries to the sample data and see the differences between the 2 solutions:
Our base df (user_id is the row index):
date count
user_id
1 2018 7 <=== max1
1 2020 5
1 2021 7 <=== max2
2 2017 1
3 2020 3 <=== max1
3 2019 1
3 2021 3 <=== max2
1st Solution result:
df.loc[df.groupby(level=0)['count'].apply(lambda x: x == x.max())]
date count
user_id
1 2018 7
1 2021 7
2 2017 1
3 2020 3
3 2021 3
2nd Solution result:
df1 = df.reset_index()
df1.loc[df1.groupby('user_id')['count'].idxmax()].set_index('user_id')
date count
user_id
1 2018 7
2 2017 1
3 2020 3

A way to iterate through rows and columns (in a panda data frame), select rows and columns based on conditions to put into panda another data frame

I have a data frame with over 1500 rows
a sample of the table is like so
Site 2019 2020 2021 ....
ABC 0 1 2
DEF 1 1 2
GHI 2 0 1
JKL 0 0 0
MNO 2 1 1
I want to create a new dataframe which only selects sites and years if they have:
a value in 2019
if 2019 has a value greater that or equal to value in the next years
if there is a greater value in the next year, then the value of the previous year
if the next year has a value less than the previous year
so the out put for the example would be
Site 2019 2020 2021 ....
DEF 1 1 1
GHI 2
MNO 2 1 1
DEF has got a 1 in 2021 because there is a one in 2020
I tried to use the following to find the rows with values in the 2019 column but
for i.j in df.iterrows():
if when j=2
if i >0
return value
but I get syntax errors

Without looping the rows you can do:
df1 = df[(df[2019] > 0) & (df.loc[:, 2020:].min(axis=1) <= df.loc[:, 2019])]
cols = df1.columns.tolist()
for i in range(2, len(cols)):
df1[cols[i]] = df1.loc[:, cols[i - 1: i + 1]].min(axis=1)
df1
Output:
2019 2020 2021
DEF 1 1 1
GHI 2 0 0
MNO 2 1 1

This should work as long as you don't have too many columns. Add another comparison for each set of years that need to be compared. This will be a reference to the original df unless you use .copy() to make a deep copy.
new_df = df[(df['2019'] > 0) & (df['2019'] <= df['2020']) & (df['2020'] <= df['2021']) & (df['2021'] <= df['2022'])]

Is there a way to create multiple dataframes from single dataframe using pandas [duplicate]

I have DataFrame with column Sales.
How can I split it into 2 based on Sales value?
First DataFrame will have data with 'Sales' < s and second with 'Sales' >= s

You can use boolean indexing:
df = pd.DataFrame({'Sales':[10,20,30,40,50], 'A':[3,4,7,6,1]})
print (df)
A Sales
0 3 10
1 4 20
2 7 30
3 6 40
4 1 50
s = 30
df1 = df[df['Sales'] >= s]
print (df1)
A Sales
2 7 30
3 6 40
4 1 50
df2 = df[df['Sales'] < s]
print (df2)
A Sales
0 3 10
1 4 20
It's also possible to invert mask by ~:
mask = df['Sales'] >= s
df1 = df[mask]
df2 = df[~mask]
print (df1)
A Sales
2 7 30
3 6 40
4 1 50
print (df2)
A Sales
0 3 10
1 4 20
print (mask)
0 False
1 False
2 True
3 True
4 True
Name: Sales, dtype: bool
print (~mask)
0 True
1 True
2 False
3 False
4 False
Name: Sales, dtype: bool

Using groupby you could split into two dataframes like
In [1047]: df1, df2 = [x for _, x in df.groupby(df['Sales'] < 30)]
In [1048]: df1
Out[1048]:
A Sales
2 7 30
3 6 40
4 1 50
In [1049]: df2
Out[1049]:
A Sales
0 3 10
1 4 20

Using groupby and list comprehension:
Storing all the split dataframe in list variable and accessing each of the seprated dataframe by their index.
DF = pd.DataFrame({'chr':["chr3","chr3","chr7","chr6","chr1"],'pos':[10,20,30,40,50],})
ans = [y for x, y in DF.groupby('chr')]
accessing the separated DF like this:
ans[0]
ans[1]
ans[len(ans)-1] # this is the last separated DF
accessing the column value of the separated DF like this:
ansI_chr=ans[i].chr

One-liner using the walrus operator (Python 3.8):
df1, df2 = df[(mask:=df['Sales'] >= 30)], df[~mask]
Consider using copy to avoid SettingWithCopyWarning:
df1, df2 = df[(mask:=df['Sales'] >= 30)].copy(), df[~mask].copy()
Alternatively, you can use the method query:
df1, df2 = df.query('Sales >= 30').copy(), df.query('Sales < 30').copy()

I like to use this for speeding up searches or rolling average finds .apply(lambda x...) type functions so I split big files into dictionaries of dataframes:
df_dict = {sale_v: df[df['Sales'] == sale_v] for sale_v in df.Sales.unique()}
This should do it if you wanted to go based on categorical groups.

Count number of rows for each ID within 1 year

I have a pandas dataframe something like this
Date ID
01/01/2016 a
05/01/2016 a
10/05/2017 a
05/05/2014 b
07/09/2014 b
12/08/2017 b
What I need to do is to add a column which shows the number of entries for each ID that occurred within the last year and another column showing the number within the next year. I've written some horrible code that iterates through the whole dataframe (millions of lines) and does the computations but there must be a better way!

I think you need between with boolean indexing for filter first and then groupby and aggregate size.
Outputs are concated and add reindex for add missing rows filled by 0:
print (df)
Date ID
0 01/01/2016 a
1 05/01/2016 a
2 10/05/2017 a
3 05/05/2018 b
4 07/09/2014 b
5 07/09/2014 c
6 12/08/2018 b
#convert to datetime (if first number is day, add parameter dayfirst)
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
now = pd.datetime.today()
print (now)
oneyarbeforenow = now - pd.offsets.DateOffset(years=1)
oneyarafternow = now + pd.offsets.DateOffset(years=1)
#first filter
a = df[df['Date'].between(oneyarbeforenow, now)].groupby('ID').size()
b = df[df['Date'].between(now, oneyarafternow)].groupby('ID').size()
print (a)
ID
a 1
dtype: int64
print (b)
ID
b 2
dtype: int64
df1 = pd.concat([a,b],axis=1).fillna(0).astype(int).reindex(df['ID'].unique(),fill_value=0)
print (df1)
0 1
a 1 0
b 0 2
c 0 0
EDIT:
If need compare each date by first date add or subtract year offset per group need custom function with condition and sum Trues:
offs = pd.offsets.DateOffset(years=1)
f = lambda x: pd.Series([(x > x.iat[-1] - offs).sum(), \
(x < x.iat[-1] + offs).sum()], index=['last','next'])
df = df.groupby('ID')['Date'].apply(f).unstack(fill_value=0).reset_index()
print (df)
ID last next
0 a 1 3
1 b 3 2
2 c 1 1

In [19]: x['date'] = pd.to_datetime( x['date']) # convert string date to datetime pd object
In [20]: x['date'] = x['date'].dt.year # get year from the date
In [21]: x
Out[21]:
date id
0 2016 a
1 2016 a
2 2017 a
3 2014 b
4 2014 b
5 2017 b
In [27]: x.groupby(['date','id']).size() # group by both columns
Out[27]:
date id
2014 b 2
2016 a 2
2017 a 1
b 1

Using resample takes care of missing inbetween years. See. year-2015
In [550]: df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
Out[550]:
Date 2014-12-31 2015-12-31 2016-12-31 2017-12-31
ID
a 0 0 2 1
b 2 0 0 1
Use rename if you want only year in columns
In [551]: (df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
.rename(columns=lambda x: x.year))
Out[551]:
Date 2014 2015 2016 2017
ID
a 0 0 2 1
b 2 0 0 1

Python pandas counting

I have a dataframe of "sentences", from which I wish to search for a keyword. Let's say that my keyword is just the letter 'A'. Sample data:
year | sentence | index
-----------------------
2015 | AAX | 0
2015 | BAX | 1
2015 | XXY | -1
2016 | AWY | 0
2017 | BWY | -1
That is, the "index" column shows the index of the first occurence of "A" in each sentence (-1 if not found). I want to group up the rows into their respective years, with a column showing the percentage of occurences of 'A' in the records of each year. That is:
year | index
-------------
2015 | 0.667
2016 | 1.0
2017 | 0
I have a feeling that this involves agg or groupby in some fashion, but I'm not clear how to string these together. I've gotten as far as:
df.groupby("index").count()
But the issue here is some kind of conditional count() first, where we first count the number of rows in year 201X containing 'A', then dividing that by the number of rows in year 201X.

You can use value_counts or GroupBy.size with boolean indexing:
What is the difference between size and count in pandas?
df2 = df['year'].value_counts()
print (df2)
2015 3
2017 1
2016 1
Name: year, dtype: int64
df1 = df.loc[df['index'] != -1, 'year'].value_counts()
print (df1)
2015 2
2016 1
Name: year, dtype: int64
Or:
df2 = df.groupby('year').size()
print (df2)
year
2015 3
2016 1
2017 1
dtype: int64
df1 = df.loc[df['index'] != -1, ['year']].groupby('year').size()
print (df1)
year
2015 2
2016 1
dtype: int64
And last divide by div:
print (df1.div(df2, fill_value=0))
2015 0.666667
2016 1.000000
2017 0.000000
Name: year, dtype: float64

from __future__ import division
import pandas as pd
x_df = # your dataframe
y = x_df.groupby('year')['sentence'].apply(lambda x: sum(True if i.count('A') >0 else False for i in x)/len(x))
#or
y = x.groupby('year')['index'].apply(lambda x: sum(True if i >=0 else False for i in x)/len(x))

Using sentence to check
df.sentence.str.contains('A').groupby(df.year).mean()
year
2015 0.666667
2016 1.000000
2017 0.000000
Name: sentence, dtype: float64
Using index that has already checked
df['index'].ne(-1).groupby(df.year).mean()
year
2015 0.666667
2016 1.000000
2017 0.000000
Name: index, dtype: float64

There are different ways to do it but no 'native' way as far as I know.
Here's one example, with only one grouby:
g = df.groupby('year')['index'].agg([lambda x: x[x>=0].count(), 'count'])
g['<lambda>'] / g['count']
Check also:
pandas dataframe groupby: sum/count of only positive numbers
Pandas Very Simple Percent of total size from Group by

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python: extract rows that a column value is more than 3 - python

there is a dataframe as following: id year number 1 2016 3 1 2017 5 2 2016 1 2 2017 5 ... I want to extract the rows that groupby id and the value of number column is more than 3 in both 2016 and 2017. for example in the above first 4 rows, the result is: id year number 1 2016 3 1 2017 5 Thanks!

>>> great_in_both_years = df.groupby("id").apply(lambda x: (x["number"] >= 3).all()) >>> great_in_both_years id 1 True 2 False dtype: bool >>> df.loc[lambda x: x["id"].map(great_in_both_years)] id year number 0 1 2016 3 1 1 2017 5

Related

Getting max row from multi-index table

A way to iterate through rows and columns (in a panda data frame), select rows and columns based on conditions to put into panda another data frame

Is there a way to create multiple dataframes from single dataframe using pandas [duplicate]

Count number of rows for each ID within 1 year

Python pandas counting

Categories

Resources