I have a table that looks similar to this:
user_id
date
count
1
2020
5
2021
7
2
2017
1
3
2020
2
2019
1
2021
3
I'm trying to keep only the row for each user_id that has the greatest count so it should look like something like this:
user_id
date
count
1
2021
7
2
2017
1
3
2021
3
I've tried using df.groupby(level=0).apply(max) but it removes the date column from the final table and I'm not sure how to modify that to keep all three original columns
You can try to specify only column count after .groupby() and then use .apply() to generate the boolean series whether the current entry in a group is equal to max count in group. Then, use .loc to locate the boolean series and display the whole dataframe.
df.loc[df.groupby(level=0)['count'].apply(lambda x: x == x.max())]
Result:
date count
user_id
1 2021 7
2 2017 1
3 2021 3
Note that if there are multiple entries in one user_id that have the same greatest count, all these entries will be kept.
In case for such multiple entries with greatest count you want to keep only one entry per user_id, you can use the following logics instead:
df1 = df.reset_index()
df1.loc[df1.groupby('user_id')['count'].idxmax()].set_index('user_id')
Result:
date count
user_id
1 2021 7
2 2017 1
3 2021 3
Note that we cannot simply use df.loc[df.groupby(level=0)["count"].idxmax()] because user_id is the row index. This code only gives you all unfiltered rows just like the original dataframe unprocessed. This is because the index that idxmax() returns in this code is the user_id itself (instead of simple RangeIndex 0, 1, 2, ...etc). Then, when .loc locates these user_id index, it will simply return all entries under the same user_id.
Demo
Let's add more entries to the sample data and see the differences between the 2 solutions:
Our base df (user_id is the row index):
date count
user_id
1 2018 7 <=== max1
1 2020 5
1 2021 7 <=== max2
2 2017 1
3 2020 3 <=== max1
3 2019 1
3 2021 3 <=== max2
1st Solution result:
df.loc[df.groupby(level=0)['count'].apply(lambda x: x == x.max())]
date count
user_id
1 2018 7
1 2021 7
2 2017 1
3 2020 3
3 2021 3
2nd Solution result:
df1 = df.reset_index()
df1.loc[df1.groupby('user_id')['count'].idxmax()].set_index('user_id')
date count
user_id
1 2018 7
2 2017 1
3 2020 3
Related
I was trying to remove the rows with nan values in a python dataframe and when i do so, i want the row identifiers to shift in such way that the identifiers in the new data frame start from 0 and are one number away from each other. By identifiers i mean the numbers at the left of the following example. Notice that this is not an actual column of my df. This is rather placed by default in every dataframe.
If my Df is like:
name toy born
0 a 1 2020
1 na 2 2020
2 c 5 2020
3 na 1 2020
4 asf 1 2020
i want after dropna()
name toy born
0 a 1 2020
1 c 5 2020
2 asf 1 2020
I dont want, but this is what i get:
name toy born
0 a 1 2020
2 c 5 2020
4 asf 1 2020
You can simply add df.reset_index(drop=True)
By default, df.dropna and df.reset_index are not performed in place. Therefore, the complete answer would be as follows.
df = df.dropna().reset_index(drop=True)
Results and Explanations
The above code yields the following result.
>>> df = df.dropna().reset_index(drop=True)
>>> df
name toy born
0 a 1 2020
1 c 5 2020
2 asf 1 2020
We use the argument drop=True to drop the index column. Otherwise, the result would look like this.
>>> df = df.dropna().reset_index()
>>> df
index name toy born
0 0 a 1 2020
1 2 c 5 2020
2 4 asf 1 2020
I have a pandas dataframe with approx 60,000 records that looks like this:
ID P1 YEAR
0 20184045 MK 2020
1 20184045 GF 2020
2 20184011 EC 2020
3 20184011 MK 2020
4 20184011 EC 2020
5 20180673 GF 2020
Where ID is the ID of the record (8-digit integer), which has a P1 property that can take 10 distinct values (all are 2-char strings) and year is between 1995 and 2020. Each ID can have records that have between 1 and 5 different year values.
I want to obtain 2 additional dataframes:
one that gives me information about the number of distinct values of P1 for each year and each ID that would look like this:
ID YEAR NUMBER OF DISTINCT VALUES OF P1 FOR EACH YEAR
0 20184045 2020 n
1 20184045 2019
2 20184045 2018
3 20184045 2017
4 20184011 2020
5 20180673 2020
My second dataframe would count the total number of distinct values of P1 for each ID.
ID NUMBER OF DISTINCT VALUES OF P1 OVERALL
0 123 n1
1 456 n2
2 789 n3
3 987 n4
4 654 n1
5 321 n2
I tried looking up how to iterate over a dataframe with iterrows() and iteritems() but I have been unable to find how to iterate over 3 columns at the same time and grouping by id.
I've also looked into itertuples() which yields namedtuples and seemed more promising but I've been unable to find a satisfactory solution.
You can make do with two groupby:
df1 = (df.groupby(['ID','YEAR'])['P1']
.nunique()
.reset_index(name='Number of Unique P1')
)
df2 = (df.groupby('YEAR')['P1']
.nunique()
.reset_index(name='Number of Unique P1')
)
I have a data frame with over 1500 rows
a sample of the table is like so
Site 2019 2020 2021 ....
ABC 0 1 2
DEF 1 1 2
GHI 2 0 1
JKL 0 0 0
MNO 2 1 1
I want to create a new dataframe which only selects sites and years if they have:
a value in 2019
if 2019 has a value greater that or equal to value in the next years
if there is a greater value in the next year, then the value of the previous year
if the next year has a value less than the previous year
so the out put for the example would be
Site 2019 2020 2021 ....
DEF 1 1 1
GHI 2
MNO 2 1 1
DEF has got a 1 in 2021 because there is a one in 2020
I tried to use the following to find the rows with values in the 2019 column but
for i.j in df.iterrows():
if when j=2
if i >0
return value
but I get syntax errors
Without looping the rows you can do:
df1 = df[(df[2019] > 0) & (df.loc[:, 2020:].min(axis=1) <= df.loc[:, 2019])]
cols = df1.columns.tolist()
for i in range(2, len(cols)):
df1[cols[i]] = df1.loc[:, cols[i - 1: i + 1]].min(axis=1)
df1
Output:
2019 2020 2021
DEF 1 1 1
GHI 2 0 0
MNO 2 1 1
This should work as long as you don't have too many columns. Add another comparison for each set of years that need to be compared. This will be a reference to the original df unless you use .copy() to make a deep copy.
new_df = df[(df['2019'] > 0) & (df['2019'] <= df['2020']) & (df['2020'] <= df['2021']) & (df['2021'] <= df['2022'])]
there is a dataframe as following:
id year number
1 2016 3
1 2017 5
2 2016 1
2 2017 5
...
I want to extract the rows that groupby id and the value of number column is more than 3 in both 2016 and 2017.
for example in the above first 4 rows, the result is:
id year number
1 2016 3
1 2017 5
Thanks!
Compare by >=3 and use GroupBy.transform for Series with same size like original, so possible filter by boolean indexing:
df1 = df[(df["number"] >= 3).groupby(df["id"]).transform('all')]
#alternative for reassign mask to column
#df = df[df.assign(number= df["number"] >= 3).groupby("id")['number'].transform('all')]
print (df1)
id year number
0 1 2016 3
1 1 2017 5
Or use filter, but it should be slow if large DataFrame or many groups:
df1 = df.groupby("id").filter(lambda x: (x["number"] >= 3).all())
>>> great_in_both_years = df.groupby("id").apply(lambda x: (x["number"] >= 3).all())
>>> great_in_both_years
id
1 True
2 False
dtype: bool
>>> df.loc[lambda x: x["id"].map(great_in_both_years)]
id year number
0 1 2016 3
1 1 2017 5
I am new to Python.
I have a dataframe with two columns. One is ID column and the other is the
year and count information related to the ID.
I want to convert this format into multiple rows with the same ID.
The current dataframe looks like:
ID information
1 2014:Total:0, 2015:Total:1, 2016:Total:2
2 2017:Total:3, 2018:Total:1, 2019:Total:2
I expect the converted dataframe should like this:
ID Year Value
1 2014 0
1 2015 1
1 2016 2
2 2017 3
2 2018 1
2 2019 2
I tried to use the str.split method of pandas dataframe, but no luck.
Any suggestions would be appreciated.
Let us using explode :-) (New in pandas 0.25.0)
df.information=df.information.str.split(', ')
Yourdf=df[['ID']].join(df.information.explode().str.split(':',expand=True).drop(1,axis=1))
Yourdf
ID 0 2
0 1 2014 0
0 1 2015 1
0 1 2016 2
1 2 2017 3
1 2 2018 1
1 2 2019 2
Try using the below code, unlike #WenYoBen's answer this works for much lower versions as well:
df2 = pd.DataFrame(df['information'].str.split(', ', expand=True).apply(lambda x: x.str.split(':')).T.values.flatten().tolist(), columns=['Year', '', 'Value']).iloc[:, [0, 2]]
print(pd.DataFrame(sorted(df['ID'].tolist() * (len(df2) // 2)), columns=['ID']).join(df2))
Output:
ID Year Value
0 1 2014 0
1 1 2017 3
2 1 2015 1
3 2 2018 1
4 2 2016 2
5 2 2019 2