Pandas How to find duplicate row in group

Pandas How to find duplicate row in group - python

everyone, I try to find duplicate row in double grouped DataFrame and I don't understand how to do it.
df_part[df_part.income_flag==1].groupby(['app_id', 'month_num'])['amnt'].duplicate()
For example df:
So I want to see something like this:
So, if I use thise code I see that there are two same value 'amnt' 0.387677 but in different month... it's information that i need
df_part[(df_part.income_flag==2) & df_part.duplicated(['app_id','amnt'], keep=False)].groupby(['app_id', 'amnt', 'month_num'])['month_num'].count().head(10)
app_id amnt month_num
0 0.348838 3 1
0.387677 6 1
10 2
0.426544 2 2
0.475654 2 1
0.488173 1 1
1 0.297589 1 1
4 1
0.348838 2 1
0.426544 8 3
Name: month_num, dtype: int64
Thanks all.

I think you need chain another mask by & for bitwise AND with DataFrame.duplicated and then use GroupBy.size:
df = (df_part[(df_part.income_flag==1) & df_part.duplicated(['app_id','amnt'], keep=False)]
.groupby('app_id')['amnt']
.size()
.reset_index(name='duplicate_count'))
print (df)
app_id duplicate_count
0 12 2
1 13 3

Related

Pivot table based on the first value of the group in Pandas

Have the following DataFrame:
I'm trying to pivot it in pandas and achieve the following format:
Actually I tried the classical approach with pd.pivot_table() but it does not work out:
pd.pivot_table(df,values='col2', index=[df.index], columns = 'col1')
Would be appreciate for some suggestions :) Thanks!

You can use pivot and then dropna for each column:
>>> df.pivot(columns='col1', values='col2').apply(lambda x: x.dropna().tolist()).astype(int)
col1 a b c
0 1 2 9
1 4 5 0
2 6 8 7

Another option is to create a Series of lists using groupby.agg; then construct a DataFrame:
out = df.groupby('col1')['col2'].agg(list).pipe(lambda x: pd.DataFrame(zip(*x), columns=x.index.tolist()))
Output:
A B C
0 1 2 9
1 4 5 0
2 6 8 7

How to remove rows with multiple occurrences in a row with pandas

i have this data:
A
1 1
2 1
3 1
4 2
5 2
6 1
i expect to get:
A
1 1
- - -> (drop)
3 1
4 2
5 2
6 1
I want to drop all the rows in col ['A'] with the same value that appear in a row,
but without the first and the last ones.
Until now I used:
df = df.loc[df[col].shift() != df[col]]
but it will remove also the last appearance.
Sorry for my bad English, thanks in advance.

Looks like you have the same problem as this question: Pandas drop_duplicates. Keep first AND last. Is it possible?.
The suggested solution is:
pd.concat([
df['A'].drop_duplicates(keep='first'),
df['A'].drop_duplicates(keep='last'),
])
Update after clarification:
First get the boolean masks for your described criteria:
is_last = df['A'] != df['A'].shift(-1)
is_duplicate = df['A'] == df['A'].shift()
And drop the rows based on these:
df.drop(df.index[~is_last & is_duplicate]) # note the ~ to negate is_last

Basically you need to group consecutive numbers, which can be achieved by diff and cumsum:
print (df.groupby(df["A"].diff().ne(0).cumsum(), as_index=False).nth([0, -1]))
A
1 1
3 1
4 2
5 2
6 1

Trying to group by, then sort a dataframe based on multiple values [duplicate]

Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).

Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1

Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)

Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)

df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.

This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000

To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])

Is there a way in pandas to create an integer in a new column if a row contains a specific string

For example, I have the following dataframe:
I want to transform the dataframe from above to something like this:
Thank's for any kind of help!

Run:
df['Number'] = df.svn_changes.str.match(r'r\d+').cumsum()

Yes, is contains with regex and cumsum:
df = pd.DataFrame({'svn_changes':['r123456','RowValueRow','ValueRowValue',
'some_string_string','r234566','ValueRowValue',
'some_string_string','r123789','something_here',
'ValueRowValue','String_2','String_4']})
df['Number'] = df['svn_changes'].str.contains('r\d+').cumsum()
print(df)
Output:
svn_changes Number
0 r123456 1
1 RowValueRow 1
2 ValueRowValue 1
3 some_string_string 1
4 r234566 2
5 ValueRowValue 2
6 some_string_string 2
7 r123789 3
8 something_here 3
9 ValueRowValue 3
10 String_2 3
11 String_4 3

Here's a simple reusable line you can use to do that:
df['new_col'] = df['old_col'].str.contains('string_to_match')*1
The new column will have value 1 if the string is present in this column, and 0 otherwise.

How to change index and transposing in pandas

I'm new in pandas and trying to do some converting on the dateframe but I reach closed path.
my data-frame is:
entity_name request_status dcount
0 entity1 0 1
1 entity1 1 6
2 entity1 2 13
3 entity2 1 4
4 entity2 2 7
I need this dataframe to be like the following:
index 0 1 2
entity1 1 6 13
entity2 0 4 7
as it shown I take the entity_name column as index without duplicates and the columns names from request_status column and the value from dcount
so please any one can help me to do that ?
many thanks

Regular pivot works as well:
df.pivot(values='dcount', index='entity_name', columns='request_status').fillna(0).astype(int)

you can use pivot_table:
a = pd.pivot_table(df, values = 'dcount', index='entity_name', columns='request_status').fillna(0)
a = a.astype(int)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas How to find duplicate row in group - python

Related

Pivot table based on the first value of the group in Pandas

How to remove rows with multiple occurrences in a row with pandas

Trying to group by, then sort a dataframe based on multiple values [duplicate]

Is there a way in pandas to create an integer in a new column if a row contains a specific string

How to change index and transposing in pandas

Categories

Resources