How to label duplicated groups in a pandas dataframe - python

Based on this problem: find duplicated groups in dataframe and this dataframe
df = pd.DataFrame({'id': ['A', 'A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D'],
'value1': ['1', '2', '3', '4', '1', '2', '1', '2', '3', '4', '1', '2', '3'],
'value2': ['1', '2', '3', '4', '1', '2', '1', '2', '3', '4', '1', '2', '3'],
'value3': ['1', '2', '3', '4', '1', '2', '1', '2', '3', '4', '1', '2', '3'],
})
How can i mark in this dataframe in the additional column duplicated the different duplicate groups (in the value columns) by unique label, like "1" for one duplicated group, "2" for the next and so on? I found examples here on slack to identify them as false and true, but one only with "ngroup", but did not work.
My real example has 20+ columns and also NaNs in between. I have created the wide format by pivot_table from original long format, since i thought getting duplicated entries is the better from wide. Duplicates should be found in N-1 columns, which names I summarize by using subset on a list comprehension excluding this identifier column
That is what i had so far:
df = df_long.pivot_table(index="Y",columns="Z",values="value").reset_index()
subset = [c for c in df.columns if not c=="id"]
df = df.loc[df.duplicated(subset=subset,keep=False)].copy()
We use pandas 0.22, if that does matter.
The problem is, that when I use
for i, group in df.groupby(subset):
print(group)
I basically don't get back any group.

Use groupby_ngroup as suggested by #Chris:
df['duplicated'] = df.groupby(df.filter(like='value').columns.tolist()).ngroup()
print(df)
# Output:
id value1 value2 value3 duplicated
0 A 1 1 1 0 # Group 0 (all 1)
1 A 2 2 2 1
2 A 3 3 3 2
3 A 4 4 4 3
4 B 1 1 1 0 # Group 0 (all 1)
5 B 2 2 2 1
6 C 1 1 1 0 # Group 0 (all 1)
7 C 2 2 2 1
8 C 3 3 3 2
9 C 4 4 4 3
10 D 1 1 1 0 # Group 0 (all 1)
11 D 2 2 2 1
12 D 3 3 3 2

Ok the last comment above was the correct hint: The NaNs in my real data are the problems, which also groupby does not allow for identifying groups. By using fillna() before using groupby, the groups can be identified and ngroup does add me the group numbers.
df['duplicated'] = df.fillna(-1).groupby(df.filter(like='value').columns.tolist()).ngroup()

Related

Compare two dataframes for missing rows based on multiple columns python

I want to compare two dataframes that have similar columns(not all) and print a new dataframe that shows the missing rows of df1 compare to df2 and a second dataframe that shows this time the missing values of df2 compare to df1 based on given columns.
Here the "key_columns" are named key_column1 and key_column2
import pandas as pd
data1 = {'first_column': ['4', '2', '7', '2', '2'],
'second_column': ['1', '2', '2', '2', '2'],
'key_column1':['1', '3', '2', '6', '4'],
'key_column2':['1', '2', '2', '1', '1'],
'fourth_column':['1', '2', '2', '2', '2'],
'other':['1', '2', '3', '2', '2'],
}
df1 = pd.DataFrame(data1)
data2 = {'first': ['1', '2', '2', '2', '2'],
'second_column': ['1', '2', '2', '2', '2'],
'key_column1':['1', '3', '2', '6', '4'],
'key_column2':['1', '5', '2', '2', '2'],
'fourth_column':['1', '2', '2', '2', '2'],
'other2':['1', '4', '3', '2', '2'],
'other3':['6', '8', '1', '4', '2'],
}
df2 = pd.DataFrame(data2)
I have modified the data1 and data2 dictionaries so that the resulting dataframes have only same columns to demonstrate that the solution provided in the answer by Emi OB relies on existence of columns in one dataframe which are not in the other one ( in case a common column is used the code fails with KeyError on the column chosen to collect NaNs). Below an improved version which does not suffer from that limitation creating own columns for the purpose of collecting NaNs:
df1['df1_NaNs'] = '' # create additional column to collect NaNs
df2['df2_NaNs'] = '' # create additional column to collect NaNs
df1_s = df1.merge(df2[['key_column1', 'key_column2', 'df2_NaNs']], on=['key_column1', 'key_column2'], how='outer')
df2 = df2.drop(columns=["df2_NaNs"]) # clean up df2
df1_s = df1_s.loc[df1_s['df2_NaNs'].isna(), df1.columns]
df1_s = df1_s.drop(columns=["df1_NaNs"]) # clean up df1_s
print(df1_s)
print('--------------------------------------------')
df2_s = df2.merge(df1[['key_column1', 'key_column2', 'df1_NaNs']], on=['key_column1', 'key_column2'], how='outer')
df1 = df1.drop(columns=["df1_NaNs"]) # clean up df1
df2_s = df2_s.loc[df2_s['df1_NaNs'].isna(), df2.columns]
df2_s = df2_s.drop(columns=["df2_NaNs"]) # clean up df2_s
print(df2_s)
gives:
first second_column key_column1 key_column2 fourth_column
1 2 2 3 2 2
3 2 2 6 1 2
4 2 2 4 1 2
--------------------------------------------
first second_column key_column1 key_column2 fourth_column
1 2 2 3 5 3
3 2 2 6 2 5
4 2 2 4 2 6
Also the code below works in case the columns of both dataframes are the same and in addition saves memory and computation time by not creating temporary full-sized dataframes required to achieve the final result:
""" I want to compare two dataframes that have similar columns(not all)
and print a new dataframe that shows the missing rows of df1 compare to
df2 and a second dataframe that shows this time the missing values of
df2 compare to df1 based on given columns. Here the "key_columns"
"""
import pandas as pd
#data1 ={ 'first_column':['4', '2', '7', '2', '2'],
data1 = { 'first':['4', '2', '7', '2', '2'],
'second_column':['1', '2', '2', '2', '2'],
'key_column1':['1', '3', '2', '6', '4'],
'key_column2':['1', '2', '2', '1', '1'],
'fourth_column':['1', '2', '2', '2', '2'],
# 'other':['1', '2', '3', '2', '2'],
}
df1 = pd.DataFrame(data1)
#print(df1)
data2 = { 'first':['1', '2', '2', '2', '2'],
'second_column':['1', '2', '2', '2', '2'],
'key_column1':['1', '3', '2', '6', '4'],
'key_column2':['1', '5', '2', '2', '2'],
# 'fourth_column':['1', '2', '2', '2', '2'],
'fourth_column':['2', '3', '4', '5', '6'],
# 'other2':['1', '4', '3', '2', '2'],
# 'other3':['6', '8', '1', '4', '2'],
}
df2 = pd.DataFrame(data2)
#print(df2)
data1_key_cols = dict.fromkeys( zip(data1['key_column1'], data1['key_column2']) )
data2_key_cols = dict.fromkeys( zip(data2['key_column1'], data2['key_column2']) )
# for Python versions < 3.7 (dictionaries are not ordered):
#data1_key_cols = list(zip(data1['key_column1'], data1['key_column2']))
#data2_key_cols = list(zip(data2['key_column1'], data2['key_column2']))
from collections import defaultdict
missing_data2_in_data1 = defaultdict(list)
missing_data1_in_data2 = defaultdict(list)
for indx, val in enumerate(data1_key_cols.keys()):
#for indx, val in enumerate(data1_key_cols): # for Python version < 3.7
if val not in data2_key_cols:
for key, val in data1.items():
missing_data1_in_data2[key].append(data1[key][indx])
for indx, val in enumerate(data2_key_cols.keys()):
#for indx, val in enumerate(data2_key_cols): # for Python version < 3.7
if val not in data1_key_cols:
for key, val in data2.items():
missing_data2_in_data1[key].append(data2[key][indx])
df1_s = pd.DataFrame(missing_data1_in_data2)
df2_s = pd.DataFrame(missing_data2_in_data1)
print(df1_s)
print('--------------------------------------------')
print(df2_s)
prints
first second_column key_column1 key_column2 fourth_column
0 2 2 3 2 2
1 2 2 6 1 2
2 2 2 4 1 2
--------------------------------------------
first second_column key_column1 key_column2 fourth_column
0 2 2 3 5 3
1 2 2 6 2 5
2 2 2 4 2 6
If you outer merge on the 2 key columns, with an additional unique column in the second dataframe, that unique column will show Nan where the row is in the first dataframe but not the second. For example:
df2.merge(df1[['key_column1', 'key_column2', 'first_column']], on=['key_column1', 'key_column2'], how='outer')
gives:
first second_column key_column1 ... other2 other3 first_column
0 1 1 1 ... 1 6 4
1 2 2 3 ... 4 8 NaN
2 2 2 2 ... 3 1 7
3 2 2 6 ... 2 4 NaN
4 2 2 4 ... 2 2 NaN
5 NaN NaN 3 ... NaN NaN 2
6 NaN NaN 6 ... NaN NaN 2
7 NaN NaN 4 ... NaN NaN 2
Here the Nans in 'first_column' correspond to the rows in df2 that are not in df1. You can then use this fact with .loc[] to filter on those Nan rows, and only the columns in df2 like so:
df2_outer.loc[df2_outer['first_column'].isna(), df2.columns]
Output:
first second_column key_column1 key_column2 fourth_column other2 other3
1 2 2 3 5 2 4 8
3 2 2 6 2 2 2 4
4 2 2 4 2 2 2 2
Full code for both tables is:
df2_outer = df2.merge(df1[['key_column1', 'key_column2', 'first_column']], on=['key_column1', 'key_column2'], how='outer')
print('missing values of df1 compare df2')
df2_output = df2_outer.loc[df2_outer['first_column'].isna(), df2.columns]
print(df2_output)
df1_outer = df1.merge(df2[['key_column1', 'key_column2', 'first']], on=['key_column1', 'key_column2'], how='outer')
print('missing values of df2 compare df1')
df1_output = df1_outer.loc[df1_outer['first'].isna(), df1.columns]
print(df1_output)
Which outputs:
missing values of df1 compare df2
first second_column key_column1 key_column2 fourth_column other2 other3
1 2 2 3 5 2 4 8
3 2 2 6 2 2 2 4
4 2 2 4 2 2 2 2
missing values of df2 compare df1
first_column second_column key_column1 key_column2 fourth_column other
1 2 2 3 2 2 2
3 2 2 6 1 2 2
4 2 2 4 1 2 2

Drop duplicate IDs keeping if value = certain value , otherwise keep first duplicate

>>> df = pd.DataFrame({'id': ['1', '1', '2', '2', '3', '4', '4', '5', '5'],
... 'value': ['keep', 'y', 'x', 'keep', 'x', 'Keep', 'x', 'y', 'x']})
>>> print(df)
id value
0 1 keep
1 1 y
2 2 x
3 2 keep
4 3 x
5 4 Keep
6 4 x
7 5 y
8 5 x
In this example, the idea would be to keep index values 0, 3, 4, 5 since they are asscoiated with a duplicate id with a particular value == 'Keep' and 7 (since it is the first of the duplicates for id 5).
In your case try with idxmax
out = df.loc[df['value'].eq('keep').groupby(df.id).idxmax()]
Out[24]:
id value
0 1 keep
3 2 keep
4 3 x
5 4 Keep
7 5 y

is there a way to make columns without pandas?

I currently have a list of lists and I need to put them into columns but I can't use pandas to make the columns. I currently have a list of lists that looks like this:
list_a = [['face1', 'face2', 'object', 'scene'], ['1', '7', '6', '5'], ['4', '3', '2', '8'], ['1', '3', '2', '4'], ['1', '2', '3', '4']]
and I want it to come out in columns like this
face1 face2 object scene
1 4 1 1
7 3 3 2
6 2 2 3
5 8 4 4
You can use string formatting to print columns.
for row in list_a:
print(''.join(f'{x:^8}' for x in row)) # 8-character wide centered columns
# output
face1 face2 object scene
1 7 6 5
4 3 2 8
1 3 2 4
1 2 3 4
try using center:
list_a = [['face1', 'face2', 'object', 'scene'], ['1', '7', '6', '5'], ['4', '3', '2', '8'], ['1', '3', '2', '4'],
['1', '2', '3', '4']]
for item in list_a:
for subitem in item:
print(subitem.center(10), end='')
print()
output :
face1 face2 object scene
1 7 6 5
4 3 2 8
1 3 2 4
1 2 3 4
Note : If your list, contains a value other than string, don't forget to convert it to string before calling .center on it:
print(str(subitem).center(10), end='')

Counting list entries in specific columns in Pandas Dataframe

I have a Dataframe like this
1 2 ... 9
0 ['1'] [] ['9']
1 ['1'] ['2', '2', '2'] ['9', '9']
2 ['1', '1'] ['2', '2', '2'] []
3 ['1', '1', '1'] ['2'] []
I want to count the occurences in each column so that the output would be like this
1 2 ... 9
0 1 0 1
1 1 3 2
2 2 3 0
3 3 1 0
This seems to work with the following code
df.1.apply(lambda x: x.count('1'))
but how can I automate it for all my columns so I don't have to run the above code for each individual column?
In addition I used
df.1.apply(lambda x: x.count('1')).sum()
to count the total for the rows which seems to be giving the right answer. Is there a better way though?

Concatenate strings based on inner join

I have two DataFrames containing the same columns; an id, a date and a str:
df1 = pd.DataFrame({'id': ['1', '2', '3', '4', '10'],
'date': ['4', '5', '6', '7', '8'],
'str': ['a', 'b', 'c', 'd', 'e']})
df2 = pd.DataFrame({'id': ['1', '2', '3', '4', '12'],
'date': ['4', '5', '6', '7', '8'],
'str': ['A', 'B', 'C', 'D', 'Q']})
I would like to join these two datasets on the id and date columns, and create a resulting column that is the concatenation of str:
df3 = pd.DataFrame({'id': ['1', '2', '3', '4', '10', '12'],
'date': ['4', '5', '6', '7', '8', '8'],
'str': ['aA', 'bB', 'cC', 'dD', 'e', 'Q']})
I guess I can make an inner join and then concatenate the strings, but is there an easier way to achieve this?
IIUC concat+groupby
pd.concat([df1,df2]).groupby(['date','id']).str.sum().reset_index()
Out[9]:
date id str
0 4 1 aA
1 5 2 bB
2 6 3 cC
3 7 4 dD
4 8 10 e
5 8 12 Q
And if we consider the efficiency using sum() base on level
pd.concat([df1,df2]).set_index(['date','id']).sum(level=[0,1]).reset_index()
Out[12]:
date id str
0 4 1 aA
1 5 2 bB
2 6 3 cC
3 7 4 dD
4 8 10 e
5 8 12 Q
Using radd:
i = df1.set_index(['date', 'id'])
j = df2.set_index(['date', 'id'])
j['str'].radd(i['str'], fill_value='').reset_index()
date id str
0 4 1 aA
1 5 2 bB
2 6 3 cC
3 7 4 dD
4 8 10 e
5 8 12 Q
This should be pretty fast.

Categories