Python/Pandas - drop_duplicates selecting the most complete row

Python/Pandas - drop_duplicates selecting the most complete row - python

I have a dataframe with people information. However sometimes these guys get repeated and some rows have more info about the same person than the others. Is there a way to drop the duplicates using column 'Name' as reference but only keep the most filled rows?

If you have a dataframe like
df = pd.DataFrame([['a',np.nan,np.nan,'M'],['a',12,np.nan,'M'],['c',np.nan,np.nan,'M'],['d',np.nan,np.nan,'M']],columns=['Name','Age','Region','Gender'])
Sorting rows based on nan count and dropping duplicates with subset 'Name' by keep first one might help i.e.
df['count'] = pd.isnull(df).sum(1)
df= df.sort_values(['count']).drop_duplicates(subset=['Name'],keep='first').drop('count',1)
Output:
Before:
Name Age Region Gender
0 a NaN NaN M
1 a 12.0 NaN M
2 c NaN NaN M
3 d NaN NaN M
After:
Name Age Region Gender
1 a 12.0 NaN M
2 c NaN NaN M
3 d NaN NaN M

Related

Python - Pandas - DROPNA(subset) deleting value for no apparent reasons?

I'm cleaning some data and I've been struggling with one thing.
I have a dataframe with 7740 rows and 68 columns.
Most of the columns contains Nan values.
What i'm interested in, is to remove NaN values when it is NaN in those two columns : [SERIAL_ID],[NUMBER_ID]
Example :
SERIAL_ID
NUMBER_ID
8RY68U4R
NaN
8756ERT5
8759321
NaN
NaN
NaN
7896521
7EY68U4R
NaN
95856ERT5
988888
NaN
NaN
NaN
4555555
Results
SERIAL_ID
NUMBER_ID
8RY68U4R
NaN
8756ERT5
8759321
NaN
7896521
7EY68U4R
NaN
95856ERT5
988888
NaN
4555555
Removing rows when NaN is in the two columns.
I've used the followings to do so :
df.dropna(subset=['SERIAL_ID', 'NUMBER_ID'], how='all', inplace=True)
When I use this on my dataframe with 68 columns the result I get is this one :
SERIAL_ID
NUMBER_ID
NaN
NaN
NaN
NaN
NaN
NaN
NaN
7896521
NaN
NaN
95856ERT5
NaN
NaN
NaN
NaN
4555555
I tried with a copy of the dataframe with only 3 columns, it is working fine.
It is somehow working (I can tel cause I have an identical ID in another column) but remove some of the value, and I have no idea why.
Please help I've been struggling the whole day with this.
Thanks again.

I don't know why it only works for 3 columns and not for 68 originals.
However, we can obtain desired output in other way.
use boolean indexing:
df[df[['SERIAL_ID', 'NUMBER_ID']].notnull().any(axis=1)]

You can use boolean logic or simple do something like this for any given column:
import numpy as np
import pandas as pd
# sample dataframe
d = {'SERIAL_ID':['8RY68U4R', '8756ERT5', np.nan, np.nan],
'NUMBER_ID':[np.nan, 8759321, np.nan ,7896521]}
df = pd.DataFrame(d)
# apply logic to columns
df['nans'] = df['NUMBER_ID'].isnull() * df['SERIAL_ID'].isnull()
# filter columns
df_filtered = df[df['nans']==False]
print(df_filtered)
which returns this:
SERIAL_ID NUMBER_ID nans
0 8RY68U4R NaN False
1 8756ERT5 8759321.0 False
3 NaN 7896521.0 False

Extracting values into a new column

I have a column in a dataframe as follows:
Data
[special_request=nowhiterice, waiter=Janice]
[allegic=no, waiter=Janice, tip=20]
[allergic=no, tip=20]
[special_request=nogreens]
May I know how could I make it such that one data = 1 column ?
special_request allegic waiter tip

You can make a Dictionary by splitting the elements of your series and build your Dataframe from it (s being your column here):
import pandas as pd
s = pd.Series([['special_request=nowhiterice', 'waiter=Janice'],
['allegic=no', 'waiter=Janice', 'tip=20'],
['allergic=no', 'tip=20'],
['special_request=nogreens']])
df = pd.DataFrame([dict(e.split('=') for e in row) for row in s])
print(df)
Output:
special_request waiter allegic tip allergic
0 nowhiterice Janice NaN NaN NaN
1 NaN Janice no 20 NaN
2 NaN NaN NaN 20 no
3 nogreens NaN NaN NaN NaN
Edit: if the column values are actual strings, you first should split your string (also stripping [, ]and whitespaces):
s = pd.Series(['[special_request=nowhiterice, waiter=Janice]',
'[allegic=no, waiter=Janice, tip=20]',
'[allergic=no, tip=20]',
'[special_request=nogreens]'])
df = pd.DataFrame([dict(map(str.strip, e.split('=')) for e in row.strip('[]').split(',')) for row in s])
print(df)

You can split the column value of string type into dict then use pd.json_normalize to convert dict to columns.
df_ = pd.json_normalize(df['Data'].apply(lambda x: dict([map(str.strip, i.split('=')) for i in x.strip("[]").split(',')])))
print(df_)
special_request waiter allegic tip allergic
0 nowhiterice Janice NaN NaN NaN
1 NaN Janice no 20 NaN
2 NaN NaN NaN 20 no
3 nogreens NaN NaN NaN NaN

re.match() in cleaning pandas data frame

I want to use re.match() to clean a pandas data frame such that if an entry in any column is 1 or 2 it remains unchanged, but if it is any other value is is set to NaN.
The problem's that my function sets everything to NaN. I'm new to regular expressions so I think I've made a mistake.
Thanks!
# DATA
data = [['Bob',10,1],['Bob',2,2],['Clarke',13,1]]
my_df = pd.DataFrame(data,columns=['Name','Age','Sex'])
print(my_df)
Name Age Sex
0 Bob 10 1
1 Bob 2 2
2 Clarke 13 1
# CLEANING FUNCTION
def my_fun(df):
for col in df.columns:
for row in df.index:
if re.match('^\d{1}(\.)\d{2}$', str(df[col][row])):
df[col][row] = df[col][row]
else:
df[col][row] = np.nan
return(df)
# OUTPUT
my_fun(my_df)
Name Age Sex
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
# EXPECTED/DESIRED OUTPUT
Name Age Sex
0 NaN NaN 1
1 NaN 2 2
2 NaN NaN 1

You can go with where with isin here for a full match:
my_df.where(my_df.isin([1,2]))
Name Age Sex
0 NaN NaN 1
1 NaN 2.0 2
2 NaN NaN 1
Some observations:
df[col][row] not a recommended way to index a dataframe in pandas. Use .loc or .iloc, see Indexing and selecting data
Also, looping over a dataframe is generally not recommended at all, you might end up with a very poor in performance solution. I'd suggest you to read How to iterate over rows in a DataFrame in Pandas
You don't need a regex for what you want to do. You want to match either 1 or 2, there are more straight forward ways of doing this, both using python lists and Pandas. When using built-in methods to match something gets complicated, then maybe start looking into regex.

Check whether a dataframe cell contains value that is in another dataframe's cell

I'm trying to do the following:
Given a row in df1, if str(row['code']) is in any rows for df2['code'], then I would like all those rows in df2['lamer_url_1'] and df2['shopee_url_1'] to take the corresponding values as from df1.
Then carry on with the next row for df1['code']...
'''
==============
Initial Tables:
df1
code lamer_url_1 shopee_url_1
0 L61B18H089 b a
1 L61S19H014 e d
2 L61S19H015 z y
df2
code lamer_url_1 shopee_url_1 lamer_url_2 shopee_url_2
0 L61B18H089-F1424 NaN NaN NaN NaN
1 L61S19H014-S1500 NaN NaN NaN NaN
2 L61B18H089-F1424 NaN NaN NaN NaN
==============
Expected output:
df2
code lamer_url_1 shopee_url_1 lamer_url_2 shopee_url_2
0 L61B18H089-F1424 b a NaN NaN
1 L61S19H014-S1500 e d NaN NaN
2 L61B18H089-F1424 b a NaN NaN
'''

I assumed that common part of "code" from "df2" are chars before "-". I also assumed that from "df1" we want 'lamer_url_1', 'shopee_url_1' and from "df2" we want 'lamer_url_2', 'shopee_url_2' (correct me in comment if I am wrong so I can polish code):
df1.set_index(df1['code'], inplace=True)
df2.set_index(df2['code'].apply(lambda x: x.split('-')[0]), inplace=True)
df2.index.names = ['code_join']
df3 = pd.merge(df2[['code', 'lamer_url_2', 'shopee_url_2']],
df1[['lamer_url_1', 'shopee_url_1']],
left_index=True, right_index=True)

Finding the index and the value of the uniquely longest strings in a pandas dataframe column

I have a 5x500k pandas dataframe and want to locate outlier indexes where the content is an abnormaly long string of characters.
for col in df.columns:
print(df[col].apply(str).map(len).max()) #finds max length of a string in the column col
print(df[col].apply(str).map(len)) #Gives length of all strings in the column col
What I would like to do is to find the longest string in each column and set it to NaN if there are no other strings with the same length (e.g. not multiple longest strings). And also save the index for this value. I want to repeat this for each column until no column has any "uniquely long" strings.
Example input:
a b c d e
0 NaN 54674054 6613722414 2330536 NaN
1 NaN 1234 asdf 2339933 NaN
2 14242 423124 gsdgsgdfgaadfg sdaasda NaN NaN
3 342543 214124 NaN 1231 978ad6f7d8yv 6767969
4 4123 512353 SDFAGdssd 12 87612378y8q7ssdy
5 4473 32325 as asfsda NaN NaN
Should Output:
a b c d e
0 NaN NaN 6613722414 2330536 NaN
1 NaN 1234 asdf 2339933 NaN
2 NaN 423124 NaN NaN NaN
3 NaN 214124 NaN 1231 NaN
4 4123 512353 2SDFAGdssd 12 NaN
5 4473 32325 as asfsda NaN NaN
Because I would like to clear my big dataset from long string obvious anomalies. Is it possible to easily do such an operation with pandas?
Maybe a more general version of the question would be, how can I find the index and the value of all the longest strings in a pandas dataframe column? And not just the first occurrence of the longest string.
Thank you very much,
Karl

I've not tried it but I think you are looking for something like this:
for col in df.columns:
# find indices, keep="all" means keep all occurrences.
idxs = df[col].astype(str).str.len().nlargest(
1, keep="all"
).index
# get values.
values = df.loc[idxs, col]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python/Pandas - drop_duplicates selecting the most complete row - python

I have a dataframe with people information. However sometimes these guys get repeated and some rows have more info about the same person than the others. Is there a way to drop the duplicates using column 'Name' as reference but only keep the most filled rows?

Related

Python - Pandas - DROPNA(subset) deleting value for no apparent reasons?

Extracting values into a new column

re.match() in cleaning pandas data frame

Check whether a dataframe cell contains value that is in another dataframe's cell

Finding the index and the value of the uniquely longest strings in a pandas dataframe column

Categories

Resources