Remove duplicates taking into account two columns, lower case and accents

Remove duplicates taking into account two columns, lower case and accents - python

I have the following DataFrame in pandas:
code
town
district
suburb
02
Benalmádena
Málaga
Arroyo de la Miel
03
Alicante
Jacarilla
Jacarilla, Correntias Bajas (Jacarilla)
04
Cabrera d'Anoia
Barcelona
Cabrera D'Anoia
07
Lanjarón
Granada
Lanjaron
08
Santa Cruz de Tenerife
Santa Cruz de Tenerife
Centro-Ifara
09
Córdoba
Córdoba
Cordoba
For each row in the suburb column, if the value it contains is equal (in lower case and without accents) to district or town columns, it becomes NaN.
# Function to remove accents and shift to lower case.
def rm_accents_lowcase(a):
return unidecode.unidecode(a).lower()
Example:
code
town
district
suburb
02
Benalmádena
Málaga
Arroyo de la Miel
03
Alicante
Jacarilla
Jacarilla, Correntias Bajas (Jacarilla)
04
Cabrera d'Anoia
Barcelona
NaN
07
Lanjarón
Granada
NaN
08
Santa Cruz de Tenerife
Santa Cruz de Tenerife
Centro-Ifara
09
Córdoba
Córdoba
NaN

You can write a function and check each row of pandas with a written function and apply, axis=1.
# !pip install unidecode
import numpy as np
import unidecode
def check_unidecode(row):
lst = [unidecode.unidecode(r).lower() for r in row]
# If we suppose that we want to check value of last column with other values of other columns in each row
if lst[-1] in lst[:-1]:
return np.nan
return row['suburb']
df['suburb'] = df.apply(check_unidecode, axis=1)
print(df)
town district \
0 Benalmádena Málaga
1 Alicante Jacarilla
2 Cabrera d'Anoia Barcelona
3 Lanjarón Granada
4 Santa Cruz de Tenerife Santa Cruz de Tenerife
5 Córdoba Córdoba
suburb
0 Arroyo de la Miel
1 Jacarilla, Correntias Bajas (Jacarilla)
2 NaN
3 NaN
4 Centro-Ifara
5 NaN
Update If you want to check the specific column with any order with other columns you can try like below:
col_chk = 'suburb'
def check_unidecode(row):
lst = []
for col_name, val in zip(row.index, row):
tmp = unidecode.unidecode(val).lower()
if col_name != col_chk:
lst.append(tmp)
else:
val_chk = tmp
if val_chk in lst:
return np.nan
return row[col_chk]
df[col_chk] = df.apply(check_unidecode, axis=1)
print(df)

You can remove accents and make lower with this code
df['suburb'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8').str.lower()
df['check'] = np.where(
((df['suburb'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8').str.lower() == df['town'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8').str.lower())
| (df['suburb'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8').str.lower() == df['district'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8').str.lower())
),
np.nan, df['suburb'])
df

Related

Find top n elements in pandas dataframe column by keeping the grouping

I am trying to find the top 5 elements of the column total_petitions, but keeping the ordered grouping I did.
df = df[['fy', 'EmployerState', 'total_petitions']]
table = df.groupby(['fy','EmployerState']).mean()
table.nlargest(5, 'total_petitions')
sample output:
fy EmployerState total_petitions
2020 WA 7039.333333
2016 MD 2647.400000
2017 MD 2313.142857
... TX 2305.541667
2020 TX 2081.952381
desired output:
fy EmployerState total_petitions
2016 AL 3.875000
AR 225.333333
AZ 26.666667
CA 326.056604
CO 21.333333
... ... ...
2020 VA 36.714286
WA 7039.333333
WI 43.750000
WV 8986086.08
WY 1.000000
with the elements of total_petitions being the 5 states with highest means by year

What you are looking for is a pivot table:
df = df.pivot_table(values='total_petitions', index=['fy','EmployerState'])
df = df.groupby(level='fy')['total_petitions'].nlargest(5).reset_index(level=0, drop=True).reset_index()

Extract the YYYY year from two string columns and put it in a new column, keeping NaN values

In a dataframe I have two columns with the information of when some football players make their debut.The columns are called 'Debut' and 'Debut Deportivo'. I have to create a function to create a new column with the YYYY year information of both columns keeping the Nan values from both when applied. Let me show and example:
With the code I have wrote until now, I am able to get the value from one column a put it in the new one, but I've never reach the form to combine both.
The result should be something like this:
Debut
Debut Deportivo
fecha_debut
27 de mayo de 2006
2006(UD Vecindario)
2006
21 de agosto de 2010
11 de agosto de 2010(Portuguesa)
2010
21 de agosto de 2010
NaN
2010
NaN
NaN
NaN
Can you help me to get this code right please
df_4['Debut deportivo'].fillna('0000',inplace=True)
df_4['Debut'].fillna('0000', inplace=True)
def find_year(x):
año = re.search('\d{4}', x)
return int(año.group(0)) if año else 0
df_4['fecha_debut'] = df_4['Debut'].map(find_year)
df_4['fecha_debut'] = df_4 ['Debut deportivo'].apply(lambda x: np.nan if x.find('2')==-1 else x[x.find('0')-1:x.find('(')])
df_4['club_debut'] = df_4 ['Debut deportivo'].apply(lambda x: np.nan if x.find ('(')==-1 else x[x.find('(')+1:x.find(')')])
df_4['fecha_debut'] = df_4['fecha_debut'].replace(0,np.nan)
# No modifiques las siguientes lineas
assert(isinstance(df, pd.DataFrame))
return df```

I suggest you use str.extract + combine_first
df['fecha_debut'] = df['Debut'].str.extract(r'(\d{4})').combine_first(df['Debut Deportivo'].str.extract(r'(\d{4})'))
print(df)
Output
Debut Debut Deportivo fecha_debut
0 27 de mayo de 2006 2006(UD Vecindario) 2006
1 21 de agosto de 2010 11 de agosto de 2010(Portuguesa) 2010
2 21 de agosto de 2010 NaN 2010
3 NaN NaN NaN
For more on how to work with strings in pandas see this.
UPDATE
If you need the column to be numeric you could do:
df['fecha_debut'] = pd.to_numeric(df['fecha_debut']).astype(pd.Int32Dtype())
Note that because you have missing values in the column it cannot be of type int32. It can be either nullable integer or float. For more on working with missing data see this.

What is the fastest way to modify my pandas dataframe?

The dataframe has 122,145 rows.
Following is snippet of data :
country_name,subdivision_1_name,subdivision_2_name,city_name
Spain,Madrid,Madrid,Sevilla La Nueva
Spain,Principality of Asturias,Asturias,Sevares
Spain,Catalonia,Barcelona,Seva
Spain,Cantabria,Cantabria,Setien
Spain,Basque Country,Biscay,Sestao
Spain,Navarre,Navarre,Sesma
Spain,Catalonia,Barcelona,Barcelona
I want to substitute city_name with subdivision_2_name whenever both the following conditions are satisfied:
subdivision_2_name and city_name have same country_name and same
subdivision_1_name , and
subdivision_2_name is present in city_name.
ex: For city_name "Seva" the subdivison_2_name "Barcelona" is present as a city_name as well in the dataframe with the same country_name "Spain" and same subdivision_1_name "Catalonia" , so I will replace "Seva" with "Barcelona".
I am able to create a proper apply func. I have prepared a loop:
for i in range(df.shape[0]):
if df.subdivision_2_name[i] in set(df.city_name[(df.country_name == df.country_name[i]) & (df.subdivision_1_name == df.subdivision_1_name[i])]):
df.city_name[i] = df.subdivision_2_name[i]
Edit : This loop took 1637 seconds(~28 min) to run
Suggest me a better method.

Use:
def f(x):
if x['subdivision_2_name'].isin(x['city_name']).any():
x['city_name'] = x['subdivision_2_name']
return (x)
df1 = df.groupby(['country_name','subdivision_1_name','subdivision_2_name']).apply(f)
print (df1)
country_name subdivision_1_name subdivision_2_name city_name
0 Spain Madrid Madrid Sevilla La Nueva
1 Spain Principality of Asturias Asturias Sevares
2 Spain Catalonia Barcelona Barcelona
3 Spain Cantabria Cantabria Setien
4 Spain Basque Country Biscay Sestao
5 Spain Navarre Navarre Sesma
6 Spain Catalonia Barcelona Barcelona

Iterating through a dataframe with info from another dataframe

I have a question that I think is more about logic than about coding. My goal is to calculate how many Kilometers a truck is loaded and charging.
I have two Dataframes
Lets call the first one trips:
Date Licence City State KM
01/05/2019 AAA-1111 Sao Paulo SP 10
02/05/2019 AAA-1111 Santos SP 10
03/05/2019 AAA-1111 Rio de Janeiro RJ 20
04/05/2019 AAA-1111 Sao Paulo SP 15
01/05/2019 AAA-2222 Curitiba PR 20
02/05/2019 AAA-2222 Sao Paulo SP 25
Lets call the second one invoice
Code Date License Origin State Destiny UF Value
A1 01/05/2019 AAA-1111 Sao Paulo SP Rio de Janeiro RJ 10.000,00
A2 01/05/2019 AAA-2222 Curitiba PR Sao Paulo SP 15.000,00
What I need to get is:
Date Licence City State KM Code
01/05/2019 AAA-1111 Sao Paulo SP 10 A1
02/05/2019 AAA-1111 Santos SP 10 A1
03/05/2019 AAA-1111 Rio de Janeiro RJ 20 A1
04/05/2019 AAA-1111 Sao Paulo SP 15 Nan
01/05/2019 AAA-2222 Curitiba PR 20 A2
02/05/2019 AAA-2222 Sao Paulo SP 25 A2
As I said, is more a question of logic. The truck got its cargo in the initial point that is São Paulo. How can I iterate through the rows knowing that it passed through Santos loaded and then went to Rio de Janeiro if I don´t have the date when the cargo was delivered?
tks

Assume the rows in the first dataframe(df1) are sorted, here is what I would do:
Note: below I am using df1 for trips and df2 for invoice
Left join with the df1 (left) and df2 (right) using as much information that is valid for matching two dataframes, so that we can find rows in df1 which are Origin of the trips. In my test, I am using the fields: ['Date', 'License', 'City', 'State'], save the result in a new dataframe df3
df3 = df1.merge(df2[df2.columns[:6]].rename(columns={'Origin':'City'})
, on = ['Date', 'License', 'City', 'State']
, how = 'left'
)
fill the NULL values in df3.Desitiny with ffill()
df3['Destiny'] = df3.Destiny.ffill()
setup the group label by the following flag:
g = (~df3.Code.isnull() | (df3.shift().City == df3.Destiny)).cumsum()
Note: I added df3['g'] in the above picture for reference
update df3.Code using ffill() based on the above group labels.
df3['Code'] = df3.groupby(g).Code.ffill()

Remove four last digits from string – Convert Zip+4 to Zip code

The following piece of code...
data = np.array([['','state','zip_code','collection_status'],
['42394','CA','92637-2854', 'NaN'],
['58955','IL','60654', 'NaN'],
['108365','MI','48021-1319', 'NaN'],
['109116','MI','48228', 'NaN'],
['110833','IL','60008-4227', 'NaN']])
print(pd.DataFrame(data=data[1:,1:],
index=data[1:,0],
columns=data[0,1:]))
... gives the following data frame:
state zip_code collection_status
42394 CA 92637-2854 NaN
58955 IL 60654 NaN
108365 MI 48021-1319 NaN
109116 MI 48228 NaN
110833 IL 60008-4227 NaN
The goal is to homogenise the "zip_code" column into a 5-digits format–i.e. I want to remove the last four last digits from zip_code when that particular data point has 9 digits instead of 5. BTW, zip_code's type is "object" type.
Any idea?

Use indexing with str only, thanks John Galt:
df['collection_status'] = df['zip_code'].str[:5]
print (df)
state zip_code collection_status
42394 CA 92637-2854 92637
58955 IL 60654 60654
108365 MI 48021-1319 48021
109116 MI 48228 48228
110833 IL 60008-4227 60008
If need add conditions use where or numpy.where:
df['collection_status'] = df['zip_code'].where(df['zip_code'].str.len() == 5,
df['zip_code'].str[:5])
print (df)
state zip_code collection_status
42394 CA 92637-2854 92637
58955 IL 60654 60654
108365 MI 48021-1319 48021
109116 MI 48228 48228
110833 IL 60008-4227 60008
df['collection_status'] = np.where(df['zip_code'].str.len() == 5,
df['zip_code'],
df['zip_code'].str[:5])
print (df)
state zip_code collection_status
42394 CA 92637-2854 92637
58955 IL 60654 60654
108365 MI 48021-1319 48021
109116 MI 48228 48228
110833 IL 60008-4227 60008

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove duplicates taking into account two columns, lower case and accents - python

Related

Find top n elements in pandas dataframe column by keeping the grouping

Extract the YYYY year from two string columns and put it in a new column, keeping NaN values

What is the fastest way to modify my pandas dataframe?

Iterating through a dataframe with info from another dataframe

Remove four last digits from string – Convert Zip+4 to Zip code

Categories

Resources