Remove four last digits from string – Convert Zip+4 to Zip code - python

The following piece of code...
data = np.array([['','state','zip_code','collection_status'],
['42394','CA','92637-2854', 'NaN'],
['58955','IL','60654', 'NaN'],
['108365','MI','48021-1319', 'NaN'],
['109116','MI','48228', 'NaN'],
['110833','IL','60008-4227', 'NaN']])
print(pd.DataFrame(data=data[1:,1:],
index=data[1:,0],
columns=data[0,1:]))
... gives the following data frame:
state zip_code collection_status
42394 CA 92637-2854 NaN
58955 IL 60654 NaN
108365 MI 48021-1319 NaN
109116 MI 48228 NaN
110833 IL 60008-4227 NaN
The goal is to homogenise the "zip_code" column into a 5-digits format–i.e. I want to remove the last four last digits from zip_code when that particular data point has 9 digits instead of 5. BTW, zip_code's type is "object" type.
Any idea?

Use indexing with str only, thanks John Galt:
df['collection_status'] = df['zip_code'].str[:5]
print (df)
state zip_code collection_status
42394 CA 92637-2854 92637
58955 IL 60654 60654
108365 MI 48021-1319 48021
109116 MI 48228 48228
110833 IL 60008-4227 60008
If need add conditions use where or numpy.where:
df['collection_status'] = df['zip_code'].where(df['zip_code'].str.len() == 5,
df['zip_code'].str[:5])
print (df)
state zip_code collection_status
42394 CA 92637-2854 92637
58955 IL 60654 60654
108365 MI 48021-1319 48021
109116 MI 48228 48228
110833 IL 60008-4227 60008
df['collection_status'] = np.where(df['zip_code'].str.len() == 5,
df['zip_code'],
df['zip_code'].str[:5])
print (df)
state zip_code collection_status
42394 CA 92637-2854 92637
58955 IL 60654 60654
108365 MI 48021-1319 48021
109116 MI 48228 48228
110833 IL 60008-4227 60008

Related

Transpose rows using gropby with multiple values

I have this below DataFrame:
df = pd.DataFrame({'DC':['Alexandre', 'Alexandre', 'Afonso de Sousa', 'Afonso de Sousa','Afonso de Sousa'],
'PN':['Farmacia Ekolelo', 'Farmacia Havemos De Voltar', 'Farmacia Gloria', 'Farmacia Mambofar','Farmacia Metamorfose'],
'PC':['C-HO-002815', 'C-HO-005192', 'C-HO-002719', 'C-HO-003030','C-SCC-012430'],
'KCP':['NA', 'DOMINGAS PAULO', 'HELDER', 'Mambueno','RITA'],
'MBN':['NA', 'NA', 29295486345, 9.40407E+11,2.92955E+11]})
Trying to convert data into below format.
By grouping DC column needs to transpose other columns as per above format.
You can group by DC then aggregate with list. From there you can concat the dataframes created from the aggregated lists:
import pandas as pd
df = pd.DataFrame({'DC':['Alexandre', 'Alexandre', 'Afonso de Sousa', 'Afonso de Sousa','Afonso de Sousa'],
'PN':['Farmacia Ekolelo', 'Farmacia Havemos De Voltar', 'Farmacia Gloria', 'Farmacia Mambofar','Farmacia Metamorfose'],
'PC':['C-HO-002815', 'C-HO-005192', 'C-HO-002719', 'C-HO-003030','C-SCC-012430'],
'KCP':['NA', 'DOMINGAS PAULO', 'HELDER', 'Mambueno','RITA'],
'MBN':['NA', 'NA', 29295486345, 9.40407E+11,2.92955E+11]})
df = df.groupby('DC', as_index=False).agg(list)
#print(df)
df_out = pd.concat([df[['DC']]] +
[
pd.DataFrame(l := df[col].to_list(),
columns=[f'{col}_{i}' for i in range(1, max(len(s) for s in l) + 1)]
) for col in df.columns[1:]
],
axis=1)
Note: the assignment in the comprehension l := df[col].to_list() only works for Python versions >= 3.8.
This will give you:
DC PN_1 PN_2 PN_3 ... KCP_3 MBN_1 MBN_2 MBN_3
0 Afonso de Sousa Farmacia Gloria Farmacia Mambofar Farmacia Metamorfose ... RITA 29295486345 940407000000.0 2.929550e+11
1 Alexandre Farmacia Ekolelo Farmacia Havemos De Voltar None ... None NA NA NaN
You can then sort the columns with your own function:
def sort_columns(col_lbl):
col, ind = col_lbl.split('_')
return (int(ind), df.columns.to_list().index(col))
df_out.columns = ['DC'] + sorted(df_out.columns[1:].to_list(), key=sort_columns)
Output:
DC PN_1 PC_1 KCP_1 ... PN_3 PC_3 KCP_3 MBN_3
0 Afonso de Sousa Farmacia Gloria Farmacia Mambofar Farmacia Metamorfose ... RITA 29295486345 940407000000.0 2.929550e+11
1 Alexandre Farmacia Ekolelo Farmacia Havemos De Voltar None ... None NA NA NaN

Remove duplicates taking into account two columns, lower case and accents

I have the following DataFrame in pandas:
code
town
district
suburb
02
Benalmádena
Málaga
Arroyo de la Miel
03
Alicante
Jacarilla
Jacarilla, Correntias Bajas (Jacarilla)
04
Cabrera d'Anoia
Barcelona
Cabrera D'Anoia
07
Lanjarón
Granada
Lanjaron
08
Santa Cruz de Tenerife
Santa Cruz de Tenerife
Centro-Ifara
09
Córdoba
Córdoba
Cordoba
For each row in the suburb column, if the value it contains is equal (in lower case and without accents) to district or town columns, it becomes NaN.
# Function to remove accents and shift to lower case.
def rm_accents_lowcase(a):
return unidecode.unidecode(a).lower()
Example:
code
town
district
suburb
02
Benalmádena
Málaga
Arroyo de la Miel
03
Alicante
Jacarilla
Jacarilla, Correntias Bajas (Jacarilla)
04
Cabrera d'Anoia
Barcelona
NaN
07
Lanjarón
Granada
NaN
08
Santa Cruz de Tenerife
Santa Cruz de Tenerife
Centro-Ifara
09
Córdoba
Córdoba
NaN
You can write a function and check each row of pandas with a written function and apply, axis=1.
# !pip install unidecode
import numpy as np
import unidecode
def check_unidecode(row):
lst = [unidecode.unidecode(r).lower() for r in row]
# If we suppose that we want to check value of last column with other values of other columns in each row
if lst[-1] in lst[:-1]:
return np.nan
return row['suburb']
df['suburb'] = df.apply(check_unidecode, axis=1)
print(df)
town district \
0 Benalmádena Málaga
1 Alicante Jacarilla
2 Cabrera d'Anoia Barcelona
3 Lanjarón Granada
4 Santa Cruz de Tenerife Santa Cruz de Tenerife
5 Córdoba Córdoba
suburb
0 Arroyo de la Miel
1 Jacarilla, Correntias Bajas (Jacarilla)
2 NaN
3 NaN
4 Centro-Ifara
5 NaN
Update If you want to check the specific column with any order with other columns you can try like below:
col_chk = 'suburb'
def check_unidecode(row):
lst = []
for col_name, val in zip(row.index, row):
tmp = unidecode.unidecode(val).lower()
if col_name != col_chk:
lst.append(tmp)
else:
val_chk = tmp
if val_chk in lst:
return np.nan
return row[col_chk]
df[col_chk] = df.apply(check_unidecode, axis=1)
print(df)
You can remove accents and make lower with this code
df['suburb'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8').str.lower()
df['check'] = np.where(
((df['suburb'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8').str.lower() == df['town'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8').str.lower())
| (df['suburb'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8').str.lower() == df['district'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8').str.lower())
),
np.nan, df['suburb'])
df

Pandas fillna with string values from 2 other columns

I have a df with 3 columns, City, State, and MSA. Some of the MSA values are NaN. I would like to fill the MSA NaN values with a concatenation of City and State. I can fill MSA with City using df.MSA_CBSA.fillna(df.City, inplace=True), but some cities in different states have the same name.
City
State
MSA
Chicago
IL
Chicago MSA
Belleville
IL
Nan
Belleville
KS
Nan
City
State
MSA
Chicago
IL
Chicago MSA
Belleville
IL
Belleville IL
Belleville
KS
Belleville KS
Keep using the vectorized operation that you suggested. Notice that the argument can receive a combination from the other instances:
df.MSA.fillna(df.City + "," + df.State, inplace=True)

Extract the YYYY year from two string columns and put it in a new column, keeping NaN values

In a dataframe I have two columns with the information of when some football players make their debut.The columns are called 'Debut' and 'Debut Deportivo'. I have to create a function to create a new column with the YYYY year information of both columns keeping the Nan values from both when applied. Let me show and example:
With the code I have wrote until now, I am able to get the value from one column a put it in the new one, but I've never reach the form to combine both.
The result should be something like this:
Debut
Debut Deportivo
fecha_debut
27 de mayo de 2006
2006(UD Vecindario)
2006
21 de agosto de 2010
11 de agosto de 2010(Portuguesa)
2010
21 de agosto de 2010
NaN
2010
NaN
NaN
NaN
Can you help me to get this code right please
df_4['Debut deportivo'].fillna('0000',inplace=True)
df_4['Debut'].fillna('0000', inplace=True)
def find_year(x):
año = re.search('\d{4}', x)
return int(año.group(0)) if año else 0
df_4['fecha_debut'] = df_4['Debut'].map(find_year)
df_4['fecha_debut'] = df_4 ['Debut deportivo'].apply(lambda x: np.nan if x.find('2')==-1 else x[x.find('0')-1:x.find('(')])
df_4['club_debut'] = df_4 ['Debut deportivo'].apply(lambda x: np.nan if x.find ('(')==-1 else x[x.find('(')+1:x.find(')')])
df_4['fecha_debut'] = df_4['fecha_debut'].replace(0,np.nan)
# No modifiques las siguientes lineas
assert(isinstance(df, pd.DataFrame))
return df```
I suggest you use str.extract + combine_first
df['fecha_debut'] = df['Debut'].str.extract(r'(\d{4})').combine_first(df['Debut Deportivo'].str.extract(r'(\d{4})'))
print(df)
Output
Debut Debut Deportivo fecha_debut
0 27 de mayo de 2006 2006(UD Vecindario) 2006
1 21 de agosto de 2010 11 de agosto de 2010(Portuguesa) 2010
2 21 de agosto de 2010 NaN 2010
3 NaN NaN NaN
For more on how to work with strings in pandas see this.
UPDATE
If you need the column to be numeric you could do:
df['fecha_debut'] = pd.to_numeric(df['fecha_debut']).astype(pd.Int32Dtype())
Note that because you have missing values in the column it cannot be of type int32. It can be either nullable integer or float. For more on working with missing data see this.

Iterating through a dataframe with info from another dataframe

I have a question that I think is more about logic than about coding. My goal is to calculate how many Kilometers a truck is loaded and charging.
I have two Dataframes
Lets call the first one trips:
Date Licence City State KM
01/05/2019 AAA-1111 Sao Paulo SP 10
02/05/2019 AAA-1111 Santos SP 10
03/05/2019 AAA-1111 Rio de Janeiro RJ 20
04/05/2019 AAA-1111 Sao Paulo SP 15
01/05/2019 AAA-2222 Curitiba PR 20
02/05/2019 AAA-2222 Sao Paulo SP 25
Lets call the second one invoice
Code Date License Origin State Destiny UF Value
A1 01/05/2019 AAA-1111 Sao Paulo SP Rio de Janeiro RJ 10.000,00
A2 01/05/2019 AAA-2222 Curitiba PR Sao Paulo SP 15.000,00
What I need to get is:
Date Licence City State KM Code
01/05/2019 AAA-1111 Sao Paulo SP 10 A1
02/05/2019 AAA-1111 Santos SP 10 A1
03/05/2019 AAA-1111 Rio de Janeiro RJ 20 A1
04/05/2019 AAA-1111 Sao Paulo SP 15 Nan
01/05/2019 AAA-2222 Curitiba PR 20 A2
02/05/2019 AAA-2222 Sao Paulo SP 25 A2
As I said, is more a question of logic. The truck got its cargo in the initial point that is São Paulo. How can I iterate through the rows knowing that it passed through Santos loaded and then went to Rio de Janeiro if I don´t have the date when the cargo was delivered?
tks
Assume the rows in the first dataframe(df1) are sorted, here is what I would do:
Note: below I am using df1 for trips and df2 for invoice
Left join with the df1 (left) and df2 (right) using as much information that is valid for matching two dataframes, so that we can find rows in df1 which are Origin of the trips. In my test, I am using the fields: ['Date', 'License', 'City', 'State'], save the result in a new dataframe df3
df3 = df1.merge(df2[df2.columns[:6]].rename(columns={'Origin':'City'})
, on = ['Date', 'License', 'City', 'State']
, how = 'left'
)
fill the NULL values in df3.Desitiny with ffill()
df3['Destiny'] = df3.Destiny.ffill()
setup the group label by the following flag:
g = (~df3.Code.isnull() | (df3.shift().City == df3.Destiny)).cumsum()
Note: I added df3['g'] in the above picture for reference
update df3.Code using ffill() based on the above group labels.
df3['Code'] = df3.groupby(g).Code.ffill()

Categories