Remove characters using re.sub - python

I'm trying to remove the special characters with the re.sub() function, but when I use the re.sub() function my replace function stops working.
My code:
import re
import pandas as pd
from IPython.display import display
tabela = pd.read_excel("tst.xlsx")
(tabela[['nome', 'mensagem', 'arquivo', 'telefone']])
for linha in tabela.index:
nome = tabela.loc[linha, "nome"]
mensagem = tabela.loc[linha, "mensagem"]
acordo = tabela.loc[linha, "acordo"]
telefone = tabela.loc[linha, "telefone"]
texto = mensagem.replace("fulano", nome)
texto = texto.replace( "value", acordo)
texto = texto.replace( "phone", telefone)
texto = re.sub(r"[!!##$%¨&*()_?',;.]", '', telefone)
print(texto)
Print result:
11
How it should come out:
thyago
R$200
11

Try this:
repalce
texto = re.sub(r"[!!##$%¨&*()_?',;.]", '', telefone)
to
texto = re.sub(r"[!!##$%¨&*()_?',;.]", '', texto)

Loops should be rarely used in Pandas.
Try this:
tabela['telefone'] = tabela['telefone'].str.replace(r'[!!##$%¨&*()_?\',;.]', '', regex=True)
tabela['mensagem']= tabela.apply(lambda x: x['mensagem'].replace('fulano', str(x['nome'])), axis=1)
tabela['mensagem']= tabela.apply(lambda x: x['mensagem'].replace('value', str(x['acordo'])), axis=1)
tabela['mensagem']= tabela.apply(lambda x: x['mensagem'].replace('phone', str(x['telefone'])), axis=1)

Related

Iteration skipping lines in a pandas dataframe

I'm trying to iterate through a whole dataframe which is already organized.
The idea is to find when a common user has an main_user aswell, when the keys that I use in the code below have a match these users have a main_user.
The problem I have is that some line are being skipped through the iteration and I can't find the error in the code.
Here's the code I'm using:
dataframe = gf.read_excel_base(path, sheet_name)
organized_dataframe = gf.organize_dataframe(dataframe)
main_user_data = {
'Nome titular': '',
'Nome beneficiário': '',
'Id Plano de Benefícios': '',
'Id Contratado': ''
}
user_data = {
'Nome titular': '',
'Nome beneficiário': '',
'Id Plano de Benefícios': '',
'Id Contratado': ''
}
main_user_list = []
user_list = []
for i, a in enumerate(organized_dataframe['Id Contratado']):
if gf.is_main_user(organized_dataframe, i):
main_user_data = gf.user_to_dict(organized_dataframe, i)
else:
user_data = gf.user_to_dict(organized_dataframe, i)
print(user_data['Nome beneficiário'])
if (main_user_data['Nome titular'] and main_user_data['Id Plano de Benefícios'] and main_user_data['Id Contratado']) == (user_data['Nome titular'] and user_data['Id Plano de Benefícios'] and user_data['Id Contratado']):
print('deu match')
main_user_list.append(main_user_data['Nome beneficiário'])
user_list.append(user_data['Nome beneficiário'])
print(user_list)
The resulting list always stops somewhere in the middle of the dataframe, there's a lot of lines that will match the statements I've made in the code, but somehow the code does not go into them.

Pandas: create a function to delete links

I necessary need a function to delete links from my oldText column (more then 1000 rows) in a pandas DataFrame.
I've created it using regex, but it doesn't work. This is my code:
def remove_links(text):
text = re.sub(r'http\S+', '', text)
text = text.strip('[link]')
return text
df['newText'] = df['oldText'].apply(remove_links)
I have not error, the code do just nothing
Your code is working for me:
CSV:
oldText
https://abc.xy/oldText asd
https://abc.xy/oldTe asd
https://abc.xy/oldT
https://abc.xy/old
https://abc.xy/ol
Code:
import pandas as pd
import re
def remove_links(text):
text = re.sub(r'http\S+', '', text)
text = text.strip('[link]')
return text
df = pd.read_csv('test2.csv')
df['newText'] = df['oldText'].apply(remove_links)
print(df)
Result:
oldText newText
0 https://abc.xy/oldText asd asd
1 https://abc.xy/oldTe asd asd
2 https://abc.xy/oldT
3 https://abc.xy/old
4 https://abc.xy/ol

How to remove special characters from Pandas DF?

I have a Python BOT that queries a database, saves the output to Pandas Dataframe and writes the data to an Excel template.
Yesterday the data did not saved to the Excel template because one of the fields in a record contain the following characters:
", *, /, (, ), :,\n
Pandas failed to save the data to the file.
This is the code that creates the dataframe:
upload_df = sql_df.copy()
This code prepares the template file with time/date stamp
src = file_name.format(val="")
date_str = " " + str(datetime.today().strftime("%d%m%Y%H%M%S"))
dst_file = file_name.format(val=date_str)
copyfile(src, os.path.join(save_path, dst_file))
work_book = load_workbook(os.path.join(save_path, dst_file))
and this code saves the dataframe to the excel file
writer = pd.ExcelWriter(os.path.join(save_path, dst_file), engine='openpyxl')
writer.book = work_book
writer.sheets = {ws.title: ws for ws in work_book.worksheets}
upload_df.to_excel(writer, sheet_name=sheet_name, startrow = 1, index=False, header = False)
writer.save()
My question is how can I clean the special characters from an specific column [description] in my dataframe before I write it to the Excel template?
I have tried:
upload_df['Name'] = upload_df['Name'].replace(to_replace= r'\W',value=' ',regex=True)
But this removes everything and not a certain type of special character.
I guess we could use a list of items and iterate through the list and run the replace but is there a more Pythonic solution ?
adding the data that corrupted the excel file and prevent pandas to write the information:
this is an example of the text that created the problem i changed a few normal characters to keep privacy but is the same data tha corrupted the file:
"""*** CRQ.: N/A *** DF2100109 SADSFO CADSFVO EN SERWO JL1047 EL
PUWERTDTO EL DIA 08-09-2021 A LAS 11:00 HRS. PERA REALIZAR TRWEROS
DE AWERWRTURA DE SITIO PARA MWERWO PWERRVO.
RWERE DE WERDDFF EN SITIO : ING. JWER ERR3WRR ERRSDFF DFFF :RERFD DDDDF : 33 315678905. 1) ADFDSF SDFDF Y DFDFF DE DFDF Y DFFF XXCVV Y
CXCVDDÓN DE DFFFD EN DFDFFDD 2) EN SDFF DE REQUERIRSE: SDFFDF Y SDFDFF
DE EEERRW HJGHJ (ACCESO, GHJHJ, GHJHJ, RRRTTEE Y ACCESO A LA YUYUGGG
RETIRAR JJGHJGHGH
CONSIDERACIONES FGFFDGFG: SE FGGG LLAVE DE FF LLEVAR FFDDF PARA ERTBGFY Y SOLDAR.""S: SE GDFGDFG LLAVE DE ERTFFFGG, FGGGFF EQUIPO
PARA DFGFGFGFG Y SOLDAR."""
As some of the special characters to remove are regex meta-characters, we have to escape these characters before we can replace them to empty strings with regex.
You can automate escaping these special character by re.escape, as follows:
import re
# put the special characters in a list
special_char = ['"', '*', '/', '(', ')', ':', '\n']
special_char_escaped = list(map(re.escape, special_char))
The resultant list of escaped special characters is as follows:
print(special_char_escaped)
['"', '\\*', '/', '\\(', '\\)', ':', '\\\n']
Then, we can remove the special characters with .replace() as follows:
upload_df['Name'] = upload_df['Name'].replace(special_char_escaped, '', regex=True)
Demo
Data Setup
upload_df = pd.DataFrame({'Name': ['"abc*/(xyz):\npqr']})
Name
0 "abc*/(xyz):\npqr
Run codes:
import re
# put the special characters in a list
special_char = ['"', '*', '/', '(', ')', ':', '\n']
special_char_escaped = list(map(re.escape, special_char))
upload_df['Name'] = upload_df['Name'].replace(special_char_escaped, '', regex=True)
Output:
print(upload_df)
Name
0 abcxyzpqr
Edit
With your edited text sample, here is the result after removing the special characters:
print(upload_df)
Name
0 CRQ. NA DF2100109 SADSFO CADSFVO EN SERWO JL1047 EL PUWERTDTO EL DIA 08-09-2021 A LAS 1100 HRS. PERA REALIZAR TRWEROS DE AWERWRTURA DE SITIO PARA MWERWO PWERRVO.
1 RWERE DE WERDDFF EN SITIO ING. JWER ERR3WRR ERRSDFF DFFF RERFD DDDDF 33 315678905. 1 ADFDSF SDFDF Y DFDFF DE DFDF Y DFFF XXCVV Y CXCVDDÓN DE DFFFD EN DFDFFDD 2 EN SDFF DE REQUERIRSE SDFFDF Y SDFDFF DE EEERRW HJGHJ ACCESO, GHJHJ, GHJHJ, RRRTTEE Y ACCESO A LA YUYUGGG
2 3. RETIRAR JJGHJGHGH
3 CONSIDERACIONES FGFFDGFG SE FGGG LLAVE DE FF LLEVAR FFDDF PARA ERTBGFY Y SOLDAR.S SE GDFGDFG LLAVE DE ERTFFFGG, FGGGFF EQUIPO PARA DFGFGFGFG Y SOLDAR.
The special characters listed in your question have all been removed. Please check whether it is ok now.
You could use the following (pass characters as a list to the method parameter):
upload_df['Name'] = upload_df['Name'].replace(
to_replace=['"', '*', '/', '()', ':', '\n'],
value=' '
)
Use str.replace:
>>> df
Name
0 (**Hello\nWorld:)
>>> df['Name'] = df['Name'].str.replace(r'''["*/():\\n]''', '', regex=True)
>>> df
Name
0 HelloWorld
Maybe you want to replace line breaks by whitespaces:
>>> df = df.replace({'Name': {r'["*/():]': '',
r'\\n': ' '}}, regex=True)
>>> df
Name
0 Hello World

Remove punctation from every value in Python dictionary

I have a long dictionary which looks like this:
name = 'Barack.'
name_last = 'Obama!'
street_name = "President Streeet?"
list_of_slot_names = {'name':name, 'name_last':name_last, 'street_name':street_name}
I want to remove the punctation for every slot (name, name_last,...).
I could do it this way:
name = name.translate(str.maketrans('', '', string.punctuation))
name_last = name_last.translate(str.maketrans('', '', string.punctuation))
street_name = street_name.translate(str.maketrans('', '', string.punctuation))
Do you know a shorter (more compact) way to write this?
Result:
>>> print(name, name_last, street_name)
>>> Barack Obama President Streeet
Use a loop / dictionary comprehension
{k: v.translate(str.maketrans('', '', string.punctuation)) for k, v in list_of_slot_names.items()}
You can either assign this back to list_of_slot_names if you want to overwrite existing values or assign to a new variable
You can also then print via
print(*list_of_slot_names.values())
name = 'Barack.'
name_last = 'Obama!'
empty_slot = None
street_name = "President Streeet?"
print([str_.strip('.?!') for str_ in (name, name_last, empty_slot, street_name) if str_ is not None])
-> Barack Obama President Streeet
Unless you also want to remove them from the middle. Then do this
import re
name = 'Barack.'
name_last = 'Obama!'
empty_slot = None
street_name = "President Streeet?"
print([re.sub('[.?!]+',"",str_) for str_ in (name, name_last, empty_slot, street_name) if str_ is not None])
import re, string
s = 'hell:o? wor!d.'
clean = re.sub(rf"[{string.punctuation}]", "", s)
print(clean)
output
hello world

Dealing with special characters in pandas Data Frame´s column Name

I am importing an excel worksheet that has the following columns name:
N° Pedido
1234
6424
4563
The column name ha a special character (°). Because of that, I can´t merge this with another Data Frame or rename the column. I don´t get any error message just the name stays the same. What should I do?
This is the code I am using and the result of the Dataframes:
import pandas as pd
import numpy as np
# Importando Planilhas
CRM = pd.ExcelFile(r'C:\Users\Michel\Desktop\Relatorio de
Vendas\relatorio_vendas_CRM.xlsx', encoding= 'utf-8')
protheus = pd.ExcelFile(r'C:\Users\Michel\Desktop\Relatorio de
Vendas\relatorio_vendas_protheus.xlsx', encoding= 'utf-8')
#transformando em Data Frame
df_crm = CRM.parse('190_pedido_export (33)')
df_protheus = protheus.parse('Relatorio de Pedido de Venda')]
# Transformando Campos em float o protheus
def turn_to_float(x):
return np.float(x)
df_protheus["TES"] = df_protheus["TES"].apply(turn_to_float)
df_protheus["Qtde"] = df_protheus["Qtde"].apply(turn_to_float)
df_protheus["Valor"] = df_protheus["Valor"].apply(turn_to_float)
#Tirando Tes de não venda do protheus
# tirando valores com código errado 6
df_protheus_1 = df_protheus[df_protheus.TES != 513.0]
df_protheus_2 = df_protheus_1[df_protheus_1.TES != 576.0]
**df_crm.columns = df_crm.columns.str.replace('N° Pedido', 'teste')
df_crm.columns**
Orçamento Origem N° Pedido Nº Pedido ERP Estabelecimento Tipo de
Pedido Classificação(Tipo) Aplicação Conta CNPJ/CPF Contato ...
Aprovação Parcial Antecipa Entrega Desconto da Tabela de Preço
Desconto do Cliente Desconto Informado Observações Observações NF Vl
Total Bruto Vl Total Completo
0 20619.0 23125 NaN Optitex 1 - Venda NaN Industrialização/Revenda
XAVIER E ARAUJO LTDA ME 7970626000170 NaN ... N N 0 0 0
Note that I used other codes for the bold part with the same result:
#renomeando tabela para dar Merge
#df_crm['proc'] = df_crm['N\xc2\xb0 Pedido']
#df_crm['N Pedido'] = df_crm['N° Pedido']
#df_crm.drop('N° Pedido',inplace=True,axis=1)
#df_crm
#df_crm['N Pedido'] = df_crm['N° Pedido']
#df.drop('N° Pedido',inplace=True,axis=1)
#df_crm
#df_crm_1 = df_crm.rename(columns={"N°Pedido": "teste"})
#df_crm_1
Thanks for posting the link to the Google Sheet. I downloaded it and loaded it via pandas:
df = pd.read_excel(r'~\relatorio_vendas_CRM.xlsx', encoding = 'utf-8')
df.columns = df.columns.str.replace('°', '')
df.columns = df.columns.str.replace('º', '')
Note that the two replace statements are replacing different characters, although they look very similar.
Help from: Why do I get a SyntaxError for a Unicode escape in my file path?
I was able to copy the values into another column. You could try that
df['N Pedido'] = df['N° Pedido']
df.drop('N° Pedido',inplace=True,axis=1)

Categories