import re
#example 1
input_text = "((PERSON)María Rosa) ((VERB)pasará) unos dias aqui, hay que ((VERB)mover) sus cosas viejas de aqui, ya que sus cosméticos ((VERB)estorban) si ((VERB)estan) tirados por aquí. ((PERSON)Cyntia) es una buena modelo, su cabello es muy bello, hay que ((VERB)lavar) su cabello"
#example 2
input_text = "Sus útiles escolares ((VERB)estan) aqui, me sorprende que ((PERSON)Juan Carlos) los haya olvidado siendo que suele ((VERB)ser) tan cuidadoso con sus útiles."
#I need replace "sus" or "su" but under certain conditions
subject_capture_pattern = r"\(\(PERSON\)((?:\w\s*)+)\)" #underlined in red in the image
associated_info_capture_pattern = r"(?:sus|su)\s+((?:\w\s*)+)(?:\s+(?:del|de )|\s*(?:\(\(VERB\)|[.,;]))" #underlined in green in the image
identification_pattern =
replacement_sequence =
input_text = re.sub(identification_pattern, replacement_sequence, input_text, flags = re.IGNORECASE)
this is the correct output:
#for example 1
"((PERSON)María Rosa) ((VERB)pasará) unos dias aqui, hay que ((VERB)mover) cosas viejas ((CONTEXT) de María Rosa) de aqui, ya que cosméticos ((CONTEXT) de María Rosa) ((VERB)estorban) si ((VERB)estan) tirados por aquí. ((PERSON)Cyntia) es una buena modelo, cabello ((CONTEXT) de Cyntia) ((VERB)es) muy bello, hay que ((VERB)lavar) cabello ((CONTEXT) de Cyntia)"
#for example 2
"útiles escolares ((CONTEXT) NO DATA) ((VERB)estan) aqui, me sorprende que ((PERSON)Juan Carlos) los haya olvidado siendo que suele ((VERB)ser) tan cuidadoso con útiles ((CONTEXT) Juan Carlos)."
Details:
Replace the possessive pronouns "sus" or "su" with "de " + the content inside the last ((PERSON) "THIS SUBSTRING"), and if there is no ((PERSON) "THIS SUBSTRING") before then replace sus or su with ((PERSON) NO DATA)
Sentences are read from left to right, so the replacement will be the substring inside the parentheses ((PERSON)the substring) before that "sus" or "su", as shown in the example.
In the end, the replaced substrings should end up with this structure:
associated_info_capture_pattern + "((CONTEXT)" + subject_capture_pattern + ")"
This shows a way to do the replacement of su/sus like you asked for (albeit not with just a single re.sub). I didn't move the additional info, but you could modify it to handle that as well.
import re
subject_capture_pattern = r"\(\(PERSON\)((?:\w\s*)+)\)"
def replace_su_and_sus(input_text):
start = 0
replacement = "((PERSON) NO DATA)"
output_text = ""
for m in re.finditer(subject_capture_pattern, input_text):
output_text += re.sub(r"\b[Ss]us?\b", replacement, input_text[start:m.end()])
start = m.end()
replacement = m.group(0).replace("(PERSON)", "(CONTEXT) de ")
output_text += re.sub(r"\b[Ss]us?\b", replacement, input_text[start:])
return output_text
My strategy was:
Up until the first subject capture, replace su/sus with "NO DATA"
Up until the second subject capture, replace su/sus with the name from the first capture
Proceed similarly for each subsequent subject capture
Finally, replace any su/sus between the last subject capture and the end of the string
I have the following output:
datos=['Venta Casas CARRETERA NACIONAL, Nuevo León', 'Publicado el 29 de Abr', 'ABEDUL DE LADERAS', '3 Recámaras', '4.5 Baños', '300m² de Construcción', '300m² de Terreno', '2 Plantas', ' 81-1255-3166', ' Preguntale al vendedor', 'http://zuhausebienesraices.nocnok.com/', "INFOSTATS_ADOAnuncios('5', '30559440');"]
And I would like to assign a different variable to each item if it is in the list otherwise it will be 0. For example:
recamara= the string from the list that has the word "Recámara"
bano= the string from the list that has the string "Baño"
and so on. And if the word "Baño" is not in the list then bano= 0
If you are using Python you can use list comprehension to do this.
datos = ['Venta Casas CARRETERA NACIONAL, Nuevo León', 'Publicado el 29 de Abr', 'ABEDUL DE LADERAS', '3 Recámaras', '4.5 Baños', '300m² de Construcción', '300m² de Terreno', '2 Plantas', ' 81-1255-3166', ' Preguntale al vendedor', 'http://zuhausebienesraices.nocnok.com/', "INFOSTATS_ADOAnuncios('5', '30559440');"]
# list of strings which has "Casas" in it
casas_list = [string for string in datos if "Casas" in string]
print(casas_list)
print(len(casas_list))
recamara = [s for s in datos if "Recámara" in s]
bano = [s for s in datos if "Baño" in s]
if len(recamara)==0:
recamara = 0
else:
print(recamara[0]) #print the entire list if there will be more than 1 string
if len(bano)==0:
bano = 0
else:
print(bano[0])
# coding=utf-8
import re
m = "Hola esto es un ejemplo Objeto: esta es una de, las palabras."
keywords = ['Objeto:', 'OBJETO', 'Objeto social:', 'Objetos']
obj = re.compile(r'\b(?:{})\b\s*(.*?),'.format('|'.join(map(re.escape, keywords))))
print obj.findall(m)
I want to print text between one of words of keywords and the next point. Output that I want in these case: "esta es una de, las palabras."
the trailing \b prevents the match because your keyword ends with :
simplify your regex by removing it. Plus the greedy / comma (.*?), is only extracting the first part before comma, I suppose you meant "to the next point": (.*?)\.
obj = re.compile(r'\b(?:{})\s*(.*?)\.'.format('|'.join(map(re.escape, keywords))))
result:
['esta es una de, las palabras']
Removing the word boundary can match part of keywords in sentences though. You could force a non-word char with \W afterwards and it would work (acting like word boundary):
obj = re.compile(r'\b(?:{})\W\s*(.*?)\.'.format('|'.join(map(re.escape, keywords))))
Use \b(?:{0})\s*(.*?)(?=\b(?:{0})|$) with lookahead instead:
import re
m = "Hola esto es un ejemplo Objeto: esta es una de, las palabras."
keywords = ['Objeto:', 'OBJETO', 'Objeto social:', 'Objetos']
obj = re.compile(r'\b(?:{0})\s*(.*?)(?=\b(?:{0})|$)'.format('|'.join(map(re.escape, keywords))))
print(obj.findall(m))
This outputs:
['esta es una de, las palabras.']
I have a problem with me file csv. It's saving with spaces in middle of each row. I don't know why. How do I solve this problem? I'm asking because I don't find any answer and solutions to this.
Here is the code:
import csv
import random
def dict_ID_aeropuertos():
with open('AeropuertosArg.csv') as archivo_csv:
leer = csv.reader(archivo_csv)
dic_ID = {}
for linea in leer:
dic_ID.setdefault(linea[0],linea[1])
archivo_csv.close()
return dic_ID
def ruteoAleatorio():
dic_ID = dict_ID_aeropuertos()
lista_ID = list(dic_ID.keys())
cont = 0
lista_rutas = []
while (cont < 50):
r1 = random.choice(lista_ID)
r2 = random.choice(lista_ID)
if (r1 != r2):
t = (r1,r2)
if (t not in lista_rutas):
lista_rutas.append(t)
cont += 1
with open('rutasAeropuertos.csv', 'w') as archivo_rutas:
escribir = csv.writer(archivo_rutas)
escribir.writerows(lista_rutas)
archivo_rutas.close()
ruteoAleatorio()
Here is the file csv AeropuertosArg.cvs:
1,Aeroparque Jorge Newbery,Ciudad Autonoma de Buenos Aires,Ciudad Autonoma de Buenos Aires,-34.55803,-58.417009
2,Aeropuerto Internacional Ministro Pistarini,Ezeiza,Buenos Aires,-34.815004,-58.5348284
3,Aeropuerto Internacional Ingeniero Ambrosio Taravella,Cordoba,Cordoba,-31.315437,-64.21232
4,Aeropuerto Internacional Gobernador Francisco Gabrielli,Ciudad de Mendoza,Mendoza,-32.827864,-68.79849
5,Aeropuerto Internacional Teniente Luis Candelaria,San Carlos de Bariloche,Rio Negro,-41.146714,-71.16203
6,Aeropuerto Internacional de Salta Martin Miguel de Guemes,Ciudad de Salta,Salta,-24.84423,-65.478412
7,Aeropuerto Internacional de Puerto Iguazu,Puerto Iguazu,Misiones,-25.731778,-54.476181
8,Aeropuerto Internacional Presidente Peron,Ciudad de Neuquen,Neuquen,-38.952137,-68.140484
9,Aeropuerto Internacional Malvinas Argentinas,Ushuaia,Tierra del Fuego,-54.842237,-68.309701
10,Aeropuerto Internacional Rosario Islas Malvinas,Rosario,Santa Fe,-32.916887,-60.780391
11,Aeropuerto Internacional Comandante Armando Tola,El Calafate,Santa Cruz,-50.283977,-72.053641
12,Aeropuerto Internacional General Enrique Mosconi,Comodoro Rivadavia,Chubut,-45.789435,-67.467498
13,Aeropuerto Internacional Teniente General Benjamin Matienzo,San Miguel de Tucuman,Tucuman,-26.835888,-65.108361
14,Aeropuerto Comandante Espora,Bahia Blanca,Buenos Aires,-38.716152,-62.164955
15,Aeropuerto Almirante Marcos A. Zar,Trelew,Chubut,-43.209957,-65.273405
16,Aeropuerto Internacional de Resistencia,Resistencia,Chaco,-27.444926,-59.048739
17,Aeropuerto Internacional Astor Piazolla,Mar del Plata,Buenos Aires,-37.933205,-57.581518
18,Aeropuerto Internacional Gobernador Horacio Guzman,San Salvador de Jujuy,Jujuy,-24.385987,-65.093755
19,Aeropuerto Internacional Piloto Civil Norberto Fernandez,Rio Gallegos,Santa Cruz,-51.611788,-69.306315
20,Aeropuerto Domingo Faustino Sarmiento,San Juan,San Juan,-31.571814,-68.422568
Your problem is, that the csv-module writerows has its own "newline"-logic. It interferes with the default newline behaviour of open():
Fix like this:
with open('rutasAeropuertos.csv', 'w', newline='' ) as archivo_rutas:
# ^^^^^^^^^^
This is also documented in the example in the documentation: csv.writer(csvfile, dialect='excel', **fmtparams):
If csvfile is a file object, it should be opened with newline='' [1]
with a link to a footnote telling you:
[1] If newline='' is not specified, newlines embedded inside quoted fields will not be interpreted correctly, and on platforms that use \r\n linendings on write an extra \r will be added. It should always be safe to specify newline='', since the csv module does its own (universal) newline handling.
You are using windows which does use \r\n which adds another \r which leads to your "wrong" output.
Full code with some optimizations:
import csv
import random
def dict_ID_aeropuertos():
with open('AeropuertosArg.csv') as archivo_csv:
leer = csv.reader(archivo_csv)
dic_ID = {}
for linea in leer:
dic_ID.setdefault(linea[0],linea[1])
return dic_ID
def ruteoAleatorio():
dic_ID = dict_ID_aeropuertos()
lista_ID = list(dic_ID.keys())
lista_rutas = set() # a set only holds unique values
while (len(lista_rutas) < 50): # simply check the length of the set
r1,r2 = random.sample(lista_ID, k=2) # draw 2 different ones
lista_rutas.add( (r1,r2) ) # you can not add duplicates, no need to check
with open('rutasAeropuertos.csv', 'w', newline='' ) as archivo_rutas:
escribir = csv.writer(archivo_rutas)
escribir.writerows(lista_rutas)
ruteoAleatorio()
Output:
9,3
16,10
15,6
[snipp lots of values]
13,14
13,7
20,4
I am importing an excel worksheet that has the following columns name:
N° Pedido
1234
6424
4563
The column name ha a special character (°). Because of that, I can´t merge this with another Data Frame or rename the column. I don´t get any error message just the name stays the same. What should I do?
This is the code I am using and the result of the Dataframes:
import pandas as pd
import numpy as np
# Importando Planilhas
CRM = pd.ExcelFile(r'C:\Users\Michel\Desktop\Relatorio de
Vendas\relatorio_vendas_CRM.xlsx', encoding= 'utf-8')
protheus = pd.ExcelFile(r'C:\Users\Michel\Desktop\Relatorio de
Vendas\relatorio_vendas_protheus.xlsx', encoding= 'utf-8')
#transformando em Data Frame
df_crm = CRM.parse('190_pedido_export (33)')
df_protheus = protheus.parse('Relatorio de Pedido de Venda')]
# Transformando Campos em float o protheus
def turn_to_float(x):
return np.float(x)
df_protheus["TES"] = df_protheus["TES"].apply(turn_to_float)
df_protheus["Qtde"] = df_protheus["Qtde"].apply(turn_to_float)
df_protheus["Valor"] = df_protheus["Valor"].apply(turn_to_float)
#Tirando Tes de não venda do protheus
# tirando valores com código errado 6
df_protheus_1 = df_protheus[df_protheus.TES != 513.0]
df_protheus_2 = df_protheus_1[df_protheus_1.TES != 576.0]
**df_crm.columns = df_crm.columns.str.replace('N° Pedido', 'teste')
df_crm.columns**
Orçamento Origem N° Pedido Nº Pedido ERP Estabelecimento Tipo de
Pedido Classificação(Tipo) Aplicação Conta CNPJ/CPF Contato ...
Aprovação Parcial Antecipa Entrega Desconto da Tabela de Preço
Desconto do Cliente Desconto Informado Observações Observações NF Vl
Total Bruto Vl Total Completo
0 20619.0 23125 NaN Optitex 1 - Venda NaN Industrialização/Revenda
XAVIER E ARAUJO LTDA ME 7970626000170 NaN ... N N 0 0 0
Note that I used other codes for the bold part with the same result:
#renomeando tabela para dar Merge
#df_crm['proc'] = df_crm['N\xc2\xb0 Pedido']
#df_crm['N Pedido'] = df_crm['N° Pedido']
#df_crm.drop('N° Pedido',inplace=True,axis=1)
#df_crm
#df_crm['N Pedido'] = df_crm['N° Pedido']
#df.drop('N° Pedido',inplace=True,axis=1)
#df_crm
#df_crm_1 = df_crm.rename(columns={"N°Pedido": "teste"})
#df_crm_1
Thanks for posting the link to the Google Sheet. I downloaded it and loaded it via pandas:
df = pd.read_excel(r'~\relatorio_vendas_CRM.xlsx', encoding = 'utf-8')
df.columns = df.columns.str.replace('°', '')
df.columns = df.columns.str.replace('º', '')
Note that the two replace statements are replacing different characters, although they look very similar.
Help from: Why do I get a SyntaxError for a Unicode escape in my file path?
I was able to copy the values into another column. You could try that
df['N Pedido'] = df['N° Pedido']
df.drop('N° Pedido',inplace=True,axis=1)