I´m trying to remove rows containing blank texts or in tweet texts column. But I have tried in different ways counting the rows that only contain whitespace or counting the leading spaces and trailing spaces but to get a criterion to eliminate it.
ID tweet WhiteSpaceCount HaveWhiteSpace
0 this is a text 0 False
1 0 False
2 Hello im fine 0 False
I want to delete all the rows that don´t have any information on the tweet column.
Code here:
def extractAndSave(api, name):
# Creamos una lista de tweets:
previous_date = date.today() - timedelta(days=1)
query_date = date.today()
name = name
tweets = API_EXTRACTOR.search(q=name + "-filter:retweets", result_type='recent', timeout=999999, count=200,
end_time=previous_date, tweet_mode='extended')
# Podemos crear un dataframe como sigue:
tweet_list = []
for tweet in tweets:
tweet_list.append(tweet.full_text)
datos = pd.DataFrame(data=tweet_list, columns=['TWEETS'])
# CREANDO COLUMNA DE ID
id_list = []
for id in tweets:
id_list.append(id.id)
id = pd.DataFrame(data=id_list, columns=['ID'])
# CREANDO COLUMNA DE ID
creado_list = []
for creado in tweets:
creado_list.append(creado.created_at)
creado = pd.DataFrame(data=creado_list, columns=['FECHA_CREACION'])
# CREANDO COLUMNA DE nombre de usuario
user_list = []
for usuario in tweets:
user_list.append(usuario.user.screen_name)
usuario = pd.DataFrame(data=user_list, columns=['USUARIO'])
# CREANDO COLUMNA DE FUENTE
fuente_list = []
for fuente in tweets:
fuente_list.append(fuente.source)
fuente = pd.DataFrame(data=fuente_list, columns=['FUENTE'])
# CREANDO COLUMNA DE ME GUSTA
like_list = []
for like in tweets:
like_list.append(like.favorite_count)
like = pd.DataFrame(data=like_list, columns=['ME_GUSTA'])
# CREANDO COLUMNA DE RT
rt_list = []
for rt in tweets:
rt_list.append(rt.retweet_count)
retweet = pd.DataFrame(data=rt_list, columns=['ME_GUSTA'])
# CREANDO COLUMNA DE IDIOMA
idioma_list = []
for idioma in tweets:
idioma_list.append(idioma.lang)
idioma = pd.DataFrame(data=idioma_list, columns=['IDIOMA'])
# CREANDO COLUMNA DE IDIOMA
quote_list = []
for quote in tweets:
quote_list.append(quote.is_quote_status)
quote = pd.DataFrame(data=quote_list, columns=['CITADO'])
# CREANDO COLUMNA DE IDIOMA
location_list = []
for location in tweets:
location_list.append(location.user.location)
location = pd.DataFrame(data=location_list, columns=['LOCACION'])
# CONCATENANDO DATAFRAMES
datos = pd.concat([datos, id, creado, usuario, fuente, like, retweet, quote, idioma, location], axis=1)
# Dropear toda la fila si la columna tweets viene vacia.
datos['pass/fail'] = np.where(datos['TWEETS'].astype(str).str.fullmatch(r"\s*"),'FAIL','PASS')
datos['CONTEO_ESPACIOS']= (datos['TWEETS'].str.startswith(" ") | datos['TWEETS'].str.endswith(" ")).sum()
# Hora de publicación
datos['HORA_PUBLICACION'] = datos['FECHA_CREACION'].dt.hour
datos['DIA_SEMANA'] = datos['FECHA_CREACION'].dt.day_name()
# Extrayendo solo los tweets del día anterior
datos['FECHA_CREACION'] = pd.to_datetime(datos['FECHA_CREACION']).dt.date
datos = datos[datos['FECHA_CREACION'] == previous_date]
print(datos)
# Guardando en dataframe.
return datos
Instead of removing rows that you don't need, keep only the ones you do need:
df = df[df["tweet"].str.strip().str.len()>0]
>>> df
ID tweet WhiteSpaceCount HaveWhiteSpace
0 0 this is a text 0 False
2 2 Hello im fine 0 False
I have a Python BOT that queries a database, saves the output to Pandas Dataframe and writes the data to an Excel template.
Yesterday the data did not saved to the Excel template because one of the fields in a record contain the following characters:
", *, /, (, ), :,\n
Pandas failed to save the data to the file.
This is the code that creates the dataframe:
upload_df = sql_df.copy()
This code prepares the template file with time/date stamp
src = file_name.format(val="")
date_str = " " + str(datetime.today().strftime("%d%m%Y%H%M%S"))
dst_file = file_name.format(val=date_str)
copyfile(src, os.path.join(save_path, dst_file))
work_book = load_workbook(os.path.join(save_path, dst_file))
and this code saves the dataframe to the excel file
writer = pd.ExcelWriter(os.path.join(save_path, dst_file), engine='openpyxl')
writer.book = work_book
writer.sheets = {ws.title: ws for ws in work_book.worksheets}
upload_df.to_excel(writer, sheet_name=sheet_name, startrow = 1, index=False, header = False)
writer.save()
My question is how can I clean the special characters from an specific column [description] in my dataframe before I write it to the Excel template?
I have tried:
upload_df['Name'] = upload_df['Name'].replace(to_replace= r'\W',value=' ',regex=True)
But this removes everything and not a certain type of special character.
I guess we could use a list of items and iterate through the list and run the replace but is there a more Pythonic solution ?
adding the data that corrupted the excel file and prevent pandas to write the information:
this is an example of the text that created the problem i changed a few normal characters to keep privacy but is the same data tha corrupted the file:
"""*** CRQ.: N/A *** DF2100109 SADSFO CADSFVO EN SERWO JL1047 EL
PUWERTDTO EL DIA 08-09-2021 A LAS 11:00 HRS. PERA REALIZAR TRWEROS
DE AWERWRTURA DE SITIO PARA MWERWO PWERRVO.
RWERE DE WERDDFF EN SITIO : ING. JWER ERR3WRR ERRSDFF DFFF :RERFD DDDDF : 33 315678905. 1) ADFDSF SDFDF Y DFDFF DE DFDF Y DFFF XXCVV Y
CXCVDDÓN DE DFFFD EN DFDFFDD 2) EN SDFF DE REQUERIRSE: SDFFDF Y SDFDFF
DE EEERRW HJGHJ (ACCESO, GHJHJ, GHJHJ, RRRTTEE Y ACCESO A LA YUYUGGG
RETIRAR JJGHJGHGH
CONSIDERACIONES FGFFDGFG: SE FGGG LLAVE DE FF LLEVAR FFDDF PARA ERTBGFY Y SOLDAR.""S: SE GDFGDFG LLAVE DE ERTFFFGG, FGGGFF EQUIPO
PARA DFGFGFGFG Y SOLDAR."""
As some of the special characters to remove are regex meta-characters, we have to escape these characters before we can replace them to empty strings with regex.
You can automate escaping these special character by re.escape, as follows:
import re
# put the special characters in a list
special_char = ['"', '*', '/', '(', ')', ':', '\n']
special_char_escaped = list(map(re.escape, special_char))
The resultant list of escaped special characters is as follows:
print(special_char_escaped)
['"', '\\*', '/', '\\(', '\\)', ':', '\\\n']
Then, we can remove the special characters with .replace() as follows:
upload_df['Name'] = upload_df['Name'].replace(special_char_escaped, '', regex=True)
Demo
Data Setup
upload_df = pd.DataFrame({'Name': ['"abc*/(xyz):\npqr']})
Name
0 "abc*/(xyz):\npqr
Run codes:
import re
# put the special characters in a list
special_char = ['"', '*', '/', '(', ')', ':', '\n']
special_char_escaped = list(map(re.escape, special_char))
upload_df['Name'] = upload_df['Name'].replace(special_char_escaped, '', regex=True)
Output:
print(upload_df)
Name
0 abcxyzpqr
Edit
With your edited text sample, here is the result after removing the special characters:
print(upload_df)
Name
0 CRQ. NA DF2100109 SADSFO CADSFVO EN SERWO JL1047 EL PUWERTDTO EL DIA 08-09-2021 A LAS 1100 HRS. PERA REALIZAR TRWEROS DE AWERWRTURA DE SITIO PARA MWERWO PWERRVO.
1 RWERE DE WERDDFF EN SITIO ING. JWER ERR3WRR ERRSDFF DFFF RERFD DDDDF 33 315678905. 1 ADFDSF SDFDF Y DFDFF DE DFDF Y DFFF XXCVV Y CXCVDDÓN DE DFFFD EN DFDFFDD 2 EN SDFF DE REQUERIRSE SDFFDF Y SDFDFF DE EEERRW HJGHJ ACCESO, GHJHJ, GHJHJ, RRRTTEE Y ACCESO A LA YUYUGGG
2 3. RETIRAR JJGHJGHGH
3 CONSIDERACIONES FGFFDGFG SE FGGG LLAVE DE FF LLEVAR FFDDF PARA ERTBGFY Y SOLDAR.S SE GDFGDFG LLAVE DE ERTFFFGG, FGGGFF EQUIPO PARA DFGFGFGFG Y SOLDAR.
The special characters listed in your question have all been removed. Please check whether it is ok now.
You could use the following (pass characters as a list to the method parameter):
upload_df['Name'] = upload_df['Name'].replace(
to_replace=['"', '*', '/', '()', ':', '\n'],
value=' '
)
Use str.replace:
>>> df
Name
0 (**Hello\nWorld:)
>>> df['Name'] = df['Name'].str.replace(r'''["*/():\\n]''', '', regex=True)
>>> df
Name
0 HelloWorld
Maybe you want to replace line breaks by whitespaces:
>>> df = df.replace({'Name': {r'["*/():]': '',
r'\\n': ' '}}, regex=True)
>>> df
Name
0 Hello World
This is my code:
import pandas as pd
import os
import glob as g
archivos = g.glob('C:\Users\Desktop\*.csv')
for archiv in archivos:
nombre = os.path.splitext(archiv)[0]
df = pd.read_csv(archiv, sep=",")
d = pd.to_datetime(df['DATA_LEITURA'], format="%Y%m%d")
df['FECHA_LECTURA'] = d.dt.date
del df['DATA_LEITURA']
df['CONSUMO']=""
df['DIAS']=""
df["SUMDIAS"]=""
df["SUMCONS"]=""
df["CONSANUAL"] = ""
ordenado = df.sort_values(['NR_CPE','FECHA_LECTURA', 'HORA_LEITURA'], ascending=True)
##Agrupamos por el CPE
agrupado = ordenado.groupby('NR_CPE')
for name, group in agrupado: #Recorremos el grupo
indice = group.index.values
inicio = indice[0]
fin = indice[-1]
#Llenamos la primeras lectura de cada CPE, con esa lectura (porque no hay una lectura anterior)
ordenado.CONSUMO.loc[inicio] = 0
ordenado.DIAS.loc[inicio] = 0
cont=0
for i in indice: #Recorremos lo que hay dentro de los grupos, dentro de los CPES(lecturas)
if i > inicio and i <= fin :
cont=cont+1
consumo = ordenado.VALOR_LEITURA[indice[cont]] - ordenado.VALOR_LEITURA[indice[cont-1]]
dias = (ordenado.FECHA_LECTURA[indice[cont]] - ordenado.FECHA_LECTURA[indice[cont-1]]).days
ordenado.CONSUMO.loc[i] = consumo
ordenado.DIAS.loc[i] = dias
# Hago las sumatorias, el resultado es un objeto DataFrame
dias = agrupado['DIAS'].sum()
consu = agrupado['CONSUMO'].sum()
canu = (consu/dias) * 365
#Contador con el numero de courrencias de los campos A,B y C
conta=0
contb=0
contc=0
#Como es un DF, para recorrerlo tengo que iterar sobre ellos para hacer la comparacion
print "Grupos:"
for ind, sumdias in dias.iteritems():
if sumdias <= 180:
grupo = "A"
conta=conta+1
elif sumdias > 180 and sumdias <= 365:
grupo = "B"
contb=contb+1
elif sumdias > 365:
grupo = "C"
contc=contc+1
print "grupo A: " , conta
print "grupo B: " , contb
print "grupo C: " , contc
#Formateamos los campos para no mostrar todos los decimales
Fdias = dias.map('{:.0f}'.format)
Fcanu = canu.map('{:.2f}'.format)
frames = [Fdias, consu, Fcanu]
concat = pd.concat(frames,axis=1).replace(['inf','nan'],[0,0])
with open('C:\Users\Documents\RPE_PORTUGAL\Datos.csv','a') as f:
concat.to_csv(f,header=False,columns=['CPE','DIAS','CONSUMO','CONSUMO_ANUAL'])
try:
ordenado.to_excel(nombre+'.xls', columns=["NOME_DISTRITO",
"NR_CPE","MARCA_EQUIPAMENTO","NR_EQUIPAMENTO","VALOR_LEITURA","REGISTADOR","TIPO_REGISTADOR",
"TIPO_DADOS_RECOLHIDOS","FACTOR_MULTIPLICATIVO_FINAL","NR_DIGITOS_INTEIRO","UNIDADE_MEDIDA",
"TIPO_LEITURA","MOTIVO_LEITURA","ESTADO_LEITURA","HORA_LEITURA","FECHA_LECTURA","CONSUMO","DIAS"],
index=False)
print (archiv)
print ("===============================================")
print ("*****Se ha creado el archivo correctamente*****")
print ("===============================================")
except IOError:
print ("===================================================")
print ("¡¡¡¡¡Hubo un error en la escritura del archivo!!!!!")
print ("===================================================")
This takes a file where I have lectures of energy consumption from different dates for every light meter('NR_CPE') and do some calculations:
Calculate the energy consumption for every 'NR_CPE' by substracting the previous reading with the next one and the result put in a new column named 'CONSUMO'.
Calculate the number of days where I'v got a reading and sum up the number of days
Add the consumption for every 'NR_CPE' and calculate the anual consumption.
Finally I want to classify by number of days that every light meter('NR_CPE') has a lecture. A if it has less than 180 days, B between 180 and 1 year and C more than a year.
And finally write this result in two differents files.
Any idea of how should I re-code this to have the same output and be faster?
Thank you all.
BTW this is my dataset:
,NOME_DISTRITO,NR_CPE,MARCA_EQUIPAMENTO,NR_EQUIPAMENTO,VALOR_LEITURA,REGISTADOR,TIPO_REGISTADOR,TIPO_DADOS_RECOLHIDOS,FACTOR_MULTIPLICATIVO_FINAL,NR_DIGITOS_INTEIRO,UNIDADE_MEDIDA,TIPO_LEITURA,MOTIVO_LEITURA,ESTADO_LEITURA,DATA_LEITURA,HORA_LEITURA
0,GUARDA,A002000642VW,101,1865411,4834,001,S,1,1,4,kWh,1,1,A,20150629,205600
1,GUARDA,A002000642VW,101,1865411,4834,001,S,1,1,4,kWh,2,2,A,20160218,123300
2,GUARDA,A002000642VJ,122,204534,25083,001,S,1,1,5,kWh,1,1,A,20150629,205700
3,GUARDA,A002000642VJ,122,204534,27536,001,S,1,1,5,kWh,2,2,A,20160218,123200
4,GUARDA,A002000642HR,101,1383899,11734,001,S,1,1,5,kWh,1,1,A,20150629,205600
5,GUARDA,A002000642HR,101,1383899,11800,001,S,1,1,5,kWh,2,2,A,20160218,123000
6,GUARDA,A002000995VM,101,97706436,12158,001,S,1,1,5,kWh,1,3,A,20150713,155300
7,GUARDA,A002000995VM,101,97706436,12163,001,S,1,1,5,kWh,2,2,A,20160129,162300
8,GUARDA,A002000995VM,101,97706436,12163,001,S,1,1,5,kWh,2,2,A,20160202,195800
9,GUARDA,A2000995VM,101,97706436,12163,001,S,1,1,5,kWh,1,3,A,20160404,145200
10,GUARDA,A002000996LV,168,5011703276,3567,001,V,1,1,6,kWh,1,1,A,20150528,205900
11,GUARDA,A02000996LV,168,5011703276,3697,001,V,1,1,6,kWh,2,2,A,20150929,163500
12,GUARDA,A02000996LV,168,5011703276,1287,002,P,1,1,6,kWh,1,1,A,20150528,205900
Generally you want to avoid for loops in pandas.
For example, the first loop where you calculate total consumption and days could be rewritten as a groupby apply something like:
def last_minus_first(df):
columns_of_interest = df[['VALOR_LEITURA', 'days']]
diff = columns_of_interest.iloc[-1] - columns_of_interest.iloc[0]
return diff
df['date'] = pd.to_datetime(df['DATA_LEITURA'], format="%Y%m%d")
df['days'] = (df['date'] - pd.datetime(1970,1,1)).dt.days # create days column
df.groupby('NR_CPE').apply(last_minus_first)
(btw I don't understand why you are subtracting each entry from the previous, surely for meter readings this is the same as last-first?)
Then given the result of the above as consumption, you can replace your second for loop (for ind, sumdias in dias.iteritems()) with something like:
pd.cut(consumption.days, [-1, 180, 365, np.inf], labels=['a', 'b', 'c']).value_counts()