Extracting the first mention from retweets in a python dataframe

Extracting the first mention from retweets in a python dataframe - python

My df looks like this:
record = {
'text_con_RT_t' : ['RT #Blanc_: ramdon text #hashtag quiere ', '#GonM ramdon text', 'RT #IvEc: #GonzM ramdon text', 'hOLA ramdon text ' ],
'rt' : ['RT', '' , 'RT','' ]
}
# create a dataframe
dataframe2 = pd.DataFrame(record,
columns = ['text_con_RT_t', 'rt'])
I would like to get something like this:
text_con_RT_t
rt
usr_rt
RT #Blanc_: ramdon text #hashtag quiere
RT
blanc
#GonM ramdon text
RT #IvEc: #GonzM ramdon text
RT
ivec
hOLA ramdon text
But i havent succeded, in cases where there is starts with mention, but not retweet, my results looks like this:
text_con_RT_t
rt
usr_rt
RT #Blanc_: ramdon text #hashtag quiere
RT
blanc
#GonM ramdon text
gonm ramdon text
RT #IvEc: #GonzM ramdon text
RT
ivec
hOLA ramdon text
NaN
I have tried with this:
try:
dataframe2["usr_rt"] = dataframe2.text_con_RT_t.str.lower().str.split(':').str[0].str.split('#').str[1]
except dataframe2["rt"]==None: # complicated failed
dataframe2["usr_rt"] = ""
Also with this
if dataframe2["rt"] == "RT":
return (dataframe2["usr_rt"] == dataframe2.text_con_RT_t.str.split(':').str[0].str.split('#').str[1])
What am I missing? thanks

You can use numpy.where to conditionally keep values from extracted value:
dataframe2['usr_rt'] = np.where(
dataframe2.rt == 'RT',
dataframe2.text_con_RT_t.str.extract('#(\w+)', expand=False).str.lower(),
''
)
dataframe2
text_con_RT_t rt usr_rt
0 RT #Blanc_: ramdon text #hashtag quiere RT blanc_
1 #GonM ramdon text
2 RT #IvEc: #GonzM ramdon text RT ivec
3 hOLA ramdon text
Or if retweets always start with RT, you can use regex RT.*?#(\w+):
dataframe2['usr_rt'] = dataframe2.text_con_RT_t.str.extract('RT.*?#(\w+)', expand=False).str.lower()
dataframe2
text_con_RT_t rt usr_rt
0 RT #Blanc_: ramdon text #hashtag quiere RT blanc_
1 #GonM ramdon text NaN
2 RT #IvEc: #GonzM ramdon text RT ivec
3 hOLA ramdon text NaN

I would [personally] find it easier to create values for the new column from record. If you added it into record, you wouldn't need to change the DataFrame after (which I prefer since I'm not great with numpy, so I would just end up extracting the column as list an doing what I've done below anyway).
# allowedChars = '' # '+-.' # add allowed characters
record['usr_rt'] = ['' if not rt == 'RT' else ''.join(
c for c in txt.split('#', 1)[-1].split(':')[0].lower()
if c.isalpha() or c.isdigit() # or c in allowedChars
) for txt, rt in zip(record['text_con_RT_t'], record['rt'])]
if c.isalpha() allows only characters from the alphabet to remain; remove or c.isdigit() if you want to get rid of any numeric character from username as well, and make use of allowedChars and or c in allowedChars if you want to allow some special characters (that includes spaces btw, though I don't think usernames have any).
Anyways, now pd.DataFrame(record) would return a DataFrame that looks like
text_con_RT_t
rt
usr_rt
RT #Blanc_: ramdon text #hashtag quiere
RT
blanc
#GonM ramdon text
RT #IvEc: #GonzM ramdon text
RT
ivec
hOLA ramdon text

Related

Python to excel using XLWT

I have this big output. Everytime that the url appears, the number below the url is the ID (and its important), and also if the url appears at the output is one request to the api
The preview of the output :
https://xxxxxx/api/v1/implantacao/projeto/254/tarefa?start=0&limit=50
254
INFORME DE FRANQUIA VENDIDA
T01
1
PONTO COMERCIAL
CONTRATO DE LOCAÇÃO ASSINADO
T04
1
PONTO COMERCIAL
INTEGRAÇÃO: TIMELINE | ACESSOS
T05
1
TIMELINE
E-MAIL: CRIAR E-MAILS
T144
1
TIMELINE
ACESSOS SULTS: CRIAR ACESSOS
T145
1
TIMELINE
PRECIFICAÇÃO: ANÁLISE DE MARGEM E MARK-UP
T142
2
MIX INICIAL
VETERINÁRIO: PERÍODO DE SELEÇÃO DE CLÍNICO GERAL
T86
2
VETERINÁRIO
ESPECIALISTAS: PESQUISA POR PROFISSIONAIS
T157
2
DRA. MEI
TREINAMENTO: VESTINDO A CAMISA (PRESENCIAL)
T91
2
That part of the code :
def unidades(data):
for i in data['data']:
name = (i['nome'])
codigo = (i['codigo'])
situation = (i['situacao'])
subName = (i['fase']['nome'])
How can i print this into excel, using xlwt ?
I tried :
line = 0
for i in data['data']:
name = (i['nome'])
sheet.write(1, line, name)
linha += 1
But this code above didn't worked, only printed 7 itens in my excel file

How to remove special characters from Pandas DF?

I have a Python BOT that queries a database, saves the output to Pandas Dataframe and writes the data to an Excel template.
Yesterday the data did not saved to the Excel template because one of the fields in a record contain the following characters:
", *, /, (, ), :,\n
Pandas failed to save the data to the file.
This is the code that creates the dataframe:
upload_df = sql_df.copy()
This code prepares the template file with time/date stamp
src = file_name.format(val="")
date_str = " " + str(datetime.today().strftime("%d%m%Y%H%M%S"))
dst_file = file_name.format(val=date_str)
copyfile(src, os.path.join(save_path, dst_file))
work_book = load_workbook(os.path.join(save_path, dst_file))
and this code saves the dataframe to the excel file
writer = pd.ExcelWriter(os.path.join(save_path, dst_file), engine='openpyxl')
writer.book = work_book
writer.sheets = {ws.title: ws for ws in work_book.worksheets}
upload_df.to_excel(writer, sheet_name=sheet_name, startrow = 1, index=False, header = False)
writer.save()
My question is how can I clean the special characters from an specific column [description] in my dataframe before I write it to the Excel template?
I have tried:
upload_df['Name'] = upload_df['Name'].replace(to_replace= r'\W',value=' ',regex=True)
But this removes everything and not a certain type of special character.
I guess we could use a list of items and iterate through the list and run the replace but is there a more Pythonic solution ?
adding the data that corrupted the excel file and prevent pandas to write the information:
this is an example of the text that created the problem i changed a few normal characters to keep privacy but is the same data tha corrupted the file:
"""*** CRQ.: N/A *** DF2100109 SADSFO CADSFVO EN SERWO JL1047 EL
PUWERTDTO EL DIA 08-09-2021 A LAS 11:00 HRS. PERA REALIZAR TRWEROS
DE AWERWRTURA DE SITIO PARA MWERWO PWERRVO.
RWERE DE WERDDFF EN SITIO : ING. JWER ERR3WRR ERRSDFF DFFF :RERFD DDDDF : 33 315678905. 1) ADFDSF SDFDF Y DFDFF DE DFDF Y DFFF XXCVV Y
CXCVDDÓN DE DFFFD EN DFDFFDD 2) EN SDFF DE REQUERIRSE: SDFFDF Y SDFDFF
DE EEERRW HJGHJ (ACCESO, GHJHJ, GHJHJ, RRRTTEE Y ACCESO A LA YUYUGGG
RETIRAR JJGHJGHGH
CONSIDERACIONES FGFFDGFG: SE FGGG LLAVE DE FF LLEVAR FFDDF PARA ERTBGFY Y SOLDAR.""S: SE GDFGDFG LLAVE DE ERTFFFGG, FGGGFF EQUIPO
PARA DFGFGFGFG Y SOLDAR."""

As some of the special characters to remove are regex meta-characters, we have to escape these characters before we can replace them to empty strings with regex.
You can automate escaping these special character by re.escape, as follows:
import re
# put the special characters in a list
special_char = ['"', '*', '/', '(', ')', ':', '\n']
special_char_escaped = list(map(re.escape, special_char))
The resultant list of escaped special characters is as follows:
print(special_char_escaped)
['"', '\\*', '/', '\\(', '\\)', ':', '\\\n']
Then, we can remove the special characters with .replace() as follows:
upload_df['Name'] = upload_df['Name'].replace(special_char_escaped, '', regex=True)
Demo
Data Setup
upload_df = pd.DataFrame({'Name': ['"abc*/(xyz):\npqr']})
Name
0 "abc*/(xyz):\npqr
Run codes:
import re
# put the special characters in a list
special_char = ['"', '*', '/', '(', ')', ':', '\n']
special_char_escaped = list(map(re.escape, special_char))
upload_df['Name'] = upload_df['Name'].replace(special_char_escaped, '', regex=True)
Output:
print(upload_df)
Name
0 abcxyzpqr
Edit
With your edited text sample, here is the result after removing the special characters:
print(upload_df)
Name
0 CRQ. NA DF2100109 SADSFO CADSFVO EN SERWO JL1047 EL PUWERTDTO EL DIA 08-09-2021 A LAS 1100 HRS. PERA REALIZAR TRWEROS DE AWERWRTURA DE SITIO PARA MWERWO PWERRVO.
1 RWERE DE WERDDFF EN SITIO ING. JWER ERR3WRR ERRSDFF DFFF RERFD DDDDF 33 315678905. 1 ADFDSF SDFDF Y DFDFF DE DFDF Y DFFF XXCVV Y CXCVDDÓN DE DFFFD EN DFDFFDD 2 EN SDFF DE REQUERIRSE SDFFDF Y SDFDFF DE EEERRW HJGHJ ACCESO, GHJHJ, GHJHJ, RRRTTEE Y ACCESO A LA YUYUGGG
2 3. RETIRAR JJGHJGHGH
3 CONSIDERACIONES FGFFDGFG SE FGGG LLAVE DE FF LLEVAR FFDDF PARA ERTBGFY Y SOLDAR.S SE GDFGDFG LLAVE DE ERTFFFGG, FGGGFF EQUIPO PARA DFGFGFGFG Y SOLDAR.
The special characters listed in your question have all been removed. Please check whether it is ok now.

You could use the following (pass characters as a list to the method parameter):
upload_df['Name'] = upload_df['Name'].replace(
to_replace=['"', '*', '/', '()', ':', '\n'],
value=' '
)

Use str.replace:
>>> df
Name
0 (**Hello\nWorld:)
>>> df['Name'] = df['Name'].str.replace(r'''["*/():\\n]''', '', regex=True)
>>> df
Name
0 HelloWorld
Maybe you want to replace line breaks by whitespaces:
>>> df = df.replace({'Name': {r'["*/():]': '',
r'\\n': ' '}}, regex=True)
>>> df
Name
0 Hello World

How to extract specific lemma or pos/tag using spacy?

I am using spacy to lemmatize and parse a list of sentences. the data are contained in an excel file.
I would like to write a function which allow me to return different lemma of my sentences.
For example returning only lemma with a specific tag ("VERB" OR "VERB" +"ADJ")
This is my code :
import spacy
from spacy.lang.fr import French
from spacy_lefff import LefffLemmatizer, POSTagger
nlp = spacy.load("fr_core_news_sm")
nlp=spacy.load('fr')
parser = French()
path = 'Gold.xlsx'
my_sheet ="Gold"
df = read_excel(path, sheet_name= my_sheet)
def tokenizeTexte(sample):
tokens = parser(sample)
lemmas = []
for tok in tokens:
lemmas.append((tok.lemma_.lower(), tok.tag_, tok.pos_))
tokens = lemmas
tokens = [tok for tok in tokens if tok not in stopwords]
return tokens
df['Preprocess_verbatim'] = df.apply(lambda row:tokenizeTexte(row['verbatim']), axis=1)
print(df)
df.to_excel('output.xlsx')
I would like to be able to return all lemma with for example "verb" or "adj" or "adv" tag and then modify to return all the lemma.
I also wish to return different combination of lemma ( "PRON" +" "VERB"+"ADJ")
How can i do that with spacy ?
this is what i obtain with my code
id ... Preprocess_verbatim
0 463 ... [(ce, , ), (concept, , ), (résoudre, , ), (que...
1 2647 ... [(alors, , ), (ça, , ), (vouloir, , ), (dire, ...
2 5391 ... [(ça, , ), (ne, , ), (changer, , ), (rien, , )...
3 1120 ... [(sur, , ), (le, , ), (station, , ), (de, , ),
tok.tag and tok.pos does not appear , do you know why?
My file :
example of my data :
id verbatim
14 L'économe originellement est donc celui qui a la responsabilité, pour des personnes d'une maison, d'une unité d'organisation donnée .
25 De leur donner des rations de ressources au temps opportun.
56 Contrairement à l'idée qu'on se fait l'économe n'est pas axé sur le capital, c'est-à-dire sur l'action de capitaliser, mais sur les individus d'une unité organisation, c'est-à-dire sur l'action de partager, de redistribuer d'une façon juste et opportune des ressources aux différents membre

First, I think your model isn't working correctly because you're defining the nlp object twice. I believe you only need it once. I am also not sure what parser is doing and I'm not sure you need it. For this code, I would use something like the following:
nlp = spacy.load("fr_core_news_sm")
doc = nlp(sample)
tokens = [tok for tok in doc]
Then, doc is a spacy Doc object, and tokens is a list of spaCy Token objects. From here, the loop that iterates over your tokens would work.
If you want to do the POS selection in your existing preprocessing function, I think you only need to change one line in your loop:
for tok in tokens:
if tok.pos_ in ("VERB", "ADJ", "ADV"):
lemmas.append((tok.lemma_.lower(), tok.tag_, tok.pos_))
This will only add tokens with those specific parts of speech to your lemmas list.
I also noticed another issue in your code on this line further down:
tokens = [tok for tok in tokens if tok not in stopwords]
At this point tok is your tuple of (lemma, tag, pos), so unless your list of stopwords is tuples of the same format, and not only lemmas or tokens you want to exclude, this step will not exclude anything.
Putting it all together, you'd have something like this, which would return a list of tuples of (lemma, tag, pos) if the POS is correct:
nlp = spacy.load("fr_core_news_sm")
stopwords = ["here", "are", "some", "stopwords"]
def tokenizeTexte(sample):
doc = nlp(sample)
lemmas = []
for tok in tokens:
if tok.pos_ in ("VERB", "ADJ", "ADV"):
lemmas.append((tok.lemma_.lower(), tok.tag_, tok.pos_))
tokens = [(lemma, tag, pos) for (lemma, tag, pos) in lemmas if lemma not in stopwords]
return tokens

Text to word per line + named entity tag in Python

I’m making a Named Entity Recognizer and I’m struggling with putting data into the right format, using Python. What I have is a certain string and a list of the named entities in that text with belonging tags. For example:
text = “Hidden Figures is a 2016 American biographical drama film directed by Theodore Melfi and written by Melfi and Allison Schroeder.”
This string can also be “[[Hidden Figures]] is a 2016 [[American]] biographical drama film directed by [[Theodore Melfi]] and written by [[Melfi]] and [[Allison Schroeder]].” if that makes it easier.
listOfNEsAndTags = [‘Hidden Figures PRO’, 'American LOC’, 'Theodore Melfi PER’, 'Melfi PER’, 'Allison Schroeder PER’]
What I want as output is:
Hidden PRO
Figures PRO
is O
a O
2016 O
American LOC
biographical O
drama O
film O
directed O
by O
Theodore PER
Melfi PER
and O
written O
by O
Melfi PER
and O
Allison PER
Schroeder PER
. O
So far I’ve only gotten as far as the following function:
def wordPerLine(text, neplustags):
text = re.sub(r"([?!,.]+)", r" \1 ", text)
wpl = text.split()
output = []
for line in wpl:
output.append(line + ” O")
return output
Which gives every line the default tag O (which is the tag for non-named entities). How can I make it so that the named entities in the text get the right tag?

This could work, replacing the print with something else and refinement of the regex is needed, but it's a good start.
text = "[[Hidden test Figures]] is, a 2016 [[American]] biographical drama film directed by [[Theodore Melfi]] and written by [[Melfi]] and [[Allison Schroeder]]."
tags = {"Hidden test Figures": "PRO", "American": "LOC", 'Theodore Melfi': "PER", 'Melfi': "PER", 'Allison Schroeder': "PER"}
text = re.sub(r"([?!,.]+)", r" \1", text)
search = ""
inTag = False
for w in text.split(" "):
outTag = False
rest = w
if rest[:2] == "[[":
rest = rest[2:]
inTag = True
if rest[-2:] == "]]":
rest = rest[:-2]
outTag = True
if inTag:
search += rest
if outTag:
val = tags[search]
for word in search.split():
print(word + ": " + val)
inTag = False
search = ""
else:
search += " "
else:
print(rest + ": O")
Input:
[[Hidden test Figures]] is, a 2016 [[American]] biographical drama film directed by [[Theodore Melfi]] and written by [[Melfi]] and [[Allison Schroeder]].
Output:
Hidden: PRO
test: PRO
Figures: PRO
is: O
,: O
a: O
2016: O
American: LOC
biographical: O
drama: O
film: O
directed: O
by: O
Theodore: PER
Melfi: PER
and: O
written: O
by: O
Melfi: PER
and: O
Allison: PER
Schroeder: PER
.: O

How to read a particular line of interest from a text file?

Here I have a text file. I want to read Adress, Beneficiary, Beneficiary Bank, Acc Nbr, Total US$, Date which is at the top, RUT, BOX. I tried writing some code by myself but I am not able to correctly get the required information and moreover if the length of character changes I will not get correct output. How should I do this such that I will get every required information in a particular string.
The main problem will arise when my slicings will go wrong. For eg: I am using line[31:] for Acc Nbr. But if the address change then my slicing will also go wrong
My Text.txt
2014-11-09 BOX 1531 20140908123456 RUT 21 654321 0123
Girry S.A. CONTADO
G 5 Y Serie A
NO 098765
11 al Rayo 321 - Oqwerty 108 Monteaudio - Gruguay
Pharm Cosco, Inc - Britania PO Box 43215
Dirección Hot Springs AR 71903 - Estados Unidos
Oescripción Importe
US$
DO 7640183 - 50% of the Production Degree 246,123
Beneficiary Bank: Bankue Heritage (Gruguay) S.A Account Nbr: 1234563 Swift: MANIUYMM
Adress: Tencon 108 Monteaudio, Gruguay.
Beneficiary: Girry SA Acc Nbr: 1234567
Servicios prestados en el exterior, exentos de IVA o IRAE
Subtotal US$ 102,500
Iva US$ ---------------
Total US$ 102,500
I.V.A AL DIA Fecha de Vencimiento
IMPRENTA IRIS LTDA. - RUT 210161234015 - 0/40987 17/11/2015
CONSTANCIA N9 1234559842 -04/2013
CONTADO A 000.001/ A 000.050 x 2 VIAS
QWERTYAS ZXCVBIZADA
R. U.T. Bamprador Asdfumldor Final
Fecha 12/12/2014
1º ORIGINAL CLLLTE (Blanco) 2º CASIA AQWERVO (Rosasd)
My Code:
txt = 'Text.txt'
lines = [line.rstrip('\n') for line in open(txt)]
for line in lines:
if 'BOX' in line:
Date = line.split("BOX")[0]
BOX = line.split('BOX ', 1)[-1].split("RUT")[0]
RUT = line.split('RUT ',1)[-1]
print 'Date : ' + Date
print 'BOX : ' + BOX
print 'RUT : ' + RUT
if 'Adress' in line:
Adress = line[8:]
print 'Adress : ' + Adress
if 'NO ' in line:
Invoice_No = line.split('NO ',1)[-1]
print 'Invoice_No : ' + Invoice_No
if 'Swift:' in line:
Swift = line.split('Swift: ',1)[-1]
print 'Swift : ' + Swift
if 'Fecha' in line and '/' in line:
Invoice_Date = line.split('Fecha ',1)[-1]
print 'Invoice_Date : ' + Invoice_Date
if 'Beneficiary Bank' in line:
Beneficiary_Bank = line[18:]
Ben_Acc_Nbr = line.split('Nbr: ', 1)[-1]
print 'Beneficiary_Bank : ' + Beneficiary_Bank.split("Acc")[0]
print 'Ben_Acc_Nbr : ' + Ben_Acc_Nbr.split("Swift")[0]
if 'Beneficiary' in line and 'Beneficiary Bank' not in line:
Beneficiary = line[13:]
print 'Beneficiary : ' + Beneficiary.split("Acc")[0]
if 'Acc Nbr' in line:
Acc_Nbr = line.split('Nbr: ', 1)[-1]
print 'Acc_Nbr : ' + Acc_Nbr
if 'Total US$' in line:
Total_US = line.split('US$ ', 1)[-1]
print 'Total_US : ' + Total_US
Output:
Date : 2014-11-09
BOX : 1531 20140908123456
RUT : 21 654321 0123
Invoice_No : 098765
Swift : MANIUYMM
Beneficiary_Bank : Bankue Heritage (Gruguay) S.A
Ben_Acc_Nbr : 1234563
Adress : Tencon 108 Monteaudio, Gruguay.
Beneficiary : Girry SA
Acc_Nbr : 1234567
Total_US : 102,500
Invoice_Date : 12/12/2014
Some Code Changes
I have made some changes but still I am not convinced as I need to provide spaces also in split.

I would recommend you to use regular expressions to extract information you need. It helps to avoid the calculation of the numbers of offset characters.
import re
with open('C:\Quad.txt') as f:
for line in f:
match = re.search(r"Acc Nbr: (.*?)", line)
if match is not None:
Acc_Nbr = match.group(1)
print Acc_Nbr
# etc...

you can search to obtain index of it. for example:
if 'Acc Nbr' in line:
Acc_Nbr = line[line.find("Acc Nbr") + 10:]
print Acc_Nbr
note that find gives you index of first char of item you searched.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting the first mention from retweets in a python dataframe - python

Related

Python to excel using XLWT

How to remove special characters from Pandas DF?

How to extract specific lemma or pos/tag using spacy?

Text to word per line + named entity tag in Python

How to read a particular line of interest from a text file?

Categories

Resources