Extract words between the 2nd and the 3rd comma - python

I am total newbie to regex, so this question might seem trivial to many of you.
I would like to extract the words between the second and the third comma, like in the sentence:
Chateau d'Arsac, Bordeaux blanc, Cuvee Celine, 2012
I have tried : (?<=,\s)[^,]+(?=,) but this doesn't return what I want...

data = "Chateau d'Arsac, Bordeaux blanc, Cuvee Celine, 2012"
import re
print re.match(".*?,.*?,\s*(.*?),.*", data).group(1)
Output
Cuvee Celine
But for this simple task, you can simply split the strings based on , like this
data.split(",")[2].strip()

In this case I see easier to use a simple split by comma.
>>> s = "Chateau d'Arsac, Bordeaux blanc, Cuvee Celine, 2012"
>>> s.split(',')[2]
' Cuvee Celine'

Why not just split the string by commas using str.split() ?
data.split(",")[2]

Related

Python String split by specific pattern with Indices

I'm trying to split sentences from different characters, where each word has its own tag, and store with indices, and names can be Mike or Steve with different lengths. Content can be multiple languages like Chinese or Japanese, etc.
content = "A:Hello.B:How are you?A:I'm fine."
which I want to be like:
[0]A:Hello. , 0:7
[1]B:How are you? , 8:21
[2]A:I'm fine. ,22:33
You can use re.split as follow:
import re
s = "A:Hello.B:How are you?A:I'm fine."
t = re.split(r'[.?]', s)
print(t)
that gives
['A:Hello', 'B:How are you', "A:I'm fine", '']
You can use re.finditer for the task:
import re
content = "A:Hello.B:How are you?A:I'm fine."
for idx, i in enumerate(re.finditer(r'(.*?[.?])(?=[A-Z]|\Z)', content)):
print('[{}]{:<20}, {}:{}'.format(idx, i.group(1), i.start(), i.end()-1))
Prints:
[0]A:Hello. , 0:7
[1]B:How are you? , 8:21
[2]A:I'm fine. , 22:32

Cleaning a dataset and removing special characters in python

I am fairly new to all of this so apologies in advance.
I've got a dataset (csv). One column contains strings with whole sentences. These sentences contain missinterpreted utf-8 charactes like ’ and emojis like 🥳.
So the dataframe (df) looks kind of like this:
date text
0 Jul 31 2020 it’s crazy. i hope post-covid we can get it done🥳
1 Jul 31 2020 just sayin’ ...
2 Jul 31 2020 nba to hold first games in 'bubble' amid pandemic
The goal is to do a sentiment analysis on the texts.
Would it be best to remove ALL special characters like , . ( ) [ ] + | - to do the sentiment analysis?
How do I do that and how do I also remove the missinterpreted utf-8 charactes like ’?
I've tried it myself by using some code I found and changing that to my problem.
This resulted in this piece of code which seems to do absolutly nothing. The charactes like ’ are still in the text.
spec_chars = ["…","🥳"]
for char in spec_chars:
df['text'] = df['text'].str.replace(char, ' ')
I'm a bit lost here.
I appreciate any help!
You can change the character encoding like this. x is one of the sentences in the original post.
x = 'it’s crazy. i hope post-covid we can get it done🥳'
x.encode('windows-1252').decode('utf8')
The result is 'it’s crazy. i hope post-covid we can get it done🥳'
As jsmart stated, use the .encode .decode. Since the column is a series, you's be using .str to access the values of the series as strings and apply the methods.
As far as the text sentiment, look at NLTK. And take a look at it's examples of sentiment analysis
import pandas as pd
df = pd.DataFrame([['Jul 31 2020','it’s crazy. i hope post-covid we can get it done🥳'],
['Jul 31 2020','just sayin’ ...'],
['Jul 31 2020',"nba to hold first games in 'bubble' amid pandemic"]],
columns = ['date','text'])
df['text'] = df['text'].str.encode('windows-1252').str.decode('utf8')
Try this. It's quite helpful for me.
df['clean_text'] = df['text'].apply(lambda x: ' '.join([word for word in x.split() if word.isalnum()])

How to extract a substring beginning with a specific substring and ends with another specific substring?

I have something like that:
"1111Austria9999Salzburg (SZG)Vienna (VIE)1111Bosnia-Herzegovina9999Sarajevo (SJJ)1111Bulgaria9999Bourgas (BOJ)Varna (VAR)"
And I want to extract
Salzburg (SZG), Sarajevo (SJJ), Bourgas (BOJ), Varna (VAR)
import re
sentence = "1111Austria9999Salzburg (SZG)Vienna (VIE)1111Bosnia-Herzegovina9999Sarajevo (SJJ)1111Bulgaria9999Bourgas (BOJ)Varna (VAR)"
regs = re.findall(r'[A-z]+\s\([A-Z]{3}\)', sentence)
print(regs)
This follows from my logic in the comments.

Replace word between two substrings (keeping other words)

I'm trying to replace a word (e.g. on) if it falls between two substrings (e.g. <temp> & </temp>) however other words are present which need to be kept.
string = "<temp>The sale happened on February 22nd</temp>"
The desired string after the replace would be:
Result = <temp>The sale happened {replace} February 22nd</temp>
I've tried using regex, I've only been able to figure out how to replace everything lying between the two <temp> tags. (Because of the .*?)
result = re.sub('<temp>.*?</temp>', '{replace}', string, flags=re.DOTALL)
However on may appear later in the string not between <temp></temp> and I wouldn't want to replace this.
re.sub('(<temp>.*?) on (.*?</temp>)', lambda x: x.group(1)+" <replace> "+x.group(2), string, flags=re.DOTALL)
Output:
<temp>The sale happened <replace> February 22nd</temp>
Edit:
Changed the regex based on suggestions by Wiktor and HolyDanna.
P.S: Wiktor's comment on the question provides a better solution.
Try lxml:
from lxml import etree
root = etree.fromstring("<temp>The sale happened on February 22nd</temp>")
root.text = root.text.replace(" on ", " {replace} ")
print(etree.tostring(root, pretty_print=True))
Output:
<temp>The sale happened {replace} February 22nd</temp>

Python Regex - Sentence not including strings

I have a series of sentences I am trying to decipher. Here are two examples:
Valid for brunch on Saturdays and Sundays
and
Valid for brunch
I want to compose a regex that identifies the word brunch, but only in the case where the sentence does not include the word saturday or sunday. How can I modify the following regex to do this?
re.compile(r'\bbrunch\b',re.I)
^(?!.*saturday)(?!.*sunday).*(brunch)
You can try in this way.Grab the capture.See demo.
https://regex101.com/r/nL5yL3/18
use a list comprehension , if you have all the sentences in a list like sentences you can use the following comprehension :
import re
[re.search(r'\bbranch\b',s) for s in sentences if `saturday` not in s and 'sunday' not in s ]
I would do like this,
>>> sent = ["Valid for brunch on Saturdays and Sundays", "Valid for brunch"]
>>> sent
['Valid for brunch on Saturdays and Sundays', 'Valid for brunch']
>>> for i in sent:
if not re.search(r'(?i)(?:saturday|sunday)', i) and re.search(r'brunch', i):
print(i)
Valid for brunch

Categories