My problem is that this simple regex statement with a boolean operator only gives me the result I want when the first item on the left side of the bitwise operator | is present in the sentence. Could someone tell me why it isn't working on the alternative as well?
import re
b = 'this is a good day to die hard'
jeff = re.search('good night (.+)hard|good day (.+)hard', b)
print jeff.group(1)
You have two sets of capturing parentheses - therefore you have two numbered capturing groups. If the second branch matches, the group(1) will be set to None, and group(2) will contain that which was matched by the second group.
There are several ways to fix this. One would be to write so that there is just one group, for example
jeff = re.search('good (?:day|night) (.+)hard', b)
The second (...) creates the second capturing group that you need to access with .group(2).
You may write a regex that will capture day or night, and the second group will fetch all up to the last hard.
import re
b = 'this is a good day to die hard'
jeff = re.search('good (day|night) (.+)', b)
if jeff:
print jeff.group(1)
print jeff.group(2)
Output of the demo:
day
to die hard
Related
So far I have done this but this returns the movie name but i want the year 1995 like this in separate list.
moviename=[]
for i in names:
moviename.append(i.split(' (',1)[0])
One issue with the code you have is that you're getting the first element of the array returned by split, which is the movie title. You want the second argument split()[1].
That being said, this solution won't work very well for a couple of reasons.
You will still have the second parenthesis in the year "1995)"
It won't work if the title itself has parenthesis (e.g. for Shanghai Triad)
If the year is always at the end of each string, you could do something like this.
movie_years = []
for movie in names:
movie_years.append(movie[-5:-1])
You could use a regular expression.
\(\d*\) will match an opening bracket, following by any number of digit characters (0-9), followed by a closing bracket.
Put only the \d+ part inside a capturing group to get only that part.
year_regex = r'\((\d+)\)'
moviename=[]
for i in names:
if re.search(year_regex, i):
moviename.append(re.search(year_regex, i).group(1))
By the way, you can make this all more concise using a list comprehension:
year_regex = r'\((\d+)\)'
moviename = [re.search(year_regex, name_and_year).group(1)
for name_and_year in names
if re.search(year_regex, name_and_year)]
I have a string column in a Data Frame that i want to extract a rate from that is the last occurrence of backslash.
After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP*
I want to get 14/10
try this code to get list of all your date in this sentence
import re
re.findall(r"\d+/\d+","After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP* ")
to get the second just select the second item in this list
This regex works:
>>> s = "After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP*"
>>> re.findall("\d+/(?=[^/]*$)\d+", s)
['14/10']
I have a DataFrame and in one cell I have a long text, e.g.:
-student- Kathrin A -/student- received abc and -student- Mike B -/student-
received def.
My question is: how can I extract the text between the -student- and -/student- and create two new columns with "Kathrin A" in the first one and "Mike B" in the second one? Meaning that this criteria meets twice or multiple times in the text.
what I have tried so far: str.extract('-student-\s * ([^.] * )\s * -/student-', expand = False) but this only extracts the first match, i.e Kathrin A.
Many thanks!
You could use str.split with regex and defined you delimiters as follows:
splittxt = ['-student-','-/student-']
df.text.str.split('|'.join(splittxt), expand=True)
Output:
0 1 2 3 4
0 Kathrin A received abc and Mike B received def.
Another approach would be to try extractall. The only caveat is the result is put into multiple rows instead of multiple columns. With some rearranging this should not be an issue, and please update this response if you end up working it out.
That being said I also have a slight modification to your regular expression which will help you with capturing both.
'(?<=-student-)(?:\s*)([\w\s]+)(?= -/student-)'
The only capturing group is [\w\s]+ so you'll be sure to not end up capturing the whole string.
I am (a newbie,) struggling with separating a database in columns with regex.findall().
I want to separate these Dutch street names into name and number.
Roemer Visscherstraat 15
Vondelstraat 102-huis
For the number I use
\S*$
Which works just fine. For the street name I use
^\S.+[^\S$]
Or: use everything but the last element, which may be a number or a combination of a number and something else.
Problem is: Python then also keeps the last whitespace after the last name, so I get:
'Roemer Visscherstraat '
Any way I can stop this from happening?
Also, Findall returns a list consisting of the bit of database I wanted, and an empty string. How does this happen and can i prevent it somehow?
Thanks so much in advance for you help.
You can rstrip() the name to remove any spaces at the end of it:
>>>'Roemer Visscherstraat '.rstrip()
'Roemer Visscherstraat'
But if the input is similar to the one you posted, you can simply use split() instead of regex, for example:
st = 'Roemer Visscherstraat 15'
data = st.split()
num = st[-1]
name = ' '.join(st[:-1])
print 'Name: {}, Number: {}'.format(name, num)
output:
Name: Roemer Visscherstraat, Number: 15
For the number you should use the following:
\S+$
Using a + instead of a * will ensure that you have at least one character in the match.
For the street name you can use the following:
^.+(?=\s\S+$)
What this does is selects text up until the number.
However, what you may consider doing is using one regex match with capture groups instead. The following would work:
^(.+(?=\s\S+$))\s(\S+$)
In this case, the first capture group gives you the street name, and the second gives you the number.
([^\d]*)\s+(\d.*)
In this regex the first group captures everything before a space and a number and the 2nd group gives the desired number
my assumption is that number would begin with a digit and the name would not have a digit in it
take a look at https://regex101.com/r/eW0UP2/1
Roemer Visscherstraat 15
Full match 0-24 `Roemer Visscherstraat 15`
Group 1. 0-21 `Roemer Visscherstraat`
Group 2. 22-24 `15`
Vondelstraat 102-huis
Full match 24-46 `Vondelstraat 102-huis`
Group 1. 24-37 `Vondelstraat`
Group 2. 38-46 `102-huis`
I have a regex something like
(\d\d\d)(\d\d\d)(\.\d\d){0,1}
when it matches I can easily get first two groups, but how do I check if third occurred 0 or 1 times.
Also another minor question: in the (\.\d\d) I only care about \d\d part, any other way to tell regex that \.\d\d needs to appear 0 or 1 times, but that I want to capture only \d\d part ?
This was based on a problem of parsing a
hhmmss
string that has optional decimal part for seconds( so it becomes
hhmmss.ss
)... I put \d\d\d in the question so it is clear about what \d\d Im talking about.
import re
value = "123456.33"
regex = re.search("^(\d\d\d)(\d\d\d)(?:\.(\d\d)){0,1}$", value)
if regex:
print regex.group(1)
print regex.group(2)
if regex.group(3) is not None:
print regex.group(3)
else:
print "3rd group not found"
else:
print "value don't match regex"