regex lookback get the first word - python

I am looking to get the first word back from a date value using regex. There isn't an exact pattern of words but I am just looking to obtain the first word preceding the date (not the entire string).
what I have so far:
regexDate = r'.*(?=)[w]*\d{2}\/\d{2}\/\d{4}|.*(?=)[a-zA-Z]+\s[0-9][0-9],?\s[0-9][0-9][0-9][0-9]|.*(?=)\d{2}-\d{2}-\d{4}|\d{2}/\w*/\d{4}|.*(?=)\d{2}-\w*-\d{4}|.*(?=)\d{2}.\w*.\d{4}|.*(?=)\d{4}/\d{2}/\d{2}|.*(?=)\d{2}-\w*-\d{2}|.*(?=)\d{2}/\d{2}/\d{2}'
text = 'blah blah Start: 01-Oct-2018 blah blah End: 30-Sep-2019'
find = re.findall(regexDate, text)
result:
['blah blah Start: 01-Oct-2018 blah blah End: 30-Sep-2019']
In this case, I'm just looking to get 'Start: 01-Oct-2018' and 'End: 30-Sep-2019'.
Note: Start and End may be a different set of words in other circumstances. Date formats also differ.
I did try wrapping .*(?=) around the possibilities but it would just return 30-Sep-2019. maybe someone can suggest a cleaner version as well?

This seems to do it:
re.findall(r'\b\S+\s+\d\d?-[A-Za-z]{3}-\d{4}\b', text)
returns:
['Start: 01-Oct-2018', 'End: 30-Sep-2019']
The \b at the start forces a start-of-word (word break), followed by a string of not-spaces and one or more spaces, then followed by your date expression.

I suppose you're looking for something like this? (?i)start:\s*(?<start>\d\d-\w+-\d{4}).*end:\s*(?<end>\d\d-\w+-\d{4})
Demo

Late answer, but here's a way to do it:
[^ ]+: [^ ]+
import re
text = 'blah blah Start: 01-Oct-2018 blah blah End: 30-Sep-2019'
result = re.findall("[^ ]+: [^ ]+", text)
# ['Start: 01-Oct-2018', 'End: 30-Sep-2019']

Use this Regex to match all of the type of month (number or abbreviation):
import re
regex = r"(\S+\s*\d{2}-\w+-\d{4})"
test_str = "blah blah Start: 01-Oct-2018 blah blah End: 30-Sep-2019"
# test_str = "test test Start:01-06-2018 blah blah End: 30-08-2019"
match = re.findall(regex,test_str)
print(match)
# ['Start: 01-Oct-2018', 'End: 30-Sep-2019']

Related

How can I remove brackets next to the link in Python?

I have a string
Some sentance startx here blah blah [Example](https://someSite.com/another/blah/blah)
and I want this string to become this one:
Some sentance startx here blah blah Example
I have tried this regex:
"[\[\]]\(\S*(https|http)*\.(ru|com)\S*"
but I get this:
Some sentance startx here blah blah [Example
The code:
pattern = r"[\[\]]\(\S*(https|http)*\.(ru)\S*"
text = re.sub(pattern, '', text)
maybe like this:
string = 'Some sentance startx here blah blah [Example](https://someSite.com/another/blah/blah)'
string = string.split("]")[0].replace("[","")
print(string)
Use
\[([^][]*)]\(http[^\s()]*\)
Replace with \1.
See regex proof.
Python code snippet:
text = re.sub(r'\[([^][]*)]\(http[^\s()]*\)', r'\1', text)

Python Regex to match a string with single whitespace between two words with last word is fixed

I'm new to Python regex and struggling to come up with a regex that can match the a strings with a single whitespace between two words with the last word being fixed.
For Example:
"MPMIDEMA Index"
"cpmindx Index"
The first 'word' in the string can be in either lowercase or uppercase letters but the last word in the string will always be fixed which is 'Index'.
I've got the code below but it doesn't seem to work:
txt = "jjj Index"
x = re.search("\s.Index", txt)
You need a lookead assertion to extract a word followed by 'Index'.
(?=...)
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
text = 'blah blah blah cpmindx Index blah blah blah'
out = re.findall(r'(\w+)(?= Index)', text)
>>> out
['cpmindx']
import re
pattern = "^(?:[a-z]+|[A-Z]+) Index$"
text = ["MPMIDEMA Index", "cpmindx Index"]
for txt in text:
if re.match(pattern, txt):
print(txt, "is matched")
output:
MPMIDEMA Index is matched
cpmindx Index is matched

Match a pattern only when previous pattern matches

I have a situation where I have to match a pattern only when previous regex pattern matches. Both pattern are different and matchobj in different line. For exmaple,
Text:
blah blah blah MyHost="xxxx"
again blah blah blah MyIp= "x.x.x.x"
I am only interested in whats comes after MyHost and MyIp, I also have a requirement that MyIp should match only when there is a match(MyHost="xxxx") in the above line.
I am able to match both MyHost value and MyIp value separately but having hard time finding a logic to match both as per the requirement. Please note I am fairly new to python and tried lot of search and end up here.
MyIp should match only when there is a match(MyHost="xxxx") in the above line.
Get the matched group from index 1 in Lazy way. You know already what is next after MyHost
\bMyHost="xxxx"\r?\n.*?MyIp=\s*\"([^"]*)
Here is demo
sample code:
import re
p = re.compile(ur'\bMyHost="xxxx"\r?\n.*?MyIp=\s*\"([^"]*)', re.IGNORECASE)
test_str = u"blah blah blah MyHost=\"xxxx\"\nagain blah blah blah MyIp= \"x.x.x.x\""
re.findall(p, test_str)
You could do this through regex module.
>>> import regex
>>> s = '''blah blah blah MyHost="xxxx"
... foo bar
... again blah blah blah MyIp= "x.x.x.x"
...
... blah blah blah MyHost="xxxx"
... again blah blah blah MyIp= "x.x.x.x"'''
>>> m = regex.search(r'(?<=MyHost="xxxx"[^\n]*\n.*?MyIp=\s*")[^"]*', s)
>>> m.group()
'x.x.x.x'
This would match the value of MyIp only if the string MyHost="xxxx" present on the previous line.
If you want to list the both, then try the below code.
>>> m = regex.findall(r'(?<=(MyHost="[^"]*")[^\n]*\n.*?)(MyIp=\s*"[^"]*")', s)
>>> m
[('MyHost="xxxx"', 'MyIp= "x.x.x.x"')]
Generally if you want to use Regex , you'll need to match "MyHost" and all that follows and "MyIP" and that follows it to the end of the line
So basically what you want to do is write a regex similar to this one
MyHost="\w+"
This will match MyHost=" " and the input between it will be set to W
afterwards you can retrieve the value of W and do the computation you need
To solve the problem where you have to match The host first
a simple if Condition can solve this problem by checking the Host name first before the Ip
(?=.*? MyHost=\"xxx\" .*) .*? MyIp=\"(\S+)\" .*
The xxx can be changed as required.MyIP will get captured.
You can use python lookahead.Only when xxx matches regex will go ahead and fetch IP
(?=regex)regex1
match regex1 only when regex has matched.
You should take advantage of short circuiting, I believe python supports it. In short circuiting, the second condition will only be evaluated if the first one is true (for AND operations). So your code will look like the following:
patternMatch1(MyHost) and patternMatch2(MyIp)
Here you could have both the pattern match functions return true if they are appropriately matched to.
Please let me know if you have any questions!

Python and Regex. Or statement

I will explain my problem with an example. Here is two different version of my text:
Version 1:
Blah: 1 2345 $ blah blah blah
Version 2:
Blah: 1 2345 $ (9 8546 $) blah blah blah
I try to write a regex in Python where if the text is in Version 2, then it will return the number in the parenthesis. Otherwise, it will return the number outside.
pat = re.compile(r"Blah: [0-9]+\s[0-9]+ /$ \(([0-9]+\s[0-9]+)|Blah: ([0-9]+\s[0-9]+)")
pat.findall(text)
The problem is that it returns ('1 2345', '') or ('', '9 8546') in each case.
How can I change the regex to return only the number?
If you are pretty comfortable with the RegEx you wrote, then I would suggest not to change the RegEx and get the value like this
print "".join(pat.findall(text)[0])
This will just concatenate the matching results. Since the other group captures nothing, you will get a single string.
Note: Also, you need to escape $ in your RegEx, like \$, otherwise it will be considered as the end of line.
Don't use findall. The only situation in which it is useful is when you have a simple regex and you want to get all its matches. When you start having capturing groups it easily become quite useless.
The finditer method returns the actual match objects created during matching instead of returning the tuples of the matched groups. You can slightly modify your regex to use capturing groups:
pat = re.compile(r'Blah: (\d+\s\d+) \$ (\((\d+\s\d+)\s*\$\))?')
Afterwards to get the matched number you can use match.group(3) or match.group(1) to select one or the other depending whether there was a parenthesized match:
text = 'Blah: 1 2345 $ (9 8546 $) blah blah blah\nBlah: 1 2345 $ blah blah blah'
[m.group(3) or m.group(1) for m in pat.finditer(text)]
Outputs:
Out[12]: ['9 8546', '1 2345']

Full expression for findall

I have a regular expression that looks for a url in some text like:
my_urlfinder = re.compile(r'\shttp:\/\/(\S+.|)blah.com/users/(\d+)(\/|)')
text = "blah blah http://blah.com/users/123 blah blah http://blah.com/users/353"
for match in my_urlfinder.findall(text):
print match #prints an array with all the individual parts of the regex
How do I get the entire url? Currently match just prints out the matched parts (which I need for other things)...but I also want the full url.
You should make your groups non-capturing:
my_urlfinder = re.compile(r'\shttp:\/\/(?:\S+.|)blah.com/users/(?:\d+)(?:\/|)')
findall() changes behaviour when there are capturing groups. With groups, it'll only return the groups, without capturing groups, the whole matched text is returned instead.
Demo:
>>> text = "blah blah http://blah.com/users/123 blah blah http://blah.com/users/353"
>>> my_urlfinder = re.compile(r'\shttp:\/\/(?:\S+.|)blah.com/users/(?:\d+)(?:\/|)')
>>> for match in my_urlfinder.findall(text):
... print match
...
http://blah.com/users/123
http://blah.com/users/353
An alternative to not using any capturing groups would be to add another one around everything:
my_urlfinder = re.compile(r'\s(http:\/\/(\S+.|)blah.com/users/(\d+)(\/|))')
This will allow you to keep the inner capturing groups while still having the whole result.
For the demo text it would yield these results:
('http://blah.com/users/123', '', '123', '')
('http://blah.com/users/353', '', '353', '')
As a side note beware that the current expression requires a whitespace in front of the URL, so if the text started with one that would not be matched.

Categories