Python and Regex. Or statement - python

I will explain my problem with an example. Here is two different version of my text:
Version 1:
Blah: 1 2345 $ blah blah blah
Version 2:
Blah: 1 2345 $ (9 8546 $) blah blah blah
I try to write a regex in Python where if the text is in Version 2, then it will return the number in the parenthesis. Otherwise, it will return the number outside.
pat = re.compile(r"Blah: [0-9]+\s[0-9]+ /$ \(([0-9]+\s[0-9]+)|Blah: ([0-9]+\s[0-9]+)")
pat.findall(text)
The problem is that it returns ('1 2345', '') or ('', '9 8546') in each case.
How can I change the regex to return only the number?

If you are pretty comfortable with the RegEx you wrote, then I would suggest not to change the RegEx and get the value like this
print "".join(pat.findall(text)[0])
This will just concatenate the matching results. Since the other group captures nothing, you will get a single string.
Note: Also, you need to escape $ in your RegEx, like \$, otherwise it will be considered as the end of line.

Don't use findall. The only situation in which it is useful is when you have a simple regex and you want to get all its matches. When you start having capturing groups it easily become quite useless.
The finditer method returns the actual match objects created during matching instead of returning the tuples of the matched groups. You can slightly modify your regex to use capturing groups:
pat = re.compile(r'Blah: (\d+\s\d+) \$ (\((\d+\s\d+)\s*\$\))?')
Afterwards to get the matched number you can use match.group(3) or match.group(1) to select one or the other depending whether there was a parenthesized match:
text = 'Blah: 1 2345 $ (9 8546 $) blah blah blah\nBlah: 1 2345 $ blah blah blah'
[m.group(3) or m.group(1) for m in pat.finditer(text)]
Outputs:
Out[12]: ['9 8546', '1 2345']

Related

Get sentence after pattern with regex python

In my string (example adopted from this turorial) I want to get everything until the first following . after the generic (year). pattern:
str = 'purple alice#google.com, (2002).blah monkey. (1991).#abc.com blah dishwasher'
I think I'm almost there with my code but not quite yet:
test = re.findall(r'[\(\d\d\d\d\).-]+([^.]*)', str)
... which returns: ['com, (2002)', 'blah monkey', ' (1991)', '#abc', 'com blah dishwasher']
The desired output is:
['blah monkey', '#abc']
In other words, I want to find everything that is between the year pattern and the next dot.
If you want to get every thing between (year). and the first . you can use this:
\(\d{4}\)\.([^.]*)
See Live Demo.
And explanation here:
"\(\d{4}\)\.([^.]*)"g
\( matches the character ( literally
\d{4} match a digit [0-9]
Quantifier: {4} Exactly 4 times
\) matches the character ) literally
\. matches the character . literally
1st Capturing group ([^.]*)
[^.]* match a single character not present in the list below
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
. the literal character .
g modifier: global. All matches (don't return on first match)
This should do the trick
print re.findall(r'\(\d{4}\)\.([^\.]+)', str)
$ ['blah monkey', '#abc']
You are using [...] in the wrong way. Try with \(\d{4}\)\.([^.]*)\.:
>>> s = 'purple alice#google.com, (2002).blah monkey. (1991).#abc.com blah dishwasher'
>>> re.findall(r'\(\d{4}\)\.([^.]*)\.', s)
['blah monkey', '#abc']
For the reference, [...] specifies a character class. By using [\(\d\d\d\d\).-] you were saying: one of 0123456789().-.

Match a pattern only when previous pattern matches

I have a situation where I have to match a pattern only when previous regex pattern matches. Both pattern are different and matchobj in different line. For exmaple,
Text:
blah blah blah MyHost="xxxx"
again blah blah blah MyIp= "x.x.x.x"
I am only interested in whats comes after MyHost and MyIp, I also have a requirement that MyIp should match only when there is a match(MyHost="xxxx") in the above line.
I am able to match both MyHost value and MyIp value separately but having hard time finding a logic to match both as per the requirement. Please note I am fairly new to python and tried lot of search and end up here.
MyIp should match only when there is a match(MyHost="xxxx") in the above line.
Get the matched group from index 1 in Lazy way. You know already what is next after MyHost
\bMyHost="xxxx"\r?\n.*?MyIp=\s*\"([^"]*)
Here is demo
sample code:
import re
p = re.compile(ur'\bMyHost="xxxx"\r?\n.*?MyIp=\s*\"([^"]*)', re.IGNORECASE)
test_str = u"blah blah blah MyHost=\"xxxx\"\nagain blah blah blah MyIp= \"x.x.x.x\""
re.findall(p, test_str)
You could do this through regex module.
>>> import regex
>>> s = '''blah blah blah MyHost="xxxx"
... foo bar
... again blah blah blah MyIp= "x.x.x.x"
...
... blah blah blah MyHost="xxxx"
... again blah blah blah MyIp= "x.x.x.x"'''
>>> m = regex.search(r'(?<=MyHost="xxxx"[^\n]*\n.*?MyIp=\s*")[^"]*', s)
>>> m.group()
'x.x.x.x'
This would match the value of MyIp only if the string MyHost="xxxx" present on the previous line.
If you want to list the both, then try the below code.
>>> m = regex.findall(r'(?<=(MyHost="[^"]*")[^\n]*\n.*?)(MyIp=\s*"[^"]*")', s)
>>> m
[('MyHost="xxxx"', 'MyIp= "x.x.x.x"')]
Generally if you want to use Regex , you'll need to match "MyHost" and all that follows and "MyIP" and that follows it to the end of the line
So basically what you want to do is write a regex similar to this one
MyHost="\w+"
This will match MyHost=" " and the input between it will be set to W
afterwards you can retrieve the value of W and do the computation you need
To solve the problem where you have to match The host first
a simple if Condition can solve this problem by checking the Host name first before the Ip
(?=.*? MyHost=\"xxx\" .*) .*? MyIp=\"(\S+)\" .*
The xxx can be changed as required.MyIP will get captured.
You can use python lookahead.Only when xxx matches regex will go ahead and fetch IP
(?=regex)regex1
match regex1 only when regex has matched.
You should take advantage of short circuiting, I believe python supports it. In short circuiting, the second condition will only be evaluated if the first one is true (for AND operations). So your code will look like the following:
patternMatch1(MyHost) and patternMatch2(MyIp)
Here you could have both the pattern match functions return true if they are appropriately matched to.
Please let me know if you have any questions!

Full expression for findall

I have a regular expression that looks for a url in some text like:
my_urlfinder = re.compile(r'\shttp:\/\/(\S+.|)blah.com/users/(\d+)(\/|)')
text = "blah blah http://blah.com/users/123 blah blah http://blah.com/users/353"
for match in my_urlfinder.findall(text):
print match #prints an array with all the individual parts of the regex
How do I get the entire url? Currently match just prints out the matched parts (which I need for other things)...but I also want the full url.
You should make your groups non-capturing:
my_urlfinder = re.compile(r'\shttp:\/\/(?:\S+.|)blah.com/users/(?:\d+)(?:\/|)')
findall() changes behaviour when there are capturing groups. With groups, it'll only return the groups, without capturing groups, the whole matched text is returned instead.
Demo:
>>> text = "blah blah http://blah.com/users/123 blah blah http://blah.com/users/353"
>>> my_urlfinder = re.compile(r'\shttp:\/\/(?:\S+.|)blah.com/users/(?:\d+)(?:\/|)')
>>> for match in my_urlfinder.findall(text):
... print match
...
http://blah.com/users/123
http://blah.com/users/353
An alternative to not using any capturing groups would be to add another one around everything:
my_urlfinder = re.compile(r'\s(http:\/\/(\S+.|)blah.com/users/(\d+)(\/|))')
This will allow you to keep the inner capturing groups while still having the whole result.
For the demo text it would yield these results:
('http://blah.com/users/123', '', '123', '')
('http://blah.com/users/353', '', '353', '')
As a side note beware that the current expression requires a whitespace in front of the URL, so if the text started with one that would not be matched.

Use regex to cut the string begin with specific character?

I am processing a flat file, with data in line by line format, like this
... blah blah blah | sku: 01234567 | price: 150 | ... blah blah blah
I want to extract the sku field, it is the number with 8 char long. However, I am not sure if I should use split or regex, I am not very good at using regex in python.
Assuming your sku values are always 8 char long, and are always preceded by 'sku', and possibly some ':' (with or without spaces in the between), then I would use the regex: r'sku[\s:]*(\d{8})':
>>> import re
>>> string = '... | sku: 01234567 | price: 150 | ... '
>>> re.findall(r'sku[\s:]*(\d{8})', string)[0]
'01234533'
If your sku values length may be variable, just use: r'sku[\s:]*(\d*)':
>>> import re
>>> string = '... | sku: 01234 | price: 150 | sku: 99872453 | blah blah ... '
>>> re.findall(r'sku[\s:]*(\d*)', string)[0]
'01234'
>>> re.findall(r'sku[\s:]*(\d*)', string)[1]
'99872453'
edit
If your 'sku' is followed by some other characters, like sku1, sku2, sku-sp, sku-18 or sku_anything, you could try that:
>>> re.findall(r'sku\D*(\d*)', string)[0]
This is the exact equivalent of:
>>> re.findall(r'sku[^0-9]*([0-9]*)', string)[0]
It's very general. It will match anything that begin with sku, then that will be followed by any undetermined number of non-decimal character (\D*, or [^0-9]*), and by some decimal characters (\d*, or [0-9]*). It will return the latter (a string of undetermined length of decimal characters).
Now, what do mean the things I used to build these expressions:
quantifiers
*: when following a single character or a class of characters, this symbol means that the expression will match any undetermined number of the character or class it follows (* means "0 or some", + means "at least one", ? means "0 or 1").
the {} are used in the same ways than the *, the + and the ?, ie. they follow a character or a class of characters. They also are quantifiers. If you say c{4}, it will match any string composed of exactly 4 'c's. If you say c{1,6} it will match any string composed of between 1 and 6 'c'.
classes
[]: define a class of characters. [abc] means any of the characters 'a', 'b', or 'c'. [a-z] means any of the lower case letters. [A-Z], any of the upper case letters, [a-zA-Z] any of the lower and upper case letters, [0-9] any of the decimal characters. If you want to match decimals with dots, or commas, with plus, minus and 'e' (for exponentials, for example), just say [0-9,\.+-e].
the ^ inside of a class - defined with [], means 'inverted class', everything but the class. Then, [^0-9] means anything but decimal characters, [^a-z] anything but lower case letters, and so on, and so forth.
predefined classes
These are classes that are predefined in python, for making the regexes syntax more friendly:
\s: will match any spacing character (space, tabulation, etc.)
\d: will match any decimal character (0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ... This is equivalent to [0-9], which is another way to express a characters class in regexes)
\D: will match any non-decimal character ... This is equivalent to [^0-9], which is another way to express an exluded class of characters in regexes.
\S: will match any non-spacing character ...
\w: will match any 'word character'
\W: will match any non-word character
...
groups
() defines some groups. They have many usages. Here, in findall, the group highlights what you want to be returned by the expression ... ie. (\d{8}) or [0-9]{8} means you want the expression returns to you only the strings of 8 decimal characters in the matching full string.
Regular expressions are really easy to use, and very useful. You just have to very well understand what they can do and what they can't (they are limited to regular languages. If you need to deal with levels of nested things for example, or other languages defined with context-free grammars, regexes won't be enough). You would probably want to have a look on the following pages:
http://docs.python.org/library/re.html
http://www.regular-expressions.info/tutorial.html
Something like the following should achieve what you need, without being dependent on exact spacing and positioning:
>>> s = '... blah blah blah | sku: 01234567 | price: 150 | ... blah blah blah'
>>> match_obj = re.search(r'sku\s*:\s*(\d+)', s)
>>> match_obj.group(1)
'01234567'
Before you attempt to access the match object with the .group() method, you should check that a match actually occurred, i.e.: if match_obj: # do something with match.
If all the 8-digit numbers in your string are SKU numbers, you can use
re.findall(r"\b\d{8}\b", mystring)
The \b word boundary anchors ensure that 8-digit substrings within longer numbers/words will not be matched.
If all of the pipe delimited fields are also (key: value), then you might as well retain the rest of the data unless you need it -- you're already having to parse the string...
s = "sku: 01234567 | price: 150"
dict( k.split(':') for k in s.split('|') )
# {sku': ' 01234567 ', ' price': ' 150'}
Might want to trim some excess leading space though
In my opinion you should use split, "sku:" and "|" acts as separators:
s = "blah blah blah | sku: 01234567 | price: 150 | ... blah blah blah"
s.split("sku:")[1].split("|")[0]
Here is with checking:
s = "blah blah blah | sku: 01234567 | price: 150 | ... blah blah blah"
s1 = s.split("sku:")
if len(s1) == 2:
print s1[1].split("|")[0]

Python regular expressions - re.search() vs re.findall()

For school I'm supposed to write a Python RE script that extracts IP addresses. The regular expression I'm using seems to work with re.search() but not with re.findall().
exp = "(\d{1,3}\.){3}\d{1,3}"
ip = "blah blah 192.168.0.185 blah blah"
match = re.search(exp, ip)
print match.group()
The match for that is always 192.168.0.185, but its different when I do re.findall()
exp = "(\d{1,3}\.){3}\d{1,3}"
ip = "blah blah 192.168.0.185 blah blah"
matches = re.findall(exp, ip)
print matches[0]
0.
I'm wondering why re.findall() yields 0. when re.search() yields 192.168.0.185, since I'm using the same expression for both functions.
And what can I do to make it so re.findall() will actually follow the expression correctly? Or am I making some kind of mistake?
findall returns a list of matches, and from the documentation:
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
So, your previous expression had one group that matched 3 times in the string where the last match was 0.
To fix your problem use: exp = "(?:\d{1,3}\.){3}\d{1,3}"; by using the non-grouping version, there is no returned groups so the match is returned in both cases.
You're only capturing the 0 in that regex, as it'll be the last one that's caught.
Change the expression to capture the entire IP, and the repeated part to be a non-capturing group:
In [2]: ip = "blah blah 192.168.0.185 blah blah"
In [3]: exp = "((?:\d{1,3}\.){3}\d{1,3})"
In [4]: m = re.findall(exp, ip)
In [5]: m
Out[5]: ['192.168.0.185']
In [6]:
And if it helps to explain the regex:
In [6]: re.compile(exp, re.DEBUG)
subpattern 1
max_repeat 3 3
subpattern None
max_repeat 1 3
in
category category_digit
literal 46
max_repeat 1 3
in
category category_digit
This explains the subpatterns. Subpattern 1 is what gets captured by findall.

Categories