I have a regular expression that looks for a url in some text like:
my_urlfinder = re.compile(r'\shttp:\/\/(\S+.|)blah.com/users/(\d+)(\/|)')
text = "blah blah http://blah.com/users/123 blah blah http://blah.com/users/353"
for match in my_urlfinder.findall(text):
print match #prints an array with all the individual parts of the regex
How do I get the entire url? Currently match just prints out the matched parts (which I need for other things)...but I also want the full url.
You should make your groups non-capturing:
my_urlfinder = re.compile(r'\shttp:\/\/(?:\S+.|)blah.com/users/(?:\d+)(?:\/|)')
findall() changes behaviour when there are capturing groups. With groups, it'll only return the groups, without capturing groups, the whole matched text is returned instead.
Demo:
>>> text = "blah blah http://blah.com/users/123 blah blah http://blah.com/users/353"
>>> my_urlfinder = re.compile(r'\shttp:\/\/(?:\S+.|)blah.com/users/(?:\d+)(?:\/|)')
>>> for match in my_urlfinder.findall(text):
... print match
...
http://blah.com/users/123
http://blah.com/users/353
An alternative to not using any capturing groups would be to add another one around everything:
my_urlfinder = re.compile(r'\s(http:\/\/(\S+.|)blah.com/users/(\d+)(\/|))')
This will allow you to keep the inner capturing groups while still having the whole result.
For the demo text it would yield these results:
('http://blah.com/users/123', '', '123', '')
('http://blah.com/users/353', '', '353', '')
As a side note beware that the current expression requires a whitespace in front of the URL, so if the text started with one that would not be matched.
Related
I will explain my problem with an example. Here is two different version of my text:
Version 1:
Blah: 1 2345 $ blah blah blah
Version 2:
Blah: 1 2345 $ (9 8546 $) blah blah blah
I try to write a regex in Python where if the text is in Version 2, then it will return the number in the parenthesis. Otherwise, it will return the number outside.
pat = re.compile(r"Blah: [0-9]+\s[0-9]+ /$ \(([0-9]+\s[0-9]+)|Blah: ([0-9]+\s[0-9]+)")
pat.findall(text)
The problem is that it returns ('1 2345', '') or ('', '9 8546') in each case.
How can I change the regex to return only the number?
If you are pretty comfortable with the RegEx you wrote, then I would suggest not to change the RegEx and get the value like this
print "".join(pat.findall(text)[0])
This will just concatenate the matching results. Since the other group captures nothing, you will get a single string.
Note: Also, you need to escape $ in your RegEx, like \$, otherwise it will be considered as the end of line.
Don't use findall. The only situation in which it is useful is when you have a simple regex and you want to get all its matches. When you start having capturing groups it easily become quite useless.
The finditer method returns the actual match objects created during matching instead of returning the tuples of the matched groups. You can slightly modify your regex to use capturing groups:
pat = re.compile(r'Blah: (\d+\s\d+) \$ (\((\d+\s\d+)\s*\$\))?')
Afterwards to get the matched number you can use match.group(3) or match.group(1) to select one or the other depending whether there was a parenthesized match:
text = 'Blah: 1 2345 $ (9 8546 $) blah blah blah\nBlah: 1 2345 $ blah blah blah'
[m.group(3) or m.group(1) for m in pat.finditer(text)]
Outputs:
Out[12]: ['9 8546', '1 2345']
So, I've been cooking some regex, and it seems the regex library is capturing an extra new line when I use ((.|\s)*) to capture multi-line text.. [\S\s]* works for some reason:
If you see below, the first regex produces an additional \n group, why??:
>>> s = """
... #pragma whatever
... #pr
... asdfsadf
... #pragma START-SomeThing-USERCODE
... this is the code
... this is more
... #pragma END-SomeThing-USERCODE
... asd
... asdf
... sadf
... sdaf
... """
>>> r = r"(#pragma START-(.*)-USERCODE\s*\n)((.|\s)*)(#pragma END-(.*)-USERCODE)"
>>> re.findall(r, s) [('#pragma START-SomeThing-USERCODE\n', 'SomeThing', 'this is the code\nthis is more\n', '\n', '#pragma END-SomeThing-USERCODE', 'SomeThing')]
>>> r = r"(#pragma START-(.*)-USERCODE\s*\n)([\S\s]*)(#pragma END-(.*)-USERCODE)"
>>> re.findall(r, s) [('#pragma START-SomeThing-USERCODE\n', 'SomeThing', 'this is the code\nthis is more\n', '#pragma END-SomeThing-USERCODE', 'SomeThing')]
The subregex
((.|\s)*)
matches "this is the code\nthis is more\n". The outer parentheses capture this entire string.
The inner parentheses capture one character at a time (either any character besides newlines, or a space (including newline)). Since that group is repeated, the contents of the group are overwritten with each repetition. At the end of the match, the last character that was matched (\n) is kept in that group.
So, if you want to avoid that, either make the inner group non-capturing:
((?:.|\s)*)
or use the ([\s\S]*) idiom for matching truly any character. It might be a good idea to use ([\s\S]*?), though, to make sure that the smallest possible number of characters are matched.
This expression produces nested group
((.|\s)*)
Because you use nested braces. For single-character OR square braces is a proper choice; this syntax is suitable when you want to chose between 2 words
(treat|trick)
I am given a string which is of this pattern:
[blah blah blah] [more blah] some text
I want to split the string into three parts: blah blah blah, more blah and some text.
A crude way to do it is to use mystr.split('] '), and then removes the lead [ from the first two elements. Is there a better and performant way (need to do this for thousands of strings very quickly).
You can use a regular expression to extract the text, if you know that it will be in that form. For efficiency, you can precompile the regex and then repeatedly use it when matching.
prog = re.compile('\[([^\]]*)\]\s*\[([^\]]*)\]\s*(.*)')
for mystr in string_list:
result = prog.match(mystr)
groups = result.groups()
If you'd like an explanation on the regex itself, you can get one using this tool.
You can use a regular expression to split where you want to leave out characters:
>>> import re
>>> s = '[...] [...] ...'
>>> re.split(r'\[|\] *\[?', s)[1:]
['...', '...', '...']
I'm creating a django filter for inserting 'a' tags into a given string from a list.
This is what I have so far:
def tag_me(text):
tags = ['abc', 'def', ...]
tag_join = "|".join(tags)
regex = re.compile(r'(?=(.))(?:'+ tag_join + ')', flags=re.IGNORECASE)
return regex.sub(r'\1', text)
Example:
tag_me('some text def')
Returns:
'some text d'
Expected:
'some text def'
The issue lies in the regex.sub as it matches but returns only the first character. Is there a problem with the way I'm capturing/using \1 on the last line ?
Note that the sequence (?: ...) in the question specifically turns off capture. See re documentation (about 1/5 thru page) which (with emphasis added) says:
(?:...) A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
As noted in previous answer, '('+ tag_join + ')' works, or use the suggested "|".join(re.escape(tag) for tag in tags) version if escapes are used in the target text.
You're capturing the (.) part, which is only one character.
I'm not sure I follow your regular expression - the simplified version r'('+ tag_join + ')' works fine for your example.
Note that if there's a chance of anything other than alphanumeric characters in your tag names, you'll want to do this:
tag_join = "|".join(re.escape(tag) for tag in tags)
Simply do
import re
def tag_me(text):
tags = ['abc', 'def']
reg = re.compile("|".join(tags).join('()'),
flags=re.IGNORECASE)
return reg.sub(r'\1', text)
print ' %s' % tag_me('some text def')
print 'wanted: some text def'
That's because you write a non-captured group (?:....) that you must then put this disturbing (?=(.)) in front.
This should do it
def tag_me(text):
tags = ['abc', 'def', ]
tag_join = "|".join(tags)
pattern = r'('+tag_join+')'
regex = re.compile(pattern, flags=re.IGNORECASE)
return regex.sub(r'\1', text)
I want to parse all of the functions inside of a .txt file. It looks like this:
def
test
end
def
hello
end
def
world
end
So, I would get the following returned: [test, hello, world]
Here is what I have tried, but I do not get anything back:
r = re.findall('def(.*?)end', doc)
print r
You have to use the re.DOTALL flag which will allow . to match newlines too (since your doc is multi-line).
You could additionally use '^def' and '^end' in the regex if you only wanted the outer def/end blocks (ie ignore indented ones), in which case you would also need to use the re.MULTILINE flag, which allows '^' and '$' to match start/end of line (as opposed to start/end of string).
re.findall('^def(.*?)^end',doc,re.DOTALL|re.MULTILINE)
r = re.findall('def(.*?)end', doc, re.S)
You need to enable re.MULTILINE flag to match multiple lines in a single regular expression.
Also, ^ and $ do NOT match linefeeds (\n)
>>> re.findall(r"^def$\n(.*)\n^end$", doc, re.MULTILINE)
[' test', ' hello', ' world']
If you don't want to match the whitespace in the beginning of the blocks, add \W+:
>>> re.findall(r"^def$\n\W*(.*)\n^end$", text, re.MULTILINE)
['test', 'hello', 'world']