This question already has answers here:
What exactly do "u" and "r" string prefixes do, and what are raw string literals?
(7 answers)
What exactly is a "raw string regex" and how can you use it?
(7 answers)
Closed 7 months ago.
I need a python regular expression to check if a word is present in a string. The string is separated by commas, potentially.
So for example,
line = 'This,is,a,sample,string'
I want to search based on "sample", this would return true. I am crappy with reg ex, so when I looked at the python docs, I saw something like
import re
re.match(r'sample', line)
But I don't know why there was an 'r' before the text to be matched. Can someone help me with the regular expression?
Are you sure you need a regex? It seems that you only need to know if a word is present in a string, so you can do:
>>> line = 'This,is,a,sample,string'
>>> "sample" in line
True
The r makes the string a raw string, which doesn't process escape characters (however, since there are none in the string, it is actually not needed here).
Also, re.match matches from the beginning of the string. In other words, it looks for an exact match between the string and the pattern. To match stuff that could be anywhere in the string, use re.search. See a demonstration below:
>>> import re
>>> line = 'This,is,a,sample,string'
>>> re.match("sample", line)
>>> re.search("sample", line)
<_sre.SRE_Match object at 0x021D32C0>
>>>
r stands for a raw string, so things like \ will be automatically escaped by Python.
Normally, if you wanted your pattern to include something like a backslash you'd need to escape it with another backslash. raw strings eliminate this problem.
short explanation
In your case, it does not matter much but it's a good habit to get into early otherwise something like \b will bite you in the behind if you are not careful (will be interpreted as backspace character instead of word boundary)
As per re.match vs re.search here's an example that will clarify it for you:
>>> import re
>>> testString = 'hello world'
>>> re.match('hello', testString)
<_sre.SRE_Match object at 0x015920C8>
>>> re.search('hello', testString)
<_sre.SRE_Match object at 0x02405560>
>>> re.match('world', testString)
>>> re.search('world', testString)
<_sre.SRE_Match object at 0x015920C8>
So search will find a match anywhere, match will only start at the beginning
You do not need regular expressions to check if a substring exists in a string.
line = 'This,is,a,sample,string'
result = bool('sample' in line) # returns True
If you want to know if a string contains a pattern then you should use re.search
line = 'This,is,a,sample,string'
result = re.search(r'sample', line) # finds 'sample'
This is best used with pattern matching, for example:
line = 'my name is bob'
result = re.search(r'my name is (\S+)', line) # finds 'bob'
As everyone else has mentioned it is better to use the "in" operator, it can also act on lists:
line = "This,is,a,sample,string"
lst = ['This', 'sample']
for i in lst:
i in line
>> True
>> True
One Liner implementation:
a=[1,3]
b=[1,2,3,4]
all(i in b for i in a)
Related
I'm novice to Python and I am trying to extract a string from another string with specific format, for example:
I have original string: -
--#$_ABC1234-XX12X
I need to extract exactly the string ABC1234 (must include three first characters and followed by four digits).
You can use the curly brace repetition qualifiers {} to match exactly three alphabetic characters and exactly four numeric characters:
>>> from re import search
>>>
>>> string = '---#$_ABC1234-XX12X'
>>> match = search('[a-zA-Z]{3}\d{4}', string)
>>> match
<_sre.SRE_Match object; span=(6, 13), match='ABC1234'>
>>> match.group(0) # Use this to get the string that was matched.
'ABC1234'
Explanation of regex:
[a-zA-Z]: Match any letter upper case of lower case...
{3}: Exactly three times. And...
\d: Any digit character...
{4} Exactly four times.
You can make use of re module in Python
matcher = re.search((?P<matched_string>[a-zA-Z]{3}\d{4}))
needed_string = matcher.groupdict()['matched_string']
needed_string will be your desired output.
For the re module refer to: https://docs.python.org/3.4/library/re.html
If you now the exact coordinates of the string you can use something like this:
>>> var = "--#$_ABC1234-XX12X"
>>> newstring = var[5:12]
>>> newstring
'ABC1234'
a python string has a slice method.
I'm trying to match strings in the lines of a file and write the matches minus the first one and the last one
import os, re
infile=open("~/infile", "r")
out=open("~/out", "w")
pattern=re.compile("=[A-Z0-9]*>")
for line in infile:
out.write( pattern.search(line)[1:-1] + '\n' )
Problem is that it says that Match is not subscriptable, when I try to add .group() it says that Nonegroup has no attritube group, groups() returns that .write needs a tuple etc
Any idea how to get .search to return a string ?
The re.search function returns a Match object.
If the match fails, the re.search function will return None. To extract the matching text, use the Match.group method.
>>> match = re.search("a.", "abc")
>>> if match is not None:
... print(match.group(0))
'ab'
>>> print(re.search("a.", "a"))
None
That said, it's probably a better idea to use groups to find the required section of the match:
>>> match = re.search("=([A-Z0-9]*)>", "=abc>") # Notice brackets
>>> match.group(0)
'=abc>'
>>> match.group(1)
'abc'
This regex can then be used with findall as #WiktorStribiżew suggests.
You seem to need only the part of strings between = and >. In this case, it is much easier to use a capturing group around the alphanumeric pattern and use it with re.findall that will never return None, but just an empty list upon no match, or a list of captured texts if found. Also, I doubt you need empty matches, so use + instead of *:
pattern=re.compile(r"=([A-Z0-9]+)>")
^ ^
and then
"\n".join(pattern.findall(line))
Suppose I want to find "PATTERN" in a string, where "PATTERN" could be anywhere in the string. My first try was *PATTERN*, but this generates an error saying that there is "nothing to repeat", which I can accept so I tried .*PATTERN*. This regex does however not give the expected result, see below
import re
p = re.compile(".*PATTERN*")
s = "XXPATTERXX"
if p.match(s):
print s + " match with '.*PATTERN*'"
The result is
XXPATTERXX match with '.*PATTERN*'
Why does "PATTER" match?
Note: I know that I could use .*PATTERN.* to get the expected result, but I am curious to find out why the asterisk on it self fails to get the results.
Your pattern matches 0 or more N characters at the end, but doesn't say anything about what comes after those N characters.
You could add $ to the pattern to anchor to the end of the input string to disallow the XX:
>>> import re
>>> re.compile(".*PATTERN*$")
<_sre.SRE_Pattern object at 0x10029fb90>
>>> import re
>>> p = re.compile(".*PATTERN*$")
>>> p.match("XXPATTERXX") is None
True
>>> p.match("XXPATTER") is None
False
>>> p.match("XXPATTER")
<_sre.SRE_Match object at 0x1004627e8>
You may want to look into the different types of anchor. \b may also fit your needs; it matches word boundaries (so between a \w and \W class character, or between \W and \w), or you could use negative look-ahead and look-behinds to disallow other characters around your PATTERN string.
I have a bunch of questions of the form:
<iiiihhiii? (end of line )
I want to make sure HTML tags are avoided.
I tried a regex:
<[^>]+\?
using http://www.pythonregex.com/.
here is the output:
>>> regex = re.compile("<[^>]+/?",re.UNICODE|re.DOTALL|re.VERBOSE)
>>> r = regex.search(string)
>>> r
<_sre.SRE_Match object at 0x87e2915436c23d50>
>>> regex.match(string)
<_sre.SRE_Match object at 0x87e2915436c23da8>
# List the groups found
>>> r.groups()
()
# List the named dictionary objects found
>>> r.groupdict()
{}
# Run findall
>>> regex.findall(string)
[u'<jghjhgjhgjh?']
No match is occuring. How can I fix this?
Your regex is starting with < correctly, but the [^>] match doesn't stop at the ? mark, nor the newline, so it will continue matching until it hits a > character. Maybe try updating it to <[^>\n?]+\? so it will match anything except a >, a newline, or that ? question mark, then when it hits that trailing question mark, you match it with \? explicitly.
Does this work for you?
<[^>]+[?]
I have a string:
This is #lame
Here I want to extract lame. But here is the issue, the above string can be
This is lame
Here I dont extract anything. And then this string can be:
This is #lame but that is #not
Here i extract lame and not
So, output I am expecting in each case is:
[lame]
[]
[lame,not]
How do I extract these in robust way in python?
Use re.findall() to find multiple patterns; in this case for anything that is preceded by #, consisting of word characters:
re.findall(r'(?<=#)\w+', inputtext)
The (?<=..) construct is a positive lookbehind assertion; it only matches if the current position is preceded by a # character. So the above pattern matches 1 or more word characters (the \w character class) only if those characters were preceded by an # symbol.
Demo:
>>> import re
>>> re.findall(r'(?<=#)\w+', 'This is #lame')
['lame']
>>> re.findall(r'(?<=#)\w+', 'This is lame')
[]
>>> re.findall(r'(?<=#)\w+', 'This is #lame but that is #not')
['lame', 'not']
If you plan on reusing the pattern, do compile the expression first, then use the .findall() method on the compiled regular expression object:
at_words = re.compile(r'(?<=#)\w+')
at_words.findall(inputtext)
This saves you a cache lookup every time you call .findall().
You should use re lib here is an example:
import re
test case = "This is #lame but that is #not"
regular = re.compile("#[\w]*")
lst= regular.findall(test case)
This will give the output you requested:
import re
regex = re.compile(r'(?<=#)\w+')
print regex.findall('This is #lame')
print regex.findall('This is lame')
print regex.findall('This is #lame but that is #not')