Python If substring exists in string, get its context between relimiters - python

I have a list of strings that follow a pattern such that in some position in the string there may be a substring RAM.
ex:
sdfjhsk_sdkjfhs_RAM_lkfdgjls
Sometimes this string may have another character after it.
ex:
aaaa_RAMA_sfsffgd
I'd need to have the whole context between the nearest underscores, so RAM in the first case, RAMA in the second.
And it may not even exist at all in the string
ex:
sfdks_sdfsdf_sdfsdf_sdfsdfsdf
Matches at the start or end of the string are allowed:
RAMsdoa_saeorfioa_noutd -> RAMsdoa
aetu_eaei_sdsdf_RAMSdoa -> RAMsdoa
as are matches in strings without underscores:
sdasids -> nothing
sdfRAMso -> sdfRAMso
What is the best way to search the string and if it contains the pattern RAM and if it does, grab everything in between the nearest delimiters _ (or the start or end of the string, if nearer)?

You can use a regular expression here. You need to match RAM, plus any non-_ characters before and after:
import re
def find_ram_context(inputtext):
match = re.search(r'[^_]*RAM[^_]*', inputtext)
if match:
return match.group(0)
[^...] is a negative character-set match; anything not in that set would match. Here that's _, and * means that there should be zero or more such characters. So any character before or after RAM that's not an underscore would be pulled into the matched text.
The function above returns the matched context, or None if the word RAM is not present:
>>> find_ram_context('sdfjhsk_sdkjfhs_RAM_lkfdgjls')
'RAM'
>>> find_ram_context('aaaa_RAMA_sfsffgd')
'RAMA'
>>> find_ram_context('sfdks_sdfsdf_sdfsdf_sdfsdfsdf') is None
True
Online demo of the regex with your test cases at https://regex101.com/r/6VcLrC/1

Related

Regex to check if it is exactly one single word

I am basically trying to match string pattern(wildcard match)
Please carefully look at this -
*(star) - means exactly one word .
This is not a regex pattern...it is a convention.
So,if there patterns like -
*.key - '.key.' is preceded by exactly one word(word containing no dots)
*.key.* - '.key.' is preceded and succeeded by exactly one word having no dots
key.* - '.key' preceeds exactly one word .
So,
"door.key" matches "*.key"
"brown.door.key" doesn't match "*.key".
"brown.key.door" matches "*.key.*"
but "brown.iron.key.door" doesn't match "*.key.*"
So, when I encounter a '*' in pattern, I have replace it with a regex so that it means it is exactly one word.(a-zA-z0-9_).Can anyone please help me do this in python?
To convert your pattern to a regexp, you first need to make sure each character is interpreted literally and not as a special character. We can do that by inserting a \ in front of any re special character. Those characters can be obtained through sre_parse.SPECIAL_CHARS.
Since you have a special meaning for *, we do not want to escape that one but instead replace it by \w+.
Code
import sre_parse
def convert_to_regexp(pattern):
special_characters = set(sre_parse.SPECIAL_CHARS)
special_characters.remove('*')
safe_pattern = ''.join(['\\' + c if c in special_characters else c for c in pattern ])
return safe_pattern.replace('*', '\\w+')
Example
import re
pattern = '*.key'
r_pattern = convert_to_regexp(pattern) # '\\w+\\.key'
re.match(r_pattern, 'door.key') # Match
re.match(r_pattern, 'brown.door.key') # None
And here is an example with escaped special characters
pattern = '*.(key)'
r_pattern = convert_to_regexp(pattern) # '\\w+\\.\\(key\\)'
re.match(r_pattern, 'door.(key)') # Match
re.match(r_pattern, 'brown.door.(key)') # None
Sidenote
If you intend looking for the output pattern with re.search or re.findall, you might want to wrap the re pattern between \b boundary characters.
The conversion rules you are looking for go like this:
* is a word, thus: \w+
. is a literal dot: \.
key is and stays a literal string
plus, your samples indicate you are going to match whole strings, which in turn means your pattern should match from the ^ beginning to the $ end of the string.
Therefore, *.key becomes ^\w+\.key$, *.key.* becomes ^\w+\.key\.\w+$, and so forth..
Online Demo: play with it!
^ means a string that starts with the given set of characters in a regular expression.
$ means a string that ends with the given set of characters in a regular expression.
\s means a whitespace character.
\S means a non-whitespace character.
+ means 1 or more characters matching given condition.
Now, you want to match just a single word meaning a string of characters that start and end with non-spaced string. So, the required regular expression is:
^\S+$
You could do it with a combination of "any characters that aren't period" and the start/end anchors.
*.key would be ^[^.]*\.key, and *.key.* would be ^[^.]*\.key\.[^.]*$
EDIT: As tripleee said, [^.]*, which matches "any number of characters that aren't periods," would allow whitespace characters (which of course aren't periods), so using \w+, "any number of 'word characters'" like the other answers is better.

How to find superfluous escapes in regex patterns

How to find and remove all the unneeded backslash escapes in Python regular expressions.
For example in r'\{\"*' all the escapes are unnecessary and has the same meaning as r'{"*'. But in r'\[a-b]\{2}\Z\'\+' removing any of the escapes would change how the regex is interpreted by the regex engine (or cause a syntax error).
Given the pattern, is there an easy, i.e. other than perhaps parsing the whole regex string looking for escapes on non-special characters, way to remove escape patterns programmatically in Python?
Here is the code that I came up with:
from contextlib import redirect_stdout
from io import StringIO
from re import compile, DEBUG, error, MULTILINE, VERBOSE
def unescape(pattern: str, flags: int):
"""Remove any escape that does not change the regex meaning"""
strio = StringIO()
with redirect_stdout(strio):
compile(pattern, DEBUG | flags)
original_debug = strio.getvalue()
index = len(pattern)
while index >= 0:
index -= 1
character = pattern[index]
if character != '\\':
continue
removed_escape = pattern[:index] + pattern[index+1:]
strio = StringIO()
with redirect_stdout(strio):
try:
compile(removed_escape, DEBUG | flags)
except error:
continue
if original_debug == strio.getvalue():
pattern = removed_escape
return pattern
def print_unescaped_raw(regex: str, flags:int=0):
"""Print an unescaped raw-string representation for s."""
print(
("r'%s'" % unescape(regex, flags)
.replace("'", r'\'')
.replace('\n', r'\n'))
)
print_unescaped_raw(r'\{\"*') # r'{"*'
One can also use sre_parse.parse directly, but the SubPatterns and tuples in the result may contain nested SubPatterns. And SubPattern instances don't have __eq__ method defined for them, so a recursive comparison subroutine might be required.
P.S.
Unfortunately, this method does not work with the regex module because in regex you get different debug output for escaped characters:
regex.compile(r'{', regex.DEBUG)
LITERAL MATCH '{'
regex.compile(r'\{', regex.DEBUG)
CHARACTER MATCH '{'
Unlike re that gives:
re.compile(r'{', re.DEBUG)
LITERAL 123
re.compile(r'\{', re.DEBUG)
LITERAL 123
I will not do the whole implementation but I can give you some hints to make a viable heuristic/algo:
Initial Hypothesis: You have for each regex that you are going to modify a list of input strings/expected output strings to validate its behavior
Use this website to have the list of characters that should stay escaped with the backslash \ http://www.rexegg.com/regex-quickstart.html and Create a list of elements that should not be replaced
Parse your regex and replace all the \X where X is a character that is not present in the list generated at the previous step by X
Test your initial regex on its input strings and test your new regex on the same input strings and compare their respective outputs for all the result
If all of your results are the same, then you can use your new/simplified regex.
If at least one of the output is different then you have to throw away your new regex and proceed with local replacements: select randomly (round robin could be used) one of the \X in your initial regex that is not in the list that you have construct at step 1. and replace it by X check the output in comparison to the initial regex output for each input string if it matches you can use that regex and repeat step 5. until it is not possible to progress anymore. however, If the output is different for that replacement remove it from the list of elements you might be able to replace and repeat the step 5 with your previous regex. Do the process until your list of possible local replacement is empty, you can use the new regex instead of the old one.

Proper replacement of "beginning" non-alphanumeric characters, in python, using regular expressions

NOTE: This post is not the same as the post "Re.sub not working for me".
That post is about matching and replacing ANY non-alphanumeric substring in a string.
This question is specifically about matching and replacing non-alphanumeric substrings that explicitly show up at the beginning of a string.
The following method attempts to match any non-alphanumeric character string "AT THE BEGINNING" of a string and replace it with a new string "BEGINNING_"
def m_getWebSafeString(self, dirtyAttributeName):
cleanAttributeName = ''.join(dirtyAttributeName)
# Deal with beginning of string...
cleanAttributeName = re.sub('^[^a-zA-z]*',"BEGINNING_",cleanAttributeName)
# Deal with end of string...
if "BEGINNING_" in cleanAttributeName:
print ' ** ** ** D: "{}" ** ** ** C: "{}"'.format(dirtyAttributeName, cleanAttributeName)
PROBLEM DESCRIPTION: The method seems to not only replace non-alphnumeric characters but it also incorrectly inserts the "BEGINNING_" string at the beginning of all strings that are passed into it. In other words...
GOOD RESULT: If the method is passed the string *##$ThisIsMyString1, it correctly returns BEGINNING_ThisIsMyString1
BAD/UNWANTED RESULT: However, if the method is passed the string ThisIsMyString2 it incorrectly (and always) inserts the replacement string (BEGINNING_), even there are no non-alphanumeric characters, and yields the result BEGINNING_ThisIsMyString2
MY QUESTION: What is the correct way to write the re.sub() line so it only replaces those non-alphnumeric characters at the beginning of the string such that it does not always insert the replacement string at the beginning of the original input string?
You're matching 0 or more instances of non-alphabetic characters by using the * quantifier, which means it'll always be picked up by your pattern. You can replace what you have with
re.sub('^[^a-zA-Z]+', ...)
to ensure that only 1 or more instances are matched.
replace
re.sub('^[^a-zA-z]*',"BEGINNING_",cleanAttributeName)
with
re.sub('^[^a-zA-z]+',"BEGINNING_",cleanAttributeName)
There is a more elegant solution. You can use this
re.sub('^\W+', 'BEGINNING_', cleanAttributeName)
\W Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].
>>> re.sub('^\W+', 'BEGINNING_', '##$ThisIsMyString1')
'BEGINNING_ThisIsMyString1'
>>> re.sub('^\W+', 'BEGINNING_', 'ThisIsMyString2')
'ThisIsMyString2'

Python regex - matching character sequences using prior matched characters

I wish to match strings such as "zxxz" and "vbbv" where a character is followed by a pair of identical characters that do not match the first, then followed by the first. Therefore I do not wish to match strings such as "zzzz" and "vvvv".
I started with the following Python regex that matches all of those examples:
(.)(.)\2\1
In an attempt to exclude the second set ("zzzz", "vvvv"), I tried this modification:
(.)([^\1])\2\1
My reasoning is that the second group can contain any single character provided it is not the same at that matched in the first set.
Unfortunately this does not seem to work as it still matches "zzzz" and "vvvv".
According to the Python 2.7.12 documentation:
\number
Matches the contents of the group of the same number. Groups are numbered starting from 1. For example, (.+) \1 matches 'the the' or '55 55', but not 'thethe' (note the space after the group). This special sequence can only be used to match one of the first 99 groups. If the first digit of number is 0, or number is 3 octal digits long, it will not be interpreted as a group match, but as the character with octal value number. Inside the '[' and ']' of a character class, all numeric escapes are treated as characters.
(My emphasis added).
I find this sentence ambiguous, or at least unclear, because it suggests to me that the numeric escape should resolve as a single excluded character in the set, but this does not seem to happen.
Additionally, the following regex does not seem to work as I would expect either:
(.)[^\1][^\1][\1]
This doesn't seem to match "zzzz" or "zxxz".
You want to do a negative lookahead assertion (?!...) on \1 in the second capture group, then it will work:
r'(.)((?!\1).)\2\1'
Testing your examples:
>>> import re
>>> re.match(r'(.)((?!\1).)\2\1', 'zxxz')
<_sre.SRE_Match object at 0x109b661c8>
>>> re.match(r'(.)((?!\1).)\2\1', 'vbbv')
<_sre.SRE_Match object at 0x109b663e8>
>>> re.match(r'(.)((?!\1).)\2\1', 'zzzz') is None
True
>>> re.match(r'(.)((?!\1).)\2\1', 'vvvv') is None
True

Check String for / against Characters in Python

I need to be able to tell the difference between a string that can contain letters and numbers, and a string that can contain numbers, colons and hyphens.
>>> def checkString(s):
... pattern = r'[-:0-9]'
... if re.search(pattern,s):
... print "Matches pattern."
... else:
... print "Does not match pattern."
# 3 Numbers seperated by colons. 12, 24 and minus 14
>>> s1 = "12:24:-14"
# String containing letters and string containing letters/numbers.
>>> s2 = "hello"
>>> s3 = "hello2"
When I run the checkString method on each of the above strings:
>>>checkString(s1)
Matches Pattern.
>>>checkString(s2)
Does not match Pattern.
>>>checkString(s3)
Matches Pattern
s3 is the only one that doesn't do what I want. I'd like to be able to create a regex that allows numbers, colons and hyphens, but excludes EVERYTHING else (or just alphabetical characters). Can anyone point me in the right direction?
EDIT:
Therefore, I need a regex that would accept:
229 // number
187:657 //two numbers
187:678:-765 // two pos and 1 neg numbers
and decline:
Car //characters
Car2 //characters and numbers
you need to match the whole string, not a single character as you do at the moment:
>>> re.search('^[-:0-9]+$', "12:24:-14")
<_sre.SRE_Match object at 0x01013758>
>>> re.search('^[-:0-9]+$', "hello")
>>> re.search('^[-:0-9]+$', "hello2")
To explain regex:
within square brackets (character class): match digits 0 to 9, hyphen and colon, only once.
+ is a quantifier, that indicates that preceding expression should be matched as many times as possible but at least once.
^ and $ match start and end of the string. For one-line strings they're equivalent to \A and \Z.
This way you restrict content of the whole string to be at least one-charter long and contain any permutation of characters from the character class. What you were doing before hand was to search for a single character from the character class within subject string. This is why s3 that contains a digit matched.
SilentGhost's answer is pretty good, but take note that it would also match strings like "---::::" with no digits at all.
I think you're looking for something like this:
'^(-?\d+:)*-?\d+$'
^ Matches the beginning of the line.
(-?\d+:)* Possible - sign, at least one digit, a colon. That whole pattern 0 or many times.
-?\d+ Then the pattern again, at least once, without the colon
$ The end of the line
This will better match the strings you describe.
pattern = r'\A([^-:0-9]+|[A-Za-z0-9])\Z'
Your regular expression is almost fine; you just need to make it match the whole string. Also, as a commenter pointed out, you don't really need a raw string (the r prefix on the string) in this case. Voila:
def checkString(s):
if re.match('[-:0-9]+$', s):
print "Matches pattern."
else:
print "Does not match pattern."
The '+' means "match one or more of the previous expression". (This will make checkString return False on an empty string. If you want True on an empty string, change the '+' to a '*'.) The '$' means "match the end of the string".
re.match means "the string must match the regular expression starting at the first character"; re.search means "the regular expression can match a sequence anywhere inside the string".
Also, if you like premature optimization--and who doesn't!--note that 're.match' needs to compile the regular expression each time. This version compiles the regular expression only once:
__checkString_re = re.compile('[-:0-9]+$')
def checkString(s):
global __checkString_re
if __checkString_re.match(s):
print "Matches pattern."
else:
print "Does not match pattern."

Categories