I want a regEx to match some text that contains both alpha and numeric chars. But I do NOT want it to match only alpha or numbers.
E.g. in python:
s = '[mytaskid: 3fee46d2]: STARTED at processing job number 10022001'
# ^^^^^^^^ <- I want something that'll only match this part.
import re
rr = re.compile('([0-9a-z]{8})')
print 'sub=', rr.sub('########', s)
print 'findall=', rr.findall(s)
generates following output:
sub= [########: ########]: STARTED at ########ng job number ########
findall= ['mytaskid', '3fee46d2', 'processi', '10022001']
I want it to be:
sub= [mytaskid: ########]: STARTED at processing job number 10022001
findall= ['3fee46d2']
Any ideas... ??
In this case it's exactly 8 chars always, it would be even more wonderful to have a regEx that doesn't have {8} in it, i.e. it can match even if there are more or less than 8 chars.
-- edit --
Question is more to understand if there is a way to write a regEx such that I can combine 2 patterns (in this case [0-9] and [a-z]) and ensure the matched string matches both patterns, but number of chars matched from each set is variable. E.g. s could also be
s = 'mytaskid 3fee46d2 STARTED processing job number 10022001'
-- answer --
Thanks to all for the answers, all them give me what I want, so everyone gets a +1 and the first one to answer gets the accepted answer. Although jerry explains it the best. :)
If anyone is a stickler for performance, there is nothing to choose from, they're all the same.
s = '[mytaskid: 3fee46d2]: STARTED at processing job number 10022001'
# ^^^^^^^^ <- I want something that'll only match this part.
def testIt(regEx):
from timeit import timeit
s = '[mytaskid: 3333fe46d2]: STARTED at processing job number 10022001'
assert (re.sub('\\b(?=[a-z0-9]*[0-9])[a-z0-9]*[a-z][a-z0-9]*\\b', '########', s) ==
'[mytaskid: ########]: STARTED at processing job number 10022001'), '"%s" does not work.' % regEx
print 'sub() with \'', regEx, '\': ', timeit('rr.sub(\'########\', s)', number=500000, setup='''
import re
s = '%s'
rr = re.compile('%s')
''' % (s, regEx)
)
print 'findall() with \'', regEx, '\': ', timeit('rr.findall(s)', setup='''
import re
s = '%s'
rr = re.compile('%s')
''' % (s, regEx)
)
testIt('\\b[0-9a-z]*(?:[a-z][0-9]|[0-9][a-z])[0-9a-z]*\\b')
testIt('\\b[a-z\d]*(?:\d[a-z]|[a-z]\d)[a-z\d]*\\b')
testIt('\\b(?=[a-z0-9]*[0-9])[a-z0-9]*[a-z][a-z0-9]*\\b')
testIt('\\b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\\b')
produced:
sub() with ' \b[0-9a-z]*(?:[a-z][0-9]|[0-9][a-z])[0-9a-z]*\b ': 0.328042736387
findall() with ' \b[0-9a-z]*(?:[a-z][0-9]|[0-9][a-z])[0-9a-z]*\b ': 0.350668751542
sub() with ' \b[a-z\d]*(?:\d[a-z]|[a-z]\d)[a-z\d]*\b ': 0.314759661193
findall() with ' \b[a-z\d]*(?:\d[a-z]|[a-z]\d)[a-z\d]*\b ': 0.35618526928
sub() with ' \b(?=[a-z0-9]*[0-9])[a-z0-9]*[a-z][a-z0-9]*\b ': 0.322802906619
findall() with ' \b(?=[a-z0-9]*[0-9])[a-z0-9]*[a-z][a-z0-9]*\b ': 0.35330467656
sub() with ' \b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\b ': 0.320779061371
findall() with ' \b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\b ': 0.347522144274
Try following regex:
\b[0-9a-z]*(?:[a-z][0-9]|[0-9][a-z])[0-9a-z]*\b
This will match a word containing a digit followed an alphabet or vice versa.
Hence it will cover a complete set of those words which contain at-least one digit and one alphabet.
Note: Although it is not the case with python, I have observed that not all varieties of tools support lookahead and lookbehind. So I prefer to avoid them if possible.
You need to use the look ahead (?=...).
This one matches all words with at least one out of [123] and [abc].
>>> re.findall('\\b(?=[abc321]*[321])[abc321]*[abc][abc321]*\\b', ' 123abc 123 abc')
['123abc']
This way you can do AND for constraints to the same string.
>>> help(re)
(?=...) Matches if ... matches next, but doesn't consume the string.
An other way is to ground it and to say: with one of [abc] and one of [123] means there is at least a [123][abc] or a [abc][123] in the string resulting in
>>> re.findall('\\b[abc321]*(?:[abc][123]|[123][abc])[abc321]*\\b', ' 123abc 123 abc')
['123abc']
Not the most beautiful regular expression, but it works:
\b[a-z\d]*(?:\d[a-z]|[a-z]\d)[a-z\d]*\b
If the format is the same each time, that is:
[########: ########]: STARTED at ########ng job number ########
You can use:
([^\]\s]+)\]
With re.findall, or re.search and getting .group(1) if you use re.search.
[^\]\s]+ is a negated class and will match any character except space (and family) or closing square bracket.
The regex basically looks for characters (except ] or spaces) up until a closing square bracket.
If you want to match any string containing both alpha and numeric characters, you will need a lookahead:
\b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\b
Used like so:
result = re.search(r'\b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\b', text, re.I)
re.I is for ignorecase.
\b is a word boundary and will match only between a 'word' character and a 'non-word' character (or start/end of string).
(?=[0-9]*[a-z]) is a positive lookahead and makes sure there's at least 1 alpha in the part to be matched.
(?=[a-z]*[0-9]) is a similar lookahead but checks for digits.
You can use more specific regular expression and skip the findall.
import re
s = '[mytaskid: 3fee46d2]: STARTED at processing job number 10022001'
mo = re.search(':\s+(\w+)', s)
print mo.group(1)
Related
Using re in python3, I want to match appearances of percentages in text, and substitute them with a special token (e.g. substitute "A 30% increase" by "A #percent# increase").
I only want to match if the percent expression is a standalone item. For example, it should not match "The product's code is A322%n43%". However, it should match when a line contains only one percentage expression like "89%".
I've tried using delimiters in my regex like \b, but because % is itself a non-alphanumeric character, it doesn't catch the end of the expression. Using \s makes it impossible to catch expression standing by themselves in a line.
At the moment, I have the code:
>>> re.sub(r"[+-]?[.,;]?(\d+[.,;']?)+%", ' #percent# ', "1,211.21%")
' #percent '
which still matches if the expression is followed by letters or other text (like the product code example above).
>>> re.sub(r"[+-]?[.,;]?(\d+[.,;']?)+%", ' #percent# ', "EEE1,211.21%asd")
'EEE #percent# asd'
What would you recommend?
Looks like a perfect job for Negative Lookbehind and Negative Lookahead:
re.sub(r'''(?<![^\s]) [+-]?[.,;]? (\d+[.,;']?)+% (?![^\s.,;!?'"])''',
'#percent#', string, flags=re.VERBOSE)
(?<![^\s]) means "no space immediately before the current position is allowed" (add more forbidden characters if you need).
(?![^\s.,;!?'"]) means "no space, period, etc. immediately after the current position are allowed".
Demo: https://regex101.com/r/khV7MZ/1.
Try putting "first" capture group with a "second".
original: r"[+-]?[.,;]?(\d+[.,;']?)+%"
suggestd: r"[+-]?[.,;]?((\d+[.,;']?)+%)\b"
I want to use python in order to manipulate a string I have.
Basically, I want to prepend"\x" before every hex byte except the bytes that already have "\x" prepended to them.
My original string looks like this:
mystr = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
And I want to create the following string from it:
mystr = r"\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00"
I thought of using regular expressions to match everything except /\x../g and replace every match with "\x". Sadly, I struggled with it a lot without any success. Moreover, I'm not sure that using regex is the best approach to solve such case.
Regex: (?:\\x)?([0-9A-Z]{2}) Substitution: \\x$1
Details:
(?:) Non-capturing group
? Matches between zero and one time, match string \x if it exists.
() Capturing group
[] Match a single character present in the list 0-9 and A-Z
{n} Matches exactly n times
\\x String \x
$1 Group 1.
Python code:
import re
text = R'30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00'
text = re.sub(R'(?:\\x)?([0-9A-Z]{2})', R'\\x\1', text)
print(text)
Output:
\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00
Code demo
You don't need regex for this. You can use simple string manipulation. First remove all of the "\x" from your string. Then add add it back at every 2 characters.
replaced = mystr.replace(r"\x", "")
newstr = "".join([r"\x" + replaced[i*2:(i+1)*2] for i in range(len(replaced)/2)])
Output:
>>> print(newstr)
\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00
You can get a list with your values to manipulate as you wish, with an even simpler re pattern
mystr = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
import re
pat = r'([a-fA-F0-9]{2})'
match = re.findall(pat, mystr)
if match:
print('\n\nNew string:')
print('\\x' + '\\x'.join(match))
#for elem in match: # match gives you a list of strings with the hex values
# print('\\x{}'.format(elem), end='')
print('\n\nOriginal string:')
print(mystr)
This can be done without replacing existing \x by using a combination of positive lookbehinds and negative lookaheads.
(?!(?<=\\x)|(?<=\\x[a-f\d]))([a-f\d]{2})
Usage
See code in use here
import re
regex = r"(?!(?<=\\x)|(?<=\\x[a-f\d]))([a-f\d]{2})"
test_str = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
subst = r"\\x$1"
result = re.sub(regex, subst, test_str, 0, re.IGNORECASE)
if result:
print (result)
Explanation
(?!(?<=\\x)|(?<=\\x[a-f\d])) Negative lookahead ensuring either of the following doesn't match.
(?<=\\x) Positive lookbehind ensuring what precedes is \x.
(?<=\\x[a-f\d]) Positive lookbehind ensuring what precedes is \x followed by a hexidecimal digit.
([a-f\d]{2}) Capture any two hexidecimal digits into capture group 1.
I am trying to do the following with a regular expression:
import re
x = re.compile('[^(going)|^(you)]') # words to replace
s = 'I am going home now, thank you.' # string to modify
print re.sub(x, '_', s)
The result I get is:
'_____going__o___no______n__you_'
The result I want is:
'_____going_________________you_'
Since the ^ can only be used inside brackets [], this result makes sense, but I'm not sure how else to go about it.
I even tried '([^g][^o][^i][^n][^g])|([^y][^o][^u])' but it yields '_g_h___y_'.
Not quite as easy as it first appears, since there is no "not" in REs except ^ inside [ ] which only matches one character (as you found). Here is my solution:
import re
def subit(m):
stuff, word = m.groups()
return ("_" * len(stuff)) + word
s = 'I am going home now, thank you.' # string to modify
print re.sub(r'(.+?)(going|you|$)', subit, s)
Gives:
_____going_________________you_
To explain. The RE itself (I always use raw strings) matches one or more of any character (.+) but is non-greedy (?). This is captured in the first parentheses group (the brackets). That is followed by either "going" or "you" or the end-of-line ($).
subit is a function (you can call it anything within reason) which is called for each substitution. A match object is passed, from which we can retrieve the captured groups. The first group we just need the length of, since we are replacing each character with an underscore. The returned string is substituted for that matching the pattern.
Here is a one regex approach:
>>> re.sub(r'(?!going|you)\b([\S\s]+?)(\b|$)', lambda x: (x.end() - x.start())*'_', s)
'_____going_________________you_'
The idea is that when you are dealing with words and you want to exclude them or etc. you need to remember that most of the regex engines (most of them use traditional NFA) analyze the strings by characters. And here since you want to exclude two word and want to use a negative lookahead you need to define the allowed strings as words (using word boundary) and since in sub it replaces the matched patterns with it's replace string you can't just pass the _ because in that case it will replace a part like I am with 3 underscore (I, ' ', 'am' ). So you can use a function to pass as the second argument of sub and multiply the _ with length of matched string to be replace.
I am having trouble splitting continuous strings into more reasonable parts:
E.g. 'MarieMüller' should become 'Marie Müller'
So far I've used this, which works if no special characters occur:
' '.join([a for a in re.split(ur'([A-Z][a-z]+)', ''.join(entity)) if a])
This outputs for e.g. 'TinaTurner' -> 'Tina Turner', but doesn't work
for 'MarieMüller', which outputs: 'MarieMüller' -> 'Marie M \utf8 ller'
Now I came accros using regex \p{L}:
' '.join([a for a in re.split(ur'([\p{Lu}][\p{Ll}]+)', ''.join(entity)) if a])
But this produces weird things like:
'JenniferLawrence' -> 'Jennifer L awrence'
Could anyone give me a hand?
If you work with Unicode and need to use Unicode categories, you should consider using PyPi regex module. There, you have support for all the Unicode categories:
>>> import regex
>>> p = regex.compile(ur'(?<=\p{Ll})(?=\p{Lu})')
>>> test_str = u"Tina Turner\nMarieM\u00FCller\nJacek\u0104cki"
>>> result = p.sub(u" ", test_str)
>>> result
u'Tina Turner\nMarie M\xfcller\nJacek \u0104cki'
^ ^ ^
Here, the (?<=\p{Ll})(?=\p{Lu}) regex finds all locations between the lower- (\p{Ll}) and uppercase (\p{Lu}) letters, and then the regex.sub inserts a space there. Note that regex module automatically compiles the regex with regex.UNICODE flag if the pattern is a Unicode string (u-prefixed).
It won't work for extended character
You can use re.sub() for this. It will be much simpler
(?=(?!^)[A-Z])
For handling spaces
print re.sub(r'(?<=[^\s])(?=(?!^)[A-Z])', ' ', ' Tina Turner'.strip())
For handling cases of consecutive capital letters
print re.sub(r'(?<=[a-z])(?=[A-Z])', ' ', ' TinaTXYurner'.strip())
Ideone Demo
Regex Breakdown
(?= #Lookahead to find all the position of capital letters
(?!^) #Ignore the first capital letter for substitution
[A-Z]
)
Using a function constructed of Python's string operations instead of regular expressions, this should work:
def split_combined_words(combined):
separated = [combined[1]]
for letter in combined[1:]:
print letter
if (letter.islower() or (letter.isupper() and separated[-1].isupper())):
separated.append(letter)
else:
separated.extend((" ", letter))
return "".join(separated)
I am quite new to python and regex (regex newbie here), and I have the following simple string:
s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
I would like to extract only the last digits in the above string i.e 767980716 and I was wondering how I could achieve this using python regex.
I wanted to do something similar along the lines of:
re.compile(r"""-(.*?)""").search(str(s)).group(1)
indicating that I want to find the stuff in between (.*?) which starts with a "-" and ends at the end of string - but this returns nothing..
I was wondering if anyone could point me in the right direction..
Thanks.
You can use re.match to find only the characters:
>>> import re
>>> s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
>>> re.match('.*?([0-9]+)$', s).group(1)
'767980716'
Alternatively, re.finditer works just as well:
>>> next(re.finditer(r'\d+$', s)).group(0)
'767980716'
Explanation of all regexp components:
.*? is a non-greedy match and consumes only as much as possible (a greedy match would consume everything except for the last digit).
[0-9] and \d are two different ways of capturing digits. Note that the latter also matches digits in other writing schemes, like ୪ or ൨.
Parentheses (()) make the content of the expression a group, which can be retrieved with group(1) (or 2 for the second group, 0 for the whole match).
+ means multiple entries (at least one number at the end).
$ matches only the end of the input.
Nice and simple with findall:
import re
s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
print re.findall('^.*-([0-9]+)$',s)
>>> ['767980716']
Regex Explanation:
^ # Match the start of the string
.* # Followed by anthing
- # Upto the last hyphen
([0-9]+) # Capture the digits after the hyphen
$ # Upto the end of the string
Or more simply just match the digits followed at the end of the string '([0-9]+)$'
Your Regex should be (\d+)$.
\d+ is used to match digit (one or more)
$ is used to match at the end of string.
So, your code should be: -
>>> s = "99-my-name-is-John-Smith-6376827-%^-1-2-767980716"
>>> import re
>>> re.compile(r'(\d+)$').search(s).group(1)
'767980716'
And you don't need to use str function here, as s is already a string.
Use the below regex
\d+$
$ depicts the end of string..
\d is a digit
+ matches the preceding character 1 to many times
Save the regular expressions for something that requires more heavy lifting.
>>> def parse_last_digits(line): return line.split('-')[-1]
>>> s = parse_last_digits(r"99-my-name-is-John-Smith-6376827-%^-1-2-767980716")
>>> s
'767980716'
I have been playing around with several of these solutions, but many seem to fail if there are no numeric digits at the end of the string. The following code should work.
import re
W = input("Enter a string:")
if re.match('.*?([0-9]+)$', W)== None:
last_digits = "None"
else:
last_digits = re.match('.*?([0-9]+)$', W).group(1)
print("Last digits of "+W+" are "+last_digits)
Try using \d+$ instead. That matches one or more numeric characters followed by the end of the string.