I can't seem to find an example of this, but I doubt the regex is that sophisticated. Is there a simple way of getting the immediately preceding digits of a certain character in Python?
For the character "A" and the string:
"îA"
It should return 238A
As long as you intend to include the trailing character in the resulting match, the regex pattern to do that is very simple. For instance, if you want to capture any series of digits followed by a letter A, the pattern would be \d+A
If you are on python 3, try this.
Please refer to this link for more information.
import re
char = "A" # the character you're searching for.
string = "BA îA 123A" # test string.
regex = "[0-9]+%s" %char # capturing digits([0-9]) which appear more than once(+) followed by a desired character "%s"%char
compiled_regex = re.compile(regex) # compile the regex
result = compiled_regex.findall(string)
print (result)
>>['238A', '123A']
Related
I want to use python in order to manipulate a string I have.
Basically, I want to prepend"\x" before every hex byte except the bytes that already have "\x" prepended to them.
My original string looks like this:
mystr = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
And I want to create the following string from it:
mystr = r"\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00"
I thought of using regular expressions to match everything except /\x../g and replace every match with "\x". Sadly, I struggled with it a lot without any success. Moreover, I'm not sure that using regex is the best approach to solve such case.
Regex: (?:\\x)?([0-9A-Z]{2}) Substitution: \\x$1
Details:
(?:) Non-capturing group
? Matches between zero and one time, match string \x if it exists.
() Capturing group
[] Match a single character present in the list 0-9 and A-Z
{n} Matches exactly n times
\\x String \x
$1 Group 1.
Python code:
import re
text = R'30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00'
text = re.sub(R'(?:\\x)?([0-9A-Z]{2})', R'\\x\1', text)
print(text)
Output:
\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00
Code demo
You don't need regex for this. You can use simple string manipulation. First remove all of the "\x" from your string. Then add add it back at every 2 characters.
replaced = mystr.replace(r"\x", "")
newstr = "".join([r"\x" + replaced[i*2:(i+1)*2] for i in range(len(replaced)/2)])
Output:
>>> print(newstr)
\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00
You can get a list with your values to manipulate as you wish, with an even simpler re pattern
mystr = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
import re
pat = r'([a-fA-F0-9]{2})'
match = re.findall(pat, mystr)
if match:
print('\n\nNew string:')
print('\\x' + '\\x'.join(match))
#for elem in match: # match gives you a list of strings with the hex values
# print('\\x{}'.format(elem), end='')
print('\n\nOriginal string:')
print(mystr)
This can be done without replacing existing \x by using a combination of positive lookbehinds and negative lookaheads.
(?!(?<=\\x)|(?<=\\x[a-f\d]))([a-f\d]{2})
Usage
See code in use here
import re
regex = r"(?!(?<=\\x)|(?<=\\x[a-f\d]))([a-f\d]{2})"
test_str = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
subst = r"\\x$1"
result = re.sub(regex, subst, test_str, 0, re.IGNORECASE)
if result:
print (result)
Explanation
(?!(?<=\\x)|(?<=\\x[a-f\d])) Negative lookahead ensuring either of the following doesn't match.
(?<=\\x) Positive lookbehind ensuring what precedes is \x.
(?<=\\x[a-f\d]) Positive lookbehind ensuring what precedes is \x followed by a hexidecimal digit.
([a-f\d]{2}) Capture any two hexidecimal digits into capture group 1.
for string "//div[#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings']", I want to find "#..'...'" like "#id~'objectnavigator-card-list'" or "#class~'outbound-alert-settings'". But when I use regex ((#.+)\~(\'.*?\')), it find "#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings'". So how to modify the regex to find the string successfully?
Use non-capturing, non greedy, modifiers on the inner brackets and search for not the terminating character, e.g.:
re.findall(r"((?:#[^\~]+)\~(?:\'[^\]]*?\'))", test)
On your test string returns:
["#id~'objectnavigator-card-list'", "#class~'outbound-alert-settings'"]
Limit the characters you want to match between the quotes to not match the quote:
>>> re.findall(r'#[a-z]+~\'[-a-z]*\'', x)
I find it's much easier to look for only the characters I know are going to be in a matching section rather than omitting characters from more permissive matches.
For your current test string's input you can try this pattern:
import re
a = "//div[#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings']"
# find everything which begins by '#' and neglect ']'
regex = re.compile(r'(#[^\]]+)')
strings = re.findall(regex, a)
# Or simply:
# strings = re.findall('(#[^\\]]+)', a)
print(strings)
Output:
["#id~'objectnavigator-card-list'", "#class~'outbound-alert-settings'"]
I have web URLs that look like this:
http://example.com/php?id=2/*
http://example.com/php?id=2'
http://example.com/php?id=2*/"
What I need to do is grab the last characters of the string, I've tried:
for urls in html_page:
syntax = list(url)[-1]
# <= *
# <= '
# etc...
However this will only grab the last character of the string, is there a way I could grab the last characters as long as they are special characters?
Use a regex. Assuming that by "special character" you mean "anything besides A-Za-z0-9":
>>> import re
>>> re.search(r"\W+$", "http://example.com/php?id=2*/'").group()
"*/'"
\W+ matches one or more "non-word" characters, and $ anchors the search to the end of the string.
Use a regular expression?
import re
addr = "http://example.com/php?id=2*/"
chars = re.search(addr, "[\*\./_]{0,4}$").group()
Characters you want to match are the ones between the [] brackets. You may want to add or remove characters depending on what you expect to encounter.
For example, you would (probably) not want to match the '=' character in your example URLs, which the other answer would match.
{0,4} means to match 0-4 characters (defaults to being greedy)
I am quite new to python and regex (regex newbie here), and I have the following simple string:
s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
I would like to extract only the last digits in the above string i.e 767980716 and I was wondering how I could achieve this using python regex.
I wanted to do something similar along the lines of:
re.compile(r"""-(.*?)""").search(str(s)).group(1)
indicating that I want to find the stuff in between (.*?) which starts with a "-" and ends at the end of string - but this returns nothing..
I was wondering if anyone could point me in the right direction..
Thanks.
You can use re.match to find only the characters:
>>> import re
>>> s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
>>> re.match('.*?([0-9]+)$', s).group(1)
'767980716'
Alternatively, re.finditer works just as well:
>>> next(re.finditer(r'\d+$', s)).group(0)
'767980716'
Explanation of all regexp components:
.*? is a non-greedy match and consumes only as much as possible (a greedy match would consume everything except for the last digit).
[0-9] and \d are two different ways of capturing digits. Note that the latter also matches digits in other writing schemes, like ୪ or ൨.
Parentheses (()) make the content of the expression a group, which can be retrieved with group(1) (or 2 for the second group, 0 for the whole match).
+ means multiple entries (at least one number at the end).
$ matches only the end of the input.
Nice and simple with findall:
import re
s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
print re.findall('^.*-([0-9]+)$',s)
>>> ['767980716']
Regex Explanation:
^ # Match the start of the string
.* # Followed by anthing
- # Upto the last hyphen
([0-9]+) # Capture the digits after the hyphen
$ # Upto the end of the string
Or more simply just match the digits followed at the end of the string '([0-9]+)$'
Your Regex should be (\d+)$.
\d+ is used to match digit (one or more)
$ is used to match at the end of string.
So, your code should be: -
>>> s = "99-my-name-is-John-Smith-6376827-%^-1-2-767980716"
>>> import re
>>> re.compile(r'(\d+)$').search(s).group(1)
'767980716'
And you don't need to use str function here, as s is already a string.
Use the below regex
\d+$
$ depicts the end of string..
\d is a digit
+ matches the preceding character 1 to many times
Save the regular expressions for something that requires more heavy lifting.
>>> def parse_last_digits(line): return line.split('-')[-1]
>>> s = parse_last_digits(r"99-my-name-is-John-Smith-6376827-%^-1-2-767980716")
>>> s
'767980716'
I have been playing around with several of these solutions, but many seem to fail if there are no numeric digits at the end of the string. The following code should work.
import re
W = input("Enter a string:")
if re.match('.*?([0-9]+)$', W)== None:
last_digits = "None"
else:
last_digits = re.match('.*?([0-9]+)$', W).group(1)
print("Last digits of "+W+" are "+last_digits)
Try using \d+$ instead. That matches one or more numeric characters followed by the end of the string.
I have a string with a mix of uppercase and lowercase letters. I need want to find every lowercase letter that is surronded by 3 uppercase letters and extract it from the string.
For instance ZZZaZZZ I want to extract the a in the previous string.
I have written a script that is able to extract ZZZaZZZ but not the a alone. I know I need to use nested regex expressions to do this but I can not wrap my mind on how to implement this. The following is what I have:
import string, re
if __name__ == "__main__":
#open the file
eqfile = open("string.txt")
gibberish = eqfile.read()
eqfile.close()
r = re.compile("[A-Z]{3}[a-z][A-Z]{3}")
print r.findall(gibberish)
EDIT:
Thanks for the answers guys! I guess I should have been more specific. I need to find the lowercase letter that is surrounded by three uppercase letters that are exactly the same, such as in my example ZZZaZZZ.
You are so close! Read about the .group* methods of MatchObjects. For example, if your script ended with
r = re.compile("[A-Z]{3}([a-z])[A-Z]{3}")
print r.match(gibberish).group(1)
then you'd capture the desired character inside the first group.
To address the new constraint of matching repeated letters, you can use backreferences:
r = re.compile(r'([A-Z])\1{2}(?P<middle>[a-z])\1{3}')
m = r.match(gibberish)
if m is not None:
print m.group('middle')
That reads like:
Match a letter A-Z and remember it.
Match two occurrences of the first letter found.
Match your lowercase letter and store it in the group named middle.
Match three more consecutive instances of the first letter found.
If a match was found, print the value of the middle group.
r = re.compile("(?<=[A-Z]{3})[a-z](?=[A-Z]{3})")
(?<=...) indicates a positive lookbehind and (?=...) is a positive lookahead.
module re
(?=...)
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
(?<=...)
Matches if the current position in the string is preceded by a match for ... that ends at the current position.
You need to capture the part of the string you are interested in with parentheses, and then access it with re.MatchObject#group:
r = re.compile("[A-Z]{3}([a-z])[A-Z]{3}")
m = r.match(gibberish)
if m:
print "Match! Middle letter was " + m.group(1)
else:
print "No match."