Python regex, pattern-match multiple backslash characters

Python regex, pattern-match multiple backslash characters - python

I have a python raw string, that has five backslash characters followed by a double quote. I am trying to pattern-match using python re.
The output must print the matching pattern. In addition, two characters before/after the pattern.
import re
command = r'abc\\\\\"abc'
search_string = '.{2}\\\\\\\\\\".{2}'
pattern = re.compile(search_string)
ts_name = pattern.findall(command)
print ts_name
The output shows,
['\\\\\\\\"ab']
I expected
['bc\\\\\"ab']
Anomalies:
1) Extra characters at the front - ab are missing
2) Magically, it prints eight backslashes when the input string contains just five backslashes

You can simplify (shorten) your regex and use search function to get your output:
command = r'abc\\\\\"abc'
search_string = r'.{2}(?:\\){5}".{2}'
print re.compile(search_string).search(command).group()
Output:
bc\\\\\"ab
Your regex should also use r prefix.

just add a capturing group around the part you want:
command = r'a(bc\\\\\"ab)c'
and access it with:
match.group(1)

Related

Match everything except a pattern and replace matched with string

I want to use python in order to manipulate a string I have.
Basically, I want to prepend"\x" before every hex byte except the bytes that already have "\x" prepended to them.
My original string looks like this:
mystr = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
And I want to create the following string from it:
mystr = r"\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00"
I thought of using regular expressions to match everything except /\x../g and replace every match with "\x". Sadly, I struggled with it a lot without any success. Moreover, I'm not sure that using regex is the best approach to solve such case.

Regex: (?:\\x)?([0-9A-Z]{2}) Substitution: \\x$1
Details:
(?:) Non-capturing group
? Matches between zero and one time, match string \x if it exists.
() Capturing group
[] Match a single character present in the list 0-9 and A-Z
{n} Matches exactly n times
\\x String \x
$1 Group 1.
Python code:
import re
text = R'30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00'
text = re.sub(R'(?:\\x)?([0-9A-Z]{2})', R'\\x\1', text)
print(text)
Output:
\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00
Code demo

You don't need regex for this. You can use simple string manipulation. First remove all of the "\x" from your string. Then add add it back at every 2 characters.
replaced = mystr.replace(r"\x", "")
newstr = "".join([r"\x" + replaced[i*2:(i+1)*2] for i in range(len(replaced)/2)])
Output:
>>> print(newstr)
\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00

You can get a list with your values to manipulate as you wish, with an even simpler re pattern
mystr = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
import re
pat = r'([a-fA-F0-9]{2})'
match = re.findall(pat, mystr)
if match:
print('\n\nNew string:')
print('\\x' + '\\x'.join(match))
#for elem in match: # match gives you a list of strings with the hex values
# print('\\x{}'.format(elem), end='')
print('\n\nOriginal string:')
print(mystr)

This can be done without replacing existing \x by using a combination of positive lookbehinds and negative lookaheads.
(?!(?<=\\x)|(?<=\\x[a-f\d]))([a-f\d]{2})
Usage
See code in use here
import re
regex = r"(?!(?<=\\x)|(?<=\\x[a-f\d]))([a-f\d]{2})"
test_str = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
subst = r"\\x$1"
result = re.sub(regex, subst, test_str, 0, re.IGNORECASE)
if result:
print (result)
Explanation
(?!(?<=\\x)|(?<=\\x[a-f\d])) Negative lookahead ensuring either of the following doesn't match.
(?<=\\x) Positive lookbehind ensuring what precedes is \x.
(?<=\\x[a-f\d]) Positive lookbehind ensuring what precedes is \x followed by a hexidecimal digit.
([a-f\d]{2}) Capture any two hexidecimal digits into capture group 1.

Get the last 4 characters of a string as long as they are special characters

I have web URLs that look like this:
http://example.com/php?id=2/*
http://example.com/php?id=2'
http://example.com/php?id=2*/"
What I need to do is grab the last characters of the string, I've tried:
for urls in html_page:
syntax = list(url)[-1]
# <= *
# <= '
# etc...
However this will only grab the last character of the string, is there a way I could grab the last characters as long as they are special characters?

Use a regex. Assuming that by "special character" you mean "anything besides A-Za-z0-9":
>>> import re
>>> re.search(r"\W+$", "http://example.com/php?id=2*/'").group()
"*/'"
\W+ matches one or more "non-word" characters, and $ anchors the search to the end of the string.

Use a regular expression?
import re
addr = "http://example.com/php?id=2*/"
chars = re.search(addr, "[\*\./_]{0,4}$").group()
Characters you want to match are the ones between the [] brackets. You may want to add or remove characters depending on what you expect to encounter.
For example, you would (probably) not want to match the '=' character in your example URLs, which the other answer would match.
{0,4} means to match 0-4 characters (defaults to being greedy)

Extract path from lines in a file using python

I am trying to extract path from a given file which meet some criteria:
Example:
I have a small file with contents something like :
contentsaasdf /net/super/file-1.txt othercontents...
data is in /sample/random/folder/folder2/file-2.txt otherdata...
filename /otherfile/other-3.txt somewording
I want to extract the path's from file which contain file-*.txt in it.
In above example, I need the below path's as output
/net/super/file-1.txt
/sample/random/folder/folder2/file-2.txt
Any suggestions with Python code ?
I am trying regex. But facing issues with multiple folder's, etc. Something like:
FileRegEx = re.compile('.*(file-\\d.txt).*', re.IGNORECASE|re.DOTALL)

You don't need .* just use character classes properly:
r'[\/\w]+file-[^.]+\.txt'
[\/\w]+ will match any combinations of word characters and /. And [^.]+ will match any combination of characters except dot.
Demo:
https://regex101.com/r/ytsZ0D/1
Note that this regex might be kind of general, In that case, if you want to exclude some cases you can use ^ within character class or another proper pattern, based on your need.

Assuming your filenames are white-space separated ...
\\s(\\S+/file-\\d+\\.txt)\\s
\\s - match a white-space character
\\S+ - matches one or more non-whitespace characters
\\d+ - matches one or more digits
\\. - turns the . into a non-interesting period, instead of a match any character
You can avoid the double backslashes using r'' strings:
r'\s(\S+/file-\d+\.txt)\s'

Try this:
import re
re.findall('/.+\.txt', s)
# Output: ['/net/super/file-1.txt', '/sample/random/folder/folder2/file-2.txt', '/otherfile/other-3.txt']
Output:
>>> import re
>>>
>>> s = """contentsaasdf /net/super/file-1.txt othercontents...
... data is in /sample/random/folder/folder2/file-2.txt otherdata...
... filename /otherfile/other-3.txt somewording"""
>>>
>>> re.findall('/.+\.txt', s)
['/net/super/file-1.txt', '/sample/random/folder/folder2/file-2.txt', '/otherfile/other-3.txt']

match any decimals appearing immediately before a character in python

I can't seem to find an example of this, but I doubt the regex is that sophisticated. Is there a simple way of getting the immediately preceding digits of a certain character in Python?
For the character "A" and the string:
"&#238A"
It should return 238A

As long as you intend to include the trailing character in the resulting match, the regex pattern to do that is very simple. For instance, if you want to capture any series of digits followed by a letter A, the pattern would be \d+A

If you are on python 3, try this.
Please refer to this link for more information.
import re
char = "A" # the character you're searching for.
string = "BA &#238A 123A" # test string.
regex = "[0-9]+%s" %char # capturing digits([0-9]) which appear more than once(+) followed by a desired character "%s"%char
compiled_regex = re.compile(regex) # compile the regex
result = compiled_regex.findall(string)
print (result)
>>['238A', '123A']

Handling backreferences to capturing groups in re.sub replacement pattern

I want to take the string 0.71331, 52.25378 and return 0.71331,52.25378 - i.e. just look for a digit, a comma, a space and a digit, and strip out the space.
This is my current code:
coords = '0.71331, 52.25378'
coord_re = re.sub("(\d), (\d)", "\1,\2", coords)
print coord_re
But this gives me 0.7133,2.25378. What am I doing wrong?

You should be using raw strings for regex, try the following:
coord_re = re.sub(r"(\d), (\d)", r"\1,\2", coords)
With your current code, the backslashes in your replacement string are escaping the digits, so you are replacing all matches the equivalent of chr(1) + "," + chr(2):
>>> '\1,\2'
'\x01,\x02'
>>> print '\1,\2'
,
>>> print r'\1,\2' # this is what you actually want
\1,\2
Any time you want to leave the backslash in the string, use the r prefix, or escape each backslash (\\1,\\2).

Python interprets the \1 as a character with ASCII value 1, and passes that to sub.
Use raw strings, in which Python doesn't interpret the \.
coord_re = re.sub(r"(\d), (\d)", r"\1,\2", coords)
This is covered right in the beginning of the re documentation, should you need more info.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regex, pattern-match multiple backslash characters - python

You can simplify (shorten) your regex and use search function to get your output: command = r'abc\\\\\"abc' search_string = r'.{2}(?:\\){5}".{2}' print re.compile(search_string).search(command).group() Output: bc\\\\\"ab Your regex should also use r prefix.

just add a capturing group around the part you want: command = r'a(bc\\\\\"ab)c' and access it with: match.group(1)

Related

Match everything except a pattern and replace matched with string

Get the last 4 characters of a string as long as they are special characters

Extract path from lines in a file using python

match any decimals appearing immediately before a character in python

Handling backreferences to capturing groups in re.sub replacement pattern

Categories

Resources