Regular Expression Not matching the value - python

I have a file saving IP addresses to names in format
<<%#$192.168.8.40$#% %##Name_of_person##% >>
I read This file and now want to extract the list using pythons regular expressions
list=re.findall("<<%#$(\S+)$#%\s%##(\w+\s*\w*)##%\s>>",ace)
print list
But the list is always an empty list..
can anyone tell me where is the mistake in the regular expression
edit-ace is the variable saving the contents read from the file

$ is a special character in regular expressions, meaning "end of line" (or "end of string", depending on the flavour). Your regex has other characters following the $, and as such only matches strings that have those characters after the end, which is impossible.
You will need to escape the $, like so: \$
I would suggest the following regular expression (formatted as a raw string since you are using Python):
r"<<%#\$([^$]+)\$#%\s%##([^#]+)##%\s>>"
That is, <<%#$, then one or more non-$ characters, $#%, a whitespace character, %##, one or more non-# characters, ##%, whitespace, >>.

Something like:
text = '<<%#$192.168.8.40$#% %##Name_of_person##% >>'
ip, name = [el[1] for el in re.findall(r'%#(.)(.+?)\1#%', text)]
If you can get any with just splitting on '#' and '$' then...
from itertools import itemgetter
ip, name = itemgetter(1, 3)(re.split(r'[#\$]', text))
You could also just use built-in string functions:
tmp = text.split('$')
ip, name = tmp[1], tmp[2].split('#')[1]

u use a invalid regex pattern.
you may use
r"<\%#\$(\S+)\$#\%\s\%##(\w+\s*\w*)##\%\s>>" replace
"<<%#$(\S+)$#%\s%##(\w+\s*\w*)##%\s>>" in fandall method
good luck~!

Related

how can I substitute a matched string in python

I have a string ="/One/Two/Three/Four"
I want to convert it to ="Four"
I can do this in one line in perl
string =~ s/.*+\///g
How Can I do this in python?
str_name="/One/Two/Three/Four"
str_name.split('/')[-1]
In general, split is a safe way to convert a string into a list based on some reg-ex. Then, we can call the last element in that list, which happens to be "Four" in this case.
Hope this helps.
Python's re module can handle regular expressions. For this case, you'd do
import re
my_str = "/One/Two/Three/Four"
new_str = re.sub(".*/", "", my_str)
# 'Four'
re.sub() is the regex replacement method. Like your perl regex, we simply look for any number of characters, followed by a slash, and then replace that with the empty string. What's left is what's after the last slash, which is 4.
The are alot of possibilities to solve this. One way would be by indexing the string. Other string method can be found here
string ="/One/Two/Three/Four"
string[string.index('Four'):]
Additionally you could split the string by the slash with .split('/')
print(string.split('/')[-1])
Another option would be regular expressions: see here

using OR operator (|) in variable for regular expression in python

I need to match against a list of string values. I'm using '|'.join() to build a sting that is passed into re.match:
import re
line='GigabitEthernet0/1 is up, line protocol is up'
interfacenames=[
'Loopback',
'GigabitEthernet'
]
rex="r'" + '|'.join(interfacenames) + "'"
print rex
interface=re.match(rex,line)
print interface
The code result is:
r'Loopback|GigabitEthernet'
None
However if I copy past the string directly into match:
interface=re.match(r'Loopback|GigabitEthernet',line)
It works:
r'Loopback|GigabitEthernet'
<_sre.SRE_Match object at 0x7fcdaf2f4718>
I did try to replace .join with actual "Loopback|GigabitEthernet" in rex and it didn't work either. It looks like the pipe symbol is not treated as operator when passed from string.
Any thoughts how to fix it?
You use the r' prefix as a part of a string literal. This is how it could be used:
rex=r'|'.join(interfacenames)
See the Python demo
If the interfacenames may contain special regex metacharacters, escape the values like this:
rex=r'|'.join([re.escape(x) for x in interfacenames])
Also, if you plan to match the strings not only at the start of the string, use re.search rather than re.match. See What is the difference between Python's re.search and re.match?
You don't need to put "r'" at the beginning and "'". That's part of the syntax for literal raw strings, it's not part of the string itself.
rex = '|'.join(interfacenames)

Replace string between tags if string begins with "1"

I have a huge XML file (about 100MB) and each line contains something along the lines of <tag>10005991</tag>. So for example:
textextextextext<tag>10005991<tag>textextextextext
textextextextext<tag>20005992</tag>textextextextext
textextextextext<tag>10005993</tag>textextextextext
textextextextext<tag>20005994</tag>textextextextext
I want to replace any string between the tags and that begins with "1" to be replaced with a string of my choice and then write back to the file. I've tried using the line.replace function which works but only if I specify the string.
line=line.replace("<tag>10005991</tag>","<tag>YYYYYY</tag>")
Ideal output:
textextextextext<tag>YYYYYY<tag>textextextextext
textextextextext<tag>20005992</tag>textextextextext
textextextextext<tag>YYYYYY</tag>textextextextext
textextextextext<tag>20005994</tag>textextextextext
I've thought about using an array to pass each string in and then replace but I'm sure there's a much simpler solution.
You can use the re module
>>> text = 'textextextextext<tag>10005991</tag>textextextextext'
>>> re.sub(r'<tag>1(\d+)</tag>','<tag>YYYYY</tag>',text)
'textextextextext<tag>YYYYY</tag>textextextextext'
re.sub will replace the matched text with the second argument.
Quote from the doc
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.
Usage may be like:
with open("file") as f:
for i in f:
with open("output") as f2:
f2.write(re.sub(r'<tag>1(\d+)</tag>','<tag>YYYYY</tag>',i))
You can use regex but as you have a multi-line string you need to use re.DOTALL flag , and in your pattern you can use positive look-around for match string between tags:
>>> print re.sub(r'(?<=<tag>)1\d+(?=</?tag>)',r'YYYYYY',s,re.DOTALL,re.MULTILINE)
textextextextext<tag>YYYYYY<tag>textextextextext
textextextextext<tag>20005992</tag>textextextextext
textextextextext<tag>YYYYYY</tag>textextextextext
textextextextext<tag>20005994</tag>textextextextext
re.DOTALL
Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.
Also as #Bhargav Rao have did in his answer you can use grouping instead look-around :
>>> print re.sub(r'<tag>(1\d+)</?tag>',r'<tag>YYYYYY</?tag>',s,re.DOTALL,re.MULTILINE)
textextextextext<tag>YYYYYY</?tag>textextextextext
textextextextext<tag>20005992</tag>textextextextext
textextextextext<tag>YYYYYY</?tag>textextextextext
textextextextext<tag>20005994</tag>textextextextext
I think your best bet is to use ElementTree
The main idea:
1) Parse the file
2) Find the elements value
3) Test your condition
4) Replace value if condition met
Here is a good place to start parsing : How do I parse XML in Python?

can't use variable inside regex

So, I have a long sequence of Unicode characters that I want to match using regular expressions:
char_set = '\u0041-\u005A|\u00C0-\u00D6|\u00D8-\u00DE|\u0100|\u0102|\u0104|\u0106|\u0108|\u010A|\u010C|\u010E|\u0110|\u0112|\u0114|\u0116|\u0118|\u011A|\u011C|\u011E|\u0120|\u0122|\u0124|\u0126|\u0128|\u012A|\u012C|\u012E|\u0130|\u0132|\u0134|\u0136|\u0139|\u013B|\u013D|\u013F|\u0141|\u0143|\u0145|\u0147|\u014A|\u014C|\u014E|\u0150|\u0152|\u0154|\u0156|\u0158|\u015A|\u015C|\u015E|\u0160|\u0162|\u0164|\u0166|\u0168|\u016A|\u016C|\u016E|\u0170|\u0172|\u0174|\u0176|\u0178|\u0179|\u017B|\u017D'
(These are all the uppercase characters comprehended in the Unicode range 0-382. Most of them are accented. PEP8 discourages the use of non-ASCII characters in Python scripts, so I'm using the Unicode codes instead of the string literals.)
If I simply compile that long string directly, it works. For instance, this matches all the words that begin with one of those characters:
regex = re.compile(u'\A[\u0041-\u005A|\u00C0-\u00D6|\u00D8-\u00DE|\u0100|\u0102|\u0104|\u0106|\u0108|\u010A|\u010C|\u010E|\u0110|\u0112|\u0114|\u0116|\u0118|\u011A|\u011C|\u011E|\u0120|\u0122|\u0124|\u0126|\u0128|\u012A|\u012C|\u012E|\u0130|\u0132|\u0134|\u0136|\u0139|\u013B|\u013D|\u013F|\u0141|\u0143|\u0145|\u0147|\u014A|\u014C|\u014E|\u0150|\u0152|\u0154|\u0156|\u0158|\u015A|\u015C|\u015E|\u0160|\u0162|\u0164|\u0166|\u0168|\u016A|\u016C|\u016E|\u0170|\u0172|\u0174|\u0176|\u0178|\u0179|\u017B|\u017D]')
But I want to re-use that same sequence of characters in several other regular expressions. I could simply copy and paste it every time, but that's ugly. So based on previous answers to similar questions I've tried this:
regex = re.compile(u'\A[%s]' % char_set)
No good. Somehow the above expression seems to match ANY character, not just the ones hardcoded under the variable 'char_set'.
I've also tried this:
regex = re.compile(u'\A[' + char_set + ']')
And this:
regex = re.compile(u'\A[' + re.escape(char_set) + ']')
And this too:
regex = re.compile(u'\A[{ }]'.format(char_set))
None of which works as expected.
Any thoughts? What am I doing wrong?
(I'm using Python 2.7 and Mac OS X 10.6)
When you're using a pattern with a set of characters in square brackets, you don't want to put any vertical bar (|) characters in the set. Instead, just string the characters together and it should work. Here's a session where I tried out your characters with no problems after stripping the | chars:
>>> import re
>>> char_set = u'\u0041-\u005A|\u00C0-\u00D6|\u00D8-\u00DE|\u0100|\u0102|\u0104|\u0106|\u0108|\u010A|\u010C|\u010E|\u0110|\u0112|\u0114|\u0116|\u0118|\u011A|\u011C|\u011E|\u0120|\u0122|\u0124|\u0126|\u0128|\u012A|\u012C|\u012E|\u0130|\u0132|\u0134|\u0136|\u0139|\u013B|\u013D|\u013F|\u0141|\u0143|\u0145|\u0147|\u014A|\u014C|\u014E|\u0150|\u0152|\u0154|\u0156|\u0158|\u015A|\u015C|\u015E|\u0160|\u0162|\u0164|\u0166|\u0168|\u016A|\u016C|\u016E|\u0170|\u0172|\u0174|\u0176|\u0178|\u0179|\u017B|\u017D'
>>> fixed_char_set = char_set.replace("|", "") # remove the unneeded vertical bars
>>> pattern = ur"\A[{}]".format(fixed_char_set) # create a pattern string
>>> regex = re.compile(pattern) # compile the pattern into a regex object
>>> print regex.match("%foo") # "%" is not in the character set, so match returns None
None
edit: Actually, it seems like there must be some other issue going on, since I don't match "%foo" even if I use your original char_set without stripping out anything. Please give examples of text that is matching when it shouldn't!

Python regular expression with string in it

I would like to match a string with something like:
re.match(r'<some_match_symbols><my_match><some_other_match_symbols>', mystring)
where mymatch is a string I would like it to find. The problem is that it may be different from time to time, and it is stored in a variable. Would it be possible to add one variable to a regex?
Nothing prevents you from simply doing this:
re.match('<some_match_symbols>' + '<my_match>' + '<some_other_match_symbols>', mystring)
Regular expressions are nothing else than a string containing some special characters, specific to the regular expression syntax. But they are still strings, so you can do whatever you are used to do with strings.
The r'…' syntax is btw. a raw string syntax which basically just prevents any escape sequences inside the string from being evaluated. So r'\n' will be the same as '\\n', a string containing a backslash and an n; while '\n' contain a line break.
import re
url = "www.dupe.com"
expression = re.compile('<p>%s</p>'%url)
result = expression.match("<p>www.dupe.com</p>BBB")
if result:
print result.start(), result.end()
The r'' notation is for constants. Use the re library to compile from variables.

Categories