Regex match multiple results between the same delimiters

Regex match multiple results between the same delimiters - python

Very poor title - feel free to update it if you feel you can help
I'm trying to return a list
[<str1>, <str2>,...,<strX>]
in the following string:
'%%<str1>%%_Anything_Can_Be_Here_%%<str2>%%'
The following code works, but if the number of '%%'s in the line is greater than 2, it takes everything between the first and last set of '%%'.
>>> import re
>>> str = '%%nas_ip_address%%'
>>> re.match('%%(.*)%%', str, re.DOTALL).group(1)
'nas_ip_address'
>>> str = '%%nas_ip_address%%:/vx/%%sfs_storage_pool%%'
>>> re.match('%%(.*)%%', str, re.DOTALL).group(1)
'nas_ip_address%%:/vx/%%sfs_storage_pool'
>>> re.match('%%(.*)%%', str, re.DOTALL).groups()
('nas_ip_address%%:/vx/%%sfs_storage_pool',)
Is there a way to somehow extract ['nas_ip_address', 'sfs_storage_pool'] from the string using regex? I'm looking to parse a very large file however the performance is not an issue as it's not for production

You can use re.findall() if you want to match multiple results in the same string
Try this:
import re
str = '%%nas_ip_address%%:/vx/%%sfs_storage_pool%%'
re.findall('%%(.*?)%%', str, re.DOTALL)

Because * is greedy by default, which means it will consume everything till the end, then backtrack one character at a time until finding the closest %%, close to the end of string then.
Two options to prevent it:
use lazy quantifier *?
even better if no risk of an occuring % in between, add contrast and use a negated character class [^%]*.

Related

Python: check if string meets specific format

Programming in Python3.
I am having difficulty in controlling whether a string meets a specific format.
So, I know that Python does not have a .contain() method like Java but that we can use regex.
My code hence will probably look something like this, where lowpan_headers is a dictionary with a field that is a string that should meet a specific format.
So the code will probably be like this:
import re
lowpan_headers = self.converter.lowpan_string_to_headers(lowpan_string)
pattern = re.compile("^([A-Z][0-9]+)+$")
pattern.match(lowpan_headers[dest_addrS])
However, my issue is in the format and I have not been able to get it right.
The format should be like bbbb00000000000000170d0000306fb6, where the first 4 characters should be bbbb and all the rest, with that exact length, should be hexadecimal values (so from 0-9 and a-f).
So two questions:
(1) any easier way of doing this except through importing re
(2) If not, can you help me out with the regex?

As for the regex you're looking for I believe that
^bbbb[0-9a-f]{28}$
should validate correctly for your requirements.
As for if there is an easier way than using the re module, I would say that there isn't really to achieve the result you're looking for. While using the in keyword in python works in the way you would expect a contains method to work for a string, you are actually wanting to know if a string is in a correct format. As such the best solution, as it is relatively simple, is to use a regular expression, and thus use the re module.

Here is a solution that does not use regex:
lowpan_headers = 'bbbb00000000000000170d0000306fb6'
if lowpan_headers[:4] == 'bbbb' and len(lowpan_headers) == 32:
try:
int(lowpan_headers[4:], 16) # tries interpreting the last 28 characters as hexadecimal
print('Input is valid!')
except ValueError:
print('Invalid Input') # hex test failed!
else:
print('Invalid Input') # either length test or 'bbbb' prefix test failed!

In fact, Python does have an equivalent to the .contains() method. You can use the in operator:
if 'substring' in long_string:
return True
A similar question has already been answered here.
For your case, however, I'd still stick with regex as you're indeed trying to evaluate a certain String format. To ensure that your string only has hexadecimal values, i.e. 0-9 and a-f, the following regex should do it: ^[a-fA-F0-9]+$. The additional "complication" are the four 'b' at the start of your string. I think an easy fix would be to include them as follows: ^(bbbb)?[a-fA-F0-9]+$.
>>> import re
>>> pattern = re.compile('^(bbbb)?[a-fA-F0-9]+$')
>>> test_1 = 'bbbb00000000000000170d0000306fb6'
>>> test_2 = 'bbbb00000000000000170d0000306fx6'
>>> pattern.match(test_1)
<_sre.SRE_Match object; span=(0, 32), match='bbbb00000000000000170d0000306fb6'>
>>> pattern.match(test_2)
>>>
The part that is currently missing is checking for the exact length of the string for which you could either use the string length method or extend the regex -- but I'm sure you can take it from here :-)

As I mentioned in the comment Python does have contains() equivalent.
if "blah" not in somestring:
continue
(source) (PythonDocs)
If you would prefer to use a regex instead to validate your input, you can use this:
^b{4}[0-9a-f]{28}$ - Regex101 Demo with explanation

how to remove or translate multiple strings from strings?

I have a long string like this:
'[("He tended to be helpful, enthusiastic, and encouraging, even to studentsthat didn\'t have very much innate talent.\\n",), (\'Great instructor\\n\',), (\'He could always say something nice and was always helpful.\\n\',), (\'He knew what he was doing.\\n\',), (\'Likes art\\n\',), (\'He enjoys the classwork.\\n\',), (\'Good discussion of ideas\\n\',), (\'Open-minded\\n\',), (\'We learned stuff without having to take notes, we just applied it to what we were doing; made it an interesting and fun class.\\n\',), (\'Very kind, gave good insight on assignments\\n\',), (\' Really pushed me in what I can do; expanded how I thought about art, the materials used, and how it was visually.\\n\',)
and I want to remove all [, (, ", \, \n from this string at once. Somehow I can do it one by one, but always failed with '\n'. Is there any efficient way I can remove or translate all these characters or blank lines symbols?
Since my senectiecs are not long so I do not want to use dictionary methods like earlier questions.

Maybe you could use regex to find all the characters that you want to replace
s = s.strip()
r = re.compile("\[|\(|\)|\]|\\|\"|'|,")
s = re.sub(r, '', s)
print s.replace("\\n", "")
I have some problems with the "\n" but replacing after the regex is easy to remove too.

If string is correct python expression then you can use literal_eval from ast module to transform string to tuples and after that you can process every tuple.
from ast import literal_eval
' '.join(el[0].strip() for el in literal_eval(your_string))
If not then you can use this:
def get_part_string(your_string):
for part in re.findall(r'\((.+?)\)', your_string):
yield re.sub(r'[\"\'\\\\n]', '', part).strip(', ')
''.join(get_part_string(your_string))

Matching regex to set

I am looking for a way to match the beginning of a line to a regex and for the line to be returned afterwards. The set is quite extensive hence why I cannot simply use the method given on Python regular expressions matching within set. I was also wondering if regex is the best solution. I have read the http://docs.python.org/3.3/library/re.html alas, it does not seem to hold the answer. Here is what I have tried so far...
import re
import os
import itertools
f2 = open(file_path)
unilist = []
bases=['A','G','C','N','U']
patterns= set(''.join(per) for per in itertools.product(bases, repeat=5))
#stuff
if re.match(r'.*?(?:patterns)', line):
print(line)
unilist.append(next(f2).strip())
print (unilist)
You see, the problem is that I do not know how to refer to my set...
The file I am trying to match it to looks like:
#SRR566546.970 HWUSI-EAS1673_11067_FC7070M:4:1:2299:1109 length=50 TTGCCTGCCTATCATTTTAGTGCCTGTGAGGTGGAGATGTGAGGATCAGT
+
hhhhhhhhhhghhghhhhhfhhhhhfffffeee[X]b[d[ed`[Y[^Y

You are going about it the wrong way.
You simply leave the set of characters to the regular expression:
re.search('[AGCNU]{5}', line)
matches any 5-character pattern built from those 5 characters; that matches the same 3125 different combinations you generated with your set line, but doesn't need to build all possible combinations up front.
Otherwise, your regular expression attempt had no correlation to your patterns variable, the pattern r'.*?(?:patterns)' would match 0 or more arbitrary characters, followed by the literal text 'patterns'.

According to what I've understood from your question, it seems to me that this could fit your need:
import re
sss = '''dfgsdfAUGNA321354354
!=**$=)"nNNUUG54788
=AkjhhUUNGffdffAAGjhff1245GGAUjkjdUU
.....cv GAUNAANNUGGA'''
print re.findall('^(.+?[AGCNU]{5})',sss,re.MULTILINE)

Python replace with re-using unknown strings

I have an XML in which I'd like to rename one of the tag groups like this:
<string>ABC</string>
<string>unknown string</string>
should be
<xyz>ABC</xyz>
<xyz>unknown string</xyz>
ABC is always the same, so that's no issue. However, "unknown string" is always different, but since I need this information extracted, I also want to keep the same string in the replacement.
Here's what I got so far:
import re
#open the xml file for reading:
file = open('path/file','r+')
#convert to string:
data = file.read()
file.write(re.sub("<string>ABC</string>(\s+)<string>(.*)</string>","<xyz>ABC</xyz>[\1]<xyz>[\2]</xyz>",data))
print (data)
file.close()
I tried to use capture groups, but didn't do it correctly. The string is replaced with weird symbols in my XML. Plus, it's printed twice. I have both the unchanged and the changed version in my XML, which I don't want.

The problem you're experiencing is not due to your regex pattern. The backslash (\) in the strings are escaping proceeding characters thus resulting in the weird symbols that you see.
>>> print "hello\1world"
helloworld
>>> print r"hello\1world"
hello\1world
Always use the raw string notation to define your re patterns.
>>> data = """
... <string>ABC</string>
... <string>unknown string</string>
... """
>>> print re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>",data)
<xyz>ABC</xyz>
<xyz>unknown string</xyz>

Why are you including the content in your replacement operation? All you need to do is:
Replace <string> by <xyz>.
Replace </string> by </xyz>.
It would take two operations but the intent of your code would be clear and you don't need to know what unknown string is.

Converting html entities into their values in python

I use this regex on some input,
[^a-zA-Z0-9##]
However this ends up removing lots of html special characters within the input, such as
#227;, #1606;, #1588; (i had to remove the & prefix so that it wouldn't
show up as the actual value..)
is there a way that I can convert them to their values so that it will satisfy the regexp expression? I also have no idea why the text decided to be so big.

Given that your text appears to have numeric-coded, not named, entities, you can first convert your byte string that includes xml entity defs (ampersand, hash, digits, semicolon) to unicode:
import re
xed_re = re.compile(r'&#(\d+);')
def usub(m): return unichr(int(m.group(1)))
s = 'ã, ن, ش'
u = xed_re.sub(usub, s)
if your terminal emulator can display arbitrary unicode glyphs, a print u will then show
ã, ن, ش
In any case, you can now, if you wish, use your original RE and you won't accidentally "catch" the entities, only ascii letters, digits, and the couple of punctuation characters you listed. (I'm not sure that's what you really want -- why not accented letters but just ascii ones, for example? -- but, if it is what you want, it will work).
If you do have named entities in addition to the numeric-coded ones, you can also apply the htmlentitydefs standard library module recommended in another answer (it only deals with named entities which map to Latin-1 code points, however).

You can adapt the following script:
import htmlentitydefs
import re
def substitute_entity (match):
name = match.group (1)
if name in htmlentitydefs.name2codepoint:
return unichr (htmlentitydefs.name2codepoint[name])
elif name.startswith ('#'):
try:
return unichr (int (name[1:]))
except:
pass
return '?'
print re.sub ('&(#?\\w+);', substitute_entity, 'x « y &wat; z {')
Produces the following answer here:
x « y ? z {
EDIT: I understood the question as "how to get rid of HTML entities before further processing", hope I haven't wasted time on answering a wrong question ;)

Without knowing what the expression is being used for I can't tell exactly what you need.
This will match special characters or strings of characters excluding letters, digits, #, and #:
[^a-zA-Z0-9##]*|#[0-9A-Za-z]+;

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex match multiple results between the same delimiters - python

You can use re.findall() if you want to match multiple results in the same string Try this: import re str = '%%nas_ip_address%%:/vx/%%sfs_storage_pool%%' re.findall('%%(.*?)%%', str, re.DOTALL)

Related

Python: check if string meets specific format

how to remove or translate multiple strings from strings?

Matching regex to set

Python replace with re-using unknown strings

Converting html entities into their values in python

Categories

Resources