Regex to match GSM character set - python

This is a GSM character set (below). I need to make sure only text containing these
characters will match. If the text contains anything outside this scope if will not match...
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567889#?£_!1$"¥#è
?¤é%ù&ì\ò(Ç)*:Ø+;ÄäøÆ,<LÖlöæ-=ÑñÅß.>ÜüåÉ/§à¡¿'
This is what I have tried...
#£$¥èéùìòÇ\fØø\nÅåΔ_ΦΓΛΩΠΨΣΘΞÆæßÉ !\"#¤%&'()*+,-./[0-9]:;<=>\?¡[A-Z]ÄÖÑܧ¿[a-z]äöñüà\^\{\}\[~\]\|€
I need a regex that only matches the following
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567889#?£_!1$"¥#è
?¤é%ù&ì\ò(Ç)*:Ø+;ÄäøÆ,<LÖlöæ-=ÑñÅß.>ÜüåÉ/§à¡¿'
how? Thanks.
UPDATED:
rule = re.compile(r'^[\w#?£!1$"¥#è?¤é%ù&ì\\ò(Ç)*:Ø+;ÄäøÆ,<LÖlöæ\-=ÑñÅß.>ÜüåÉ/§à¡¿\']+$')
if not rule.search(value):
msg = u"Invalid characters."
raise ValidationError(msg)

Try
r'^[\w#?£!1$"¥#è?¤é%ù&ì\\ò(Ç)*:Ø+;ÄäøÆ,<LÖlöæ\-=ÑñÅß.>ÜüåÉ/§à¡¿\']+$'
If you want to match the above characters within a string which also contains other characters then remove the leading ^ and trailing $.
Note that the above will not allow space characters. If you want to include them just add a space (or add \s if you want to include newlines also) to the set.

An alternative approach without using regular expressions:
>>> valid_chars = set(u'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567889#?£_!1$"¥#è?¤é%ù&ì\ò(Ç)*:Ø+;ÄäøÆ,<LÖlöæ-=ÑñÅß.>ÜüåÉ/§à¡¿\'')
>>> tests = ['hello', u'£_!', u'Ϡ']
>>> [len(set(t).difference(valid_chars)) == 0 for t in tests]
[True, True, False]

Related

Regex matching: Case insensitive German words with spaces (Python)

I have a problem where I want to match any number of German words inside [] braces, ignoring the case. The expression should only match spaces and words, nothing else i.e no punctuation marks or parenthesis
E.g :
The expression ['über das thema schreibt'] should be matched with ['Über', 'das', 'Thema', 'schreibt']
I have one list with items of the former order and another with the latter order, as long as the words are same, they both should match.
The code I tried with is -
regex = re.findall('[(a-zA-Z_äöüÄÖÜß\s+)]', str(term))
or
re.findall('[(\S\s+)]', str(term))
But they are not working. Kindly help me find a solution
In the simplest form using \w+ works for finding words (needs Unicode flag for non-ascii chars), but since you want them to be within the square brackets (and quotes I assume) you'd need something a bit complex
\[(['\"])((\w+\s?)+)\1\]
\[ and \] are used to match the square brackets
['\"] matches either quote and the \1 makes sure the same quote is one the other end
\w+ captures 1 word. The \s? is for an optional space.
The whole string is in the second group which you can split to get the list
import re
text = "['über das thema schreibt']"
regex = re.compile("\[(['\"])((\w+\s?)+)['\"]\]", flags=re.U)
match = regex.match(text)
if match:
print(match.group(2).split())
(slight edit as \1 did not seem to work in the terminal for me)
I found the easiest solution to it :
for a, b in zip(list1, list2):
reg_a = re.findall('[(\w\s+)]', str(a).lower())
reg_b = re.findall('[(\w\s+)]', str(b).lower())
if reg_a == reg_b:
return True
else
return False
Updated based on comments to match each word. This simply ignores spaces, single quotes and square braces
import re
text = "['über das thema schreibt']"
re.findall("([a-zA-Z_äöüÄÖÜß]+)", str(text))
# ['über', 'das', 'thema', 'schreibt']
If you are solving case sensitivity issue, add the regex flaf re.IGNORECASE
like
re.findall('[(\S\s+)]', str(term),re.IGNORECASE)
You might need to consider converting them to unicode, if it did not help.

how to use python regex find matched string?

for string "//div[#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings']", I want to find "#..'...'" like "#id~'objectnavigator-card-list'" or "#class~'outbound-alert-settings'". But when I use regex ((#.+)\~(\'.*?\')), it find "#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings'". So how to modify the regex to find the string successfully?
Use non-capturing, non greedy, modifiers on the inner brackets and search for not the terminating character, e.g.:
re.findall(r"((?:#[^\~]+)\~(?:\'[^\]]*?\'))", test)
On your test string returns:
["#id~'objectnavigator-card-list'", "#class~'outbound-alert-settings'"]
Limit the characters you want to match between the quotes to not match the quote:
>>> re.findall(r'#[a-z]+~\'[-a-z]*\'', x)
I find it's much easier to look for only the characters I know are going to be in a matching section rather than omitting characters from more permissive matches.
For your current test string's input you can try this pattern:
import re
a = "//div[#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings']"
# find everything which begins by '#' and neglect ']'
regex = re.compile(r'(#[^\]]+)')
strings = re.findall(regex, a)
# Or simply:
# strings = re.findall('(#[^\\]]+)', a)
print(strings)
Output:
["#id~'objectnavigator-card-list'", "#class~'outbound-alert-settings'"]

Get the last 4 characters of a string as long as they are special characters

I have web URLs that look like this:
http://example.com/php?id=2/*
http://example.com/php?id=2'
http://example.com/php?id=2*/"
What I need to do is grab the last characters of the string, I've tried:
for urls in html_page:
syntax = list(url)[-1]
# <= *
# <= '
# etc...
However this will only grab the last character of the string, is there a way I could grab the last characters as long as they are special characters?
Use a regex. Assuming that by "special character" you mean "anything besides A-Za-z0-9":
>>> import re
>>> re.search(r"\W+$", "http://example.com/php?id=2*/'").group()
"*/'"
\W+ matches one or more "non-word" characters, and $ anchors the search to the end of the string.
Use a regular expression?
import re
addr = "http://example.com/php?id=2*/"
chars = re.search(addr, "[\*\./_]{0,4}$").group()
Characters you want to match are the ones between the [] brackets. You may want to add or remove characters depending on what you expect to encounter.
For example, you would (probably) not want to match the '=' character in your example URLs, which the other answer would match.
{0,4} means to match 0-4 characters (defaults to being greedy)

How to remove substrings marked with special characters from a string?

I have a string in Python:
Tt = "This is a <\"string\">string, It should be <\"changed\">changed to <\"a\">a nummber."
print Tt
'This is a <"string">string, It should be <"changed">changed to <"a">a nummber.'
You see the some words repeat in this part <\" \">.
My question is, how to delete those repeated parts (delimited with the named characters)?
The result should be like:
'This is a string, It should be changed to a nummber.'
Use regular expressions:
import re
Tt = re.sub('<\".*?\">', '', Tt)
Note the ? after *. It makes the expression non-greedy,
so it tries to match so few symbols between <\" and \"> as possible.
The Solution of James will work only in cases when the delimiting substrings
consist only from one character (< and >). In this case it is possible to use negations like [^>]. If you want to remove a substring delimited with character sequences (e.g. with begin and end), you should use non-greedy regular expressions (i.e. .*?).
I'd use a quick regular expression:
import re
Tt = "This is a <\"string\">string, It should be <\"changed\">changed to <\"a\">a number."
print re.sub("<[^<]+>","",Tt)
#Out: This is a string, It should be changed to a nummber.
Ah - similar to Igor's post, he beat my by a bit. Rather than making the expression non-greedy, I don't match an expression if it contains another start tag "<" in it, so it will only match a start tag that's followed by an end tag ">".

Remove non-letter characters from beginning and end of a string

I need to remove all non-letter characters from the beginning and from the end of a word, but keep them if they appear between two letters.
For example:
'123foo456' --> 'foo'
'2foo1c#BAR' --> 'foo1c#BAR'
I tried using re.sub(), but I couldn't write the regex.
like this?
re.sub('^[^a-zA-Z]*|[^a-zA-Z]*$','',s)
s is the input string.
You could use str.strip for this:
In [1]: import string
In [4]: '123foo456'.strip(string.digits)
Out[4]: 'foo'
In [5]: '2foo1c#BAR'.strip(string.digits)
Out[5]: 'foo1c#BAR'
As Matt points out in the comments (thanks, Matt), this removes digits only. To remove any non-letter character,
Define what you mean by a non-letter:
In [22]: allchars = string.maketrans('', '')
In [23]: nonletter = allchars.translate(allchars, string.letters)
and then strip:
In [18]: '2foo1c#BAR'.strip(nonletter)
Out[18]: 'foo1c#BAR'
With your two examples, I was able to create a regex using Python's non-greedy syntax as described here. I broke up the input into three parts: non-letters, exclusively letters, then non-letters until the end. Here's a test run:
1:[123] 2:[foo] 3:[456]
1:[2] 2:[foo1c#BAR] 3:[]
Here's the regular expression:
^([^A-Za-z]*)(.*?)([^A-Za-z]*)$
And mo.group(2) what you want, where mo is the MatchObject.
To be unicode compatible:
^\PL+|\PL+$
\PL stands for for not a letter
Try this:
re.sub(r'^[^a-zA-Z]*(.*?)[^a-zA-Z]*$', '\1', string);
The round brackets capture everything between non-letter strings at the beginning and end of the string. The ? makes sure that the . does not capture any non-letter strings at the end, too. The replacement then simply prints the captured group.
result = re.sub('(.*?)([a-z].*[a-z])(.*)', '\\2', '23WERT#3T67', flags=re.IGNORECASE)

Categories