I need to remove all non-letter characters from the beginning and from the end of a word, but keep them if they appear between two letters.
For example:
'123foo456' --> 'foo'
'2foo1c#BAR' --> 'foo1c#BAR'
I tried using re.sub(), but I couldn't write the regex.
like this?
re.sub('^[^a-zA-Z]*|[^a-zA-Z]*$','',s)
s is the input string.
You could use str.strip for this:
In [1]: import string
In [4]: '123foo456'.strip(string.digits)
Out[4]: 'foo'
In [5]: '2foo1c#BAR'.strip(string.digits)
Out[5]: 'foo1c#BAR'
As Matt points out in the comments (thanks, Matt), this removes digits only. To remove any non-letter character,
Define what you mean by a non-letter:
In [22]: allchars = string.maketrans('', '')
In [23]: nonletter = allchars.translate(allchars, string.letters)
and then strip:
In [18]: '2foo1c#BAR'.strip(nonletter)
Out[18]: 'foo1c#BAR'
With your two examples, I was able to create a regex using Python's non-greedy syntax as described here. I broke up the input into three parts: non-letters, exclusively letters, then non-letters until the end. Here's a test run:
1:[123] 2:[foo] 3:[456]
1:[2] 2:[foo1c#BAR] 3:[]
Here's the regular expression:
^([^A-Za-z]*)(.*?)([^A-Za-z]*)$
And mo.group(2) what you want, where mo is the MatchObject.
To be unicode compatible:
^\PL+|\PL+$
\PL stands for for not a letter
Try this:
re.sub(r'^[^a-zA-Z]*(.*?)[^a-zA-Z]*$', '\1', string);
The round brackets capture everything between non-letter strings at the beginning and end of the string. The ? makes sure that the . does not capture any non-letter strings at the end, too. The replacement then simply prints the captured group.
result = re.sub('(.*?)([a-z].*[a-z])(.*)', '\\2', '23WERT#3T67', flags=re.IGNORECASE)
Related
for string "//div[#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings']", I want to find "#..'...'" like "#id~'objectnavigator-card-list'" or "#class~'outbound-alert-settings'". But when I use regex ((#.+)\~(\'.*?\')), it find "#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings'". So how to modify the regex to find the string successfully?
Use non-capturing, non greedy, modifiers on the inner brackets and search for not the terminating character, e.g.:
re.findall(r"((?:#[^\~]+)\~(?:\'[^\]]*?\'))", test)
On your test string returns:
["#id~'objectnavigator-card-list'", "#class~'outbound-alert-settings'"]
Limit the characters you want to match between the quotes to not match the quote:
>>> re.findall(r'#[a-z]+~\'[-a-z]*\'', x)
I find it's much easier to look for only the characters I know are going to be in a matching section rather than omitting characters from more permissive matches.
For your current test string's input you can try this pattern:
import re
a = "//div[#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings']"
# find everything which begins by '#' and neglect ']'
regex = re.compile(r'(#[^\]]+)')
strings = re.findall(regex, a)
# Or simply:
# strings = re.findall('(#[^\\]]+)', a)
print(strings)
Output:
["#id~'objectnavigator-card-list'", "#class~'outbound-alert-settings'"]
Here us what I'm trying to do... I have a string structured like this:
stringparts.bst? (carriage return)
765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchksyttsutcuan99 (carriage return)
SPAM /198975/
I need it to match or return this:
765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchksyttsutcuan99
What RegEx will do the trick?
I have tried this, but to no avail :(
bst\?(.*)\n
Thanks in advc
I tried this. Assuming the newline is only one character.
>>> s
'stringparts.bst?\n765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchks
yttsutcuan99\nSPAM /198975/'
>>> m = re.match('.*bst\?\s(.+)\s', s)
>>> print m.group(1)
765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchksyttsutcuan99
Your regex will match everything between the bst? and the first newline which is nothing. I think you want to match everything between the first two newlines.
bst\?\n(.*)\n
will work, but you could also use
\n(.*)\n
although it may not work for some other more specific cases
This is more robust against different kinds of line breaks, and works if you have a whole list of such strings. The $ and ^ represent the beginning and end of a line, but not the actual line break character (hence the \s+ sequence).
import re
BST_RE = re.compile(
r"bst\?.*$\s+^(.*)$",
re.MULTILINE
)
INPUT_STR = r"""
stringparts.bst?
765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchksyttsutcuan99
SPAM /198975/
stringparts.bst?
another
SPAM /.../
"""
occurrences = BST_RE.findall(INPUT_STR)
for occurrence in occurrences:
print occurrence
This pattern allows additional whitespace before the \n:
r'bst\?\s*\n(.*?)\s*\n'
If you don't expect any whitespace within the string to be captured, you could use a simpler one, where \s+ consumes whitespace, including the \n, and (\S+) captures all the consecutive non-whitespace:
r'bst\?\s+(\S+)'
This is a GSM character set (below). I need to make sure only text containing these
characters will match. If the text contains anything outside this scope if will not match...
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567889#?£_!1$"¥#è
?¤é%ù&ì\ò(Ç)*:Ø+;ÄäøÆ,<LÖlöæ-=ÑñÅß.>ÜüåÉ/§à¡¿'
This is what I have tried...
#£$¥èéùìòÇ\fØø\nÅåΔ_ΦΓΛΩΠΨΣΘΞÆæßÉ !\"#¤%&'()*+,-./[0-9]:;<=>\?¡[A-Z]ÄÖÑܧ¿[a-z]äöñüà\^\{\}\[~\]\|€
I need a regex that only matches the following
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567889#?£_!1$"¥#è
?¤é%ù&ì\ò(Ç)*:Ø+;ÄäøÆ,<LÖlöæ-=ÑñÅß.>ÜüåÉ/§à¡¿'
how? Thanks.
UPDATED:
rule = re.compile(r'^[\w#?£!1$"¥#è?¤é%ù&ì\\ò(Ç)*:Ø+;ÄäøÆ,<LÖlöæ\-=ÑñÅß.>ÜüåÉ/§à¡¿\']+$')
if not rule.search(value):
msg = u"Invalid characters."
raise ValidationError(msg)
Try
r'^[\w#?£!1$"¥#è?¤é%ù&ì\\ò(Ç)*:Ø+;ÄäøÆ,<LÖlöæ\-=ÑñÅß.>ÜüåÉ/§à¡¿\']+$'
If you want to match the above characters within a string which also contains other characters then remove the leading ^ and trailing $.
Note that the above will not allow space characters. If you want to include them just add a space (or add \s if you want to include newlines also) to the set.
An alternative approach without using regular expressions:
>>> valid_chars = set(u'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567889#?£_!1$"¥#è?¤é%ù&ì\ò(Ç)*:Ø+;ÄäøÆ,<LÖlöæ-=ÑñÅß.>ÜüåÉ/§à¡¿\'')
>>> tests = ['hello', u'£_!', u'Ϡ']
>>> [len(set(t).difference(valid_chars)) == 0 for t in tests]
[True, True, False]
How to match the following i want all the names with in the single quotes
This hasn't been much that much of a twist and turn's to 'Tom','Harry' and u know who..yes its 'rock'
How to extract the name within the single quotes only
name = re.compile(r'^\'+\w+\'')
The following regex finds all single words enclosed in quotes:
In [6]: re.findall(r"'(\w+)'", s)
Out[6]: ['Tom', 'Harry', 'rock']
Here:
the ' matches a single quote;
the \w+ matches one or more word characters;
the ' matches a single quote;
the parentheses form a capture group: they define the part of the match that gets returned by findall().
If you only wish to find words that start with a capital letter, the regex can be modified like so:
In [7]: re.findall(r"'([A-Z]\w*)'", s)
Out[7]: ['Tom', 'Harry']
I'd suggest
r = re.compile(r"\B'\w+'\B")
apos = r.findall("This hasn't been much that much of a twist and turn's to 'Tom','Harry' and u know who..yes its 'rock'")
Result:
>>> apos
["'Tom'", "'Harry'", "'rock'"]
The "negative word boundaries" (\B) prevent matches like the 'n' in words like Rock'n'Roll.
Explanation:
\B # make sure that we're not at a word boundary
' # match a quote
\w+ # match one or more alphanumeric characters
' # match a quote
\B # make sure that we're not at a word boundary
^ ('hat' or 'caret', among other names) in regex means "start of the string" (or, given particular options, "start of a line"), which you don't care about. Omitting it makes your regex work fine:
>>> re.findall(r'\'+\w+\'', s)
["'Tom'", "'Harry'", "'rock'"]
The regexes others have suggested might be better for what you're trying to achieve, this is the minimal change to fix your problem.
Your regex can only match a pattern following the start of the string. Try something like: r"'([^']*)'"
I have a cell array in matlab
columns = {'MagX', 'MagY', 'MagZ', ...
'AccelerationX', 'AccelerationX', 'AccelerationX', ...
'AngularRateX', 'AngularRateX', 'AngularRateX', ...
'Temperature'}
I use these scripts which make use of matlab's hdf5write function to save the array in the hdf5 format.
I then read in the the hdf5 file into python using pytables. The cell array comes in as a numpy array of strings. I convert to a list and this is the output:
>>>columns
['MagX\x00\x00\x00\x08\x01\x008\xe6\x7f',
'MagY\x00\x7f\x00\x00\x00\xee\x0b9\xe6\x7f',
'MagZ\x00\x00\x00\x00\x001',
'AccelerationX',
'AccelerationY',
'AccelerationZ',
'AngularRateX',
'AngularRateY',
'AngularRateZ',
'Temperature']
These hex values pop into the strings from somewhere and I'd like to remove them. They don't always appear on the first three items of the list and I need a nice way to deal with them or to find out why they are there in the first place.
>>>print columns[0]
Mag8�
>>>columns[0]
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>repr(columns[0])
"'MagX\\x00\\x00\\x00\\x08\\x01\\x008\\xe6\\x7f'"
>>>print repr(columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
I've tried using a regular expression to remove the hex values but have little luck.
>>>re.sub('(\w*)\\\\x.*', '\1', columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>re.sub('(\w*)\\\\x.*', r'\1', columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>re.sub(r'(\w*)\\x.*', '\1', columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>re.sub('([A-Za-z]*)\x00', r'\1', columns[0])
'MagX\x08\x018\xe6\x7f'
>>>re.sub('(\w*?)', '\1', columns[0])
'\x01M\x01a\x01g\x01X\x01\x00\x01\x00\x01\x00\x01\x08\x01\x01\x01\x00\x018\x01\xe6\x01\x7f\x01'
Any suggestions on how to deal with this?
You can remove all non-word characters in the following way:
>>> re.sub(r'[^\w]', '', 'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f')
'MagX8'
The regex [^\w] will match any character that is not a letter, digit, or underscore. By providing that regex in re.sub with an empty string as a replacement you will delete all other characters in the string.
Since there may be other characters you want to keep, a better solution might be to specify a larger range of characters that you want to keep that excludes control characters. For example:
>>> re.sub(r'[^\x20-\x7e]', '', 'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f')
'MagX8'
Or you could replace [^\x20-\x7e] with the equivalent [^ -~], depending on which seems more clear to you.
To exclude all characters after this first control character just add a .*, like this:
>>> re.sub(r'[^ -~].*', '', 'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f')
'MagX'
They're not actually in the string: you have unescaped control characters, which Python displays using the hexadecimal notation - that's why you see a unusual symbol when you print the value.
You should simply be able to remove the extra levels of quoting in your regular expression but you might also simply rely on something like the regexp module's generic whitespace class, which will match whitespace characters other than tabs and spaces:
>>> import re
>>> re.sub(r'\s', '?', "foo\x00bar")
'foo\x00bar'
>>> print re.sub(r'\s', '?', "foo\x00bar")
foobar
I use this one a bit to replace all input whitespace runs, including non-breaking space characters, with a single space:
>>> re.sub(r'[\xa0\s]+', ' ', input_str)
You can also do this without importing re. E.g. if you're content to keep only ascii characters:
good_string = ''.join(c if ord(c) < 129 else '?' for c in bad_string)