Substitute all non letters/numbers with regex

Substitute all non letters/numbers with regex - python

I have the string 'hello (new)' and I would like to remove all non numbers and letters. One way to do this is by finding all letters and joining them:
>>> ''.join(re.findall(r'[a-zA-Z0-0]', 'hello (new)'))
'hellonew'
How would I do the reverse, that is, subtituting all non-characters to ''? So far I had:
>>> re.sub(r'^[a-zA-Z0-9]+', '', 'hello (new)')
' (new)'
But it's off a bit.

You should use a negated character class instead and remove the anchor at the front:
re.sub(r'[^a-z0-9]+', '', 'hello (new)', re.IGNORECASE)

You could match any non word character 1+ times using \W+.
If you want to keep the underscore which is matched by \w you could use a character class [\W_]+.
Python demo
import re
print(re.sub(r'\W+', '', 'hello (new)'))
Output
hellonew

Related

Wrap each word with a tag inside a sentence with Python regex and `re.sub`

I want to split the sentence into words, wrap words in tags and join the string back.
Example: Test, abc; text.. Should become <span>Test</span>, <span>abc</span>; <span>text</span>.
I've tried to use regex and \b but I don't understand how \b works.

You can use
import re
text = "Test, abc; text."
print( re.sub(r'\w+', r'<span>\g<0></span>', text) )
# => <span>Test</span>, <span>abc</span>; <span>text</span>.
See the Python demo.
With \w+, you match any chunks of one or more letters, digits, some diacritical marks or connector punctuation chars and the <span>\g<0></span> replacement pattern wraps each match (\g<0> is the whole match backreference) with span tags.
Note that, in Python 3, \w matches any Unicode letters and digits. In Python 2.x, you'd need to add flags=re.U:
re.sub(r'\w+', r'<span>\g<0></span>', text, flags=re.U)
Or use an inline modifier:
re.sub(r'(?u)\w+', r'<span>\g<0></span>', text)

Padding ascii characters with spaces in a mix unicode-ascii string

Given a mixed string of unicode and ascii chars, e.g.:
它看灵魂塑Nike造得和学问同等重要。
The goal is to pad the ascii substrings with spaces, i.e.:
它看灵魂塑 Nike 造得和学问同等重要。
I've tried using the ([^[:ascii:]]) regex, it looks fine in matching the substrings, e.g. https://regex101.com/r/FVHhU1/1
But in code, the substitution with ' \1 ' is not achieving the desired output.
>>> import re
>>> patt = re.compile('([^[:ascii:]])')
>>> s = u'它看灵魂塑Nike造得和学问同等重要。'
>>> print (patt.sub(' \1 ', s))
它看灵魂塑Nike造得和学问同等重要。
How to pad ascii characters with spaces in a mix unicode-ascii string?

The pattern should be:
([\x00-\x7f]+)
So you can use:
patt = re.compile('([\x00-\x7f]+)')
patt.sub(r' \1 ',s)
This generates:
>>> print(patt.sub(r' \1 ',s))
它看灵魂塑 Nike 造得和学问同等重要。
ASCII is defined as a range of characters with hex codes between 00 and 7f. So we define such a range as [\x00-\x7f], use + to denote one or more, and replace the matching group with r' \1 ' to add two spaces.

regex. Match all words starting with two letter

I wish to find all words that start with "Am" and this is what I tried so far with python
import re
my_string = "America's mom, American"
re.findall(r'\b[Am][a-zA-Z]+\b', my_string)
but this is the output that I get
['America', 'mom', 'American']
Instead of what I want
['America', 'American']
I know that in regex [Am] means match either A or m, but is it possible to match A and m as well?

The [Am], a positive character class, matches either A or m. To match a sequence of chars, you need to use them one after another.
Remove the brackets:
import re
my_string = "America's mom, American"
print(re.findall(r'\bAm[a-zA-Z]+\b', my_string))
# => ['America', 'American']
See the Python demo
This pattern details:
\b - a word boundary
Am - a string of chars matched as a sequence Am
[a-zA-Z]+ - 1 or more ASCII letters
\b - a word boundary.

Don't use character class:
import re
my_string = "America's mom, American"
re.findall(r'\bAm[a-zA-Z]+\b', my_string)

re.findall(r'(Am\w+)', my_text, re.I)

Python how to separate punctuation from text

So I want to separate group of punctuation from the text with spaces.
my_text = "!where??and!!or$$then:)"
I want to have a ! where ?? and !! or $$ then :) as a result.
I wanted something like in Javascript, where you can use $1 to get your matching string. What I have tried so far:
my_matches = re.findall('[!"\$%&\'()*+,\-.\/:;=##?\[\\\]^_`{|}~]*', my_text)
Here my_matches is empty so I had to delete \\\ from the expression:
my_matches = re.findall('[!"\$%&\'()*+,\-.\/:;=##?\^_`{|}~]*', my_text)
I have this result:
['!', '', '', '', '', '', '??', '', '', '', '!!', '', '', '$$', '', '', '', '',
':)', '']
So I delete all the redundant entry like this:
my_matches_distinct = list(set(my_matches))
And I have a better result:
['', '??', ':)', '$$', '!', '!!']
Then I replace every match by himself and space:
for match in my_matches:
if match != '':
my_text = re.sub(match, ' ' + match + ' ', my_text)
And of course it's not working ! I tried to cast the match as a string, but it's not working either... When I try to put directly the string to replace it's working though.
But I think I'm not doing it right, because I will have problems with '!' et '!!' right?
Thanks :)

It is recommended to use raw string literals when defining a regex pattern. Besides, do not escape arbitrary symbols inside a character class, only \ must be always escaped, and others can be placed so that they do not need escaping. Also, your regex matches an empty string - and it does - due to *. Replace with + quantifier. Besides, if you want to remove these symbols from your string, use re.sub directly.
import re
my_text = "!where??and!!or$$then:)"
print(re.sub(r'[]!"$%&\'()*+,./:;=##?[\\^_`{|}~-]+', r' \g<0> ', my_text).strip())
# => ! where ?? and !! or $$ then :)
See the Python demo
Details: The []!"$%&'()*+,./:;=##?[\^_`{|}~-]+ matches any 1+ symbols from the set (note that only \ is escaped here since - is used at the end, and ] at the start of the class), and the replacement inserts a space + the whole match (the \g<0> is the backreference to the whole match) and a space. And .strip() will remove leading/trailing whitespace after the regex finishes processing the string.
string.punctuation NOTE
Those who think that they can use f"[{string.punctuation}]+" make a mistake because this won't match \. Why? Because the resulting pattern looks like [!"#$%&'()*+,-./:;<=>?#[\]^_`{|}~]+ and the \] part does not match a backslash or ], it only matches a ] since the \ escapes the ] char.
If you plan to use string.punctuation, you need to escape ] and \ (it would be also correct to escape - and ^ as these are the only special chars inside square brackets, but in this case, it would be redundant):
from string import punctuation
my_text = "!where??and!!or$$then:)"
pattern = "[" + punctuation.replace('\\','\\\\').replace(']', r'\]') + "]+"
print(re.sub(pattern, r' \g<0> ', my_text).strip())
# => ! where ?? and !! or $$ then :)
See this Python demo.

Use sub() method in re library. You can do this as follows,
import re
str = '!where??and!!or$$then:)'
print re.sub(r'([!##%\^&\*\(\):;"\',\./\\]+)', r' \1 ', str).strip()
I hope this code should solve your problem. If you are obvious with regex then the regex part is not a big deal. Just it is to use the right function.
Hope this helps! Please comment if you have any queries. :)
References:
Python re library

how remove special characters from the end of every word in a string?

i want it match only the end of every word
example:
"i am test-ing., i am test.ing-, i am_, test_ing,"
output should be:
"i am test-ing i am test.ing i am test_ing"

>>> import re
>>> test = "i am test-ing., i am test.ing-, i am_, test_ing,"
>>> re.sub(r'([^\w\s]|_)+(?=\s|$)', '', test)
'i am test-ing i am test.ing i am test_ing'
Matches one or more non-alphanumeric characters ([^\w\s]|_) followed by either a space (\s) or the end of the string ($). The (?= ) construct is a lookahead assertion: it makes sure that a matching space is not included in the match, so it doesn't get replaced; only the [\W_]+ gets replaced.
Okay, but why [^\w\s]|_, you ask? The first part matches anything that's not alphanumeric or an underscore ([^\w]) or whitespace ([^\s]), i.e. punctuation characters. Except we do want to eliminate underscores, so we then include those with |_.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Substitute all non letters/numbers with regex - python

You should use a negated character class instead and remove the anchor at the front: re.sub(r'[^a-z0-9]+', '', 'hello (new)', re.IGNORECASE)

You could match any non word character 1+ times using \W+. If you want to keep the underscore which is matched by \w you could use a character class [\W_]+. Python demo import re print(re.sub(r'\W+', '', 'hello (new)')) Output hellonew

Related

Wrap each word with a tag inside a sentence with Python regex and `re.sub`

Padding ascii characters with spaces in a mix unicode-ascii string

regex. Match all words starting with two letter

Python how to separate punctuation from text

how remove special characters from the end of every word in a string?

Categories

Resources