Regex to ignore specific characters - python

I am parsing a text on non alphanumeric characters and would like to exclude specific characters like apostrophes, dash/hyphens and commas.
I would like to build a regex for the following cases:
non-alphanumeric character, excluding apostrophes and hypens
non-alphanumeric character, excluding commas,apostrophes and hypens
This is what i have tried:
def split_text(text):
my_text = re.split('\W',text)
# the following doesn't work.
#my_text = re.split('([A-Z]\w*)',text)
#my_text = re.split("^[a-zA-Z0-9]+(-[a-zA-Z0-9]+)*$",text)
return my_text
Case 1:
Sample Input: What's up? It's good to see you my-friend. "Hello" to-the world!.
Sample Output: ['What's','up','It's','good','to','see','you','my-friend','Hello','to-the','world']
Case 2:
Sample Input: It means that, it's not good-to do such things.
Sample Output: ['It', 'means', 'that,', 'it's', 'not', 'good-to', 'do', 'such', 'things']
Any ideas

is this what you want?
non-alphanumeric character, excluding apostrophes and hypens
my_text = re.split(r"[^\w'-]+",text)
non-alphanumeric character, excluding commas,apostrophes and hypens
my_text = re.split(r"[^\w-',]+",text)
the [] syntax defines a character class, [^..] "complements" it, i.e. it negates it.
See the documentation about that:
Characters that are not within a range can be matched by complementing the set. If the first character of the set is '^', all the characters that are not in the set will be matched. For example, [^5] will match any character except '5', and [^^] will match any character except '^'. ^ has no special meaning if it’s not the first character in the set.

You can use a negated character class for this:
my_text = re.split(r"[^\w'-]+",text)
or
my_text = re.split(r"[^\w,'-]+",text) # also excludes commas

Related

remove only consecutive special characters but keep consecutive [a-zA-Z0-9] and single characters

How can I remove multiple consecutive occurrences of all the special characters in a string?
I can get the code like:
re.sub('\.\.+',' ',string)
re.sub('##+',' ',string)
re.sub('\s\s+',' ',string)
for individual and in best case, use a loop for all the characters in a list like:
from string import punctuation
for i in punctuation:
to = ('\\' + i + '\\' + i + '+')
string = re.sub(to, ' ', string)
but I'm sure there is an effective method too.
I tried:
re.sub('[^a-zA-Z0-9][^a-zA-Z0-9]+', ' ', '\n\n.AAA.x.##+*##=..xx000..x..\t.x..\nx*+Y.')
but it removes all the special characters except one preceded by alphabets.
string can have different consecutive special characters like 99#aaaa*!##$. but not same like ++--....
A pattern to match all non-alphanumeric characters in Python is [\W_].
So, all you need is to wrap the pattern with a capturing group and add \1+ after it to match 2 or more consecutive occurrences of the same non-alphanumeric characters:
text = re.sub(r'([\W_])\1+',' ',text)
In Python 3.x, if you wish to make the pattern ASCII aware only, use the re.A or re.ASCII flag:
text = re.sub(r'([\W_])\1+',' ',text, flags=re.A)
Mind the use of the r prefix that defines a raw string literal (so that you do not have to escape \ char).
See the regex demo. See the Python demo:
import re
text = "\n\n.AAA.x.##+*##=..xx000..x..\t.x..\nx*+Y."
print(re.sub(r'([\W_])\1+',' ',text))
Output:
.AAA.x. +*##= xx000 x .x
x*+Y.

Regex to get non-alphanumeric strings between alphanumeric strings

Let say I have this string:
Alpha+*&Numeric%$^String%%$
I want to get the non-alphanumeric characters that are between alphanumeric characters:
+*& %$^
I have this regex: [^0-9a-zA-Z]+ but it's giving me
+* %$^ %%$
which includes the tailing non-alphanumeric characters which I do not want. I have also tried [0-9a-zA-Z]([^0-9a-zA-Z])+[0-9a-zA-Z] but it's giving me
a+*&N c%$^S
which include the characters a, N, c and S
If you don't mind including the _ character as alpha-numeric data, you can extract all your non-alpha-numeric-data with this:
some_string = "A+*&N%$^S%%$"
import re
result = re.findall(r'\b\W+\b', some_string) # sets result to: ['+*&', '%$^']
Note my use of \b instead of something like \w or [^\W].
\w and [^\W] each match one character, so if your alpha-numeric string (between the text you want) is exactly one character, then what you think should be the next match won't match.
But since \b is a zero-width "word boundary," it doesn't care how many alpha-numeric characters there are, as long as there is at least one.
The only problem with your second attempt is the location of the + qualifier--it should be inside of the parentheses. You can also use the word character class \w and its inverse \W to pull out these items, which is the same as your second regex but includes underscores _ as parts of words:
import re
s = "Alpha+*&Numeric%$^String%%$"
print(re.findall(r"\w(\W+)\w", s)) # adds _ character
print(re.findall(r"[0-9a-zA-Z]([^0-9a-zA-Z]+)[0-9a-zA-Z]", s)) # your version fixed
print(re.findall(r"(?i)[0-9A-Z]([^0-9A-Z]+)[0-9A-Z]", s)) # same as above
Output:
['+*&', '%$^']
['+*&', '%$^']
['+*&', '%$^']

Split string at capital letter but only if no whitespace

Set-up
I've got a string of names which need to be separated into a list.
Following this answer, I have,
string = 'KreuzbergLichtenbergNeuköllnPrenzlauer Berg'
re.findall('[A-Z][a-z]*', string)
where the last line gives me,
['Kreuzberg', 'Lichtenberg', 'Neuk', 'Prenzlauer', 'Berg']
Problems
1) Whitespace is ignored
'Prenzlauer Berg' is actually 1 name but the code splits according to the 'split-at-capital-letter' rule.
What is the command ensuring it to not split at a capital letter if preceding character is a whitespace?
2) Special characters not handled well
The code used cannot handle 'ö'. How do I include such 'German' characters?
I.e. I want to obtain,
['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']
You can use positive and negative lookbehind and just list the Umlauts explicitly:
>>> string = 'KreuzbergLichtenbergNeuköllnPrenzlauer Berg'
>>> re.findall('(?<!\s)[A-ZÄÖÜ](?:[a-zäöüß\s]|(?<=\s)[A-ZÄÖÜ])*', string)
['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']
(?<!\s)...: matches ... that is not preceded by \s
(?<=\s)...: matches ... that is preceded by \s
(?:...): non-capturing group so as to not mess with the findall results
This works
string="KreuzbergLichtenbergNeuköllnPrenzlauer Berg"
pattern="[A-Z][a-ü]+\s[A-Z][a-ü]+|[A-Z][a-ü]+"
re.findall(pattern, string)
#>>>['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']

Python regex: not more than one special symbol in a row

I'm looking for a way to get those substrings that had no more than 1 special symbols among [a-z] in a row.
Here is the example:
sp_sym = '/,# '
text1 = 'as for#you' # <- ok
text2 = 'as for# you ' # <- ok
text3 = 'as for##you ' # <- not good
An expression like [a-z(?:/,#){1}] is not working.
The below regex won't match the strings if it has consecutive / or , or # symbols inside lowercase letters,
^(?:(?!([,\/#])\1+)[a-z\W])+$
DEMO
Instead of searching for strings that do not have two characters in a row, why not search for those that do? Then, your result is all of the other strings.
result = []
for string in (text1, text2, text3):
if not re.search(r'[/,#]{2,}', string):
result.append(string)
If you prefer a one-liner:
result = [s for s in (text1,text2,text3) if not re.search(r'[/,#]{2,}', s)]
Try matching the characters followed by anything that is not one of the characters: [/,#][^/,#].
The brackets are sets that match any characters between them, so [/,#] matches / or , or #. But when the first character in the brackets is ^, this negates the set so it matches everything but the characters in the set.
Edit: of course you have to make sure that there is not one of these characters before the pattern as well. So then it becomes: [^/,#][/,#][^/,#]. Now the only problem might be that you cannot match a single special character at the beginning or end of the string. Do you need to match those?

how remove special characters from the end of every word in a string?

i want it match only the end of every word
example:
"i am test-ing., i am test.ing-, i am_, test_ing,"
output should be:
"i am test-ing i am test.ing i am test_ing"
>>> import re
>>> test = "i am test-ing., i am test.ing-, i am_, test_ing,"
>>> re.sub(r'([^\w\s]|_)+(?=\s|$)', '', test)
'i am test-ing i am test.ing i am test_ing'
Matches one or more non-alphanumeric characters ([^\w\s]|_) followed by either a space (\s) or the end of the string ($). The (?= ) construct is a lookahead assertion: it makes sure that a matching space is not included in the match, so it doesn't get replaced; only the [\W_]+ gets replaced.
Okay, but why [^\w\s]|_, you ask? The first part matches anything that's not alphanumeric or an underscore ([^\w]) or whitespace ([^\s]), i.e. punctuation characters. Except we do want to eliminate underscores, so we then include those with |_.

Categories