Padding ascii characters with spaces in a mix unicode-ascii string - python

Given a mixed string of unicode and ascii chars, e.g.:
它看灵魂塑Nike造得和学问同等重要。
The goal is to pad the ascii substrings with spaces, i.e.:
它看灵魂塑 Nike 造得和学问同等重要。
I've tried using the ([^[:ascii:]]) regex, it looks fine in matching the substrings, e.g. https://regex101.com/r/FVHhU1/1
But in code, the substitution with ' \1 ' is not achieving the desired output.
>>> import re
>>> patt = re.compile('([^[:ascii:]])')
>>> s = u'它看灵魂塑Nike造得和学问同等重要。'
>>> print (patt.sub(' \1 ', s))
它看灵魂塑Nike造得和学问同等重要。
How to pad ascii characters with spaces in a mix unicode-ascii string?

The pattern should be:
([\x00-\x7f]+)
So you can use:
patt = re.compile('([\x00-\x7f]+)')
patt.sub(r' \1 ',s)
This generates:
>>> print(patt.sub(r' \1 ',s))
它看灵魂塑 Nike 造得和学问同等重要。
ASCII is defined as a range of characters with hex codes between 00 and 7f. So we define such a range as [\x00-\x7f], use + to denote one or more, and replace the matching group with r' \1 ' to add two spaces.

Related

remove only consecutive special characters but keep consecutive [a-zA-Z0-9] and single characters

How can I remove multiple consecutive occurrences of all the special characters in a string?
I can get the code like:
re.sub('\.\.+',' ',string)
re.sub('##+',' ',string)
re.sub('\s\s+',' ',string)
for individual and in best case, use a loop for all the characters in a list like:
from string import punctuation
for i in punctuation:
to = ('\\' + i + '\\' + i + '+')
string = re.sub(to, ' ', string)
but I'm sure there is an effective method too.
I tried:
re.sub('[^a-zA-Z0-9][^a-zA-Z0-9]+', ' ', '\n\n.AAA.x.##+*##=..xx000..x..\t.x..\nx*+Y.')
but it removes all the special characters except one preceded by alphabets.
string can have different consecutive special characters like 99#aaaa*!##$. but not same like ++--....
A pattern to match all non-alphanumeric characters in Python is [\W_].
So, all you need is to wrap the pattern with a capturing group and add \1+ after it to match 2 or more consecutive occurrences of the same non-alphanumeric characters:
text = re.sub(r'([\W_])\1+',' ',text)
In Python 3.x, if you wish to make the pattern ASCII aware only, use the re.A or re.ASCII flag:
text = re.sub(r'([\W_])\1+',' ',text, flags=re.A)
Mind the use of the r prefix that defines a raw string literal (so that you do not have to escape \ char).
See the regex demo. See the Python demo:
import re
text = "\n\n.AAA.x.##+*##=..xx000..x..\t.x..\nx*+Y."
print(re.sub(r'([\W_])\1+',' ',text))
Output:
.AAA.x. +*##= xx000 x .x
x*+Y.

Adding space between characters of a string containing special accented character

Is there a way to add space between the characters of a string such as the following: 'abakə̃tə̃'?
The usual ' '.join('abakə̃tə̃') approach returns 'a b a k ə ̃ t ə ̃', I am looking for 'a b a k ə̃ t ə̃'.
Thanks in advance.
You can use re.findall with a pattern that matches a word character optionally followed by an non-word character (which matches an accent):
import re
s = 'abakə̃tə̃'
print(' '.join(re.findall(r'\w\W?', s)))
For Python 3.7+, where zero-width patterns are allowed in re.split, you can use a lookahead and a lookbehind pattern split the string at positions that are followed by a word character and preceded by any character:
print(' '.join(re.split(r'(?<=.)(?=\w)', s)))
Both of the above would output:
a b a k ə̃ t ə

Split stacked entities using regex re.split in python

I am having trouble splitting continuous strings into more reasonable parts:
E.g. 'MarieMüller' should become 'Marie Müller'
So far I've used this, which works if no special characters occur:
' '.join([a for a in re.split(ur'([A-Z][a-z]+)', ''.join(entity)) if a])
This outputs for e.g. 'TinaTurner' -> 'Tina Turner', but doesn't work
for 'MarieMüller', which outputs: 'MarieMüller' -> 'Marie M \utf8 ller'
Now I came accros using regex \p{L}:
' '.join([a for a in re.split(ur'([\p{Lu}][\p{Ll}]+)', ''.join(entity)) if a])
But this produces weird things like:
'JenniferLawrence' -> 'Jennifer L awrence'
Could anyone give me a hand?
If you work with Unicode and need to use Unicode categories, you should consider using PyPi regex module. There, you have support for all the Unicode categories:
>>> import regex
>>> p = regex.compile(ur'(?<=\p{Ll})(?=\p{Lu})')
>>> test_str = u"Tina Turner\nMarieM\u00FCller\nJacek\u0104cki"
>>> result = p.sub(u" ", test_str)
>>> result
u'Tina Turner\nMarie M\xfcller\nJacek \u0104cki'
^ ^ ^
Here, the (?<=\p{Ll})(?=\p{Lu}) regex finds all locations between the lower- (\p{Ll}) and uppercase (\p{Lu}) letters, and then the regex.sub inserts a space there. Note that regex module automatically compiles the regex with regex.UNICODE flag if the pattern is a Unicode string (u-prefixed).
It won't work for extended character
You can use re.sub() for this. It will be much simpler
(?=(?!^)[A-Z])
For handling spaces
print re.sub(r'(?<=[^\s])(?=(?!^)[A-Z])', ' ', ' Tina Turner'.strip())
For handling cases of consecutive capital letters
print re.sub(r'(?<=[a-z])(?=[A-Z])', ' ', ' TinaTXYurner'.strip())
Ideone Demo
Regex Breakdown
(?= #Lookahead to find all the position of capital letters
(?!^) #Ignore the first capital letter for substitution
[A-Z]
)
Using a function constructed of Python's string operations instead of regular expressions, this should work:
def split_combined_words(combined):
separated = [combined[1]]
for letter in combined[1:]:
print letter
if (letter.islower() or (letter.isupper() and separated[-1].isupper())):
separated.append(letter)
else:
separated.extend((" ", letter))
return "".join(separated)

split string in python when characters on either side of separator are not numbers

I have a large list of chemical data, that contains entries like the following:
1. 2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP
2. Lead,Paints/Pigments,Zinc
I have a function that is correctly splitting the 1st entry into:
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
based on ', ' as a separator. For the second entry, ', ' won't work. But, if i could easily split any string that contains ',' with only two non-numeric characters on either side, I would be able to parse all entries like the second one, without splitting up the chemicals in entries like the first, that have numbers in their name separated by commas (i.e. 2,4,5-TP).
Is there an easy pythonic way to do this?
I explain a little bit based on #eph's answer:
import re
data_list = ['2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP', 'Lead,Paints/Pigments,Zinc']
for d in data_list:
print re.split(r'(?<=\D),\s*|\s*,(?=\D)',d)
re.split(pattern, string) will split string by the occurrences of regex pattern.
(plz read Regex Quick Start if you are not familiar with regex.)
The (?<=\D),\s*|\s*,(?=\D) consists of two part: (?<=\D),\s* and \s*,(?=\D). The meaning of each unit:
The middle | is the OR operator.
\D matches a single character that is not a digit.
\s matches a whitespace character (includes tabs and line breaks).
, matches character ",".
* attempts to match the preceding token zero or more times. Therefore, \s* means the whitespace can be appear zero or more times. (see Repetition with Star and Plus)
(?<= ... ) and (?= ...) are the lookbebind and lookahead assertions.
For example, q(?=u) matches a q that is followed by a u, without making the u part of the match.
Therefore, \s*,(?=\D) matches a , that is preceded by zero or more whitespace and followed by non-digit characters. Similarly, (?<=\D),\s* matches a , that is preceded by non-digit characters and followed by zero or more whitespace. The whole regex will find , that satisfy either case, which is equivalent to your requirement: ',' with only two non-numeric characters on either side.
Some useful tools for regex:
Regex Cheat Sheet
Online regex tester: regex101 (with a tree structure explanation to your regex)
Use regex and lookbehind/lookahead assertion
>>> re.split(r'(?<=\D\D),\s*|,\s*(?=\D\D)', s)
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> s1 = "2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP"
>>> s2 = "Lead,Paints/Pigments,Zinc"
>>> import re
>>> res1 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s1)
>>> res1
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> res2 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s2)
>>> res2
['Lead', 'Paints/Pigments', 'Zinc']

Extracting substrings between single quotes

I am new in python and trying to extract substrings between single quotes. Do you know how to do this with regex?
E.G input
text = "[(u'apple',), (u'banana',)]"
I want to extract apple and banana as list items like ['apple', 'banana']
In the general case, to extract any chars in between single quotes, the most efficient regex approach is
re.findall(r"'([^']*)'", text) # to also extract empty values
re.findall(r"'([^']+)'", text) # to only extract non-empty values
See the regex demo.
Details
' - a single quote (no need to escape inside a double quote string literal)
([^']*) - a capturing group that captures any 0+ (or 1+ if you use + quantifier) chars other than ' (the [^...] is a negated character class that matches any chars other than those specified in the class)
' - a closing single quote.
Note that re.findall only returns captured substrings if capturing groups are specified in the pattern:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
Python demo:
import re
text = "[(u'apple',), (u'banana',)]"
print(re.findall(r"'([^']*)'", text))
# => ['apple', 'banana']
Escaped quote support
If you need to support escaped quotes (so as to match abc\'def in 'abc\'def' you will need a regex like
re.findall(r"'([^'\\]*(?:\\.[^'\\]*)*)'", text, re.DOTALL) # in case the text contains only "valid" pairs of quotes
re.findall(r"(?<!\\)(?:\\\\)*'([^'\\]*(?:\\.[^'\\]*)*)'", text, re.DOTALL) # if your text is too messed up and there can be "wild" single quotes out there
See regex variation 1 and regex variation 2 demos.
Pattern details
(?<!\\) - a negative lookbehind that fails the match if there is a backslash immediately to the left of the current position
(?:\\\\)* - 0 or more consecutive double backslashes (since these are not escaping the neighboring character)
' - an open '
([^'\\]*(?:\\.[^'\\]*)*) - Group 1 (what will be returned by re.findall)matching...
[^'\\]* - 0 or more chars other than ' and \
(?: - start of a non-capturing group that matches
\\. - any escaped char (a backslash and any char including line breaks due to the re.DOTALL modifier)
[^'\\]* - 0 or more chars other than ' and \
)* - ... zero or more times
' - a closing '.
See another Python demo:
import re
text = r"[(u'apple',), (u'banana',)] [(u'apple',), (u'banana',), (u'abc\'def',)] \\'abc''def' \\\'abc 'abc\\\\\'def'"
print(re.findall(r"(?<!\\)(?:\\\\)*'([^'\\]*(?:\\.[^'\\]*)*)'", text))
# => apple, banana, apple, banana, abc\'def, abc, def, abc\\\\\'def
text = "[(u'apple',), (u'banana',)]"
print(re.findall(r"\(u'(.*?)',\)", text)
['apple', 'banana']
text = "[(u'this string contains\' an escaped quote mark and\\ an escaped slash',)]"
print(re.findall(r"\(u'(.*?)',\)", text)[0])
this string contains' an escaped quote mark and \ an escaped slash
You may alternatively use ast.literal_eval then extract the first item by list comprehension:
from ast import literal_eval
text = "[(u'apple',), (u'banana',)]"
literal_eval(text)
Out[3]: [(u'apple',), (u'banana',)]
[t[0] for t in literal_eval(text)]
Out[4]: [u'apple', u'banana']

Categories