Split stacked entities using regex re.split in python - python

I am having trouble splitting continuous strings into more reasonable parts:
E.g. 'MarieMüller' should become 'Marie Müller'
So far I've used this, which works if no special characters occur:
' '.join([a for a in re.split(ur'([A-Z][a-z]+)', ''.join(entity)) if a])
This outputs for e.g. 'TinaTurner' -> 'Tina Turner', but doesn't work
for 'MarieMüller', which outputs: 'MarieMüller' -> 'Marie M \utf8 ller'
Now I came accros using regex \p{L}:
' '.join([a for a in re.split(ur'([\p{Lu}][\p{Ll}]+)', ''.join(entity)) if a])
But this produces weird things like:
'JenniferLawrence' -> 'Jennifer L awrence'
Could anyone give me a hand?

If you work with Unicode and need to use Unicode categories, you should consider using PyPi regex module. There, you have support for all the Unicode categories:
>>> import regex
>>> p = regex.compile(ur'(?<=\p{Ll})(?=\p{Lu})')
>>> test_str = u"Tina Turner\nMarieM\u00FCller\nJacek\u0104cki"
>>> result = p.sub(u" ", test_str)
>>> result
u'Tina Turner\nMarie M\xfcller\nJacek \u0104cki'
^ ^ ^
Here, the (?<=\p{Ll})(?=\p{Lu}) regex finds all locations between the lower- (\p{Ll}) and uppercase (\p{Lu}) letters, and then the regex.sub inserts a space there. Note that regex module automatically compiles the regex with regex.UNICODE flag if the pattern is a Unicode string (u-prefixed).

It won't work for extended character
You can use re.sub() for this. It will be much simpler
(?=(?!^)[A-Z])
For handling spaces
print re.sub(r'(?<=[^\s])(?=(?!^)[A-Z])', ' ', ' Tina Turner'.strip())
For handling cases of consecutive capital letters
print re.sub(r'(?<=[a-z])(?=[A-Z])', ' ', ' TinaTXYurner'.strip())
Ideone Demo
Regex Breakdown
(?= #Lookahead to find all the position of capital letters
(?!^) #Ignore the first capital letter for substitution
[A-Z]
)

Using a function constructed of Python's string operations instead of regular expressions, this should work:
def split_combined_words(combined):
separated = [combined[1]]
for letter in combined[1:]:
print letter
if (letter.islower() or (letter.isupper() and separated[-1].isupper())):
separated.append(letter)
else:
separated.extend((" ", letter))
return "".join(separated)

Related

Add space between Persian numeric and letter with python re

I want to add space between Persian number and Persian letter like this:
"سعید123" convert to "سعید 123"
Java code of this procedure is like below.
str.replaceAll("(?<=\\p{IsDigit})(?=\\p{IsAlphabetic})", " ").
But I can't find any python solution.
There is a short regex which you may rely on to match boundary between letters and digits (in any language):
\d(?=[^_\d\W])|[^_\d\W](?=\d)
Live demo
Breakdown:
\d Match a digit
(?=[^_\d\W]) Preceding a letter from a language
| Or
[^_\d\W] Match a letter from a language
(?=\d) Preceding a digit
Python:
re.sub(r'\d(?![_\d\W])|[^_\d\W](?!\D)', r'\g<0> ', str, flags = re.UNICODE)
But according to this answer, this is the right way to accomplish this task:
re.sub(r'\d(?=[آابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی])|[آابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی](?=\d)', r'\g<0> ', str, flags = re.UNICODE)
I am not sure if this is a correct approach.
import re
k = "سعید123"
m = re.search("(\d+)", k)
if m:
k = " ".join([m.group(), k.replace(m.group(), "")])
print(k)
Output:
123 سعید
You may use
re.sub(r'([^\W\d_])(\d)', r'\1 \2', s, flags=re.U)
Note that in Python 3.x, re.U flag is redundant as the patterns are Unicode aware by default.
See the online Python demo and a regex demo.
Pattern details
([^\W\d_]) - Capturing group 1: any Unicode letter (literally, any char other than a non-word, digit or underscore chars)
(\d) - Capturing group 2: any Unicode digit
The replacement pattern is a combination of the Group 1 and 2 placeholders (referring to corresponding captured values) with a space in between them.
You may use a variation of the regex with a lookahead:
re.sub(r'[^\W\d_](?=\d)', r'\g<0> ', s)
See this regex demo.

Regex subbing in Python leads to ASCII characters appearing

I am trying to use regex to replace some issues in some text.
Strings look like this:
a = "Here is a shortString with various issuesWith spacing"
My regex looks like this right now:
new_string = re.sub("[a-z][A-Z]", "\1 \2", a).
This takes those places with missing spaces (there is always a capital letter after a lowercase letter), and adds a space.
Unfortunately, the output looks like this:
Here is a shor\x01 \x02tring with various issue\x01 \x02ith spacing
I want it to look like this:
b = "Here is a short String with various issues With spacing"
It seems that the regex is properly matching the correct instances of things I want to change, but there is something wrong with my substitution. I thought \1 \2 meant replace with the first part of the regex, add a space, and then add the second matched item. But for some reason I get something else?
>>> a = "Here is a shortString with various issuesWith spacing"
>>> re.sub("([a-z])([A-Z])", r"\1 \2", a)
'Here is a short String with various issues With spacing'
capturing group and backslash escaping was missing.
you can go even further:
>>> a = "Here is a shortString with various issuesWith spacing"
>>> re.sub('([a-z])([A-Z])', r'\1 \2', a).lower().capitalize()
'Here is a short string with various issues with spacing'
You need to define capturing groups, and use raw string literals:
import re
a = "Here is a shortString with various issuesWith spacing"
new_string = re.sub(r"([a-z])([A-Z])", r"\1 \2", a)
print(new_string)
See the Python demo.
Note that without the r'' prefix Python interpreted the \1 and \2 as characters rather than as backreferences since the \ was parsed as part of an escape sequence. In raw string literals, \ is parsed as a literal backslash.
You can have a try like this:
>>>> import re
>>>> a = "Here is a shortString with various issuesWith spacing"
>>>> re.sub(r"(?<=[a-z])(?=[A-Z])", " ", a)
>>>> Here is a short String with various issues With spacing

regEx: To match two groups of chars

I want a regEx to match some text that contains both alpha and numeric chars. But I do NOT want it to match only alpha or numbers.
E.g. in python:
s = '[mytaskid: 3fee46d2]: STARTED at processing job number 10022001'
# ^^^^^^^^ <- I want something that'll only match this part.
import re
rr = re.compile('([0-9a-z]{8})')
print 'sub=', rr.sub('########', s)
print 'findall=', rr.findall(s)
generates following output:
sub= [########: ########]: STARTED at ########ng job number ########
findall= ['mytaskid', '3fee46d2', 'processi', '10022001']
I want it to be:
sub= [mytaskid: ########]: STARTED at processing job number 10022001
findall= ['3fee46d2']
Any ideas... ??
In this case it's exactly 8 chars always, it would be even more wonderful to have a regEx that doesn't have {8} in it, i.e. it can match even if there are more or less than 8 chars.
-- edit --
Question is more to understand if there is a way to write a regEx such that I can combine 2 patterns (in this case [0-9] and [a-z]) and ensure the matched string matches both patterns, but number of chars matched from each set is variable. E.g. s could also be
s = 'mytaskid 3fee46d2 STARTED processing job number 10022001'
-- answer --
Thanks to all for the answers, all them give me what I want, so everyone gets a +1 and the first one to answer gets the accepted answer. Although jerry explains it the best. :)
If anyone is a stickler for performance, there is nothing to choose from, they're all the same.
s = '[mytaskid: 3fee46d2]: STARTED at processing job number 10022001'
# ^^^^^^^^ <- I want something that'll only match this part.
def testIt(regEx):
from timeit import timeit
s = '[mytaskid: 3333fe46d2]: STARTED at processing job number 10022001'
assert (re.sub('\\b(?=[a-z0-9]*[0-9])[a-z0-9]*[a-z][a-z0-9]*\\b', '########', s) ==
'[mytaskid: ########]: STARTED at processing job number 10022001'), '"%s" does not work.' % regEx
print 'sub() with \'', regEx, '\': ', timeit('rr.sub(\'########\', s)', number=500000, setup='''
import re
s = '%s'
rr = re.compile('%s')
''' % (s, regEx)
)
print 'findall() with \'', regEx, '\': ', timeit('rr.findall(s)', setup='''
import re
s = '%s'
rr = re.compile('%s')
''' % (s, regEx)
)
testIt('\\b[0-9a-z]*(?:[a-z][0-9]|[0-9][a-z])[0-9a-z]*\\b')
testIt('\\b[a-z\d]*(?:\d[a-z]|[a-z]\d)[a-z\d]*\\b')
testIt('\\b(?=[a-z0-9]*[0-9])[a-z0-9]*[a-z][a-z0-9]*\\b')
testIt('\\b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\\b')
produced:
sub() with ' \b[0-9a-z]*(?:[a-z][0-9]|[0-9][a-z])[0-9a-z]*\b ': 0.328042736387
findall() with ' \b[0-9a-z]*(?:[a-z][0-9]|[0-9][a-z])[0-9a-z]*\b ': 0.350668751542
sub() with ' \b[a-z\d]*(?:\d[a-z]|[a-z]\d)[a-z\d]*\b ': 0.314759661193
findall() with ' \b[a-z\d]*(?:\d[a-z]|[a-z]\d)[a-z\d]*\b ': 0.35618526928
sub() with ' \b(?=[a-z0-9]*[0-9])[a-z0-9]*[a-z][a-z0-9]*\b ': 0.322802906619
findall() with ' \b(?=[a-z0-9]*[0-9])[a-z0-9]*[a-z][a-z0-9]*\b ': 0.35330467656
sub() with ' \b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\b ': 0.320779061371
findall() with ' \b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\b ': 0.347522144274
Try following regex:
\b[0-9a-z]*(?:[a-z][0-9]|[0-9][a-z])[0-9a-z]*\b
This will match a word containing a digit followed an alphabet or vice versa.
Hence it will cover a complete set of those words which contain at-least one digit and one alphabet.
Note: Although it is not the case with python, I have observed that not all varieties of tools support lookahead and lookbehind. So I prefer to avoid them if possible.
You need to use the look ahead (?=...).
This one matches all words with at least one out of [123] and [abc].
>>> re.findall('\\b(?=[abc321]*[321])[abc321]*[abc][abc321]*\\b', ' 123abc 123 abc')
['123abc']
This way you can do AND for constraints to the same string.
>>> help(re)
(?=...) Matches if ... matches next, but doesn't consume the string.
An other way is to ground it and to say: with one of [abc] and one of [123] means there is at least a [123][abc] or a [abc][123] in the string resulting in
>>> re.findall('\\b[abc321]*(?:[abc][123]|[123][abc])[abc321]*\\b', ' 123abc 123 abc')
['123abc']
Not the most beautiful regular expression, but it works:
\b[a-z\d]*(?:\d[a-z]|[a-z]\d)[a-z\d]*\b
If the format is the same each time, that is:
[########: ########]: STARTED at ########ng job number ########
You can use:
([^\]\s]+)\]
With re.findall, or re.search and getting .group(1) if you use re.search.
[^\]\s]+ is a negated class and will match any character except space (and family) or closing square bracket.
The regex basically looks for characters (except ] or spaces) up until a closing square bracket.
If you want to match any string containing both alpha and numeric characters, you will need a lookahead:
\b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\b
Used like so:
result = re.search(r'\b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\b', text, re.I)
re.I is for ignorecase.
\b is a word boundary and will match only between a 'word' character and a 'non-word' character (or start/end of string).
(?=[0-9]*[a-z]) is a positive lookahead and makes sure there's at least 1 alpha in the part to be matched.
(?=[a-z]*[0-9]) is a similar lookahead but checks for digits.
You can use more specific regular expression and skip the findall.
import re
s = '[mytaskid: 3fee46d2]: STARTED at processing job number 10022001'
mo = re.search(':\s+(\w+)', s)
print mo.group(1)

Match single quotes from python re

How to match the following i want all the names with in the single quotes
This hasn't been much that much of a twist and turn's to 'Tom','Harry' and u know who..yes its 'rock'
How to extract the name within the single quotes only
name = re.compile(r'^\'+\w+\'')
The following regex finds all single words enclosed in quotes:
In [6]: re.findall(r"'(\w+)'", s)
Out[6]: ['Tom', 'Harry', 'rock']
Here:
the ' matches a single quote;
the \w+ matches one or more word characters;
the ' matches a single quote;
the parentheses form a capture group: they define the part of the match that gets returned by findall().
If you only wish to find words that start with a capital letter, the regex can be modified like so:
In [7]: re.findall(r"'([A-Z]\w*)'", s)
Out[7]: ['Tom', 'Harry']
I'd suggest
r = re.compile(r"\B'\w+'\B")
apos = r.findall("This hasn't been much that much of a twist and turn's to 'Tom','Harry' and u know who..yes its 'rock'")
Result:
>>> apos
["'Tom'", "'Harry'", "'rock'"]
The "negative word boundaries" (\B) prevent matches like the 'n' in words like Rock'n'Roll.
Explanation:
\B # make sure that we're not at a word boundary
' # match a quote
\w+ # match one or more alphanumeric characters
' # match a quote
\B # make sure that we're not at a word boundary
^ ('hat' or 'caret', among other names) in regex means "start of the string" (or, given particular options, "start of a line"), which you don't care about. Omitting it makes your regex work fine:
>>> re.findall(r'\'+\w+\'', s)
["'Tom'", "'Harry'", "'rock'"]
The regexes others have suggested might be better for what you're trying to achieve, this is the minimal change to fix your problem.
Your regex can only match a pattern following the start of the string. Try something like: r"'([^']*)'"

Regex divide with upper-case

I would like to replace strings like 'HDMWhoSomeThing' to 'HDM Who Some Thing' with regex.
So I would like to extract words which starts with an upper-case letter or consist of upper-case letters only. Notice that in the string 'HDMWho' the last upper-case letter is in the fact the first letter of the word Who - and should not be included in the word HDM.
What is the correct regex to achieve this goal? I have tried many regex' similar to [A-Z][a-z]+ but without success. The [A-Z][a-z]+ gives me 'Who Some Thing' - without 'HDM' of course.
Any ideas?
Thanks,
Rukki
#! /usr/bin/env python
import re
from collections import deque
pattern = r'([A-Z]{2,}(?=[A-Z]|$)|[A-Z](?=[a-z]|$))'
chunks = deque(re.split(pattern, 'HDMWhoSomeMONKEYThingXYZ'))
result = []
while len(chunks):
buf = chunks.popleft()
if len(buf) == 0:
continue
if re.match(r'^[A-Z]$', buf) and len(chunks):
buf += chunks.popleft()
result.append(buf)
print ' '.join(result)
Output:
HDM Who Some MONKEY Thing XYZ
Judging by lines of code, this task is a much more natural fit with re.findall:
pattern = r'([A-Z]{2,}(?=[A-Z]|$)|[A-Z][a-z]*)'
print ' '.join(re.findall(pattern, 'HDMWhoSomeMONKEYThingX'))
Output:
HDM Who Some MONKEY Thing X
Try to split with this regular expression:
/(?=[A-Z][a-z])/
And if your regular expression engine does not support splitting empty matches, try this regular expression to put spaces between the words:
/([A-Z])(?![A-Z])/
Replace it with " $1" (space plus match of the first group). Then you can split at the space.
one liner :
' '.join(a or b for a,b in re.findall('([A-Z][a-z]+)|(?:([A-Z]*)(?=[A-Z]))',s))
using regexp
([A-Z][a-z]+)|(?:([A-Z]*)(?=[A-Z]))
So 'words' in this case are:
Any number of uppercase letters - unless the last uppercase letter is followed by a lowercase letter.
One uppercase letter followed by any number of lowercase letters.
so try:
([A-Z]+(?![a-z])|[A-Z][a-z]*)
The first alternation includes a negative lookahead (?![a-z]), which handles the boundary between an all-caps word and an initial caps word.
May be '[A-Z]*?[A-Z][a-z]+'?
Edit: This seems to work: [A-Z]{2,}(?![a-z])|[A-Z][a-z]+
import re
def find_stuff(str):
p = re.compile(r'[A-Z]{2,}(?![a-z])|[A-Z][a-z]+')
m = p.findall(str)
result = ''
for x in m:
result += x + ' '
print result
find_stuff('HDMWhoSomeThing')
find_stuff('SomeHDMWhoThing')
Prints out:
HDM Who Some Thing
Some HDM Who Thing

Categories