Add space between Persian numeric and letter with python re - python

I want to add space between Persian number and Persian letter like this:
"سعید123" convert to "سعید 123"
Java code of this procedure is like below.
str.replaceAll("(?<=\\p{IsDigit})(?=\\p{IsAlphabetic})", " ").
But I can't find any python solution.

There is a short regex which you may rely on to match boundary between letters and digits (in any language):
\d(?=[^_\d\W])|[^_\d\W](?=\d)
Live demo
Breakdown:
\d Match a digit
(?=[^_\d\W]) Preceding a letter from a language
| Or
[^_\d\W] Match a letter from a language
(?=\d) Preceding a digit
Python:
re.sub(r'\d(?![_\d\W])|[^_\d\W](?!\D)', r'\g<0> ', str, flags = re.UNICODE)
But according to this answer, this is the right way to accomplish this task:
re.sub(r'\d(?=[آابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی])|[آابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی](?=\d)', r'\g<0> ', str, flags = re.UNICODE)

I am not sure if this is a correct approach.
import re
k = "سعید123"
m = re.search("(\d+)", k)
if m:
k = " ".join([m.group(), k.replace(m.group(), "")])
print(k)
Output:
123 سعید

You may use
re.sub(r'([^\W\d_])(\d)', r'\1 \2', s, flags=re.U)
Note that in Python 3.x, re.U flag is redundant as the patterns are Unicode aware by default.
See the online Python demo and a regex demo.
Pattern details
([^\W\d_]) - Capturing group 1: any Unicode letter (literally, any char other than a non-word, digit or underscore chars)
(\d) - Capturing group 2: any Unicode digit
The replacement pattern is a combination of the Group 1 and 2 placeholders (referring to corresponding captured values) with a space in between them.
You may use a variation of the regex with a lookahead:
re.sub(r'[^\W\d_](?=\d)', r'\g<0> ', s)
See this regex demo.

Related

Adding space between characters of a string containing special accented character

Is there a way to add space between the characters of a string such as the following: 'abakə̃tə̃'?
The usual ' '.join('abakə̃tə̃') approach returns 'a b a k ə ̃ t ə ̃', I am looking for 'a b a k ə̃ t ə̃'.
Thanks in advance.
You can use re.findall with a pattern that matches a word character optionally followed by an non-word character (which matches an accent):
import re
s = 'abakə̃tə̃'
print(' '.join(re.findall(r'\w\W?', s)))
For Python 3.7+, where zero-width patterns are allowed in re.split, you can use a lookahead and a lookbehind pattern split the string at positions that are followed by a word character and preceded by any character:
print(' '.join(re.split(r'(?<=.)(?=\w)', s)))
Both of the above would output:
a b a k ə̃ t ə

Split according to regex condition

This will be my another question:
string = "Organization: S.P. Dyer Computer Consulting, Cambridge MA"
How can I take all the characters despite it being fullstop, digits, or anything after "Organization: " using regex?
result_organization = re.search("(Organization: )(\w*\.*\w*\.*\w*\s*\w*\s*\w*\s*)", string)
My above code is super long and not wise at all.
I would recommend using find command like this
print(string[string.find("Organization")+14:])
You don't need regex for that, this simple code should give you desired result:
str = "Organization: S.P. Dyer Computer Consulting, Cambridge MA";
if str.startswith("Organization: "):
str = str[14:];
print(str)
You also could use pattern (?<=Organization: ).+
Explanation:
(?<=Organization: ) - positive lookbehind, asserts if what is preceeding is Organization:
.+ - match any character except for newline characters.
Demo
You could use a single capturing group instead of 2 capturing groups.
Instead of specify all the words (\w*\.*\w*\.*\w*\s*\w*\s*\w*\s*) you might choose to match any character except a newline using the dot and then match the 0+ times to match until the end.
But note that that would also match strings like ##$$ ++
^Organization: (.+)
Regex demo | Python demo
For example
import re
string = "Organization: S.P. Dyer Computer Consulting, Cambridge MA"
result_organization = re.search("Organization: (.*)", string)
print(result_organization.group(1))
If you want a somewhat more restrictive pattern you might use a character class and specify what you would allow to match. For example:
^Organization: ([\w.,]+(?: [\w.,]+)*)
Regex demo

Regular expressions: replace comma in string, Python

Somehow puzzled by the way regular expressions work in python, I am looking to replace all commas inside strings that are preceded by a letter and followed either by a letter or a whitespace. For example:
2015,1674,240/09,PEOPLE V. MICHAEL JORDAN,15,15
2015,2135,602832/09,DOYLE V ICON, LLC,15,15
The first line has effectively 6 columns, while the second line has 7 columns. Thus I am trying to replace the comma between (N, L) in the second line by a whitespace (N L) as so:
2015,2135,602832/09,DOYLE V ICON LLC,15,15
This is what I have tried so far, without success however:
new_text = re.sub(r'([\w],[\s\w|\w])', "", text)
Any ideas where I am wrong?
Help would be much appreciated!
The pattern you use, ([\w],[\s\w|\w]), is consuming a word char (= an alphanumeric or an underscore, [\w]) before a ,, then matches the comma, and then matches (and again, consumes) 1 character - a whitespace, a word character, or a literal | (as inside the character class, the pipe character is considered a literal pipe symbol, not alternation operator).
So, the main problem is that \w matches both letters and digits.
You can actually leverage lookarounds:
(?<=[a-zA-Z]),(?=[a-zA-Z\s])
See the regex demo
The (?<=[a-zA-Z]) is a positive lookbehind that requires a letter to be right before the , and (?=[a-zA-Z\s]) is a positive lookahead that requires a letter or whitespace to be present right after the comma.
Here is a Python demo:
import re
p = re.compile(r'(?<=[a-zA-Z]),(?=[a-zA-Z\s])')
test_str = "2015,1674,240/09,PEOPLE V. MICHAEL JORDAN,15,15\n2015,2135,602832/09,DOYLE V ICON, LLC,15,15"
result = p.sub("", test_str)
print(result)
If you still want to use \w, you can exclude digits and underscore from it using an opposite class \W inside a negated character class:
(?<=[^\W\d_]),(?=[^\W\d_]|\s)
See another regex demo
\w matches a-z,A-Z and 0-9, so your regex will replace all commas. You could try the following regex, and replace with \1\2.
([a-zA-Z]),(\s|[a-zA-Z])
Here is the DEMO.

regex - how to select a word that has a '-' in it?

I am learning Regular Expressions, so apologies for a simple question.
I want to select the words that have a '-' (minus sign) in it but not at the beginning and not at the end of the word
I tried (using findall):
r'\b-\b'
for
str = 'word semi-column peace'
but, of course got only:
['-']
Thank you!
What you actually want to do is a regex like this:
\w+-\w+
What this means is find a alphanumeric character at least once as indicated by the utilization of '+', then find a '-', following by another alphanumeric character at least once, again, as indicated by the '+' again.
str is a built in name, better not to use it for naming
st = 'word semi-column peace'
# \w+ word - \w+ word after -
print(re.findall(r"\b\w+-\w+\b",st))
['semi-column']
a '-' (minus sign) in it but not at the beginning and not at the end of the word
Since "-" is not a word character, you can't use word boundaries (\b) to prevent a match from words with hyphens at the beggining or end. A string like "-not-wanted-" will match both \b\w+-\w+\b and \w+-\w+.
We need to add an extra condition before and after the word:
Before: (?<![-\w]) not preceded by either a hyphen nor a word character.
After: (?![-\w]) not followed by either a hyphen nor a word character.
Also, a word may have more than 1 hyphen in it, and we need to allow it. What we can do here is repeat the last part of the word ("hyphen and word characters") once or more:
\w+(?:-\w+)+ matches:
\w+ one or more word characters
(?:-\w+)+ a hyphen and one or more word characters, and also allows this last part to repeat.
Regex:
(?<![-\w])\w+(?:-\w+)+(?![-\w])
regex101 demo
Code:
import re
pattern = re.compile(r'(?<![-\w])\w+(?:-\w+)+(?![-\w])')
text = "-abc word semi-column peace -not-wanted- one-word dont-match- multi-hyphenated-word"
result = re.findall(pattern, text)
ideone demo
You can also use the following regex:
>>> st = "word semi-column peace"
>>> print re.findall(r"\S+\-\S+", st)
['semi-column']
You can try something like this: Centering on the hyphen, I match until there is a white space in either direction from the hyphen I also make check to see if the words are surrounded by hyphens (e.g -test-cats-) and if they are I make sure not to include them. The regular expression should also work with findall.
st = 'word semi-column peace'
m = re.search(r'([^ | ^-]+-[^ | ^-]+)', st)
if m:
print m.group(1)

python-re: How do I match an alpha character

How can I match an alpha character with a regular expression. I want a character that is in \w but is not in \d. I want it unicode compatible that's why I cannot use [a-zA-Z].
Your first two sentences contradict each other. "in \w but is not in \d" includes underscore. I'm assuming from your third sentence that you don't want underscore.
Using a Venn diagram on the back of an envelope helps. Let's look at what we DON'T want:
(1) characters that are not matched by \w (i.e. don't want anything that's not alpha, digits, or underscore) => \W
(2) digits => \d
(3) underscore => _
So what we don't want is anything in the character class [\W\d_] and consequently what we do want is anything in the character class [^\W\d_]
Here's a simple example (Python 2.6).
>>> import re
>>> rx = re.compile("[^\W\d_]+", re.UNICODE)
>>> rx.findall(u"abc_def,k9")
[u'abc', u'def', u'k']
Further exploration reveals a few quirks of this approach:
>>> import unicodedata as ucd
>>> allsorts =u"\u0473\u0660\u06c9\u24e8\u4e0a\u3020\u3021"
>>> for x in allsorts:
... print repr(x), ucd.category(x), ucd.name(x)
...
u'\u0473' Ll CYRILLIC SMALL LETTER FITA
u'\u0660' Nd ARABIC-INDIC DIGIT ZERO
u'\u06c9' Lo ARABIC LETTER KIRGHIZ YU
u'\u24e8' So CIRCLED LATIN SMALL LETTER Y
u'\u4e0a' Lo CJK UNIFIED IDEOGRAPH-4E0A
u'\u3020' So POSTAL MARK FACE
u'\u3021' Nl HANGZHOU NUMERAL ONE
>>> rx.findall(allsorts)
[u'\u0473', u'\u06c9', u'\u4e0a', u'\u3021']
U+3021 (HANGZHOU NUMERAL ONE) is treated as numeric (hence it matches \w) but it appears that Python interprets "digit" to mean "decimal digit" (category Nd) so it doesn't match \d
U+2438 (CIRCLED LATIN SMALL LETTER Y) doesn't match \w
All CJK ideographs are classed as "letters" and thus match \w
Whether any of the above 3 points are a concern or not, that approach is the best you will get out of the re module as currently released. Syntax like \p{letter} is in the future.
What about:
\p{L}
You can to use this document as reference: Unicode Regular Expressions
EDIT: Seems Python doesn't handle Unicode expressions. Take a look into this link: Handling Accented Characters with Python Regular Expressions -- [A-Z] just isn't good enough (no longer active, link to internet archive)
Another references:
re.UNICODE
python and regular expression with unicode
Unicode Technical Standard #18: Unicode Regular Expressions
For posterity, here are the examples on the blog:
import re
string = 'riché'
print string
riché
richre = re.compile('([A-z]+)')
match = richre.match(string)
print match.groups()
('rich',)
richre = re.compile('(\w+)',re.LOCALE)
match = richre.match(string)
print match.groups()
('rich',)
richre = re.compile('([é\w]+)')
match = richre.match(string)
print match.groups()
('rich\xe9',)
richre = re.compile('([\xe9\w]+)')
match = richre.match(string)
print match.groups()
('rich\xe9',)
richre = re.compile('([\xe9-\xf8\w]+)')
match = richre.match(string)
print match.groups()
('rich\xe9',)
string = 'richéñ'
match = richre.match(string)
print match.groups()
('rich\xe9\xf1',)
richre = re.compile('([\u00E9-\u00F8\w]+)')
print match.groups()
('rich\xe9\xf1',)
matched = match.group(1)
print matched
richéñ
You can use one of the following expressions to match a single letter:
(?![\d_])\w
or
\w(?<![\d_])
Here I match for \w, but check that [\d_] is not matched before/after that.
From the docs:
(?!...)
Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov) will match 'Isaac ' only if it’s not followed by 'Asimov'.
(?<!...)
Matches if the current position in the string is not preceded by a match for .... This is called a negative lookbehind assertion. Similar to positive lookbehind assertions, the contained pattern must only match strings of some fixed length and shouldn’t contain group references. Patterns which start with negative lookbehind assertions may match at the beginning of the string being searched.

Categories