Adding space between characters of a string containing special accented character - python

Is there a way to add space between the characters of a string such as the following: 'abakə̃tə̃'?
The usual ' '.join('abakə̃tə̃') approach returns 'a b a k ə ̃ t ə ̃', I am looking for 'a b a k ə̃ t ə̃'.
Thanks in advance.

You can use re.findall with a pattern that matches a word character optionally followed by an non-word character (which matches an accent):
import re
s = 'abakə̃tə̃'
print(' '.join(re.findall(r'\w\W?', s)))
For Python 3.7+, where zero-width patterns are allowed in re.split, you can use a lookahead and a lookbehind pattern split the string at positions that are followed by a word character and preceded by any character:
print(' '.join(re.split(r'(?<=.)(?=\w)', s)))
Both of the above would output:
a b a k ə̃ t ə

Related

Regex to get non-alphanumeric strings between alphanumeric strings

Let say I have this string:
Alpha+*&Numeric%$^String%%$
I want to get the non-alphanumeric characters that are between alphanumeric characters:
+*& %$^
I have this regex: [^0-9a-zA-Z]+ but it's giving me
+* %$^ %%$
which includes the tailing non-alphanumeric characters which I do not want. I have also tried [0-9a-zA-Z]([^0-9a-zA-Z])+[0-9a-zA-Z] but it's giving me
a+*&N c%$^S
which include the characters a, N, c and S
If you don't mind including the _ character as alpha-numeric data, you can extract all your non-alpha-numeric-data with this:
some_string = "A+*&N%$^S%%$"
import re
result = re.findall(r'\b\W+\b', some_string) # sets result to: ['+*&', '%$^']
Note my use of \b instead of something like \w or [^\W].
\w and [^\W] each match one character, so if your alpha-numeric string (between the text you want) is exactly one character, then what you think should be the next match won't match.
But since \b is a zero-width "word boundary," it doesn't care how many alpha-numeric characters there are, as long as there is at least one.
The only problem with your second attempt is the location of the + qualifier--it should be inside of the parentheses. You can also use the word character class \w and its inverse \W to pull out these items, which is the same as your second regex but includes underscores _ as parts of words:
import re
s = "Alpha+*&Numeric%$^String%%$"
print(re.findall(r"\w(\W+)\w", s)) # adds _ character
print(re.findall(r"[0-9a-zA-Z]([^0-9a-zA-Z]+)[0-9a-zA-Z]", s)) # your version fixed
print(re.findall(r"(?i)[0-9A-Z]([^0-9A-Z]+)[0-9A-Z]", s)) # same as above
Output:
['+*&', '%$^']
['+*&', '%$^']
['+*&', '%$^']

Add space between Persian numeric and letter with python re

I want to add space between Persian number and Persian letter like this:
"سعید123" convert to "سعید 123"
Java code of this procedure is like below.
str.replaceAll("(?<=\\p{IsDigit})(?=\\p{IsAlphabetic})", " ").
But I can't find any python solution.
There is a short regex which you may rely on to match boundary between letters and digits (in any language):
\d(?=[^_\d\W])|[^_\d\W](?=\d)
Live demo
Breakdown:
\d Match a digit
(?=[^_\d\W]) Preceding a letter from a language
| Or
[^_\d\W] Match a letter from a language
(?=\d) Preceding a digit
Python:
re.sub(r'\d(?![_\d\W])|[^_\d\W](?!\D)', r'\g<0> ', str, flags = re.UNICODE)
But according to this answer, this is the right way to accomplish this task:
re.sub(r'\d(?=[آابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی])|[آابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی](?=\d)', r'\g<0> ', str, flags = re.UNICODE)
I am not sure if this is a correct approach.
import re
k = "سعید123"
m = re.search("(\d+)", k)
if m:
k = " ".join([m.group(), k.replace(m.group(), "")])
print(k)
Output:
123 سعید
You may use
re.sub(r'([^\W\d_])(\d)', r'\1 \2', s, flags=re.U)
Note that in Python 3.x, re.U flag is redundant as the patterns are Unicode aware by default.
See the online Python demo and a regex demo.
Pattern details
([^\W\d_]) - Capturing group 1: any Unicode letter (literally, any char other than a non-word, digit or underscore chars)
(\d) - Capturing group 2: any Unicode digit
The replacement pattern is a combination of the Group 1 and 2 placeholders (referring to corresponding captured values) with a space in between them.
You may use a variation of the regex with a lookahead:
re.sub(r'[^\W\d_](?=\d)', r'\g<0> ', s)
See this regex demo.

regex. Match all words starting with two letter

I wish to find all words that start with "Am" and this is what I tried so far with python
import re
my_string = "America's mom, American"
re.findall(r'\b[Am][a-zA-Z]+\b', my_string)
but this is the output that I get
['America', 'mom', 'American']
Instead of what I want
['America', 'American']
I know that in regex [Am] means match either A or m, but is it possible to match A and m as well?
The [Am], a positive character class, matches either A or m. To match a sequence of chars, you need to use them one after another.
Remove the brackets:
import re
my_string = "America's mom, American"
print(re.findall(r'\bAm[a-zA-Z]+\b', my_string))
# => ['America', 'American']
See the Python demo
This pattern details:
\b - a word boundary
Am - a string of chars matched as a sequence Am
[a-zA-Z]+ - 1 or more ASCII letters
\b - a word boundary.
Don't use character class:
import re
my_string = "America's mom, American"
re.findall(r'\bAm[a-zA-Z]+\b', my_string)
re.findall(r'(Am\w+)', my_text, re.I)

Matching an apostrophe only within a word or string

I'm looking for a Python regex that can match 'didn't' and returns only the character that is immediately preceded by an apostrophe, like 't, but not the 'd or t' at the beginning and end.
I have tried (?=.*\w)^(\w|')+$ but it only matches the apostrophe at the beginning.
Some more examples:
'I'm' should only match 'm and not 'I
'Erick's' should only return 's and not 'E
The text will always start and end with an apostrophe and can include apostrophes within the text.
To match an apostrophe inside a whole string = match it anwyhere but at the start/end of the string:
(?!^)'(?!$)
See the regex demo.
Often, the apostophe is searched only inside a word (but in fact, a pair of words where the second one is shortened), then you may use
\b'\b
See this regex demo. Here, the ' is preceded and followed with a word boundary, so that ' could be preceded with any word, letter or _ char. Yes, _ char and digits are allowed to be on both sides.
If you need to match a ' only between two letters, use
(?<=[A-Za-z])'(?=[A-Za-z]) # ASCII only
(?<=[^\W\d_])'(?=[^\W\d_]) # Any Unicode letters
See this regex demo.
As for this current question, here is a bunch of possible solutions:
import re
s = "'didn't'"
print(s.strip("'")[s.strip("'").find("'")+1])
print(re.search(r'\b\'(\w)', s).group(1))
print(re.search(r'\b\'([^\W\d_])', s).group(1))
print(re.search(r'\b\'([a-z])', s, flags=re.I).group(1))
print(re.findall(r'\b\'([a-z])', "'didn't know I'm a student'", flags=re.I))
The s.strip("'")[s.strip("'").find("'")+1] gets the character after the first ' after stripping the leading/trailing apostrophes.
The re.search(r'\b\'(\w)', s).group(1) solution gets the word (i.e. [a-zA-Z0-9_], can be adjusted from here) char after a ' that is preceded with a word char (due to the \b word boundary).
The re.search(r'\b\'([^\W\d_])', s).group(1) is almost identical to the above solution, it only fetches a letter character as [^\W\d_] matches any char other than a non-word, digit and _.
Note that the re.search(r'\b\'([a-z])', s, flags=re.I).group(1) solution is next to identical to the above one, but you cannot make it Unicode aware with re.UNICODE.
The last re.findall(r'\b\'([a-z])', "'didn't know I'm a student'", flags=re.I) just shows how to fetch multiple letter chars from a string input.

Regular expressions: replace comma in string, Python

Somehow puzzled by the way regular expressions work in python, I am looking to replace all commas inside strings that are preceded by a letter and followed either by a letter or a whitespace. For example:
2015,1674,240/09,PEOPLE V. MICHAEL JORDAN,15,15
2015,2135,602832/09,DOYLE V ICON, LLC,15,15
The first line has effectively 6 columns, while the second line has 7 columns. Thus I am trying to replace the comma between (N, L) in the second line by a whitespace (N L) as so:
2015,2135,602832/09,DOYLE V ICON LLC,15,15
This is what I have tried so far, without success however:
new_text = re.sub(r'([\w],[\s\w|\w])', "", text)
Any ideas where I am wrong?
Help would be much appreciated!
The pattern you use, ([\w],[\s\w|\w]), is consuming a word char (= an alphanumeric or an underscore, [\w]) before a ,, then matches the comma, and then matches (and again, consumes) 1 character - a whitespace, a word character, or a literal | (as inside the character class, the pipe character is considered a literal pipe symbol, not alternation operator).
So, the main problem is that \w matches both letters and digits.
You can actually leverage lookarounds:
(?<=[a-zA-Z]),(?=[a-zA-Z\s])
See the regex demo
The (?<=[a-zA-Z]) is a positive lookbehind that requires a letter to be right before the , and (?=[a-zA-Z\s]) is a positive lookahead that requires a letter or whitespace to be present right after the comma.
Here is a Python demo:
import re
p = re.compile(r'(?<=[a-zA-Z]),(?=[a-zA-Z\s])')
test_str = "2015,1674,240/09,PEOPLE V. MICHAEL JORDAN,15,15\n2015,2135,602832/09,DOYLE V ICON, LLC,15,15"
result = p.sub("", test_str)
print(result)
If you still want to use \w, you can exclude digits and underscore from it using an opposite class \W inside a negated character class:
(?<=[^\W\d_]),(?=[^\W\d_]|\s)
See another regex demo
\w matches a-z,A-Z and 0-9, so your regex will replace all commas. You could try the following regex, and replace with \1\2.
([a-zA-Z]),(\s|[a-zA-Z])
Here is the DEMO.

Categories