Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 2 years ago.
Improve this question
I have some sentences with emoji Unicode, which consist of Unicode patterns like U0001. I need to extract all the string having U0001 into an array. This is the code that i have tried
import re
pattern = re.compile(r"^U0001")
sentence = 'U0001f308 U0001f64b The dark clouds disperse the hail subsides and one neon lit rainbow with a faint second arches across the length of the A u2026'
print(pattern.match(sentence).group()) #this prints U0001 every time but what i want is ['U0001f308']
matches = re.findall(r"^\w+", sentence)
print(matches) # This only prints the first match which is 'U0001f308'
Any way to extract string to an array?. I don't have much experience in regex.
'U0001f30' is not an emoji codepoint! It's a 9-character string beginning with the letter 'U'.
The way to enter unicode codpoints with more than 4 hex characters is \U0001f308. Likewise to enter a 4-hexadecimal character codepoint: \u0001.
But you can't look for codepoints that start with '0001' as if they were regular character strings. It seems to me you are either looking for the 4-hexadecimal character codepoint \u0001 or anything in the range \U00010000 - \U0001FFFF:
import re
sentence = '\U0001f308 \U0001f64b The dark clouds disperse the hail subsides and one neon lit rainbow with a faint second arches across the length of the A \u2026'
matches = re.findall('[\u0001\U00010000-\U0001FFFF]', sentence)
print(matches)
matches -> ['\U0001f308', '\U0001f64b']
If for some reason you really had strings beginning with 'U' and not actual codepoints, then:
matches = re.findall('U0001(?:[0-9a-fA-F]{4})?', sentence)
I have also assumed that emojis can be anywhere in the string and adjacent to any other characters.
Related
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 10 months ago.
Improve this question
This is my string:
Hair ReplacementHair BraidingHair Supplies & Accessories
my expected result should be:
Hair Replacement,Hair Braiding,Hair Supplies & Accessories
If two word like this ReplacementHair I want to split this two word and add comma between theme.
I tried this code:
re.sub(r"(\w)([A-Z])", r"\1 \2", text)
The above code splitting two word and add space between theme. I want comma instead of space.
You can replace the space in the replacement pattern with a comma.
import re
text = "Hair ReplacementHair BraidingHair Supplies & Accessories"
text2 = re.sub(r"(\w)([A-Z])", r"\1,\2", text)
print(text2)
output
Hair Replacement,Hair Braiding,Hair Supplies & Accessories
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I want to split "Coffee Hello 咖啡 咖啡"
into
"Coffee Hello" and "咖啡 咖啡" How should I do it? I used isalpha and isspace but it is not working. It is splitting into "Coffee Hello 咖啡 咖啡", "" instead.I found a simple fix by using regexp checking the index of first alphabet and splitting by that index.
Regex would do:
>>> import re
>>> string = "Coffee Hello 咖啡 咖啡"
>>> re.split("(?<=[A-Za-z+])\s*(?=[\u4e00-\u9fa5+])", string)
['Coffee Hello', '咖啡 咖啡']
>>>
#Lutz's and #U12-Forward's answers only work when English words precede Chinese words.
A better-rounded approach that works regardless of the order of English and Chinese words would be to use re.findall with an alternation pattern instead:
re.findall(r'[a-z]+(?:\s+[a-z]+)*|[\u4e00-\u9fa5]+(?:\s+[\u4e00-\u9fa5]+)*', string, re.I)
Try it online!
If all you want to consider is plain english characters you can use a combination of lookahead and look behind:
re.split("(?<=[a-zA-Z])\s*(?=[^a-zA-Z]*$)","Coffee Hello 咖啡 咖啡" )
this splits by any number of spaces (\s*) but only if the character before is from the English alphabet ((?<=[a-zA-Z]) = (?<=): lookbehind; [a-zA-Z]: english characters) and if everything that follows is not from the English alphabet ((?=[^a-zA-Z]*$) = (?=): lookahead; [^a-zA-Z]*$: not Englisch characters to the end of the line)
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 2 years ago.
Improve this question
I'm using jupyter notebook to do some simple Regex patterns but it keeps returning none for these two cases and I can’t see why.
I want to search for 3 to 5 digits pattern
digitRegex = re.compile('r(\d){3,5}')
digitRegex.search('123456789')
should return '12345' but it returns none :(
Same problem here, when trying to find 3 consecutive US phone numbers and I want optional: area code and separated by a comma
phoneRegex = re.compile(r'((\d\d\d-)?\d\d\d-\d\d\d\d(,)?){3}')
phoneRegex.search('My numbers are 415-555-1234,555-4242,212-555-000')
should return the 3 phone numbers but also returns none :(
Thank you...
In your first code, you put the r prefix inside the string, so it won't work. (Such prefix are used for raw strings.)
Working code:
digitRegex = re.compile(r'\d{3,5}')
digitRegex.search('123456789')
In the second sample, the string won't match because it attempts to get three phone numbers at all and the last one ends with three figures instead of four. You need to fix either your regexp or your phone number.
Working sample with valid numbers matching the original regex:
phoneRegex = re.compile(r'((\d\d\d-)?\d\d\d-\d\d\d\d(,)?){3}')
phoneRegex.search('My numbers are 415-555-1234,555-4242,212-555-0000')
Working sample with a edited regex matching the original numbers:
phoneRegex = re.compile(r'((\d\d\d-)?\d\d\d-\d\d\d\d?(,)?){3,4}')
phoneRegex.search('My numbers are 415-555-1234,555-4242,212-555-0000')
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
Is there a best practice to remove weird whitespace unicode characters from strings in Python?
For example if a string contains one of the following unicodes in this table I would like to remove it.
I was thinking of putting the unicodes into a list then doing a loop using replace but I'm sure there is a more pythonic way of doing so.
You should be able to use this
[''.join(letter for letter in word if not letter.isspace()) for word in word_list]
because if you read the docs for str.isspace it says:
Return True if there are only whitespace characters in the string and there is at least one character, False otherwise.
A character is whitespace if in the Unicode character database (see unicodedata), either its general category is Zs (“Separator, space”), or its bidirectional class is one of WS, B, or S.
If you look at the unicode character list for category Zs.
Regex is your friend in cases like this, you can simply iterate over your list applying a regex substitution
import re
r = re.compile(r"^\s+")
dirty_list = [...]
# iterate over dirty_list substituting
# any whitespace with an empty string
clean_list = [
r.sub("", s)
for s in dirty_list
]
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I need some help writing a couple of complex regular expression that are way over my head.
The first Regex, I want to exclude everything except:
The letters A to Z in both upper and lowercase
Single spaces
Single Dashes (-)
For the second, I want the same as above but also allow:
The numbers 0 to 9
Apostrophes (')
Question Marks (?)
Exclamation Marks (!)
Colons & Semi-Colons (: & ;)
Periods/Fullstops & commas (. & ,)
As a side note, are there any online generators that i can type a list of allowed characters into that will generate one for me?
Many thanks.
To satisfy the "single" requirement, you'll need a lookeahead, along the lines of:
r1 = r"""(?xi)
^
(
[a-z]+
|
\x20(?!\x20)
|
-(?!-)
)
+
$
"""
\x20(?!\x20) reads "a space, if not followed by another space".
For the second re, just add extra chars to the first group: [a-z0-9&+ etc].