Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
How to remove noises from word (or sequence of words) edges. By noises I mean: 's, 're, ., ?, ,, ;, etc. In other words, punctuation and abbreviations. But it needs to be only from left and right edges, noises within word should remain.
examples:
Apple. Apple
Donald Trump's Trump
They're They
I'm I
¿Hablas espanol? Hablas espanhol
$12 12
H4ck3r H4ck3r
What's up What's up
So basically remove apostrophes, verb abbreviations and punctuation but only for the string edges (right/left). It seems strip doesn't work with full matches and couldn't find re suitable method only for edges.
What about
import re
strings = ['Apple.', "Trump's", "They're", "I'm", "¿Hablas", "$12", "H4ck3r"]
rx = re.compile(r'\b\w+\b')
filtered = [m.group(0) for string in strings for m in [rx.search(string)] if m]
print(filtered)
Yielding
['Apple', 'Trump', 'They', 'I', 'Hablas', '12', 'H4ck3r']
Instead of eating something away from the left or right, it simply takes the first match of word characters (i.e. [a-zA-Z0-9_]).
To apply it "in the wild", you could split the sentence first, like so:
sentence = "Apple. Trump's They're I'm ¿Hablas $12 H4ck3r"
rx = re.compile(r'\b\w+\b')
filtered = [m.group(0) for string in sentence.split() for m in [rx.search(string)] if m]
print(filtered)
This obviously yields the same list as above.
Use pandas:
import pandas as pd
s = pd.Series(['Apple.', "Trump's", "They're", "I'm", "¿Hablas", "$12", "H4ck3r"])
s.str.extract(r'(\w+)')
Output:
0 Apple
1 Trump
2 They
3 I
4 Hablas
5 12
6 H4ck3r
Name: 0, dtype: object
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have complex task that I want to accomplish:
From a string I want to be able to classify words in particular categories.
s = 'Age 63 years, female 35%; race or ethnic group: White 68%, Black 5%, Asian 19%, other 8%'
d = function(s)
print(d)
{"age": "63 years",
"gender: "female 35%",
"race": "White 68%, Black 5%, Asian 19%, other 8%"}
I must not that not all strings are in the same format but there is a finite set of categories in all (age, gender, race, region) but some strings only have 1 or 2 out of the 4 categories.
Here are some other toy strings:
s2 = 'Age 71 years, male 64%'
s3 = 'Age 64 years, female 21%,
Race or ethnicity: White 66%, Black 5%, Asian 18%, other 11%
Region: N. America 7%, Latin America 17%, W. Europe or other 24%, central Europe 33%, Asia-Pacific 18%
As you can see there are some patterns:
age is not preceded by any ':'.
gender is documented as either female or male.
race and region are followed by ':'.
I am in interested in collection all the information corresponding to the category as see in my toy example with the race category.
What I need:
Writing the RegEx pattern with the appropriate capturing groups to obtain the results.
Transform the matches to a dictionary: I have seen a solution using the .groupdict() method to do so.
I have a problem writing the regex pattern that will return the aforementioned groups.
I have seen this interesting solution for a similar problem: python regex: create dictionary from string.
But I have trouble applying it to mine.
Instead of finding one golden regex to handle all the cases you could possibly pass your input string through a set of regexes each trying to extract one of the columns you have mentioned in the question. Something like
ageMatch = re.match( r'Age\s+(\d+)\s+years?', s, re.I)
if ageMatch:
//Use ageMatch.group(1) to form part of your dict
genderMatch = re.match( r'(male|female)\s+(\d+)\s*%', s, re.I)
if genderMatch:
//Use genderMatch.group(1) genderMatch.group (2) to form part of your dict
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I have a string that looks like:
str_in = "Lemons: J2020, M2021. Eat by 9/03/28
Strawberries: N2023, O2024. Buy by 10/10/20"
How do I get just "J2020, M2021, N2023, O2024"?
What I have so far is very hardcoded. It is:
str_in.replace("Lemon:","")
str_in.replace("Strawberries:", "")
str_in.replace("Buy by")
I don't know how to get rid of the date if the date changes from the number specified. Is there a RegEx form I could use?
Based on your original post and your follow-up comments, you can explicitly fetch the strings you want to keep by using this regex: \b[A-Z]+\d+\b. It allows for 1 or more letters followed by 1 or more digits, bounded as a single word. To test it and other regexes in the future, use this great online tool.
The findall() method on the regex class is best used here because it will return all instances of this pattern. For more on findall() and other kinds of matching methods, check out this tutorial.
Putting all that together, the code would be:
values = re.findall(r'\b[A-Z]+\d+\b', str_in)
Be sure to import re first.
I just saw your edited question, so, here's my edited answer
import re
re_pattern = re.compile(r'(\w+),\s(\w+)\.')
data = [ 'Lemons: J2020, M2021. Eat by 9/03/28',
'Strawberries: N2023, O2024. Buy by 10/10/20',
'Peaches: N12345, O123456. Buy by 10/10/20'
]
for line in data:
match = re_pattern.search(line)
if match:
print(match.group(1), match.group(2))
import re
string = "Lemons: J2020, M2021. Eat by 9/03/28 Strawberries: N2023, O2024. Buy by 10/10/20"
array = re.findall(r"\b[A-Z]\d{4}\b", string)
result = ','.join(array)
The result string is "J2020, M2021, N2023, O2024"
The array is ['J2020', 'M2021', 'N2023', 'O2024']
The regex matches the possibility of having 1 OR 2 chars in the begining of the required text an then matches the later portions of the digits. I think the OP has the requisite information to make a test on the basis of this information.
import re
str_in = "Lemons: J2020, M2021. Eat by 9/03/28 \
Strawberries: N2023, O2024. Buy by 10/10/20"
result = re.findall(r'([A-Z]{1,2}\d+)', str_in)
print(result)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'm using Python 3.7. I have an array of unique words ...
words = ["abc", "cat", "dog"]
Then I have other strings, which may or not contain one or more instances of these words. How do I figure out the number of occurrences of unique instances of each word in each string? For example if I have
s = "bbb abc abc lll dog"
Given the above array, words, the result of counting unique words in "s" should be 2, because "abc" occurs at least once, and "dog" occurs at least once. Similarly,
s2 = "CATTL DOG mmm"
would only contain 1 unique word, "dog". The other words don't occur in the array "words".
A quick way would be:
set(words).intersection(s.split(" "))
A set comprehension is a good choice here
words = ['abc', 'cat', 'dog']
s = 'bbb abc abc lll dog'
ss = {w for w in s.split() if w in words}
ss
> {'abc', 'dog'}
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I need to separate given words if they are surrounding by numbers. For example the word is "x".
s = '''
1x 3 # OK
s1x2 # WRONG
2x # OK
s1 x2 # WRONG
x2 # OK
1sx3 # WRONG
'''
print(re.sub("(?<=\d)\s*x\s*(?=\d)", " x ", s))
This separates everything even if surrounding number is not a number, I mean, s1 x2 nor s1x3x should not be matched.
On the other hand it doesn't work for "no" - only for the last 2 rows:
s = '''
2 no 3 # OK (but it's not needed to match)
2no # OK
3no2 # OK
no9 # OK
xno9 # WRONG
5 non # WRONG (for 'no')
'''
print(re.sub("(?<=\d)\s*no\s*(?=\d)", " x ", s))
I've edited examples a bit.
There's a need to use it within a sentence, for example:
Sever land and erect 1x 3 Bedroom chalet bungalow and 1x2 bedroom
bungalow. Installation of 2 non-illuminated fascia signs and 2no ad
signs.
Both from 1st sentence should match, only second from 2nd sentence.
EDIT
Thanks to the below post I've found this to match:
\b(?:\d*\s*x\s*\d+|\d+\s*x\s*\d*)\b
but the problem is it doesn't work for replacement. The idea is to add for surrounded words by numbers an extra space. So while this now pattern selects properly those phrases (both from single row and sentences) it doesn't work with replacement because it should match only those words:
s = "Sever land and erect 1x 3 Bedroom chalet bungalow and 1x2 Bedroom bungalow"
re.sub("\b(?:\d*\s*x\s*\d+|\d+\s*x\s*\d*)\b", " x ", s, flags=re.IGNORECASE)
data = '''
Sever land and erect 1x 3 Bedroom chalet bungalow and 1x2 bedroom bungalow. Installation of 2 non-illuminated fascia signs and 2no ad signs.
'''
cases = ['no', 'nos', 'x']
import re
l = data
for case in cases:
l = re.sub(r'\s{2,}', ' ', re.sub(r'(?<=\d| ){}(?=\d| )'.format(case), r' {} '.format(case), l))
print(l)
Prints:
Sever land and erect 1 x 3 Bedroom chalet bungalow and 1 x 2 bedroom bungalow. Installation of 2 non-illuminated fascia signs and 2 no ad signs.
You might use an alternation using | to match a required digit at either side where x or no could be matched in the middle.
^(?:\d* *(?:x|no)\s*\d+|\d+\s*(?:x|no) *\d*)$
Regex demo
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I'm trying to extract relevant information from my text.
I'm now using python.
Since, if any (s in mytext for s in mylist):
searches too many irrelevant strings,
I'm trying to find strings which contain two or more words from mylist.
So, my question is "How can i only choose sentences which contain two or more words from my list"
Thank you for your help!
Have a good day!
I didn't get the exactly what you want but What I understand is :
list = ["This is tesing", "hello","My name", "This is not test"]
for sentense in list :
if len(sentense.split()) >= 2:
print sentense
OUTPUT :
This is tesing
My name
This is not test
I hope this will help you...
1) Split the string.
2) Check occurrence of keyword in text.
3) If count greater than or equal to 2 print the text
keywords = ['the', 'apple' , 'fruit']
text = ['apple is a fruit', 'orange is fruit', 'the apple', 'the orange', 'the orange fruit']
for element in text:
if len(set(keywords)&set(element.split())) >=2 :
print element
Output:
apple is a fruit
the apple
the orange fruit