Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I need to separate given words if they are surrounding by numbers. For example the word is "x".
s = '''
1x 3 # OK
s1x2 # WRONG
2x # OK
s1 x2 # WRONG
x2 # OK
1sx3 # WRONG
'''
print(re.sub("(?<=\d)\s*x\s*(?=\d)", " x ", s))
This separates everything even if surrounding number is not a number, I mean, s1 x2 nor s1x3x should not be matched.
On the other hand it doesn't work for "no" - only for the last 2 rows:
s = '''
2 no 3 # OK (but it's not needed to match)
2no # OK
3no2 # OK
no9 # OK
xno9 # WRONG
5 non # WRONG (for 'no')
'''
print(re.sub("(?<=\d)\s*no\s*(?=\d)", " x ", s))
I've edited examples a bit.
There's a need to use it within a sentence, for example:
Sever land and erect 1x 3 Bedroom chalet bungalow and 1x2 bedroom
bungalow. Installation of 2 non-illuminated fascia signs and 2no ad
signs.
Both from 1st sentence should match, only second from 2nd sentence.
EDIT
Thanks to the below post I've found this to match:
\b(?:\d*\s*x\s*\d+|\d+\s*x\s*\d*)\b
but the problem is it doesn't work for replacement. The idea is to add for surrounded words by numbers an extra space. So while this now pattern selects properly those phrases (both from single row and sentences) it doesn't work with replacement because it should match only those words:
s = "Sever land and erect 1x 3 Bedroom chalet bungalow and 1x2 Bedroom bungalow"
re.sub("\b(?:\d*\s*x\s*\d+|\d+\s*x\s*\d*)\b", " x ", s, flags=re.IGNORECASE)
data = '''
Sever land and erect 1x 3 Bedroom chalet bungalow and 1x2 bedroom bungalow. Installation of 2 non-illuminated fascia signs and 2no ad signs.
'''
cases = ['no', 'nos', 'x']
import re
l = data
for case in cases:
l = re.sub(r'\s{2,}', ' ', re.sub(r'(?<=\d| ){}(?=\d| )'.format(case), r' {} '.format(case), l))
print(l)
Prints:
Sever land and erect 1 x 3 Bedroom chalet bungalow and 1 x 2 bedroom bungalow. Installation of 2 non-illuminated fascia signs and 2 no ad signs.
You might use an alternation using | to match a required digit at either side where x or no could be matched in the middle.
^(?:\d* *(?:x|no)\s*\d+|\d+\s*(?:x|no) *\d*)$
Regex demo
Related
I am working with a dataset where I am separating the contents of one Excel column into 3 separate columns. A mock version of the data is as follows:
Movie Titles/Category/Rating
Wolf of Wall Street A-13 x 9
Django Unchained IMDB x 8
The EXPL Haunted House FEAR x 7
Silver Lining DC-23 x 8
This is what I want the results to look like:
Title
Category
Rating
Wolf of Wall Street
A-13
9
Django Unchained
IMDB
8
The EXPL Haunted House
FEAR
7
Silver Lining
DC-23
8
Here is the RegEx I used to successfully separate the cells:
For Rating, this RegEx worked:
data = [[Movie Titles/Category/Rating, Rating]] = data['Movie Titles/Category/Rating'].str.split(' x ', expand = True)
However, to separate Category from movie titles, this RegEx doesn't work:
data['Category']=data['Movie Titles/Category/Rating'].str.extract('((\s[A-Z]{1,2}-\d{1,2})|(\s[A-Z]{4}$))', expand = True)
Since the uppercase letter pattern is present in the middle of the third cell as well (EXPL and I only want to separate FEAR into a separate column), the regex pattern '\s[A-Z]{4}$' is not working. Is there a way to indicate in the RegEx pattern that I only want the uppercase text in the end of the table cell to separate (FEAR) and not the middle (EXPL)?
You can use
import pandas as pd
df = pd.DataFrame({'Movie Titles/Category/Rating':['Wolf of Wall Street A-13 x 9','Django Unchained IMDB x 8','The EXPL Haunted House FEAR x 7','Silver Lining DC-23 x 8']})
df2 = df['Movie Titles/Category/Rating'].str.extract(r'^(?P<Movie>.*?)\s+(?P<Category>\S+)\s+x\s+(?P<Rating>\d+)$', expand=True)
See the regex demo.
Details:
^ - start of string
(?P<Movie>.*?) - Group (Column) "Movie": any zero or more chars other than line break chars, as few as possible
\s+ - one or more whitespaces
(?P<Category>\S+) - Group "Category": one or more non-whitespace chars
\s+x\s+ - x enclosed with one or more whitespaces
(?P<Rating>\d+) - Group "Rating": one or more digits
$ - end of string.
Assuming there is always x between Category and Rating, and the Category has no spaces in it, then the following should get what you want:
(.*) (.*) x (\d+)
I think
'((\s[A-Z]{1,2}-\d{1,2})|(\s[A-Z]{4})) x'
would work for you - to indicate that you want the part of the string that comes right before the x. (Assuming that pattern is always true for your data.)
I am trying to read a file that has the following structure.
Question 1 What is the weather today? Answer 1 It is hot and sunny. Question 2 What day is it today? Answer 2 Thursday Question 3 How many legs does a dog have? Answer 3 Four legs
I want to put the content in a dictionary with questions and answers, so something like this:
dict = {
"What is the weather today?": "It is hot and sunny.",
"What day is it today?": "Thursday",
"How many legs does a dog have?": "Four legs"
}
To find the questions and answers in the text, I created this regular expression:
\s?(Question|Answer)\s\d+\s?(.*)\s?(Question|Answer)\s\d+\s?
You also can find the regex with the example here. As you can see on that page, it finds one big match, instead of multiple smaller matches. I assume that you need the Question and Answer texts for two matches, because Question 2, for example, means both the end of the match of Answer 1, and the start of the match of Question 2. How can I get the questions and the answers itself correctly, so that I can put it in a dictionary (including the last answer, after which no new 'Question X' follows), as shown in the example dictionary?
If there is a question followed by an answer, you don't have to use the alternation |, but you can first match Question and then match Answer
\bQuestion\s+\d+\s+(\S.*?)\s+Answer\s+\d+\s+(\S.*?)\s*(?=Question|$)
\bQuestion\s+\d+\s+ Match Question followed by 1+ digits between whitespace chars
(\S.*?) Capture group 1, match at least a single non whitespace char
\s+Answer\s+\d+\s+ Match Answer followed by 1+ digits between whitespace chars
(\S.*?) Capture group 2, match at least a single non whitespace char
\s*(?=Question|$) Match optional whitespace char asserting either another question to the right or the end of the string in case of the last question
Then you could for example use re.findall to get the group 1 and group 2 values and fill a dictionary.
Regex demo | Python demo
import re
dict = {}
regex = r"\bQuestion\s+\d+\s+(\S.*?)\s+Answer\s+\d+\s+(\S.*?)\s*(?=Question|$)"
s = "Question 1 What is the weather today? Answer 1 It is hot and sunny. Question 2 What day is it today? Answer 2 Thursday Question 3 How many legs does a dog have? Answer 3 Four legs"
for m in re.findall(regex, s):
dict[m[0]] = m[1]
print(dict)
Output
{'What is the weather today?': 'It is hot and sunny.', 'What day is it today?': 'Thursday', 'How many legs does a dog have?': 'Four legs'}
I want to get bold parts in sentences below.
Examples:
SmellNice Coffee 450 gr
Clean 2 k Rice
LukaLuka 1,5lt cold drink
Jumbo 7 gutgut eggs 12'li
Espresso 5 Klasik 10 Ad
Expression below works well until to the last two.
\d+[.,]?\d*\s*[’']?\s*(gr|g|kg|k|adet|ad|lı|li|lu|lü|cc|cl|ml|lt|l|mm|cm|mt|m)
I have added \s|$ end of the expression. Thinking that If the unit is not the last word then there should be a space after it. But it didn't work. Briefly, how can I capture all bold expressions?
It works with brackets:
\d+[.,]?\d*\s*[’']?\s*(gr|g|kg|k|adet|ad|lı|li|lu|lü|cc|cl|ml|lt|l|mm|cm|mt|m)(\s+|$)
x2 = (
"\d+" #digit
"[,'\s]" #space comma apostrophe
"[\d*\s*]?" #opt digit or space
"((gr)|g|(kg)|k|(adet)|([Aa]d)|(lı)|(li)|(lu)|(lü)|(cc)|(cl)|(ml)|(lt)|l|(mm)|(cm)|(mt)|m)" #all the weights to look for
"(\s+|$)" #it's gotta be followed with a space, or with end of line.
)
This question already has answers here:
Regular expression: Match everything after a particular word
(4 answers)
Closed 4 years ago.
I need to filter the sentence and select only few terms from the whole sentence
For example, I have sample text:
ID: a9000006
NSF Org : DMI
Total Amt. : $225024
Abstract :This SBIR proposal is aimed at (1) the synthesis of new ferroelectric liquid crystals with ultra-high polarization,
chemical stability and low viscosity
token = re.compile('a90[0-9][0-9][0-9][0-9][0-9]| [$][\d]+ |')
re.findall(token, filetext)
I get 'a9000006','$225024', but I do not know how to write regex for three upper case letter right after "NSF Org:" which is "DMI" and all text after "Abstract:"
If you want to create a single regex which will match each of those 4 fields with explicit checks on each, then use this regex: :\s?(a90[\d]+|[$][\d]+|[A-Z]{3}|.*$)
>>> token = re.compile(r':\s?(a90[\d]+|[$][\d]+|[A-Z]{3}|.*$)', re.DOTALL) # flag needed
>>> re.findall(token, filetext)
['a9000006', 'DMI', '$225024', 'This SBIR proposal is aimed at (1) the synthesis of new ferroelectric liquid crystals wi
th ultra-high polarization, \n chemical stability and low viscosity']
>>>
However, since you're searching for all at the same time, would be better to use one which matches all 4 together and generically, such as the one in this answer here.
This must do the job.
: .*
You can check this here.
check
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
How to remove noises from word (or sequence of words) edges. By noises I mean: 's, 're, ., ?, ,, ;, etc. In other words, punctuation and abbreviations. But it needs to be only from left and right edges, noises within word should remain.
examples:
Apple. Apple
Donald Trump's Trump
They're They
I'm I
¿Hablas espanol? Hablas espanhol
$12 12
H4ck3r H4ck3r
What's up What's up
So basically remove apostrophes, verb abbreviations and punctuation but only for the string edges (right/left). It seems strip doesn't work with full matches and couldn't find re suitable method only for edges.
What about
import re
strings = ['Apple.', "Trump's", "They're", "I'm", "¿Hablas", "$12", "H4ck3r"]
rx = re.compile(r'\b\w+\b')
filtered = [m.group(0) for string in strings for m in [rx.search(string)] if m]
print(filtered)
Yielding
['Apple', 'Trump', 'They', 'I', 'Hablas', '12', 'H4ck3r']
Instead of eating something away from the left or right, it simply takes the first match of word characters (i.e. [a-zA-Z0-9_]).
To apply it "in the wild", you could split the sentence first, like so:
sentence = "Apple. Trump's They're I'm ¿Hablas $12 H4ck3r"
rx = re.compile(r'\b\w+\b')
filtered = [m.group(0) for string in sentence.split() for m in [rx.search(string)] if m]
print(filtered)
This obviously yields the same list as above.
Use pandas:
import pandas as pd
s = pd.Series(['Apple.', "Trump's", "They're", "I'm", "¿Hablas", "$12", "H4ck3r"])
s.str.extract(r'(\w+)')
Output:
0 Apple
1 Trump
2 They
3 I
4 Hablas
5 12
6 H4ck3r
Name: 0, dtype: object