Python : Regex capturing genric for 3 cases. - python

Hi Anyone help me imporve my not working regular expresion.
Strings Cases:
1) 120 lbs and is intended for riders ages 8 years and up. #catch : 8 years and up
2) 56w x 28d x 32h inches recommended for hobbyists recommended for ages 12 and up. #catch : 12 and up
3) 4 users recorded speech for effective use language tutor pod measures 11l x 9w x 5h inches recommended for ages 6 and above. #catch : 6 and above
I want a genric regular expression which works perfectly for all the three string.
My regular expression is :
\b\d+[\w+\s]?(?:\ban[a-z]\sup\b|\ban[a-z]\sabove\b|\ban[a-z]\sold[a-z]*\b|\b&\sup)
But it is not working quite well. If anyone can provide me a generic regular expression which works for all 3 cases. I am using python re.findall()
Anyone? could Help?

Make it a habit and start with verbose regular expressions:
import re
rx = re.compile(r'''
ages\ # look for ages
(\d+(?:\ years)?\ and\ (?:above|up)) # capture a digit, years eventually
# and one of above or up
''', re.VERBOSE)
string = '''
1) 120 lbs and is intended for riders ages 8 years and up. #catch : 8 years and up
2) 56w x 28d x 32h inches recommended for hobbyists recommended for ages 12 and up. #catch : 12 and up
3) 4 users recorded speech for effective use language tutor pod measures 11l x 9w x 5h inches recommended for ages 6 and above. #catch : 6 and above
'''
matches = rx.findall(string)
print(matches)
# ['8 years and up', '12 and up', '6 and above']
See a demo on ideone.com as well as on regex101.com.

(As the suggestion I made in a comment appears to have been what you wanted, I offer it as an answer.)
If your examples illustrate all possible strings (but I fear they don't ;) you could do it as simple as
\d+[^\d]*$
See it here at regex101.
It matches the last number, and everything after it.
Or a little bit more sophisticated - making sure it's preceded by age - here

Related

Regex split the string at \n but skip the first one if it is \n\n

I want to split some strings on Python by separating at \n and use them in that format, but some of those strings have unexpected newlines and I want to ignore them.
TO CLARIFY: Both examples have only one string.
For example this is a regular string with no unexpected newlines:
Step 1
Cut peppers into strips.
Step 2
Heat a non-stick skillet over medium-high heat. Add peppers and cook on stove top for about 5 minutes.
Step 3
Toast the wheat bread and then spread hummus, flax seeds, and spinach on top
Step 4
Lastly add the peppers. Enjoy!
but some of them are like this:
Step 1
Using a fork, mash up the tuna really well until the consistency is even.
Step 2
Mix in the avocado until smooth.
Step 3
Add salt and pepper to taste. Enjoy!
I have to say I am new at regex and if the solution is obvious, please forgive
Edit: Here is my regex
stepOrder = []
# STEPS
txtSteps = re.split("\n",directions.text)
listOfLists = [[] for i in range(len(txtSteps)) if i % 2 == 0]
for i in range(len(listOfLists)):
listOfLists[i] = [txtSteps[i*2],txtSteps[i*2+1]]
recipe["steps"] = listOfLists
print(listOfLists)
directions.text is every one of these examples I gave. I can share what it is too, but I think it's irrelevant.
You can solve this problem by splitting on the following regex:
(?<=\d\n).*
Basically it will get any character in the same line .* which is preceeded by one digit \d and one new line character \n.
Check the regex demo here.
Your whole Python snippet then becomes simplified as follows using the re.findall method:
# STEPS
steps = re.findall("(?<=\d\n).*", directions.text)
out = [[{'order':i+1, 'step': step}] for i, step in enumerate(steps)]
f = open("your_file_name")
content = f.read()
f.close()
for line in content.split("\n"):
if re.match("^&",line):
continue
print(line)

Using RegEx in Python to extract contents

Good evening,
I am very new to Python and RegEx. I have the following sentence:
-75.76 Card INSURANCEGrabPay ASIA DIRECT to Paid AM 1:16 +100.00 3257 UpAmex Top PM 9:55 +300.00 3257 UpAmex Top PM 9:55 -400.00 Card LTDGrabPay PTE AXS to Paid PM 9:57 (SGD) Amount Details Time here. appear will transactions cashless your All 2022 Feb 15 on made transactions GrabPay points 52 earned points Rewards 475.76 SGD spent Amount 0.24 SGD balance Wallet 2022 Feb 15 Summary statement daily your here
I would like to search for just '-' and the amount after that.
After that, I would like to skip 2 words and extract ALL words if need be in a single group (I will read more about groups but for now i would need in a single group, which i can later use to split and get the words from that string) just before 'Paid'
For instance, I would get
-75.76 ASIA Direct to
-400 PTE AXS to
What would be the regex command? Also, is there a good regex tutorial where I can read up on?
For now I have created one match having 2 groups ie, group1 for the amount and group2 for all the words (that include "to " string also).
Regex:
(-\d+\.?\d+) \w+ \w+ ([\w ]+)?Paid
You can check the details here: https://regex101.com/r/eUMgdW/1
Python code:
import re
output = re.findall("""(-\d+\.?\d+) \w+ \w+ ([\w ]+)?Paid""", your_input_string)
for found in output:
print(found)
#('-75.76', 'ASIA DIRECT to ')
#('-400.00', 'PTE AXS to ')
Rather than give you the actual regex, I'll gently nudge you in the right direction. It's more satisfying that way.
"Words" here are seperated by spaces. So what you're searching for is a group of characters (captured), a space, characters again, space, characters, space, then capture everything and end with "PAID". Try to create a regex to do that.
If you'd like to brush up on regex, check out Regex101. It's a web tool to test out regex, along with a debugger and a cheat sheet.

Multipart Regex: Mix of exact and non-exact phrases

I am building a ML training dataset from a corpus using some chemical named entities.
The reason I mention the chemical context is just to assure that this is a realistic example of what I am dealing with, not a made up one.
In doing so, I need a regex expression that has the following structure:
1 - Starts by the chemical formula string "2h-tetrazolium, 2,2'-(3,3'-dimethoxy[1,1'-biphenyl]-4,4'-diyl)bis[3-(4-nitrophenyl)-5-phenyl-,chloride (1:2)"
2 - followed by 0 up to 15 characters
3 - followed by the chemical code string "298-83-9"
4 - followed by 0 up to 15 characters
5 - followed by a non-alphanumerical character
6 - followed by the string "5"
7 - ends with a non-alphanumerical value.
The reason that I added the non-alphanumerical requirements #5 and #7 is that the text in which the regex search is to be performed is a long messy text and I wanted to ensure that the string "5" is not part of another entity such as these two examples: "bluh bluh 298-83-9 bluh bluh 564" or "bluh bluh 298-83-9 bluh bluh 645".
The way I approached was building an expression like the following:
reg_exp = name_entity[0] + r".{0,15}\s*" + name_entity[1] + r".{0,15}\s*" + r"[^a-zA-Z\d]+" + name_entity[2] + r"[^a-zA-Z\d]+"
where name_entity is the array that contains the strings in requirements 1, 3, and 6.
However, the issue is that the chemical formula and code in requirements 1 and 3 have so much escaping, hyphens, etc that my expression does not work. I need a way to prompt regex in thinking that name_entity elements are to be treated as exactly literal phrases, not containing some regex expression.
In case it matters, I am coding in Python.
I would appreciate your help. Here, I copy a portion of the multi-page long text that is intended to contain what the the regex expression is intended to find. The part that my python code re.findall(reg_exp, text) should find is bolded:
"composition/information on ingredients substance / mixture : mixture substance name : nbt/bcip stock solution, mbf components chemical name cas-no. concentration (% w/w) methane, 1,1'-sulfinylbis- 67-68-5 >= 50 - < 70 2h-tetrazolium, 2,2'-(3,3'- dimethoxy[1,1'-biphenyl]-4,4'- diyl)bis[3-(4-nitrophenyl)-5-phenyl-, chloride (1:2) 298-83-9 >= 1 - < 5 actual concentration is withheld as a trade secret section 4. first aid measures general advice : do not leave the victim unattended. safety data sheet nbt/bcip stock solution version 3.0 revision date: 09-25-2019"
There's a few issues here, but it works with the following code:
def new_regex(entity):
return fr"{re.escape(entity[0])}.{{0,15}}\s*{re.escape(entity[1])}.{{0,15}}\s*[^a-zA-Z\d]+{re.escape(entity[2])}[^a-zA-Z\d]+"
entity = [
"2h-tetrazolium, 2,2'-(3,3'- dimethoxy[1,1'-biphenyl]-4,4'- diyl)bis[3-(4-nitrophenyl)-5-phenyl-, chloride (1:2)",
'298-83-9',
'5'
]
n = "composition/information on ingredients substance / mixture : mixture substance name : nbt/bcip stock solution, mbf components chemical name cas-no. concentration (% w/w) methane, 1,1'-sulfinylbis- 67-68-5 >= 50 - < 70 2h-tetrazolium, 2,2'-(3,3'- dimethoxy[1,1'-biphenyl]-4,4'- diyl)bis[3-(4-nitrophenyl)-5-phenyl-, chloride (1:2) 298-83-9 >= 1 - < 5 actual concentration is withheld as a trade secret section 4. first aid measures general advice : do not leave the victim unattended. safety data sheet nbt/bcip stock solution version 3.0 revision date: 09-25-2019"
regex = new_regex(entity)
regex.findall(n)
# ["2h-tetrazolium, 2,2'-(3,3'- dimethoxy[1,1'-biphenyl]-4,4'- diyl)bis[3-(4-nitrophenyl)-5-phenyl-, chloride (1:2) 298-83-9 >= 1 - < 5 "]
This was fixed by using re.escape, as well as fixing a few issues with whitespace in your chemical formula. You likely however want to change your entities to handle whitespace better.

Does anyone know a cleaner way to write this regex?

(?:reminder|Reminder)\s\d+\s\b(?:second|seconds|Second|Seconds|minute|minutes|Minute|Minutes|hour|hours|Hour|Hours|day|days|Day|Days|week|weeks|Week|Weeks|month|months|Month|Months|year|years|Year|Years)\b
Objective format: "Reminder 3 seconds", "Reminder 20 days", "Reminder 3 second" etc
[rR]eminder\s\d+\s(?:[sS]econd|[mM]inute|[hH]our|[dD]ay|[wW]eek|[mM]onth|[yY]ear)s?\b
I think this works. Most of the changes I made were putting characters into groups. A little bit of it was moving the sometimes-s outside the group. Does this make sense to you?
I'm guessing that maybe less boundaries might be OK here,
(?i)\breminder\s+\d+\s+\b(?:seconds?|minutes?|hours?|days?|weeks?|months?|years?)\b
or maybe not.
Demo
Test
import re
expression = r"(?i)\breminder\s+\d+\s+\b(?:seconds?|minutes?|hours?|days?|weeks?|months?|years?)\b"
string = """
Reminder 3 seconds some data here, Reminder 20 days and some more data, Reminder 3 second and Reminder 3 WEek
"""
print(re.findall(expression, string))
Output
['Reminder 3 seconds', 'Reminder 20 days', 'Reminder 3 second', 'Reminder 3 WEek']

How to write regular expression for all text after ":" [duplicate]

This question already has answers here:
Regular expression: Match everything after a particular word
(4 answers)
Closed 4 years ago.
I need to filter the sentence and select only few terms from the whole sentence
For example, I have sample text:
ID: a9000006
NSF Org : DMI
Total Amt. : $225024
Abstract :This SBIR proposal is aimed at (1) the synthesis of new ferroelectric liquid crystals with ultra-high polarization,
chemical stability and low viscosity
token = re.compile('a90[0-9][0-9][0-9][0-9][0-9]| [$][\d]+ |')
re.findall(token, filetext)
I get 'a9000006','$225024', but I do not know how to write regex for three upper case letter right after "NSF Org:" which is "DMI" and all text after "Abstract:"
If you want to create a single regex which will match each of those 4 fields with explicit checks on each, then use this regex: :\s?(a90[\d]+|[$][\d]+|[A-Z]{3}|.*$)
>>> token = re.compile(r':\s?(a90[\d]+|[$][\d]+|[A-Z]{3}|.*$)', re.DOTALL) # flag needed
>>> re.findall(token, filetext)
['a9000006', 'DMI', '$225024', 'This SBIR proposal is aimed at (1) the synthesis of new ferroelectric liquid crystals wi
th ultra-high polarization, \n chemical stability and low viscosity']
>>>
However, since you're searching for all at the same time, would be better to use one which matches all 4 together and generically, such as the one in this answer here.
This must do the job.
: .*
You can check this here.
check

Categories