Regex pattern match for a given string - python

I'm working on project which require to extract all the case number from the given string. Can anyone please help me to create a regex to match the pattern for all the case numbers.
Pattern is like: alphanumeric must followed with / alphanumeric must followed with / alphanumeric
*Housekeeping Services For the period( 1‐03‐2020 to 31‐03‐2020) ‐ HDC ‐5i
SL.NO HSN/SAC
Code UOM
Facility
Approved
HC
Total Billing
Hours
Actual Manpower
HC
Unit Rate Per
Month Taxable Value
1 HK Supervisor 9985 HR 4 832 4.00 18,644.00 7 4,576.00*
Case no.**MH20/00285/VAS**
Case no. **MH20/00294/GVN1**
Case no. **MH20/000026/MUMR**
Case no. **KA20/00346/BN**
Case no. **DL20/0024/DLH39**
Case no. **MH20/003B30/GUR2**
Case no. **GJ20/001A75/GJ**
Case no. **GJ20/001A77/GJ**
Case no. **MH20/002CK89/GVN1**
*3,15,962.69
2 8,436.64
2 8,436.64
3,72,836.00
AMOUNT IN WORDS:‐ Rupees Three Lakhs Seventy Two Thousand Eight Hundred Thirty Six Only*

This one should do the Job
[\d\w]{4}/[\d\w]+/[\d\w]+

Related

Using RegEx in Python to extract contents

Good evening,
I am very new to Python and RegEx. I have the following sentence:
-75.76 Card INSURANCEGrabPay ASIA DIRECT to Paid AM 1:16 +100.00 3257 UpAmex Top PM 9:55 +300.00 3257 UpAmex Top PM 9:55 -400.00 Card LTDGrabPay PTE AXS to Paid PM 9:57 (SGD) Amount Details Time here. appear will transactions cashless your All 2022 Feb 15 on made transactions GrabPay points 52 earned points Rewards 475.76 SGD spent Amount 0.24 SGD balance Wallet 2022 Feb 15 Summary statement daily your here
I would like to search for just '-' and the amount after that.
After that, I would like to skip 2 words and extract ALL words if need be in a single group (I will read more about groups but for now i would need in a single group, which i can later use to split and get the words from that string) just before 'Paid'
For instance, I would get
-75.76 ASIA Direct to
-400 PTE AXS to
What would be the regex command? Also, is there a good regex tutorial where I can read up on?
For now I have created one match having 2 groups ie, group1 for the amount and group2 for all the words (that include "to " string also).
Regex:
(-\d+\.?\d+) \w+ \w+ ([\w ]+)?Paid
You can check the details here: https://regex101.com/r/eUMgdW/1
Python code:
import re
output = re.findall("""(-\d+\.?\d+) \w+ \w+ ([\w ]+)?Paid""", your_input_string)
for found in output:
print(found)
#('-75.76', 'ASIA DIRECT to ')
#('-400.00', 'PTE AXS to ')
Rather than give you the actual regex, I'll gently nudge you in the right direction. It's more satisfying that way.
"Words" here are seperated by spaces. So what you're searching for is a group of characters (captured), a space, characters again, space, characters, space, then capture everything and end with "PAID". Try to create a regex to do that.
If you'd like to brush up on regex, check out Regex101. It's a web tool to test out regex, along with a debugger and a cheat sheet.

Multipart Regex: Mix of exact and non-exact phrases

I am building a ML training dataset from a corpus using some chemical named entities.
The reason I mention the chemical context is just to assure that this is a realistic example of what I am dealing with, not a made up one.
In doing so, I need a regex expression that has the following structure:
1 - Starts by the chemical formula string "2h-tetrazolium, 2,2'-(3,3'-dimethoxy[1,1'-biphenyl]-4,4'-diyl)bis[3-(4-nitrophenyl)-5-phenyl-,chloride (1:2)"
2 - followed by 0 up to 15 characters
3 - followed by the chemical code string "298-83-9"
4 - followed by 0 up to 15 characters
5 - followed by a non-alphanumerical character
6 - followed by the string "5"
7 - ends with a non-alphanumerical value.
The reason that I added the non-alphanumerical requirements #5 and #7 is that the text in which the regex search is to be performed is a long messy text and I wanted to ensure that the string "5" is not part of another entity such as these two examples: "bluh bluh 298-83-9 bluh bluh 564" or "bluh bluh 298-83-9 bluh bluh 645".
The way I approached was building an expression like the following:
reg_exp = name_entity[0] + r".{0,15}\s*" + name_entity[1] + r".{0,15}\s*" + r"[^a-zA-Z\d]+" + name_entity[2] + r"[^a-zA-Z\d]+"
where name_entity is the array that contains the strings in requirements 1, 3, and 6.
However, the issue is that the chemical formula and code in requirements 1 and 3 have so much escaping, hyphens, etc that my expression does not work. I need a way to prompt regex in thinking that name_entity elements are to be treated as exactly literal phrases, not containing some regex expression.
In case it matters, I am coding in Python.
I would appreciate your help. Here, I copy a portion of the multi-page long text that is intended to contain what the the regex expression is intended to find. The part that my python code re.findall(reg_exp, text) should find is bolded:
"composition/information on ingredients substance / mixture : mixture substance name : nbt/bcip stock solution, mbf components chemical name cas-no. concentration (% w/w) methane, 1,1'-sulfinylbis- 67-68-5 >= 50 - < 70 2h-tetrazolium, 2,2'-(3,3'- dimethoxy[1,1'-biphenyl]-4,4'- diyl)bis[3-(4-nitrophenyl)-5-phenyl-, chloride (1:2) 298-83-9 >= 1 - < 5 actual concentration is withheld as a trade secret section 4. first aid measures general advice : do not leave the victim unattended. safety data sheet nbt/bcip stock solution version 3.0 revision date: 09-25-2019"
There's a few issues here, but it works with the following code:
def new_regex(entity):
return fr"{re.escape(entity[0])}.{{0,15}}\s*{re.escape(entity[1])}.{{0,15}}\s*[^a-zA-Z\d]+{re.escape(entity[2])}[^a-zA-Z\d]+"
entity = [
"2h-tetrazolium, 2,2'-(3,3'- dimethoxy[1,1'-biphenyl]-4,4'- diyl)bis[3-(4-nitrophenyl)-5-phenyl-, chloride (1:2)",
'298-83-9',
'5'
]
n = "composition/information on ingredients substance / mixture : mixture substance name : nbt/bcip stock solution, mbf components chemical name cas-no. concentration (% w/w) methane, 1,1'-sulfinylbis- 67-68-5 >= 50 - < 70 2h-tetrazolium, 2,2'-(3,3'- dimethoxy[1,1'-biphenyl]-4,4'- diyl)bis[3-(4-nitrophenyl)-5-phenyl-, chloride (1:2) 298-83-9 >= 1 - < 5 actual concentration is withheld as a trade secret section 4. first aid measures general advice : do not leave the victim unattended. safety data sheet nbt/bcip stock solution version 3.0 revision date: 09-25-2019"
regex = new_regex(entity)
regex.findall(n)
# ["2h-tetrazolium, 2,2'-(3,3'- dimethoxy[1,1'-biphenyl]-4,4'- diyl)bis[3-(4-nitrophenyl)-5-phenyl-, chloride (1:2) 298-83-9 >= 1 - < 5 "]
This was fixed by using re.escape, as well as fixing a few issues with whitespace in your chemical formula. You likely however want to change your entities to handle whitespace better.

how to use positive and negative look ahead for multiple terms in Python?

I have a data frame like as shown below
df = pd.DataFrame({'person_id': [11,11,11,11,11,11,11,11,11,11],
'text':['inJECTable 1234 Eprex DOSE 4000 units on NONd',
'department 6789 DOSE 8000 units on DIALYSIS days - IV Interm',
'inJECTable 4321 Eprex DOSE - 3 times/wk on NONdialysis day',
'insulin MixTARD 30/70 - inJECTable 46 units',
'insulin ISOPHANE -- InsulaTARD Vial - inJECTable 56 units SC SubCutaneous',
'1-alfacalcidol DOSE 1 mcg - 3 times a week - IV Intermittent',
'jevity liquid - FEEDS PO Jevity - 237 mL - 1 times per day',
'1-alfacalcidol DOSE 1 mcg - 3 times per week - IV Intermittent',
'1-supported DOSE 1 mcg - 1 time/day - IV Intermittent',
'1-testpackage DOSE 1 mcg - 1 time a day - IV Intermittent']})
I would like to remove the words/strings which follow patterns such as 46 units, 3 times a week, 3 times per week, 1 time/day etc.
I was reading about positive and negative look ahead and behind.
So, was trying something like below
[^([0-9\s]*(?=units))] #to remove terms like `46 units` from the string
[^[0-9\s]*(?=times)(times a day)] # don't know how to make this work for all time variants
time variants ex: 3 times a day, 3 time/wk, 3 times per day, 3 times a month, 3 times/month etc.
Basically, I expect my output to be something like below (remove terms like xx units, xx time a day, xx times per week, xx time/day, xx time/wk, xx time/week, xx times per week, etc)
You can consider a pattern like
\s*\d+\s*(?:units?|times?(?:\s+(?:a|per)\s+|\s*/\s*)(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?))
See the regex demo
NOTE: the \d+ matches one or more digits. If you need to match any number, please consider using other patterns for a number in the format you expect, see regular expression for finding decimal/float numbers?, for example.
Pattern details
\s* - zero or more whitespace chars
\d+ - one or more digits
\s* - zero or more whitespaces
(?:units?|times?(?:\s+(?:a|per)\s+|\s*/\s*)(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?)) - a non-capturing group matching:
units? - unit or units
| - or
times? - time or times
(?:\s+(?:a|per)\s+|\s*/\s*) - a or per enclosed with 1+ whitespaces, or / enclosed with 0+ whitespaces
(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?) - d or day, or wk or week, or month, or y/yea/yr
If you need to match whole words only, use word boundaries, \b:
\s*\b\d+\s*(?:units?|times?(?:\s+(?:a|per)\s+|\s*/\s*)(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?))\b
In Pandas, use
df['text'] = df['text'].str.replace(r'\s*\b\d+\s*(?:units?|times?(?:\s+(?:a|per)\s+|\s*/\s*)(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?))\b', '')

Extracting numerical values from a string with at most 6 digits with optional 2 digits for decimal

I have a task from which I need to extract values from a text that represent numerical values. However I am interested in extracting values that have at most 6 digits with decimal being optional.
For example, from the below text:
Total compensation for Mr. XYZ was $5,123,456 and other salary which was $650,000 in fiscal 2018, was determined to be approximately 8.78 times the median annual compensation for all of the firm's other employees, which was approximately $74,000. Some other salaries are 56000.
I need to extract
["650,000", "2018", "8.78", "74,000", "56000"]
from this.
The regex I am using:
((\d{1,3})(?:,[0-9]{3}){0,1}|(\d{1,6}))(\.\d{1,2})?
It is correctly identifying 650,000 and 74,000 but doesn't identify others correctly.
I found this 7 digit money regex and worked around it to make one for 6 digit but wasn't successful. How do I correct my regex?
Try this : (?<![\d,.])(?:\d,?){0,5}\d(?:\.\d+)?(?!,?\d)
Here's a detailed explanation:
(?x) # flag for readable mode, whitespaces and comments are ignored
# Make sure to not start in the middle of a number, so no digit, comma or dot before the match
(?<![\d,.])
# k-1 digits, with facultative comma between each. Therefore 5,4,3,2 are allowed for the sake of simplicity, be aware of that
(?:\d,?){0,5}
#The kth digit
\d
# Facultative dot and decimal part
(?:\.\d+)?
# Make sure to not stop in the middle of a big number, so no digit after. Comma is allowed, but only for the grammatical comma, so comma+digit is forbidden
(?!,?\d)
There could be improvement, but I think it's what you wanted. There might be some cases not handled, tell me if you find some.
Test it here : https://regex101.com/r/Wxi5Sj/2
Try below code
import re
input = "Total compensation for Mr. XYZ was $5,123,456 and other salary which was $650,000 in fiscal 2018, was determined to be approximately 8.78 times the median annual compensation for all of the firm's other employees, which was approximately $74,000. Some other salaries are 56000. "
print(re.findall(r'(?<=\s)\$?\d{0,3}\,?\d{1,3}(?:\.\d{2})?(?!,?\d)', input))
Output
['$650,000', '2018', '8.78', '$74,000', '56000']

regex python find dollar amount and few words at the same time

I need to find dollar amount and few(3 or 4) words surrounding that amount at the same time in one paragraph.
in-process research and development of $184.3 million and charges $120 of
million for the impairment of long-lived assets. See Notes 2, 16 and21 to the
Consolidated Financial Statements. Income from continuingoperations for the
fiscal year ended September 30, 2001 also includes a netgain on sale of
businesses and investments of $276.6 million and a net gainon the sale of
common shares of a subsidiary of $64.1 million.
What I want to get is something like below,
[amount, amount+ digit words, 3-4 words after to before amount].
[$184.3 $184.3 million, research and development of $184.3 million],[$120, $120 of million,charges $120 of
million for the impairment of long-lived assets ], [$276.6, $276.6 million, investments of $276.6 million] ,[ $64.1, $64.1 million, a subsidiary of $64.1 million.]
What I tried is this and it only found dollar amount.
[\$]{1}\d+\.?\d{0,2}
Thanks!
So let's name the pattern you have:
amount_patt = r"[\$]{1}[\d,]+\.?\d{0,2}"
Digit word should be then defined using the above:
digit_word_patt = amount_patt + r" (\w+)"
Now, for the surrounding 3-4 words, do the following:
words_patt = r"(\S+ ){3, 4}" + amount_patt + r"(\S+ ){3, 4}"
You're done! Now simply use these with your re methods for your string extraction.

Categories