I've written a script in python to get certain from a text container. I used re module to do the job. However, it is giving me unnecesary output along with the required ones.
How can I modify my expression to be stick to the lines I wanna grab?
This is my try:
import re
content = """
A Gross exaggeration,
-- Gross 5 90,630,08,
Gross 4 13,360,023,
Gross 2 70,940,02,
Luke gross is an actor
"""
for item in re.finditer(r'Gross(?:[\d\s,]*)',content):
print(item.group().strip())
Output I'm having:
Gross
Gross 5 90,630,08,
Gross 4 13,360,023,
Gross 2 70,940,02,
Output I wish to have:
Gross 4 13,360,023
Gross 2 70,940,02
I changed the regex string to r'(?:^\s*?)Gross[\d\s,]*?(?=,$)' and added multiline flag (online regex here):
import re
content = """
A Gross exaggeration,
-- Gross 5 90,630,08,
Gross 4 13,360,023,
Gross 2 70,940,02,
Luke gross is an actor
"""
for item in re.finditer(r'(?:^\s*?)Gross[\d\s,]*?(?=,$)',content, flags=re.M):
print(item.group().strip())
Output is:
Gross 4 13,360,023
Gross 2 70,940,02
^\s*Gross[\d ,]*(?=,) Will capture what you want.
I just tacked on ^ to signal the start of the line, used \s* to indicate optional whitespace before "gross" and trimmed the , from the end. I also removed your \s from your character class because it worked with new lines. I replaced it with a blank space.
Demo
Related
I have a huge pdf that is all very basic text on pages for invoices, I need to create a regex or 2 so when I split it I get the customer number and the invoice number to use in the file name. I am using python 3 and pypdf2 currently
text example of 2 of the pages:
Detailed Invoice Report
Starting 8/12/2015 and ending 8/11/2022
Company: (Multiple Companies) Printed by Robert S on 8/11/2022 1:26:46PM
Donna Contact Cust# Name: Customer A 1234
Customer A Invoice Date Invoice Name 8/12/2015 241849
Item Description Qty Price Extended Price
Credit ($810.00) 1 ($810.00) 1
Due Paid Total Total Taxes Subtotal
($810.00) ($810.00) $0.00 ($810.00)
Balance: ($810.00) $0.00 $0.00
8/11/2022 1:26:46PM Page 1 of 340977
Detailed Invoice Report
Starting 8/12/2015 and ending 8/11/2022
Company: (Multiple Companies) Printed by Robert S on 8/11/2022 1:26:46PM
Customer B Cust# Name: Customer B 45678
Customer B Invoice Date Invoice Name 8/12/2015 241850
Item Description Qty Price Extended Price
credit ($49.99) 1 ($49.99) 1
Due Paid Total Total Taxes Subtotal
($49.99) ($49.99) $0.00 ($49.99)
Balance: ($49.99) $0.00 $0.00
8/11/2022 1:26:46PM Page 2 of 340977
currently I have these 2 regex filters to get each one kind of but I do not know how to only keep the last groups match from them.
Note: the firstmatch regex is broken if the customer name has a number in it which is an edge case but not uncommon in the data
firstmatch=r"(Name:)(\D*)(\d+)"
secondmatch=r"(Name )(\d*.\d*.\d*..)(\d*)"
Each one is its own page and I would like the regex to be able to pull from the first one 1234 241849 and the second one 45678 241850
You could get both values using a capture matching the last digits on the line.
For the first pattern:
\bName:.*?\b(\d+)[^\d\n]*$
Explanation
\bName: Match Name: preceded by a word boundary
.*? Match any character without a newline, as least as possible
\b(\d+) A word boundary, then capture 1+ digits in group 1
[^\d\n]* Optionally match any character except digits or a newline
$ End of string
Regex demo
For the second pattern you can make it a bit more specific, where [^\S\n]+ matches 1+ whitespace chars without newlines:
\bName[^\S\n]+\d+/\d+/\d+[^\S\n]+(\d+)[^\d\n]*$
Regex demo
Or if the lines are right behind each other, you can also use 1 pattern with 2 capture groups and match the newline at the end of the first line:
\bName:.*?\b(\d+)[^\d\n]*\n\b.*?Name[^\S\n]+\d+/\d+/\d+[^\S\n]+(\d+)[^\d\n]*$
Regex demo
I am building a ML training dataset from a corpus using some chemical named entities.
The reason I mention the chemical context is just to assure that this is a realistic example of what I am dealing with, not a made up one.
In doing so, I need a regex expression that has the following structure:
1 - Starts by the chemical formula string "2h-tetrazolium, 2,2'-(3,3'-dimethoxy[1,1'-biphenyl]-4,4'-diyl)bis[3-(4-nitrophenyl)-5-phenyl-,chloride (1:2)"
2 - followed by 0 up to 15 characters
3 - followed by the chemical code string "298-83-9"
4 - followed by 0 up to 15 characters
5 - followed by a non-alphanumerical character
6 - followed by the string "5"
7 - ends with a non-alphanumerical value.
The reason that I added the non-alphanumerical requirements #5 and #7 is that the text in which the regex search is to be performed is a long messy text and I wanted to ensure that the string "5" is not part of another entity such as these two examples: "bluh bluh 298-83-9 bluh bluh 564" or "bluh bluh 298-83-9 bluh bluh 645".
The way I approached was building an expression like the following:
reg_exp = name_entity[0] + r".{0,15}\s*" + name_entity[1] + r".{0,15}\s*" + r"[^a-zA-Z\d]+" + name_entity[2] + r"[^a-zA-Z\d]+"
where name_entity is the array that contains the strings in requirements 1, 3, and 6.
However, the issue is that the chemical formula and code in requirements 1 and 3 have so much escaping, hyphens, etc that my expression does not work. I need a way to prompt regex in thinking that name_entity elements are to be treated as exactly literal phrases, not containing some regex expression.
In case it matters, I am coding in Python.
I would appreciate your help. Here, I copy a portion of the multi-page long text that is intended to contain what the the regex expression is intended to find. The part that my python code re.findall(reg_exp, text) should find is bolded:
"composition/information on ingredients substance / mixture : mixture substance name : nbt/bcip stock solution, mbf components chemical name cas-no. concentration (% w/w) methane, 1,1'-sulfinylbis- 67-68-5 >= 50 - < 70 2h-tetrazolium, 2,2'-(3,3'- dimethoxy[1,1'-biphenyl]-4,4'- diyl)bis[3-(4-nitrophenyl)-5-phenyl-, chloride (1:2) 298-83-9 >= 1 - < 5 actual concentration is withheld as a trade secret section 4. first aid measures general advice : do not leave the victim unattended. safety data sheet nbt/bcip stock solution version 3.0 revision date: 09-25-2019"
There's a few issues here, but it works with the following code:
def new_regex(entity):
return fr"{re.escape(entity[0])}.{{0,15}}\s*{re.escape(entity[1])}.{{0,15}}\s*[^a-zA-Z\d]+{re.escape(entity[2])}[^a-zA-Z\d]+"
entity = [
"2h-tetrazolium, 2,2'-(3,3'- dimethoxy[1,1'-biphenyl]-4,4'- diyl)bis[3-(4-nitrophenyl)-5-phenyl-, chloride (1:2)",
'298-83-9',
'5'
]
n = "composition/information on ingredients substance / mixture : mixture substance name : nbt/bcip stock solution, mbf components chemical name cas-no. concentration (% w/w) methane, 1,1'-sulfinylbis- 67-68-5 >= 50 - < 70 2h-tetrazolium, 2,2'-(3,3'- dimethoxy[1,1'-biphenyl]-4,4'- diyl)bis[3-(4-nitrophenyl)-5-phenyl-, chloride (1:2) 298-83-9 >= 1 - < 5 actual concentration is withheld as a trade secret section 4. first aid measures general advice : do not leave the victim unattended. safety data sheet nbt/bcip stock solution version 3.0 revision date: 09-25-2019"
regex = new_regex(entity)
regex.findall(n)
# ["2h-tetrazolium, 2,2'-(3,3'- dimethoxy[1,1'-biphenyl]-4,4'- diyl)bis[3-(4-nitrophenyl)-5-phenyl-, chloride (1:2) 298-83-9 >= 1 - < 5 "]
This was fixed by using re.escape, as well as fixing a few issues with whitespace in your chemical formula. You likely however want to change your entities to handle whitespace better.
I'm trying to match a pattern with re in python, but I can't seem to get a match no matter how I try.
This is my matching pattern:
def get_report_date(report):
report_data = {}
with open(report, 'r') as f:
report_date = re.findall(f'([Q\d \d\d\d\d\s])', f.read())[0]
pprint(report_date)
report_data.update({f"{report_date.replace(' ', '_')}": report})
return report_data
and a piece of the file I'm trying to match:
(In millions, except number of shares which are reflected in thousands and per share amounts)
See accompanying Notes to Condensed Consolidated Financial Statements.
Apple Inc. | Q2 2018 Form 10-Q | 1 Apple Inc. CONDENSED CONSOLIDATED STATEMENTS OF COMPREHENSIVE INCOME (Unaudited)
I'm trying to scrape the Q2 2018
But I keep getting empty strings.
RegExp: r'(Q\d\s\d+\s)'
Explanation:
r prefix for raw string
Q to match the Q of quarter
\d to match the quarter number afterward
\s to match space
\d+ to match multiple numbers which are the year
\s to match space
Example:
import re
text = """(In millions, except number of shares which are reflected in thousands and per share amounts)
See accompanying Notes to Condensed Consolidated Financial Statements.
Apple Inc. | Q2 2018 Form 10-Q | 1 Apple Inc. CONDENSED CONSOLIDATED STATEMENTS OF COMPREHENSIVE INCOME (Unaudited)"""
x = re.findall(r'(Q\d\s\d+\s)', text)[0]
# Q2 2018
print(x)
Code Fix:
def get_report_date(report):
report_data = {}
with open(report, 'r') as f:
report_date = re.findall(r'(Q\d\s\d+\s)', f.read())[0]
pprint(report_date)
report_data.update({f"{report_date.replace(' ', '_')}": report})
return report_data
I'd like to remove text between the strings "Criteria Details" and both "\n{Some number}\n" or "\nPage {Some number}\n". My code is below:
test = re.search(r'Criteria Details[\w\s\S]*?(\n[0-9]+\n|\nPAGE [0-9]+\n)', input_text)
print(test)
input_text = re.sub(r'Criteria Details[\w\s\S]*?(\n[0-9]+\n|\nPAGE [0-9]+\n)', ' ', input_text, flags=re.IGNORECASE)
This works on regex101 for the string below, as I can see that the chunk between "Criteria Details" and "88" is detected, but the .search() in my code doesn't return anything, and nothing is replaced in .sub(). Am I missing something?
cyclobenzaprine oral tablet 10 mg, 5 mg,
7.5 mg
PA Criteria
Criteria Details
N/A
N/A
other
N/A
Exclusion
Criteria
Required
Medical
Information
Prescriber
Restrictions
Coverage
Duration
Other Criteria
Age Restrictions Patients aged less than 65 years, approve. Patients aged 65 years and older,
End of the Contract Year
PA does NOT apply to patients less than 65 yrs of age. High Risk
Medications will be approved if ALL of the following are met: a. Patient
has an FDA-approved diagnosis or CMS-approved compendia accepted
indication for the requested high risk medication AND b. the prescriber
has completed a risk assessment of the high risk medication for the patient
and has indicated that the benefits of the requested high risk medication
outweigh the risks for the patient AND c.Prescriber has documented that
s/he discussed risks and potential side effects of the medication with the
patient AND d. if patient is taking conconmitantly a muscle relaxant with
an opioid, the prescriber indicated that the benefits of the requested
combination therapy outweigh the risks for the patient.
Indications
All Medically-accepted Indications.
Off-Label Uses
N/A
88
Updated 06/2020
I would expect the output to be something like
cyclobenzaprine oral tablet 10 mg, 5 mg,
7.5 mg
PA Criteria
Updated 06/2020
You got it, just a silly mistake. Change your code to this
input_text = re.sub(r'Criteria Details[\w\s\S]*?(\n[0-9]+\n|\nPAGE [0-9]+\n)', ' ', input_text, flags=re.IGNORECASE)
print(input_text)
Where you went wrong is
input_text = re.sub(r'Criteria Details[\w\s\S]*?(\n[0-9]+\n|\nPAGE [0-9]+\n)', ' ', input_text, flags=re.IGNORECASE) # This is the necessary replacement well done
test = re.search(r'Criteria Details[\w\s\S]*?(\n[0-9]+\n|\nPAGE [0-9]+\n)', input_text) # This extracts a pattern which will never be found because you already removed it
print(test) # The result of the previous line which would never be found
Hope this helps! We all have bad days 😀
I figured it out. When using Pdfminer to parse the PDF into text, there aren't actually newlines after the page number, but they get converted into newlines if I copy and paste the output to the regex website, or Stackoverflow. I ended up using \s instead of \n to detect the trailing spaces after the page numbers.
Here's what I want to happen:
input = "asdsad,200200-12964,0009,""TREASURY SETTLEMENT NON-COMPLIANCE ASSESSMENT FOR CPD2020-01-21 USD 589,037.17"" 0.00000000,1.000000"
output = "asdsad,200200-12964,0009,""TREASURY SETTLEMENT NON-COMPLIANCE ASSESSMENT FOR CPD2020-01-21 USD 589.037.17"" 0.00000000,1.000000"
How can I change the comma (,) to a dot (.) between ""...589,037.17..."" in Python using regex.
Extra: 589,037.17 => 589.037.17
I tried:
print(re.sub(r'(?<=\d),', '.', input))
But my output was:
output = "asdsad,200200-12964.0009,""TREASURY SETTLEMENT NON-COMPLIANCE ASSESSMENT FOR CPD2020-01-21 USD 589.037.17"" 0.00000000,1.000000"
First, don't call a variable input, because it overwrites the the built-in function input(). Also you repeated strings are just one string in Python.
i = 'asdsad,200200-12964,0009,TREASURY SETTLEMENT NON-COMPLIANCE ASSESSMENT FOR CPD2020-01-21 USD 589,037.17 0.00000000,1.000000'
To solve your specific case, you could match a the country code followed by 3 numbers in the first bit of the price before the comma. That works for this, but probably isn't generic enough for any country code and any price, as look-behinds must be of fixed width.
print(re.sub(r'(?<=USD \d{3}),', '.', i))
I would use a look-behind for the country code and space, then group the first bit of the number and replace it with a backreference.
print(re.sub(r'(?<=[A-Z]{3} )(\d+),', r'\1.', i))
import re
input = "asdsad,200200-12964,0009,""TREASURY SETTLEMENT NON-COMPLIANCE ASSESSMENT FOR CPD2020-01-21 USD 589,037.17"" 0.00000000,1.000000"
print(input)
print(re.sub(r'USD (\d+),(\d+)', r'USD \1.\2', input))
Output:
asdsad,200200-12964,0009,TREASURY SETTLEMENT NON-COMPLIANCE ASSESSMENT FOR CPD2020-01-21 USD 589,037.17 0.00000000,1.000000
asdsad,200200-12964,0009,TREASURY SETTLEMENT NON-COMPLIANCE ASSESSMENT FOR CPD2020-01-21 USD 589.037.17 0.00000000,1.000000
You can go through this Search and Replace and this link for documenation on this.