Regex matches fine in tester but not in Python code

Regex matches fine in tester but not in Python code - python

I'd like to remove text between the strings "Criteria Details" and both "\n{Some number}\n" or "\nPage {Some number}\n". My code is below:
test = re.search(r'Criteria Details[\w\s\S]*?(\n[0-9]+\n|\nPAGE [0-9]+\n)', input_text)
print(test)
input_text = re.sub(r'Criteria Details[\w\s\S]*?(\n[0-9]+\n|\nPAGE [0-9]+\n)', ' ', input_text, flags=re.IGNORECASE)
This works on regex101 for the string below, as I can see that the chunk between "Criteria Details" and "88" is detected, but the .search() in my code doesn't return anything, and nothing is replaced in .sub(). Am I missing something?
cyclobenzaprine oral tablet 10 mg, 5 mg,
7.5 mg
PA Criteria
Criteria Details
N/A
N/A
other
N/A
Exclusion
Criteria
Required
Medical
Information
Prescriber
Restrictions
Coverage
Duration
Other Criteria
Age Restrictions Patients aged less than 65 years, approve. Patients aged 65 years and older,
End of the Contract Year
PA does NOT apply to patients less than 65 yrs of age. High Risk
Medications will be approved if ALL of the following are met: a. Patient
has an FDA-approved diagnosis or CMS-approved compendia accepted
indication for the requested high risk medication AND b. the prescriber
has completed a risk assessment of the high risk medication for the patient
and has indicated that the benefits of the requested high risk medication
outweigh the risks for the patient AND c.Prescriber has documented that
s/he discussed risks and potential side effects of the medication with the
patient AND d. if patient is taking conconmitantly a muscle relaxant with
an opioid, the prescriber indicated that the benefits of the requested
combination therapy outweigh the risks for the patient.
Indications
All Medically-accepted Indications.
Off-Label Uses
N/A
88
Updated 06/2020
I would expect the output to be something like
cyclobenzaprine oral tablet 10 mg, 5 mg,
7.5 mg
PA Criteria
Updated 06/2020

You got it, just a silly mistake. Change your code to this
input_text = re.sub(r'Criteria Details[\w\s\S]*?(\n[0-9]+\n|\nPAGE [0-9]+\n)', ' ', input_text, flags=re.IGNORECASE)
print(input_text)
Where you went wrong is
input_text = re.sub(r'Criteria Details[\w\s\S]*?(\n[0-9]+\n|\nPAGE [0-9]+\n)', ' ', input_text, flags=re.IGNORECASE) # This is the necessary replacement well done
test = re.search(r'Criteria Details[\w\s\S]*?(\n[0-9]+\n|\nPAGE [0-9]+\n)', input_text) # This extracts a pattern which will never be found because you already removed it
print(test) # The result of the previous line which would never be found
Hope this helps! We all have bad days 😀

I figured it out. When using Pdfminer to parse the PDF into text, there aren't actually newlines after the page number, but they get converted into newlines if I copy and paste the output to the regex website, or Stackoverflow. I ended up using \s instead of \n to detect the trailing spaces after the page numbers.

Related

Returning empty string for missing capture group Python regex

I'm working on parsing string text containing information on university, year, degree field, and whether or not a person graduated. Here are two examples:
ex1 = 'BYU: 1990 Bachelor of Arts Theater (Graduated):BYU: 1990 Bachelor of Science Mathematics (Graduated):UNIVERSITY OF VIRGINIA: 1995 Master of Science Mechanical Engineering (Graduated):MICHIGAN STATE UNIVERSITY: 2008 Master of Fine Arts INDUSTRIAL DESIGN (Graduated)'
ex2 = 'UCSD: 2001 Bachelor of Arts English:UCLA: 2005 Bachelor of Science Economics (Graduated):UCSD 2010 Master of Science Economics'
What I am struggling to accomplish is to have an entry for each school experience regardless of whether specific information is missing. In particular, imagine I wanted to pull whether each degree was finished from ex1 and ex2 above. When I try to use re.findall I end up with something like the following for ex1:
# Code:
re.findall('[A-Z ]+: \d+ Bachelor [^:]+\((Graduated)', ex1)
# Output:
['Graduated', 'Graduated']
which is what I want, two entries for two Bachelor's degrees. For ex2, however, one of the Bachelor's degrees was unfinished so the text does not contain "(Graduated)", so the output is the following:
# Code:
re.findall('[A-Z ]+: \d+ Bachelor [^:]+\((Graduated)', ex2)
# Output:
['Graduated']
# Desired Output:
['', 'Graduated']
I have tried making the capture group optional or including the colon after graduated and am not making much headway. The example I am using is the "Graduated" information, but in principle the more general question remains if there is an identifiable degree but it is missing one or two pieces of information (like graduation year or university). Ultimately I am just looking to have complete information on each degree, including whether certain pieces of information are missing. Thank you for any help you can provide!

You can use the ?-Quantifier to match "Graduated" (and the paranthesis () between 0 and n times.
re.findall('[A-Z ]+: \d+ Bachelor [^:()]*\(?(Graduated)?', ex2)
Output:
>>> re.findall('[A-Z ]+: \d+ Bachelor [^:()]*\(?(Graduated)?', ex2)
['', 'Graduated']

How about this?
[re.sub('[(:)]', '', t) for t in [re.sub('^[^\(]+','', s) for s in re.findall('[A-Z ]+: \d+ Bachelor [^:]+:', ex1)]]
# output ['Graduated', 'Graduated']
[re.sub('[(:)]', '', t) for t in [re.sub('^[^\(]+','', s) for s in re.findall('[A-Z ]+: \d+ Bachelor [^:]+:', ex2)]]
# output ['', 'Graduated']

How to extract string that contains specific characters in Python

I'm trying to extract ONLY one string that contains $ character. The input based on output that I extracted using BeautifulSoup.
Code
price = [m.split() for m in re.findall(r"\w+/$(?:\s+\w+/$)*", soup_content.find('blockquote', { "class": "postcontent restore" }).text)]
Input
For Sale is my Tag Heuer Carrera Calibre 6 with box and papers and extras.
39mm
47 ish lug to lug
19mm in between lugs
Pretty thin but not sure exact height. Likely around 12mm (maybe less)
I've owned it for about 2 years. I absolutely love the case on this watch. It fits my wrist and sits better than any other watch I've ever owned. I'm selling because I need cash and other pieces have more sentimental value
I am the second owner, but the first barely wore it.
It comes with barely worn blue leather strap, extra suede strap that matches just about perfectly and I'll include a blue Barton Band Elite Silicone.
I also purchased an OEM bracelet that I personally think takes the watch to a new level. This model never came with a bracelet and it was several hundred $ to purchase after the fact.
The watch was worn in rotation and never dropped or knocked around.
The watch does have hairlines, but they nearly all superficial. A bit of time with a cape cod cloth would take care of a lot it them. The pics show the imperfections in at "worst" possible angle to show the nature of scratches.
The bracelet has a few desk diving marks, but all in all, the watch and bracelet are in very good shape.
Asking $2000 obo. PayPal shipped. CONUS.
It's a big hard to compare with others for sale as this one includes the bracelet.
The output should be like this.
2000

You don't need a regex. Instead you can iterate over lines and over each word to check for starting with '$' and extract the word:
[word[1:] for line in s.split('\n') for word in line.split() if word.startswith('$') and len(word) > 1]
where s is your paragraph.
which outputs:
['2000']

Since this is very simple you don't need a regex solution, this should sufice:
words = text.split()
words_with_dollar = [word for word in words if '$' in word]
print(words_with_dollar)
>>> ['$', '$2000']
If you don't want the dollar sign alone, you can add a filter like this:
words_with_dollar = [word for word in words if '$' in word and '$' != word]
print(words_with_dollar)
>>> ['$2000']

I would do something like that (provided input is the string you wrote above)-
price_start = input.find('$')
price = input[price_start:].split(' ')[0]
IF there is only 1 occurrence like you said.
Alternative- you could use regex like that-
price = re.findall('\S*\$\S*\d', input)[0]
price = price.replace('$', '')

Extract information part f URL in python

I have a list of 200k urls, with the general format of:
http[s]://..../..../the-headline-of-the-article
OR
http[s]://..../..../the-headline-of-the-article/....
The number of / before and after the-headline-of-the-article varies
Here is some sample data:
'http://catholicphilly.com/2019/03/news/national-news/call-to-end-affordable-care-act-is-immoral-says-cha-president/',
'https://www.houmatoday.com/news/20190327/new-website-puts-louisiana-art-on-businesses-walls',
'https://feltonbusinessnews.com/global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429/149601/',
'http://www.bristolpress.com/BP-General+News/347592/female-music-art-to-take-center-stage-at-swan-day-in-new-britain',
'https://www.sfgate.com/business/article/Trump-orders-Treasury-HUD-to-develop-new-plan-13721842.php',
'https://industrytoday.co.uk/it/research-delivers-insight-into-the-global-business-voip-services-market-during-the-period-2018-2025',
'https://news.yahoo.com/why-mirza-international-limited-nse-233259149.html',
'https://www.indianz.com/IndianGaming/2019/03/27/indian-gaming-industry-grows-in-revenues.asp',
'https://www.yahoo.com/entertainment/facebook-instagram-banning-pro-white-210002719.html',
'https://www.marketwatch.com/press-release/fluence-receives-another-aspiraltm-bulk-order-with-partner-itest-in-china-2019-03-27',
'https://www.valleymorningstar.com/news/elections/top-firms-decry-religious-exemption-bills-proposed-in-texas/article_68a5c4d6-2f72-5a6e-8abd-4f04a44ee74f.html',
'https://tucson.com/news/national/correction-trump-investigations-sater-lawsuit-story/article_ed20e441-de30-5b57-aafd-b1f7d7929f71.html',
'https://www.publicradiotulsa.org/post/weather-channel-sued-125-million-over-death-storm-chase-collision',
I want to extract the-headline-of-the-article only.
ie.
call-to-end-affordable-care-act-is-immoral-says-cha-president
global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429
correction-trump-investigations-sater-lawsuit-story
I am sure this is possible, but am relatively new with regex in python.
In pseudocode, I was thinking:
split everything by /
keep only the chunk that contains -
replace all - with \s
Is this possible in python (I am a python n00b)?

urls = [...]
for url in urls:
bits = url.split('/') # Split each url at the '/'
bits_with_hyphens = [bit.replace('-', ' ') for bit in bits if '-' in bit] # [1]
print (bits_with_hyphens)
[1] Note that your algorithm assumes that only one of the fragments after splitting the url will have a hyphen, which is not correct given your examples. So at [1], I'm keeping all the bits that do so.
Output:
['national news', 'call to end affordable care act is immoral says cha president']
['new website puts louisiana art on businesses walls']
['global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits 456 69429']
['BP General+News', 'female music art to take center stage at swan day in new britain']
['Trump orders Treasury HUD to develop new plan 13721842.php']
['research delivers insight into the global business voip services market during the period 2018 2025']
['why mirza international limited nse 233259149.html']
['indian gaming industry grows in revenues.asp']
['facebook instagram banning pro white 210002719.html']
['press release', 'fluence receives another aspiraltm bulk order with partner itest in china 2019 03 27']
['top firms decry religious exemption bills proposed in texas', 'article_68a5c4d6 2f72 5a6e 8abd 4f04a44ee74f.html']
['correction trump investigations sater lawsuit story', 'article_ed20e441 de30 5b57 aafd b1f7d7929f71.html']
['weather channel sued 125 million over death storm chase collision']
PS. I think your algorithm could do with a bit of thought. Problems that I see:
more than one bit might contain a hyphen, where:
both only contain dictionary words (see first and fourth output)
one of them is "clearly" not a headline (see second and third from bottom)
spurious string fragments at the end of the real headline: eg "13721842.php", "revenues.asp", "210002719.html"
Need to substitute in a space for characters other than '/', (see fourth, "General+News")

Here's a slightly different variation which seems to produce good results from the samples you provided.
Out of the parts with dashes, we trim off any trailing hex strings and file name extension; then, we extract the one with the largest number of dashes from each URL, and finally replace the remaining dashes with spaces.
import re
regex = re.compile(r'(-[0-9a-f]+)*(\.[a-z]+)?$', re.IGNORECASE)
for url in urls:
parts = url.split('/')
trimmed = [regex.sub('', x) for x in parts if '-' in x]
longest = sorted(trimmed, key=lambda x: -len(x.split('-')))[0]
print(longest.replace('-', ' '))
Output:
call to end affordable care act is immoral says cha president
new website puts louisiana art on businesses walls
global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits
female music art to take center stage at swan day in new britain
Trump orders Treasury HUD to develop new plan
research delivers insight into the global business voip services market during the period
why mirza international limited nse
indian gaming industry grows in revenues
facebook instagram banning pro white
fluence receives another aspiraltm bulk order with partner itest in china
top firms decry religious exemption bills proposed in texas
correction trump investigations sater lawsuit story
weather channel sued 125 million over death storm chase collision
My original attempt would clean out the numbers from the end of the URL only after extracting the longest, and it worked for your samples; but trimming off trailing numbers immediately when splitting is probably more robust against variations in these patterns.

Since the url's are not in a consistent pattern, Stating the fact that the first and the third url are of different pattern than those of the rest.
Using r.split():
s = ['http://catholicphilly.com/2019/03/news/national-news/call-to-end-affordable-care-act-is-immoral-says-cha-president/',
'https://www.houmatoday.com/news/20190327/new-website-puts-louisiana-art-on-businesses-walls',
'https://feltonbusinessnews.com/global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429/149601/',
'http://www.bristolpress.com/BP-General+News/347592/female-music-art-to-take-center-stage-at-swan-day-in-new-britain',
'https://www.sfgate.com/business/article/Trump-orders-Treasury-HUD-to-develop-new-plan-13721842.php',
'https://industrytoday.co.uk/it/research-delivers-insight-into-the-global-business-voip-services-market-during-the-period-2018-2025',
'https://news.yahoo.com/why-mirza-international-limited-nse-233259149.html',
'https://www.indianz.com/IndianGaming/2019/03/27/indian-gaming-industry-grows-in-revenues.asp',
'https://www.yahoo.com/entertainment/facebook-instagram-banning-pro-white-210002719.html',
'https://www.marketwatch.com/press-release/fluence-receives-another-aspiraltm-bulk-order-with-partner-itest-in-china-2019-03-27',
'https://www.valleymorningstar.com/news/elections/top-firms-decry-religious-exemption-bills-proposed-in-texas/article_68a5c4d6-2f72-5a6e-8abd-4f04a44ee74f.html',
'https://tucson.com/news/national/correction-trump-investigations-sater-lawsuit-story/article_ed20e441-de30-5b57-aafd-b1f7d7929f71.html',
'https://www.publicradiotulsa.org/post/weather-channel-sued-125-million-over-death-storm-chase-collision']
for url in s:
url = url.replace("-", " ")
if url.rsplit('/', 1)[1] == '': # For case 1 and 3rd url
if url.rsplit('/', 2)[1].isdigit(): # For 3rd case url
print(url.rsplit('/', 3)[1])
else:
print(url.rsplit('/', 2)[1])
else:
print(url.rsplit('/', 1)[1]) # except 1st and 3rd case urls
OUTPUT:
call to end affordable care act is immoral says cha president
new website puts louisiana art on businesses walls
global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits 456 69429
female music art to take center stage at swan day in new britain
Trump orders Treasury HUD to develop new plan 13721842.php
research delivers insight into the global business voip services market during the period 2018 2025
why mirza international limited nse 233259149.html
indian gaming industry grows in revenues.asp
facebook instagram banning pro white 210002719.html
fluence receives another aspiraltm bulk order with partner itest in china 2019 03 27
article_68a5c4d6 2f72 5a6e 8abd 4f04a44ee74f.html
article_ed20e441 de30 5b57 aafd b1f7d7929f71.html
weather channel sued 125 million over death storm chase collision

Why are WDT words being marked as a sentence subject by dependency parsing?

I want to report the subject of each sentence; and also extract all its modifiers. (E.g. "Donald Trump" not just "Trump"; "(The) average remaining lease term" not just "term".)
Here is my test code:
import spacy
nlp = spacy.load('en_core_web_sm')
def handle(doc):
for sent in doc.sents:
shownSentence = False
for token in sent:
if(token.dep_=="nsubj"):
if(not shownSentence):
print("----------")
print(sent)
shownSentence = True
print("{0}/{1}".format(token.text, token.tag_))
print([ [t,t.tag_] for t in token.children])
handle(nlp('Donald Trump, legend in his own lifetime, said: "This transaction is a continuation of our main strategy to invest in assets which offer growth potential and that are coloured pink." The average remaining lease term is six years, and Laura Palmer was killed by Bob. Trump added he will sell up soon.'))
The output is below. I'm wondering why I get "which/WDT" as a subject? Is it just model noise, or is it considered correct behaviour? (Incidentally, in my real sentence, which had the same structure, I also got "that/WDT" being marked as a subject.) (UPDATE: If I switch to 'en_core_web_md' then I do get "that/WDT" for my Trump example; that is the only difference switching from the small to the medium model makes.)
I can easily filter them out by looking at tag_; I'm more interested in the underlying reason.
(UPDATE: Incidentally, "Laura Palmer" doesn't get pulled out as a subject by this code, as the dep_ value is "nsubjpass", not "nsubj".)
----------
Donald Trump, legend in his own lifetime, said: "This transaction is a continuation of our main strategy to invest in assets which offer growth potential and that are coloured pink."
Trump/NNP
[[Donald, 'NNP'], [,, ','], [legend, 'NN'], [,, ',']]
transaction/NN
[[This, 'DT']]
which/WDT
[]
----------
The average remaining lease term is six years, and Laura Palmer was killed by Bob.
term/NN
[[The, 'DT'], [average, 'JJ'], [remaining, 'JJ'], [lease, 'NN']]
----------
Trump added he will sell up soon.
Trump/NNP
[]
he/PRP
[]
(By the way, the bigger picture: pronoun resolution. I want to turn PRPs into the text they refer to.)

BeautifulSoup: when pulling text from a section, <emph> and other tags are ignored causing adjacent words to be pushed together

I have an XML document. I want to pull all text between all .. <.p> tags. Below is an example of the text. The problem is that in a sentence like:
"Because the <emph>raspberry</emph> and.."
the output is "Because theraspberryand...". Somehow, the emph tags are being dropped (which is good) but being dropped in a way that pushes together the adjacent word.
Here is the relevant code I am using:
xml = BeautifulSoup(xml, convertEntities=BeautifulSoup.HTML_ENTITIES)
for para in xml.findAll('p'):
text = text + " " + para.text + " "
Here is a the start of part of the text, in case the full text helps:
<!DOCTYPE art SYSTEM "keton.dtd">
<art jid="PNAS" aid="1436" vid="94" iss="14" date="07-08-1997" ppf="7349" ppl="7355">
<fm>
<doctopic>Developmental Biology</doctopic>
<dochead>Inaugural Article</dochead>
<docsubj>Biological Sciences</docsubj>
<atl>Suspensor-derived polyembryony caused by altered expression of
valyl-tRNA synthetase in the <emph>twn2</emph>
mutant of <emph>Arabidopsis</emph></atl>
<prs>This contribution is part of the special series of Inaugural
Articles by members of the National Academy of Sciences elected on
April 30, 1996.</prs>
<aug>
<au><fnm>James Z.</fnm><snm>Zhang</snm></au>
<au><fnm>Chris R.</fnm><snm>Somerville</snm></au>
<fnr rid="FN150"><aff>Department of Plant Biology, Carnegie Institution of Washington,
290 Panama Street, Stanford CA 94305</aff>
</fnr></aug>
<acc>May 9, 1997</acc>
<con>Chris R. Somerville</con>
<pubfront>
<cpyrt><date><year>1997</year></date>
<cpyrtnme><collab>The National Academy of Sciences of the USA</collab></cpyrtnme></cpyrt>
<issn>0027-8424</issn><extent>7</extent><price>2.00/0</price>
</pubfront>
<fn id="FN150"><p>To whom reprint requests should be addressed. e-mail:
<email>crs#andrew.stanford.edu</email>.</p>
</fn>
<abs><p>The <emph>twn2</emph> mutant of <emph>Arabidopsis</emph>
exhibits a defect in early embryogenesis where, following one or two
divisions of the zygote, the decendents of the apical cell arrest. The
basal cells that normally give rise to the suspensor proliferate
abnormally, giving rise to multiple embryos. A high proportion of the
seeds fail to develop viable embryos, and those that do, contain a high
proportion of partially or completely duplicated embryos. The adult
plants are smaller and less vigorous than the wild type and have a
severely stunted root. The <emph>twn2-1</emph> mutation, which is the
only known allele, was caused by a T-DNA insertion in the 5′
untranslated region of a putative valyl-tRNA synthetase gene,
<it>valRS</it>. The insertion causes reduced transcription of the
<it>valRS</it> gene in reproductive tissues and developing seeds but
increased expression in leaves. Analysis of transcript initiation sites
and the expression of promoter–reporter fusions in transgenic plants
indicated that enhancer elements inside the first two introns interact
with the border of the T-DNA to cause the altered pattern of expression
of the <it>valRS</it> gene in the <emph>twn2</emph> mutant. The
phenotypic consequences of this unique mutation are interpreted in the
context of a model, suggested by Vernon and Meinke &lsqbVernon, D. M. &
Meinke, D. W. (1994) <emph>Dev. Biol.</emph> 165, 566–573&rsqb, in
which the apical cell and its decendents normally suppress the
embryogenic potential of the basal cell and its decendents during early
embryo development.</p>
</abs>
</fm>

I think the problem here is that you're trying to write bs4 code with bs3.
The obvious fix is to the use bs4 instead.
But in bs3, the docs show two ways to get all of the text recursively from all contents of a soup:
''.join(e for e in soup.recursiveChildGenerator() if isinstance(e, unicode))
''.join(soup.findAll(text=True))
You can obviously change either one of those to explicitly strip whitespace off the edges and add exactly one space between each node instead of relying on whatever space might be there:
' '.join(e.strip() for e in soup.recursiveChildGenerator() if isinstance(e, unicode))
' '.join(map(str.strip, soup.findAll(text=True)))
I wouldn't want to guarantee that this will be exactly the same as the bs4 text property… but I think it's what you want here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.