This is my string :
content = '<tr class="cart-subtotal"><th>RTO / Registration office :</th><td><span class="amount"><h5>Yadgiri</h5></span></td></tr>'
I have tried below regular expression to extract the text which is in between h5 element tag:
reg = re.search(r'<tr class="cart-subtotal"><th>RTO / Registration office :</th><td><span class="amount"><h5>([A-Za-z0-9%s]+)</h5></span></td></tr>' % string.punctuation,content)
It's exactly returns what I wants.
Is there any more pythonic way to get this one ?
Dunno whether this qualifies as more pythonic or not, but it handles it as HTML data.
from lxml import html
content = '<tr class="cart-subtotal"><th>RTO / Registration office :</th><td><span class="amount"><h5>Yadgiri</h5></span></td></tr>'
HtmlData = html.fromstring(content)
ListData = HtmlData.xpath(ā//text()ā)
And to get the last element:
ListData[-1]
Related
How to cover all case of <strong> <em> <u> to <strong> with python and smart way?
I was trying code to cover that but I think my way not good and have a lot of case.
Have any want know how to do that with better?
I very want have 2 way to learn that:
Way 1: Not use regex.
Way 2: Use regex.
My code now:
html = '''
<h1>Make bakery</h1>
<p>Step 1: Please use good quality product for make <strong><em>bakery</em></strong></p>
<p>Step 2: Make <em><u>bakery</u></em> hot</p>
<p>Step 2: Make <em><strong><u>bakery</u></strong></em> hot</p>
'''
def function_cover_all_strong(html):
html = html.replace('<u><em><strong>','<strong>')
html = html.replace('</strong></em></u>','</strong>')
html = html.replace('<em><strong><u>','<strong>')
html = html.replace('</u></strong></em>','</strong>')
html = html.replace('<strong><u><em>','<strong>')
html = html.replace('</em></u></strong>','</strong>')
html = html.replace('<strong><em><u>','<strong>')
html = html.replace('</u></em></strong>','</strong>')
html = html.replace('<em><u>','<strong>')
html = html.replace('</u></em>','</strong>')
html = html.replace('<u><strong>','<strong>')
html = html.replace('</strong></u>','</strong>')
html = html.replace('<u><em>','<strong>')
html = html.replace('</em></u>','</strong>')
html = html.replace('<strong><u>','<strong>')
html = html.replace('</u></strong>','</strong>')
html = html.replace('<strong><em>','<strong>')
html = html.replace('</em></strong>','</strong>')
html = html.replace('<u>','<strong>')
html = html.replace('</u>','</strong>')
html = html.replace('<em>','<strong>')
html = html.replace('</em>','</strong>')
return html
html_cover = function_cover_all_strong(html = html)
print(html_cover)
Thanks for your support! All best with you. I very try to research more but not see case like that!
What about using a simple list and a for loop?
def function_cover_all_strong(html):
#convert <strong class"some-thing" id="something"> to <strong>
while '<strong class' in html:
i = html.find('<strong class') #find start index
j = html[i:].find('>') #find end index
html = html[:i] + '<strong>' + html[j:] #replace
#convert tags
tags = ['<u><em><strong>', '</strong></em></u>', '<em><strong><u>']
for old_tag in tags:
if '/' in tag:
new_tag = '</strong>'
else:
new_tag = '<strong>'
html = html.replace(old_tag, new_tag)
return html
The webpage I'm scraping has paragraphs and headings structured this way:
<p>
<strong>
<a href="https://dummy.com" class="">This is a link heading
</strong>
</p>
<p>
Content To Be Pulled
</p>
I wrote the following code to pull the link heading's content:
for anchor in soup.find_all('#pcl-full-content > p > strong > a'):
signs.append(anchor.text)
The next part is confusing me because the text I want to collect next is the <p> tag after the <p> tag which contains the link. I cannot use .next_sibling() here because it is outside of the parent <p> tag.
How do I choose the following paragraph given that the <p> before it contained a link?
One way seems to be to extract from script tag though you will need to split the text by horoscope:
import requests, re, json
r = requests.get('https://indianexpress.com/article/horoscope/weekly-horoscope-june-6-june-12-gemini-cancer-taurus-and-other-signs-check-astrological-prediction-7346080/',
headers = {'User-Agent':'Mozilla/5.0'})
data = json.loads(re.search(r'(\{"#context.*articleBody.*\})', r.text).group(1))
print(data['articleBody'])
You could get the horoscopes separately as follows. This dynamically determines which horoscopes are present, and in what order:
import requests, re, json
r = requests.get('https://indianexpress.com/article/horoscope/horoscope-today-april-6-2021-sagittarius-leo-aries-and-other-signs-check-astrological-prediction-7260276/',
headers = {'User-Agent':'Mozilla/5.0'})
data = json.loads(re.search(r'(\{"#context.*articleBody.*\})', r.text).group(1))
# print(data['articleBody'])
signs = ['ARIES', 'TAURUS', 'GEMINI', 'CANCER', 'LEO', 'VIRGO', 'LIBRA', 'SCORPIO', 'SAGITTARIUS', 'CAPRICORN', 'AQUARIUS', 'PISCES']
p = re.compile('|'.join(signs))
signs = p.findall(data['articleBody'])
for number, sign in enumerate(signs):
if number < len(signs) - 1:
print(re.search(f'({sign}.*?){signs[number + 1]}', data['articleBody']).group(1))
else:
print(re.search(f'({sign}.*)', data['articleBody']).group(1))
I am doing an assignment where I need to scrape information from live sites.
For this I am using https://www.nintendo.com/games/nintendo-switch-bestsellers, and need to scrape the game titles, prices and then the image sources. I have the titles working but the prices and image sources are just retuning empty list, though when put through pythex it is returning the right answer.
Here is my code:
from re import findall, finditer, MULTILINE, DOTALL
from urllib.request import urlopen
game_html_source = urlopen\
('https://www.nintendo.com/games/nintendo-switch-bestsellers').\
read().decode("UTF-8")
# game titles - working
game_title = findall(r'<h3 class="b3">([A-Z a-z:0-9]+)</h3>', game_html_source)
print(game_title)
# game prices - retuning empty-list
game_prices = findall(r'<p class="b3 row-price">(\$[.0-9]+)</p>', game_html_source)
print(game_prices)
# game images - returning empty list
game_images = findall(r'<img alt="[A-Z a-z:]+" src=("https://media.nintendo.com/nintendo/bin/[A-Za-z0-9-\/_]+.png")>',game_html_source)
print(game_images)
Parsing HTML with regex has too many pitfalls for reliable processing. BeautifulSoup and other HTML parsers work by building a complete document data structure, which you then navigate to extract the interesting bits - it's thorough and comprehensive, but if there is some erroneous HTML anywhere in the source, even if its in a part you don't care about, it can defeat the parsing process. Pyparsing takes a middle approach - you can define mini-parsers that match just the bits you want, and skip over everything else (this simplifies the post-parsing navigation too). To address some of the variabilities in HTML styles, pyparsing provides a function makeHTMLTags which returns a pair of pyparsing expressions for the opening and closing tags:
foo_start, foo_end = pp.makeHTMLTags('foo')
foo_start will match:
<foo>
<foo/>
<foo class='bar'>
<foo href=something_not_in_quotes>
and many more variations of attributes and whitespace.
The foo_start expression (like all pyparsing expressions) will return a ParseResults object. This makes it easy to access the parts of the parsed tag:
foo_data = foo_start.parseString("<foo img='bar.jpg'>")
print(foo_data.img)
For your Nintendo page scraper, see the annotated source below:
import pyparsing as pp
# define expressions to match opening and closing tags <h3>
h3, h3_end = pp.makeHTMLTags("h3")
# define a specific type of <h3> tag that has the desired 'class' attribute
h3_b3 = h3().addCondition(lambda t: t['class'] == "b3")
# similar for <p>
p, p_end = pp.makeHTMLTags("p")
p_b3_row_price = p().addCondition(lambda t: t['class'] == "b3 row-price")
# similar for <img>
img, _ = pp.makeHTMLTags("img")
img_expr = img().addCondition(lambda t: t.src.startswith("//media.nintendo.com/nintendo/bin"))
# define expressions to capture tag body for title and price - include negative lookahead for '<' so that
# tags with embedded tags are not matched
LT = pp.Literal('<')
title_expr = h3_b3 + ~LT + pp.SkipTo(h3_end)('title') + h3_end
price_expr = p_b3_row_price + ~LT + pp.SkipTo(p_end)('price') + p_end
# compose a scanner expression by '|'ing the 3 sub-expressions into one
scanner = title_expr | price_expr | img_expr
# not shown - read web page into variable 'html'
# use searchString to search through the retrieved HTML for matches
for match in scanner.searchString(html):
if 'title' in match:
print("Title:", match.title)
elif 'price' in match:
print("Price:", match.price)
elif 'src' in match:
print("Img src:", match.src)
else:
print("???", match.dump())
The first few matches printed are:
Img src: //media.nintendo.com/nintendo/bin/SF6LoN-xgX1iT617eWfBrNcWH6RQXnSh/I_IRYaBzJ61i-3hnYt_k7hVxHtqGmM_w.png
Title: Hyrule Warriors: Definitive Edition
Price: $59.99
Img src: //media.nintendo.com/nintendo/bin/wcfCyAd7t2N78FkGvEwCOGzVFBNQRbhy/AvG-_d4kEvEplp0mJoUew8IAg71YQveM.png
Title: Donkey Kong Country: Tropical Freeze
Price: $59.99
Img src: //media.nintendo.com/nintendo/bin/QKPpE587ZIA5fUhUL4nSbH3c_PpXYojl/J_Wd79pnFLX1NQISxouLGp636sdewhMS.png
Title: Wizard of Legend
Price: $15.99
Iām trying to extract data from this webpage and I'm having some trouble due to inconsistancies within the page's HTML formatting. I have a list of OGAP IDs and I want to extract the Gene Name and any literature information (PMID #) for each OGAP ID I iterate through. Thanks to other questions on here, and the BeautifulSoup documentation, I've been able to consistantly get the gene name for each ID, but I'm having trouble with the literature part. Here's a couple search terms that highlight the inconsitancies.
HTML sample that works
Search term: OG00131
<tr>
<td colspan="4" bgcolor="#FBFFCC" class="STYLE28">Literature describing O-GlcNAcylation:
<br> PMID:
20068230
[CAD, ETD MS/MS]; <br>
<br>
</td>
</tr>
HTML sample that doesn't work
Search term: OG00020
<td align="top" bgcolor="#FBFFCC">
<div class="STYLE28">Literature describing O-GlcNAcylation: </div>
<div class="STYLE28">
<div class="STYLE28">PMID:
16408927
[Azide-tag, nano-HPLC/tandem MS]
</div>
<br>
Site has not yet been determined. Use
OGlcNAcScan
to predict the O-GlcNAc site. </div>
</td>
Here's the code I have so far
import urllib2
from bs4 import BeautifulSoup
#define list of genes
#initialize variables
gene_list = []
literature = []
# Test list
gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"]
for i in range(len(gene_listID)):
print gene_listID[i]
# Specifies URL, uses the "%" to sub in different ogapIDs based on a list provided
dbOGAP = "https://wangj27.u.hpc.mssm.edu/cgi-bin/DB_tb.cgi?textfield=%s&select=Any" % gene_listID[i]
# Opens the URL as a page
page = urllib2.urlopen(dbOGAP)
# Reads the page and parses it through "lxml" format
soup = BeautifulSoup(page, "lxml")
gene_name = soup.find("td", text="Gene Name").find_next_sibling("td").text
print gene_name[1:]
gene_list.append(gene_name[1:])
# PubMed IDs are located near the <td> tag with the term "Data and Source"
pmid = soup.find("span", text="Data and Source")
# Based on inspection of the website, need to move up to the parent <td> tag
pmid_p = pmid.parent
# Then we move to the next <td> tag, denoted as sibling (since they share parent <tr> (Table row) tag)
pmid_s = pmid_p.next_sibling
#for child in pmid_s.descendants:
# print child
# Now we search down the tree to find the next table data (<td>) tag
pmid_c = pmid_s.find("td")
temp_lit = []
# Next we print the text of the data
#print pmid_c.text
if "No literature is available" in pmid_c.text:
temp_lit.append("No literature is available")
print "Not available"
else:
# and then print out a list of urls for each pubmed ID we have
print "The following is available"
for link in pmid_c.find_all('a'):
# the <a> tag includes more than just the link address.
# for each <a> tag found, print the address (href attribute) and extra bits
# link.string provides the string that appears to be hyperlinked.
# In this case, it is the pubmedID
print link.string
temp_lit.append("PMID: " + link.string + " URL: " + link.get('href'))
literature.append(temp_lit)
print "\n"
So it seems the element is what is throwing the code for a loop. Is there a way to search for any element with the text "PMID" and return the text that comes after it (and url if there is a PMID number)? If not, would I just want to check each child, looking for the text I'm interested in?
I'm using Python 2.7.10
import requests
from bs4 import BeautifulSoup
import re
gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"]
urls = ('https://wangj27.u.hpc.mssm.edu/cgi-bin/DB_tb.cgi?textfield={}&select=Any'.format(i) for i in gene_listID)
for url in urls:
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
regex = re.compile(r'http://www.ncbi.nlm.nih.gov/pubmed/\d+')
a_tag = soup.find('a', href=regex)
has_pmid = 'PMID' in a_tag.previous_element
if has_pmid :
print(a_tag.text, a_tag.next_sibling, a_tag.get("href"))
else:
print("Not available")
out:
18984734 [GalNAz-Biotin tagging, CAD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/18984734
20068230 [CAD, ETD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/20068230
20068230 [CAD, ETD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/20068230
Not available
16408927 [Azide-tag, nano-HPLC/tandem MS]; http://www.ncbi.nlm.nih.gov/pubmed/16408927
Not available
16408927 [Azide-tag, nano-HPLC/tandem MS] http://www.ncbi.nlm.nih.gov/pubmed/16408927?dopt=Citation
find the first a tag that match the target url, which end with numbers, than check if 'PMID' in it's previous element.
this web is so inconsistancies , and i try many times, hope this would help
I have been playing with BeautifulSoup, which is great. My end goal is to try and just get the text from a page. I am just trying to get the text from the body, with a special case to get the title and/or alt attributes from <a> or <img> tags.
So far I have this EDITED & UPDATED CURRENT CODE:
soup = BeautifulSoup(page)
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
page = ''.join(soup.findAll(text=True))
page = ' '.join(page.split())
print page
1) What do you suggest the best way for my special case to NOT exclude those attributes from the two tags I listed above? If it is too complex to do this, it isn't as important as doing #2.
2) I would like to strip<!-- --> tags and everything in between them. How would I go about that?
QUESTION EDIT #jathanism: Here are some comment tags that I have tried to strip, but remain, even when I use your example
<!-- Begin function popUp(URL) { day = new Date(); id = day.getTime(); eval("page" + id + " = window.open(URL, '" + id + "', 'toolbar=0,scrollbars=0,location=0,statusbar=0,menubar=0,resizable=0,width=300,height=330,left = 774,top = 518');"); } // End -->
<!-- var MenuBar1 = new Spry.Widget.MenuBar("MenuBar1", {imgDown:"SpryAssets/SpryMenuBarDownHover.gif", imgRight:"SpryAssets/SpryMenuBarRightHover.gif"}); //--> <!-- var MenuBar1 = new Spry.Widget.MenuBar("MenuBar1", {imgDown:"SpryAssets/SpryMenuBarDownHover.gif", imgRight:"SpryAssets/SpryMenuBarRightHover.gif"}); //--> <!-- var whichlink=0 var whichimage=0 var blenddelay=(ie)? document.images.slide.filters[0].duration*1000 : 0 function slideit(){ if (!document.images) return if (ie) document.images.slide.filters[0].apply() document.images.slide.src=imageholder[whichimage].src if (ie) document.images.slide.filters[0].play() whichlink=whichimage whichimage=(whichimage<slideimages.length-1)? whichimage+1 : 0 setTimeout("slideit()",slidespeed+blenddelay) } slideit() //-->
Straight from the documentation for BeautifulSoup, you can easily strip comments (or anything) using extract():
from BeautifulSoup import BeautifulSoup, Comment
soup = BeautifulSoup("""1<!--The loneliest number-->
<a>2<!--Can be as bad as one--><b>3""")
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
print soup
# 1
# <a>2<b>3</b></a>
I am still trying to figure out why it
doesn't find and strip tags like this:
<!-- //-->. Those backslashes cause
certain tags to be overlooked.
This may be a problem with the underlying SGML parser: see http://www.crummy.com/software/BeautifulSoup/documentation.html#Sanitizing%20Bad%20Data%20with%20Regexps. You can override it by using a markupMassage regex -- straight from the docs:
import re, copy
myMassage = [(re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1))]
myNewMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
myNewMassage.extend(myMassage)
BeautifulSoup(badString, markupMassage=myNewMassage)
# Foo<!--This comment is malformed.-->Bar<br />Baz
If you are looking for solution in BeautifulSoup version 3 BS3 Docs - Comment
soup = BeautifulSoup("""Hello! <!--I've got to be nice to get what I want.-->""")
comment = soup.find(text=re.compile("if"))
Comment=comment.__class__
for element in soup(text=lambda text: isinstance(text, Comment)):
element.extract()
print soup.prettify()
if mutation isn't your bag, you can
[t for t in soup.find_all(text=True) if not isinstance(t, Comment)]