Parse html in python using beautifulsoup

Parse html in python using beautifulsoup - python

How do i get output as below from give html page ?
>html_sting='''<td class="status_icon" rowspan="2"><img alt="QUEUED" src="images/arts/status_QUEUED.png" style="border:none" title="QUEUED"/></td>
><td class="test"> v1402beta_150127_1_OTM_TICKETS_dv_c_UID142307274200
> <div class="start">(04.02) 23:29</div>
> <div class="end">~
> <span style="color:green"> () </span>
> </div>
></td>
><td>mcordeix</td>
><td>1614809</td>
><td>0/0/0 of 0</td>
><td>high</td>
><td style="white-space:nowrap"><img class="pbar" src="images/arts/bar_green.gif" style="border-right:2px;border-right-style:solid;border-right-color:#ffffff" width="1%"/><img class="pbar" src="images/arts/bar_gray.gif" width="99%"/></td>
><td></td>
><td></td>
><td></td>
><td></td>
><td colspan="4">
><!-- Florent Vial: this can be alway shown if admin=1 -->
>XML
>Raw XML
>CINFO
></td>
><td></td>
><td><!-- <script type="text/javascript">DIVShowHideDetails('func:DoPrintArtsDetails')</script> --> </td>
><td></td>
><td></td>
><td></td>
><td></td>
'''
EXpected Output:
-------
Status="QUEUED"
test=v1402beta_150127_1_OTM_TICKETS_dv_c_UID142307274200
start=(04.02) 23:29
end=~
user=mcordeix

Welcome to StackOverflow!
Please read the How to ask a question section of our FAQ.
Explain how you encountered the problem you're trying to solve, and any difficulties that have prevented you from solving it yourself.
What have you tried to solve this problem so far?
Let's give you a start.
All you're gonna need are the find and find_all functions.
soup = BeautifulSoup(html_string)
status = soup.find('img').get('alt') # get 'alt' content of the first <img> tag.
# find the first <td> tag with a class="test", get its content, split it using spaces,
version = soup.find('td', class_='test').text.split()[0] # and get the first substring
time_start = soup.find('div', class_='start').text
time_end = soup.find('div', class_='end').text
user = soup.find_all('td')[2].text # get a third <td>'s content.
print status # QUEUED
print version # v1402beta_150127_1_OTM_TICKETS_dv_c_UID142307274200
print time_start # (04.02) 23:29
print time_end # ~ > () >
print user # mcordeix
That's just reading the bs4's documentation for like 10 minutes and trying it yourself.
Just pop out the Python interpreter, assign the html_string variable, import the beautifulsoup library, and try.
I'm sure you could work out the problem left with the time_end content yourself. It's not that hard.

Related

Beautifulsoup4 - not selcting all instances of span class

I am attempting to scrape data from a website that uses non-specific span classes to format/display content. The pages present information about chemical products and each product is described within a single div class.
I first parsed by that div class and am working to pull the data I need from there. I have been able to get many things but the parts I cant seem to pull are within the span class "ppisreportspan"
If you look at the code, you will note that it appears multiple times within each chemical description.
<tr>
<td><h4 id='stateprod'>MAINE STATE PRODUCT REPORT</h4><hr class='report'><span style="color:Maroon;" Class="subtitle">Company Number: </span><span style='color:black;' Class="subtitle">38</span><br /><span Class="subtitle">MONSANTO COMPANY <br/>800 N. LINDBERGH BOULEVARD <br/>MAIL STOP FF4B <br/>ST LOUIS MO 63167-0001<br/></span><br/><span style="color:Maroon;" Class="subtitle">Number of Currently Registered Products: </span><span style='color:black; font-size:14px' class="subtitle">80</span><br /><br/><p class='noprint'><img alt='' src='images/epalogo.png' /> View the label in the US EPA Pesticide Product Label System (PPLS).<br /><img alt='' src='images/alstar.png' /> View the label in the Accepted Labels State Tracking and Repository (ALSTAR).<br /></p>
<hr class='report'>
<div class='nopgbrk'>
<span class='ppisreportspanprodname'>PRECEPT INSECTICIDE </span>
<br/>EPA Registration Number: <a href = "http://iaspub.epa.gov/apex/pesticides/f?p=PPLS:102:::NO::P102_REG_NUM:100-1075" target='blank'>100-1075-524 <img alt='EPA PPLS Link' src='images/pplslink.png'/></a>
<span class='line-break'></span>
<span class=ppisProd>ME Product Number: </span>
<**span class="ppisreportspan"**>2014000996</span>
<br />Registration Year: <**span class="ppisreportspan"**>2019</span>
Type: <span class="ppisreportspan">RESTRICTED</span><br/><br/>
<table width='100%'>
<tr>
<td width='13%'>Percent</td>
<td style='width:87%;align:left'>Active Ingredient</td>
</tr>
<tr>
<td><span class="ppisreportspan">3.0000</span></td>
<td><span class="ppisreportspan">Tefluthrin (128912)</span></td>
</tr>
</table><hr />
</div>
<div class='nopgbrk'>
<span class='ppisreportspanprodname' >ACCELERON IC-609 INSECTICIDE SEED TREATMENT FOR CORN</span>
<br/>EPA Registration Number: <a href = "http://iaspub.epa.gov/apex/pesticides/f?p=PPLS:102:::NO::P102_REG_NUM:264-789" target='blank'>264-789-524 <img alt='EPA PPLS Link' src='images/pplslink.png'/>
</a><span class='line-break'></span>
<span class=ppisProd>ME Product Number: <a href = "alstar_label.aspx?LabelId=116671" target = 'blank'>2009005053</span>
<img alt='ALSTAR Link' src='images/alstar.png'/></a>
<br />Registration Year: <span class="ppisreportspan">2019</span>
<br/>
<table width='100%'>
<tr>
<td width='13%'>Percent</td>
<td style='width:87%;align:left'>Active Ingredient</td>
</tr>
<tr>
<td><span class="ppisreportspan">48.0000</span></td>
<td><span class="ppisreportspan">Clothianidin (44309)</span></td>
</tr>
</table><hr />
</div>
This sample includes two chemicals. One has an "alstar" ID and link and one does not. Both have registration years. Those are the data points that are hanging me up.
You may also note that there is a 10 digit code stored in "ppisreportspan" in the first example. I was able to extract that as part of the "ppisProd" span for nay record that doesn't have the Alstar link. I don't understand why, but it reinforces the point that it seems my parsing process ignores that span class.
I have tried various methods for the last 2 days based on all kinds of different answers on SO, so I can't possibly list them all. I seem to be able to either get anything from the first "span" to the end on the last span, or I get "nonetype" errors or empty lists.
This one gets the closest:
It returns the correct spans for many div chunks but it still skips (returns blank tuple []) for any of the ones that have alstar links like the second one in the example.
picture showing data and then a series of three sets of empty brackets where the data should be
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
import re
url = input('Enter URL:')
hand = open(url)
soup = BeautifulSoup(hand, 'html.parser')
#create a list of chunks by product (div)
products = soup.find_all('div' , class_ ='nopgbrk')
print(type(products))
print(len(products))
tempalstars =[]
rptspanclasses = []
regyears = []
alstarIDs = []
asltrlinks = []
# read the span tags
for product in products:
tempalstar = product.find_all('span', class_= "ppisreportspan")
tempalstars.append(tempalstar)
print(tempalstar)
Ultimately, I want to be able to select the text for the year as well as the Alstar link out of these span statements for each div chunk, but I will be cross that bridge when I can get the code finding all the instances of that class.
Alternately - Is there some easier way I can get the Registration year and the Alstar link (eg. <a href = "alstar_label.aspx?LabelId=116671" target = 'blank'>2009005053</span> <img alt='ALSTAR Link' src='images/alstar.png'/></a>) rather than what I am trying to do?
I am using Python 3.7.2 and Thank you!

I managed to get some data from this site. All you need to know is the company number, in case of monsanto, the number is 38 (this number is shown in after selecting Maine and typing monsanto in the search box:
import re
import requests
from bs4 import BeautifulSoup
url_1 = 'http://npirspublic.ceris.purdue.edu/state/state_menu.aspx?state=ME'
url_2 = 'http://npirspublic.ceris.purdue.edu/state/company.aspx'
company_name = 'monsanto'
company_number = '38'
with requests.session() as s:
r = s.get(url_1)
soup = BeautifulSoup(r.text, 'lxml')
data = {i['name']: '' for i in soup.select('input[name]')}
for i in soup.select('input[value]'):
data[i['name']] = i['value']
data['ctl00$ContentPlaceHolder1$search'] = 'company'
data['ctl00$ContentPlaceHolder1$TextBoxInput1'] = company_name
r = s.post(url_1, data=data)
soup = BeautifulSoup(r.text, 'lxml')
data = {i['name']: '' for i in soup.select('input[name]')}
for i in soup.select('input[value]'):
data[i['name']] = i['value']
data = {k: v for k, v in data.items() if not k.startswith('ctl00$ContentPlaceHolder1$')}
data['ctl00$ContentPlaceHolder1${}'.format(company_number)] = 'Display+Products'
r = s.post(url_2, data=data)
soup = BeautifulSoup(r.text, 'lxml')
for div in soup.select('.nopgbrk'):
#extract name
print(div.select_one('.ppisreportspanprodname').text)
#extract ME product number:
s = ''.join(re.findall(r'\d{10}', div.text))
print(s)
#extract alstar link
s = div.select_one('a[href*="alstar_label.aspx"]')
if s:
print(s['href'])
else:
print('No ALSTAR link')
#extract Registration year:
s = div.find(text=lambda t: 'Registration Year:' in t)
if s:
print(s.next.text)
else:
print('No registration year.')
print('-' * 80)
Prints:
PRECEPT INSECTICIDE
2014000996
No ALSTAR link
2019
--------------------------------------------------------------------------------
ACCELERON IC-609 INSECTICIDE SEED TREATMENT FOR CORN
2009005053
alstar_label.aspx?LabelId=117531
2019
--------------------------------------------------------------------------------
ACCELERON D-342 FUNGICIDE SEED TREATMENT
2015000498
alstar_label.aspx?LabelId=117538
2019
--------------------------------------------------------------------------------
ACCELERON DX-309
2009005026
alstar_label.aspx?LabelId=117559
2019
--------------------------------------------------------------------------------
... and so on.

Scraping <span>flow text</span> with BeautifulSoup and urllib

I am working on scraping the data from a website using BeautifulSoup. For whatever reason, I cannot seem to find a way to get the text between span elements to print. Here is what I am running.
data = """ <div class="grouping">
<div class="a1 left" style="width:20px;">Text</div>
<div class="a2 left" style="width:30px;"><span
id="target_0">Data1</span>
</div>
<div class="a3 left" style="width:45px;"><span id="div_target_0">Data2
</span></div>
<div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3
</span</div>
</div>
"""
My ultimate goal would be to able to print a list ["Text", "Data1", "Data2"] for each entry. But right now I am having trouble getting python and urllib to produce any text between the . Here is what I am running:
import urllib
from bs4 import BeautifulSoup
url = 'http://target.com'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
Search_List = [0,4,5] # list of Target IDs to scrape
for i in Search_List:
h = str(i)
root = 'target_' + h
taggr = soup.find("span", { "id" : root })
print taggr, ", ", taggr.text
When I use urllib it produces this:
<span id="target_0"></span>,
<span id="target_4"></span>,
<span id="target_5"></span>,
However, I also downloaded the html file, and when I parse the downloaded file it produces this output (the one that I want):
<span id="target_0">Data1</span>, Data1
<span id="target_4">Data1</span>, Data1
<span id="target_5">Data1</span>, Data1
Can anyone explain to me why urllib doesn't produce the outcome?

use this code :
...
soup = BeautifulSoup(html, 'html.parser')
your_data = list()
for line in soup.findAll('span', attrs={'id': 'target_0'}):
your_data.append(line.text)
...
similarly add all class attributes which you need to extract data from and write your_data list in csv file. Hope this will help if this doesn't work out. let me know.

You can use the following approach to create your lists based on the source HTML you have shown:
from bs4 import BeautifulSoup
data = """
<div class="grouping">
<div class="a1 left" style="width:20px;">Text0</div>
<div class="a2 left" style="width:30px;"><span id="target_0">Data1</span></div>
<div class="a3 left" style="width:45px;"><span id="div_target_0">Data2</span></div>
<div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3</span></div>
</div>
<div class="grouping">
<div class="a1 left" style="width:20px;">Text2</div>
<div class="a2 left" style="width:30px;"><span id="target_2">Data1</span></div>
<div class="a3 left" style="width:45px;"><span id="div_target_0">Data2</span></div>
<div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3</span></div>
</div>
<div class="grouping">
<div class="a1 left" style="width:20px;">Text4</div>
<div class="a2 left" style="width:30px;"><span id="target_4">Data1</span></div>
<div class="a3 left" style="width:45px;"><span id="div_target_0">Data2</span></div>
<div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3</span></div>
</div>
"""
soup = BeautifulSoup(data, "lxml")
search_ids = [0, 4, 5] # list of Target IDs to scrape
for i in search_ids:
span = soup.find("span", id='target_{}'.format(i))
if span:
grouping = span.parent.parent
print list(grouping.stripped_strings)[:-1] # -1 to remove "Data3"
The example has been slightly modified to show it finding IDs 0 and 4. This would display the following output:
[u'Text0', u'Data1', u'Data2']
[u'Text4', u'Data1', u'Data2']
Note, if the HTML you are getting back from your URL is different to that seen been viewing the source from your browser (i.e. the data you want is missing completely) then you will need to use a solution such as selenium to connect to your browser and extract the HTML. This is because in this case, the HTML is probably being generated locally via Javascript, and urllib does not have a Javascript processor.

Extracting data from an inconsistent HTML page using BeautifulSoup4 and Python

I’m trying to extract data from this webpage and I'm having some trouble due to inconsistancies within the page's HTML formatting. I have a list of OGAP IDs and I want to extract the Gene Name and any literature information (PMID #) for each OGAP ID I iterate through. Thanks to other questions on here, and the BeautifulSoup documentation, I've been able to consistantly get the gene name for each ID, but I'm having trouble with the literature part. Here's a couple search terms that highlight the inconsitancies.
HTML sample that works
Search term: OG00131
<tr>
<td colspan="4" bgcolor="#FBFFCC" class="STYLE28">Literature describing O-GlcNAcylation:
<br> PMID:
20068230
[CAD, ETD MS/MS]; <br>
<br>
</td>
</tr>
HTML sample that doesn't work
Search term: OG00020
<td align="top" bgcolor="#FBFFCC">
<div class="STYLE28">Literature describing O-GlcNAcylation: </div>
<div class="STYLE28">
<div class="STYLE28">PMID:
16408927
[Azide-tag, nano-HPLC/tandem MS]
</div>
<br>
Site has not yet been determined. Use
OGlcNAcScan
to predict the O-GlcNAc site. </div>
</td>
Here's the code I have so far
import urllib2
from bs4 import BeautifulSoup
#define list of genes
#initialize variables
gene_list = []
literature = []
# Test list
gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"]
for i in range(len(gene_listID)):
print gene_listID[i]
# Specifies URL, uses the "%" to sub in different ogapIDs based on a list provided
dbOGAP = "https://wangj27.u.hpc.mssm.edu/cgi-bin/DB_tb.cgi?textfield=%s&select=Any" % gene_listID[i]
# Opens the URL as a page
page = urllib2.urlopen(dbOGAP)
# Reads the page and parses it through "lxml" format
soup = BeautifulSoup(page, "lxml")
gene_name = soup.find("td", text="Gene Name").find_next_sibling("td").text
print gene_name[1:]
gene_list.append(gene_name[1:])
# PubMed IDs are located near the <td> tag with the term "Data and Source"
pmid = soup.find("span", text="Data and Source")
# Based on inspection of the website, need to move up to the parent <td> tag
pmid_p = pmid.parent
# Then we move to the next <td> tag, denoted as sibling (since they share parent <tr> (Table row) tag)
pmid_s = pmid_p.next_sibling
#for child in pmid_s.descendants:
# print child
# Now we search down the tree to find the next table data (<td>) tag
pmid_c = pmid_s.find("td")
temp_lit = []
# Next we print the text of the data
#print pmid_c.text
if "No literature is available" in pmid_c.text:
temp_lit.append("No literature is available")
print "Not available"
else:
# and then print out a list of urls for each pubmed ID we have
print "The following is available"
for link in pmid_c.find_all('a'):
# the <a> tag includes more than just the link address.
# for each <a> tag found, print the address (href attribute) and extra bits
# link.string provides the string that appears to be hyperlinked.
# In this case, it is the pubmedID
print link.string
temp_lit.append("PMID: " + link.string + " URL: " + link.get('href'))
literature.append(temp_lit)
print "\n"
So it seems the element is what is throwing the code for a loop. Is there a way to search for any element with the text "PMID" and return the text that comes after it (and url if there is a PMID number)? If not, would I just want to check each child, looking for the text I'm interested in?
I'm using Python 2.7.10

import requests
from bs4 import BeautifulSoup
import re
gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"]
urls = ('https://wangj27.u.hpc.mssm.edu/cgi-bin/DB_tb.cgi?textfield={}&select=Any'.format(i) for i in gene_listID)
for url in urls:
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
regex = re.compile(r'http://www.ncbi.nlm.nih.gov/pubmed/\d+')
a_tag = soup.find('a', href=regex)
has_pmid = 'PMID' in a_tag.previous_element
if has_pmid :
print(a_tag.text, a_tag.next_sibling, a_tag.get("href"))
else:
print("Not available")
out:
18984734 [GalNAz-Biotin tagging, CAD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/18984734
20068230 [CAD, ETD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/20068230
20068230 [CAD, ETD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/20068230
Not available
16408927 [Azide-tag, nano-HPLC/tandem MS]; http://www.ncbi.nlm.nih.gov/pubmed/16408927
Not available
16408927 [Azide-tag, nano-HPLC/tandem MS] http://www.ncbi.nlm.nih.gov/pubmed/16408927?dopt=Citation
find the first a tag that match the target url, which end with numbers, than check if 'PMID' in it's previous element.
this web is so inconsistancies , and i try many times, hope this would help

How to find text with a particular value BeautifulSoup python2.7

I have the following html: I'm trying to get the following numbers saved as variables Available Now,7,148.49,HatchBack,Good. The problem I'm running into is that I'm not able to pull them out independently since they don't have a class attached to it. I'm wondering how to solve this. The following is the html then my futile code to solve this.
</div>
<div class="car-profile-info">
<div class="col-md-12 no-padding">
<div class="col-md-6 no-padding">
<strong>Status:</strong> <span class="statusAvail"> Available Now </span><br/>
<strong>Min. Booking </strong>7 Days ($148.89)<br/>
<strong>Style: </strong>Hatchback<br/>
<strong>Transmission: </strong>Automatic<br/>
<strong>Condition: </strong>Good<br/>
</div>
Python 2.7 Code: - this gives me the entire html!
soup=BeautifulSoup(html)
print soup.find("span",{"class":"statusAvail"}).getText()
for i in soup.select("strong"):
if i.getText()=="Min. Booking ":
print i.parent.getText().replace("Min. Booking ","")

Find all the strong elements under the div element with class="car-profile-info" and, for each element found, get the .next_siblings until you meet the br element:
from bs4 import BeautifulSoup, Tag
for strong in soup.select(".car-profile-info strong"):
label = strong.get_text()
value = ""
for elm in strong.next_siblings:
if getattr(elm, "name") == "br":
break
if isinstance(elm, Tag):
value += elm.get_text(strip=True)
else:
value += elm.strip()
print(label, value)

You can use ".next_sibling" to navigate to the text you want like this:
for i in soup.select("strong"):
if i.get_text(strip=True) == "Min. Booking":
print(i.next_sibling) #this will print: 7 Days ($148.89)
See also http://www.crummy.com/software/BeautifulSoup/bs4/doc/#going-sideways

Get following node in different ancestor using lxml and xpath

I'm writing a text-to-speech program that reads math equations. I have a thread that needs to pull math equations (as MathJax SVG's) and parse them to prose.
Because of how the content is laid out, the math equations can be arbitrarily nested in other elements, like paragraphs, bolds, tables, etc.
Using a reference to the current element, how do I get the next <span class="MathJax_SVG">, which may be embedded in some other parent/ancestor?
I tried to solve it using the following:
nextMath = currentElement.xpath('following::.//span[#class=\'MathJax_SVG\']')
Returns nothing, even though I can confirm visually that there is something following it. I tried removing the period, but lxml complains that my XPath is malformed.
Have you guys ran into this before?
P.S. Here is a test document to show my point:
<html>
<head>
<title>Test Document</title>
</head>
<body>
<h1 id="mainHeading">The Quadratic Formula</h1>
<p>The quadratic formula is used to solve quadratic equations. Here is the formula:</p>
<p><span class="MathJax_SVG" id="MathJax_Element_Frame_1">removed the SVG</span></p>
<p>Here are some possible values when you use the formula:</p>
<p>
<table>
<tr>
<td><span class="MathJax_SVG" id="MathJax_Element_Frame_2">removed the SVG</span></td>
<td><span class="MathJax_SVG" id="MathJax_Element_Frame_3">removed the SVG</span></td>
</tr>
<tr>
<td><span class="MathJax_SVG" id="MathJax_Element_Frame_4">removed the SVG</span></td>
<td><span class="MathJax_SVG" id="MathJax_Element_Frame_5">removed the SVG</span></td>
</tr>
</table>
</p>
</body>
</html>
Updates
Learned that lxml doesn't support absolute positions. This may be relevant.
Some Testing Code (assuming you saved HTML as test.html)
from lxml import html
# Get my html element
with open('test.html', 'r') as f:
myHtml = html.fromstring(f.read())
# Get the first MathJax element
start = myHtml.find('.//h1[#id=\'mainHeading\']')
print 'My start:', html.tostring(start)
# Get next math equation
nextXPath = 'following::.//span[#class=\'MathJax_SVG\']'
nextElem = start.xpath(nextXPath)
if len(nextElem) > 0:
print 'Next equation:', html.tostring(nextElem[0])
else:
print 'No next equation...'

Do you need to iterate through the document? You could also search for span elements of the class MathJax_SVG directly:
from lxml import etree
doc = etree.parse(open("test-document.html")).getroot()
maths = doc.xpath("//span[#class='MathJax_SVG']")

I ended up creating my own function to get what I want. I called it getNext(elem, xpathString). If there is a more efficient way to do this, I'm all ears. I'm not confident in its performance.
from lxml import html
def getNext(elem, xpathString):
'''
Gets the next element defined by XPath. The element returned
may be itself.
'''
myElem = elem
nextElem = elem.find(xpathString)
while nextElem is None:
if myElem.getnext() is not None:
myElem = myElem.getnext()
nextElem = myElem.find(xpathString)
else:
if myElem.getparent() is not None:
myElem = myElem.getparent()
else:
break
return nextElem
# Get my html element
with open('test.html', 'r') as f:
myHtml = html.fromstring(f.read())
# Get the first MathJax element
start = myHtml.find('.//span[#id=\'MathJax_Element_Frame_1\']')
print 'My start:', html.tostring(start)
# Get next math equation
nextXPath = './/span[#class=\'MathJax_SVG\']'
nextElem = getNext(start, nextXPath)
if nextElem is not None:
print 'Next equation:', html.tostring(nextElem)
else:
print 'No next equation...'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse html in python using beautifulsoup - python

Related

Beautifulsoup4 - not selcting all instances of span class

Scraping <span>flow text</span> with BeautifulSoup and urllib

Extracting data from an inconsistent HTML page using BeautifulSoup4 and Python

How to find text with a particular value BeautifulSoup python2.7

Get following node in different ancestor using lxml and xpath

Categories

Resources