Beautifulsoup4 - not selcting all instances of span class

Beautifulsoup4 - not selcting all instances of span class - python

I am attempting to scrape data from a website that uses non-specific span classes to format/display content. The pages present information about chemical products and each product is described within a single div class.
I first parsed by that div class and am working to pull the data I need from there. I have been able to get many things but the parts I cant seem to pull are within the span class "ppisreportspan"
If you look at the code, you will note that it appears multiple times within each chemical description.
<tr>
<td><h4 id='stateprod'>MAINE STATE PRODUCT REPORT</h4><hr class='report'><span style="color:Maroon;" Class="subtitle">Company Number: </span><span style='color:black;' Class="subtitle">38</span><br /><span Class="subtitle">MONSANTO COMPANY <br/>800 N. LINDBERGH BOULEVARD <br/>MAIL STOP FF4B <br/>ST LOUIS MO 63167-0001<br/></span><br/><span style="color:Maroon;" Class="subtitle">Number of Currently Registered Products: </span><span style='color:black; font-size:14px' class="subtitle">80</span><br /><br/><p class='noprint'><img alt='' src='images/epalogo.png' /> View the label in the US EPA Pesticide Product Label System (PPLS).<br /><img alt='' src='images/alstar.png' /> View the label in the Accepted Labels State Tracking and Repository (ALSTAR).<br /></p>
<hr class='report'>
<div class='nopgbrk'>
<span class='ppisreportspanprodname'>PRECEPT INSECTICIDE </span>
<br/>EPA Registration Number: <a href = "http://iaspub.epa.gov/apex/pesticides/f?p=PPLS:102:::NO::P102_REG_NUM:100-1075" target='blank'>100-1075-524 <img alt='EPA PPLS Link' src='images/pplslink.png'/></a>
<span class='line-break'></span>
<span class=ppisProd>ME Product Number: </span>
<**span class="ppisreportspan"**>2014000996</span>
<br />Registration Year: <**span class="ppisreportspan"**>2019</span>
Type: <span class="ppisreportspan">RESTRICTED</span><br/><br/>
<table width='100%'>
<tr>
<td width='13%'>Percent</td>
<td style='width:87%;align:left'>Active Ingredient</td>
</tr>
<tr>
<td><span class="ppisreportspan">3.0000</span></td>
<td><span class="ppisreportspan">Tefluthrin (128912)</span></td>
</tr>
</table><hr />
</div>
<div class='nopgbrk'>
<span class='ppisreportspanprodname' >ACCELERON IC-609 INSECTICIDE SEED TREATMENT FOR CORN</span>
<br/>EPA Registration Number: <a href = "http://iaspub.epa.gov/apex/pesticides/f?p=PPLS:102:::NO::P102_REG_NUM:264-789" target='blank'>264-789-524 <img alt='EPA PPLS Link' src='images/pplslink.png'/>
</a><span class='line-break'></span>
<span class=ppisProd>ME Product Number: <a href = "alstar_label.aspx?LabelId=116671" target = 'blank'>2009005053</span>
<img alt='ALSTAR Link' src='images/alstar.png'/></a>
<br />Registration Year: <span class="ppisreportspan">2019</span>
<br/>
<table width='100%'>
<tr>
<td width='13%'>Percent</td>
<td style='width:87%;align:left'>Active Ingredient</td>
</tr>
<tr>
<td><span class="ppisreportspan">48.0000</span></td>
<td><span class="ppisreportspan">Clothianidin (44309)</span></td>
</tr>
</table><hr />
</div>
This sample includes two chemicals. One has an "alstar" ID and link and one does not. Both have registration years. Those are the data points that are hanging me up.
You may also note that there is a 10 digit code stored in "ppisreportspan" in the first example. I was able to extract that as part of the "ppisProd" span for nay record that doesn't have the Alstar link. I don't understand why, but it reinforces the point that it seems my parsing process ignores that span class.
I have tried various methods for the last 2 days based on all kinds of different answers on SO, so I can't possibly list them all. I seem to be able to either get anything from the first "span" to the end on the last span, or I get "nonetype" errors or empty lists.
This one gets the closest:
It returns the correct spans for many div chunks but it still skips (returns blank tuple []) for any of the ones that have alstar links like the second one in the example.
picture showing data and then a series of three sets of empty brackets where the data should be
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
import re
url = input('Enter URL:')
hand = open(url)
soup = BeautifulSoup(hand, 'html.parser')
#create a list of chunks by product (div)
products = soup.find_all('div' , class_ ='nopgbrk')
print(type(products))
print(len(products))
tempalstars =[]
rptspanclasses = []
regyears = []
alstarIDs = []
asltrlinks = []
# read the span tags
for product in products:
tempalstar = product.find_all('span', class_= "ppisreportspan")
tempalstars.append(tempalstar)
print(tempalstar)
Ultimately, I want to be able to select the text for the year as well as the Alstar link out of these span statements for each div chunk, but I will be cross that bridge when I can get the code finding all the instances of that class.
Alternately - Is there some easier way I can get the Registration year and the Alstar link (eg. <a href = "alstar_label.aspx?LabelId=116671" target = 'blank'>2009005053</span> <img alt='ALSTAR Link' src='images/alstar.png'/></a>) rather than what I am trying to do?
I am using Python 3.7.2 and Thank you!

I managed to get some data from this site. All you need to know is the company number, in case of monsanto, the number is 38 (this number is shown in after selecting Maine and typing monsanto in the search box:
import re
import requests
from bs4 import BeautifulSoup
url_1 = 'http://npirspublic.ceris.purdue.edu/state/state_menu.aspx?state=ME'
url_2 = 'http://npirspublic.ceris.purdue.edu/state/company.aspx'
company_name = 'monsanto'
company_number = '38'
with requests.session() as s:
r = s.get(url_1)
soup = BeautifulSoup(r.text, 'lxml')
data = {i['name']: '' for i in soup.select('input[name]')}
for i in soup.select('input[value]'):
data[i['name']] = i['value']
data['ctl00$ContentPlaceHolder1$search'] = 'company'
data['ctl00$ContentPlaceHolder1$TextBoxInput1'] = company_name
r = s.post(url_1, data=data)
soup = BeautifulSoup(r.text, 'lxml')
data = {i['name']: '' for i in soup.select('input[name]')}
for i in soup.select('input[value]'):
data[i['name']] = i['value']
data = {k: v for k, v in data.items() if not k.startswith('ctl00$ContentPlaceHolder1$')}
data['ctl00$ContentPlaceHolder1${}'.format(company_number)] = 'Display+Products'
r = s.post(url_2, data=data)
soup = BeautifulSoup(r.text, 'lxml')
for div in soup.select('.nopgbrk'):
#extract name
print(div.select_one('.ppisreportspanprodname').text)
#extract ME product number:
s = ''.join(re.findall(r'\d{10}', div.text))
print(s)
#extract alstar link
s = div.select_one('a[href*="alstar_label.aspx"]')
if s:
print(s['href'])
else:
print('No ALSTAR link')
#extract Registration year:
s = div.find(text=lambda t: 'Registration Year:' in t)
if s:
print(s.next.text)
else:
print('No registration year.')
print('-' * 80)
Prints:
PRECEPT INSECTICIDE
2014000996
No ALSTAR link
2019
--------------------------------------------------------------------------------
ACCELERON IC-609 INSECTICIDE SEED TREATMENT FOR CORN
2009005053
alstar_label.aspx?LabelId=117531
2019
--------------------------------------------------------------------------------
ACCELERON D-342 FUNGICIDE SEED TREATMENT
2015000498
alstar_label.aspx?LabelId=117538
2019
--------------------------------------------------------------------------------
ACCELERON DX-309
2009005026
alstar_label.aspx?LabelId=117559
2019
--------------------------------------------------------------------------------
... and so on.

Related

How to scrape content from a website with no class or id specified in attribute with BeautifulSoup4

I want to scrape separate content like- text in 'a' tag (ie. only the name- "42mm Architecture") and 'scope of services, types of built projects, Locations of Built Projects, Style of work, Website' as CSV file headers and its content for a whole webpage.
The elements have no Class or ID associated with it. So I am kind of stuck on how to extract those details properly, also there are those 'br' and 'b' tags in between.
There are multiple 'p' tags before and after the block of code provided. Here is the website.
<h2>
<a href="http://www.dezeen.com/tag/design-by-42mm-architecture" rel="noopener noreferrer" target="_blank">
42mm Architecture
</a>
|
<span style="color: #808080;">
Delhi | Top Architecture Firms/ Architects in India
</span>
</h2>
<!-- /wp:paragraph -->
<p>
<b>
Scope of services:
</b>
Architecture, Interiors, Urban Design.
<br/>
<b>
Types of Built Projects:
</b>
Residential, commercial, hospitality, offices, retail, healthcare, housing, Institutional
<br/>
<b>
Locations of Built Projects:
</b>
New Delhi and nearby states
<b>
<br/>
</b>
<b>
Style of work
</b>
<span style="font-weight: 400;">
: Contemporary
</span>
<br/>
<b>
Website
</b>
<span style="font-weight: 400;">
:
<a href="https://www.42mm.co.in/">
42mm.co.in
</a>
</span>
</p>
So how is it done using BeautifulSoup4?

This one was a bit of a time consuming one! The webpage is not complete and it has less tags and identifiers. To add more on that they haven't even spell checked the content Eg. One place has a heading Scope of Services and another place has Scope of services and there are many more like that! So what I have done is a crude extraction and I'm sure it would help you if you have the idea of paginating also.
import requests
from bs4 import BeautifulSoup
import csv
page = requests.get('https://www.re-thinkingthefuture.com/top-architects/top-architecture-firms-in-india-part-1/')
soup = BeautifulSoup(page.text, 'lxml')
# there are many h2 tags but we want the one without any class name
h2 = soup.find_all('h2', class_= '')
headers = []
contents = []
header_len = []
a_tags = []
for i in h2:
if i.find_next().name == 'a': # to make sure we do not grab the wrong tag
a_tags.append(i.find_next().text)
p = i.find_next_sibling()
contents.append(p.text)
h =[j.text for j in p.find_all('strong')] # some headings were bold in the website
headers.append(h)
header_len.append(len(h))
# since only some headings were in bold the max number of bold would give all headers
headers = headers[header_len.index(max(header_len))]
# removing the : from headings
headers = [i[:len(i)-1] for i in headers]
# inserted a new heading
headers.insert(0, 'Firm')
# n for traversing through headers list
# k for traversing through a_tags list
n =1
k =0
# this is the difficult part where the content will have all the details in one value including the heading like this
"""
Scope of services: Architecture, Interiors, Urban Design.Types of Built Projects: Residential, commercial, hospitality, offices, retail, healthcare, housing, InstitutionalLocations of Built Projects: New Delhi and nearby statesStyle of work: ContemporaryWebsite: 42mm.co.in
"""
# thus I am splitting it using the ':' and then splicing it from the start of the each heading
contents = [i.split(':') for i in contents]
for i in contents:
for j in i:
h = headers[n][:5]
if i.index(j) == 0:
i[i.index(j)] = a_tags[k]
n+=1
k+=1
elif h in j:
i[i.index(j)] = j[:j.index(h)]
j = j[:j.index(h)]
if n < len(headers)-1:
n+=1
n =1
# merging those extra values in the list if any
if len(i) == 7:
i[3] = i[3] + ' ' + i[4]
i.remove(i[4])
# writing into csv file
# if you don't want a line space between each row then add newline = '' argument in the open function below
with open('output.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(headers)
writer.writerows(contents)
This was the output:
If you want to paginate then just add the page number to the end of the url and you'll be good!
page_num = 1
while page_num <13:
page = requests.get(f'https://www.re-thinkingthefuture.com/top-architects/top-architecture-firms-in-india-part-1/{page_num}/')
# paste the above code starting from soup = BeautifulSoup(page.text, 'lxml')
page_num +=1
Hope this helps, let me know if there's any error.
EDIT 1:
I forgot to say the most important part sorry, if there is a tag with no class name then you can still get the tag with what I have used in the code above
h2 = soup.find_all('h2', class_= '')
This just says that give me all the h2 tags which does not have a class name. This itself can sometimes be a unique identifier as we are using this no class value to identify it.

You can use this example as a basis how to scrape the informations from that page:
import requests
import pandas as pd
url = "https://www.gov.uk/government/publications/endorsing-bodies-start-up/start-up"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
parent = soup.select_one("div.govspeak")
mapping = {"sector": "sectors", "endorses businesses": "endorses businesses in"}
all_data = []
for h3 in parent.select("h3"):
name = h3.text
link = h3.a["href"] if h3.a else "-"
ul = h3.find_next("ul")
if ul and ul.find_previous("h3") == h3 and ul.parent == parent:
li = [
list(map(lambda x: mapping.get((i := x.strip()), i), v))
for li in ul.select("li")
if len(v := li.get_text(strip=True).split(":")) == 2
]
else:
li = []
all_data.append({"name": name, "link": link, **dict(li)})
df = pd.DataFrame(all_data)
print(df)
df.to_csv("data.csv", index=False)
Creates data.csv (screenshot from LibreOffice):

Find data within HTML tags using Python

I have the following HTML code I am trying to scrape from a website:
<td>Net Taxes Due<td>
<td class="value-column">$2,370.00</td>
<td class="value-column">$2,408.00</td>
What I am trying to accomplish is to search the page to find the text "Net Taxes Due" within the tag, find the siblings of the tag, and send the results into a Pandas data frame.
I have the following code:
soup = BeautifulSoup(url, "html.parser")
table = soup.select('#Net Taxes Due')
cells = table.find_next_siblings('td')
cells = [ele.text.strip() for ele in cells]
df = pd.DataFrame(np.array(cells))
print(df)
I've been all over the web looking for a solution and can't come up with something. Appreciate any help.
Thanks!

In the following I expected to use indices 1 and 2 but 2 and 3 seems to work when using lxml.html and xpath
import requests
from lxml.html import fromstring
# url = ''
# tree = html.fromstring( requests.get(url).content)
h = '''
<td>Net Taxes Due<td>
<td class="value-column">$2,370.00</td>
<td class="value-column">$2,408.00</td>
'''
tree = fromstring(h)
links = [link.text for link in tree.xpath('//td[text() = "Net Taxes Due"]/following-sibling::td[2] | //td[text() = "Net Taxes Due"]/following-sibling::td[3]' )]
print(links)

Make sure to add the tag name along with your search string. This is how you can do that:
from bs4 import BeautifulSoup
htmldoc = """
<tr>
<td>Net Taxes Due</td>
<td class="value-column">$2,370.00</td>
<td class="value-column">$2,408.00</td>
</tr>
"""
soup = BeautifulSoup(htmldoc, "html.parser")
item = soup.find('td',text='Net Taxes Due').find_next_sibling("td")
print(item)

Your .select() call is not correct. # in a selector is used to match an element's ID, not its text contents, so #Net means to look for an element with id="Net". Spaces in a selector mean to look for descendants that match each successive selector. So #Net Taxes Due searches for something like:
<div id="Net">
<taxes>
<due>...</due>
</taxes>
</div>
To search for an element containing a specific string, use .find() with the string keyword:
table = soup.find(string="Net Taxes Due")

Assuming that there's an actual HTML table involved:
<html>
<table>
<tr>
<td>Net Taxes Due</td>
<td class="value-column">$2,370.00</td>
<td class="value-column">$2,408.00</td>
</tr>
</table>
</html>
soup = BeautifulSoup(url, "html.parser")
table = soup.find('tr')
df = [x.text for x in table.findAll('td', {'class':'value-column'})]

These should work. If you are using bs4 4.7.0, you "could" use select. But if you are on an older version, or just prefer the find interface, you can use that. Basically as stated earlier, you cannot reference content with #, that is an ID.
import bs4
markup = """
<td>Net Taxes Due</td>
<td class="value-column">$2,370.00</td>
<td class="value-column">$2,408.00</td>
"""
# Version 4.7.0
soup = bs4.BeautifulSoup(markup, "html.parser")
cells = soup.select('td:contains("Net Taxes Due") ~ td.value-column')
cells = [ele.text.strip() for ele in cells]
print(cells)
# Version < 4.7.0 or if you prefer find
soup = bs4.BeautifulSoup(markup, "html.parser")
cells = soup.find('td', text="Net Taxes Due").find_next_siblings('td')
cells = [ele.text.strip() for ele in cells]
print(cells)
You would get this
['$2,370.00', '$2,408.00']
['$2,370.00', '$2,408.00']

How to automate scraping wikipedia info box specifically and print the data using python for any wiki page?

My task is to automate printing the wikipedia infobox data.As an example, I am scraping the Star Trek wikipedia page (https://en.wikipedia.org/wiki/Star_Trek) and extract infobox section from the right hand side and print them row by row on screen using python. I specifically want the info box. So far I have done this:
from bs4 import BeautifulSoup
import urllib.request
# specify the url
urlpage = 'https://en.wikipedia.org/wiki/Star_Trek'
# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(urlpage)
# parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(page, 'html.parser')
# find results within table
table = soup.find('table', attrs={'class': 'infobox vevent'})
results = table.find_all('tr')
print(type(results))
print('Number of results', len(results))
print(results)
This gives me everything from the info box. A snippet is shown below:
[<tr><th class="summary" colspan="2" style="text-align:center;font-
size:125%;font-weight:bold;font-style: italic; background: lavender;">
<i>Star Trek</i></th></tr>, <tr><td colspan="2" style="text-align:center">
<a class="image" href="/wiki/File:Star_Trek_TOS_logo.svg"><img alt="Star
Trek TOS logo.svg" data-file-height="132" data-file-width="560" height="59"
I want to extract the data only and print it on screen. So What i want is:
Created by Gene Roddenberry
Original work Star Trek: The Original Series
Print publications
Book(s)
List of reference books
List of technical manuals
Novel(s) List of novels
Comics List of comics
Magazine(s)
Star Trek: The Magazine
Star Trek Magazine
And so on till the end of the infobox. So basically a way of printing every row of the infobox data so I can automate it for any wiki page? (The class of infobox table of all wiki pages is 'infobox vevent' as shown in the code)

This page should help you to parse your html as a simple string without the html tags Using BeautifulSoup Extract Text without Tags
This is a code from that page, it belongs to #0605002
>>> html = """
<p>
<strong class="offender">YOB:</strong> 1987<br />
<strong class="offender">RACE:</strong> WHITE<br />
<strong class="offender">GENDER:</strong> FEMALE<br />
<strong class="offender">HEIGHT:</strong> 5'05''<br />
<strong class="offender">WEIGHT:</strong> 118<br />
<strong class="offender">EYE COLOR:</strong> GREEN<br />
<strong class="offender">HAIR COLOR:</strong> BROWN<br />
</p>
"""
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> print soup.text
YOB: 1987
RACE: WHITE
GENDER: FEMALE
HEIGHT: 5'05''
WEIGHT: 118
EYE COLOR: GREEN
HAIR COLOR: BROWN

By using beautifulsoup,you need to reformat the data as you want. use fresult = [e.text for e in result] to get each result
If you want to read a table on html you can try some code like this,though this is using pandas.
import pandas
urlpage = 'https://en.wikipedia.org/wiki/Star_Trek'
data = pandas.read_html(urlpage)[0]
null = data.isnull()
for x in range(len(data)):
first = data.iloc[x][0]
second = data.iloc[x][1] if not null.iloc[x][1] else ""
print(first,second,"\n")

Extracting data from an inconsistent HTML page using BeautifulSoup4 and Python

I’m trying to extract data from this webpage and I'm having some trouble due to inconsistancies within the page's HTML formatting. I have a list of OGAP IDs and I want to extract the Gene Name and any literature information (PMID #) for each OGAP ID I iterate through. Thanks to other questions on here, and the BeautifulSoup documentation, I've been able to consistantly get the gene name for each ID, but I'm having trouble with the literature part. Here's a couple search terms that highlight the inconsitancies.
HTML sample that works
Search term: OG00131
<tr>
<td colspan="4" bgcolor="#FBFFCC" class="STYLE28">Literature describing O-GlcNAcylation:
<br> PMID:
20068230
[CAD, ETD MS/MS]; <br>
<br>
</td>
</tr>
HTML sample that doesn't work
Search term: OG00020
<td align="top" bgcolor="#FBFFCC">
<div class="STYLE28">Literature describing O-GlcNAcylation: </div>
<div class="STYLE28">
<div class="STYLE28">PMID:
16408927
[Azide-tag, nano-HPLC/tandem MS]
</div>
<br>
Site has not yet been determined. Use
OGlcNAcScan
to predict the O-GlcNAc site. </div>
</td>
Here's the code I have so far
import urllib2
from bs4 import BeautifulSoup
#define list of genes
#initialize variables
gene_list = []
literature = []
# Test list
gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"]
for i in range(len(gene_listID)):
print gene_listID[i]
# Specifies URL, uses the "%" to sub in different ogapIDs based on a list provided
dbOGAP = "https://wangj27.u.hpc.mssm.edu/cgi-bin/DB_tb.cgi?textfield=%s&select=Any" % gene_listID[i]
# Opens the URL as a page
page = urllib2.urlopen(dbOGAP)
# Reads the page and parses it through "lxml" format
soup = BeautifulSoup(page, "lxml")
gene_name = soup.find("td", text="Gene Name").find_next_sibling("td").text
print gene_name[1:]
gene_list.append(gene_name[1:])
# PubMed IDs are located near the <td> tag with the term "Data and Source"
pmid = soup.find("span", text="Data and Source")
# Based on inspection of the website, need to move up to the parent <td> tag
pmid_p = pmid.parent
# Then we move to the next <td> tag, denoted as sibling (since they share parent <tr> (Table row) tag)
pmid_s = pmid_p.next_sibling
#for child in pmid_s.descendants:
# print child
# Now we search down the tree to find the next table data (<td>) tag
pmid_c = pmid_s.find("td")
temp_lit = []
# Next we print the text of the data
#print pmid_c.text
if "No literature is available" in pmid_c.text:
temp_lit.append("No literature is available")
print "Not available"
else:
# and then print out a list of urls for each pubmed ID we have
print "The following is available"
for link in pmid_c.find_all('a'):
# the <a> tag includes more than just the link address.
# for each <a> tag found, print the address (href attribute) and extra bits
# link.string provides the string that appears to be hyperlinked.
# In this case, it is the pubmedID
print link.string
temp_lit.append("PMID: " + link.string + " URL: " + link.get('href'))
literature.append(temp_lit)
print "\n"
So it seems the element is what is throwing the code for a loop. Is there a way to search for any element with the text "PMID" and return the text that comes after it (and url if there is a PMID number)? If not, would I just want to check each child, looking for the text I'm interested in?
I'm using Python 2.7.10

import requests
from bs4 import BeautifulSoup
import re
gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"]
urls = ('https://wangj27.u.hpc.mssm.edu/cgi-bin/DB_tb.cgi?textfield={}&select=Any'.format(i) for i in gene_listID)
for url in urls:
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
regex = re.compile(r'http://www.ncbi.nlm.nih.gov/pubmed/\d+')
a_tag = soup.find('a', href=regex)
has_pmid = 'PMID' in a_tag.previous_element
if has_pmid :
print(a_tag.text, a_tag.next_sibling, a_tag.get("href"))
else:
print("Not available")
out:
18984734 [GalNAz-Biotin tagging, CAD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/18984734
20068230 [CAD, ETD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/20068230
20068230 [CAD, ETD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/20068230
Not available
16408927 [Azide-tag, nano-HPLC/tandem MS]; http://www.ncbi.nlm.nih.gov/pubmed/16408927
Not available
16408927 [Azide-tag, nano-HPLC/tandem MS] http://www.ncbi.nlm.nih.gov/pubmed/16408927?dopt=Citation
find the first a tag that match the target url, which end with numbers, than check if 'PMID' in it's previous element.
this web is so inconsistancies , and i try many times, hope this would help

Parsing IMDB with BeautifulSoup

I've stripped the following code from IMDB's mobile site using BeautifulSoup, with Python 2.7.
I want to create a separate object for the episode number '1', title 'Winter is Coming', and IMDB score '8.9'. Can't seem to figure out how to split apart the episode number and the title.
<a class="btn-full" href="/title/tt1480055?ref_=m_ttep_ep_ep1">
<span class="text-large">
1.
<strong>
Winter Is Coming
</strong>
</span>
<br/>
<span class="mobile-sprite tiny-star">
</span>
<strong>
8.9
</strong>
17 Apr. 2011
</a>

You can use find to locate the span with the class text-large to the specific element you need.
Once you have your desired span, you can use next to grab the next line, containing the episode number and find to locate the strong containing the title
html = """
<a class="btn-full" href="/title/tt1480055?ref_=m_ttep_ep_ep1">
<span class="text-large">
1.
<strong>
Winter Is Coming
</strong>
</span>
<br/>
<span class="mobile-sprite tiny-star">
</span>
<strong>
8.9
</strong>
17 Apr. 2011
</a>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
span = soup.find('span', attrs={'text-large'})
ep = str(span.next).strip()
title = str(span.find('strong').text).strip()
print ep
print title
> 1.
> Winter Is Coming

Once you have each a class="btn-full", you can use the span classes to get the tags you want, the strong tag is a child of the span with the text-large class so you just need to call .strong.text on the Tag, for the span with the css class mobile-sprite tiny-star, you need to find the next strong tag as it is a sibling of the span not a child:
h = """<a class="btn-full" href="/title/tt1480055?ref_=m_ttep_ep_ep1">
<span class="text-large">
1.
<strong>
Winter Is Coming
</strong>
</span>
<br/>
<span class="mobile-sprite tiny-star">
</span>
<strong>
8.9
</strong>
17 Apr. 2011
</a>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(h)
title = soup.select_one("span.text-large").strong.text.strip()
score = soup.select_one("span.mobile-sprite.tiny-star").find_next("strong").text.strip()
print(title, score)
Which gives you:
(u'Winter Is Coming', u'8.9')
If you really want to get the episode the simplest way is to split the text once:
soup = BeautifulSoup(h)
ep, title = soup.select_one("span.text-large").text.split(None, 1)
score = soup.select_one("span.mobile-sprite.tiny-star").find_next("strong").text.strip()
print(ep, title.strip(), score)
Which will give you:
(u'1.', u'Winter Is Coming', u'8.9')

Using url html scraping with reguest and regular expression search.
import os, sys, requests
frame = ('http://www.imdb.com/title/tt1480055?ref_=m_ttep_ep_ep1')
f = requests.get(frame)
helpme = f.text
import re
result = re.findall('itemprop="name" class="">(.*?) ', helpme)
result2 = re.findall('"ratingCount">(.*?)</span>', helpme)
result3 = re.findall('"ratingValue">(.*?)</span>', helpme)
print result[0].encode('utf-8')
print result2[0]
print result3[0]
output:
Winter Is Coming
24,474
9.0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Beautifulsoup4 - not selcting all instances of span class - python

Related

How to scrape content from a website with no class or id specified in attribute with BeautifulSoup4

Find data within HTML tags using Python

How to automate scraping wikipedia info box specifically and print the data using python for any wiki page?

Extracting data from an inconsistent HTML page using BeautifulSoup4 and Python

Parsing IMDB with BeautifulSoup

Categories

Resources