Finding number of pages using Python BeautifulSoup - python

I want to extract the total page number (11 in this case) from a steam page. I believe that the following code should work (return 11), but it is returning an empty list. Like if it is not finding paged_items_paging_pagelink class.
import requests
import re
from bs4 import BeautifulSoup
r = requests.get('http://store.steampowered.com/tags/en-us/RPG/')
c = r.content
soup = BeautifulSoup(c, 'html.parser')
total_pages = soup.find_all("span",{"class":"paged_items_paging_pagelink"})[-1].text

If you check the page source, the content you want is not available. It means that it is generated dynamically through Javascript.
The page numbers are located inside the <span id="NewReleases_links"> tag, but in the page source the HTML shows only this:
<span id="NewReleases_links"></span>
Easiest way to handle this is using Selenium.
But, if you look at the page source, the text Showing 1-20 of 213 results
is available. So, you can scrape this and calculate the number of pages.
Required HTML:
<div class="paged_items_paging_summary ellipsis">
Showing
<span id="NewReleases_start">1</span>
-
<span id="NewReleases_end">20</span>
of
<span id="NewReleases_total">213</span>
results
</div>
Code:
import requests
from bs4 import BeautifulSoup
r = requests.get('http://store.steampowered.com/tags/en-us/RPG/')
soup = BeautifulSoup(r.text, 'lxml')
def get_pages_no(soup):
total_items = int(soup.find('span', id='NewReleases_total').text)
items_per_page = int(soup.find('span', id='NewReleases_end').text)
return round(total_items/items_per_page)
print(get_pages_no(soup))
# prints 11
(Note: I still recommend the use of Selenium, as most of the content from this site is dynamically generated. It'll be a pain to scrape all the data like this.)

An alternative faster way without using BeautifulSoup:
import requests
url = "http://store.steampowered.com/contenthub/querypaginated/tags/NewReleases/render/?query=&start=20&count=20&cc=US&l=english&no_violence=0&no_sex=0&v=4&tag=RPG" # This returns your query in json format
r = requests.get(url)
print(round(r.json()['total_count'] / 20)) # total_count = number of records, 20 = number of pages shown
11

Related

Python - Extracting info from website using BeautifulSoup

I am new to BeautifulSoup, and I'm trying to extract data from the following website.
https://excise.wb.gov.in/CHMS/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx
I am trying to extract the availability of the hospital beds information (along with the detailed breakup) after choosing a particular district and also with the 'With available bed only' option selected.
Should I choose the table, the td, the tbody, or the div class for this instance?
My current code:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://excise.wb.gov.in/CHMS/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx').text
soup = BeautifulSoup(html_text, 'lxml')
locations= soup.find('div', {'class': 'col-lg-12 col-md-12 col-sm-12'})
print(locations)
This only prints out a blank output:
Output
I have also tried using tbody and from table still could not work it out.
Any help would be greatly appreciated!
EDIT: Trying to find a certain element returns []. The code -
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://excise.wb.gov.in/CHMS/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx').text
soup = BeautifulSoup(html_text, 'lxml')
location = soup.find_all('h5')
print(location)
It is probably a dynamic website, it means that when you use bs4 for retrieving data it doesn't retrieve what you see because the page updates or loads the content after the initial HTML load.
For these dynamic webpages you should use selenium and combine it with bs4.
https://selenium-python.readthedocs.io/index.html

webscraping in python: copying specific part of HTML for each webpage

I am working on a webscraper using html requests and beautiful soup (New to this). For 1 webpage (https://www.lookfantastic.com/illamasqua-artistry-palette-experimental/11723920.html) I am trying to scrape a part, which I will replicate for other products. The html looks like:
<span class="js-enhanced-ecommerce-data hidden" data-product-title="Illamasqua Expressionist Artistry Palette" data-product-id="12024086" data-product-category="" data-product-is-master-product-id="false" data-product-master-product-id="12024086" data-product-brand="Illamasqua" data-product-price="£39.00" data-product-position="1">
</span>
I want to select the data-product-brand="Illamasqua" , specifically the Illamasqua. I am not sure how to grab this using html requests or Beautifulsoup. I tried:
r.html.find("span.data-product-brand", first=True)
But this was unsuccesful. Any help would be appreiciated.
Because you tagged beautifulsoup, here's a solution for using that package
from bs4 import BeautifulSoup
import requests
page = requests.get('https://www.lookfantastic.com/illamasqua-artistry-palette-experimental/11723920.html')
soup = BeautifulSoup(page.content, "html.parser")
# there are multiple matches for the class that contains the word 'Illamasqua', which is what I think you want in the end???
# you can loop through and get the brand like this; in this case there are three
for l in soup.find_all(class_="js-enhanced-ecommerce-data hidden"):
print(l.get('data-product-brand'))
# if it's always going to be the first, you can just do this
soup.find(class_="js-enhanced-ecommerce-data hidden").get('data-product-brand')
You can get element(s) with specified data attribute directly:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.lookfantastic.com/illamasqua-artistry-palette-experimental/11723920.html')
span=r.html.find('[data-product-brand]',first=True)
print(span)
3 results, and you need just first, i guess.

Scraping a Java Web-page

I have found and read quite some articles about scraping but am somehow as a beginner overwhelmed.
I want to get data from a table (https://www.senamhi.gob.pe/mapas/mapa-estaciones/_dat_esta_tipo.php?estaciones=472CA750)
I tried around with beautifulsoup and can get a list of the available option_tags (see options in soup object).
I am now troubling with getting the actual content / how to access for each date / option the table and save into e.g. a pandas df.
Any advices where to begin?
Here my code to get the options:
from bs4 import BeautifulSoup
import requests
resp = requests.get("https://www.senamhi.gob.pe/mapas/mapa-estaciones/_dat_esta_tipo.php?estaciones=472CA750")
html = resp.content
soup = BeautifulSoup(html)
option_tags = soup.find_all("option")
When I look your given url , I think the table is embeded the website which is given :
<iframe src="_dat_esta_tipo02.php?estaciones=472CA750&tipo=SUT&CBOFiltro=201902&t_e=M" name="contenedor" width="600" marginwidth="0" height="560" marginheight="0" scrolling="NO" align="center" frameborder="0" id="interior"></iframe>
When you click src https://www.senamhi.gob.pe/mapas/mapa-estaciones/_dat_esta_tipo.php?estaciones=472CA750 page is opens and shows the same table so you can soap this page . I try it for you Its given the true result
**All Code : **
from bs4 import BeautifulSoup
import requests
resp = requests.get("https://www.senamhi.gob.pe/mapas/mapa-
estaciones/_dat_esta_tipo02.php?
estaciones=472CA750&tipo=SUT&CBOFiltro=201902&t_e=M")
html = resp.content
soup = BeautifulSoup(html,"lxml") ## Add lxml or html.parser in this line
option_tags = soup.find_all("tr" , attrs={'aling' : 'center'})
for a in option_tags:
print a.find('div').text
OUTPUT :
Día/mes/año
Prom
01-02-2019
02-02-2019
03-02-2019
04-02-2019
05-02-2019
06-02-2019
07-02-2019
08-02-2019
09-02-2019
10-02-2019
11-02-2019
12-02-2019
13-02-2019
14-02-2019
15-02-2019
16-02-2019
17-02-2019
18-02-2019
Above code just get the date only. If you want to access all elements with given date you can create an array and append it . Just will change below code
array = []
for a in option_tags:
array.append(a.text.split())
print array

BeautifulSoup doesn't find all spans or children

I am trying to access the sequence on this webpage:
https://www.ncbi.nlm.nih.gov/protein/EGW15053.1?report=fasta
The sequence is stored under the div class="seq gbff". Each line is stored under
<span class='ff_line' id='gi_344258949_1"> *line 1 of sequence* </span>
When I try to search for the spans containing the sequence, beautiful soup returns None. Same problem when I try to look at the children or content of the div above the spans.
Here is the code:
import requests
import re
from bs4 import BeautifulSoup
# Create a variable with the url
url = 'https://www.ncbi.nlm.nih.gov/protein/EGW15053.1?report=fasta'
# Use requests to get the contents
r = requests.get(url)
# Get the text of the contents
html_content = r.text
# Convert the html content into a beautiful soup object
soup = BeautifulSoup(html_content, 'html.parser')
div = soup.find_all('div', attrs={'class', 'seq gbff'})
for each in div.children:
print(each)
soup.find_all('span', aatrs={'class', 'ff_line'})
Neither method works and I'd greatly appreciate any help :D
This page uses JavaScript to load data
With DevTools in Chrome/Firefox I found this url and there are all <span>
https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=344258949&db=protein&report=fasta&extrafeat=0&fmt_mask=0&retmode=html&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=1000000
Now hard part. You have to find this url in HTML because different pages will use different arguments in url. Or you have to compare few urls and find schema so you could generate this url manually.
EDIT: if in url you change retmode=html to retmode=xml then you get it as XML. If you use retmode=text then you get it as text without HTML tags. retmode=json doesn't works.

Using Beautiful Soup to find specific class

I am trying to use Beautiful Soup to scrape housing price data from Zillow.
I get the web page by property id, eg. http://www.zillow.com/homes/for_sale/18429834_zpid/
When I try the find_all() function, I do not get any results:
results = soup.find_all('div', attrs={"class":"home-summary-row"})
However, if I take the HTML and cut it down to just the bits I want, eg.:
<html>
<body>
<div class=" status-icon-row for-sale-row home-summary-row">
</div>
<div class=" home-summary-row">
<span class=""> $1,342,144 </span>
</div>
</body>
</html>
I get 2 results, both <div>s with the class home-summary-row. So, my question is, why do I not get any results when searching the full page?
Working example:
from bs4 import BeautifulSoup
import requests
zpid = "18429834"
url = "http://www.zillow.com/homes/" + zpid + "_zpid/"
response = requests.get(url)
html = response.content
#html = '<html><body><div class=" status-icon-row for-sale-row home-summary-row"></div><div class=" home-summary-row"><span class=""> $1,342,144 </span></div></body></html>'
soup = BeautifulSoup(html, "html5lib")
results = soup.find_all('div', attrs={"class":"home-summary-row"})
print(results)
Your HTML is non-well-formed and in cases like this, choosing the right parser is crucial. In BeautifulSoup, there are currently 3 available HTML parsers which work and handle broken HTML differently:
html.parser (built-in, no additional modules needed)
lxml (the fastest, requires lxml to be installed)
html5lib (the most lenient, requires html5lib to be installed)
The Differences between parsers documentation page describes the differences in more detail. In your case, to demonstrate the difference:
>>> from bs4 import BeautifulSoup
>>> import requests
>>>
>>> zpid = "18429834"
>>> url = "http://www.zillow.com/homes/" + zpid + "_zpid/"
>>> response = requests.get(url)
>>> html = response.content
>>>
>>> len(BeautifulSoup(html, "html5lib").find_all('div', attrs={"class":"home-summary-row"}))
0
>>> len(BeautifulSoup(html, "html.parser").find_all('div', attrs={"class":"home-summary-row"}))
3
>>> len(BeautifulSoup(html, "lxml").find_all('div', attrs={"class":"home-summary-row"}))
3
As you can see, in your case, both html.parser and lxml do the job, but html5lib does not.
import requests
from bs4 import BeautifulSoup
zpid = "18429834"
url = "http://www.zillow.com/homes/" + zpid + "_zpid/"
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
g_data = soup.find_all("div", {"class": "home-summary-row"})
print g_data[1].text
#for item in g_data:
# print item("span")[0].text
# print '\n'
I got this working too -- but it looks like someone beat me to it.
going to post anyways.
According to the W3.org Validator, there are a number of issues with the HTML such as stray closing tags and tags split across multiple lines. For example:
<a
href="http://www.zillow.com/danville-ca-94526/sold/" title="Recent home sales" class="" data-za-action="Recent Home Sales" >
This kind of markup can make it much more difficult for BeautifulSoup to parse the HTML.
You may want to try running something to clean up the HTML, such as removing the line breaks and trailing spaces from the end of each line. BeautifulSoup can also clean up the HTML tree for you:
from BeautifulSoup import BeautifulSoup
tree = BeautifulSoup(bad_html)
good_html = tree.prettify()

Categories