Scraping a Java Web-page - python

I have found and read quite some articles about scraping but am somehow as a beginner overwhelmed.
I want to get data from a table (https://www.senamhi.gob.pe/mapas/mapa-estaciones/_dat_esta_tipo.php?estaciones=472CA750)
I tried around with beautifulsoup and can get a list of the available option_tags (see options in soup object).
I am now troubling with getting the actual content / how to access for each date / option the table and save into e.g. a pandas df.
Any advices where to begin?
Here my code to get the options:
from bs4 import BeautifulSoup
import requests
resp = requests.get("https://www.senamhi.gob.pe/mapas/mapa-estaciones/_dat_esta_tipo.php?estaciones=472CA750")
html = resp.content
soup = BeautifulSoup(html)
option_tags = soup.find_all("option")

When I look your given url , I think the table is embeded the website which is given :
<iframe src="_dat_esta_tipo02.php?estaciones=472CA750&tipo=SUT&CBOFiltro=201902&t_e=M" name="contenedor" width="600" marginwidth="0" height="560" marginheight="0" scrolling="NO" align="center" frameborder="0" id="interior"></iframe>
When you click src https://www.senamhi.gob.pe/mapas/mapa-estaciones/_dat_esta_tipo.php?estaciones=472CA750 page is opens and shows the same table so you can soap this page . I try it for you Its given the true result
**All Code : **
from bs4 import BeautifulSoup
import requests
resp = requests.get("https://www.senamhi.gob.pe/mapas/mapa-
estaciones/_dat_esta_tipo02.php?
estaciones=472CA750&tipo=SUT&CBOFiltro=201902&t_e=M")
html = resp.content
soup = BeautifulSoup(html,"lxml") ## Add lxml or html.parser in this line
option_tags = soup.find_all("tr" , attrs={'aling' : 'center'})
for a in option_tags:
print a.find('div').text
OUTPUT :
Día/mes/año
Prom
01-02-2019
02-02-2019
03-02-2019
04-02-2019
05-02-2019
06-02-2019
07-02-2019
08-02-2019
09-02-2019
10-02-2019
11-02-2019
12-02-2019
13-02-2019
14-02-2019
15-02-2019
16-02-2019
17-02-2019
18-02-2019
Above code just get the date only. If you want to access all elements with given date you can create an array and append it . Just will change below code
array = []
for a in option_tags:
array.append(a.text.split())
print array

Related

webscraping in python: copying specific part of HTML for each webpage

I am working on a webscraper using html requests and beautiful soup (New to this). For 1 webpage (https://www.lookfantastic.com/illamasqua-artistry-palette-experimental/11723920.html) I am trying to scrape a part, which I will replicate for other products. The html looks like:
<span class="js-enhanced-ecommerce-data hidden" data-product-title="Illamasqua Expressionist Artistry Palette" data-product-id="12024086" data-product-category="" data-product-is-master-product-id="false" data-product-master-product-id="12024086" data-product-brand="Illamasqua" data-product-price="£39.00" data-product-position="1">
</span>
I want to select the data-product-brand="Illamasqua" , specifically the Illamasqua. I am not sure how to grab this using html requests or Beautifulsoup. I tried:
r.html.find("span.data-product-brand", first=True)
But this was unsuccesful. Any help would be appreiciated.
Because you tagged beautifulsoup, here's a solution for using that package
from bs4 import BeautifulSoup
import requests
page = requests.get('https://www.lookfantastic.com/illamasqua-artistry-palette-experimental/11723920.html')
soup = BeautifulSoup(page.content, "html.parser")
# there are multiple matches for the class that contains the word 'Illamasqua', which is what I think you want in the end???
# you can loop through and get the brand like this; in this case there are three
for l in soup.find_all(class_="js-enhanced-ecommerce-data hidden"):
print(l.get('data-product-brand'))
# if it's always going to be the first, you can just do this
soup.find(class_="js-enhanced-ecommerce-data hidden").get('data-product-brand')
You can get element(s) with specified data attribute directly:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.lookfantastic.com/illamasqua-artistry-palette-experimental/11723920.html')
span=r.html.find('[data-product-brand]',first=True)
print(span)
3 results, and you need just first, i guess.

BeautifulSoup4 cannot locate table no matter what I try

I am trying to scrape 2 tables from a webpage simultaneously.
BeautifulSoup finds the first table no problem, but no matter what I try it cannot find the second table, here is the webpage: Hockey Reference: Justin Abdelkader.
It is the table underneath the Playoffs header.
Here is my code.
sauce = urllib.request.urlopen('https://www.hockey-reference.com/players/a/abdelju01/gamelog/2014', timeout=None).read()
soup = bs.BeautifulSoup(sauce, 'html5lib')
table = soup.find_all('table')
print(len(table))
Which always prints 1.
If I print(soup), and use the search function in my terminal I can locate 2 seperate table tags. I don't see any javascript that would be hindering BS4 from finding the tag. I have also tried finding the table by id and class, even the parent div of the table seems to be unfindable. Does anyone have any idea what I could be doing wrong?
Because of javascript loading additional information
Today requests_html can load with html page also javascript content.
pip install requests-html
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.hockey-reference.com/players/a/abdelju01/gamelog/2014')
r.html.render()
res = r.html.find('table')
print(len(res))
4
The second table seems to be inside a HTML comment tag <--... <table class=.... I guess that's why BeautifulSoup doesn't find it.
Looks like that table is a widget — click "Share & more" -> "Embed this Table", you'll get a script with link:
https://widgets.sports-reference.com/wg.fcgi?css=1&site=hr&url=%2Fplayers%2Fa%2Fabdelju01%2Fgamelog%2F2014&div=div_gamelog_playoffs
How can we parse it?
import requests
import bs4
url = 'https://widgets.sports-reference.com/wg.fcgi?css=1&site=hr&url=%2Fplayers%2Fa%2Fabdelju01%2Fgamelog%2F2014&div=div_gamelog_playoffs'
widget = requests.get(url).text
fixed = '\n'.join(s.lstrip("document.write('").rstrip("');") for s in widget.splitlines())
soup = bs4.BeautifulSoup(fixed)
soup.find('td', {'data-stat': "date_game"}).text # => '2014-04-18'
Voila!
You can reach Comment line with bs4 Comment like :
from bs4 import BeautifulSoup , Comment
from urllib import urlopen
search_url = 'https://www.hockey-reference.com/players/a/abdelju01/gamelog/2014'
page = urlopen(search_url)
soup = BeautifulSoup(page, "html.parser")
table = soup.findAll('table') ## html part with no comment
table_with_comment = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in table_with_comment]
## print table_with_comment print all comment line
start = '<table class'
for c in range(0,len(table_with_comment)):
if start in table_with_comment[c]:
print table_with_comment[c] ## print comment line has <table class

Finding number of pages using Python BeautifulSoup

I want to extract the total page number (11 in this case) from a steam page. I believe that the following code should work (return 11), but it is returning an empty list. Like if it is not finding paged_items_paging_pagelink class.
import requests
import re
from bs4 import BeautifulSoup
r = requests.get('http://store.steampowered.com/tags/en-us/RPG/')
c = r.content
soup = BeautifulSoup(c, 'html.parser')
total_pages = soup.find_all("span",{"class":"paged_items_paging_pagelink"})[-1].text
If you check the page source, the content you want is not available. It means that it is generated dynamically through Javascript.
The page numbers are located inside the <span id="NewReleases_links"> tag, but in the page source the HTML shows only this:
<span id="NewReleases_links"></span>
Easiest way to handle this is using Selenium.
But, if you look at the page source, the text Showing 1-20 of 213 results
is available. So, you can scrape this and calculate the number of pages.
Required HTML:
<div class="paged_items_paging_summary ellipsis">
Showing
<span id="NewReleases_start">1</span>
-
<span id="NewReleases_end">20</span>
of
<span id="NewReleases_total">213</span>
results
</div>
Code:
import requests
from bs4 import BeautifulSoup
r = requests.get('http://store.steampowered.com/tags/en-us/RPG/')
soup = BeautifulSoup(r.text, 'lxml')
def get_pages_no(soup):
total_items = int(soup.find('span', id='NewReleases_total').text)
items_per_page = int(soup.find('span', id='NewReleases_end').text)
return round(total_items/items_per_page)
print(get_pages_no(soup))
# prints 11
(Note: I still recommend the use of Selenium, as most of the content from this site is dynamically generated. It'll be a pain to scrape all the data like this.)
An alternative faster way without using BeautifulSoup:
import requests
url = "http://store.steampowered.com/contenthub/querypaginated/tags/NewReleases/render/?query=&start=20&count=20&cc=US&l=english&no_violence=0&no_sex=0&v=4&tag=RPG" # This returns your query in json format
r = requests.get(url)
print(round(r.json()['total_count'] / 20)) # total_count = number of records, 20 = number of pages shown
11

BeautifulSoup doesn't find all spans or children

I am trying to access the sequence on this webpage:
https://www.ncbi.nlm.nih.gov/protein/EGW15053.1?report=fasta
The sequence is stored under the div class="seq gbff". Each line is stored under
<span class='ff_line' id='gi_344258949_1"> *line 1 of sequence* </span>
When I try to search for the spans containing the sequence, beautiful soup returns None. Same problem when I try to look at the children or content of the div above the spans.
Here is the code:
import requests
import re
from bs4 import BeautifulSoup
# Create a variable with the url
url = 'https://www.ncbi.nlm.nih.gov/protein/EGW15053.1?report=fasta'
# Use requests to get the contents
r = requests.get(url)
# Get the text of the contents
html_content = r.text
# Convert the html content into a beautiful soup object
soup = BeautifulSoup(html_content, 'html.parser')
div = soup.find_all('div', attrs={'class', 'seq gbff'})
for each in div.children:
print(each)
soup.find_all('span', aatrs={'class', 'ff_line'})
Neither method works and I'd greatly appreciate any help :D
This page uses JavaScript to load data
With DevTools in Chrome/Firefox I found this url and there are all <span>
https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=344258949&db=protein&report=fasta&extrafeat=0&fmt_mask=0&retmode=html&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=1000000
Now hard part. You have to find this url in HTML because different pages will use different arguments in url. Or you have to compare few urls and find schema so you could generate this url manually.
EDIT: if in url you change retmode=html to retmode=xml then you get it as XML. If you use retmode=text then you get it as text without HTML tags. retmode=json doesn't works.

Python:Getting text from html using Beautifulsoup

I am trying to extract the ranking text number from this link link example: kaggle user ranking no1. More clear in an image:
I am using the following code:
def get_single_item_data(item_url):
sourceCode = requests.get(item_url)
plainText = sourceCode.text
soup = BeautifulSoup(plainText)
for item_name in soup.findAll('h4',{'data-bind':"text: rankingText"}):
print(item_name.string)
item_url = 'https://www.kaggle.com/titericz'
get_single_item_data(item_url)
The result is None. The problem is that soup.findAll('h4',{'data-bind':"text: rankingText"}) outputs:
[<h4 data-bind="text: rankingText"></h4>]
but in the html of the link when inspecting this is like:
<h4 data-bind="text: rankingText">1st</h4>. It can be seen in the image:
Its clear that the text is missing. How can I overpass that?
Edit:
Printing the soup variable in the terminal I can see that this value exists:
So there should be a way to access through soup.
Edit 2: I tried unsuccessfully to use the most voted answer from this stackoverflow question. Could be a solution around there.
If you aren't going to try browser automation through selenium as #Ali suggested, you would have to parse the javascript containing the desired information. You can do this in different ways. Here is a working code that locates the script by a regular expression pattern, then extracts the profile object, loads it with json into a Python dictionary and prints out the desired ranking:
import re
import json
from bs4 import BeautifulSoup
import requests
response = requests.get("https://www.kaggle.com/titericz")
soup = BeautifulSoup(response.content, "html.parser")
pattern = re.compile(r"profile: ({.*}),", re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
profile_text = pattern.search(script.text).group(1)
profile = json.loads(profile_text)
print profile["ranking"], profile["rankingText"]
Prints:
1 1st
The data is databound using javascript, as the "data-bind" attribute suggests.
However, if you download the page with e.g. wget, you'll see that the rankingText value is actually there inside this script element on initial load:
<script type="text/javascript"
profile: {
...
"ranking": 96,
"rankingText": "96th",
"highestRanking": 3,
"highestRankingText": "3rd",
...
So you could use that instead.
I have solved your problem using regex on the plain text:
def get_single_item_data(item_url):
sourceCode = requests.get(item_url)
plainText = sourceCode.text
#soup = BeautifulSoup(plainText, "html.parser")
pattern = re.compile("ranking\": [0-9]+")
name = pattern.search(plainText)
ranking = name.group().split()[1]
print(ranking)
item_url = 'https://www.kaggle.com/titericz'
get_single_item_data(item_url)
This return only the rank number, but I think it will help you, since from what I see the rankText just add 'st', 'th' and etc to the right of the number
This could because of dynamic data filling.
Some javascript code, fill this tag after page loading. Thus if you fetch the html using requests it is not filled yet.
<h4 data-bind="text: rankingText"></h4>
Please take a look at Selenium web driver. Using this driver you can fetch the complete page and running js as normal.

Categories