How to get the string from "chrome://downloads" page - python

I used Chromedriver to download the file, then I would like to parse the "chrome://downloads" to get download status, but I can't get the string, please refer to below code and result. I also checked the HTML in Chrome. I could saw <span id="name">Noto-hinted (1).zip</span>, but if I used the view page source, I can't find the string "Noto-hinted (1).zip". It is <span id="name" hidden="[[completelyOnDisk_]]">[[data.file_name]]</span>
import time, bs4
from selenium import webdriver
url = "https://noto-website.storage.googleapis.com/pkgs/Noto-hinted.zip"
browser = webdriver.Chrome()
browser.get(url)
time.sleep(5)
browser.get("chrome://downloads/")
time.sleep(30)
soup = bs4.BeautifulSoup(browser.page_source,"lxml")
webElemlist = soup.find('span', id='name')
print(webElemlist)
time.sleep(300)
browser.quit()
Output:
<span id="name"> </span>

I change the 'lxml' to 'html', I got the warning messages as below and still can't get the strings.
Warning (from warnings module):
File "C:\Python362\lib\site-packages\bs4__init__.py", line 181
markup_type=markup_type))
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 1 of the file . To get rid of this warning, change code that looks like this:
BeautifulSoup(YOUR_MARKUP})
to this:
BeautifulSoup(YOUR_MARKUP, "lxml")

Are you trying to get the downloading item from the screen?
Right-click the element you are trying to click, and select 'Inspect'.
This will open the console and you can see the specific tags for each element of the page, as you hover over them.
I found this for the package:
<div id="title-area">
<a is="action-link" id="file-link" tabindex="0" role="link" hidden="" href="https://noto-website.storage.googleapis.com/pkgs/Noto-hinted.zip">Noto-hinted.zip</a>
<span id="name">Noto-hinted.zip</span>
<span id="tag"></span>
</div>
All you need to do is get the text for these tags using the IDs. This also applies after you have downloaded the file.
Edit:
test = """
<div id="title-area">
<a is="action-link" id="file-link" tabindex="0" role="link" hidden="" href="https://noto-website.storage.googleapis.com/pkgs/Noto-hinted.zip">Noto-hinted.zip</a>
<span id="name">Noto-hinted.zip</span>
<span id="tag"></span>
</div>
"""
soup = BeautifulSoup(test, "lxml")
fileDiv = soup.find("span", {"id": "name"}).text
print(fileDiv)
If the above does not work, try doing this:
soup = bs4.BeautifulSoup(browser.page_source,"html.parser")

Related

Python beatuifulsoup: extract value from div class

I want to build a program that automatically gets the live price of the german index (DAX). Therefore i use a website with the price provider FXCM.
In my code i use beautifulsoup and requests as packages. The div Box where the current value is stored looks like this :
<div class="left" data-item="quoteContainer" data-bg_quotepush="133962:74:bid">
<div class="wrapper cf">
<div class="left">
<span class="quote quote_standard" data-bg_quotepush="quote" data-bg_quotepush_i="133962:74:bid" data-bg_quotepush_f="quote" data-bg_quotepush_c="40">13.599,24</span>
<span class="label" data-bg_quotepush="time" data-bg_quotepush_i="133962:74:bid" data-bg_quotepush_f="time" data-bg_quotepush_c="41">25.12.2020</span>
<span class="label"> • </span>
<span class="label" data-item="currency"></span>
</div>
<div class="right">
<span class="percent up" data-bg_quotepush="percent" data-bg_quotepush_i="133962:74:bid" data-bg_quotepush_f="percent" data-bg_quotepush_c="42">+0,00<span>%</span></span>
<span class="label up" data-bg_quotepush="change" data-bg_quotepush_i="133962:74:bid" data-bg_quotepush_f="change" data-bg_quotepush_c="43">0,00</span>
</div>
</div>
</div>
The value i want to have is the one after data-bg_quotepush_c="40" and has a vaulue of 13.599,24.
My Python code looks like this:
import requests as rq
from bs4 import BeautifulSoup as bs
url = "https://news.guidants.com/#Ticker/Profil/?i=133962&e=74"
response = rq.get(url)
soup = bs(response.text, "lxml")
price = soup.find_all("div", {"class":"left"})[0].find("span")
print(price["data-bg_quotepush_c"])
It returns the following error:
File "C:\Users\Felix\anaconda3\lib\site-packages\bs4\element.py", line 1406, in __getitem__
return self.attrs[key]
KeyError: 'data-bg_quotepush_c'
Use Selenium instead of requests if working with dynamically generated content
What is going on?
Requesting the website with requests just provide the initial content, that not contains all the dynamically generatet information, so you can not find what your looking for.
To wait until website loaded completely use Selenium and sleep() as simple method or selenium waits in advanced.
Avoiding the error
Use price.text to get the text of the element that looks like this:
<span class="quote quote_standard" data-bg_quotepush="quote" data-bg_quotepush_c="40" data-bg_quotepush_f="quote" data-bg_quotepush_i="133962:74:bid">13.599,24</span>
Example
from selenium import webdriver
from bs4 import BeautifulSoup
url = "https://news.guidants.com/#Ticker/Profil/?i=133962&e=74"
driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
driver.get(url)
driver.implicitly_wait(3)
soup = BeautifulSoup(driver.page_source,"html5lib")
price = soup.find_all("div", {"class":"left"})[0].find("span")
print(price.text)
driver.close()
Output
13.599,24
if you scraping the value of div class try this, example
driver = webdriver.Chrome(YourPATH to driver)
from bs4 import BeautifulSoup
# create variable to store a url strings
url = 'https://news.guidants.com/#Ticker/Profil/?i=133962&e=74'
driver.get(url)
# scraping proccess
soup = BeautifulSoup(driver.page_source,"html5lib")
# parse
prices = soup.find_all("div", attrs={"class":"left"})
for price in prices:
total_price = price.find('span')
# close the driver
driver.close()
if you using requests module try use different parser
you can install with pip example html5lib
pip install html5lib
thanks

parse page with beautifulsoup

I'm trying to parse this webpage and take some of information:
http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=778253364357513
import requests
page = requests.get("http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=778253364357513")
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
All_Information = soup.find(id="MainContent")
print(All_Information)
it seams all information between tag is hidden. when i run the code this data is returned.
<div class="tabcontent content" id="MainContent">
<div id="TopBox"></div>
<div id="ThemePlace" style="text-align:center">
<div class="box1 olive tbl z2_4 h250" id="Section_relco" style="display:none"></div>
<div class="box1 silver tbl z2_4 h250" id="Section_history" style="display:none"></div>
<div class="box1 silver tbl z2_4 h250" id="Section_tcsconfirmedorders" style="display:none"></div>
</div>
</div>
Why is the information not there, and how can I find and/or access it?
The information that I assume you are looking for is not loaded in your request. The webpage makes additional requests after it has initally loaded. There are a few ways you can get that information.
You can try selenium. It is a python package that simulates a web browser. This allows the page to load all the information before you try to scrape.
Another way is to reverse enginneer the website and find out where it is getting the information you need.
Have a look at this link.
http://www.tsetmc.com/tsev2/data/instinfofast.aspx?i=778253364357513&c=57+
It is called by your page every few seconds, and it appears to contain all the pricing information you are looking for. It may be easier to call that webpage to get your information.

HTML source while webscraping seems inconsistent for website

I checked out say:
https://www.calix.com/search-results.html?searchKeyword=C7
And if I inspect element on the first link I get this:
<a class="title viewDoc"
href="https://www.calix.com/content/dam/calix/mycalix-
misc/ed-svcs/learning_paths/C7_lp.pdf" data-
preview="/session/4e14b237-f19b-47dd-9bb5-d34cc4c4ce01/"
data-preview-count="1" target="_blank"><i class="fa fa-file-
pdf-o grn"></i><b>C7</b> Learning Path</a>
I coded:
import requests, bs4
res = requests.get('https://www.calix.com/search-results.html?
searchKeyword=C7',headers={'User-Agent':'test'})
print(res)
#res.raise_for_status()
bs_obj= bs4.BeautifulSoup(res.text, "html.parser")
elems=bs_obj.findAll('a',attrs={"class","title viewDoc"})
print(elems)
And there was [] as output (empty list).
So, I thought about actually looking through the "view-source" for the page.
view-source:https://www.calix.com/search-results.html?searchKeyword=C7
If you search through the "view-source" you will not find the code for the "inspect element" I mentioned earlier.
There is no "a class="title viewDoc"" in the view-source of the page.
That is probably why my code isn't returning anything.
The I went to www.nba.com, and inspected a link
<a class="content_list--item clearfix"
href="/article/2018/07/07/demarcus-cousins-discusses-
stacked-golden-state-warriors-roster"><h5 class="content_list-
-title">Cousins on Warriors' potential: 'Scary'</h5><time
class="content_list--time">in 5 hours</time></a>
The content of "inspect" for this link was in the "view-source" of the page.
And, obviously my code was working for this page.
I have seen a few examples of issue #1.
Just curious why the difference in html formats, or am I missing something?

How to get specific data using BeautifulSoup

I'm not sure how to get a specific result from this:
<div class="videoPlayer">
<div class="border-radius-player">
<div id="allplayers" style="position:relative;width:100%;height:100%;overflow: hidden;">
<div id="box">
<div id="player_content" class="todo" style="text-align: center; display: block;">
<div id="player" class="jwplayer jew-reset jew-skin-seven jw-state-paused jw-flag-user-inactive" tabindex="0">
<div class="jw-media jw-reset">
<video class="jw-video jw-reset" x-webkit-playsinline="" src="https:EXAMPLE-URL-HERE" preload="metadata"></video>
</div">
How would I get the src in <video class="jw-video jw-reset" x-webkit-playsinline="" src="https:EXAMPLE-URL-HERE" preload="metadata"></video>
This is what I've tried so far:
import urllib.request
from bs4 import BeautifulSoup
url = "https://someurlhere"
a = urllib.request.Request(url, headers={'User-Agent' : "Cliqz"})
b = urllib.request.urlopen(a) # prevent "Permission denies"
soup = BeautifulSoup(b, 'html.parser')
for video_class in soup.select("div.videoPlayer"):
print(video_class.text)
Which returns parts of it but not down to video class
Requests is a simple html client, it cannot execute javascripts.
You have three more options to try here though!
try going over the html source (b) and see if any of the javascripts in the site have the data you need. usually, the page would have the url (which, i assume you want to scrape) in some sort of holder (a javascript code or a json object) that you can scrape off.
Try looking at the XHR requests of the site and see if any of the requests query external sources for the video data. In this case, see if you can imitate that request to get the data you need.
(last resort) You need to use a phantomjs + selenium browser to download the website (Link1, Link2). You can find out more about how to use selenium in this SO post: https://stackoverflow.com/a/26440563/3986395

Python BeautifulSoup issue parsing table

Hi I am using beautifulsoup to parse tables in the following website, but not all the rows are getting returned. I looking for article tags (http://itp.ne.jp/result/?kw=%92J%98e%8E%95%89%C8%83N%83%8A%83j%83b%83N)
url = 'http://itp.ne.jp/result/?kw=%92J%98e%8E%95%89%C8%83N%83%8A%83j%83b%83N'
page = requests.get(url)
prefsoup = BeautifulSoup(page.content,"html.parser")
art= prefsoup.find_all("article")
print(art)
[<article>
<section class="noimage">
<h4 class="clearfix">
<a class="blackText" href="/shop/KN0114031400001406/" target="_blank">谷脇歯科クリニック</a>
<a class="itrademark24" href="/stats_click/?s_bid=KN0114031400001406&s_sid=FSP-LSR-001&s_fr=V09&s_ck=C12&s_acd=7" target="_blank"><img alt="付加価値情報" src="/img/pc/shop/icon_itrade_7.gif"/></a>
</h4>
<p><span class="inlineSmallHeader">住所</span> 〒060-0042 北海道札幌市中央区大通西5丁目 <a class="boxedLink navigationLink" href="/shop/KN0114031400001406/map.html" target="_blank">地図・ナビ</a></p>
<p><span class="inlineSmallHeader">TEL</span>
<a class="whiteboxicon popup_04" href="/guide/phonemark.html">(代)</a>
<b>011-213-1184</b></p>
<p>
<span class="inlineSmallHeader">URL</span>
http://taniwaki-dental.com</p></section></article>]
However it is missing the last paragraph with the email information
<p><span class="inlineSmallHeader">EMAIL</span>
taniwaki#kzh.biglobe.ne.jp<!-- br-->
</p>
Moreover len(art) returns a 2, and art[1] returns an index out of range error.
Tried several pages and got the same issue.
Use the parser html5lib instead of html.parser and it will work like a charm. You just need to change the following line of code -
prefsoup = BeautifulSoup(page.content,"html.parser")
to -
prefsoup = BeautifulSoup(page.content,"html5lib")
Of course, you will need to install the html5lib using pip install html5lib.
Check this as well - https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

Categories