HTML request does not show everything as the html in browser - python

I am trying to obtain comments of a website using Python and urllib.
I am able to get the html, however, I noticed that the comment section of the html I got using python is missing.
Here's what I have using python:
<div data-bv-product-id="6810124" data-bv-show="reviews" id="BVReviewsContainer">
</div>
(what's in between the div tags is empty)
Where as this is what it should look like(in the browser):
<div data-bv-product-id="6810124" data-bv-show="reviews" id="BVReviewsContainer">
<div id="BVRRContainer">
<div class="bv-cleanslate bv-cv2-cleanslate"> <div data-bv-v="contentList:1" class="bv-shared bv-core-container-437" data-product-id="6810124">
.
.
.
</div>
</div>
</div>
I am confounded as to why I am not getting the whole thing.

This post explains why scraped HTML isn't always the same; JavaScript can change the HTML of a website. One instance I've seen this happen is I believe on Archive of Our Own, where the actual body of a work was not available. According to that StackOverflow post, you should use Selenium to scrape it instead, as it essentially simulates the actual process that happens when a user accesses a page: the user opens a web browser (you can use your preferred web browser, like Chrome), then opens a page, and the page's JavaScript runs (through possible the onload event.

Related

Python Webscrape: hidden strange url link that is not available in page source

I am currently working on extracting some information from a portal. What I am trying to do is to extract the url link to an external PDF file. The Website I am trying to scrape is https://232app.azurewebsites.net/Forms/ExclusionRequestItem/800. The information I am trying to scrape is the BIS Decision Memo sectio with a button "View attachment file" linked to an external PDF file
"View attachment file" button
Here comes the question: when I looked into the page source, I did not find any url link related to the PDF file:
<div>
<h3>BIS Decision Memo</h3>
<div class="jumbotron">
<div class="row form-group">
<div class="col-sm-12" id="DMAttachment">
<span>Please wait...</span>
View attachment file<br />
</div>
</div>
</div>
</div>
</div>
<div class="row form-group">
<div class="col-xs-12 col-sm-4 col-md-4 col-lg-4 text-left">
However, when clicked on the "View attachment file" button, I was able to download the PDF file. I look into my download path with the PDF file, I found that the link address to the PDF file is as follows:
https://itaisinternationaltrade.sharepoint.com/sites/232App/_layouts/15/download.aspx?UniqueId=a18de65a-7092-4670-8c9a-9315a62f1814&Translate=false&tempauth=eyJ0eXAiOiJKV1QiLCJhbGciOiJub25lIn0.eyJhdWQiOiIwMDAwMDAwMy0wMDAwLTBmZjEtY2UwMC0wMDAwMDAwMDAwMDAvaXRhaXNpbnRlcm5hdGlvbmFsdHJhZGUuc2hhcmVwb2ludC5jb21AYTFkMTgzZjItNmM3Yi00ZDlhLWI5OTQtNWYyZjMxYjNmNzgwIiwiaXNzIjoiMDAwMDAwMDMtMDAwMC0wZmYxLWNlMDAtMDAwMDAwMDAwMDAwIiwibmJmIjoiMTYwOTEwNTMyNiIsImV4cCI6IjE2MDkxMDg5MjYiLCJlbmRwb2ludHVybCI6Ik8zcVZjS2N6WC9mSjlkeVU2SzlpcG1hVFRsQWNOblkvamZ5RUFPMmxUT2c9IiwiZW5kcG9pbnR1cmxMZW5ndGgiOiIxNDciLCJpc2xvb3BiYWNrIjoiVHJ1ZSIsImNpZCI6Ik5HSTJPVE5pTldRdE1EQTNOaTAwWWpFekxUZzVOek10WW1VME1qa3dNalF3TmpZMiIsInZlciI6Imhhc2hlZHByb29mdG9rZW4iLCJzaXRlaWQiOiJOMlUyT1dReU9HRXRabUUxT0MwME56SmtMVGxtWVdVdFlUZ3pZMk0zWldZME1HSXoiLCJhcHBfZGlzcGxheW5hbWUiOiIyMzJBcGkiLCJuYW1laWQiOiI3MDQ3M2M5OC0wNzIyLTQ1MDEtYWJhZi1kOWEyNWNmM2FlN2RAYTFkMTgzZjItNmM3Yi00ZDlhLWI5OTQtNWYyZjMxYjNmNzgwIiwicm9sZXMiOiJhbGxzaXRlcy5tYW5hZ2UgYWxsZmlsZXMud3JpdGUiLCJ0dCI6IjEiLCJ1c2VQZXJzaXN0ZW50Q29va2llIjpudWxsfQ.R2FXb3pYOE4yN1VFajRRMUs3ME50QlZjdHZ6ZnljNSs4VFlQaUhiQitYRT0&ApiVersion=2.0
Therefore, I am very curious in where does this strange url come from. I split this strange url into several parts and search them one by one from the page source, but I could not get any clues. Therefore, I would like to ask for some hints on how to get this url.
In addition, I am trying to scrape more PDF url links like the above one: https://232app.azurewebsites.net/Forms/ExclusionRequestItem/801
Therefore, I would like to ask if there is any way that I can scrape these PDF file links? How should I approach this question? What I have right now is to use the requests package:
import requests
url = 'https://232app.azurewebsites.net/Forms/ExclusionRequestItem/800'
html_data = requests.get(url).text
Then I tried to slice the text to extract the PDF url. However, since I am not able to find the PDF urls from above, so I do not know what I can do. Please give me some hints. Thank you very much in advance!
This page use JavaScript when you click this button.
Using DevTools in Firefox/Chrome (and digging in HTML) I found that this button sends request to ur
https://232app.azurewebsites.net/Forms/ExclusionRequestItem/800?handler=DownloadDM&ID=800
and it gets JSON data with link to PDF.
SO I used this to get file using only requests
import requests
import webbrowser
number = 800
# generate URL with `800` in two places
url = f'https://232app.azurewebsites.net/Forms/ExclusionRequestItem/{number}?handler=DownloadDM&ID={number}'
# send requests to get JSON data with `data["downloadURL"]`
r = requests.get(url)
data = r.json()
print('url:', data["downloadURL"])
# create unique filename for PDF
filename = f'output-{number}.pdf'
# get PDF and save in file (usign bytes mode)
r = requests.get(data["downloadURL"])
with open(filename, 'wb') as fh:
fh.write(r.content)
# open PDF in default program
webbrowser.open(filename)
Using different number I can get different documents.

parse page with beautifulsoup

I'm trying to parse this webpage and take some of information:
http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=778253364357513
import requests
page = requests.get("http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=778253364357513")
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
All_Information = soup.find(id="MainContent")
print(All_Information)
it seams all information between tag is hidden. when i run the code this data is returned.
<div class="tabcontent content" id="MainContent">
<div id="TopBox"></div>
<div id="ThemePlace" style="text-align:center">
<div class="box1 olive tbl z2_4 h250" id="Section_relco" style="display:none"></div>
<div class="box1 silver tbl z2_4 h250" id="Section_history" style="display:none"></div>
<div class="box1 silver tbl z2_4 h250" id="Section_tcsconfirmedorders" style="display:none"></div>
</div>
</div>
Why is the information not there, and how can I find and/or access it?
The information that I assume you are looking for is not loaded in your request. The webpage makes additional requests after it has initally loaded. There are a few ways you can get that information.
You can try selenium. It is a python package that simulates a web browser. This allows the page to load all the information before you try to scrape.
Another way is to reverse enginneer the website and find out where it is getting the information you need.
Have a look at this link.
http://www.tsetmc.com/tsev2/data/instinfofast.aspx?i=778253364357513&c=57+
It is called by your page every few seconds, and it appears to contain all the pricing information you are looking for. It may be easier to call that webpage to get your information.

How can I access bulk-edit button "bulkedit_all"? Python / Selenium

I am trying to automate JIRA tasks but struggling to access bulkedit option after JQL filter. After accessing the correct sceen I am stuck at this point:
enter image description here
HTML code:
<div class="aui-list">
<h5>Bulk Change:</h5>
<ul class="aui-list-sectionaui-first aui-last">
<li class="aui-list-item active">
<a class="aui-list-item-link" id="bulkedit_all" href="/secure/views/bulkedit/BulkEdit1!default.jspa?reset=true&tempMax=4">all 4 issue(s)</a>
</li>
</ul>
</div>
My Python code:
bulkDropdown = browser.find_elements_by_xpath("//div[#class='aui-list']//aui-list[#class='aui-list-item.active']").click()
Try the following xpath -
bulkDropdown = browser.find_elements_by_xpath("//li/a[#id='bulkedit_all']").click()
The link you want has an ID, you should use that unless you find that it's not unique on the page.
browser.find_element_by_id("bulkedit_all").click()
You will likely need to add a wait for clickable since from the screenshot it looks like a popup or tooltip of some kind. See the docs for more info on the different waits available.

can't locate element in iframe selenium

I'm Trying to switch to frame on a web page to access a video in that frame but error always occur that element is not find i've tried many elements all the same error
this is the code i used to switch to the frame and get the video url
WebDriverWait(browser,10).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH, "//*[#id = 'innerframe']/iframe")))
browser.find_element_by_xpath("//video[#id='mediaplayer']/soucre").text
here is the html of the page HTML PAGE
<div id="innerframe"><iframe src="https://ops.cielo24.com/hitman/work/load_task/554097cdbe314edb9ad5d62edf5396ed/tasks/2547efb19fc5430c9f335fe165a46df3?active_task_uuid=44686eab1ad8448d97e5e74e484575ab" width="100%" height="100%" frameborder="0"></iframe>
video html:
<div id="mediacontent">
<video height="305" width="480" id="mediaplayer"><source src="https://c24cdn.co/restricted/sliced-media/790f319ee29c46e585c5ee585ed31580.mp4?Expires=1584401901&GoogleAccessId=microservice-writer%40coresystem-171219.iam.gserviceaccount.com&Signature=QtSAPQc5GMxPx9qAI8WnCurouFagNgRE2rto1B3af%2BrUhemeqoFnJZWmfQfQ2SGXKAhc5pXL68GhLINlshZ4yGEvy7SDMEr1l44Z%2FA9bFL3Xvlsii9MfZpkXaCeXT%2FKrMZZvH%2BpbiR%2BpgQjgqLysP68fODMsQ3zub9FCx8zD2Yw5bQZg12rzQWdlEcU5VHGktTSDAjpReWHIrmca63X6jQAYru5TQi12sy18UwSlpdrF1qFgXlTOEMKwB2iPHbLRPxxpFF%2FhOkYVrCcIi6OmJOXvy6arBZY9%2FYBP2vjIpDQ3UODyH8uFrEFdWbqVTHAe0G0pKly4NK1K30dKrSGYJw%3D%3D" type="video/mp4"><source src="https://c24cdn.co/restricted/sliced-media/59b2c60d2e764a25bd4a8e2d6f15cb31.webm?Expires=1584401901&GoogleAccessId=microservice-writer%40coresystem-171219.iam.gserviceaccount.com&Signature=JGbxZYS0u2rI2gY%2BjXThKj9KkIMBDfLvW9XEImWdtfzMFNpUBBm33B7wM3XYD01JLKcMD%2BlqfWf%2FqzMFAgW2zQH07NvGKzdkYFIgwxgCUQha8ws%2FLqoJyLMiz8UeXr5Smqqjr%2FiFrLLc6HmCnYfP8g7Y%2BJ%2FJoQuHmVeZjJIKxz957SZEOQ8QIQqtbIusK%2B0uqQzvyyW4vStDF7RvjZwp44b1H0pqzsby2bjCYspacgv9JM712Z72sZdercFFczC5BR%2FxT0jXFxYn6XiRhfE0HO1e24qFiR1A%2B78Ems3A3ZdQylaVDZ4UfVX13iofy2l0LWdXMjEynLxSz7cNPGtDpg%3D%3D" type="video/webm"></video>
</div>
The xpath you are using is incorrect as there is a typing error in it, you have used soucre instead of source and the structure you have used to get it is also incorrect.
So, try to use the below code, it should work fine.
WebDriverWait(browser,10).until(EC.frame_to_be_available_and_switch_to_it(By.XPATH, "//div[#id='innerframe']/iframe"))
browser.find_element_by_xpath("//div[#id='mediacontent']//video").text

How to get the link of next page when the next page loads itself while scrolling down

I am trying to scrape data from e-commerce site but I am able to scrape data from one page. So I tried to paginate through pages but the problem here is there is no next page button in the current page and next page loads itself when I scroll down to bottom of current page. I am using BeautifulSoup in Python to scrape data.
scraping page url :
http://www.shopclues.com/mobiles-smartphones.html
When I did inspect the page before scrolling down to end I found something like:
<div class="load_more">
<a id="moreProduct" catid="1431" class="btn btn_effect" href="javascript:void(0);">Load 875 More Products</a>
</div>
so I am assuming that this <div> tag is the reason for loading next page.
If yes, please provide me an answer to how to get link to next page.
If no, then please do inspect the URL I have provided and provide me an answer for the same.

Categories