I have been exploring ways to use python to log into a secure website (eg. Salesforce), navigate to a certain page and print (save) the page as pdf at a prescribed location.
I have tried using:
pdfkit.from_url: Use Request to get a session cookie, parse it then pass it as cookie into the wkhtmltopdf's options settings. This method does not work due to pdfkit not being able to recognise the cookie I passed.
pdfkit.from_file: Use Request.get to get the html of the page I want to print, then use pdfkit to convert the html file to pdf. This works but the page format and images are all missing.
Selenium: Use a webdriver to log in then navigate to the wanted page, call the windows.print function. This does not work because I can't pass any arguments to the window's SaveAs dialog.
Does anyone have any idea to get around?
log in using requests
use requests session mechanism to keep track of the cookie
use session to retrieve the HTML page
parse the HTML (use beautifulsoup)
identify img tags and css links
download locally the images and css documents
rewrite the img src attributes to point to the locally downloaded images
rewrite the css links to point to the locally downloaded css
serialize the new HTML tree to a local .html file
use whatever "HTML to PDF" solution to render the local .html file
Related
I want to upload my javascript and html code to qwebengine so it will read it and load it in any website I specify. Is this possible?
I want the html to load on the website with the setPage() command but I'm not sure how to do this.
I want to do it to my own website using setPage() with qtwebengine.
when I right click a page and click save as and save it as html, it saves everything on the page, including images. However, when I use python's requests.get, it saves the html page without saving the images. They appear as broken links, like so:
Broken images
Instead of like so:
Working images
How can I get requests.get to get all the data on the webpage? Appreciate any advice.
Edit: This is the code I'm using to scrape the website:
for link in aosa1:
res=requests.get("http://aosabook.org/en/"+link)
print(res.status_code)
print(len(res.text))
res.raise_for_status()
playfile=open(link,'wb')
for chunk in res.iter_content(100000):
playfile.write(chunk)
playfile.close
You don't understand how HTML works. Those images are not part of the page. When a browser downloads an HTML file, it then scans the HTML looking for <img> tags. For each <img> tag, it make a new HTML request to fetch that so it can display it. Now, if the <img> tags had absolute URLs, it would still show for you. But if they have relative URLs (<img src="/images/abc.png">), then the browser is going to try to fetch them from your localhost web server, which does not exist. You can try to scan the HTML and fetch those images.
I'm looking for a way to list all loaded files with the requests module.
Like there is in chrome's Inspector Network tab, you can see all kinds of files that have been loaded by the webpage.
The problem is the file(in this case .pdf file) I want to fetch does not have a specific tab, and the webpage loads it by javascript and AJAX I guess, because even after the page loaded completely, I couldn't find a tag that has a link to the .pdf file or something like that, so every time I should goto Networks tab and reload the page and find the file in the loaded resources list.
Is there any way to catch all the loaded files and list them using the Requests module?
When a browser loads an HTML file it then interprets the contents of that file. It may discover that there is a tag referencing an external JavaScript URL. The browser will then issue a GET request to retrieve that file. When said file is received, it hen interprets the JavaScript file by executing the code within. That code might contain AJAX code that in turn fetches more files. Or the HTML file may reference an extern CSS file with a tag or image file with an tag. These files will also be loaded by the browser and can be seen when you run the browser's inspector.
In contrast, when you do a get request with the requests module for a particular URL, only that one page is fetched. There is no logic to interpret the contents of the returned page and fetch those images, style sheets, JavaScript files, etc. that are referenced within the page.
You can, however, use Python to automate a browser using a tool such as Selenium WebDriver, which can be used to fully download a page.
I want to scrap Lulu webstore. I have the following problems with it.
The website content is loaded dynamically.
The website when tried to access, redirects to choose country page.
After choosing country, it pops up select delivery location and then redirects to home page.
When you try to hit end page programmatically, you get an empty response because the content is loaded dynamically.
I have a list of end URLs from which I have to scrape data. For example, consider mobile accessories. Now I want to
Get the HTML source of that page directly, which is loaded dynamically bypassing choose country, select location popups, so that I can use my Scrapy Xpath selectors to extract data.
If you suggest me to use Selenium, PhantomJS, Ghost or something else to deal with dynamic content, please understand that I want the end HTML source as in a web browser after processing all dynamic content which will be sent to Scrapy.
Also, I tried using proxies to skip choose country popup but still it loads it and select delivery location.
I've tried using Splash, but it returns me the source of choose country page.
At last I found answer. I used EditThisCookie plugin to view the cookies that are loaded by the Web Page. I found that it stores 3 cookies CurrencyCode,ServerId,Site_Config in my local storage. I used the above mentioned plugin to copy the cookies in JSON format. I referred this manual for setting cookies in the requests.
Now I'm able to skip those location,delivery address popups. After that I found that the dynamic pages are loaded via <script type=text/javascript> and found that part of page url is stored in a variable. I extracted the value using split(). Here is the script part to get the dynamic page url.
from lxml import html
page_source=requests.get(url,cookies=jar)
tree=html.fromstring(page_source.content)
dynamic_pg_link=tree.xpath('//div[#class="col3_T02"]/div/script/text()')[0] #entire javascript to load product pages
dynamic_pg_link=dynamic_pg_link.split("=")[1].split(";")[0].strip()#obtains the dynamic page url.
page_link="http://www.luluwebstore.com/Handler/ProductShowcaseHandler.ashx?ProductShowcaseInput="+dynamic_pg_link
Now I'm able to extract data from these LInks.
Thanks to #Cal Eliacheff for the previous guidance.
I am using Python and lxml library to parse a saved webpage.
The docinfo of a saved webpage shows the disk location of a saved webpage.
storedHtmlDoc.docinfo.URL
Is there any way to extract the original URl from the saved page?
If you have not stored somewhere yourself the URL of the downloaded page, it's not available to you.
If you can control the downloading process, you could put the URL of the downloaded page inside a META tag of the page.