I am using Python and lxml library to parse a saved webpage.
The docinfo of a saved webpage shows the disk location of a saved webpage.
storedHtmlDoc.docinfo.URL
Is there any way to extract the original URl from the saved page?
If you have not stored somewhere yourself the URL of the downloaded page, it's not available to you.
If you can control the downloading process, you could put the URL of the downloaded page inside a META tag of the page.
Related
when I right click a page and click save as and save it as html, it saves everything on the page, including images. However, when I use python's requests.get, it saves the html page without saving the images. They appear as broken links, like so:
Broken images
Instead of like so:
Working images
How can I get requests.get to get all the data on the webpage? Appreciate any advice.
Edit: This is the code I'm using to scrape the website:
for link in aosa1:
res=requests.get("http://aosabook.org/en/"+link)
print(res.status_code)
print(len(res.text))
res.raise_for_status()
playfile=open(link,'wb')
for chunk in res.iter_content(100000):
playfile.write(chunk)
playfile.close
You don't understand how HTML works. Those images are not part of the page. When a browser downloads an HTML file, it then scans the HTML looking for <img> tags. For each <img> tag, it make a new HTML request to fetch that so it can display it. Now, if the <img> tags had absolute URLs, it would still show for you. But if they have relative URLs (<img src="/images/abc.png">), then the browser is going to try to fetch them from your localhost web server, which does not exist. You can try to scan the HTML and fetch those images.
I'm looking for a way to list all loaded files with the requests module.
Like there is in chrome's Inspector Network tab, you can see all kinds of files that have been loaded by the webpage.
The problem is the file(in this case .pdf file) I want to fetch does not have a specific tab, and the webpage loads it by javascript and AJAX I guess, because even after the page loaded completely, I couldn't find a tag that has a link to the .pdf file or something like that, so every time I should goto Networks tab and reload the page and find the file in the loaded resources list.
Is there any way to catch all the loaded files and list them using the Requests module?
When a browser loads an HTML file it then interprets the contents of that file. It may discover that there is a tag referencing an external JavaScript URL. The browser will then issue a GET request to retrieve that file. When said file is received, it hen interprets the JavaScript file by executing the code within. That code might contain AJAX code that in turn fetches more files. Or the HTML file may reference an extern CSS file with a tag or image file with an tag. These files will also be loaded by the browser and can be seen when you run the browser's inspector.
In contrast, when you do a get request with the requests module for a particular URL, only that one page is fetched. There is no logic to interpret the contents of the returned page and fetch those images, style sheets, JavaScript files, etc. that are referenced within the page.
You can, however, use Python to automate a browser using a tool such as Selenium WebDriver, which can be used to fully download a page.
I am trying to get the img tag from the first image, so I can get the image link.
When I scrape the site with beautifulsoup, there is not a img tag (in image 2).
I don't understand why the website has an img tag for each, but beautifulsoup does not.
It is possible that the images does not load on the site until it gets input from the user.
For example, if you had to click a dropdown or a next arrow to view the image on the website, then it is probably making a new request for that image and updating the html on the site.
Another issue might be JavaScript. Websites commonly have JavaScript code that runs after the page has first been loaded. The Javascript then mades additional requests to update elements on the page.
To see what is happending on the site, in your browers go to the site press F12. Go to the Network tab and reload the page. You will see all the urls that are requested.
If you need to get data that loads by Javascript requests, try using Selenium.
UPDATE
I went to the webiste you posted and pulled just the html using the following code.
import requests
page = requests.get("https://auburn.craigslist.org/search/sss?query=test")
print(page.text)
The requests return the html you would get before any Javascript and other requests run. You can see it here
The image urls are not in this either. This means that in the initial request the image html is not returned. What we do see are data tags, see line 2192 of the pastebin. These are commonly used by JavaScript to make additional requests so it knows which images to go and get.
Result: The img tags you are looking for are not in the html returned from your request. Selenium will help you here, or investigate how thier javascript is using those data-ids to determine which images to request.
I want to scrap Lulu webstore. I have the following problems with it.
The website content is loaded dynamically.
The website when tried to access, redirects to choose country page.
After choosing country, it pops up select delivery location and then redirects to home page.
When you try to hit end page programmatically, you get an empty response because the content is loaded dynamically.
I have a list of end URLs from which I have to scrape data. For example, consider mobile accessories. Now I want to
Get the HTML source of that page directly, which is loaded dynamically bypassing choose country, select location popups, so that I can use my Scrapy Xpath selectors to extract data.
If you suggest me to use Selenium, PhantomJS, Ghost or something else to deal with dynamic content, please understand that I want the end HTML source as in a web browser after processing all dynamic content which will be sent to Scrapy.
Also, I tried using proxies to skip choose country popup but still it loads it and select delivery location.
I've tried using Splash, but it returns me the source of choose country page.
At last I found answer. I used EditThisCookie plugin to view the cookies that are loaded by the Web Page. I found that it stores 3 cookies CurrencyCode,ServerId,Site_Config in my local storage. I used the above mentioned plugin to copy the cookies in JSON format. I referred this manual for setting cookies in the requests.
Now I'm able to skip those location,delivery address popups. After that I found that the dynamic pages are loaded via <script type=text/javascript> and found that part of page url is stored in a variable. I extracted the value using split(). Here is the script part to get the dynamic page url.
from lxml import html
page_source=requests.get(url,cookies=jar)
tree=html.fromstring(page_source.content)
dynamic_pg_link=tree.xpath('//div[#class="col3_T02"]/div/script/text()')[0] #entire javascript to load product pages
dynamic_pg_link=dynamic_pg_link.split("=")[1].split(";")[0].strip()#obtains the dynamic page url.
page_link="http://www.luluwebstore.com/Handler/ProductShowcaseHandler.ashx?ProductShowcaseInput="+dynamic_pg_link
Now I'm able to extract data from these LInks.
Thanks to #Cal Eliacheff for the previous guidance.
I have been exploring ways to use python to log into a secure website (eg. Salesforce), navigate to a certain page and print (save) the page as pdf at a prescribed location.
I have tried using:
pdfkit.from_url: Use Request to get a session cookie, parse it then pass it as cookie into the wkhtmltopdf's options settings. This method does not work due to pdfkit not being able to recognise the cookie I passed.
pdfkit.from_file: Use Request.get to get the html of the page I want to print, then use pdfkit to convert the html file to pdf. This works but the page format and images are all missing.
Selenium: Use a webdriver to log in then navigate to the wanted page, call the windows.print function. This does not work because I can't pass any arguments to the window's SaveAs dialog.
Does anyone have any idea to get around?
log in using requests
use requests session mechanism to keep track of the cookie
use session to retrieve the HTML page
parse the HTML (use beautifulsoup)
identify img tags and css links
download locally the images and css documents
rewrite the img src attributes to point to the locally downloaded images
rewrite the css links to point to the locally downloaded css
serialize the new HTML tree to a local .html file
use whatever "HTML to PDF" solution to render the local .html file