when I right click a page and click save as and save it as html, it saves everything on the page, including images. However, when I use python's requests.get, it saves the html page without saving the images. They appear as broken links, like so:
Broken images
Instead of like so:
Working images
How can I get requests.get to get all the data on the webpage? Appreciate any advice.
Edit: This is the code I'm using to scrape the website:
for link in aosa1:
res=requests.get("http://aosabook.org/en/"+link)
print(res.status_code)
print(len(res.text))
res.raise_for_status()
playfile=open(link,'wb')
for chunk in res.iter_content(100000):
playfile.write(chunk)
playfile.close
You don't understand how HTML works. Those images are not part of the page. When a browser downloads an HTML file, it then scans the HTML looking for <img> tags. For each <img> tag, it make a new HTML request to fetch that so it can display it. Now, if the <img> tags had absolute URLs, it would still show for you. But if they have relative URLs (<img src="/images/abc.png">), then the browser is going to try to fetch them from your localhost web server, which does not exist. You can try to scan the HTML and fetch those images.
Related
I'm trying to web scrape the amazon deals page but the problem is that I'm unable to get the URL for the next page. Here is the link to the Amazon today's deals page. At the bottom of the page, there is pagination but when I inspected the page, there is no URL. The href tag only contains "#" in the URL which should only load the page to the top. How is Amazon able to move to the next page? is there any hidden URL? I couldn't find anything using the Network tab in the Inspect menu as well. I'm adding the picture below to show the code of pagination.
Probably some JavaScript wizardry they are running in the background. # seems like a placeholder. Check out the JavaScript code, and there might be more clues there.
I am trying to get the img tag from the first image, so I can get the image link.
When I scrape the site with beautifulsoup, there is not a img tag (in image 2).
I don't understand why the website has an img tag for each, but beautifulsoup does not.
It is possible that the images does not load on the site until it gets input from the user.
For example, if you had to click a dropdown or a next arrow to view the image on the website, then it is probably making a new request for that image and updating the html on the site.
Another issue might be JavaScript. Websites commonly have JavaScript code that runs after the page has first been loaded. The Javascript then mades additional requests to update elements on the page.
To see what is happending on the site, in your browers go to the site press F12. Go to the Network tab and reload the page. You will see all the urls that are requested.
If you need to get data that loads by Javascript requests, try using Selenium.
UPDATE
I went to the webiste you posted and pulled just the html using the following code.
import requests
page = requests.get("https://auburn.craigslist.org/search/sss?query=test")
print(page.text)
The requests return the html you would get before any Javascript and other requests run. You can see it here
The image urls are not in this either. This means that in the initial request the image html is not returned. What we do see are data tags, see line 2192 of the pastebin. These are commonly used by JavaScript to make additional requests so it knows which images to go and get.
Result: The img tags you are looking for are not in the html returned from your request. Selenium will help you here, or investigate how thier javascript is using those data-ids to determine which images to request.
I have been exploring ways to use python to log into a secure website (eg. Salesforce), navigate to a certain page and print (save) the page as pdf at a prescribed location.
I have tried using:
pdfkit.from_url: Use Request to get a session cookie, parse it then pass it as cookie into the wkhtmltopdf's options settings. This method does not work due to pdfkit not being able to recognise the cookie I passed.
pdfkit.from_file: Use Request.get to get the html of the page I want to print, then use pdfkit to convert the html file to pdf. This works but the page format and images are all missing.
Selenium: Use a webdriver to log in then navigate to the wanted page, call the windows.print function. This does not work because I can't pass any arguments to the window's SaveAs dialog.
Does anyone have any idea to get around?
log in using requests
use requests session mechanism to keep track of the cookie
use session to retrieve the HTML page
parse the HTML (use beautifulsoup)
identify img tags and css links
download locally the images and css documents
rewrite the img src attributes to point to the locally downloaded images
rewrite the css links to point to the locally downloaded css
serialize the new HTML tree to a local .html file
use whatever "HTML to PDF" solution to render the local .html file
I'm trying to fetch all the visible text from a website, I'm using python-scrapy for this work. However what i observe scrapy only works with HTML tags such as div,body,head etc. and not with angular js tags such as ng-view, if there is any element within ng-view tags and when I do a right-click on the page and do view source then the content inside the tag doesn't appear and it displays like <ng-view> </ng-view>, So how can I use python to scrap the elements within this ng-view tags.Thanks in advance..
To answer your question
how can I use python to scrap the elements within this ng-view tags
You can't.
The content you want to scrape renders on the client side(browser), what scrapy get's you is just static content from server, your browser than interprets the HTML code and renders the JS code. And JS code than fetches different content from server again and makes some stuff with it.
Can it be done?
Yes!
One of the ways is to use some sort oh headless browser like http://phantomjs.org/ to fetch all the content. Once you have the content you can save it and scrape it as you wish. The thing is that this kind of web scraping is not as easy and straight forward as just scraping regular HTML. There is a reason why Google still doesn't scrape web pages that render their content via JS.
I am using Python and lxml library to parse a saved webpage.
The docinfo of a saved webpage shows the disk location of a saved webpage.
storedHtmlDoc.docinfo.URL
Is there any way to extract the original URl from the saved page?
If you have not stored somewhere yourself the URL of the downloaded page, it's not available to you.
If you can control the downloading process, you could put the URL of the downloaded page inside a META tag of the page.