Understanding google's HTML - python

first time poster here.
I am just getting into python and coding in general and I am looking into the requests and BeutifulSoup libraries. I am trying to grab image url’s from google images. When inspecting the site in chrome i can find the “div” and the correct img src url. But when I open the HTML that “requests” gives me I can find the same “div” but the img src url is something completely different and only leads to a black page if used.
Img of the HTML requests get
Img of the HTML found in chrome's inspect tool
What I wonder, and want to understand is:
why are these HTML's different
How do I get the img src that is found with the inspect tool with requests?
Hope the question makes sense and thank you in advance for any help!

Maybe differences between the the response HTML and the code in chrome inspector stems for updates to the page when JS changes it . for example when you use innerHTML() to edit div element so the code you add will add to DOM stack so as the code in the inspector but it would have no influence on the response.
You may search the http:// in the begging and the .png or .jpg or any other image format in the end.

Simply put, your code retrieves a single HTML page, and lets you access it, as it was retrieved. The browser, on the other hand, retrieves that HTML, but then lets the scripts embedded in (or linked from) it run, and these scripts often make significant modifications to the HTML (also known as DOM - Document Object Model). The browser's inspector inspects the fully modified DOM.

Related

<img> tag mechanics of in HTML, selenium scraping

This is a bit of a long theoretical question about how img tags really work, for the purposes of web scraping. I've done a lot of research and have seen a bunch of working solutions, but I haven't felt that the core question was answered.
First off, my task: I wish to efficiently scrape ~100k HTML pages from a website and also download images on these pages, while respecting their robots.txt crawl rate of 3 seconds per page.
First, I built a scraper intending to just crawl the HTML and get a long list of image urls to download on a second pass. But then, I realized that, with ~10 images per page this would be ~1M images. At a 3-second crawl rate, this would take ~30 days.
So I thought: "if I'm scraping using Selenium, the images are getting downloaded anyway! I can just download the images on page-scrape."
Now, my background research: This sent me down a rabbit hole, and I learned that the following options exist to download images on a page without making additional calls:
You can right-click and "Save as" (SO post)
You can screenshot the image (SO post)
Sometimes, weirdly, the image data is loaded into src anyway (SO post)
Selenium Wire exists, which is really the best way to address this. (SO Post)
These all seem like viable answers, but on a deeper level, they all (even Selenium Wire**) seem like hacks.
** Selenium Wire allows you to access the data in the requests made by Selenium. This is great, but I naively assumed that when a page is rendered and the images are placed in the img tags, they're in the page and I shouldn't have to worry about the requests that retrieved them.
Now, finally, my question. I clearly have a fundamental misunderstanding about how the img tag works.
Why can't one directly access image data through the Selenium driver, which is loading and rendering the images anyway? The images are there; I see the images when the driver loads. Theoretically, naively, I would expect to be able to download whatever is loaded on the page.
The one parallel I know of is with iframes -- you can visually see the content of the iframe, but you can only scrape it after directing Selenium to switch into the frame (background). So naively I assumed there would be a switch method for img's as well. The fact that there isn't, and it's not clear how to use Selenium to download the image data, tells me that I'm not really understanding how a browser handles an img tag.
I understand all the hacks and the workarounds, but my question here is why?

Beautifulsoup scrape not showing everything

I am trying to get the img tag from the first image, so I can get the image link.
When I scrape the site with beautifulsoup, there is not a img tag (in image 2).
I don't understand why the website has an img tag for each, but beautifulsoup does not.
It is possible that the images does not load on the site until it gets input from the user.
For example, if you had to click a dropdown or a next arrow to view the image on the website, then it is probably making a new request for that image and updating the html on the site.
Another issue might be JavaScript. Websites commonly have JavaScript code that runs after the page has first been loaded. The Javascript then mades additional requests to update elements on the page.
To see what is happending on the site, in your browers go to the site press F12. Go to the Network tab and reload the page. You will see all the urls that are requested.
If you need to get data that loads by Javascript requests, try using Selenium.
UPDATE
I went to the webiste you posted and pulled just the html using the following code.
import requests
page = requests.get("https://auburn.craigslist.org/search/sss?query=test")
print(page.text)
The requests return the html you would get before any Javascript and other requests run. You can see it here
The image urls are not in this either. This means that in the initial request the image html is not returned. What we do see are data tags, see line 2192 of the pastebin. These are commonly used by JavaScript to make additional requests so it knows which images to go and get.
Result: The img tags you are looking for are not in the html returned from your request. Selenium will help you here, or investigate how thier javascript is using those data-ids to determine which images to request.

How to read a HTML page that takes some time to load? [duplicate]

I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes.
What I see in the source code is:
<div id="cntnt"></div>
But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.
I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !
You need JavaScript Engine to parse and run JavaScript code inside the page.
There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer
A regular scraper gets just the HTML document. To get any content generated by JavaScript logic, you rather need a Headless browser that would also generate the DOM, load and run the scripts like a regular browser would. The Wikipedia article and some other pages on the Net have lists of those and their capabilities.
Keep in mind when choosing that some previously major products of those are abandoned now.
TRY THIS FIRST!
Perhaps the data technically could be in the javascript itself and all this javascript engine business is needed. (Some GREAT links here!)
But from experience, my first guess is that the JS is pulling the data in via an ajax request. If you can get your program simulate that, you'll probably get everything you need handed right to you without any tedious parsing/executing/scraping involved!
It will take a little detective work though. I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. Focus your attention attention on any/all XmlHTTPRequests. The data you need should be found somewhere in one of these responses, probably in the middle of some JSON text.
Now, see if you can re-create that request and get the data directly. (NOTE: You may have to set the User-Agent of your request so the server thinks you're a "real" web browser.)

getting specific images from page

I am pretty new with BeautifulSoup. I am trying to print image links from http://www.bing.com/images?q=owl:
redditFile = urllib2.urlopen("http://www.bing.com/images?q=owl")
redditHtml = redditFile.read()
redditFile.close()
soup = BeautifulSoup(redditHtml)
productDivs = soup.findAll('div', attrs={'class' : 'dg_u'})
for div in productDivs:
print div.find('a')['t1'] #works fine
print div.find('img')['src'] #This getting issue KeyError: 'src'
But this gives only title, not the image source
Is there anything wrong?
Edit:
I have edited my source, still could not get image url.
Bing is using some techniques to block automated scrapers. I tried to print
div.find('img')
and found that they are sending source in attribute names src2, so following should work -
div.find('img')['src2']
This is working for me. Hope it helps.
If you open up browser develop tools, you'll see that there is an additional async XHR request issued to the http://www.bing.com/images/async endpoint which contains the image search results.
Which leads to the 3 main options you have:
simulate that XHR request in your code. You might want to use something more suitable for humans than urllib2; see requests module. This would be so called "low-level" approach, going down to the bare metal and web-site specific implementation which would make this option non-reliable, difficult, "heavy", error-prompt and fragile
automate a real browser using selenium - stay on the high-level. In other words, you don't care how the results are retrieved, what requests are made, what javascript needs to be executed. You just wait for search results to appear and extract them.
use Bing Search API (this should probably be option #1)

How can I use python to get rendered ASP pages?

I'm trying to use python to navigate through a website that have auth forms on its landing page, rendered by ASP scripts.
But when I use python (with mechanize, requests, or urlibs) to get the HTML of that site, I always end up with a semi-blank HTML file, due to such ASP scripts.
Would anyone know any method that I can use to get the final (as displayed on a browser) version of an ASP site?
Your target page is a frameset. There is nothing fancy going on from the server side that I can tell. When I use requests or urllib to download it, even sending no headers at all, I get exactly the same HTML that I see in Chrome or Firefox. There is some embedded JS, but it doesn't do anything. Basically, all there is here is a frameset with a single frame in it.
The frame target is also a perfectly normal page with nothing fancy going on from the server side that I can tell. Again, if I fetch it with no headers, I get the exact same contents as in Chrome or Firefox. There is plenty of embedded JS here, but it's not building the DOM from scratch or anything; the static contents that I get from the server have the whole page contents in them. I can strip out all the JS and render it, and it looks exactly the same.
There is a minor problem that neither the server nor the HTML specifies a charset anywhere, and yet the contents aren't ASCII, which means you need to guess what charset to decode if you want to process it as Unicode. But if you're in Python 2.x, and just planning to grab things out of the DOM by ID or something, that won't matter.
I suspect your real problem is just that you don't know how HTML framesets work. You're downloading the frameset, not downloading the referenced frame, and wondering why the resulting page looks like an empty frameset.
Frames are an obsolete feature that nobody uses anymore for anything but a common trick for letting the user pop up a new window even in ancient browsers, and some obscure tricks for fooling popup blockers. In HTML 5 they're finally gone. But as long as ancient websites are out there and need to be scraped, you need to know how they work.
This isn't a substitute for the full documentation, but here's the short version of what a web browser does with a frameset: For each frame tag, it follows the src attribute, then it replaces the contents of the frame tag with a #document tag with no attributes, with the results of reading the src URL as its contents. Beyond that, of course, frames affect layout, but that probably doesn't affect you.
Meanwhile, if you're trying to learn web scraping, you really want to install your browser's "Web Developer Tools" (different browsers have different names), or a full-on debugger like Firebug. That way, you can inspect the live tree that your browser is rendering, and compare it to what you get from your script (or, more simply, from wget). So, next time you can say "In Chrome's Inspect Page, I see a #document under the frame, with a whole bunch of stuff underneath that, but when I try to read the same page myself, the frame has no children".

Categories