Beautifulsoup scrape not showing everything - python

I am trying to get the img tag from the first image, so I can get the image link.
When I scrape the site with beautifulsoup, there is not a img tag (in image 2).
I don't understand why the website has an img tag for each, but beautifulsoup does not.

It is possible that the images does not load on the site until it gets input from the user.
For example, if you had to click a dropdown or a next arrow to view the image on the website, then it is probably making a new request for that image and updating the html on the site.
Another issue might be JavaScript. Websites commonly have JavaScript code that runs after the page has first been loaded. The Javascript then mades additional requests to update elements on the page.
To see what is happending on the site, in your browers go to the site press F12. Go to the Network tab and reload the page. You will see all the urls that are requested.
If you need to get data that loads by Javascript requests, try using Selenium.
UPDATE
I went to the webiste you posted and pulled just the html using the following code.
import requests
page = requests.get("https://auburn.craigslist.org/search/sss?query=test")
print(page.text)
The requests return the html you would get before any Javascript and other requests run. You can see it here
The image urls are not in this either. This means that in the initial request the image html is not returned. What we do see are data tags, see line 2192 of the pastebin. These are commonly used by JavaScript to make additional requests so it knows which images to go and get.
Result: The img tags you are looking for are not in the html returned from your request. Selenium will help you here, or investigate how thier javascript is using those data-ids to determine which images to request.

Related

Requests.get not saving images on the webpage

when I right click a page and click save as and save it as html, it saves everything on the page, including images. However, when I use python's requests.get, it saves the html page without saving the images. They appear as broken links, like so:
Broken images
Instead of like so:
Working images
How can I get requests.get to get all the data on the webpage? Appreciate any advice.
Edit: This is the code I'm using to scrape the website:
for link in aosa1:
res=requests.get("http://aosabook.org/en/"+link)
print(res.status_code)
print(len(res.text))
res.raise_for_status()
playfile=open(link,'wb')
for chunk in res.iter_content(100000):
playfile.write(chunk)
playfile.close
You don't understand how HTML works. Those images are not part of the page. When a browser downloads an HTML file, it then scans the HTML looking for <img> tags. For each <img> tag, it make a new HTML request to fetch that so it can display it. Now, if the <img> tags had absolute URLs, it would still show for you. But if they have relative URLs (<img src="/images/abc.png">), then the browser is going to try to fetch them from your localhost web server, which does not exist. You can try to scan the HTML and fetch those images.

Understanding google's HTML

first time poster here.
I am just getting into python and coding in general and I am looking into the requests and BeutifulSoup libraries. I am trying to grab image url’s from google images. When inspecting the site in chrome i can find the “div” and the correct img src url. But when I open the HTML that “requests” gives me I can find the same “div” but the img src url is something completely different and only leads to a black page if used.
Img of the HTML requests get
Img of the HTML found in chrome's inspect tool
What I wonder, and want to understand is:
why are these HTML's different
How do I get the img src that is found with the inspect tool with requests?
Hope the question makes sense and thank you in advance for any help!
Maybe differences between the the response HTML and the code in chrome inspector stems for updates to the page when JS changes it . for example when you use innerHTML() to edit div element so the code you add will add to DOM stack so as the code in the inspector but it would have no influence on the response.
You may search the http:// in the begging and the .png or .jpg or any other image format in the end.
Simply put, your code retrieves a single HTML page, and lets you access it, as it was retrieved. The browser, on the other hand, retrieves that HTML, but then lets the scripts embedded in (or linked from) it run, and these scripts often make significant modifications to the HTML (also known as DOM - Document Object Model). The browser's inspector inspects the fully modified DOM.

How to find the Request URL from a webpage in chromes's Inspect Element> Network > XHR > Response > Headers > General

Sorry if I don't know the correct terminology, I'm new to web scraping, please correct my terminology if you feel like it.
I'm working on a project to scrape images off all the pieces by an artist given the URL for the artist's gallery. What I am doing is finding the unique id of each page of the gallery that will lead me to the webpage hosting the original image. I can already scrape from the art page I just need the id's of each page from gallery.
Artist Gallery --> Art Page --> scrape image
The id's of each page on the gallery page is not available in the page source, since it is being loaded in separately through JavaScript I think, so I can not grab them using:
import requests
import urllib.request
response = requests.get(pageurl)
print(response.text)
But I have found that by going to Chrome Inspect Element> Network > XHR > Response > Headers > General, there is a request URL that has all the id's that I need, and below that is a Query String Parameters section that has all the id's I need.
Picture of Query String Parameters
Picture of Request URL
I am using BeautifulSoup, but the problem just lies with how to get the data.
I have also used urllib.request.urlopen(pageurl) with similar results. I have also tried Selenium, but was still unable to get the ids, although I may not have done so correctly, I was able to get to the webpage, but maybe I did not use the right method. For now, this is what I want to try. EDIT: I have since figured it out using Selenium. (I just wasn't trying hard enough), but would still like some input regarding intercepting XHR's.
Link to site if you really want to see it, but you may have to login

Scrape data from JavaScript-rendered website

I want to scrap Lulu webstore. I have the following problems with it.
The website content is loaded dynamically.
The website when tried to access, redirects to choose country page.
After choosing country, it pops up select delivery location and then redirects to home page.
When you try to hit end page programmatically, you get an empty response because the content is loaded dynamically.
I have a list of end URLs from which I have to scrape data. For example, consider mobile accessories. Now I want to
Get the HTML source of that page directly, which is loaded dynamically bypassing choose country, select location popups, so that I can use my Scrapy Xpath selectors to extract data.
If you suggest me to use Selenium, PhantomJS, Ghost or something else to deal with dynamic content, please understand that I want the end HTML source as in a web browser after processing all dynamic content which will be sent to Scrapy.
Also, I tried using proxies to skip choose country popup but still it loads it and select delivery location.
I've tried using Splash, but it returns me the source of choose country page.
At last I found answer. I used EditThisCookie plugin to view the cookies that are loaded by the Web Page. I found that it stores 3 cookies CurrencyCode,ServerId,Site_Config in my local storage. I used the above mentioned plugin to copy the cookies in JSON format. I referred this manual for setting cookies in the requests.
Now I'm able to skip those location,delivery address popups. After that I found that the dynamic pages are loaded via <script type=text/javascript> and found that part of page url is stored in a variable. I extracted the value using split(). Here is the script part to get the dynamic page url.
from lxml import html
page_source=requests.get(url,cookies=jar)
tree=html.fromstring(page_source.content)
dynamic_pg_link=tree.xpath('//div[#class="col3_T02"]/div/script/text()')[0] #entire javascript to load product pages
dynamic_pg_link=dynamic_pg_link.split("=")[1].split(";")[0].strip()#obtains the dynamic page url.
page_link="http://www.luluwebstore.com/Handler/ProductShowcaseHandler.ashx?ProductShowcaseInput="+dynamic_pg_link
Now I'm able to extract data from these LInks.
Thanks to #Cal Eliacheff for the previous guidance.

How to use Python mechanize to save a page inlcuding images on the page

So I understand that mechanize can make python script behave exactly like a browser.
With a browser, I can save a loaded page into local disk, including the images in the page
How can I do this with mechanize ?
It doesn't seem to be able to save images on the page
I'm using example from This page
Thanks in advance
There isn't a direct way to do this however here are the steps you'll need to take
open the page with mechanize or urllib2
get the html
parse html with lxml
find all image urls
for each image url open with mechanize or urllib2 and save
In this situation mechanize doesn't really offer you anything over urllib2 if you only want to deal with one page and you don't need to login to get to it. You might find requests useful for this task.

Categories