How to scrape image url off of html? - python

I'm having trouble getting the url of an image on a website and I was wondering if I could get some help.
I want to get the image url of the card on the website, but using xpath only gives me the image url of the website logo.
scrapy shell https://db.ygoprodeck.com/card/?search=7%20Colored%20Fish
response.xpath('//img')
Out[2]: [<Selector xpath='//img' data='<img src="https://db.ygoprodeck.com/sear'>]
There should be another img link to the card picture but it is not showing up

So there is some logic to how the images are done. Each card has an ID listed on the page. The ID is the name of the image. They hide this ID from you also.
They load much of this information in via the meta attributes at the top of the page. Often times the JS will be put at the top in the script or meta attributes. This is particularly true of shopify stores.
If you ever have trouble finding something for example with this image get the image name and search the rest of the document for references for that keyword. You will often be able to track down the information or at least figure out how it is loaded. This is also useful when websites require a "token" often they will supply the token on the previous page somewhere.
# with css
In [6]: response.css('meta[property="og:image"]::attr(content)').extract_first()
Out[6]: 'https://ygoprodeck.com/pics/23771716.jpg'
# with xpath
In [8]: response.xpath('//meta[#property="og:image"]/#content').extract_first()
Out[8]: 'https://ygoprodeck.com/pics/23771716.jpg'

Related

How do I get the list of all images on a page?

In Firefox, I can get a list of all images from the "Media" tab of the Page Info window:
How can I obtain such a list using Python Selenium? In addition to getting such a list of image URLs, I would also like to be able to get each image's data (i.e. the image itself) without needing to make additional network requests.
Please DO NOT suggest that I parse the HTML to look for <img ... /> tags. That is clearly not what I'm looking for. I am looking for image responses. Not all image responses are present in the DOM. Example: some image responses from AJAX requests.

Understanding google's HTML

first time poster here.
I am just getting into python and coding in general and I am looking into the requests and BeutifulSoup libraries. I am trying to grab image url’s from google images. When inspecting the site in chrome i can find the “div” and the correct img src url. But when I open the HTML that “requests” gives me I can find the same “div” but the img src url is something completely different and only leads to a black page if used.
Img of the HTML requests get
Img of the HTML found in chrome's inspect tool
What I wonder, and want to understand is:
why are these HTML's different
How do I get the img src that is found with the inspect tool with requests?
Hope the question makes sense and thank you in advance for any help!
Maybe differences between the the response HTML and the code in chrome inspector stems for updates to the page when JS changes it . for example when you use innerHTML() to edit div element so the code you add will add to DOM stack so as the code in the inspector but it would have no influence on the response.
You may search the http:// in the begging and the .png or .jpg or any other image format in the end.
Simply put, your code retrieves a single HTML page, and lets you access it, as it was retrieved. The browser, on the other hand, retrieves that HTML, but then lets the scripts embedded in (or linked from) it run, and these scripts often make significant modifications to the HTML (also known as DOM - Document Object Model). The browser's inspector inspects the fully modified DOM.

Beautifulsoup scrape not showing everything

I am trying to get the img tag from the first image, so I can get the image link.
When I scrape the site with beautifulsoup, there is not a img tag (in image 2).
I don't understand why the website has an img tag for each, but beautifulsoup does not.
It is possible that the images does not load on the site until it gets input from the user.
For example, if you had to click a dropdown or a next arrow to view the image on the website, then it is probably making a new request for that image and updating the html on the site.
Another issue might be JavaScript. Websites commonly have JavaScript code that runs after the page has first been loaded. The Javascript then mades additional requests to update elements on the page.
To see what is happending on the site, in your browers go to the site press F12. Go to the Network tab and reload the page. You will see all the urls that are requested.
If you need to get data that loads by Javascript requests, try using Selenium.
UPDATE
I went to the webiste you posted and pulled just the html using the following code.
import requests
page = requests.get("https://auburn.craigslist.org/search/sss?query=test")
print(page.text)
The requests return the html you would get before any Javascript and other requests run. You can see it here
The image urls are not in this either. This means that in the initial request the image html is not returned. What we do see are data tags, see line 2192 of the pastebin. These are commonly used by JavaScript to make additional requests so it knows which images to go and get.
Result: The img tags you are looking for are not in the html returned from your request. Selenium will help you here, or investigate how thier javascript is using those data-ids to determine which images to request.

How to find the Request URL from a webpage in chromes's Inspect Element> Network > XHR > Response > Headers > General

Sorry if I don't know the correct terminology, I'm new to web scraping, please correct my terminology if you feel like it.
I'm working on a project to scrape images off all the pieces by an artist given the URL for the artist's gallery. What I am doing is finding the unique id of each page of the gallery that will lead me to the webpage hosting the original image. I can already scrape from the art page I just need the id's of each page from gallery.
Artist Gallery --> Art Page --> scrape image
The id's of each page on the gallery page is not available in the page source, since it is being loaded in separately through JavaScript I think, so I can not grab them using:
import requests
import urllib.request
response = requests.get(pageurl)
print(response.text)
But I have found that by going to Chrome Inspect Element> Network > XHR > Response > Headers > General, there is a request URL that has all the id's that I need, and below that is a Query String Parameters section that has all the id's I need.
Picture of Query String Parameters
Picture of Request URL
I am using BeautifulSoup, but the problem just lies with how to get the data.
I have also used urllib.request.urlopen(pageurl) with similar results. I have also tried Selenium, but was still unable to get the ids, although I may not have done so correctly, I was able to get to the webpage, but maybe I did not use the right method. For now, this is what I want to try. EDIT: I have since figured it out using Selenium. (I just wasn't trying hard enough), but would still like some input regarding intercepting XHR's.
Link to site if you really want to see it, but you may have to login

Scraping the content of a box contains infinite scrolling in Python

I am new to Python and web crawling. I intend to scrape links in the top stories of a website. I was told to look at to its Ajax requests and send similar ones. The problem is that all requests for the links are same: http://www.marketwatch.com/newsviewer/mktwheadlines
My question would be how to extract links from an infinite scrolling box like this. I am using beautiful soup, but I think it's not suitable for this task. I am also not familiar with Selenium and java scripts. I know how to scrape certain requests by Scrapy though.
It is indeed an AJAX request. If you take a look at the network tab in your browser inspector:
You can see that it's making a POST request to download the urls to the articles.
Every value is self explanatory in here except maybe for docid and timestamp. docid seems to indicate which box to pull articles for(there are multiple boxes on the page) and it seems to be the id attached to <li> element under which the article urls are stored.
Fortunately in this case POST and GET are interchangable. Also timestamp paremeter doesn't seem to be required. So in all you can actually view the results in your browser, by right clicking the url in the inspector and selecting "copy location with parameters":
http://www.marketwatch.com/newsviewer/mktwheadlines?blogs=true&commentary=true&docId=1275261016&premium=true&pullCount=100&pulse=true&rtheadlines=true&topic=All%20Topics&topstories=true&video=true
This example has timestamp parameter removed as well as increased pullCount to 100, so simply request it, it will return 100 of article urls.
You can mess around more to reverse engineer how the website does it and what the use of every keyword, but this is a good start.

Categories