I'm trying to implement a similar functionality to Facebook's thumbnail preview. The idea is, a user enters the URL of a product, and selects the best image of that product.
In order to filter out images that obviously aren't a product, I want to filter them based on height and width > 150px.
I'm using python and BeautifulSoup to download the HTML and extract images, but can't find a way to gather the height or width when it is specified in CSS.
GD is a library that's been around for quite some time and it has a pretty easy interface to work with...Here's a link to GD
See the "size" method to get width and height.
EDIT
Ah, how about this?
Parse the HTML content and retrieve URLs to the CSS file(s) and inline styles
Download the CSS file(s) and parse CSS files, in order, building a rule-set of the CSS rules.
Next, parse the rest of the HTML from Step 1, gathering IMG tags and if the IMG tag has a class name, look up the class name in your CSS rules and check for width or height.
Might sound a little complicated but I bet download a few CSS stylesheets is much lighter than downloading images and having to use an image library on the server-side.
Related
In Firefox, I can get a list of all images from the "Media" tab of the Page Info window:
How can I obtain such a list using Python Selenium? In addition to getting such a list of image URLs, I would also like to be able to get each image's data (i.e. the image itself) without needing to make additional network requests.
Please DO NOT suggest that I parse the HTML to look for <img ... /> tags. That is clearly not what I'm looking for. I am looking for image responses. Not all image responses are present in the DOM. Example: some image responses from AJAX requests.
I'm having trouble getting the url of an image on a website and I was wondering if I could get some help.
I want to get the image url of the card on the website, but using xpath only gives me the image url of the website logo.
scrapy shell https://db.ygoprodeck.com/card/?search=7%20Colored%20Fish
response.xpath('//img')
Out[2]: [<Selector xpath='//img' data='<img src="https://db.ygoprodeck.com/sear'>]
There should be another img link to the card picture but it is not showing up
So there is some logic to how the images are done. Each card has an ID listed on the page. The ID is the name of the image. They hide this ID from you also.
They load much of this information in via the meta attributes at the top of the page. Often times the JS will be put at the top in the script or meta attributes. This is particularly true of shopify stores.
If you ever have trouble finding something for example with this image get the image name and search the rest of the document for references for that keyword. You will often be able to track down the information or at least figure out how it is loaded. This is also useful when websites require a "token" often they will supply the token on the previous page somewhere.
# with css
In [6]: response.css('meta[property="og:image"]::attr(content)').extract_first()
Out[6]: 'https://ygoprodeck.com/pics/23771716.jpg'
# with xpath
In [8]: response.xpath('//meta[#property="og:image"]/#content').extract_first()
Out[8]: 'https://ygoprodeck.com/pics/23771716.jpg'
I'm working on a scraper project and one of the goals is to get every image link from HTML & CSS of a website. I was using BeautifulSoup & TinyCSS to do that but now I'd like to switch everything on Selenium as I can load the JS.
I can't find in the doc a way to target some CSS parameters without having to know the tag/id/class. I can get the images from the HTML easily but I need to target every "background-image" parameter from the CSS in order to get the URL from it.
ex: background-image: url("paper.gif");
Is there a way to do it or should I loop into each element and check the corresponding CSS (which would be time-consuming)?
You can grab all the Style tags and parse them, searching what you look.
Also you can download the css file, using the resource URL and parse them.
Also you can create a XPATH/CSS rule for searching nodes that contain the parameter that you're looking for.
I got this following content from a html page
str='http://www.ralphlauren.com/graphics/product_images/pPOLO2-24922076_alternate1_v360x480.jpg\', zoom: \'s7-1251098_alternate1\' }]\n\n\nEnlarge Image\n\n\n\n\n\n\nCotton Canvas Utility Jacket\nStyle Number : 112933196\n\n\n\n$125.00'
Like so, I have many html pages. I need some way to read the content BEFORE the style number. In this case, I need Cotton Canvas Utility Jacket. Is there a regex in python to do that? Note that I can start looking for pattern Enlarge Image and read whatever comes before I strike Style number. The issue is that there are many Enlarge Image on the html page. What I have shown above is part of the html page. full html page is here
In short, I need to find the product name from the linked HTML page.
Thanks.
I am pretty new to Django so I am creating a project to learn more about how it works. Right now I have a model that contains a URL field. I want to automatically generate a thumbnail from this url field by taking an appropriate image from the webite like facebook or reddit does. I'm guessing that I should store this image in an image field. What would be a good way to select an ideal image from the website and how can I accomplish this?
EDIT- I'm trying to take actual images from the website rather than a picture of the website
First off you can check if the site uses any Facebook open graph tags - namely <meta property="og:image" content="http://..."/>.
You'll first need to parse the html content for img src urls with something like lxml or BeautifulSoup. Then, you can feed one of those img src urls into sorl-thumbnail or easy-thumbnails as Edmon suggests.
One option, which is not specific to Django, is to take a snapshot of a page using webkit2png
and then use Sorl or Easy Thumbnails to generate image url.
Sorl - https://github.com/sorl/sorl-thumbnail
Easy Thumbnails - https://github.com/SmileyChris/easy-thumbnails