Saving HTML Element Tree including CSS properties using Selenium - python

I'm using Python with Selenium.
I am attempting to do some web scraping. I have a WebElement (which contains child elements) that I would like to save to a offline file. So far, I have managed to get the raw HTML for my WebElement using WebElement.get_attribute('innerHTML'). This works, but, no CSS is present in the final product because a stylesheet is used on the website. So I'd like to get these CSS properties converted to inline.
I found this stackoverflow solution which shows how to get the CSS properties of an element. However, getting this data, then parsing the HTML as a string to add these properties inside the HTML tag's style attribute, then doing this for all the child elements, feels like it'd be a significant undertaking.
So, I was wondering whether there was a more straightforward way of doing this.

Related

Scrapy: extract text from span without class or id

I have the following html structure:
I would like to extract the text ("“Business-Thinking”-Fokus im Master-Kurs") from the span highlighted (using Scrapy), however I have trouble reaching to it as it does not contain any specific class or id.
I tried to access it with the following absolute xPath:
sel.xpath('/html/body/div[4]/div[1]/div/div/h1/span/text()').extract()
I don't get any error, however it returns a blank file, meaning the text is not extracted.
Note: The parent classes are not unique, that's why I'm not using a relative path. As the text varies, I also cannot reach the span by looking for the text it contains.
Do you have any suggestion on how I should modify my xPath to extract the text? Thanks!
If you load the page using scrapy shell url it loads without javascript.
When you look at source without javascript, the xpath to the span is /html/body/div/div[1]/div/div/h1/span
To load webpages with javascript in Scrapy use Splash.

What is the best way to get the size of text in the tag when parsing HTML+CSS with python?

I'm scraping HTML pages of live websites using python and beautifulsoup4. I want to be able get the size of the text of any html tag. I tried to use cssutils to parse the CSS and find font-size param but real life CSS is pretty complicated like this
.some_div_class a span {font-size: 20px}
So I can find all tags that correspond to this selector using bs.select(selector) but trying every selector in stylesheet will take way too much time. So how is it possible to find font-size for any tag efficiently? Browsers do it pretty fast, so it shouldn't be impossible.
I don't want to use headless browser.

Selenium Python: How to get css without targetting a specific class/id/tag

I'm working on a scraper project and one of the goals is to get every image link from HTML & CSS of a website. I was using BeautifulSoup & TinyCSS to do that but now I'd like to switch everything on Selenium as I can load the JS.
I can't find in the doc a way to target some CSS parameters without having to know the tag/id/class. I can get the images from the HTML easily but I need to target every "background-image" parameter from the CSS in order to get the URL from it.
ex: background-image: url("paper.gif");
Is there a way to do it or should I loop into each element and check the corresponding CSS (which would be time-consuming)?
You can grab all the Style tags and parse them, searching what you look.
Also you can download the css file, using the resource URL and parse them.
Also you can create a XPATH/CSS rule for searching nodes that contain the parameter that you're looking for.

Scrapy Python Web Scraping - creating XPath

I am trying to create "universal" Xpath, so when I run spider, it will be able to download the hotel name for each hotel on the list.
This is the XPath that I need to convert:
//*[#id="offerPage"]/div[3]/div[1]/div[1]/div/div/div/div/div[2]/div/div[1]/h3/a
Can anyone point me the right direction?
This is the example how they did it in the scrapy docs:
https://github.com/scrapy/quotesbot/blob/master/quotesbot/spiders/toscrape-xpath.py
for text: they have :
'text': quote.xpath('./span[#class="text"]/text()').extract_first(),
When you open "http://quotes.toscrape.com/" and copy Xpath for text you will get :
/html/body/div/div[2]/div[1]/div[1]/span[1]
When you look at the html that your are scraping just using "copy xpath" from the browser source viewer is not enough.
You need to look at the attributes that the html tags have.
Of course, using just tag types as an xpath can work, but what if not every page you are going to scrape follows that pattern?
The Scrapy example you are using uses the span's class attribute to precisely point to the target tag.
I suggest reading a bit more about Xpath (for example here) to understand how flexible your search patterns can be.
If you want to go even broader, reading about DOM structure will also be useful. Let us know if you need more pointers.

Is it possible to locate elements by CSS properties in Scrapy?

Am wondering if Scrapy has methods to scrape data based on their colors defined in CSS. For example, select all elements with background-color: #ff0000.
I have tried this:
response.css('td::attr(background-color)').extract()
I was expecting a list with all background colors set for the table data elements but it returns an empty list.
Is it generally possible to locate elements by their CSS properties in Scrapy?
Short answer is No, this is not possible to do with Scrapy alone.
Why No?
the :attr() selector allows you to access element attributes, but background-color is a CSS property.
an important thing to understand now is that there are multiple different ways to define CSS properties of elements on a page and, to actually get a CSS property value of an element, you need a browser to fully render the page and all the defined stylesheets
Scrapy itself is not a browser, not a javascript engine, it is not able to render a page
Exceptions
Sometimes, though, CSS properties are defined in style attributes of the elements. For instance:
<span style="background-color: green"/>
If this is the case, when, yes, you would be able to use the style attributes value to filter elements:
response.xpath("//span[contains(#style, 'background-color: green')]")
This would though be quite fragile and may generate false positives.
What can you do?
look for other things to base your locators on. In general, strictly speaking, locating elements by the background color is not the best way to get to the desired elements unless, in some unusual circumstances, this property is the only distinguishing factor
scrapy-splash project allows you to automate a lightweight Splash browser which may render the page. In that case, you would need some Lua scripts to be executed to access CSS properties of elements on a rendered page
selenium browser automation tool is probably the most straightforward tool for this problem as it gives you direct control and access to the page and its elements and their properties and attributes. There is this .value_of_css_property() method to get a value of a CSS property.
Response.css() is a shortcut to TextResponse.selector.css(query)
http://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.TextResponse.css

Categories