Is it possible to locate elements by CSS properties in Scrapy?

Is it possible to locate elements by CSS properties in Scrapy? - python

Am wondering if Scrapy has methods to scrape data based on their colors defined in CSS. For example, select all elements with background-color: #ff0000.
I have tried this:
response.css('td::attr(background-color)').extract()
I was expecting a list with all background colors set for the table data elements but it returns an empty list.
Is it generally possible to locate elements by their CSS properties in Scrapy?

Short answer is No, this is not possible to do with Scrapy alone.
Why No?
the :attr() selector allows you to access element attributes, but background-color is a CSS property.
an important thing to understand now is that there are multiple different ways to define CSS properties of elements on a page and, to actually get a CSS property value of an element, you need a browser to fully render the page and all the defined stylesheets
Scrapy itself is not a browser, not a javascript engine, it is not able to render a page
Exceptions
Sometimes, though, CSS properties are defined in style attributes of the elements. For instance:
<span style="background-color: green"/>
If this is the case, when, yes, you would be able to use the style attributes value to filter elements:
response.xpath("//span[contains(#style, 'background-color: green')]")
This would though be quite fragile and may generate false positives.
What can you do?
look for other things to base your locators on. In general, strictly speaking, locating elements by the background color is not the best way to get to the desired elements unless, in some unusual circumstances, this property is the only distinguishing factor
scrapy-splash project allows you to automate a lightweight Splash browser which may render the page. In that case, you would need some Lua scripts to be executed to access CSS properties of elements on a rendered page
selenium browser automation tool is probably the most straightforward tool for this problem as it gives you direct control and access to the page and its elements and their properties and attributes. There is this .value_of_css_property() method to get a value of a CSS property.

Response.css() is a shortcut to TextResponse.selector.css(query)
http://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.TextResponse.css

Related

Saving HTML Element Tree including CSS properties using Selenium

I'm using Python with Selenium.
I am attempting to do some web scraping. I have a WebElement (which contains child elements) that I would like to save to a offline file. So far, I have managed to get the raw HTML for my WebElement using WebElement.get_attribute('innerHTML'). This works, but, no CSS is present in the final product because a stylesheet is used on the website. So I'd like to get these CSS properties converted to inline.
I found this stackoverflow solution which shows how to get the CSS properties of an element. However, getting this data, then parsing the HTML as a string to add these properties inside the HTML tag's style attribute, then doing this for all the child elements, feels like it'd be a significant undertaking.
So, I was wondering whether there was a more straightforward way of doing this.

Will Django process inclusion tag if I hide it in html template using css

I am using inclusion tag in sidebar of html page. If I hide the sidebar using css for mobile view. Will Django process the inclusion tag in backend when visited from mobile

Yes it will process and include it. But if your css hides it it wont be displayed.
You can use developer tools (Inspect) in your browser to see the included code.

The {display: none}; CSS property, commands the browser not to render the referenced element, however, the element is acknowledged and processed by the system.
Therefore,yes Django will process inclusion tags even if they are hidden using the {display:none}; property.
To use the property:
Firstly, identify the HTML/class of the element you wish to hide.
If you'd like to hide an element by HTML ID use:#html-id {display:none};
If you'd like to hide a CSS class use: .class-name {display:none};
If you are not sure how to reference the element in CSS, use the Inspect tool of Google Chrome.
For further details on how you can make advanced selections check out:
https://www.w3schools.com/css/css_selectors.asp
This could come in handy if you have a complex identification methodology employed, or you wish to minimize CSS.

How can Selenium (or BeautifulSoup) be used to access these hidden elements?

Here is an example page with pagination controlling dynamically loaded results.
http://www.rehabs.com/local/jacksonville-fl/
All that I presently know to try is:
curButton = 1
driver.find_element_by_css_selector('ul[class="pagination"]').find_elements_by_tag_name('li')[curButton].click()
Nothing seems to happen (also when trying to access and click the a tag or driver.get() the href of the a element).
Is there another way to access the hidden elements? For instance, when reading the html of the entire page, the elements of different pagination are shown, but are apparently inaccessible with BeautifulSoup.

Pagination was added for humans. Maybe you used the wrong xpath or css. Check it.
Use this xpath:
//div[#id="listing-basic"]/article/div[#class="h3"]/a/#href

You can click on the pagination button using:
driver.find_elements_by_css_selector('.pagination li a')[1].click()

How do I select an element by its style from css (not inline) using webdriver (python)

I wish to use python-webdriver to select an element based on it's background highlighting colour. Normally, this bit of html:
<div class="line-highlight" style="background:#FD71B5;">
I would select it doing the following:
.line-highlight[style*='background:#FD71B5']
However, in this case I have different inline styling:
<div class="line-highlight" style=top:130px;height:28px;left:506px;width:434px;">
but the highlight colour (which is the same) is set in an external CSS, so the above selector does not seem to work.
Is there any way webdriver can select by style if that style is not inline?
Thanks,
Darren

Because the css you're looking for isn't part of the HTML markup you can't select element(s) as you would usually do.
Instead try to select by the class name ".line-highlight" then loop through the resulting element objects and for each element get the css background property value by using:
value_of_css_property("background")
(or any css property for that matter). Once an element matches the background you're looking for break from the loop and tada you found the element you are looking for.
Note if you're using Java use:
getCssValue(property-name)

What this link http://www.w3.org/2002/07/26-dom-article.html says about style.
Accessing the Style Associated With the Document
Each node in the document is associated with stylistic effects such as color, position, and
borders. These stylistic effects are not always part of the document and might be defined in
a separate section called a style sheet.
So as Selenium works with DOM only it can't locate elements as you want.

check if a css class position is fixed using python

I've write a simple web parser to parse a website using beautiful soap. But i want to know wether a class css attribute "postion" is fiexd or absolute.
All the css attribute are css links defined in html header, with class attribute in the css file like this
.slideABox , .slideBBox{
max-width:320px; position:relative; margin: 0 auto;
}
So how can i check the attribute of a css class in python, just like in javascript?

BeautifulSoup is a poor choice for what you want to accomplish, because it does not come with a rendering engine (therefore, CSS rules are not even taken in consideration when parsing the page).
Obviously you could parse the CSS rules manually using an appropriate tool (e.g. http://pythonhosted.org/tinycss/), but I won't recommend that because CSS properties can also be inherited, and you'll end up with false results, unless your HTML page is a very simple one.
Instead, I suggest to take a look to Selenium, which is essentially a browser wrapper, and has Python bindings. Its Element object has a value_of_css_property method, which will probably suit your needs (see the APIs here).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is it possible to locate elements by CSS properties in Scrapy? - python

Response.css() is a shortcut to TextResponse.selector.css(query) http://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.TextResponse.css

Related

Saving HTML Element Tree including CSS properties using Selenium

Will Django process inclusion tag if I hide it in html template using css

How can Selenium (or BeautifulSoup) be used to access these hidden elements?

How do I select an element by its style from css (not inline) using webdriver (python)

check if a css class position is fixed using python

Categories

Resources