Am wondering if Scrapy has methods to scrape data based on their colors defined in CSS. For example, select all elements with background-color: #ff0000.
I have tried this:
response.css('td::attr(background-color)').extract()
I was expecting a list with all background colors set for the table data elements but it returns an empty list.
Is it generally possible to locate elements by their CSS properties in Scrapy?
Short answer is No, this is not possible to do with Scrapy alone.
Why No?
the :attr() selector allows you to access element attributes, but background-color is a CSS property.
an important thing to understand now is that there are multiple different ways to define CSS properties of elements on a page and, to actually get a CSS property value of an element, you need a browser to fully render the page and all the defined stylesheets
Scrapy itself is not a browser, not a javascript engine, it is not able to render a page
Exceptions
Sometimes, though, CSS properties are defined in style attributes of the elements. For instance:
<span style="background-color: green"/>
If this is the case, when, yes, you would be able to use the style attributes value to filter elements:
response.xpath("//span[contains(#style, 'background-color: green')]")
This would though be quite fragile and may generate false positives.
What can you do?
look for other things to base your locators on. In general, strictly speaking, locating elements by the background color is not the best way to get to the desired elements unless, in some unusual circumstances, this property is the only distinguishing factor
scrapy-splash project allows you to automate a lightweight Splash browser which may render the page. In that case, you would need some Lua scripts to be executed to access CSS properties of elements on a rendered page
selenium browser automation tool is probably the most straightforward tool for this problem as it gives you direct control and access to the page and its elements and their properties and attributes. There is this .value_of_css_property() method to get a value of a CSS property.
Response.css() is a shortcut to TextResponse.selector.css(query)
http://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.TextResponse.css
Related
I'm using Python with Selenium.
I am attempting to do some web scraping. I have a WebElement (which contains child elements) that I would like to save to a offline file. So far, I have managed to get the raw HTML for my WebElement using WebElement.get_attribute('innerHTML'). This works, but, no CSS is present in the final product because a stylesheet is used on the website. So I'd like to get these CSS properties converted to inline.
I found this stackoverflow solution which shows how to get the CSS properties of an element. However, getting this data, then parsing the HTML as a string to add these properties inside the HTML tag's style attribute, then doing this for all the child elements, feels like it'd be a significant undertaking.
So, I was wondering whether there was a more straightforward way of doing this.
I am using inclusion tag in sidebar of html page. If I hide the sidebar using css for mobile view. Will Django process the inclusion tag in backend when visited from mobile
Yes it will process and include it. But if your css hides it it wont be displayed.
You can use developer tools (Inspect) in your browser to see the included code.
The {display: none}; CSS property, commands the browser not to render the referenced element, however, the element is acknowledged and processed by the system.
Therefore,yes Django will process inclusion tags even if they are hidden using the {display:none}; property.
To use the property:
Firstly, identify the HTML/class of the element you wish to hide.
If you'd like to hide an element by HTML ID use:#html-id {display:none};
If you'd like to hide a CSS class use: .class-name {display:none};
If you are not sure how to reference the element in CSS, use the Inspect tool of Google Chrome.
For further details on how you can make advanced selections check out:
https://www.w3schools.com/css/css_selectors.asp
This could come in handy if you have a complex identification methodology employed, or you wish to minimize CSS.
Here is an example page with pagination controlling dynamically loaded results.
http://www.rehabs.com/local/jacksonville-fl/
All that I presently know to try is:
curButton = 1
driver.find_element_by_css_selector('ul[class="pagination"]').find_elements_by_tag_name('li')[curButton].click()
Nothing seems to happen (also when trying to access and click the a tag or driver.get() the href of the a element).
Is there another way to access the hidden elements? For instance, when reading the html of the entire page, the elements of different pagination are shown, but are apparently inaccessible with BeautifulSoup.
Pagination was added for humans. Maybe you used the wrong xpath or css. Check it.
Use this xpath:
//div[#id="listing-basic"]/article/div[#class="h3"]/a/#href
You can click on the pagination button using:
driver.find_elements_by_css_selector('.pagination li a')[1].click()
I wish to use python-webdriver to select an element based on it's background highlighting colour. Normally, this bit of html:
<div class="line-highlight" style="background:#FD71B5;">
I would select it doing the following:
.line-highlight[style*='background:#FD71B5']
However, in this case I have different inline styling:
<div class="line-highlight" style=top:130px;height:28px;left:506px;width:434px;">
but the highlight colour (which is the same) is set in an external CSS, so the above selector does not seem to work.
Is there any way webdriver can select by style if that style is not inline?
Thanks,
Darren
Because the css you're looking for isn't part of the HTML markup you can't select element(s) as you would usually do.
Instead try to select by the class name ".line-highlight" then loop through the resulting element objects and for each element get the css background property value by using:
value_of_css_property("background")
(or any css property for that matter). Once an element matches the background you're looking for break from the loop and tada you found the element you are looking for.
Note if you're using Java use:
getCssValue(property-name)
What this link http://www.w3.org/2002/07/26-dom-article.html says about style.
Accessing the Style Associated With the Document
Each node in the document is associated with stylistic effects such as color, position, and
borders. These stylistic effects are not always part of the document and might be defined in
a separate section called a style sheet.
So as Selenium works with DOM only it can't locate elements as you want.
I've write a simple web parser to parse a website using beautiful soap. But i want to know wether a class css attribute "postion" is fiexd or absolute.
All the css attribute are css links defined in html header, with class attribute in the css file like this
.slideABox , .slideBBox{
max-width:320px; position:relative; margin: 0 auto;
}
So how can i check the attribute of a css class in python, just like in javascript?
BeautifulSoup is a poor choice for what you want to accomplish, because it does not come with a rendering engine (therefore, CSS rules are not even taken in consideration when parsing the page).
Obviously you could parse the CSS rules manually using an appropriate tool (e.g. http://pythonhosted.org/tinycss/), but I won't recommend that because CSS properties can also be inherited, and you'll end up with false results, unless your HTML page is a very simple one.
Instead, I suggest to take a look to Selenium, which is essentially a browser wrapper, and has Python bindings. Its Element object has a value_of_css_property method, which will probably suit your needs (see the APIs here).