check if a css class position is fixed using python - python

I've write a simple web parser to parse a website using beautiful soap. But i want to know wether a class css attribute "postion" is fiexd or absolute.
All the css attribute are css links defined in html header, with class attribute in the css file like this
.slideABox , .slideBBox{
max-width:320px; position:relative; margin: 0 auto;
}
So how can i check the attribute of a css class in python, just like in javascript?

BeautifulSoup is a poor choice for what you want to accomplish, because it does not come with a rendering engine (therefore, CSS rules are not even taken in consideration when parsing the page).
Obviously you could parse the CSS rules manually using an appropriate tool (e.g. http://pythonhosted.org/tinycss/), but I won't recommend that because CSS properties can also be inherited, and you'll end up with false results, unless your HTML page is a very simple one.
Instead, I suggest to take a look to Selenium, which is essentially a browser wrapper, and has Python bindings. Its Element object has a value_of_css_property method, which will probably suit your needs (see the APIs here).

Related

Saving HTML Element Tree including CSS properties using Selenium

I'm using Python with Selenium.
I am attempting to do some web scraping. I have a WebElement (which contains child elements) that I would like to save to a offline file. So far, I have managed to get the raw HTML for my WebElement using WebElement.get_attribute('innerHTML'). This works, but, no CSS is present in the final product because a stylesheet is used on the website. So I'd like to get these CSS properties converted to inline.
I found this stackoverflow solution which shows how to get the CSS properties of an element. However, getting this data, then parsing the HTML as a string to add these properties inside the HTML tag's style attribute, then doing this for all the child elements, feels like it'd be a significant undertaking.
So, I was wondering whether there was a more straightforward way of doing this.

Selenium Python: How to get css without targetting a specific class/id/tag

I'm working on a scraper project and one of the goals is to get every image link from HTML & CSS of a website. I was using BeautifulSoup & TinyCSS to do that but now I'd like to switch everything on Selenium as I can load the JS.
I can't find in the doc a way to target some CSS parameters without having to know the tag/id/class. I can get the images from the HTML easily but I need to target every "background-image" parameter from the CSS in order to get the URL from it.
ex: background-image: url("paper.gif");
Is there a way to do it or should I loop into each element and check the corresponding CSS (which would be time-consuming)?
You can grab all the Style tags and parse them, searching what you look.
Also you can download the css file, using the resource URL and parse them.
Also you can create a XPATH/CSS rule for searching nodes that contain the parameter that you're looking for.

Extracting links with scrapy that have a specific css class

Conceptually simple question/idea.
Using Scrapy, how to I use use LinkExtractor that extracts on only follows links with a given CSS?
Seems trivial and like it should already be built in, but I don't see it? Is it?
It looks like I can use an XPath, but I'd prefer using CSS selectors. It seems like they are not supported?
Do I have to write a custom LinkExtractor to use CSS selectors?
From what I understand, you want something similar to restrict_xpaths, but provide a CSS selector instead of an XPath expression.
This is actually a built-in feature in Scrapy 1.0 (currently in a release candidate state), the argument is called restrict_css:
restrict_css
a CSS selector (or list of selectors) which defines regions inside the
response where links should be extracted from. Has the same behaviour
as restrict_xpaths.
The initial feature request:
CSS support in link extractors

Is it possible to locate elements by CSS properties in Scrapy?

Am wondering if Scrapy has methods to scrape data based on their colors defined in CSS. For example, select all elements with background-color: #ff0000.
I have tried this:
response.css('td::attr(background-color)').extract()
I was expecting a list with all background colors set for the table data elements but it returns an empty list.
Is it generally possible to locate elements by their CSS properties in Scrapy?
Short answer is No, this is not possible to do with Scrapy alone.
Why No?
the :attr() selector allows you to access element attributes, but background-color is a CSS property.
an important thing to understand now is that there are multiple different ways to define CSS properties of elements on a page and, to actually get a CSS property value of an element, you need a browser to fully render the page and all the defined stylesheets
Scrapy itself is not a browser, not a javascript engine, it is not able to render a page
Exceptions
Sometimes, though, CSS properties are defined in style attributes of the elements. For instance:
<span style="background-color: green"/>
If this is the case, when, yes, you would be able to use the style attributes value to filter elements:
response.xpath("//span[contains(#style, 'background-color: green')]")
This would though be quite fragile and may generate false positives.
What can you do?
look for other things to base your locators on. In general, strictly speaking, locating elements by the background color is not the best way to get to the desired elements unless, in some unusual circumstances, this property is the only distinguishing factor
scrapy-splash project allows you to automate a lightweight Splash browser which may render the page. In that case, you would need some Lua scripts to be executed to access CSS properties of elements on a rendered page
selenium browser automation tool is probably the most straightforward tool for this problem as it gives you direct control and access to the page and its elements and their properties and attributes. There is this .value_of_css_property() method to get a value of a CSS property.
Response.css() is a shortcut to TextResponse.selector.css(query)
http://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.TextResponse.css

How do I query XHTML using python?

I have created a simple test harness in python for my ASP .net web site.
I would like to look up some HTML tags in the resulting page to find certain values.\
What would be the best way of doing this in python?
eg (returned page):
<div id="ErrorPanel">An error occurred......</div>
would display (in std out from python):
Error: .....
or
<td id="dob">23/3/1985</td>
would display:
Date of birth: 23/3/1985
Do you want to parse XML, as you state in your question's title, or HTML, as you show in the text of the question? For the latter, I recommend BeautifulSoup -- download it and install it, then, once having made a soup object out of the HTML, you can easily locate the tag with a certain id (or other attribute), e.g.:
errp = soup.find(attrs={'id': 'ErrorPanel'})
if errp is not None:
print 'Error:', errp.string
and similarly for the other case (easily tweakable e.g. into a loop if you're looking for non-unique attributes, and so on).
You can also do it with lxml. It handles HTML very well, and you can use CSS selectors for querying DOM, which makes it particularly attractive if you use libraries like jQuery regularly.

Categories