I wrote a function that parses all headers based on header's tags (h1/2...). Now I want to expand on it and add a feature that parses text based on font-size - say either 20px or 1.5em, regardless of the headers. I want a feature that brings any text written in font-size greater than X, wherever it is on the page. The function takes json file as an input, composed of a random HTML (and whatever website could have, i.e. CSS etc) in it.
Based on crummy it seems like one possible option is to use soup.fetch(), however, I haven't found many examples using it for this purpose.
Since font-size well might appear under CSS component I'm not sure that bs4 is the right package for it. I assume the answer includes cssutils or tinycss but haven't been able to find the best way to use those for this task.
As a reference - My code for header's tags was posted for a review: https://codereview.stackexchange.com/questions/166671/extract-html-content-based-on-tags-specifically-headers/166674?noredirect=1#comment317280_166674.
Posts I've checked:
What is the pythonic way to implement a css parser/replacer ;
Find all the span styles with font size larger than the most common one via beautiful soup python ;
Search in HTML page using Regex patterns with python ;
How to parse a web page containing CSS and HTML using python ;
how to extract text within font tag using beautifulsoup ;
Extract text with bold content from css selector
Thanks much,
Related
I have the following html structure:
I would like to extract the text ("“Business-Thinking”-Fokus im Master-Kurs") from the span highlighted (using Scrapy), however I have trouble reaching to it as it does not contain any specific class or id.
I tried to access it with the following absolute xPath:
sel.xpath('/html/body/div[4]/div[1]/div/div/h1/span/text()').extract()
I don't get any error, however it returns a blank file, meaning the text is not extracted.
Note: The parent classes are not unique, that's why I'm not using a relative path. As the text varies, I also cannot reach the span by looking for the text it contains.
Do you have any suggestion on how I should modify my xPath to extract the text? Thanks!
If you load the page using scrapy shell url it loads without javascript.
When you look at source without javascript, the xpath to the span is /html/body/div/div[1]/div/div/h1/span
To load webpages with javascript in Scrapy use Splash.
I'm scraping HTML pages of live websites using python and beautifulsoup4. I want to be able get the size of the text of any html tag. I tried to use cssutils to parse the CSS and find font-size param but real life CSS is pretty complicated like this
.some_div_class a span {font-size: 20px}
So I can find all tags that correspond to this selector using bs.select(selector) but trying every selector in stylesheet will take way too much time. So how is it possible to find font-size for any tag efficiently? Browsers do it pretty fast, so it shouldn't be impossible.
I don't want to use headless browser.
I'm working on a scraper project and one of the goals is to get every image link from HTML & CSS of a website. I was using BeautifulSoup & TinyCSS to do that but now I'd like to switch everything on Selenium as I can load the JS.
I can't find in the doc a way to target some CSS parameters without having to know the tag/id/class. I can get the images from the HTML easily but I need to target every "background-image" parameter from the CSS in order to get the URL from it.
ex: background-image: url("paper.gif");
Is there a way to do it or should I loop into each element and check the corresponding CSS (which would be time-consuming)?
You can grab all the Style tags and parse them, searching what you look.
Also you can download the css file, using the resource URL and parse them.
Also you can create a XPATH/CSS rule for searching nodes that contain the parameter that you're looking for.
We use NLTK to extract text from HTML pages, but we want only most trivial text analysis, e.g. word count.
Is there a faster way to extract visible text from HTML using Python?
Understanding HTML (and ideally CSS) at some minimal level, like visible / invisible nodes, images' alt texts, etc, would be additionally great.
Ran into the same problem at my previous workplace. You'll want to check out beautifulsoup.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print soup.text
You'll find its documentation here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
You can ignore elements based on attributes. As to understanding external stylesheets im not too sure. However what you could do there and something that would not be too slow (depending on the page) is to look into rendering the page with something like phantomjs and then selecting the rendered text :)
I got this following content from a html page
str='http://www.ralphlauren.com/graphics/product_images/pPOLO2-24922076_alternate1_v360x480.jpg\', zoom: \'s7-1251098_alternate1\' }]\n\n\nEnlarge Image\n\n\n\n\n\n\nCotton Canvas Utility Jacket\nStyle Number : 112933196\n\n\n\n$125.00'
Like so, I have many html pages. I need some way to read the content BEFORE the style number. In this case, I need Cotton Canvas Utility Jacket. Is there a regex in python to do that? Note that I can start looking for pattern Enlarge Image and read whatever comes before I strike Style number. The issue is that there are many Enlarge Image on the html page. What I have shown above is part of the html page. full html page is here
In short, I need to find the product name from the linked HTML page.
Thanks.