Extract text from HTML faster than NLTK? - python

We use NLTK to extract text from HTML pages, but we want only most trivial text analysis, e.g. word count.
Is there a faster way to extract visible text from HTML using Python?
Understanding HTML (and ideally CSS) at some minimal level, like visible / invisible nodes, images' alt texts, etc, would be additionally great.

Ran into the same problem at my previous workplace. You'll want to check out beautifulsoup.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print soup.text
You'll find its documentation here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
You can ignore elements based on attributes. As to understanding external stylesheets im not too sure. However what you could do there and something that would not be too slow (depending on the page) is to look into rendering the page with something like phantomjs and then selecting the rendered text :)

Related

Python web scraping: websites from google search result

A newbie to Python here. I want to extract info from multiple websites (e.g. 100+) from a google search page. I just want to extract the key info, e.g. those with <h1>, <h2> or <b> or <li> HTML tags etc. But I don't want to extract the entire paragraph <p>.
I know how to gather a list of website URLs from that google search; and I know how to web scrape individual website after looking at the page's HTML. I use the Request and BeautifulSoup for these tasks.
However, I want to know how can I extract key info from all these (100+ !) websites without having to look at their html one by one. Is there a way to automatically find out the HTML tags the website used to emphasize key messages? e.g. some websites may use <h1>, while some may use <b> , or something else...
All I can think of is to come up with a list of possible "emphasis-typed" HTML tags and then just use BeautifulSoup.find_all() to do a wide-scale extraction. But surely there must be an easier way?
It would seem that you must first learn how to do loops and function first. Every website is completely different and scraping a website alone to extract useful information is daunting. I'm a newb myself, but if I have to extract info from headers like you, this is what I would do: (this is just concept code, but hope you'll find it useful)
def getLinks(articleUrl):
html = urlopen('http://en.web.com{}'.format(articleUrl))
bs = BeautifulSoup(html, 'html.parser')
return bs.find('h1', {'class':'header'}).find_all('h1',
header=re.compile('^(/web/)((?!:).)*$'))

What is the best way to get the size of text in the tag when parsing HTML+CSS with python?

I'm scraping HTML pages of live websites using python and beautifulsoup4. I want to be able get the size of the text of any html tag. I tried to use cssutils to parse the CSS and find font-size param but real life CSS is pretty complicated like this
.some_div_class a span {font-size: 20px}
So I can find all tags that correspond to this selector using bs.select(selector) but trying every selector in stylesheet will take way too much time. So how is it possible to find font-size for any tag efficiently? Browsers do it pretty fast, so it shouldn't be impossible.
I don't want to use headless browser.

Extract text from CSS based on font-size

I wrote a function that parses all headers based on header's tags (h1/2...). Now I want to expand on it and add a feature that parses text based on font-size - say either 20px or 1.5em, regardless of the headers. I want a feature that brings any text written in font-size greater than X, wherever it is on the page. The function takes json file as an input, composed of a random HTML (and whatever website could have, i.e. CSS etc) in it.
Based on crummy it seems like one possible option is to use soup.fetch(), however, I haven't found many examples using it for this purpose.
Since font-size well might appear under CSS component I'm not sure that bs4 is the right package for it. I assume the answer includes cssutils or tinycss but haven't been able to find the best way to use those for this task.
As a reference - My code for header's tags was posted for a review: https://codereview.stackexchange.com/questions/166671/extract-html-content-based-on-tags-specifically-headers/166674?noredirect=1#comment317280_166674.
Posts I've checked:
What is the pythonic way to implement a css parser/replacer ;
Find all the span styles with font size larger than the most common one via beautiful soup python ;
Search in HTML page using Regex patterns with python ;
How to parse a web page containing CSS and HTML using python ;
how to extract text within font tag using beautifulsoup ;
Extract text with bold content from css selector
Thanks much,

Counting content only in HTML page

Is there anyway I can parse a website by just viewing the content as displayed to the user in his browser? That is, instead of downloading "page.htm"l and starting to parse the whole page with all the HTML/javascript tags, I will be able to retrieve the version as displayed to users in their browsers. I would like to "crawl" websites and rank them according to keywords popularity (viewing the HTML source version is problematic for that purpose).
Thanks!
Joel
A browser also downloads the page.html and then renders it. You should work the same way. Use a html parser like lxml.html or BeautifulSoup, using those you can ask for only the text enclosed within tags (and arguments you do like, like title and alt attributes).
You could get the source and strip the tags out, leaving only non-tag text, which works for almost all pages, except those where JavaScript-generated content is essential.
The pyparsing wiki Examples page includes this html tag stripper.

(python) sgmlparser and how to extract data between tags, not attributes/values

Every example I see for sgmlparser involves finding a tag, then finding the attributes/values of the tag. So for it would be the ability to extract 'google.com' out. but i want the data between tags. so if i used sgmlparser, i would look for and extract out everything in that div until it's closing tag. is that the job of sgmlparser, or am i using the wrong library?
Because you mention div's, I gather you want to parse HTML. For doing that your best choice is BeautifulSoup.

Categories