Scraping news article data and formatting the results in Word - python

For a list of article URLs I need to scrape the title, author, date, publication, and body of the article. Then each article needs to appear in Word, formatted according to a template (bold title, pub in italics, table of contents at the top with hyperlinks etc).

I've used some of this in the past and I would recommend you two things to clean an HTML code and get the text:
html2text: Library to get text from html code
Regular expressions: Python module which you can use to delete HTML expressions
(Be careful with RegEx, it's possible to miss some data or some string in some cases)
For Word I would recommend this:
python-docs: Library to work with MS Word into Python.
PS: This is just a little resume. You would have a lot of results if you just use the SO searcher.

Related

Matching text group(s) between tags from multiline html using Python's Regex

I am trying to use python's regex to collect all text matches from multiline html code.
Here is the html code (I kept it as it is, messy):
<p class="SearchCard">
Hi,
Need AWS devops with 10+ experience
If anyone interested plz revert
</p>
If I use regex='\"SearchCard\">(\n*\s*.*)' I get match up to word Hi, included.
But if I use regex='\"SearchCard\">(\n*\s*.*)*?<' I get everything until end of the html file but I want to capture matched group before first '<' symbol.
Python code is bellow:
re.findall(regex, html)
I know it can be (and probably should be) done with other modules (e.g. BS4, or similar), but I am looking for solution with re.

Finding structures from web articles with Python

I'm looking for some Python tool, that can help me determine content structures from an article website such as http://www.bbc.co.uk/. I used boilerplate removal library - Boilerpipe to clean the web page from unwanted stuff (banners, links, pictures, etc).
Now when I have only relevant content, I want to automatically determine what string is title, author, date, date of article updating, what is the article itself. Problem is, I am not only going to use it for transparent article pages, that has most of those information in HTML tags such as <title>Title</title>. I'd like to be able to determine it from tags like <div>28.11.2011<p>John Cusack on Syria conflict</div>.
Is there any tool that can help me with that?
Isn't scrapy meant for that kind of stuff? http://scrapy.org/
You can get content from articles easy with the follow "tools":
scrapy (recommended, but have greater learning curve)
newspaper (gives you immediately title, author, text, images, videos, etc. )
goose-extractor (is like newspaper)

Regex to extract all URLs from a page

This question has been asked a few times on SO but I couldn't get any of the answers to work correctly. I need to extract all the URLs in page both in href links and the plain text. I don't need to individual groups of the regex. I need a list of strings i.e. URLs in the page. Could someone point me to a good working example?
I'd like to do this using Regexs and not BeautifulSoup, etc.
Thank you.
HTML is not a regular language, and thus cannot be parsed by regular expressions.
It's possible to make reasonable guesses using regular expressions, and/or to recognize a restricted subset of URIs, but that way lies madness (lengthy debugging processes, inaccurate results).
That said, if you're willing to go that path, see John Gruber's regex for the purpose:
def extract_urls(your_text):
url_re = re.compile(r'\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))')
for match in url_re.finditer(your_text):
yield match.group(0)
This can be used as follows:
>>> for uri in extract_urls('http://foo.bar/baz irc://freenode.org/bash'):
... print uri
http://foo.bar/
irc://freenode.org
I know you can use the DOM object in PHP to parse an HTML document. I'm not familiar with python but this might help: http://docs.python.org/library/xml.dom.html

Counting content only in HTML page

Is there anyway I can parse a website by just viewing the content as displayed to the user in his browser? That is, instead of downloading "page.htm"l and starting to parse the whole page with all the HTML/javascript tags, I will be able to retrieve the version as displayed to users in their browsers. I would like to "crawl" websites and rank them according to keywords popularity (viewing the HTML source version is problematic for that purpose).
Thanks!
Joel
A browser also downloads the page.html and then renders it. You should work the same way. Use a html parser like lxml.html or BeautifulSoup, using those you can ask for only the text enclosed within tags (and arguments you do like, like title and alt attributes).
You could get the source and strip the tags out, leaving only non-tag text, which works for almost all pages, except those where JavaScript-generated content is essential.
The pyparsing wiki Examples page includes this html tag stripper.

python method to extract content (excluding navigation) from an HTML page

Of course an HTML page can be parsed using any number of python parsers, but I'm surprised that there don't seem to be any public parsing scripts to extract meaningful content (excluding sidebars, navigation, etc.) from a given HTML doc.
I'm guessing it's something like collecting DIV and P elements and then checking them for a minimum amount of text content, but I'm sure a solid implementation would include plenty of things that I haven't thought of.
Try the Beautiful Soup library for Python. It has very simple methods to extract information from an html file.
Trying to generically extract data from webpages would require people to write their pages in a similar way... but there's an almost infinite number of ways to convey a page that looks identical let alone all the conbinations you can have to convey the same information.
Was there a particular type of information you were trying to extract or some other end goal?
You could try extracting any content in 'div' and 'p' markers and compare the relative sizes of all the information in the page. The problem then is that people probably group information into collections of 'div's and 'p's (or at least they do if they're writing well formed html!).
Maybe if you formed a tree of how the information is related (nodes would be the 'p' or 'div or whatever and each node would contain the associated text) you could do some sort of analysis to identify the smallest 'p' or 'div' that encompases what appears to be the majority of the information.. ?
[EDIT] Maybe if you can get it into the tree structure I suggested, you could then use a similar points system to spam assassin. Define some rules that attempt to classify the information. Some examples:
+1 points for every 100 words
+1 points for every child element that has > 100 words
-1 points if the section name contains the word 'nav'
-2 points if the section name contains the word 'advert'
If you have a lots of low scoring rules which add up when you find more relevent looking sections, I think that could evolve into a fairly powerful and robust technique.
[EDIT2] Looking at the readability, it seems to be doing pretty much exactly what I just suggested! Maybe it could be improved to try and understand tables better?
Have a look at templatemaker: http://www.holovaty.com/writing/templatemaker/
It's written by one of the founders of Django. Basically you feed it a few example html files and it will generate a "template" that you can then use to extract just the bits that are different (which is usually the meaningful content).
Here's an example from the google code page:
# Import the Template class.
>>> from templatemaker import Template
# Create a Template instance.
>>> t = Template()
# Learn a Sample String.
>>> t.learn('<b>this and that</b>')
# Output the template so far, using the "!" character to mark holes.
# We've only learned a single string, so the template has no holes.
>>> t.as_text('!')
'<b>this and that</b>'
# Learn another string. The True return value means the template gained
# at least one hole.
>>> t.learn('<b>alex and sue</b>')
True
# Sure enough, the template now has some holes.
>>> t.as_text('!')
'<b>! and !</b>'
You might use the boilerpipe Web application to fetch and extract content on the fly.
(This is not specific to Python, as you only need to issue a HTTP GET request to a page on Google AppEngine).
Cheers,
Christian
What is meaningful and what is not, it depends on the semantic of the page. If the semantics is crappy, your code won't "guess" what is meaningful. I use readability, which you linked in the comment, and I see that on many pages I try to read it does not provide any result, not talking about a decent one.
If someone puts the content in a table, you're doomed. Try readability on a phpbb forum you'll see what I mean.
If you want to do it, go with a regexp on <p></p>, or parse the DOM.
Goose is just the library for this task. To quote their README:
Goose will try to extract the following information:
Main text of an article
Main image of article
Any Youtube/Vimeo movies embedded in article
Meta Description
Meta tags

Categories