Parsing HTML in python3, re, html.parser, or something else?

Parsing HTML in python3, re, html.parser, or something else? - python

I'm trying to get a list of craigslist states and their associates urls. Don't worry, I have no intentions of spaming, if you're wondering what this is for see the * below.
What I'm trying to extract begins the line after 'us states' and is the next 50 < li >'s. I read through html.parser's docs and it seemed too low level for this, more aimed at making a dom parser or syntax highlighting/formatting in an ide as opposed to searching which makes me think my best bet is using re's. I would like to keep myself contained to what's in the standard library just for the sake of learning. I'm not asking for help writing a regular expression, I'll figure that out on my own, just making sure there's not a better way to do this before spending the time on that.
*This is my first program or anything beyond simple python scripts. I'm making a c++ program to manage my posts and remind me when they've expired in case I want to repost them, and a python script to download a list of all of the US states and cities/areas in order to populate a combobox in the gui. I really don't need it, but I'm aiming to make this 'production ready'/feature complete both as a learning exercise and to create a portfolio to possibly get a job. I don't know if I'll make the program publicly available or not, there's obvious potential for misuse and is probably against their ToS anyway.

There is xml.etree an XML Parser available in the Python Standard library itself. You should not using regex for parsing XMLs. Go the particular node where you find the information and extract the links from that.

Use lxml.html. It's the best python html parser. It supports xpath!

Related

Webscraping a page: notification on update stock prices

Question
I am interested in web scraping and data analysis and I would like to develop my skills by writing a program using Python 2.7 that will monitor changes in stock prices. My goal is to be able to compare two stocks (for the time being) at certain points throughout the day, save that info into a document format easily handled by pandas (which I will learn how to use after I get this front end working). In the end I would like to map relationship trends between chosen stocks (when this one goes up by x what effect does that have on the other one). This is just a hobby project so it doesn't matter if the code is production quality.
My Experience
I am a brand new Python programmer (I have a very basic understanding of python and no real experience with any non-included modules) but I do have a technical background so if the answer to my question requires reading and understanding documentation intended for intermediate level programmers that should be OK.
For the basics I am working my way through Learning Python: Powerful Object-Oriented Programming by Mark Lutz if this helps any.
What I'm Looking For
I recognize this is a very broad subject and I am not asking for anyone to write any actual code examples or anything. I just want some direction as to where to go to get information more specific to my interests and goals.
This is actually my first post on this forum so please forgive me if this doesn't follow best practices for posting. I did search for other questions like mine and read the posting tips docs prior to writing this.

So, you want to web-scrape? If you're using Python 2.7, then you'll want to look into the urllib2, requests and BeautifulSoup libraries. If you're using Python 3.x, then you'll want to peek at urllib, urllib.request, and BeautifulSoup, again. Together, these libraries should accomplish everything you're looking to do in terms of web-scraping.
If you're interested in scraping stock data, might I suggest the yahoo_finance package? This is a Python wrapper for the Yahoo Finance API. Whenever I've done things with stock data in the past, this module was invaluable. There's also googlefinance, too. It's much easier to use these already-developed wrappers to extract stock info, rather than scraping hundreds (if not thousands) of web pages to get the data you want.

Sublime Text editor plug-in, scan div id's and classes

I'm aware I'm supposed to show some starting code to give you a clue as to what I'm trying to do, but I'm really at a basic level and I can't find any resources to show me what I'm after. Basically, I'm trying to write a plug-in for Sublime Text editor, which selects all div ID's then outputs them into a file. What's the best approach? It seems like it should be easy, but I'm not too sure.
Thanks in advance for your help,
Ewan

This looks like a good place to start: http://www.sublimetext.com/docs/plugin-basics

Look at http://www.sublimetext.com/docs/2/api_reference.html, though be advised that Sublime Text 3 is currently in beta. It introduces changes to the plugin api, and a requirement to support Python 3. See http://www.sublimetext.com/docs/3/porting_guide.html

Assuming you have some familiarity with python, I would start with this tutorial on for writing plugins (Link). The author of that tutorial wrote, among other things, package control. Granted, it is for ST2, but for what you are trying to do, I don't for see any major issues with writing a plugin that is compatible with both ST2 and ST3.
How you go about writing your particular plugin is up to you. One approach may be leveraging the view.find_all() method. This takes a regular expression and returns a set of regions. From these regions, you can grab the text, and subsequently the IDs for the divs. There may be a better way, but that might work as an initial attempt. Writing to a file can be done through the usual python means.

Identifying large bodies of text via BeautifulSoup or other python based extractors

Given some random news article, I want to write a web crawler to find the largest body of text present, and extract it. The intention is to extract the physical news article on the page.
The original plan was to use a BeautifulSoup findAll(True) and to sort each tag by its .getText() value. EDIT: don't use this for html work, use the lxml library, it's python based and much faster than BeautifulSoup. command (which means extract all html tags)
But this won't work for most pages, like the one I listed as an example, because the large body of text is split into many smaller tags, like paragraph dividers for example.
Does anyone have any experience with this? Any help with something like this would be amazing.
At the moment I'm using BeautifulSoup along with python, but willing to explore other possibilities.
EDIT: Came back to this question after a few months later (wow i sounded like an idiot ^), and solved this with a combination of libraries & own code.
Here are some deadly helpful python libraries for the task in sorted order of how much it helped me:
#1 goose library Fast, powerful, consistent
#2 readability library Content is passable, slower on average than goose but faster than boilerpipe
#3 python-boilerpipe Slower & hard to install, no fault to the boilerpipe library (originally in java), but to the fact that this library is build on top of another library in java, which attributes to IO time & errors, etc.
I'll release benchmarks perhaps if there is interest.
Indirectly related libraries, you should probably install them and read their docs:
NLTK text processing library This
is too good not to install. They provide text analysis tools along
with html tools (like cleanup, etc).
lxml html/xml parser Mentioned
above. This beats BeautifulSoup in every aspect but usability. It's a
bit harder to learn but the results are worth it. HTML parsing takes
much less time, it's very noticeable.
python
webscraper library I think the value of this code isn't the
lib itself, but using the lib as a reference manual to build your own
crawlers/extractors. It's very nicely coded / documented!
A lot of the value and power in using python, a rather slow language, comes from it's open source libraries. They are especially awesome when combined and used together, and everyone should take advantage of them to solve whatever problems they may have!
Goose library gets lots of solid maintenance, they just added Arabic support, it's great!

You might look at the python-readability package which does exactly this for you.

You're really not going about it the right way, I would say, as all the comments above would attest to.
That said, this does what you're looking for.
from bs4 import BeautifulSoup as BS
import requests
html = requests.get('http://www.cnn.com/2013/01/04/justice/ohio-rape-online-video/index.html?hpt=hp_c2').text
soup = BS(html)
print '\n\n'.join([k.text for k in soup.find(class_='cnn_strycntntlft').find_all('p')])
It pulls out only the text, first by finding the main container of all the <p> tags, then by selecting only the <p> tags themselves to get the text; ignoring the <script> and other irrelevant ones.
As was mentioned in the comments, this will only work for CNN--and possibly, only this page. You might need a different strategy for every new webpage.

Suggest semantic tags for short snippets of text

I am interested in generating a list of suggested semantic tags (via links to Freebase, Wikipedia or another system) to a user who is posting a short text snippet. I'm not looking to "understand" what the text is really saying, or even to automatically tag it, I just want to suggest to the user the most likely semantic tags for his/her post. My main goal is to force users to tag semantically and therefore consistently and not to write in ambiguous text strings. If there were a reasonably functional and reasonably priced tool on the market, I would use it. I have not found such a tool so I am looking in to writing my own.
My question is first of all, if there is such a tool that I have not encountered. I've looked at Zemanta, AlchemyAPI and OpenCalais and none of them seemed to offer the service I need.
Assuming that I'm writing my own, I'd be doing it in Python (unless there was a really compelling reason to use something else). My first guess would be to search for n-grams that match "entities" in Freebase and suggest them as tags, perhaps searching in descriptions of entities as well to get a little "smarter." If that proved insufficient, I'd read up and dip my toes into the ontological water. Since this is a very hard problem and I don't think that my application requires its solution, I would like to refrain from real semantic analysis as much as possible.
Does anyone have experience working with a semantic database system and could give me some pointers regarding where to begin and what sort of pitfalls to expect?

Take a look at NLTK python library. It contains a vast number of tools, dictionaries and algorithms.

What are some of the Artificial Intelligence (AI) related techniques one would use for parsing a webpage?

I would like to scrape several different discussions forums, most of which have different HTML formats. Rather than dissecting the HTML for each page, it would be more efficient (and fun) to implement some sort of Learning Algorithm that could identify the different messages (i.e. structures) on each page, and individually parse them while simultaneously ignoring all the extraneous crap (i.e., ads and other nonsense). Could someone please point me to some references or sample code for work that's already been carried out in this area.
Moreover, does anyone know of pseudocode for Arc90's readability code?
http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/

build a solution that:
takes some sample webpages with the same structure (eg forum threads)
analyzes the DOM tree of each to find which parts are the same / different
where they are different is the dynamic content you are after (posts, user names, etc)
This technique is known as wrapper induction.

There seems to be a Python port of arc90's Readability script that might point you in the right direction (or at least some direction).

Maybe not exactly correct but there's an O'Reilly book called 'Collective Intelligence' that may lead you in the right direction for what you are attempting to do. Additionally, many of the examples are in python :)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.