How can I store the results of parsed html? - python

I'm using Python's HTMLParser and BeautifulSoup to parse Yahoo finance data. There is a very nice package written to do this already but it doesn't get "tangbile price/book value", which is to say that it includes Goodwill and other intangibles in the calculation of book value. Hence, I'm forced to roll my own solution.
It hasn't been pretty. Here's the code
from BeautifulSoup import BeautifulSoup
import urllib2
from HTMLParser import HTMLParse
class data(HTMLParser):
def handle_data(self, data):
print data
parser = data()
url='http://finance.yahoo.com/q/bs?s=BAC&annual'
response = urllib2.urlopen(url)
html = response.read()
soup=BeautifulSoup(html)
tangibles=[str(parser.feed(str(soup('strong')[24:26])))]
Two problems with this:
1) I'm relying on the data always being on the same place on Yahoo's page, which isn't the biggest problem but doesn't make me happy
and,
2) The real problem;
tangibles=[str(parser.feed(str(soup('strong')[24:26])))]
is an empty list, because the "data" class is just printing the stuff I want and not storing it.
I'll be happy if you answer part 2) for me. I don't understand classes yet.

get rid of the data and parser and supporting imports then do this.
tangibles = [''.join(node(text=True)).strip() for node in soup('strong')[24:26]]
I basically changed this to use some python list comprehension. Read more here if you are not aware of what list comprehension is in Python
in essence it does these things:
Tells soup to find your elements labeled strong and each instance to name it node for node in soup.findAll('strong')[24:26]
In the node it finds and removes the strong tags completely node.findAll(text=True) Beautiful soup docs about text=True
Joins the elements in the node so its 1 element and not a list of 1 element in length ''.join() (a python trick)
i.e ['Net Stuff', '152,113,000'] vs [['Net Stuff'], ['152,113,000']]
Removes superfluous whitespace (trailing and leading) .strip()

Related

Stripping HTML Tags from Forum using Python/bs4

I am a (very) new Python user, and decided some of my first work would be to grab some lyrics from a forum and sort according to word frequency. I obviously haven't gotten to the frequency part yet, but the following is the code that does not work for obtaining the string values I want, resulting in an "AttributeError: 'ResultSet' object has no attribute 'getText' ":
from bs4 import BeautifulSoup
import urllib.request
url = 'http://www.thefewgoodmen.com/thefgmforum/threads/gdr-marching-songs-section-b.14998'
wp = urllib.request.urlopen(url)
soup = BeautifulSoup(wp.read())
message = soup.findAll("div", {"class": "messageContent"})
words = message.getText()
print(words)
If I alter the code to have getText() operate on the soup object:
words = soup.getText()
I, of course, get all of the string values throughout the webpage, rather than those limited to only the class messageContent.
My question, therefore, is two-fold:
1) Is there a simple way to limit the tag-stripping to only the intended sections?
2) What simple thing do I not understand in that I cannot have getText() operate on the message object?
Thanks.
The message in this case is a BeautifulSoup ResultSet, which is a list of BeautifulSoup Tag(s). What you need to do is call getText on each element of message like so,
words = [item.getText() for item in message]
Similarly, if you are just interested in a single Tag (let's say the first one for the sake of argument), you could get its content with,
words = message[0].getText()

Python 3 - Getting some strings from a HTTPrequest response

I'm having a hard time extracting data from a httprequest response.
Can somebody help me? Here's a part of my code:
import requests
r = requests.get('https://www.example.com', verify=True)
keyword = r.text.find('loginfield')
print (keyword)
>>> 42136
42136 value basically means that string 'loginfield' exists on the response.text. But how do I extract specific strings from it?
Like for example I want to extract these exact strings:
<title>Some title here</title>
or this one:
<div id='bla...' #continues extracting of strings until it stops where I want it to stop extracting.
Anybody got an idea on how should I approach this problem?
You can use BeautifulSoup to parse HTML and get tags. Here's an example piece of code:
import requests
from bs4 import BeautifulSoup as BS
r = requests.get('https://www.example.com', verify=True)
soup = BS(r.text)
print(soup.find('title').text)
Should print:
Some title here
But depends on if it's the first title or not
Please note that for HTML-page data extraction, you should take a look at a specialized library like Beautiful soup. Your program will be less fragile and more maintainable that way.
string.find will return -1 if the string does not exists.
There is no string "loginfield" in the page you retrieved.
Once you have the correct index for your string, the returned value is the position of the first char of that string.
since you edited your question:
>>> r.text.find('loginfield')
42136
That means, the string "loginfield" starts at offset 42136 in the text. You could display say 200 chars starting at that position that way:
>>> print(r.text[42136:42136+200])
To find the various values you looking for, you have to figure out where there are relative to that position.

Extract information from a webpage in a particular format

I am trying to make a simple python script to extract certain links from a webpage. I am able to extract link successfully but now I want to extract some more information like bitrate,size,duration given on that webpage.
I am using the below xpath to extract the above mentioned info
>>> doc = lxml.html.parse('http://mp3skull.com/mp3/linkin_park_faint.html')
>>> info = doc.xpath(".//*[#id='song_html']/div[1]/text()")
>>> info[0:7]
['\n\t\t\t', '\n\t\t\t\t3.71 mb\t\t\t', '\n\t\t\t', '\n\t\t\t\t3.49 mb\t\t\t', '\n\t\t\t', '\n\t\t\t\t192 kbps', '2:41']
Now what I need is that for a particular link the info I require is generated in a form of tuple like (bitrate,size,duration).
The xpath I mentioned above generates the required info but it is ill-formatted that is it is not possible to achieve my required format with any logic at least I am not able to that.
So, is there any way to achieve the output in my format.?
I think BeautifulSoup will do the job, it parses even badly formatted HTML:
http://www.crummy.com/software/BeautifulSoup/
parsing is quite easy with BeautifulSoup - for example:
import bs4
import urllib
soup = bs4.BeautifulSoup(urllib.urlopen('http://mp3skull.com/mp3/linkin_park_faint.html').read())
print soup.find_all('a')
and have quite good docs:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
You can actually strip everything out with XPath:
translate(.//*[#id='song_html']/div[1]/text(), "\n\t,'", '')
So for your additional question, either:
info[0, len(info)]
for altogether, or:
info.rfind(" ")
Since the translate leaves a space character, but you could replace that with whatever you wanted.
Addl info found here
How are you with regular expressions and python's re module?
http://docs.python.org/library/re.html may be essential.
As far as getting the data out of the array, re.match(regex,info[n]) should suffice, as far as the triple tuple goes, the python tuple syntax takes care of it. Simply match from members of your info array with re.match.
import re
matching_re = '.*' # this re matches whole strings, rather than what you need
incoming_value_1 = re.match(matching_re,info[1])
# etc.
var truple = (incoming_value_1, incoming_value_2, incoming_value_2

Python: Keyword to Links

I am building a blog on Google App Engine. I would like to convert some keywords in my blog posts to links, just like what you see in many WordPress blogs.
Here is one WP plugin which do the same thing:http://wordpress.org/extend/plugins/blog-mechanics-keyword-link-plugin-v01/
A plugin that allows you to define keyword/link pairs. The keywords are automatically linked in each of your posts.
I think this is more than a simple Python Replace. What I am dealing with is HTML code. It can be quite complex sometimes.
Take the following code snippet as an example. I want to conver the word example into a link to http://example.com:
Here is an example link:example.com
By a simple Python replace function which replaces example with example, it would output:
Here is an example link:example.com">example.com</a>
but I want:
Here is an example link:example.com
Is there any Python plugin that capable of this? Thanks a lot!
This is roughly what you could do using Beautifulsoup:
from BeautifulSoup import BeautifulSoup
html_body ="""
Here is an example link:<a href='http://example.com'>example.com</a>
"""
soup = BeautifulSoup(html_body)
for link_tag in soup.findAll('a'):
link_tag.string = "%s%s%s" % ('|',link_tag.string,'|')
for text in soup.findAll(text=True):
text_formatted = ['example'\
if word == 'example' and not (word.startswith('|') and word.endswith('|'))\
else word for word in foo.split() ]
text.replaceWith(' '.join(text_formatted))
for link_tag in soup.findAll('a'):
link_tag.string = link_tag.string[1:-1]
print soup
Basically I'm stripping out all the text from the post_body, replacing the example word with the given link, without touching the links text that are saved by the '|' characters during the parsing.
This is not 100% perfect, for example it does not work if the word you are trying to replace ends with a period; with some patience you could fix all the edge cases.
This would probably be better suited to client-side code. You could easily modify a word highlighter to get the desired results. By keeping this client-side, you can avoid having to expire page caches when your 'tags' change.
If you really need it to be processed server-side, then you need to look at using re.sub which lets you pass in a function, but unless you are operating on plain-text you will have to first parse the HTML using something like minidom to ensure you are not replacing something in the middle of any elements.

Scraping with Python?

I'd like to grab all the index words and its definitions from here. Is it possible to scrape web content with Python?
Firebug exploration shows the following URL returns my desirable contents including both index and its definition as to 'a'.
http://pali.hum.ku.dk/cgi-bin/cpd/pali?acti=xart&arid=14179&sphra=undefined
what are the modules used? Is there any tutorial available?
I do not know how many words indexed in the dictionary. I`m absolute beginner in the programming.
You should use urllib2 for gettting the URL contents and BeautifulSoup for parsing the HTML/XML.
Example - retrieving all questions from the StackOverflow.com main page:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://stackoverflow.com")
soup = BeautifulSoup(page)
for incident in soup('h3'):
print [i.decode('utf8') for i in incident.contents]
print
This code sample was adapted from the BeautifulSoup documentation.
You can get data from the web using the built-in urllib or urllib2, but the parsing itself is the most important part. May I suggest the wonderful BeautifulSoup? It can deal with just about anything.
http://www.crummy.com/software/BeautifulSoup/
The documentation is built like a tutorial. Sorta:
http://www.crummy.com/software/BeautifulSoup/documentation.html
In your case, you probably need to use wildcards to see all entries in the dictionary. You can do something like this:
import urllib2
def getArticles(query, start_index, count):
xml = urllib2.urlopen('http://pali.hum.ku.dk/cgi-bin/cpd/pali?' +
'acti=xsea&tsearch=%s&rfield=entr&recf=%d&recc=%d' %
(query, start_index, count))
# TODO:
# parse xml code here (using BeautifulSoup or an xml parser like Python's
# own xml.etree. We should at least have the name and ID for each article.
# article = (article_name, article_id)
return (article_names # a list of parsed names from XML
def getArticleContent(article):
xml = urllib2.urlopen('http://pali.hum.ku.dk/cgi-bin/cpd/pali?' +
'acti=xart&arid=%d&sphra=undefined' % article_id)
# TODO: parse xml
return parsed_article
Now you can loop over things. For instance, to get all articles starting in 'ana', use the wildcard 'ana*', and loop until you get no results:
query = 'ana*'
article_dict = {}
i = 0
while (true):
new_articles = getArticles(query, i, 100)
if len(new_articles) == 0:
break
i += 100
for article_name, article_id in new_articles:
article_dict[article_name] = getArticleContent(article_id)
Once done, you'll have a dictionary of the content of all articles, referenced by names. I omitted the parsing itself, but it's quite simple in this case, since everything is XML. You might not even need to use BeautifulSoup (even though it's still handy and easy to use for XML).
A word of warning though:
You should check the site's usage policy (and maybe robots.txt) before trying to heavily scrap articles. If you're just getting a few articles for yourself they may not care (the dictionary copyright owner, if it's not public domain, may care though), but if you're going to scrape the entire dictionary, this is going to be some heavy usage.

Categories