Replace BeautifulSoup with another (standard) HTML parsing module in this Python script - python

I have made a script with BeautifulSoup which works fine and is very readable, but I want to redistribute it some day, and BeautifulSoup is an external dependency I would like to avoid, specially considering Windows use.
Here is the code, it gets every usermap link from a given google maps user. The ####### marked lines are the ones using BeautifulSoup:
# coding: utf-8
import urllib, re
from BeautifulSoup import BeautifulSoup as bs
uid = '200931058040775970557'
start = 0
shown = 1
while True:
url = 'http://maps.google.com/maps/user?uid='+uid+'&ptab=2&start='+str(start)
source = urllib.urlopen(url).read()
soup = bs(source) ####
maptables = soup.findAll(id=re.compile('^map[0-9]+$')) #################
for table in maptables:
for line in table.findAll('a', 'maptitle'): ################
mapid = re.search(uid+'\.([^"]*)', str(line)).group(1)
mapname = re.search('>(.*)</a>', str(line)).group(1).strip()[:-3]
print shown, mapid, '\t', mapname
shown += 1
urllib.urlretrieve('http://maps.google.com.br/maps/ms?msid=' + uid + '.' + str(mapid) +
'&msa=0&output=kml', mapname + '.kml')
if '<span>Next</span>' in str(source):
start += 5
else:
break
As you can see, there are just three lines of code using BSoup, but I am not a programmer and I had a lot of difficulty trying to use other standard HTML and XML parsing tools, probably because I tried the wrong way, I guess.
EDIT: This question is more about replacing the three lines of code of this script than to find a way to solve generic html parsing problems there might be.
Any help will be much appreciated, thanks for reading!

Unfortunately, Python does not have useful HTML parsing in the standard library, so the only reasonable way to parse HTML is by using a third party module like lxml.html or BeautifulSoup. This does not mean that you have to have a separate dependency--these modules are free software and if you do not want an external dependency, you're welcome to bundle them with your code, which then won't make them any more a dependency than the code you write yourself.

to parse HTML code I see have three solutions :
use simple string search (.find(),...) Fast !
use regular expressions (aka regex)
use HTMLParser

I have tried this code (see below) and it shows up a list of links. As I have no beautiful soup installed and don't want to, it is very difficult to me to check the results against what your code gives.
The "pure" python code without any "soup" is even shorter and more readable.
Anyway, here it is. Tell me what you think ! Friendly, Louis.
#coding: utf-8
import urllib, re
uid = '200931058040775970557'
start = 0
shown = 1
while True:
url = 'http://maps.google.com/maps/user?uid='+uid+'&ptab=2&start='+str(start)
source = urllib.urlopen(url).read()
while True:
endit = source.find('maptitle')
mapid = re.search(uid+'\.([^"]*)', str(source)).group(1)
mapname = re.search('>(.*)</a>', str(source)).group(1).strip()[:-3]
print shown, mapid, '\t', mapname
shown += 1
urllib.urlretrieve('http://maps.google.com.br/maps/ms?msid=' + uid + '.' + str(mapid) + '&msa=0&output=kml', mapname + '.kml')
if '<span>Next</span>' in str(source):
start += 5
else:
break

Related

Adding chevrons and underscores with BeautifulSoup

I want BeautifulSoup to add the strings like this into my HTML pages:
{{< Transfer/component_short_name >}}
(If you are interested why, this is a Hugo shortcode, a kind of variable for markdown)
when I build it programmatically in python and add it using tag.insert_after(), what ends up in the document looks like this:
{{< Transfer/component\_short\_name >}}
which of course does not work the same.
I managed a workaround for the chevrons > < using string replaces, but the underscores '_' would require going into regex, leaving complicated code for a simple operation, so I'm wondering whether there's an option in BeautifulSoup.
I tried various approaches, such as var_name = var_name.replace("\\_", "_") , but that does not work.
I don't see a way to avoid the < and > conversion using BeautifulSoup, but as you say they could be converted afterwards. In the following example there is no underscore escaping:
from bs4 import BeautifulSoup
import re
shortcode = "{{< Transfer/component_short_name >}}"
html = "<html><body><h1>hello world</h1></body>"
soup = BeautifulSoup(html, "html.parser")
soup.h1.insert_after(shortcode)
fixed = re.sub('\{\{<|>\}\}|\\\_', lambda x: {'{{<' : '{{<', '>}}' : '>}}', '\\_' : '_'}[x.group(0)], str(soup))
print(fixed)
Giving the HTML as:
<html><body><h1>hello world</h1>{{< Transfer/component_short_name >}}</body></html>
Here, the \_ replacement does not appear to be needed but I have included it for completeness.

Scraping with lxml and python requests.

Okay, I am at it again and really trying to figure this stuff out with lxml and python. The last time I asked a question I was using xpath and had to figure out how to make a change in case that the direct xpath source itself would change. I have edited my code to try to go after the class instead. I keep running into a problem with it pulling the address up in memory and not the text that I want. Before anyone says there is a library for what I want to do, this is not about that but, rather, allowing me to understand this code. Here is what I have so far but when I print it out I get an error and I can add [0] behind the print[0].text but it still give me nothing. Any help would be cool.
from lxml import html
import requests
import time
while True:
page = requests.get('https://markets.businessinsider.com/index/realtime-chart/dow_jones')
content = html.fromstring(page.content)
#This will create a list of prices:
prices = content.find_class('price')
print(prices.text)
time.sleep(.5)
Probably a formatting issue from posting but your while loop is not indented.
Try my code below:
while True:
page = requests.get('https://markets.businessinsider.com/index/realtime-chart/dow_jones')
content = html.fromstring(page.content)
prices = content.find_class('price')
#You need to access the 'text_content' method
text = [p.text_content() for p in prices]
for t in text:
if not t.startswith(r"\"): # Prevents the multiple blank lines
print(t)
time.sleep(0.5)

Scraping with Python?

I'd like to grab all the index words and its definitions from here. Is it possible to scrape web content with Python?
Firebug exploration shows the following URL returns my desirable contents including both index and its definition as to 'a'.
http://pali.hum.ku.dk/cgi-bin/cpd/pali?acti=xart&arid=14179&sphra=undefined
what are the modules used? Is there any tutorial available?
I do not know how many words indexed in the dictionary. I`m absolute beginner in the programming.
You should use urllib2 for gettting the URL contents and BeautifulSoup for parsing the HTML/XML.
Example - retrieving all questions from the StackOverflow.com main page:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://stackoverflow.com")
soup = BeautifulSoup(page)
for incident in soup('h3'):
print [i.decode('utf8') for i in incident.contents]
print
This code sample was adapted from the BeautifulSoup documentation.
You can get data from the web using the built-in urllib or urllib2, but the parsing itself is the most important part. May I suggest the wonderful BeautifulSoup? It can deal with just about anything.
http://www.crummy.com/software/BeautifulSoup/
The documentation is built like a tutorial. Sorta:
http://www.crummy.com/software/BeautifulSoup/documentation.html
In your case, you probably need to use wildcards to see all entries in the dictionary. You can do something like this:
import urllib2
def getArticles(query, start_index, count):
xml = urllib2.urlopen('http://pali.hum.ku.dk/cgi-bin/cpd/pali?' +
'acti=xsea&tsearch=%s&rfield=entr&recf=%d&recc=%d' %
(query, start_index, count))
# TODO:
# parse xml code here (using BeautifulSoup or an xml parser like Python's
# own xml.etree. We should at least have the name and ID for each article.
# article = (article_name, article_id)
return (article_names # a list of parsed names from XML
def getArticleContent(article):
xml = urllib2.urlopen('http://pali.hum.ku.dk/cgi-bin/cpd/pali?' +
'acti=xart&arid=%d&sphra=undefined' % article_id)
# TODO: parse xml
return parsed_article
Now you can loop over things. For instance, to get all articles starting in 'ana', use the wildcard 'ana*', and loop until you get no results:
query = 'ana*'
article_dict = {}
i = 0
while (true):
new_articles = getArticles(query, i, 100)
if len(new_articles) == 0:
break
i += 100
for article_name, article_id in new_articles:
article_dict[article_name] = getArticleContent(article_id)
Once done, you'll have a dictionary of the content of all articles, referenced by names. I omitted the parsing itself, but it's quite simple in this case, since everything is XML. You might not even need to use BeautifulSoup (even though it's still handy and easy to use for XML).
A word of warning though:
You should check the site's usage policy (and maybe robots.txt) before trying to heavily scrap articles. If you're just getting a few articles for yourself they may not care (the dictionary copyright owner, if it's not public domain, may care though), but if you're going to scrape the entire dictionary, this is going to be some heavy usage.

Screen scraping in LXML with python-- extract specific data

I've been trying to write a program for the last several hours that does what I thought would be an incredibly simple task:
Program asks for user input (let's say the type 'happiness')
Program queries the website thinkexist using this format ("http://thinkexist.com/search/searchQuotation.asp?search=USERINPUT")
Program returns first quote from the website.
I've tried using Xpath with lxml, but have no experience and every single construction comes back with a blank array.
The actual meat of the quote appears to be contained in the class "sqq."
If I navigate the site via Firebug, click the DOM tab, it appears the quote is in a textNode attribute "wholeText" or "textContent"-- but I don't know how to use that knowledge programatically.
Any ideas?
import lxml.html
import urllib
site = 'http://thinkexist.com/search/searchquotation.asp'
userInput = raw_input('Search for: ').strip()
url = site + '?' + urllib.urlencode({'search':userInput})
root = lxml.html.parse(url).getroot()
quotes = root.xpath('//a[#class="sqq"]')
print quotes[0].text_content()
... and if you enter 'Shakespeare', it returns
In real life, unlike in Shakespeare, the sweetness
of the rose depends upon the name it bears. Things
are not only what they are. They are, in very important
respects, what they seem to be.
If it's not necessary for you to implement this via XPath, you may use BeautifilSoup library like this (let myXml variable contain the page HTML source):
soup = BeautifulSoup(myXml)
for a in soup.findAll(a,{'class' : 'sqq'}):
# this is your quote
print a.contents
Anyway, read the BS documentation, it may be very useful for some scraping needs that don't require the power of XPath.
You could open the html source to find out the exact class you are looking for. For example, to grab the first StackOverflow username encountered on the page you could do:
#!/usr/bin/env python
from lxml import html
url = 'http://stackoverflow.com/questions/4710307'
tree = html.parse(url)
path = '//div[#class="user-details"]/a[#href]'
print tree.findtext(path)
# -> Parseltongue
# OR to print text including the text in children
a = tree.find(path)
print a.text_content()
# -> Parseltongue

Find following tag with pyparsing

I'm using pyparsing to parse HTML. I'm grabbing all embed tags, but in some cases there's an a tag directly following that I also want to grab if it's available.
example:
import pyparsing
target = pyparsing.makeHTMLTags("embed")[0]
target.setParseAction(pyparsing.withAttribute(src=pyparsing.withAttribute.ANY_VALUE))
target.ignore(pyparsing.htmlComment)
result = target.searchString(""".....
<object....><embed>.....</embed></object><br />blah
""")
I haven't been able to find any character offset in the result objects, otherwise I could just grab a slice of the original input string and work from there.
EDIT:
Someone asked why I don't use BeautifulSoup. That's a good question, let me show you why I chose not to use it with a code sample:
import BeautifulSoup
import urllib
import re
import socket
socket.setdefaulttimeout(3)
# get some random blogs
xml = urllib.urlopen('http://rpc.weblogs.com/shortChanges.xml').read()
success, failure = 0.0, 0.0
for url in re.compile(r'\burl="([^"]+)"').findall(xml)[:30]:
print url
try:
BeautifulSoup.BeautifulSoup(urllib.urlopen(url).read())
except IOError:
pass
except Exception, e:
print e
failure += 1
else:
success += 1
print failure / (failure + success)
When I try this, BeautifulSoup fails with parse errors 20-30% of the time. These aren't rare edge cases. pyparsing is slow and cumbersome but it hasn't blown up no matter what I throw at it. If I can be enlightened as to a better way to use BeautifulSoup then I would be really interested in knowing that.
If there is an optional <a> tag that would be interesting if it follows an <embed> tag, then add it to your search pattern:
embedTag = pyparsing.makeHTMLTags("embed")[0]
aTag = pyparsing.makeHTMLTags("a")[0]
target = embedTag + pyparsing.Optional(aTag)
result = target.searchString(""".....
<object....><embed>.....</embed></object><br />blah
""")
print result.dump()
If you want to capture the character location of an expression within your parser, insert one of these, with a results name:
loc = pyparsing.Empty().setParseAction(lambda s,locn,toks: locn)
target = loc("beforeEmbed") + embedTag + loc("afterEmbed") +
pyparsing.Optional(aTag)
Why would you write your own HTML parser? The standard library includes HTMLParser, and BeautifulSoup can handle any job HTMLParser can't.
you don't prefer using normal regex? or because its bad habit to parse html? :D
re.findall("<object.*?</object>(?:<br /><a.*?</a>)?",a)
I was able to run your BeautifulSoup code and received no errors. I'm running BeautifulSoup 3.0.7a
Please use BeautifulSoup 3.0.7a; 3.1.0.1 has bugs that prevent it from working at all in some cases (such as yours).

Categories