Scraping with Python? - python

I'd like to grab all the index words and its definitions from here. Is it possible to scrape web content with Python?
Firebug exploration shows the following URL returns my desirable contents including both index and its definition as to 'a'.
http://pali.hum.ku.dk/cgi-bin/cpd/pali?acti=xart&arid=14179&sphra=undefined
what are the modules used? Is there any tutorial available?
I do not know how many words indexed in the dictionary. I`m absolute beginner in the programming.

You should use urllib2 for gettting the URL contents and BeautifulSoup for parsing the HTML/XML.
Example - retrieving all questions from the StackOverflow.com main page:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://stackoverflow.com")
soup = BeautifulSoup(page)
for incident in soup('h3'):
print [i.decode('utf8') for i in incident.contents]
print
This code sample was adapted from the BeautifulSoup documentation.

You can get data from the web using the built-in urllib or urllib2, but the parsing itself is the most important part. May I suggest the wonderful BeautifulSoup? It can deal with just about anything.
http://www.crummy.com/software/BeautifulSoup/
The documentation is built like a tutorial. Sorta:
http://www.crummy.com/software/BeautifulSoup/documentation.html
In your case, you probably need to use wildcards to see all entries in the dictionary. You can do something like this:
import urllib2
def getArticles(query, start_index, count):
xml = urllib2.urlopen('http://pali.hum.ku.dk/cgi-bin/cpd/pali?' +
'acti=xsea&tsearch=%s&rfield=entr&recf=%d&recc=%d' %
(query, start_index, count))
# TODO:
# parse xml code here (using BeautifulSoup or an xml parser like Python's
# own xml.etree. We should at least have the name and ID for each article.
# article = (article_name, article_id)
return (article_names # a list of parsed names from XML
def getArticleContent(article):
xml = urllib2.urlopen('http://pali.hum.ku.dk/cgi-bin/cpd/pali?' +
'acti=xart&arid=%d&sphra=undefined' % article_id)
# TODO: parse xml
return parsed_article
Now you can loop over things. For instance, to get all articles starting in 'ana', use the wildcard 'ana*', and loop until you get no results:
query = 'ana*'
article_dict = {}
i = 0
while (true):
new_articles = getArticles(query, i, 100)
if len(new_articles) == 0:
break
i += 100
for article_name, article_id in new_articles:
article_dict[article_name] = getArticleContent(article_id)
Once done, you'll have a dictionary of the content of all articles, referenced by names. I omitted the parsing itself, but it's quite simple in this case, since everything is XML. You might not even need to use BeautifulSoup (even though it's still handy and easy to use for XML).
A word of warning though:
You should check the site's usage policy (and maybe robots.txt) before trying to heavily scrap articles. If you're just getting a few articles for yourself they may not care (the dictionary copyright owner, if it's not public domain, may care though), but if you're going to scrape the entire dictionary, this is going to be some heavy usage.

Related

BeautifulSoup to access available bikes in DC bikeshare

I'm new to programming and python and am trying to access the number of available bikes at a given station in the DC bikeshare program. I believe that the best way to do that is with BeautifulSoup. The good news is that the data is available in what appears to be a clean format here: https://www.capitalbikeshare.com/data/stations/bikeStations.xml
Here's an example of a station:
<station>
<id>1</id>
<name>15th & S Eads St</name>
<terminalName>31000</terminalName>
<lastCommWithServer>1460217337648</lastCommWithServer>
<lat>38.858662</lat>
<long>-77.053199</long>
<installed>true</installed>
<locked>false</locked>
<installDate>0</installDate>
<removalDate/>
<temporary>false</temporary>
<public>true</public>
<nbBikes>7</nbBikes>
<nbEmptyDocks>8</nbEmptyDocks>
<latestUpdateTime>1460192501598</latestUpdateTime>
</station>
I'm looking for the <nbBikes> value. I had what I thought would be the start of a python script that would show me the value for the first 5 stations (I'll tackle picking the station I want once I get this under control) but it doesn't return any values. Here's the script:
# bikeShareParse.py - parses the capital bikeshare info page
import bs4, requests
url = "https://www.capitalbikeshare.com/data/stations/bikeStations.xml"
res = requests.get(url)
res.raise_for_status()
#create the soup element from the file
soup = bs4.BeautifulSoup("res.text", "lxml")
# defines the part of the page we are looking for
nbikes = soup.select('#text')
#limits number of results for testing
numOpen = 5
for i in range(numOpen):
print nbikes
I believe that my problem (besides not understanding how to format code correctly in a stack overflow question) is that the value for nbikes = soup.select('#text') is incorrect. However, I can't seem to substitute anything for '#text' to get any values, let alone the ones I want.
Am I approaching this the right way? If so, what am I missing?
thanks
This script creates a dictionary with the structure [station_ID, bikes_remaining]. It is modified from the beginning of this: http://www.plotsofdots.com/archives/68
# from http://www.plotsofdots.com/archives/68
import xml.etree.ElementTree as ET
import urllib2
#we parse the data using urlib2 and xml
site='https://www.capitalbikeshare.com/data/stations/bikeStations.xml'
htm=urllib2.urlopen(site)
doc = ET.parse(htm)
#we get the root tag
root=doc.getroot()
root.tag
#we define empty lists for the empty bikes
sID=[]
embikes=[]
#we now use a for loop to extract the information we are interested in
for country in root.findall('station'):
sID.append(country.find('id').text)
embikes.append(int(country.find('nbBikes').text))
#this just tests that the process above works, can be commented out
#print embikes
#print sID
#use zip to create touples and then parse them into a dataframe
prov=zip(sID,embikes)
print prov[0]

web scraping in python

I'd like to scrape all the ~62000 names from this petition, using python. I'm trying to use the beautifulsoup4 library.
However, it's just not working.
Here's my code so far:
import urllib2, re
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.thepetitionsite.com/104/781/496/ban-pesticides-used-to-kill-tigers/index.html').read())
divs = soup.findAll('div', attrs={'class' : 'name_location'})
print divs
[]
What am I doing wrong? Also, I want to somehow access the next page to add the next set of names to the list, but I have no idea how to do that right now. Any help is appreciated, thanks.
You could try something like this:
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/latest.xml?1374861495')
# uncomment to try with a smaller subset of the signatures
#html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/00/00/00/05.xml')
results = []
while True:
# Read the web page in XML mode
soup = BeautifulSoup(html.read(), "xml")
try:
for s in soup.find_all("signature"):
# Scrape the names from the XML
firstname = s.find('firstname').contents[0]
lastname = s.find('lastname').contents[0]
results.append(str(firstname) + " " + str(lastname))
except:
pass
# Find the next page to scrape
prev = soup.find("prev_signature")
# Check if another page of result exists - if not break from loop
if prev == None:
break
# Get the previous URL
url = prev.contents[0]
# Open the next page of results
html = urllib2.urlopen(url)
print("Extracting data from {}".format(url))
# Print the results
print("\n")
print("====================")
print("= Printing Results =")
print("====================\n")
print(results)
Be warned though there is a lot of data there to go through and I have no idea if this is against the terms of service of the website so you would need to check it out.
In most cases it is extremely inconsiderate to simply scrape a site. You put a fairly large load on a site in a short amount of time slowing down legitimate users requests. Not to mention stealing all of their data.
Consider an alternate approach such as asking (politely) for a dump of the data (as mentioned above).
Or if you do absolutely need to scrape:
Space your requests using a timer
Scrape smartly
I took a quick glance at that page and it appears to me they use AJAX to request the signatures. Why not simply copy their AJAX request, it'll most likely be using some sort of REST call. By doing this you lessen the load on their server by only requesting the data you need. It will also be easier for you to actually process the data because it will be in a nice format.
Reedit, I looked at their robots.txt file. It dissallows /xml/ Please respect this.
what do you mean by not working? empty list or error?
if you are receiving an empty list, it is because the class "name_location" does not exist in the document. also checkout bs4's documentation on findAll

How can I store the results of parsed html?

I'm using Python's HTMLParser and BeautifulSoup to parse Yahoo finance data. There is a very nice package written to do this already but it doesn't get "tangbile price/book value", which is to say that it includes Goodwill and other intangibles in the calculation of book value. Hence, I'm forced to roll my own solution.
It hasn't been pretty. Here's the code
from BeautifulSoup import BeautifulSoup
import urllib2
from HTMLParser import HTMLParse
class data(HTMLParser):
def handle_data(self, data):
print data
parser = data()
url='http://finance.yahoo.com/q/bs?s=BAC&annual'
response = urllib2.urlopen(url)
html = response.read()
soup=BeautifulSoup(html)
tangibles=[str(parser.feed(str(soup('strong')[24:26])))]
Two problems with this:
1) I'm relying on the data always being on the same place on Yahoo's page, which isn't the biggest problem but doesn't make me happy
and,
2) The real problem;
tangibles=[str(parser.feed(str(soup('strong')[24:26])))]
is an empty list, because the "data" class is just printing the stuff I want and not storing it.
I'll be happy if you answer part 2) for me. I don't understand classes yet.
get rid of the data and parser and supporting imports then do this.
tangibles = [''.join(node(text=True)).strip() for node in soup('strong')[24:26]]
I basically changed this to use some python list comprehension. Read more here if you are not aware of what list comprehension is in Python
in essence it does these things:
Tells soup to find your elements labeled strong and each instance to name it node for node in soup.findAll('strong')[24:26]
In the node it finds and removes the strong tags completely node.findAll(text=True) Beautiful soup docs about text=True
Joins the elements in the node so its 1 element and not a list of 1 element in length ''.join() (a python trick)
i.e ['Net Stuff', '152,113,000'] vs [['Net Stuff'], ['152,113,000']]
Removes superfluous whitespace (trailing and leading) .strip()

Python: Keyword to Links

I am building a blog on Google App Engine. I would like to convert some keywords in my blog posts to links, just like what you see in many WordPress blogs.
Here is one WP plugin which do the same thing:http://wordpress.org/extend/plugins/blog-mechanics-keyword-link-plugin-v01/
A plugin that allows you to define keyword/link pairs. The keywords are automatically linked in each of your posts.
I think this is more than a simple Python Replace. What I am dealing with is HTML code. It can be quite complex sometimes.
Take the following code snippet as an example. I want to conver the word example into a link to http://example.com:
Here is an example link:example.com
By a simple Python replace function which replaces example with example, it would output:
Here is an example link:example.com">example.com</a>
but I want:
Here is an example link:example.com
Is there any Python plugin that capable of this? Thanks a lot!
This is roughly what you could do using Beautifulsoup:
from BeautifulSoup import BeautifulSoup
html_body ="""
Here is an example link:<a href='http://example.com'>example.com</a>
"""
soup = BeautifulSoup(html_body)
for link_tag in soup.findAll('a'):
link_tag.string = "%s%s%s" % ('|',link_tag.string,'|')
for text in soup.findAll(text=True):
text_formatted = ['example'\
if word == 'example' and not (word.startswith('|') and word.endswith('|'))\
else word for word in foo.split() ]
text.replaceWith(' '.join(text_formatted))
for link_tag in soup.findAll('a'):
link_tag.string = link_tag.string[1:-1]
print soup
Basically I'm stripping out all the text from the post_body, replacing the example word with the given link, without touching the links text that are saved by the '|' characters during the parsing.
This is not 100% perfect, for example it does not work if the word you are trying to replace ends with a period; with some patience you could fix all the edge cases.
This would probably be better suited to client-side code. You could easily modify a word highlighter to get the desired results. By keeping this client-side, you can avoid having to expire page caches when your 'tags' change.
If you really need it to be processed server-side, then you need to look at using re.sub which lets you pass in a function, but unless you are operating on plain-text you will have to first parse the HTML using something like minidom to ensure you are not replacing something in the middle of any elements.

Screen scraping in LXML with python-- extract specific data

I've been trying to write a program for the last several hours that does what I thought would be an incredibly simple task:
Program asks for user input (let's say the type 'happiness')
Program queries the website thinkexist using this format ("http://thinkexist.com/search/searchQuotation.asp?search=USERINPUT")
Program returns first quote from the website.
I've tried using Xpath with lxml, but have no experience and every single construction comes back with a blank array.
The actual meat of the quote appears to be contained in the class "sqq."
If I navigate the site via Firebug, click the DOM tab, it appears the quote is in a textNode attribute "wholeText" or "textContent"-- but I don't know how to use that knowledge programatically.
Any ideas?
import lxml.html
import urllib
site = 'http://thinkexist.com/search/searchquotation.asp'
userInput = raw_input('Search for: ').strip()
url = site + '?' + urllib.urlencode({'search':userInput})
root = lxml.html.parse(url).getroot()
quotes = root.xpath('//a[#class="sqq"]')
print quotes[0].text_content()
... and if you enter 'Shakespeare', it returns
In real life, unlike in Shakespeare, the sweetness
of the rose depends upon the name it bears. Things
are not only what they are. They are, in very important
respects, what they seem to be.
If it's not necessary for you to implement this via XPath, you may use BeautifilSoup library like this (let myXml variable contain the page HTML source):
soup = BeautifulSoup(myXml)
for a in soup.findAll(a,{'class' : 'sqq'}):
# this is your quote
print a.contents
Anyway, read the BS documentation, it may be very useful for some scraping needs that don't require the power of XPath.
You could open the html source to find out the exact class you are looking for. For example, to grab the first StackOverflow username encountered on the page you could do:
#!/usr/bin/env python
from lxml import html
url = 'http://stackoverflow.com/questions/4710307'
tree = html.parse(url)
path = '//div[#class="user-details"]/a[#href]'
print tree.findtext(path)
# -> Parseltongue
# OR to print text including the text in children
a = tree.find(path)
print a.text_content()
# -> Parseltongue

Categories