Scraper yields no results - python

I am very new to python (this is my first Python project, in fact) and I am having a bit of trouble writing this web scraper. I used a tutorial to figure this out, but the code is yielding no results. I would really appreciate some help.
from lxml import html
import requests
page = requests.get('http://openbook.sfgov.org/openbooks/cgi-bin/cognosisapi.dll?b_action=cognosViewer&ui.action=run&ui.object=/content/folder%5B%40name%3D%27Reports%27%5D/report%5B%40name%3D%27Budget%27%5D&ui.name=20Budget&run.outputFormat=&run.prompt=false')
tree = html.fromstring(page.content)
#This will find the table headers:
categories = tree.xpath('//*[#id="rt_NS_"]/tbody/tr[2]/td/table/tbody/tr[4]/td/table/tbody/tr[2]/td/div/div/table/tbody/tr/td[2]/table/tbody/tr[2]/td[1]')
# This will find the budgets
category_budget = tree.xpath('//*[#id="rt_NS_"]/tbody/tr[2]/td/table/tbody/tr[4]/td/table/tbody/tr[2]/td/div/div/table/tbody/tr/td[2]/table/tbody/tr[2]/td[2]/span[1]')
print 'Cateogries: ', categories
print 'Budget: ', category_budget

Looks like contents of table id="rt_NS_" are being generated by JavaScript.
In that case requests won't help you.
page = requests.get('http://openbook.sfgov.org/openbooks/cgi-bin/cognosisapi.dll?b_action=cognosViewer&ui.action=run&ui.object=/content/folder%5B%40name%3D%27Reports%27%5D/report%5B%40name%3D%27Budget%27%5D&ui.name=20Budget&run.outputFormat=&run.prompt=false')
ctx = page.content
if "id=\"rt_NS_\"" in ctx:
print "Found!"
else:
print "Not Found!"
Not Found!
You'll need to use other approach. Selenium with python could be an option.

Related

When I take html from a website using urllib2, the inner html is empty. Anyone know why?

I am working on a project and one of the steps includes getting a random word which I will use later. When I try to grab the random word, it gives me '<span id="result"></span>' but as you can see, there is no word inside.
Code:
import urllib2
from bs4 import BeautifulSoup
quote_page = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find("span", {"id": "result"})
print name_box
name = name_box.text.strip()
print name
I am thinking that maybe it might need to wait for a word to appear, but I'm not sure how to do that.
This word is added to the page using JavaScript. We can verify this by looking at the actual HTML that is returned in the request and comparing it with what we see in the web browser DOM inspector. There are two options:
Use a library capable of executing JavaScript and giving you the resulting HTML
Try a different approach that doesn't require JavaScript support
For 1, we can use something like requests_html. This would look like:
from requests_html import HTMLSession
url = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
session = HTMLSession()
r = session.get(url)
# Some sleep required since the default of 0.2 isn't long enough.
r.html.render(sleep=0.5)
print(r.html.find('#result', first=True).text)
For 2, if we look at the network requests that the page is making, then we can see that it retrieves random words by making a POST request to http://watchout4snakes.com/wo4snakes/Random/RandomWord. Making a direct request with a library like requests (recommended in the standard library documentation here) looks like:
import requests
url = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
print(requests.post(url).text)
So the way that the site works is that it sends you the site with no word in the span box, and edits it in later through JavaScript; that's why you get a span box with nothing inside.
However, since you're trying to get the word I'd definitely suggest you use a different method to getting the word, rather than scraping the word off the page, you can simply send a POST request to http://watchout4snakes.com/wo4snakes/Random/RandomWord with no body and receive the word in response.
You're using Python 2 but in Python 3 (for example, so I can show this works) you can do:
>>> import requests
>>> r = requests.post('http://watchout4snakes.com/wo4snakes/Random/RandomWord')
>>> print(r.text)
doom
You can do something similar using urllib in Python 2 as well.

how to get data from website on search option using python?

I'm trying to extract data from website
http://maps.jocogov.org/ims/
In this website there is an search option, In that search option I want to get information corresponding to specific propertyids like DP14000001 0001
When we search property id, a pop up windows appear and from that window, I need to extract data from this link "Tax Bill Info Click Here"
I'm storing the propertyids in text file so that ids could itearte from there and used in search option and getting data from the link in pop up window.
I'm new to web scraping and written some starting code...
import re
import urllib
propertyids = "/home/NewYork/PropertyId.txt"
url = "http://maps.jocogov.org/ims/"
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
regex = 'class="ui-autocomplete-input" type="+propertyids+"'
pattern = re.compile(regex)
locationidinfo = re.findall(pattern,htmltext)
print locationidinfo
After executing this code am getting result like this [ ], dont know what it means. So I'm lost in setting up further code to get data from website...
Can anyone assist to step further ?
Thank in advance !! :)

How to gather specific information using urllib2 in python

As of now I have created a basic program in python 2.7 using urllib2 and re that gathers the html code of a website and prints it out for you as well as indexing a keyword. I would like to create a much more complex and dynamic program which could gather data from websites such as sports or stock statistics and aggregate them into lists which could then be used in analysis in something such as an excel document etc. I'm not asking for someone to literally write the code. I simply need help understanding more of how I should approach the code: whether I require extra libraries, etc. Here is the current code. It is very simplistic as of now.:
import urllib2
import re
y = 0
while(y == 0):
x = str(raw_input("[[[Enter URL]]]"))
keyword = str(raw_input("[[[Enter Keyword]]]"))
wait = 0
try:
req = urllib2.Request(x)
response = urllib2.urlopen(req)
page_content = response.read()
idall = [m.start() for m in re.finditer(keyword,page_content)]
wait = raw_input("")
print(idall)
wait = raw_input("")
print(page_content)
except urllib2.HTTPError as e:
print e.reason
You can use requests to deal with interaction with website. Here is link for it. http://docs.python-requests.org/en/latest/
And then you can use beautifulsoup to handle the html content. Here is link for it.http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
They're more ease of use than urllib2 and re.
Hope it helps.

Getting the number of like counts from a facebook album

I was trying to develop a python script for my friend, which would take a link of a public album and count the like and comment numbers of every photo with "requests" module. This is the code of my script
import re
import requests
def get_page(url):
r = requests.get(url)
content = r.text.encode('utf-8', 'ignore')
return content
if __name__ == "__main__":
url = 'https://www.facebook.com/media/set/?set=a.460132914032627.102894.316378325074754&type=1'
content = get_page(url)
content = content.replace("\n", '')
chehara = "(\d+) likes and (\d+) comments"
cpattern = re.compile(chehara)
result = re.findall(cpattern, content)
for jinish in result:
print "likes "+ jinish[0] + " comments " + jinish [1]
But the problem here is, it only parses the likes and comments for the first 28 photos, and not more, what is the problem? Can somebody please help?
[Edit: the module "request" just loads the web page, which is, the variable content contains the full html source of the facebook web page of the linked album]
use the facebook graph api:
For Albums its documented here:
https://developers.facebook.com/docs/reference/api/album/
Use the limit attribute for testing since its rather slow:
http://graph.facebook.com/460132914032627/photos/?limit=10
EDIT
i just realized that the like_count is not part of the json, you may have to use fql for that
If you want to see the next page you need to add the after attribute to your request like in this URL:
https://graph.facebook.com/albumID/photos?fields=likes.summary(true),comments.summary(true)&after=XXXXXX&access_token=XXXXXX
You could take a look at this JavaScript project for reference.

Searching through a website directory, validate, then place URL in a list depending on content

I've been working on a script and I thought I would ask for help. I'm looking to search a series of websites, check if the site is valid. Then the next step would be to check for specific content on the site. If the site holds that content, place the URL in a list.
import urllib2
def getPage():
url="import urllib2
National=[]
Local=[]
Sports=[]
Culture=[]
def getPage():
url="http://readingeagle.com/section.aspx?id=2"
for i in range (0,100,1)
req = urllib2.Request(http://readingeagle.com/section.aspx?id=,i)
if "national" in response:
response = urllib2.urlopen(req)
return response.read()
for g in range (0,100,1)
if "national" in response:
National.append("http://readingeagle.com/section.aspx?id=,g"
# I would like to set-up an iteration to check the 'entryid from 1-100. If the term is found on the page, place the url in the list.
if __name__ == "__main__":
namesPage = getPage()
print (namesPage)
Here's my answer to the question of how to validate a given web site.
python check html valid
For checking the context of the page the tools consist of basic string methods, regex, or more sophisticated tools like lxml or beautifulsoup.
matchingSites = []
matchingSites.append(url) #Since you asked. :-p

Categories