Python requests fails to get webpages - python

I am using Python3 and the package requests to fetch HTML data.
I have tried running the line
r = requests.get('https://github.com/timeline.json')
, which is the example on their tutorial, to no avail. However, when I run
request = requests.get('http://www.math.ksu.edu/events/grad_conf_2013/')
it works fine. I am getting errors such as
AttributeError: 'MockRequest' object has no attribute 'unverifiable'
Error in sys.excepthook:
I am thinking the errors have something to do with the type of webpage I am attempting to get, since the html page that is working is just basic html that I wrote.
I am very new to requests and Python in general. I am also new to stackoverflow.

As a little example, here is a little tool which I developed in order to fetch data from a website, in this case IP and show it:
# Import the requests module
# TODO: Make sure to install it first
import requests
# Get the raw information from the website
r = requests.get('http://whatismyipaddress.com')
raw_page_source_list = r.text
text = ''
# Join the whole list into a single string in order
# to simplify things
text = text.join(raw_page_source_list)
# Get the exact starting position of the IP address string
ip_text_pos = text.find('IP Information') + 62
# Now extract the IP address and store it
ip_address = text[ip_text_pos : ip_text_pos + 12]
# print 'Your IP address is: %s' % ip_address
# or, for Python 3 ... #
# print('Your IP address is: %s' % ip_address)

Related

Trying to parse xml from url store to string so I can use in another spot to output to irc

The following is the xml from remote URL
<SHOUTCASTSERVER>
<CURRENTLISTENERS>0</CURRENTLISTENERS>
<PEAKLISTENERS>0</PEAKLISTENERS>
<MAXLISTENERS>100</MAXLISTENERS>
<UNIQUELISTENERS>0</UNIQUELISTENERS>
<AVERAGETIME>0</AVERAGETIME>
<SERVERGENRE>variety</SERVERGENRE>
<SERVERGENRE2/>
<SERVERGENRE3/>
<SERVERGENRE4/>
<SERVERGENRE5/>
<SERVERURL>http://localhost/</SERVERURL>
<SERVERTITLE>Wicked Radio WIKD/WPOS</SERVERTITLE>
<SONGTITLE>Unknown - Haxor Radio Show 08</SONGTITLE>
<STREAMHITS>0</STREAMHITS>
<STREAMSTATUS>1</STREAMSTATUS>
<BACKUPSTATUS>0</BACKUPSTATUS>
<STREAMLISTED>0</STREAMLISTED>
<STREAMLISTEDERROR>200</STREAMLISTEDERROR>
<STREAMPATH>/stream</STREAMPATH>
<STREAMUPTIME>448632</STREAMUPTIME>
<BITRATE>128</BITRATE>
<CONTENT>audio/mpeg</CONTENT>
<VERSION>2.4.7.256 (posix(linux x64))</VERSION>
</SHOUTCASTSERVER>
All I am trying to do is store the contents of the element <SONGTITLE> store it so I can post to IRC using a bot that I have.
import urllib2
from lxml import etree
url = "http://142.4.217.133:9203/stats?sid=1&mode=viewxml&page=0"
fp = urllib2.urlopen(url)
doc = etree.parse(fp)
fp.close()
for record in doc.xpath('//SONGTITLE'):
for x in record.xpath("./subfield/text()"):
print "\t", x
That is what I have so far; not sure what I am doing wrong here. I am quite new to python but the IRC bot works and does some other utility type things I just want to add this as a feature to it.
You don't need to include ./subfield/:
for x in record.xpath("text()"):
Output:
Unknown - Haxor Radio Show 08

Using CGI in python 3.4 I keep getting "End of script output before headers: measurement.py" error in my error log

The purpose is to have the area method return json serialized data using cgi and restful services. When I run request_area() my console displays 500 internal server error and when I check my error log it says 'End of script output before headers: measurement.py'
Here is measurement.py
#!/usr/bin/python3
__author__ = 'charles'
import json
import logging
import db_utility, db_access
def send_data(data):
logging.debug(str(data))
dataJ = json.dumps(data)
logging.debug("custJ " + str(dataJ))
lng = len(dataJ)
logging.debug("len " + str(lng))
print("Content-Type: application/json; charset=UTF-8")
print("Content-Length: " + str(lng))
print()
print(dataJ)
def area():
areas = db_access.get_all_areas()
# print(areas)
send_data(areas)
And here is request_area()
import requests
r = requests.get("http://localhost/cgi-bin/measurements_rest/measurement.py/area")
print(r.text)
Oh and the function being called in area(), get_all_areas()
def get_all_areas():
"""
Returns a list of dictionaries representing all the rows in the
area table.
"""
cmd = 'select * from area'
crs.execute(cmd)
return crs.fetchall()
I can not figure out what I am doing wrong.
As you are using apache to call your program, apache will place in the
environment several variables with info on the cgi call and the url.
The one of interest to you is PATH_INFO which contains the string remaining
unparsed by apache, ie area. In your python main you need to
os.getenv('PATH_INFO') and recognised the word and call your function.
Alternatively, use a framework like cherrypy which
does this sort of work for you.
Also, you are printing stuff before the Content-type etc headers. Remove
print(areas)

How to access header information sent to http.server from an AJAX client using python 3+?

I am trying to read data sent to python's http.server from a local javascript program posted with AJAX. Everything works in python 2.7 as in this example, but now in python 3+ I can't access the header anymore to get the file length.
# python 2.7 (works!)
class handler_class(SimpleHTTPServer.SimpleHTTPRequestHandler):
def do_POST(self):
if self.path == '/data':
length = int(self.headers.getheader('Content-Length'))
NewData = self.rfile.read(length)
I've discovered I could use urllib.request, as I have mocked up below. However, I am running on a localhost and don't have a full url as I've seen in the examples, and I am starting to second guess if this is even the right way to go? Frustratingly, I can see the content-length printed out in the console, but I can't access it.
# python 3+ (url?)
import urllib.request
class handler_class(http.server.SimpleHTTPRequestHandler):
def do_POST(self):
if self.path == '/data':
print(self.headers) # I can see the content length here but cannot access it!
d = urllib.request.urlopen(url) # what url?
length = int(d.getheader('Content-Length'))
NewData = self.rfile.read(length)
Various url's I have tried are:
self.path
http://localhost:8000/data
/data
and I generally get this error:
ValueError: unknown url type: '/data'
So why is 'urllib.request' failing me and more importantly, how does one access 'self.header' in this Python3 world?

Parse content from select menu, Python+BeautifulSoup

I am trying to parse data from a page using python which can be pretty straightforward but all the data is hidden under jquery elements and such which makes it harder to grab the data. Please forgive me as i am a newbie to Python and programming as a whole so still getting familiar with it.The website i am getting it from is http://www.asusparts.eu/partfinder/Asus/All In One/E Series so i just need all the data from the E This is the code i have so far:
import string, urllib2, csv, urlparse, sys
from bs4 import BeautifulSoup
changable_url = 'http://www.asusparts.eu/partfinder/Asus/All%20In%20One/E%20Series'
page = urllib2.urlopen(changable_url)
base_url = 'http://www.asusparts.eu'
soup = BeautifulSoup(page)
redirects = []
model_info = []
select = soup.find(id='myselectListModel')
print select.get_text()
options = select.findAll('option')
for option in options:
if(option.has_attr('redirectvalue')):
redirects.append(option['redirectvalue'])
for r in redirects:
rpage = urllib2.urlopen(base_url + r.replace(' ', '%20'))
s = BeautifulSoup(rpage)
print s
sys.exit()
However the only problem is, it just prints out the data for the first model which is
Asus->All In One->E Series->ET10B->AC Adapter. The actual HTML page prints out like the following... (output was too long - just pasted the main output needed)
I am unsure on how i would grab the data for all the E Series parts as i assumed this would grab everything? Also i would appreciate if any answers you show relate to the current method i am using as this is the way the person in charge would like it done, Thanks.
[EDIT]
This is how i am trying to parse the HTML:
for r in redirects:
rpage = urllib2.urlopen(urljoin(base_url, quote(r)))
s = BeautifulSoup(rpage)
print s
data = soup.find(id='accordion')
selection = data.findAll('td')
for s in selections:
if(selection.has_attr('class', 'ProduktLista')):
redirects.append(td['class', 'ProduktLista'])
This is the error i come up with:
Traceback (most recent call last):
File "C:\asus.py", line 31, in <module>
selection = data.findAll('td')
AttributeError: 'NoneType' object has no attribute 'findAll'
You need to remove the sys.exit() call you have in your loop:
for r in redirects:
rpage = urllib2.urlopen(base_url + r.replace(' ', '%20'))
s = BeautifulSoup(rpage)
print s
# sys.exit() # remove this line, no need to exit your program
You also may want to use urllib.quote to properly quote the URLs you get from the option dropdown; this removes the need to manually replace spaces with '%20'. Use urlparse.urljoin() to construct the final URL:
from urllib import quote
from urlparse import
for r in redirects:
rpage = urllib2.urlopen(urljoin(base_url, quote(r)))
s = BeautifulSoup(rpage)
print s

Getting my IP using python

I am trying to update my rackspace dns with my IP using a python script.
My script works when I manually enter an IP in it, but dosen't when I get it from the outside, why?
This WORKS:
#!/usr/bin/env python
import clouddns
import requests
r= requests.get(r'http://curlmyip.com/')
ip= '4.4.4.4'
dns = clouddns.connection.Connection('******','********************')
domain = dns.get_domain(name='reazem.net')
record = domain.get_record(name='ssh.reazem.net')
record.update(data=ip, ttl=600)
This DOESN'T:
#!/usr/bin/env python
import clouddns
import requests
r= requests.get(r'http://curlmyip.com/')
**ip= '{}'.format(r.text)**
dns = clouddns.connection.Connection('******','********************')
domain = dns.get_domain(name='reazem.net')
record = domain.get_record(name='ssh.reazem.net')
record.update(data=ip, ttl=600)
Note: print '{}'.format(r.text) succesfully outputs my ip.
Helping you helping me: I just noticed that print '{}'.format(r.text) adds an extra line, how do I avoid that?
For those interested: https://github.com/rackspace/python-clouddns
Try ip = r.text.strip() to remove the extra newline.

Categories