Why the browser display the entity name? - python

import urllib2
url = "http://www.reddit.com/r/pics/hot.json"
hdr = { 'User-Agent' : 'super happy flair bot by /u/spladug' }
req = urllib2.Request(url, headers=hdr)
html = urllib2.urlopen(req).read()
data = '{0}({1})'.format("callback", html)
print data
I want to use some data scraped from some sites and make the return value support the JSONP.But the browser outputs:
callback({"kind": "Listing", "data":........
Actually I want:
callback({"kind": "Listing", "data":........
I don't want the quotes convert to ".Maybe I should not use function read(),but I don't know how to deal with it?
Can someone help me?

Related

Why won't python request pagination work?

I'm trying to use pagination to request multiple pages of rent listing from zillow. Otherwise I'm limited to the first page only. However, my code seems to load the first page only even if I specify specific pages manually.
# Rent
import requests
from bs4 import BeautifulSoup as soup
import json
url = 'https://www.zillow.com/torrance-ca/rentals'
params = {
'q': {"pagination":{"currentPage": 1},"isMapVisible":False,"filterState":{"fore":{"value":False},"mf":{"value":False},"ah":{"value":True},"auc":{"value":False},"nc":{"value":False},"fr":{"value":True},"land":{"value":False},"manu":{"value":False},"fsbo":{"value":False},"cmsn":{"value":False},"fsba":{"value":False}},"isListVisible":True}
}
headers = {
# headers were copied from network tab on developer tools in chrome
}
html = requests.get(url=url,headers=headers, params=params)
html.status_code
bsobj = soup(html.content, 'lxml')
for script in bsobj.find_all('script'):
inner_text_with_string = str(script.string)
if inner_text_with_string[:18] == '<!--{"queryState":':
my_query = inner_text_with_string
my_query = my_query.strip('><!-')
data = json.loads(my_query)
data = data['cat1']['searchResults']['listResults']
print(data)
This returns about 40 listings. However, if I change "pagination":{"currentPage": 1} to "pagination":{"currentPage": 2}, it returns the same listings! It's as if the pagination parameter isn't recognized.
I believe these are the correct parameters, as I took them straight from the url string query and used http://urlprettyprint.com/ to make it pretty.
Any thoughts on what I'm doing wrong?
Using the params argument with requests is sending the wrong data, you can confirm this by printing response.url. what i would do is use urllib.parse.urlencode:
from urllib.parse import urlencode
...
html = requests.get(url=f"{url}?{urlencode(params)}", headers=headers)

Python Requests Post login

Yes, I know I'm green. I'm trying to learn how to POST into websites though I cannot seem pic the right fields to pass into the POST request.
Below you'll see the HTML for the site that I'm trying to grab everything from:
HTML picture
I've tried the following code to log into the website for Greetly but been having a hell of a time. I'm sure the values I'm passing must have the wrong keys but I can't seem to figure out what it is that I'm doing wrong.
import requests
from bs4 import BeautifulSoup
url = 'https://app.greetly.com'
urlVisitorLog = 'https://app.greetly.com/locations/00001/check_in_records'
values = {
'user[email]':'email',
'user[password]':'password'
}
c = requests.Session()
results = c.get(url)
soup = BeautifulSoup(results.content, 'html.parser')
key = soup.find(name="authenticity_token")
authenticity_token = key['value']
values["authenticity_token"] = authenticity_token
c.post(urlVisitorLog, headers= values)
r = c.get(urlVisitorLog)
soup2 = BeautifulSoup(r.content, 'html.parser')
Also once I get the username and password I noticed the authenticity token isn't bound to a specific id but I also kind of need that login in order to parse through and see where that is.

Missing Attribute in a Python HTML request

I'm a newbie to Python and I'm actually working on a little Python script that request and read the HTML of an URL.
For Information the web page that i'm working on is http://bitcoinity.org/markets ,
I would like with my script to fetch the Current Price of the market.
I checked the HTML code and i found that the Price was in a balise :
<span id="last_price" value="447.77"</span>
Here is the code of my Python script :
import urllib2
import urllib
from bs4 import BeautifulSoup
url = "http://bitcoinity.org/markets"
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
HTML = urllib2.urlopen(req)
soup = BeautifulSoup(HTML)
HTMLText = HTML.read()
HTML.close()
#print soup.prettify()
#print HTMLText
So the problem is that the output of this script ( with the 2 methods BeautifulSoup and read() ) is like this :
</span>
<span id="last_price">
</span>
The "value=" attribute is missing and the syntax changed , so I don't know if the server doesn't allow me to make a request of this value or if there is a problem with my code.
All Help is welcome ! :)
( Sorry for my bad english , i'm not a native )
The price is calculated via a set of javascript functions, urllib2+BeautifulSoup approach would not work in this case.
Consider using a tool that utilizes a real browser, like selenium:
>>> from selenium import webdriver
>>> driver = webdriver.Firefox()
>>> driver.get('http://bitcoinity.org/markets')
>>> driver.find_element_by_id('last_price').text
u'0.448'
I'm not sure beautifulsoup or selenium are the tools for this task. They're actually a very poor solution.
Since we're talking about "stock" prices (bitcoin in this case), it is much better if you feed your app/script with real-time market data. Bitcoinity's default "current price" is actually Bitstamp's price... You can also get it directly from the Bitstamp's API via 2 ways.
HTTP API
Here's the ticker you need to feed your app with: https://www.bitstamp.net/api/ticker/ and here how you can get the last price (It is the 'last' value of that JSON what you really are looking for)
import urllib2
import json
req = urllib2.Request("https://www.bitstamp.net/api/ticker/")
opener = urllib2.build_opener()
f = opener.open(req)
json = json.loads(f.read())
print 'Bitcoin last price is = '+json['last']
Websockets API
This is how bitcoinity, bitcoinwisdom, etc grab the prices and market info in order to show it to you in real-time. For this you'll need pusher package for python, since Bitstamp uses pusher for websockets.

Trouble Getting a clean text file from HTML

I have looked at these previous questions
I am trying to consolidate news and notes from websites.
Reputed News service websites allow Users to post comments and views.
I am trying to get only the news content without the users comments. I tried working with BeautifulSoup and html2text. But user-comments are being included in the text file. I have even tried developing a custom program but with no useful progress than the above two.
Can anybody provide some clue how to proceed?
The code:
import urllib2
from bs4 import BeautifulSoup
URL ='http://www.example.com'
print 'Following: ',URL
print "Loading..."
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
identify_as = { 'User-Agent' : user_agent }
print "Reading URL:"+str(URL)
def process(URL,identify_as):
req = urllib2.Request(URL,data=None,headers=identify_as)
response = urllib2.urlopen(req)
_BSobj = BeautifulSoup(response).prettify(encoding='utf-8')
return _BSobj #return beauifulsoup object
print 'Processing URL...'
new_string = process(URL,identify_as).split()
print 'Buidiing requested Text'
tagB = ['<title>','<p>']
tagC = ['</title>','</p>']
reqText = []
for num in xrange(len(new_string)):
buffText = [] #initialize and reset
if new_string[num] in tagB:
tag = tagB.index(new_string[num])
while new_string[num] != tagC[tag]:
buffText.append(new_string[num])
num+=1
reqText.extend(buffText)
reqText= ''.join(reqText)
fileID = open('reqText.txt','w')
fileID.write(reqText)
fileID.close()
Here's a quick example I wrote using urllib which gets the contents of a page to a file:
import urllib
import urllib.request
myurl = "http://www.mysite.com"
sock = urllib.request.urlopen(myurl)
pagedata = str(sock.read())
sock.close()
file = open("output.txt","w")
file.write(pagedata)
file.close()
Then with a lot of string formatting you should be able to extract the parts of the html you want. This gives you something to get started from.

Submitting to a web form using python

I have seen questions like this asked many many times but none are helpful
Im trying to submit data to a form on the web ive tried requests, and urllib and none have worked
for example here is code that should search for the [python] tag on SO:
import urllib
import urllib2
url = 'http://stackoverflow.com/'
# Prepare the data
values = {'q' : '[python]'}
data = urllib.urlencode(values)
# Send HTTP POST request
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
html = response.read()
# Print the result
print html
yet when i run it i get the html soure of the home page
here is an example of using requests:
import requests
data= {
'q': '[python]'
}
r = requests.get('http://stackoverflow.com', data=data)
print r.text
same result! i dont understand why these methods arent working i've tried them on various sites with no success so if anyone has successfully done this please show me how!
Thanks so much!
If you want to pass q as a parameter in the URL using requests, use the params argument, not data (see Passing Parameters In URLs):
r = requests.get('http://stackoverflow.com', params=data)
This will request https://stackoverflow.com/?q=%5Bpython%5D , which isn't what you are looking for.
You really want to POST to a form. Try this:
r = requests.post('https://stackoverflow.com/search', data=data)
This is essentially the same as GET-ting https://stackoverflow.com/questions/tagged/python , but I think you'll get the idea from this.
import urllib
import urllib2
url = 'http://www.someserver.com/cgi-bin/register.cgi'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()
This makes a POST request with the data specified in the values. we need urllib to encode the url and then urllib2 to send a request.
Mechanize library from python is also great allowing you to even submit forms. You can use the following code to create a browser object and create requests.
import mechanize,re
br = mechanize.Browser()
br.set_handle_robots(False) # ignore robots
br.set_handle_refresh(False) # can sometimes hang without this
br.addheaders = [('User-agent', 'Firefox')]
br.open( "http://google.com" )
br.select_form( 'f' )
br.form[ 'q' ] = 'foo'
br.submit()
resp = None
for link in br.links():
siteMatch = re.compile( 'www.foofighters.com' ).search( link.url )
if siteMatch:
resp = br.follow_link( link )
break
content = resp.get_data()
print content

Categories