I'm trying to programmably access a website
from robobrowser import RoboBrowser
import sys
browser = RoboBrowser(history=True)
browser.open('https://test.com/login')
loginForm = browser.get_form()
loginForm['UserName']='username'
loginForm['Password']='*'
browser.submit_form(loginForm)
if browser.response.ok:
if browser.response.content[2]=='false':
print browser.response.content[4]
sys.exit(1)
website returned json format ( at least i think it's json), but i can't seems to find robobrowser api for dealing with json.
{"RedirectUrl":null,"IsSuccess":false,"Message":null,"CustomMessage":null,"Errors":[{"Key":"CaptchaValue","Value":["Your response did not match. Please try again."]}],"Messages":{},"HasView":true.......}
As you can see I want to test if "isSuccess", and print error message, how can I proceed in this case?
thanks
found a solution using json
json.load(StringIO(browser.response.content))
and for python 3.x is functional
import io
import json
json.load(io.BytesIO(browser.response.content))
Related
Background:
Typically, if I want to see what type of requests a website is getting, I would open up chrome developer tools (F12), go to the Network tab and filter the requests I want to see.
Example:
Once I have the request URL, I can simply parse the URL for the query string parameters I want.
This is a very manual task and I thought I could write a script that does this for any URL I provide. I thought Python would be great for this.
Task:
I have found a library called requests that I use to validate the URL before opening.
testPage = "http://www.google.com"
validatedRequest = str(requests.get(testPage, verify=False).url)
page = urlopen(validatedRequest)
However, I am unsure of how to get the requests that the URL I enter receives. Is this possible in python? A point in the right direction would be great. Once I know how to access these request headers, I can easily parse through.
Thank you.
You can use the urlparse method to fetch the query params
Demo:
import requests
import urllib
from urlparse import urlparse
testPage = "http://www.google.com"
validatedRequest = str(requests.get(testPage, verify=False).url)
page = urllib.urlopen(validatedRequest)
print urlparse(page.url).query
Result:
gfe_rd=cr&dcr=0&ei=ISdiWuOLJ86dX8j3vPgI
Tested in python2.7
I was using Bs4 in Python for downloading a wallpaper from nmgncp.com.
However the code downloads only 16KB file whereas the full image is around 300KB.
Please help me. I have even tried wget.download method.
PS:- I am using Python 3.6 on Windows 10.
Here is my code::--
from bs4 import BeautifulSoup
import requests
import datetime
import time
import re
import wget
import os
url='http://www.nmgncp.com/dark-wallpaper-1920x1080.html'
html=requests.get(url)
soup=BeautifulSoup(html.text,"lxml")
a = soup.findAll('img')[0].get('src')
newurl='http://www.nmgncp.com/'+a
print(newurl)
response = requests.get(newurl)
if response.status_code == 200:
with open("C:/Users/KD/Desktop/Python_practice/newwww.jpg", 'wb') as f:
f.write(response.content)
The source of your problem is because there is a protection : the image page requires a referer, otherwise it redirects to the html page.
Source code fixed :
from bs4 import BeautifulSoup
import requests
import datetime
import time
import re
import wget
import os
url='http://www.nmgncp.com/dark-wallpaper-1920x1080.html'
html=requests.get(url)
soup=BeautifulSoup(html.text,"lxml")
a = soup.findAll('img')[0].get('src')
newurl='http://www.nmgncp.com'+a
print(newurl)
response = requests.get(newurl, headers={'referer': newurl})
if response.status_code == 200:
with open("C:/Users/KD/Desktop/Python_practice/newwww.jpg", 'wb') as f:
f.write(response.content)
First of all http://www.nmgncp.com/dark-wallpaper-1920x1080.html is an HTML document. Second when you try to download an image by direct URL (like: http://www.nmgncp.com/data/out/95/4351795-dark-wallpaper-1920x1080.jpg) it will also redirect you to a HTML document. This is most probably because the hoster (nmgncp.com) does not want to provide direct links to its images. He can check whether the image was called directly by looking at the HTTP referer and deciding if it is valid. So in this case you have to put in some more effort to make the hoster think, that you are a valid caller of direct URLs.
I want to retrieve data from a website named as myip.ms. I'm using requests to send data to form and then I want the response page back to me. When I run the script it returns the same page (homepage) in response. I want the next page using the query I provide. I'm new in WebScraping. Here's the code I'm using to achieve this.
import requests
from urllib.parse import urlencode, quote_plus
payload={
'name':'educationmaza.com',
'value':'educationmaza.com',
}
payload=urlencode(payload)
r=requests.post("http://myip.ms/s.php",data=payload)
infile=open("E://abc.html",'wb')
infile.write(r.content)
infile.close()
I'm no expert, but it appears that when interacting with the webpage, the post is processed by jQuery, which requests does not do well with.
As such, you would have to use the Selenium module to interact with it.
The following code will execute as desired:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get("https://myip.ms/s.php")
driver.find_element_by_id("home_txt").send_keys('educationmaza.com')
driver.find_element_by_id("home_submit").click()
html = driver.page_source
infile=open("stack.html",'w')
infile.write(html)
infile.close()
You will have to install the Selenium package, as well as Phantom.JS.
I have tested this code, and it works fine. Let me know if you need any further help!
The printed html returns garbled text... instead of what I expect to see as seen in "view source" in browser.
Why is that? How to fix it easily?
Thank you for your help.
Same behavior using mechanize, curl, etc.
import urllib
import urllib2
start_url = "http://www.ncert.nic.in/ncerts/textbook/textbook.htm"
response = urllib2.urlopen(start_url)
html = response.read()
print html
I got the same garbled text using curl
curl http://www.ncert.nic.in/ncerts/textbook/textbook.htm
The result appears to be gzipped. So this shows the correct HTML for me.
curl http://www.ncert.nic.in/ncerts/textbook/textbook.htm | gunzip
Here's a solutions on doing this in Python: Convert gzipped data fetched by urllib2 to HTML
Edited by OP:
The revised answer after reading above is:
import urllib
import urllib2
import gzip
import StringIO
start_url = "http://www.ncert.nic.in/ncerts/textbook/textbook.htm"
response = urllib2.urlopen(start_url)
html = response.read()
data = StringIO.StringIO(html)
gzipper = gzip.GzipFile(fileobj=data)
html = gzipper.read()
html now holds the HTML (Print it to see)
Try requests. Python Requests.
import requests
response = requests.get("http://www.ncert.nic.in/ncerts/textbook/textbook.htm")
print response.text
The reason for this is because the site uses gzip encoding. To my knowledge urllib doesn't support deflating so you end up with compressed html responses for certain sites that use that encoding. You can confirm this by printing the content headers from the response like so.
print response.headers
There you will see that the "Content-Encoding" is gzip format. In order to get around this using the standard urllib library you'd need to use the gzip module. Mechanize also does this because it uses the same urllib library. Requests will handle this encoding and format it nicely for you.
Can someone tell me why this doesn't work?
import cookielib
import urllib
import urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
data = urllib.urlencode({'session[username_or_email]':'twitter handle' , 'session[password]':'password'})
opener.open('https://twitter.com' , data)
stuff = opener.open('https://twitter.com')
print stuff.read()
Why doesn't this give the html of the page after logging in?
Please consider using an Oauth library for your task. Scraping the site using mechanize is not recommended because twitter can change the HTML specific stuffs any time, and then your code will break.
Check this out: Python-twitter at http://code.google.com/p/python-twitter/
Simplest example to post an update:
>>> import twitter
>>> api = twitter.Api(
consumer_key='yourConsumerKey',
consumer_secret='consumerSecret',
access_token_key='accessToken',
access_token_secret='accessTokenSecret')
>>> api.PostUpdate('Blah blah lbah!')
There can be many reasons why it is failing:
Twitter probably expects a User-Agent header, which you are not providing.
I didn't look at the HTML, but many be there's some Javascript at play before the form is actually submitted (I actually think this is the case, because I vaguely remember writing a very detailed answer on this exact thing (and I dont seem to find the link to it!)).