Browser simulation - Python 3 - python

I need to access a few HTML pages through a Python 3 script, problem is that I need COOKIE functionality, therefore a simple urllib HTTP request won't work.
Any ideas?

python3's urllib has cookie support, look at urllib.request.HTTPCookieProcessor, and http.cookiejar

Use requests.
>>> import requests
>>> url = 'http://httpbin.org/cookies/set/requests-is/awesome'
>>> r = requests.get(url)
>>> print r.cookies
{'requests-is': 'awesome'}
Reference: http://docs.python-requests.org/en/latest/user/quickstart/#cookies
As of a few days ago, requests supports Python 3, though you might have to use one of the develop branches, not entirely sure about the status of upstream integration.

Related

Prefix "http://" valid but actually ""https://"

A long list of incomplete websites, some missing prefix like "http://www." etc.
pewresearch.org
narod.ru
intel.com
xda-developers.com
oecd.org
I tried:
import requests
from lxml.html import fromstring
to_check = [
"pewresearch.org",
"narod.ru",
"intel.com",
"xda-developers.com",
"oecd.org"]
for each in to_check:
r = requests.get("http://www." + each)
tree = fromstring(r.content)
title = tree.findtext('.//title')
print (title)
They returned:
Pew Research Center | Pew Research Center
Лучшие конструкторы сайтов | Народный рейтинг конструкторов для создания сайтов
Intel | Data Center Solutions, IoT, and PC Innovation
XDA Portal & Forums
Home page - OECD
Seems theirs all started with "http://www.", however not - because for example, the right one is "https://www.pewresearch.org/".
What's the quickest way, with online tool or Python, that I can find out their complete and correct addresses, instead of keying them one-by-one in web browser? (some might be http, some https).
Write a script / short program to send a HEAD request to each site. The server should respond with a redirect (e.g. to HTTPS). Follow each redirect until no further redirects are received.
The C# HttpClient can follow redirects automatically.
For Python, see #jterrace's answer here using the requests library with the code snippet below:
>>> import requests
>>> r = requests.head('http://github.com', allow_redirects=True)
>>> r
<Response [200]>
>>> r.history
[<Response [301]>]
>>> r.url
u'https://github.com/'

Python - Request being blocked by Cloudflare

I am trying to log into a website. When I look at print(g.text) I am not getting back the web page I expect but instead a cloudflare page that says 'Checking your browser before accessing'
import requests
import time
s = requests.Session()
s.get('https://www.off---white.com/en/GB/')
headers = {'Referer': 'https://www.off---white.com/en/GB/login'}
payload = {
'utf8':'✓',
'authenticity_token':'',
'spree_user[email]': 'EMAIL#gmail.com',
'spree_user[password]': 'PASSWORD',
'spree_user[remember_me]': '0',
'commit': 'Login'
}
r = s.post('https://www.off---white.com/en/GB/login', data=payload, headers=headers)
print(r.status_code)
g = s.get('https://www.off---white.com/en/GB/account')
print(g.status_code)
print(g.text)
Why is this occurring when I have set the session?
You might want to try this:
import cloudscraper
scraper = cloudscraper.create_scraper() # returns a CloudScraper instance
# Or: scraper = cloudscraper.CloudScraper() # CloudScraper inherits from requests.Session
print scraper.get("http://somesite.com").text # => "<!DOCTYPE html><html><head>..."
It does not require Node.js dependency.
All credits go to this pypi page
This is due to the fact that the page uses Cloudflare's anti-bot page (or IUAM).
Bypassing this check is quite difficult to solve on your own, since Cloudflare changes their techniques periodically. Currently, they check if the client supports JavaScript, which can be spoofed.
I would recommend using the cfscrape module for bypassing this. To install it, use pip install cfscrape. You'll also need to install Node.js.
You can pass a requests session into create_scraper() like so:
session = requests.Session()
session.headers = ...
scraper = cfscrape.create_scraper(sess=session)
I had the same problem because they implemented cloudfare in the api, I solved it this way
import cloudscraper
import json
scraper = cloudscraper.create_scraper()
r = scraper.get("MY API").text
y = json.loads(r)
print (y)
You can scrape any Cloudflare protected page by using this tool. Node.js is mandatory in order for the code to work correctly.
Download Node from this link https://nodejs.org/en/
import cfscrape #pip install cfscrape
scraper = cfscrape.create_scraper()
res = scraper.get("https://www.example.com").text
print(res)
curl and hx avoid this problem. But how?
I found, they work by default with HTTP/2. But requests library used only HTTP/1.1.
So, for tests I installed httpx with h2 python library to support HTTP/2 requests) and it works if I do: httpx --http2 'https://some.url'.
So, the solution is to use a library that supports http2. For example httpx with h2
It's not a complete solution, since it won't help to solve Cloudflare's anti-bot ("I'm Under Attack Mode", or IUAM) challenge

Curl to Python Conversion

I'm trying to use the Twitch API in a Django [python] web application. I want to send a request and get information back, but I don't really know what I'm doing.
curl -H 'Accept: application/vnd.twitchtv.v2+json' -X GET \
https://api.twitch.tv/kraken/streams/test_channel
How do I convert this python?
Thanks
Using the builtin urllib2:
>>> import urllib2
>>> req = urllib2.Request('https://api.twitch.tv/kraken/streams/test_channel')
>>> req.add_header('Accept', 'application/vnd.twitchtv.v2+json')
>>> resp = urllib2.urlopen(req)
>>> content = resp.read()
If you're using Python 3.x, the module is called urllib.request, but otherwise you can do everything the same.
You could also use a third-party library for HTTP, like requests, which has a simpler API:
>>> import requests
>>> r = requests.get('https://api.twitch.tv/kraken/streams/test_channel',
headers={'Accept': 'application/vnd.twitchtv.v2+json'})
>>> print(r.status_code)
422 # <- on my machine, YMMV
>>> print(r.text)
{"status":422,"message":"Channel 'test_channel' is not available on Twitch",
"error":"Unprocessable Entity"}
I usually use urllib2 for my api requests in (blocking) python apps.
>>> import urllib2
>>> req = urllib2.Request('https://api.twitch.tv/kraken/streams/test_channel', None, {'Accept':'application/vnd.twitchtv.vs+json'})
>>> response = urllib2.urlopen(req)
You can then access the text returned with response.read(). From there you can parse the JSON with your preferred library, though I generally just use json.loads(response.read()).
I would keep in mind, though, that this is for 2.7, if you are using python 3 the libraries have been moved around and this can be found in urllib.request

urllib2 does not read entire page

A portion of code that I have that will parse a web site does not work.
I can trace the problem to the .read function of my urllib2.urlopen object.
page = urllib2.urlopen('http://magiccards.info/us/en.html')
data = page.read()
Until yesterday, this worked fine; but now the length of the data is always 69496 instead of 122989, however when I open smaller pages my code works fine.
I have tested this on Ubuntu, Linux Mint and windows 7. All have the same behaviour.
I'm assuming that something has changed on the web server; but the page is complete when I use a web browser. I have tried to diagnose the issue with wireshark but the page is received as complete.
Does anybody know why this may be happening or what I could try to determine the issue?
The page seems to be misbehaving unless you request the content encoded as gzip. Give this a shot:
import urllib2
import zlib
request = urllib2.Request('http://magiccards.info/us/en.html')
request.add_header('Accept-Encoding', 'gzip')
response = urllib2.urlopen(request)
data = zlib.decompress(response.read(), 16 + zlib.MAX_WBITS)
As Nathan suggested, you could also use the great Requests library, which accepts gzip by default.
import requests
data = requests.get('http://magiccards.info/us/en.html').text
Yes, the server is closing connection and you need keep-alive to be sent. urllib2 does not have that facility ( :-( ). There used be urlgrabber which you could use have a HTTPHandler that works alongside with urllib2 opener. But unfortunately, I dont find that working too. At the moment, you could be other libraries, like requests as demonstrated in the other answer or httplib2.
import httplib2
h = httplib2.Http(".cache")
resp, content = h.request("http://magiccards.info/us/en.html", "GET")
print len(content)

Twitter authentication using cookielib

Can someone tell me why this doesn't work?
import cookielib
import urllib
import urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
data = urllib.urlencode({'session[username_or_email]':'twitter handle' , 'session[password]':'password'})
opener.open('https://twitter.com' , data)
stuff = opener.open('https://twitter.com')
print stuff.read()
Why doesn't this give the html of the page after logging in?
Please consider using an Oauth library for your task. Scraping the site using mechanize is not recommended because twitter can change the HTML specific stuffs any time, and then your code will break.
Check this out: Python-twitter at http://code.google.com/p/python-twitter/
Simplest example to post an update:
>>> import twitter
>>> api = twitter.Api(
consumer_key='yourConsumerKey',
consumer_secret='consumerSecret',
access_token_key='accessToken',
access_token_secret='accessTokenSecret')
>>> api.PostUpdate('Blah blah lbah!')
There can be many reasons why it is failing:
Twitter probably expects a User-Agent header, which you are not providing.
I didn't look at the HTML, but many be there's some Javascript at play before the form is actually submitted (I actually think this is the case, because I vaguely remember writing a very detailed answer on this exact thing (and I dont seem to find the link to it!)).

Categories