I want use cookies that copy from my chrome, but make much error.
import urllib.request
import re
def open_url(url):
header={"User-Agent":r'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'}
Cookies={'Cookie':r"xxxxx"}
Request=urllib.request.Request(url=url,headers=Cookies)
response=urllib.request.urlopen(Request,timeout=100)
return response.read().decode("utf-8")
Where does my code go wrong? Is that headers=Cookies ?
The correct way when using urllib.request is to use an OpenerDirector populated with aCookieProcessor:
cookieProcessor = urllib.request.HTTPCookieProcessor()
opener = urllib.request.build_opener(cookieProcessor)
then you use opener and it will automagically process the cookies:
response = opener.open(request,timeout=100)
By default, the CookieJar (http.cookiejar.CookieJar) used in a simple in memory store, but you can use a FileCookieJar in you need long term storage of persistent cookies, or even a http.cookiejar.MozillaCookieJar if you want to use persistent cookies stored in a cookies.txt now legacy Mozilla format
If you want to use cookies existing in your web browser, you must first store them in a cookie.txt compatible file and load them in a MozillaCookieJar. For Mozilla, you can find an add-on Cookie Exporter. For other browser, you must manually create a cookie.txt file by reading the content of the cookies you need in your browser. The format can be found in The Unofficial Cookie FAQ. Extracts:
... each line contains one name-value pair. An example cookies.txt file may have an entry that looks like this:
.netscape.com TRUE / FALSE 946684799 NETSCAPE_ID 100103
Each line represents a single piece of stored information. A tab is inserted between each of the fields.
From left-to-right, here is what each field represents:
domain - The domain that created AND that can read the variable.
flag - A TRUE/FALSE value indicating if all machines within a given domain can access the variable. This value is set automatically by the browser, depending on the value you set for domain.
path - The path within the domain that the variable is valid for.
secure - A TRUE/FALSE value indicating if a secure connection with the domain is needed to access the variable.
*expiration - The UNIX time that the variable will expire on. UNIX time is defined as the number of seconds since Jan 1, 1970 00:00:00 GMT.
name - The name of the variable.
value - The value of the variable.
But the normal way is to mimic a full session and extract automatically the cookies from the responses.
"When receiving an HTTP request, a server can send a Set-Cookie header with the response. The cookie is usually stored by the browser and, afterwards, the cookie value is sent along with every request made to the same server as the content of a Cookie HTTP header" extracted from mozilla site.
This link
Please go through this
will help you give some knowledge about headers and http request. Please go through this. This might answer alot of your answer.
You can use a better library (IMHO) - requests.
import requests
headers = {
'User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
}
cookies = dict(c1="cookie_numba_one")
r = requests.get('http://example.com', headers = headers, cookies = cookies)
print(r.text)
Related
I'm trying to scrape data from autotrader page and I managed to grab link to every offer on that page but when I'm trying to get data from every offer I get 403 requests status even though I'm using a header.
What more can I do to get past it?
headers = {"User Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/85.0.4183.121 Safari/537.36'}
page = requests.get("https://www.autotrader.co.uk/car-details/202010145012219", headers=headers)
print(page.status_code) # 403 forbidden
content_of_page = page.content
soup = bs4.BeautifulSoup(content_of_page, 'lxml')
title = soup.find('h1', {'class': 'advert-heading__title atc-type-insignia atc-type-insignia--medium '})
print(title.text)
[for people that are in the same position: autotrader uses cloudflare to protect every "car-detail" page, so I would suggest using selenium for example]
If you can manage to get the data via your browser, i.e. you somehow see this data in a website, then you can likely replicate that with requests.
Briefly, you need headers in your request to match the Browser's request:
Open dev tools in you browser (e.g. F12 or cmd+opt+I or click on menu)
Open Network tab
Reload the page (the whole website or the target request's url only, whatever provides a desired response from the server)
Find a http request to the desired url in the Network tab. Right click it, click 'Copy...', and choose the option (e.g. curl) you need.
Your browser sends tons of extra headers, you never know which ones are actually checked by the server so this technique will save you much time.
However, this might fail if there's some protection against blunt request copies, e.g. some temporary tokens, so the requests cannot be reused. In this case you need Selenium (browser emulation/automation), it's not difficult so it worth using.
I am trying to login-in into a website using Python requests module.
Website : http://www.way2sms.com/
I use POST to submit the form data. Following is the code that is use.
import requests as r
URL = "http://www.way2sms.com"
data = {'mobileNo':'###','password':'#####'}
header={'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.42 Safari/537.36'}
sess = r.Session()
x = sess.post(url , data= data , headers = header)
print x.status_code()
I don't seem to find a way to validate if the login was successful or not. Also the Response is always 200 whether if I enter the right login details or not.
My whole intention is to login-in and then send text messages using this website(I know that I could have used some API). But I am unable to know if I have logged-in successfully or not.
Also this website uses some kind of JSESSIONID (don't know much about that) to maintain the session.
As you can see in the picture, site submit an AJAX request to www.way2sms.com/re-login so it would be better to submit your request directly here and then check response (returned content)
Something like this would help:
session = requests.Session()
URL = 'http://www.way2sms.com/re-login'
data = {'mobileNo': '94########', 'password': 'pass'} # Make sure to remove '+' from your number
post = session.post(URL, data=data)
if post.text != 'login-reg': # This returned when i did input invalid credentials
print('Login successful')
else:
print(post.text)
Since i don't have an account there you may also need to check success response
Check if the response object contains the cookie you're looking for, namely JSESSIONID.
if x.cookies.get('JSESSIONID'):
print 'Login successful.'
I am using python (http://python-wordpress-xmlrpc.readthedocs.io/en/latest/) to connect to wordpress to post contents.
I have a few wordpress sites to which I connect using sitename.com/xmlrpc.php
However one of my sites recently started reporting a problem while connection mentioning not a valid xml. When I view the page in browser I see the usual "XML-RPC server accepts POST requests only." but when I connect using python I see the following message:
funct ion toNumbers(d){var e=[];d.replace(/(..)/g,function(d){e.push(parseInt(d,16))}) ;return e}function toHex(){for(var d=[],d=1==arguments.length&&arguments[0].cons tructor==Array?arguments[0]:arguments,e="",f=0;fd[f]?"0":"" )+d[f].toString(16);return e.toLowerCase()}var a=toNumbers("f655ba9d09a112d4968c 63579db590b4"),b=toNumbers("98344c2eee86c3994890592585b49f80"),c=toNumbers("c299 e542498206cd9cff8fd57dfc56df");document.cookie="__test="+toHex(slowAES.decrypt(c ,2,a,b))+"; expires=Thu, 31-Dec-37 23:55:55 GMT; path=/"; location.href="http://targetDomainNameHere.com/xmlrpc.php?i=1";This site requires Javascript to work, please enable Javascript in your browser or use a browser with Javascript support
I searched for the file aes.js, no luck.
How to get this working ? How do I remove this? I am using the latest version of Wordpress as of 07.NOV.2017
You can try to pass "User-Agent" header in the request. Generally the Java or Python library would use their version in the User-Agent allowing the word-press server to block.
Over-riding User-Agent header with browser-like value can help get data for some word-press servers. Value can look like: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36
I am using urlfetch.fetch in App engine using Python 2.7.
I tried fetching 2 URLs belonging to 2 different domains. For the first one, the result of urlfetch.fetch includes results after resolving XHR queries that are made for getting recommended products.
However for the other page belonging to another domain, the XHR queries are not resolved and I just get the plain HTML for the most part. The XHR queries for this page are also made for purposes of getting recommended products to show, etc.
Here is how I use urlfetch:
fetch_result = urlfetch.fetch(url, deadline=5, validate_certificate=True)
URL 1 (the one where XHR is resolved and the response is complete)
https://www.walmart.com/ip/HP-15-f222wm-ndash-15.6-Laptop-Touchscreen-Windows-10-Home-Intel-Pentium-Quad-Core-Processor-4GB-Memory-500GB-Hard-Drive/53853531
URL 2 (the one where I just get the plain HTML for the most part)
https://www.flipkart.com/oricum-blue-486-loafers/p/itmezfrvwtwsug9w?pid=SHOEHZWJUMMTEYRU
Can someone please advice what I may be missing in regards to the inconsistency.
The server is serving different output based on the user-agent string supplied in the request headers.
By default, urlfetch.fetch will send requests with the user agent header set to something like AppEngine-Google; (+http://code.google.com/appengine; appid: myapp.appspot.com.
A browser will send a user agent header like this: Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0
If you override the default headers for urlfetch.fetch
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0'}
urlfetch.fetch(url, headers=headers)
you will find that the html that you receive is almost identical to that served to the browser.
Is there a way to find the user-agent and global ip in particular json format? Help me out on this.
Here is what I am trying have partial success in getting global IP but no information about user-agent.
import requests, json
r = requests.get('http://httpbin.org/ip').json()
print r['origin']
Above code returning me the Global IP but I want some information regarding on which platform I am connected to the particular URL. E.g. 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36'.
Considering you have a JSON object as string (possibly read from a file?) you would first want to convert that into a Python dictionary object
import json
request_details = json.loads('{"user-agent": "Chrome", "remote_address": "64.10.1.1"}')
print request_details["user-agent"]
print request_details["remote_address"]
OR
If you are talking about a request that comes to the server, the user-agent is part of the request headers and remote_address is added later in the network layer. Different web frameworks have different ways of letting you access these values. For example Django lets you access from HttpRequest.META dictionary. Flask gives you request.headers.get("user-agent") and request.remote_addr .
You can
import requests, json
r = requests.get('https://httpbin.org/user-agent').json()
print r['user-agent']
but I would do that only when I want to verify the user-agent I'm setting in my request header.