Is there a way to find the user-agent and global ip in particular json format? Help me out on this.
Here is what I am trying have partial success in getting global IP but no information about user-agent.
import requests, json
r = requests.get('http://httpbin.org/ip').json()
print r['origin']
Above code returning me the Global IP but I want some information regarding on which platform I am connected to the particular URL. E.g. 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36'.
Considering you have a JSON object as string (possibly read from a file?) you would first want to convert that into a Python dictionary object
import json
request_details = json.loads('{"user-agent": "Chrome", "remote_address": "64.10.1.1"}')
print request_details["user-agent"]
print request_details["remote_address"]
OR
If you are talking about a request that comes to the server, the user-agent is part of the request headers and remote_address is added later in the network layer. Different web frameworks have different ways of letting you access these values. For example Django lets you access from HttpRequest.META dictionary. Flask gives you request.headers.get("user-agent") and request.remote_addr .
You can
import requests, json
r = requests.get('https://httpbin.org/user-agent').json()
print r['user-agent']
but I would do that only when I want to verify the user-agent I'm setting in my request header.
Related
I am trying to send HTTP GET request to certain website, for example, https://www.united.com, but it get stuck with no response.
Here is the code:
from urllib.request import urlopen
url = 'https://www.united.com'
resp = urlopen(url,timeout=10 )
Every time, it goes timeout. But the same code works for other URLs, for example, https://www.aa.com.
So I wonder what is behind https://www.united.com that keeps me from getting the HTTP request through. Thank you!
Update:
Adding a request header still doesn't work for this site:
from urllib.request import urlopen
url = 'https://www.united.com'
req = urllib.request.Request(
url,
data=None,
headers={
'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'
}
)
resp = urlopen(req,timeout=3 )
The server of united.com might be only responding to certain user-agent strings or request headers and blocking for others. You have to send certain headers or user-agent string which are allowed by their server. This depends upon website to website who want to add some more security to their applications so they are very specific about user-agents like which resource is trying to access them.
I'm recently working on occasional data intensive projects and I'm in need of gathering data from e-commerce platforms like Amazon so I created a web scraping program in Python. I'm using requests library along with a list of user agents and proxies however I think they are not working and it is causing failure of the program. Note that Amazon Api is limiting in terms of content and access rates and is not suitable for my needs.
Here's how I send requests:
import requests
import random
session = requests.session()
proxies = [{'https:': 'https://' + item.rstrip(), 'http':
'http://' + item.rstrip()} for item in open('proxies.txt').readlines()]
user_agent = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'}
print(session.get('https://icanhazip.com', proxies=random.choice(proxies), headers=user_agent).text)
However I keep getting the same ip address printed and this means the proxies are not working this way. and the proxies.txt contains proxies in this format:
ex:
178.168.19.139:30736
342.552.34.456:8080
...
What is the best way to workaround captchas and robot checks presented by Amazon using these tools (or extra tools if you have any suggestions) and why are the proxies failing to work?
I'm not sure if this will work for you, but I found that removing the protocol at the start of the ip within the dictionary solved the problem.
proxies = [{'https': item.rstrip(), 'http': item.rstrip()} for item in open('proxies.txt').readlines()]
I have a flask application running on a IIS server. Everything works fine, however I always get a timeout error when using requests.
import requests
r = requests.get('https://github.com')
Using web services is therefore impossible.
I have tried using headers with the requests. But still the same result:
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
r = requests.get('https://github.com', headers=headers)
Also tried increasing the timeouts limits, both in code and in the IIS.
Also tried changing the Identity field under Process Model section to LocalSystem.
I'm not familiar with IIS and I cannot think of anything else. Need help.
According to your description, I think this issue is not related with the IIS. It seems your network issue.
I suggest you could firstly check your server's firewall to make sure you let your server could access the internet.
If you need to use proxy to access the internet, I suggest you could try to add below settings in your web.config for your flask application.
<system.net>
<defaultProxy>
<proxy
proxyaddress="The IP address"
bypassonlocal="true"
/>
</defaultProxy>
</system.net>
Details, you could see this article.
I want use cookies that copy from my chrome, but make much error.
import urllib.request
import re
def open_url(url):
header={"User-Agent":r'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'}
Cookies={'Cookie':r"xxxxx"}
Request=urllib.request.Request(url=url,headers=Cookies)
response=urllib.request.urlopen(Request,timeout=100)
return response.read().decode("utf-8")
Where does my code go wrong? Is that headers=Cookies ?
The correct way when using urllib.request is to use an OpenerDirector populated with aCookieProcessor:
cookieProcessor = urllib.request.HTTPCookieProcessor()
opener = urllib.request.build_opener(cookieProcessor)
then you use opener and it will automagically process the cookies:
response = opener.open(request,timeout=100)
By default, the CookieJar (http.cookiejar.CookieJar) used in a simple in memory store, but you can use a FileCookieJar in you need long term storage of persistent cookies, or even a http.cookiejar.MozillaCookieJar if you want to use persistent cookies stored in a cookies.txt now legacy Mozilla format
If you want to use cookies existing in your web browser, you must first store them in a cookie.txt compatible file and load them in a MozillaCookieJar. For Mozilla, you can find an add-on Cookie Exporter. For other browser, you must manually create a cookie.txt file by reading the content of the cookies you need in your browser. The format can be found in The Unofficial Cookie FAQ. Extracts:
... each line contains one name-value pair. An example cookies.txt file may have an entry that looks like this:
.netscape.com TRUE / FALSE 946684799 NETSCAPE_ID 100103
Each line represents a single piece of stored information. A tab is inserted between each of the fields.
From left-to-right, here is what each field represents:
domain - The domain that created AND that can read the variable.
flag - A TRUE/FALSE value indicating if all machines within a given domain can access the variable. This value is set automatically by the browser, depending on the value you set for domain.
path - The path within the domain that the variable is valid for.
secure - A TRUE/FALSE value indicating if a secure connection with the domain is needed to access the variable.
*expiration - The UNIX time that the variable will expire on. UNIX time is defined as the number of seconds since Jan 1, 1970 00:00:00 GMT.
name - The name of the variable.
value - The value of the variable.
But the normal way is to mimic a full session and extract automatically the cookies from the responses.
"When receiving an HTTP request, a server can send a Set-Cookie header with the response. The cookie is usually stored by the browser and, afterwards, the cookie value is sent along with every request made to the same server as the content of a Cookie HTTP header" extracted from mozilla site.
This link
Please go through this
will help you give some knowledge about headers and http request. Please go through this. This might answer alot of your answer.
You can use a better library (IMHO) - requests.
import requests
headers = {
'User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
}
cookies = dict(c1="cookie_numba_one")
r = requests.get('http://example.com', headers = headers, cookies = cookies)
print(r.text)
I have an URL such as
http://www.example-url.com/content?param1=1¶m2=2
in particular I am testing in on
http://ws.parlament.ch/votes/councillors?concillorNumberFilter=2565&format=json
How do I get the content of such URL, such that the get parameters are considered as well?
How can I save it to file?
How can I access multiple URLs like this either in parallel or asynchronously (saving to file on response received callback like in JavaScript)?
I have tried
import urllib
urllib.urlretrieve("http://ws.parlament.ch/votes/councillors?concillorNumberFilter=2565&format=json", "file.json")
but I am getting a content of http://ws.parlament.ch/votes/councillors instead of the json I want.
You can use urllib, but there are other libraries I know of which make it a lot easier in different situations. for example, if you want to also have user authentication done you can use Requests.
For this situation you can use httplib2 for example, here is a clean small piece of code which takes the GET into consideration (source).
import httplib2
h = httplib2.Http(".cache")
(resp_headers, content) = h.request("http://example.org/", "GET")
It seems that jou need to set the user agent of the connection otherwise it will refuse to give you the data. I also use the urllib2.Request() instead of the standard urlretrieve() and or urlopen(), mostly because this function allows GET, POST requests and allows the user agent to be set by the programmer.
import urllib2, json
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
header = { 'User-Agent' : user_agent }
fullurl = "http://ws.parlament.ch/votes/councillors?councillorNumberFilter=2565&format=json"
response = urllib2.Request(fullurl, headers=header)
data = urllib2.urlopen(response)
print json.loads(data.read())
Some extra information about headers in python
if you want to keep using httplib2 here is the code for this one:
import httplib2
header = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)' }
fullurl = "http://ws.parlament.ch/votes/councillors?councillorNumberFilter=2565&format=json"
http = httplib2.Http(".cache")
response, content = http.request(fullurl, "GET", headers=header)
print content
The data printed by my last example can be saved to a file with json.dump(filename, data).