Is it possible to read Wikipedia using Python requests library?

Is it possible to read Wikipedia using Python requests library? - python

To read a content from a given URL I do the following:
import requests
proxies = {'http':'http://user:pswd#foo-webproxy.foo.com:7777'}
url = 'http://example.com/foo/bar'
r = requests.get(url, proxies = proxies)
print r.text.encode('utf-8')
And it works fine! I get the content.
However, if I use another URL:
url = 'https://en.wikipedia.org/wiki/Mestisko'
It does not work. I get an error message that starts with:
requests.exceptions.ConnectionError: ('Connection aborted.', error(10060
Is Wikipedia blocking automatic requests?
ADDED
I tried to set a user agent in the following way:
headers = {'User-Agent':'Mozilla/5.0'}
r = requests.get(url, proxies = proxies, headers = headers)
Unfortunately it does not help. I still get the same error.
ADDED 2
Now I am confused. If I try to get content from http://example.com/foo/bar with setting proxy, I get it. If I do not set proxy, I get content generated by proxy. This behavior I can understand. Now, if I try to get content from Wikipedia, I get the same error message independently on whether I set or do not set proxy. So, I do not understand where this error message comes from Wikipedia or proxy (both options cannot be true).

The problem was resolved by replacing:
proxies = {'http':'http://user:pswd#foo-webproxy.foo.com:7777'}
with the following line:
proxies = {'http':'http://user:pswd#foo-webproxy.foo.com:7777', 'https':'http://user:pswd#foo-webproxy.foo.com:7777'}

Related

Python : 403 client error : Forbidden for url

While scraping without any problem, access is suddenly denied.
The error code is as below.
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://www.investing.com/equities/alibaba
Most of solution on google is to set User Agent in headers to avoid robot detection. This part was already applied to the code and the scrapping has done well, but it is being rejected from one day.
So, I tried fakeuseragent and random user agent to send user-agent randomly, but it was always rejected.
Secondly, to check the case through IP information, I tried to use VPN to switch IP but still denied.
The last thing I found was regarding Referer Control. So I installed the Referer Control extension, but I can't find any information on how to use it.
My code is below.
url = "https://www.investing.com/equities/alibaba"
ua = generate_user_agent()
print(ua)
headers = {"User-agent":ua}
res = requests.get(url,headers=headers)
print("Response Code :", res.status_code)
res.raise_for_status()
print("Start")
Any help will be appreciated.

Python Requests Returning 401 code on 'get' method

I'm working on a webscrape function that's going to be pulling HTML data from internal (non public) servers. I have a connection through a VPN and proxy server so when I ping any public site I get code 200 no problem, but our internals are returning 401.
Heres my code:
http_str = f'http://{username}:{password}#proxy.yourorg.com:80'
proxyDict = {
'http' : http_str,
'https' : https_str,
'ftp' : https_str
}
html_text = requests.get(url, verify=True, proxies=proxyDict, auth=HTTPBasicAuth(user, pwd))
I've tried flushing my DNS server, using different certificate chains (that had a whole new list of problems). I'm using urllib3 on version 1.23 because that seemed to help with SSL errors. I've considered using a requests session but I'm not sure what that would change.
Also, the url's we're trying to access DO NOT require a log in. I'm not sure why its throwing 401 errors but the auth is for the proxy server, I think. Any help or idea are appreciated, along with questions as at this point I'm not even sure what to ask to move this along.
Edit: the proxyDict has a string with the user and pwd passed it for each type, https http fts, etc.

To use HTTP Basic Auth with your proxy, use the http://user:password#host/ syntax in any of the proxy configuration entries. See apidocs.
import requests
proxyDict = {
"http": "http://username:password#proxy.yourorg.com:80",
"https": "http://username:password#proxy.yourorg.com:80"
}
url = 'http://myorg.com/example'
response = requests.get(url, proxies=proxyDict)
If, however, you are accessing internal URLs via VPN (i.e., internal to your organization on your intranet) then you should NOT need the proxy to access them.
Try:
import requests
url = 'http://myorg.com/example'
response = requests.get(url, verify=False)

How do I verify myself when using Python request posts

I can login and get a website, and post some messages. But on the other hand, some of the post calls failed, and the website return a message, which ask me to login again.
I think the reason is that I do not verify myself while posting, and the server find it and break this connection. But how to? I'm using mac and python 2.7.
I use this code to login:
Connection = requests.session()
result = Connection.post(url, headers = headd, data = data)
and success
This is other post codes after I login:
result = Connection.post(url, headers = headers, data = data)
but failed.
I also tried this:
result = Connection.post(url, headers = headers, data = data, verify=False)
But failed again. The url here is an https website. Does it matter? I mean, how to verify myself if necessary. Because I think it's the website who reject the post and break the session.

try using:
s = requests.session()
s.post(url, headers = headers, data = data)
instead of
Connection.post(...)

issue with cookies and sending POST/GET to get te web content in Python [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to use Python to login to a webpage and retrieve cookies for later usage?
I want to download whole webpage source from a service that handles cookies in some unusual way. I wrote a script that actually works and seems to be fine however at some point it returned such error:
urllib2.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found
My script works in loop and changes link to subpage wchich content im interested to download.
I get a cookie, send a package of data and then i am able to get to porper link then download html.
script look like this:
import urllib2
data = 'some_string'
url = "http://example/index.php"
url2 = "http://example/source"
req1 = urllib2.Request(url)
response = urllib2.urlopen(req1)
cookie = response.info().getheader('Set-Cookie')
## Use the cookie is subsequent requests
req2 = urllib2.Request(url, data)
req2.add_header('cookie', cookie)
response = urllib2.urlopen(req2)
## reuse again
req3 = urllib2.Request(url2)
req3.add_header('cookie', cookie)
response = urllib2.urlopen(req3)
html = response.read()
Ive been reading sth ab cookiejar/cookielib coz using this lib i am supposed to ged rid of this error mentioned above however i have no clue how to reporoduce my code to be used by: http.cookiejar, urllib.request
i tried sth like this:
import http.cookiejar, urllib.request
cj = http.cookiejar.CookieJar()
opener = urllib.request.build_opener( urllib.request.HTTPCookieProcessor(cj) )
r = opener.open(url) # now cookies are stored in cj
r1 = urllib.request(url, data) #TypeError: POST data should be bytes or an iterable of bytes. It cannot be str.
r2 = opener.open(url2)
print( r2.read() )
But its not working as my first script.
ps. Sorry for my english but im am not native.

#Piotr Dobrogost thanks for the link, it solved the issue.
TypeError solved by using data=b"string" instead of data="string"
Ive got still some issues due to porting to python3 but issue is to be closed.

python urllib2: connection reset by peer

I have a perl program that retrieves data from the database of my university library and it works well. Now I want to rewrite it in python but encounter the problem
<urlopen error [errno 104] connection reset by peer>
The perl code is:
my $ua = LWP::UserAgent->new;
$ua->cookie_jar( HTTP::Cookies->new() );
$ua->timeout(30);
$ua->env_proxy;
my $response = $ua->get($url);
The python code I wrote is:
cj = CookieJar();
request = urllib2.Request(url); # url: target web page
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj));
opener = urllib2.install_opener(opener);
data = urllib2.urlopen(request);
I use VPN(virtual private network) to log in my university's library at home, and I tried both the perl code and python code. The perl code works as I expected, but the python code always encountered "urlopen error".
I googled for the problem and it seems that the urllib2 fails to load the environmental proxy. But according to the document of urllib2, the urlopen() function works transparently with proxies which do not require authentication. Now I feels quite confusing. Can anybody help me with this problem?

I tried faking the User-Agent headers as Uku Loskit and Mikko Ohtamaa suggested, and solved my problem. The code is as follows:
proxy = "YOUR_PROXY_GOES_HERE"
proxies = {"http":"http://%s" % proxy}
headers={'User-agent' : 'Mozilla/5.0'}
proxy_support = urllib2.ProxyHandler(proxies)
opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler(debuglevel=1))
urllib2.install_opener(opener)
req = urllib2.Request(url, None, headers)
html = urllib2.urlopen(req).read()
print html
Hope it is useful for someone else!

Firstly, as Steve said, you need response.read(), but that's not your problem
import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()
Can you give details of the error? You can get it like this:
try:
urllib2.urlopen(req)
except URLError, e:
print e.code
print e.read()
Source: http://www.voidspace.org.uk/python/articles/urllib2.shtml
(I put this in a comment but it ate my newlines)

You might find that the requests module is a much easier-to-use replacement for urllib2.

Did you try specifying the proxy manually?
proxy = urllib2.ProxyHandler({'http': 'your_proxy_ip'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
urllib2.urlopen('http://www.uni-database.com')
if it still fails, try faking your User-Agent headers so as to make it seem that the request is coming from a real browser.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is it possible to read Wikipedia using Python requests library? - python

The problem was resolved by replacing: proxies = {'http':'http://user:pswd#foo-webproxy.foo.com:7777'} with the following line: proxies = {'http':'http://user:pswd#foo-webproxy.foo.com:7777', 'https':'http://user:pswd#foo-webproxy.foo.com:7777'}

Related

Python : 403 client error : Forbidden for url

Python Requests Returning 401 code on 'get' method

How do I verify myself when using Python request posts

issue with cookies and sending POST/GET to get te web content in Python [duplicate]

python urllib2: connection reset by peer

Categories

Resources