I'm trying to use requests and bs4 to get info from a website, but am receiving the status code 304 and no content from request.get(). I've done some reading and understand that this code indicates the resource is already in my cache. How can I either access the resource from my cache, or preferably, clear my cache so that I can receive the resource new?
I've tried adding the following header: headers={'Cache-Control': 'no-cache'} to requests.get() but still have the same issue.
Additionally I've looked into the requests-cache module, but am unclear on how or if this could be used to solve the problem.
code:
import requests
r = requests.get('https://smsreceivefree.com/')
print(r.status_code)
print(r.content)
output:
304
b''
A server should send a 304 Not Modified reply if the client sent a conditional request, like one having an If-Modified-Since header. This makes sense if the client already has a cached version of the page, and wants to avoid downloading the content if he already has the newest version of it.
In this case, the website seems to send a 304 to certain kinds of clients, as it seems: ones where the User-Agent seems to indicate automation (which is true, in your case).
The server should instead send a 4xx error code, probably a 403 Forbidden, but likely uses a 304 in order to throw bot writers off the right track and make them come to StackOverflow.
Related
I need to make a post request from a modem onto one of my REPLs
The post request works fine when sent from a sufficiently advanced device (with automatic redirects), however, low-level requests always get stopped due to HTTP 308 (permanent redirect) errors.
Is there any way to fix this on repl?
I am using module google search for webscraping, but I got this error 429. I tried uninstall and install module again, but it didn't help. So my next idea is delete cookies, but I don't know how. Can you help me, please?
query = 'site:https://stackoverflow.com urllib.error.HTTPError: HTTP Error 429: Too Many Requests'
search_query = search(query=query, stop=10)
for url in search_query:
print(url)
429 Too Many Requests
The HTTP 429 Too Many Requests response status code indicates that the user has sent too many requests in a given amount of time ("rate limiting"). The response representations SHOULD include details explaining the condition, and MAY include a Retry-After header indicating how long to wait before making a new request.
Note that this specification does not define how the origin server identifies the user, nor how it counts requests. For example, an origin server that is limiting request rates can do so based upon counts of requests on a per-resource basis, across the entire server, or even among a set of servers. Likewise, it might identify the user by its authentication credentials, or a stateful cookie.
You're sending too many requests in a short period of time. The Custom Search API might be useful depending on your use scenario. If not then you might have to use proxies for your calls, or implement a wait and retry mechanism
There is resolution of my problem: https://support.google.com/gsa/answer/4411411#requests
How to disable cookie handling with the Python requests library?
I am trying to send a patch request a database using urllib and urllib2 library in python 2.7 (as I cannot use requests library cause it does not work in this server and nobody has found the solution for that, so please do not suggest requests because that path is already closed).
The code look like this:
data={"name":"whatever name"}
data=urllib.urlencode(data,'utf-8')#Encoding the dictionary of the data to make the request
req=urllib2.Request(url=next_url,headers={"Authorization": auth_header,"Content-Type": "application/json"})#Creating a request object of urllib library
req.add_data=data
req.get_method = lambda: 'PATCH'
resp = urllib2.urlopen(req)
If don't assign both req.get_method=lambda: 'PATCH' , req.add_data=data the request class automatically sends a get request which has a 200 response, so I guess it does not have to do with the authorization and credentials. Using python 3 and urllib.request library works as well, so the server accept for sure PATCH requests.
I hope that anybody can find the solution... I cannot picture why this' happening.
Update SOLVED: I find the problem, to be related with the url I was making the request.
The "Moved Permanently" error would indicate that the server responded with a HTTP 301 error, meaning that the URL you are requesting has been moved to another URL (https://en.wikipedia.org/wiki/HTTP_301).
I would suggest to take a network traffic capture with tools like tcpdump or wireshark, to check the HTTP conversation and confirm . If the server is actually replying with a 301 and this is not urllib raising a wrong error code, the server response should include a "Location" header with another URL, and you should try this one instead.
Note that urllib has problems when dealing with redirects., so you might want to reconsider trying to make the "requests" module work instead.
I have come across an interesting phenomenon when scraping a particular site on the web. The issue arises when using Python's urllib2 (Python 2.7). For example:
import urllib2
LINK = "http://www.sample.com/article/1"
HEADERS = {'User-Agent': 'Mozilla5.0/...'} # Ellipsis for brevity
req = urllib2.Request(link, data=None, headers=HEADERS)
resp = urllib2.urlopen(req, timeout=20).read()
Here are the strange outcomes:
1) When a valid user agent is passed to the request headers, the server will return a status of 200 and a page saying there was an issue processing the request (invalid html). This means I am able to get a successful response from the server with corrupted data.
2) When an invalid user agent is passed (empty headers {}), the server will timeout. However, if the timeout is set to a large value (20 seconds in this example), the server will then return the valid data but in a slow fashion.
This problem arises on the server when there has been no prior requests, therefore I believe the server may be expecting a certain cookie from the request to serve valid data. Anyone have insight into why this is happening?
You can't know what's going-on on the server side.
You can just keep experiment and guess how it works on the other side. If this is a user agent only issue - just keep sending it(and maybe changing it once in awhile) for ALL of your requests, including the first one.
Also, I would open chrome dev tool on a new session(incognito) and recording all the actions you're doing, this way you can see the structure of the requests that being made by a real browser.
I am trying to use Python to write a client that connects to a custom http server that uses digest authentication. I can connect and pull the first request without problem. Using TCPDUMP (I am on MAC OS X--I am both a MAC and a Python noob) I can see the first request is actually two http requests, as you would expect if you are familiar with RFC2617. The first results in the 401 UNAUTHORIZED. The header information sent back from the server is correctly used to generate headers for a second request with some custom Authorization header values which yields a 200 OK response and the payload.
Everything is great. My HTTPDigestAuthHandler opener is working, thanks to urllib2.
In the same program I attempt to request a second, different page, from the same server. I expect, per the RFC, that the TCPDUMP will show only one request this time, using almost all the same Authorization Header information (nc should increment).
Instead it starts from scratch and first gets the 401 and regenerates the information needed for a 200.
Is it possible with urllib2 to have subsequent requests with digest authentication recycle the known Authorization Header values and only do one request?
[Re-read that a couple times until it makes sense, I am not sure how to make it any more plain]
Google has yielded surprisingly little so I guess not. I looked at the code for urllib2.py and its really messy (comments like: "This isn't a fabulous effort"), so I wouldn't be shocked if this was a bug. I noticed that my Connection Header is Closed, and even if I set it to keepalive, it gets overwritten. That led me to keepalive.py but that didn't work for me either.
Pycurl won't work either.
I can hand code the entire interaction, but I would like to piggy back on existing libraries where possible.
In summary, is it possible with urllib2 and digest authentication to get 2 pages from the same server with only 3 http requests executed (2 for first page, 1 for second).
If you happen to have tried this before and already know its not possible please let me know. If you have an alternative I am all ears.
Thanks in advance.
Although it's not available out of the box, urllib2 is flexible enough to add it yourself. Subclass HTTPDigestAuthHandler, hack it (retry_http_digest_auth method I think) to remember authentication information and define an http_request(self, request) method to use it for all subsequent requests (add WWW-Authenticate header).