Troubleshooting 404 received by python script - python

I have a python script that pings 12 pages on someExampleSite.com every 3 minutes. It's been working for a couple months but today I started receiving 404 errors for 6 of the pages every time it runs.
So I tried going to those urls on the pc that the script is running on and they load fine in Chrome and Safari. I've also tried changing the user agent string the script is using and that also didn't change anything. Also I tried removing the ['If-Modified-Since'] header which also didn't change anything.
Why would the server be sending my script a 404 for these 6 pages but on that same computer I can load them in Chrome and Safari just fine? (I made sure to do a hard refresh in Chrome and Safari and they still loaded)
I'm using urllib2 to make the request.

There could be multiple reasons for this, such as the server is rejecting your request based on missing headers, or throttling.
You could try and record your request header in chrome using HTTP Headers then use Python requests library to by adding all your browser headers in your request. Then you could try either changing or removing headers to see what exactly is happening.

So I figured out what the problem was.
The website is returning an erroneous response code for these 6 pages. Even though it's returning a 404, it's also returning the web page. Chrome and Safari seem to ignore the response code and display the page anyways, my script aborts on the 404.

Related

How to send get requests to nclt website using Python requests

I have an issue related to technical support. I am trying to send a get requests using Python which code is given below
import requests
res=requests.get('https://nclt.gov.in/')
but this request got stuck for long time where it is working fine in local system and I get response within a second in my local system but not able to send get request from my droplet server. I don't know what going on this site.
I had test with different website and I am getting response from all the website instead of this website and I don't have any idea why.
I had also tried in such way:
I had set the user-agents in header
used cookies
but I am not getting response. I had tried this for last 24 hours and not able to get the exact reason behind this.
Is there any issue in droplet and should I have to configure anything. I think there is not any validation in 'http://nclt.gov.in' because I am sending just get request and it is working fine in my local machine without any problem.

Selenium gets response code of 429 but firefox private mode does not

Used Selenium in python3 to open a page. It does not open under selenium but it does open under firefox private page.
What is the difference and how to fix it?
from selenium import webdriver
from time import sleep
driver = webdriver.Firefox()
driver.get('https://google.com') # creating a google cookie
driver.get_cookies() # check google gets cookies
sleep(3.0)
url='https://www.realestate.com.au/buy/in-sydney+cbd%2c+nsw/list-1'
driver.get(url)
Creating a google cookie is not necessary. It is not there under firefox private page either but it works without it. However, under Selenium the behavior is different.
I also see the website returns [HTTP/2 429 Too Many Requests 173ms] status and the page is blank white. It does not happen in firefox private mode.
UPDATE:
I turned on the persistent log. Firefox on private mode will receive a 429 response too but it seems the javascript will resume from another url. It only happens for the first time.
On selenium however, the request does not survive the 429 response. It does report something to cdndex website. I have blocked that website so you o not see the request go through there. This is still a different behavior between firefox and selenium.
Selenium with persistent log:
Firefox with persistent log:
This is just my huch after working with selenium and webdriver for a while; I suspect that it is due to the default user agent of selenium being set to something lame by default and that the server side recognizes this and provides you with a silly HTTP code and a blank page as a result.
Try setting the user agent to something reasonable and/or disable selenium's interfering with defaults.
Another tips is to look at the request using wireshark or similar to see exactly what is sent over the wire.
429 Too Many Requests
The HTTP 429 Too Many Requests response status code indicates the user has sent too many requests within a short period of time. The 429 status code is intended for use with rate-limiting schemes.
Root Cause
When your server detects that a user agent is trying to access a specific page too often in a short period of time, it triggers a rate-limiting feature. The most common example of this is when a user (or an attacker) repeatedly tries to log into a web application.
The server can also identify a bot with cookies, rather than by their login credentials. Requests may also be counted on a per-request basis, across your server, or across several servers. So there are a variety of situations that can result in you seeing an error like one of these:
429 Too Many Requests
429 Error
HTTP 429
Error 429 (Too Many Requests)
This usecase
This usecase seems to be a classical case of Selenium driven GeckoDriver initiated firefox Browsing Context getting detected as a bot due to the fact:
Selenium identifies itself
References
You can find a couple of relevant detailed discussions in:
How to Conceal WebDriver in Geckodriver from BotD in Java?
How can I make a Selenium script undetectable using GeckoDriver and Firefox through Python?

TD Ameritrade API :: Unable to connect Firefox can’t establish a connection to the server at 127.0.0.1

Using documentation from https://pypi.org/project/td-ameritrade-python-api/
I'm trying to get started with the TD Ameritrade API in Python...
The problem I am having is with authentication of my account which is done via this Url:
(note: client_id has been changed b/c it is private)
https://auth.tdameritrade.com/auth/?response_type=code&redirect_uri=https%3A%2F%2F127.0.0.1&client_id=[Private]%40AMER.OAUTHAP
So everything works:
I get the login screen
After successful login, I get the permissions page
EXCEPT...
When everything is completed I get this error from FireFox (or Chrome, whatever)
Unable to connect
Firefox can’t establish a connection to the server at 127.0.0.1.
Given the above issue, I search Google for info and did the following:
Cleared Cache
Made sure correct IIS settings were configured
It does not work at this point.
I have no idea what is going on. Any help would be greatly appreciated.
This is probably one of the few times when getting an error message like you did is actually part of the process to authenticate your account. At the very bottom of the PyPi page for that library he explains that you're supposed to copy and paste the resulting url of the error page you're currently on into your terminal. It was confusing for me aswell and it took me awhile to really understand what's going on so I will explain it as best as I can.
Alex Reed is the guy who made the library TD Ameritrade API and he has an awesome YouTube channel called Sigma Coding. One of his video series guides you through the whole process of directly connecting to the TD Ameritrade API without the use of his API library, and another series about building the library itself.
In this video How to Use the TD Ameritrade API | Part 2 he is demonstrating how to access the API. The link should have a time stamp of 16:36 if not skip ahead to that section and you will see a similar error to what you are experiencing except he is using Chrome, not Firefox so the error is the same but worded differently.
Here's a picture to better explain the rest:
What he does next is copy and paste the current url of the page with the error, which contains the code needed for the next step. The url in the picture starts with https://localhost/test?code=siVrfqPLdQ... and you can see that the url has code= following a very long access code that TD Ameritrade needs to generate your access token.
Your url should have a similar structure, don't worry if it doesn't have /test after localhost, he made a specific folder for the video series. Just copy and paste the whole thing in your terminal where you should have a line that says:
Paste the full redirect url here:

How to view what python requests is doing on browser

I just started experimenting with Requests with python to interact with different sites. However sometimes I want to see if the POST Requests I'm sending is actually working. Is there anyway to open a browser to see what is actually happening in the browser when I send POST requests?

Python Web Scraping HTTP 400

I'm doing a web scrape with Python (using the Scrapy framework). The scrape works successfully until it gets about an hour into the process and then every request comes back with a HTTP400 error code.
Is this just likely to be a IP based rate limiter or scrape detection tool? Any advice on how I might investigate the root cause further?
I think the problem with the request rate. try with some download_delay. if you are able to request more pages before 400 error, then you can adjust download_delay and get full web content. Some website give info about download_delay in their robots.txt file
It could be a rate limiter.
However a 400 error generally means that the client request was malformed and therefore rejected by the server.
You should start investigating this first. When your requests start failing, exit your program and immediately start it again. If it starts working, you know that you aren't being rate-limited and that there is in fact something wrong with how your requests are formed later on.

Categories