I have a flask application running on a IIS server. Everything works fine, however I always get a timeout error when using requests.
import requests
r = requests.get('https://github.com')
Using web services is therefore impossible.
I have tried using headers with the requests. But still the same result:
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
r = requests.get('https://github.com', headers=headers)
Also tried increasing the timeouts limits, both in code and in the IIS.
Also tried changing the Identity field under Process Model section to LocalSystem.
I'm not familiar with IIS and I cannot think of anything else. Need help.
According to your description, I think this issue is not related with the IIS. It seems your network issue.
I suggest you could firstly check your server's firewall to make sure you let your server could access the internet.
If you need to use proxy to access the internet, I suggest you could try to add below settings in your web.config for your flask application.
<system.net>
<defaultProxy>
<proxy
proxyaddress="The IP address"
bypassonlocal="true"
/>
</defaultProxy>
</system.net>
Details, you could see this article.
Related
I am trying to webscrape data from CME exchange:
https://www.cmegroup.com/CmeWS/mvc/Settlements/Futures/Settlements/425/FUT?tradeDate=11/05/2021
I have the following code snippet:
import requests as r
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"
header = {'User-Agent': user_agent}
link = 'https://www.cmegroup.com/CmeWS/mvc/Settlements/Futures/Settlements/425/FUT?tradeDate=11/05/2021'
page = r.get(link,headers=header)
raw_json = json.loads(page.text)
While it works perfectly well on a local computer, it totally hangs on remote hosting servers (Digital Ocean, Hetzner). I have also tried to curl url but it gives a timeout error without additional details.
Do I need to use selenium for this? I wonder what can be different between scraping data from a local machine and the hosting server.
I don't know how to resolve this. Hope you can give me some clues.
Apparently, some hosting providers are blocked by CME. You should look for one which is not blocked and you can use it as a proxy server. That's the solution that worked for me. However, now I am thinking that this could be related to IPv6 settings on the server. Try to disable IPv6 connection and it will automatically fall back into IPv4.
on Ubuntu
sudo sysctl -w net.ipv6.conf.all.disable_ipv6=1
sudo sysctl -w net.ipv6.conf.default.disable_ipv6=1
sudo sysctl -w net.ipv6.conf.lo.disable_ipv6=1
Just found the solution for this problem.
Reason for this behaviour its due to the protocol HTTP/2.
A way to test this its upgrading curl, since 7.47.0, the curl tool enables HTTP/2 by default for HTTPS connections.
Hope it helps!
You can get json response from URL itself not requried page.text to transform in to json
Just use this directly may be it could work
data=page.json()
I am attempting to run my code on a aws ec2(ubuntu) instance. The codes work perfectly fine on my local but doesnt seem to be able to connect to website inside server.
Im assuming it has to do something with the headers. I have installed firefox and chrome on the server but doesnt seem to do anything.
Any ideas on how to fix this problem would be appreciated.
import requests
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'}
# Making a get request
response = requests.get("https://us.louisvuitton.com/eng-us/products/pocket-organizer-monogram-other-nvprod2380073v", headers=HEADERS) #hangs here, cant make request in server
# print response
print(response.status_code)
Output:
Doesn't give me one, just stays blank until I KeyboardInterrupt.
I'm doing a requests.get(url='url', verify=False), from my django application hosted on an Ubuntu server from AWS, to a url that has a Django Rest Framework. There are no permissions or authentication on the DRF, because I'm the one that made it. I've added headers such as
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}, but wasn't able to get any content.
BUT when I do run ./manage.py shell and run the exact same command, I get the output that I need!
EDIT 1:
So I've started using subprocess.get_output("curl <url> --insecure", shell=True) and it works, but I know this is not a very "nice" way to do things.
I know what the problem was.
My application when it was being deployed was single threaded, not multithreaded.
I changed my worker number and that fixed everything.
I'm running Scrapy 0.24.4, and have encountered quite a few sites that shut down the crawl very quickly, typically within 5 requests. The sites return 403 or 503 for every request, and Scrapy gives up. I'm running through a pool of 100 proxies, with the RotateUserAgentMiddleware enabled.
Does anybody know how a site could identify Scrapy that quickly, even with the proxies and user agents changing? Scrapy doesn't add anything to the request headers that gives it away, does it?
Some sites incorporate javascript code that needs to be run.
Scrapy doesn't execute javascript code so the web app really quickly knows it's a bot.
http://scraping.pro/javascript-protected-content-scrape/
Try using selenium for those sites that return 403. If crawling with selenium works, you can assume that problem is in javascript.
I think crunchbase.com uses such protection against scraping.
It appears that the primary problem was not having cookies enabled. Having enabled cookies, I'm having more success now. Thanks.
For me cookies were already enabled.
What fixed it was using another user agent, one that is common.
Replace in settings.py file of your project USER_AGENT with this:
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
I simply set AutoThrottle_ENABLED to True and my script was able to run.
I'm trying to find out which browsers are my users using and I'm running into a problem.
If I try to read header "User-Agent" it usually gives me lots of text, and tells me nothing.
For example, if I visit the site with Chrome, in "User-Agent" header there is:
User-Agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36".
As you can see, this tells me nothing since there is mention of Mozzila, Safari, Chrome etc.. even though I visited with Chrome.
Framework I've been using is Bottle (Python).
Any help would be appreciated, thanks.
User-Agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36".
As you can see, this tells me nothing since there is mention of
Mozzila, Safari, Chrome etc.. even though I visited with Chrome.
Your conclusion above is wrong. The UA tells you many things including the type and version of the web browser.
The post below explains why Mozilla and Safari exist in Chrome's UA.
History of the browser user-agent string
You can try to analyze it manually on user-agent-string-db.
There's a Python API for it.
from uasparser2 import UASparser
uas_parser = UASparser()
# Instead of fecthing data via network every time, you can cache the db in local
# uas_parser = UASparser('/path/to/your/cache/folder', mem_cache_size=1000)
# Updating data is simple: uas_parser.updateData()
result = ua_parser.parse('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36')
# result
{'os_company': u'',
'os_company_url': u'',
'os_family': u'Linux',
'os_icon': u'linux.png',
'os_name': u'Linux',
'os_url': u'http://en.wikipedia.org/wiki/Linux',
'typ': u'Browser',
'ua_company': u'Google Inc.',
'ua_company_url': u'http://www.google.com/',
'ua_family': u'Chrome',
'ua_icon': u'chrome.png',
'ua_info_url': u'http://user-agent-string.info/list-of-ua/browser-detail?browser=Chrome',
'ua_name': u'Chrome 31.0.1650.57',
'ua_url': u'http://www.google.com/chrome'}
Thank you everyone for your answers, I found something really simple that works.
Download httpagentparser module from:
https://pypi.python.org/pypi/httpagentparser
after that, just import it in your pythong program
import httpagentparser
Then you can write a function like this that returns browser, works like a charm:
def detectBrowser(request):
agent = request.environ.get('HTTP_USER_AGENT')
browser = httpagentparser.detect(agent)
if not browser:
browser = agent.split('/')[0]
else:
browser = browser['browser']['name']
return browser
That's it
As you can see, this tells me nothing since there is mention of
Mozzila, Safari, Chrome etc.. even though I visited with Chrome.
It's not that the User Agent string tells you "nothing;" it's that it's telling you too much.
If you want a report that breaks down your users browser, your best bet is to analyze your logs. Several programs are available to help. (One caveat, if you're using Bottle's "raw" web server, is that it won't log in Common Log Format out of the box. You have options.)
If you need to know in real time, you'll need to spend time learning user agent strings (useragentstring.com might help here) or use an API like this one.