I am trying to webscrape data from CME exchange:
https://www.cmegroup.com/CmeWS/mvc/Settlements/Futures/Settlements/425/FUT?tradeDate=11/05/2021
I have the following code snippet:
import requests as r
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"
header = {'User-Agent': user_agent}
link = 'https://www.cmegroup.com/CmeWS/mvc/Settlements/Futures/Settlements/425/FUT?tradeDate=11/05/2021'
page = r.get(link,headers=header)
raw_json = json.loads(page.text)
While it works perfectly well on a local computer, it totally hangs on remote hosting servers (Digital Ocean, Hetzner). I have also tried to curl url but it gives a timeout error without additional details.
Do I need to use selenium for this? I wonder what can be different between scraping data from a local machine and the hosting server.
I don't know how to resolve this. Hope you can give me some clues.
Apparently, some hosting providers are blocked by CME. You should look for one which is not blocked and you can use it as a proxy server. That's the solution that worked for me. However, now I am thinking that this could be related to IPv6 settings on the server. Try to disable IPv6 connection and it will automatically fall back into IPv4.
on Ubuntu
sudo sysctl -w net.ipv6.conf.all.disable_ipv6=1
sudo sysctl -w net.ipv6.conf.default.disable_ipv6=1
sudo sysctl -w net.ipv6.conf.lo.disable_ipv6=1
Just found the solution for this problem.
Reason for this behaviour its due to the protocol HTTP/2.
A way to test this its upgrading curl, since 7.47.0, the curl tool enables HTTP/2 by default for HTTPS connections.
Hope it helps!
You can get json response from URL itself not requried page.text to transform in to json
Just use this directly may be it could work
data=page.json()
Related
I am trying to download a file using the python requests module, my code works for some urls/hosts but I've come across one that does not work.
Based on other similar questions it may be related to the User-Agent request header, I have tried to remedy by adding the chrome user-agent but the connection still times out for this particular url (it does work for others).
I have tested opening the url in chrome browser (which works all OK) and inspecting the request headers, but I still can't figure out why my code is failing:
import requests
url = 'http://publicdata.landregistry.gov.uk/market-trend-data/house-price-index-data/Indices-2020-03.csv'
headers = {'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}
session = requests.Session()
session.headers.update(headers)
response = session.get(url, stream=True)
# !!! code fails here for this particular url !!!
with open('test.csv', "wb") as fh:
for x in response.iter_content(chunk_size=1024):
if x: fh.write(x)
Update 2020-08-14
I have figured out what was wrong; on the instances where the code was working the urls were using https protocol. This url is http protocol, and my proxy settings were not configured for http only https. After providing a http proxy to requests my code did work as written.
The code you posted worked for me, it saved the file (129007 lines). It could be that the host is rate-limiting you, try again later to see if it works.
# count lines
$ wc -l test.csv
129007 test.csv
# inspect headers
$ head -n 4 test.csv
Date,Region_Name,Area_Code,Index
1968-04-01,Wales,W92000004,2.11932727
1968-04-01,Scotland,S92000003,2.108087275
1968-04-01,Northern Ireland,N92000001,3.300419757
You can disable requests' timeouts by passing timeout=None. Here is the official documentation: https://requests.readthedocs.io/en/master/user/advanced/#timeouts
I have a flask application running on a IIS server. Everything works fine, however I always get a timeout error when using requests.
import requests
r = requests.get('https://github.com')
Using web services is therefore impossible.
I have tried using headers with the requests. But still the same result:
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
r = requests.get('https://github.com', headers=headers)
Also tried increasing the timeouts limits, both in code and in the IIS.
Also tried changing the Identity field under Process Model section to LocalSystem.
I'm not familiar with IIS and I cannot think of anything else. Need help.
According to your description, I think this issue is not related with the IIS. It seems your network issue.
I suggest you could firstly check your server's firewall to make sure you let your server could access the internet.
If you need to use proxy to access the internet, I suggest you could try to add below settings in your web.config for your flask application.
<system.net>
<defaultProxy>
<proxy
proxyaddress="The IP address"
bypassonlocal="true"
/>
</defaultProxy>
</system.net>
Details, you could see this article.
I'm doing a requests.get(url='url', verify=False), from my django application hosted on an Ubuntu server from AWS, to a url that has a Django Rest Framework. There are no permissions or authentication on the DRF, because I'm the one that made it. I've added headers such as
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}, but wasn't able to get any content.
BUT when I do run ./manage.py shell and run the exact same command, I get the output that I need!
EDIT 1:
So I've started using subprocess.get_output("curl <url> --insecure", shell=True) and it works, but I know this is not a very "nice" way to do things.
I know what the problem was.
My application when it was being deployed was single threaded, not multithreaded.
I changed my worker number and that fixed everything.
I am using python (http://python-wordpress-xmlrpc.readthedocs.io/en/latest/) to connect to wordpress to post contents.
I have a few wordpress sites to which I connect using sitename.com/xmlrpc.php
However one of my sites recently started reporting a problem while connection mentioning not a valid xml. When I view the page in browser I see the usual "XML-RPC server accepts POST requests only." but when I connect using python I see the following message:
funct ion toNumbers(d){var e=[];d.replace(/(..)/g,function(d){e.push(parseInt(d,16))}) ;return e}function toHex(){for(var d=[],d=1==arguments.length&&arguments[0].cons tructor==Array?arguments[0]:arguments,e="",f=0;fd[f]?"0":"" )+d[f].toString(16);return e.toLowerCase()}var a=toNumbers("f655ba9d09a112d4968c 63579db590b4"),b=toNumbers("98344c2eee86c3994890592585b49f80"),c=toNumbers("c299 e542498206cd9cff8fd57dfc56df");document.cookie="__test="+toHex(slowAES.decrypt(c ,2,a,b))+"; expires=Thu, 31-Dec-37 23:55:55 GMT; path=/"; location.href="http://targetDomainNameHere.com/xmlrpc.php?i=1";This site requires Javascript to work, please enable Javascript in your browser or use a browser with Javascript support
I searched for the file aes.js, no luck.
How to get this working ? How do I remove this? I am using the latest version of Wordpress as of 07.NOV.2017
You can try to pass "User-Agent" header in the request. Generally the Java or Python library would use their version in the User-Agent allowing the word-press server to block.
Over-riding User-Agent header with browser-like value can help get data for some word-press servers. Value can look like: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36
I'm trying to find out which browsers are my users using and I'm running into a problem.
If I try to read header "User-Agent" it usually gives me lots of text, and tells me nothing.
For example, if I visit the site with Chrome, in "User-Agent" header there is:
User-Agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36".
As you can see, this tells me nothing since there is mention of Mozzila, Safari, Chrome etc.. even though I visited with Chrome.
Framework I've been using is Bottle (Python).
Any help would be appreciated, thanks.
User-Agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36".
As you can see, this tells me nothing since there is mention of
Mozzila, Safari, Chrome etc.. even though I visited with Chrome.
Your conclusion above is wrong. The UA tells you many things including the type and version of the web browser.
The post below explains why Mozilla and Safari exist in Chrome's UA.
History of the browser user-agent string
You can try to analyze it manually on user-agent-string-db.
There's a Python API for it.
from uasparser2 import UASparser
uas_parser = UASparser()
# Instead of fecthing data via network every time, you can cache the db in local
# uas_parser = UASparser('/path/to/your/cache/folder', mem_cache_size=1000)
# Updating data is simple: uas_parser.updateData()
result = ua_parser.parse('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36')
# result
{'os_company': u'',
'os_company_url': u'',
'os_family': u'Linux',
'os_icon': u'linux.png',
'os_name': u'Linux',
'os_url': u'http://en.wikipedia.org/wiki/Linux',
'typ': u'Browser',
'ua_company': u'Google Inc.',
'ua_company_url': u'http://www.google.com/',
'ua_family': u'Chrome',
'ua_icon': u'chrome.png',
'ua_info_url': u'http://user-agent-string.info/list-of-ua/browser-detail?browser=Chrome',
'ua_name': u'Chrome 31.0.1650.57',
'ua_url': u'http://www.google.com/chrome'}
Thank you everyone for your answers, I found something really simple that works.
Download httpagentparser module from:
https://pypi.python.org/pypi/httpagentparser
after that, just import it in your pythong program
import httpagentparser
Then you can write a function like this that returns browser, works like a charm:
def detectBrowser(request):
agent = request.environ.get('HTTP_USER_AGENT')
browser = httpagentparser.detect(agent)
if not browser:
browser = agent.split('/')[0]
else:
browser = browser['browser']['name']
return browser
That's it
As you can see, this tells me nothing since there is mention of
Mozzila, Safari, Chrome etc.. even though I visited with Chrome.
It's not that the User Agent string tells you "nothing;" it's that it's telling you too much.
If you want a report that breaks down your users browser, your best bet is to analyze your logs. Several programs are available to help. (One caveat, if you're using Bottle's "raw" web server, is that it won't log in Common Log Format out of the box. You have options.)
If you need to know in real time, you'll need to spend time learning user agent strings (useragentstring.com might help here) or use an API like this one.