Urllib2 not returning entire data

Urllib2 not returning entire data - python

I'm trying to download icecast json status data from a server using python.
This is my code (after different attempts).
def checkStream(url):
request = urllib2.Request(url)
request.add_header("Connection", "keep-alive")
request.add_header("Cache-Control", "max-age=0")
request.add_header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8")
request.add_header("Upgrade-Insecure-Requests", "1")
request.add_header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36")
request.add_header("Accept-Encoding", "gzip, deflate, sdch")
response = urllib2.urlopen(request)
line = response.read()
print line
return
checkStream("http://108.168.175.149:10128/status-json.xsl")
The problem is that my response is printed like this
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate
Pragma: no-cache
Access-Control-Allow-Origin: *
Access-Control-Allow-Headers: Origin, Accept, X-Requested-With, Content-Type
Access-Control-Allow-Methods: GET, OPTIONS, HEAD
{"icestats":{"admin":"icemaster#localhost","banned_IPs":0,"build":20141112090605,"host":"pro02.caster.fm","location":"Earth","outgoing_kbitrate":3799,"server_id":"Icecast 2.3.3-kh11","server_start":"05/Oct/2015:10:43:46 -0500","stream_kbytes_read":104422400,"stream_kbytes_sent":5123403693,"source":[{"audio_codecid":2,"audio_info":"ice-samplerate=44100;ice-bitrate=96;ice-channels=2","bitrate":96,"connected":33748,"genre":"Various","ice-bitrate":96,"ice-channels":2,"ice-samplerate":44100,"incoming_bitrate":95920,"listener_peak":153,"listeners":42,"listenurl":"http://pro02.caster.fm:10128/live","mpeg_channels":2,"mpeg_samplerate":44100,"outgoing_kbitrate":3883,"queue_size":358609,"se
The end of the json response is short 272 bytes which is exactly the number of bytes of the response headers which are returned in the data.
If I open the link on chrome the response appears ok.
I also tested using requests lib with no luck.
>>> import requests
>>> r = requests.get("http://108.168.175.149:10128/status-json.xsl")
>>> r.text
u'Expires: Thu, 19 Nov 1981 08:52:00 GMT\r\nCache-Control: no-store, no-cache, must-revalidate\r\nPragma: no-cache\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Headers: Origin, Accept, X-Requested-With, Content-Type\r\nAccess-Control-Allow-Methods: GET, OPTIONS, HEAD\r\n\r\n{"icestats":{"admin":"icemaster#localhost","banned_IPs":0,"build":20141112090605,"host":"pro02.caster.fm","location":"Earth","outgoing_kbitrate":3844,"server_id":"Icecast 2.3.3-kh11","server_start":"05/Oct/2015:10:43:46 -0500","stream_kbytes_read":104438630,"stream_kbytes_sent":5124109510,"source":[{"audio_codecid":2,"audio_info":"ice-samplerate=44100;ice-bitrate=96;ice-channels=2","bitrate":96,"connected":35133,"genre":"Various","ice-bitrate":96,"ice-channels":2,"ice-samplerate":44100,"incoming_bitrate":95920,"listener_peak":153,"listeners":43,"listenurl":"http://pro02.caster.fm:10128/live","mpeg_channels":2,"mpeg_samplerate":44100,"outgoing_kbitrate":3837,"queue_size":164258,"se'
>>>
How can I retrieve the complete data?

The server you are requesting this from is running an ancient version of an Icecast fork.
This bug was fixed and the fix released long ago in mainline. I'd recommend to upgrade (or tell the operator to upgrade) the server to the latest official Icecast version from http://icecast.org

Related

Selenium succeeding, python requests library failing, despite same url and same request headers - what's the difference?

# selenium-request.py
from seleniumwire import webdriver # Import from seleniumwire
# Create a new instance of the Chrome driver
driver = webdriver.Chrome()
driver.get('https://www.cmegroup.com/content/cmegroup/en/tools-information/advisorySearch/jcr:content/full-par/cmeadvisorysearch.advisorySearch.advisorynotices:Advisory%20Notices.-.2.12|07|2021.01|01|2008.json')
for request in driver.requests:
if request.response:
print(request.response.headers)
When I run that code I get the headers Selenium uses:
$ python selenium-request.py
Accept-Ranges: bytes
Access-Control-Allow-Origin: http://star-website.com
Content-Type: application/json
ETag: W/"36b8a-5d3d28ed9cc43"
Last-Modified: Thu, 23 Dec 2021 16:16:16 GMT
Referrer-Policy: no-referrer-when-downgrade
Server: Apache
ServerID: e1
Strict-Transport-Security: max-age=31536000; includeSubDomains
Vary: Accept-Encoding
Content-Encoding: gzip
Cache-Control: max-age=86400
Date: Thu, 23 Dec 2021 16:16:16 GMT
Content-Length: 46236
Connection: keep-alive
Content-Security-Policy: frame-ancestors 'self' *.cmegroup.com *.quikstrike.net commodex.co.il openexchange.community.cmegroup.com staging.tickertocker.com http://www.straitsfinancial.com www.straitsfinancial.com http://straitsfinancial.com https://www.home.saxo https://app.topsteptrader.com https://help.topsteptrader.com https://staging.topsteptrader.com https://blueeditsitecore.sys.dom https://bluesitecore.sys.dom https://sitecoredev.orange.saxobank.com https://sitecoredev-nocache.orange.saxobank.com https://sitecoredevedit.orange.tst2.dom http://star-website.com https://www.investing.com https://*.benzinga.com https://bz.zingbot.bz https://www.zingbot.bz https://gdcdyn.interactivebrokers.com https://www.interactivebrokers.com https://zingbot.bz https://www.zingbot.bz https://m.zingbot.bz https://bz.zingbot.bz https://dev.futuresfirstacademy.com https://uat.futuresfirstacademy.com https://futuresfirstacademy.com http://stage.barchart.com http://www.barchart.com https://www.infinityfutures.com https://kilofutures.com https://m.cqg.com https://mdemo.cqg.com *.chicago.cme.com:7822 https://uatm.cqg.com https://local.zingbot.bz https://www.gulfbondsukuk.org www.kgieworld.sg https://www.propex24.wpcomstaging.com https://www.propex24.com *.straitsfinancial.gate39tech.com us.straitsfinancial.com https://*.kapcoclients.com https://kapcoclients.com https://*.wallstreetbound.org https://wallstreetbound.org https://cofcointl.plateau.com https://rise.articulate.com https://members.tradeday.com http://blf-django.herokuapp.com https://www.bluelinefutures.com https://www.bluelinefutures.live https://www.bluelinefutures.trade https://login.chicago.cme.com https://loginnr.chicago.cme.com https://logincert.chicago.cme.com https://login-ny.chicago.cme.com https://ampfutures.com https://cme.ampfutures.com https://*.advantagefutures.com https://*.e-futures.com https://*.etrade.com https://*.gffbrokers.com https://infinityfutures-cn.com https://sweetfutures.com https://*.tradovate.com https://home.saxo https://*.tickmill.co.uk https://*.directa.it https://big.pt https://*.tradestation-international.com https://*.stonex.com http://tradinglesson.com https://tradinglesson.com *.ibroker.it *.ibroker.es *.cornertrader.ch *.whselfinvest.com *.banxbroker.de *.ameritrade.com *.sweetfutures.com *.danielstrading.com *.gainfutures.com *.futuresonline.com *.tdainc.com *.lsvp.com *.schwab.com *.schwab.co.uk *.us.global.schwab.com *.dev.schwab.com;
Set-Cookie: ak_bmsc=AB0A9701302106EABE2E195C6AC2A074~000000000000000000000000000000~YAAQLtERAvOZVN19AQAA7C8U6A7AWr7StAmiphZPltguFftPSOXgfa2NAq7Vts+40k7AdnPG55ULK1vyBRhPRdqWbtYml3JTC3RjHLu31l8kWBFvysYyuY2uz4GpkvmOWoBSN/Dl/2bQ9bEgbiYj3tCZ1o+wEvMfsiAWiJeMY3M1ozu6nyQz0JVpdvfsqun3z5wGhpJWhkjrJjeIyHvVdzx2uyIb1azRFlHT+nRCR6NHGoaMM/G2sI1DqPOXPB5btXjdncvB739c2Beh7RgWD/zvb78qpAJDUR1KOenDy1EwN2Bg8pqH1sxlsoVrl7i7r/pAOaWKfd4U1FKP7p730GfOp/m2VRBIdYgHDPHPvGeITPKrR/G22aR886r9Lerhug==; Domain=.cmegroup.com; Path=/; Expires=Thu, 23 Dec 2021 18:16:01 GMT; Max-Age=7185; HttpOnly
I copy these exact headers into a python dict and request as follows:
# python-request.py
import requests
headers = {
"Accept-Ranges": "bytes",
"Access-Control-Allow-Origin": "http://star-website.com",
"Content-Type": "application/json",
"ETag": 'W/"36b8a-5d3d28ed9cc43"',
"Last-Modified": "Thu, 23 Dec 2021 16:16:16 GMT",
"Referrer-Policy": "no-referrer-when-downgrade",
"Server": "Apache",
"ServerID": "e1",
"Strict-Transport-Security": "max-age=31536000; includeSubDomains",
"Vary": "Accept-Encoding",
"Content-Encoding": "gzip",
"Cache-Control": "max-age=86400",
"Date": "Thu, 23 Dec 2021 16:16:16 GMT",
"Content-Length": "46236",
"Connection": "keep-alive",
"Content-Security-Policy": "frame-ancestors 'self' *.cmegroup.com *.quikstrike.net commodex.co.il openexchange.community.cmegroup.com staging.tickertocker.com http://www.straitsfinancial.com www.straitsfinancial.com http://straitsfinancial.com https://www.home.saxo https://app.topsteptrader.com https://help.topsteptrader.com https://staging.topsteptrader.com https://blueeditsitecore.sys.dom https://bluesitecore.sys.dom https://sitecoredev.orange.saxobank.com https://sitecoredev-nocache.orange.saxobank.com https://sitecoredevedit.orange.tst2.dom http://star-website.com https://www.investing.com https://*.benzinga.com https://bz.zingbot.bz https://www.zingbot.bz https://gdcdyn.interactivebrokers.com https://www.interactivebrokers.com https://zingbot.bz https://www.zingbot.bz https://m.zingbot.bz https://bz.zingbot.bz https://dev.futuresfirstacademy.com https://uat.futuresfirstacademy.com https://futuresfirstacademy.com http://stage.barchart.com http://www.barchart.com https://www.infinityfutures.com https://kilofutures.com https://m.cqg.com https://mdemo.cqg.com *.chicago.cme.com:7822 https://uatm.cqg.com https://local.zingbot.bz https://www.gulfbondsukuk.org www.kgieworld.sg https://www.propex24.wpcomstaging.com https://www.propex24.com *.straitsfinancial.gate39tech.com us.straitsfinancial.com https://*.kapcoclients.com https://kapcoclients.com https://*.wallstreetbound.org https://wallstreetbound.org https://cofcointl.plateau.com https://rise.articulate.com https://members.tradeday.com http://blf-django.herokuapp.com https://www.bluelinefutures.com https://www.bluelinefutures.live https://www.bluelinefutures.trade https://login.chicago.cme.com https://loginnr.chicago.cme.com https://logincert.chicago.cme.com https://login-ny.chicago.cme.com https://ampfutures.com https://cme.ampfutures.com https://*.advantagefutures.com https://*.e-futures.com https://*.etrade.com https://*.gffbrokers.com https://infinityfutures-cn.com https://sweetfutures.com https://*.tradovate.com https://home.saxo https://*.tickmill.co.uk https://*.directa.it https://big.pt https://*.tradestation-international.com https://*.stonex.com http://tradinglesson.com https://tradinglesson.com *.ibroker.it *.ibroker.es *.cornertrader.ch *.whselfinvest.com *.banxbroker.de *.ameritrade.com *.sweetfutures.com *.danielstrading.com *.gainfutures.com *.futuresonline.com *.tdainc.com *.lsvp.com *.schwab.com *.schwab.co.uk *.us.global.schwab.com *.dev.schwab.com;",
"Set-Cookie": "ak_bmsc=AB0A9701302106EABE2E195C6AC2A074~000000000000000000000000000000~YAAQLtERAvOZVN19AQAA7C8U6A7AWr7StAmiphZPltguFftPSOXgfa2NAq7Vts+40k7AdnPG55ULK1vyBRhPRdqWbtYml3JTC3RjHLu31l8kWBFvysYyuY2uz4GpkvmOWoBSN/Dl/2bQ9bEgbiYj3tCZ1o+wEvMfsiAWiJeMY3M1ozu6nyQz0JVpdvfsqun3z5wGhpJWhkjrJjeIyHvVdzx2uyIb1azRFlHT+nRCR6NHGoaMM/G2sI1DqPOXPB5btXjdncvB739c2Beh7RgWD/zvb78qpAJDUR1KOenDy1EwN2Bg8pqH1sxlsoVrl7i7r/pAOaWKfd4U1FKP7p730GfOp/m2VRBIdYgHDPHPvGeITPKrR/G22aR886r9Lerhug==; Domain=.cmegroup.com; Path=/; Expires=Thu, 23 Dec 2021 18:16:01 GMT; Max-Age=7185; HttpOnly"
}
requests.get(
"https://www.cmegroup.com/content/cmegroup/en/tools-information/advisorySearch/jcr:content/full-par/cmeadvisorysearch.advisorySearch.advisorynotices:Advisory%20Notices.-.2.12|07|2021.01|01|2008.json",
headers=headers)
When I run this it just hangs indefinitely, so there is some issue with the request.
Apart from the headers, what is the difference between the requests made by python and Selenium - how could I identify the issue and hopefully get this working with the python requests library?
Update
I updated the code to get the request.headers instead:
Host: www.cmegroup.com
Connection: keep-alive
sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Linux"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
... but the python requests script has the same result when using these headers, just hanging (or timing out if I set a timeout parameter).
Further update
Debug output is as follows:
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.cmegroup.com:443
send: b'GET /content/cmegroup/en/tools-information/advisorySearch/jcr:content/full-par/cmeadvisorysearch.advisorySearch.advisorynotices:Advisory%20Notices.-.2.12%7C07%7C2021.01%7C01%7C2008.json HTTP/1.1\r\nUser-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36\r\nAccept-Encoding: gzip, deflate, br\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9\r\nConnection: keep-alive\r\nHost: www.cmegroup.com\r\nsec-ch-ua: " Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"\r\nsec-ch-ua-mobile: ?0\r\nsec-ch-ua-platform: Linux\r\nUpgrade-Insecure-Requests: 1\r\nSec-Fetch-Site: none\r\nSec-Fetch-Mode: navigate\r\nSec-Fetch-User: ?1\r\nSec-Fetch-Dest: document\r\nAccept-Language: en-US,en;q=0.9\r\n\r\n'

It looks like it only needs a compatible User-Agent header.
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0',
}
url = 'https://www.cmegroup.com/content/cmegroup/en/tools-information/advisorySearch/jcr:content/full-par/cmeadvisorysearch.advisorySearch.advisorynotices:Advisory%20Notices.-.2.12|07|2021.01|01|2008.json'
response = requests.get(url, headers = headers, timeout = 30) # A
print(response.status_code) # Prints 200 (OK).
print(response.json()) # Prints the output as JSON. "item" key has 50 values in a list.
^ This snippet did the trick for me.

It looks, you are using the response headers, not request headers.
Try
print(request.headers)

Python requests downloading page html content instead of .xlsx file

I am attempting to download a .xlsx file from https://matmatch.com/ using Python Requests, however I am having issues with my script downloading the html content of the page rather than the xlsx document I want. I have used LivehttpHeaders to look at the requests the page is making and I found the get command the page is sending to download the file
http://matmatch.com/api/downloads/materials/MITF1194/xlsx?expires-in=30min&download-token=cd21d8a7-ba5e-037e-f830-08600989cb0a
Using this in python with a request.get command still results in downloading the raw html instead of the file. Here is the code I am currently using
import requests
url = "https://matmatch.com/api/downloads/materials/MITF1194/xlsx?expires-in=30min&download-token=360f528b-53be-72c1-25dd-dcf9401633a3"
r = requests.get(url, allow_redirects=True)
open('copper.csv', 'wb').write(r.content)
Are there any glaring issues in my code or am I just not understanding how html works?
Cheers
EDIT: Here is the page I am trying to download from using the "download as .xlsx" button
https://matmatch.com/materials/mitf1194-astm-b196-grade-c17200-tb00
EDIT: Here is the LivehttpHeaders summary of the GET request. Do I need to include some of these headers in my request?
GET /api/downloads/materials/MITF1194/xlsx?expires-in=30min&download-token=41db0a0e-5091-0b06-8195-f0e0639286d6 HTTP/1.1
Host: matmatch.com:443
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
sec-ch-ua: "Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"
sec-ch-ua-mobile: ?0
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: same-origin
Sec-Fetch-User: ?1
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36
HTTP/1.1 200
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Connection: keep-alive
Content-Disposition: attachment; filename="mitf1194-astm-b196-grade-c17200-tb00.xlsx"
Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Date: Mon, 22 Mar 2021 00:37:42 GMT
Expires: 0
Filename: mitf1194-astm-b196-grade-c17200-tb00.xlsx
Pragma: no-cache
Server: nginx
Strict-Transport-Security: max-age=15724800; includeSubDomains;
transfer-encoding: chunked
X-Content-Type-Options: nosniff
X-Frame-Options: DENY
X-XSS-Protection: 1; mode=block

Open URL using python requests only then proceed to download file

I'm trying to download a file from a website using Python's request module.
However the site will allow me to download the file only if the download link is clicked directly from the download page.
So using requests, I tried hitting the download page's URL first using requests.get() then proceeding to download the file. But unfortunately this doesn't seem to work. A text asking me to open the download page first simply gets written into file.torrent"
import requests
def download(username, password):
with requests.Session() as session:
session.post('https://website.net/forum/login.php', data={'login_username': username, 'login_password': password})
# Download page URL
requests.get('https://website.net/forum/viewtopic.php?t=2508126')
# The download URL itself
response = requests.get('https://website.net/forum/dl.php?t=2508126')
with open('file.torrent', 'wb') as f:
f.write(response.content)
download(username='XXXXX', password='YYYYY')
Response when downloading directly from the download page (works) :
General :
Request URL: https://website.net/forum/dl.php?t=2508126
Request Method: GET
Status Code: 200 OK
Remote Address: 185.37.128.136:443
Referrer Policy: no-referrer-when-downgrade
Response Headers :
Cache-Control: no-store, no-cache, must-revalidate
Cache-Control: post-check=0, pre-check=0
Content-Disposition: attachment; filename="[website.net].t2508126.torrent"
Content-Length: 33641
Content-Type: application/x-bittorrent; name="[website.net].t2508126.torrent"
Date: Thu, 14 Feb 2019 07:57:08 GMT
Expires: Mon, 26 Jul 1997 05:00:00 GMT
Last-Modified: Thu, 14 Feb 2019 07:57:09 GMT
Pragma: no-cache
Server: nginx
Set-Cookie: bb_dl=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; path=/forum/; domain=.website.net
Request Headers :
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Connection: keep-alive
Cookie: bb_t=a%3A3%3A%7Bi%3A2507902%3Bi%3A1550052944%3Bi%3A2508011%3Bi%3A1550120230%3Bi%3A2508126%3Bi%3A1550125516%3B%7D; bb_data=1-27969311-wXVPJGcedLE1I2mM9H0u-3106784170-1550128652-1550131012-3061288864-1; bb_dl=2508126
Host: website.net
Referer: https://website.net/forum/viewtopic.php?t=2508126
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3701.0 Safari/537.36
Query String Parameters :
t: 2508126
Response when opening the download link on it's own (doesn't work) :
General :
Request URL: https://website.net/forum/dl.php?t=2508126
Request Method: GET
Status Code: 200 OK
Remote Address: 185.37.128.136:443
Referrer Policy: no-referrer-when-downgrade
Response Headers :
Cache-Control: no-store, no-cache, must-revalidate
Cache-Control: post-check=0, pre-check=0
Content-Type: text/html; charset=windows-1251
Date: Thu, 14 Feb 2019 08:03:29 GMT
Expires: Mon, 26 Jul 1997 05:00:00 GMT
Last-Modified: Thu, 14 Feb 2019 08:03:29 GMT
Pragma: no-cache
Server: nginx
Transfer-Encoding: chunked
Request Headers :
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Connection: keep-alive
Cookie: bb_t=a%3A3%3A%7Bi%3A2507902%3Bi%3A1550052944%3Bi%3A2508011%3Bi%3A1550120230%3Bi%3A2508126%3Bi%3A1550125516%3B%7D; bb_data=1-27969311-wXVPJGcedLE1I2mM9H0u-3106784170-1550128652-1550131390-3061288864-1
Host: website.net
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3701.0 Safari/537.36
Query String Parameters :
t: 2508126

This works for me:
data={'login_username': username, 'login_password': password, 'login': ''}
and using session.get() instead of requests.get()

Getting bytes response from request

I'm trying to perform a request at Python 3, to a url that should return a JSON. Instead it's returning a sequence of bytes that i'm unable to convert. Why am i receiving this type of response and how can i convert it into human-readable data?
Bellow a snippet of my code:
headers = {}
headers['Host']= 'XXXXX' # hidden
headers['Connection']= 'keep-alive'
headers['Content-Length']= '122'
headers['Accept']= 'application/json, text/javascript, */*; q=0.01'
headers['Origin']= 'XXXXX' # hidden
headers['X-Requested-With']= 'XMLHttpRequest'
headers['User-Agent']= 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
headers['Content-Type']= 'application/json'
headers['Referer']= 'XXXXX' # hidden
headers['Accept-Encoding']= 'gzip, deflate, br'
headers['Accept-Language']= 'pt-BR,pt;q=0.9,en-US;q=0.8,en;q=0.7'
headers['Cookie'] = 'XXXXX' # hidden
try:
req = request.Request(url,post_data,headers)
x = request.urlopen(req)
print(x.read())
print(x.info())
except Exception as e:
print(e)
Bellow the response received:
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03L\x8fAK\x031\x10\x85\xff\xca0\x07Q\x88\x899(\xb2\xd0\x93\xf4\xe2\xa1-z]\x90\xecf\xb6\x1b\xd9d\xca$-H\xe9\x7f7\x91\x8a^\x86\x997\xef\x1b\xde\x9c\xf1D\x92\x03\'\xec\xd0j\x8b\nI\x84\x05\xbb\xf3_\x13)g\xb7\xa7\xea\x88n\x99X"yx}\xdfn \x17\ti\xaf Q(3\t8\x11\xf7\xa5\x80\x87O\x1aK\x95\x8fq QW\x1bp5\x14\x8e\xaaV\x18g\'n,\x95\xe1i\xcaT\xe0\x01n\x07\xaa\xb7\t\xfa\xdfD\xab\x9a\xe7&R\x99\xd9\xaf\xd6Z\xeb\x1e\xef\x1aj\x8eYL\xae<\x99\x03\xc9\xf2hN\x94<\xcbG\x1bL\x8b\xa5\x0f\x11\x96\x90\x08\xec\xd3\xb3\xee\x13^\x14&\x17[\xfc\xb6}\xdb\xbd\xac\x7f\x1eS\xff\xfe\xda9\xc9\x04t\xd5G\xf6M\xb4\x8d\x0c\x1e\xbb{{\xf9\x06\x00\x00\xff\xff\x03\x00\xc4\xd9gg\'\x01\x00\x00'
Date: Wed, 26 Dec 2018 16:46:48 GMT
Server: Apache
Strict-Transport-Security: max-age=16070400
X-UA-Compatible: IE=Edge,chrome=1, IE=Edge,chrome=1
Content-Type: application/json; charset=utf-8
Vary: Accept-Encoding
Content-Encoding: gzip
X-Frame-Options: SAMEORIGIN
Connection: close
Transfer-Encoding: chunked

It seems to be zipped: Content-Encoding: gzip.
Unzip it and then use json.decode.
Example:
import zlib
decompressed_data=zlib.decompress(f.read(), 16+zlib.MAX_WBITS)
Another option - tell server you're upset with zipped content. Remove gzip and probably other types of compression from Accept-Encoding request header

try something like this
import requests
r = requests.post('your URL',data=YourData)
r.json()

download files with python (REST URL)

I am trying to write a script that will download a bunch files from a website that has REST URLs.
Here is the GET request:
GET /test/download/id/5774/format/testTitle HTTP/1.1
Host: testServer.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Cookie: __utma=11863783.1459862770.1379789243.1379789243.1379789243.1; __utmb=11863783.28.9.1379790533699; __utmc=11863783; __utmz=11863783.1379789243.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); PHPSESSID=fa844952890e9091d968c541caa6965f; loginremember=Qraoz3j%2BoWXxwqcJkgW9%2BfGFR0SDFLi1FLS7YVAfvbcd9GhX8zjw4u6plYFTACsRruZM4n%2FpX50%2BsjXW5v8vykKw2XNL0Vqo5syZKSDFSSX9mTFNd5KLpJV%2FFlYkCY4oi7Qyw%3D%3D; ma-refresh-storage=1; ma-pref=KLSFKJSJSD897897; skipPostLogin=0; pp-sid=hlh6hs1pnvuh571arl59t5pao0; __utmv=11863783.|1=MemberType=Yearly=1; nats_cookie=http%253A%252F%252Fwww.testServer.com%252F; nats=NDc1NzAzOjQ5MzoyNA%2C74%2C0%2C0%2C0; nats_sess=fe3f77e6e326eb8d18ef0111ab6f322e; __utma=163815075.1459708390.1379790355.1379790355.1379790355.1; __utmb=163815075.1.9.1379790485255; __utmc=163815075; __utmz=163815075.1379790355.1.1.utmcsr=ppp.contentdef.com|utmccn=(referral)|utmcmd=referral|utmcct=/postlogin; unlockedNetworks=%5B%22rk%22%2C%22bz%22%2C%22wkd%22%5D
Connection: close
If the request is good, it will return a 302 response such as this one:
HTTP/1.1 302 Found
Date: Sat, 21 Sep 2013 19:32:37 GMT
Server: Apache
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
location: http://downloads.test.stuff.com/5774/stuff/picture.jpg?wed=20130921152237&wer=20130922153237&hash=0f20f4a6d0c9f1720b0b6
Vary: User-Agent,Accept-Encoding
Content-Length: 0
Connection: close
Content-Type: text/html; charset=UTF-8
What I need the script to do is check to see if it was a 302 response. If it is not, it will "pass", if it is, it will need to parse out the location parameter shown here:
location: http://downloads.test.stuff.com/5774/stuff/picture.jpg?wed=20130921152237&wer=20130922153237&hash=0f20f4a6d0c9f1720b0b6
Once I have the location parameter, I will have to make another GET request to download that file. I will also have to maintain the cookie for my session in order to download the file.
Can someone point me in the right direction for what library is best to use for this? I am having trouble finding out how to parse the 302 response and adding a cookie value like the one shown in my GET request above. I am sure there must be some library that can do all of this.
Any help would be much appreciated.

import urllib.request as ur
import urllib.error as ue
'''
Note that http.client.HTTPResponse.read([amt]) reads and returns the response body, or up to
the next amt bytes. This is because there is no way for urlopen() to automatically determine
the encoding of the byte stream it receives from the http server.
'''
url = "http://www.example.org/images/{}.jpg"
dst = ""
arr = ["01","02","03","04","05","06","07","08","09"]
# arr = range(10,20)
try:
for x in arr:
print(str(x)+"). ".ljust(4),end="")
hrio = ur.urlopen(url.format(x)) # HTTPResponse iterable object (returns the response header and body, together, as bytes)
fh = open(dst+str(x)+".jpg","b+w")
fh.write(hrio.read())
fh.close()
print("\t[REQUEST COMPLETE]\t\t<Error ~ [None]>")
except ue.URLError as e:
print("\t[REQUEST INCOMPLETE]\t",end="")
print("<Error ~ [{}]>".format(e))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Urllib2 not returning entire data - python

The server you are requesting this from is running an ancient version of an Icecast fork. This bug was fixed and the fix released long ago in mainline. I'd recommend to upgrade (or tell the operator to upgrade) the server to the latest official Icecast version from http://icecast.org

Related

Selenium succeeding, python requests library failing, despite same url and same request headers - what's the difference?

Python requests downloading page html content instead of .xlsx file

Open URL using python requests only then proceed to download file

Getting bytes response from request

download files with python (REST URL)

Categories

Resources