I am trying to write a simple download manager using python with concurrency. The aim is to use the Content-Length header from the response of a file url and splits the file into chunks then download those chunks concurrently. The idea works for all the urls which have Content-Length header but recently I came across some urls which doesn't serve a Content-Length header.
https://filesamples.com/samples/audio/mp3/sample3.mp3
HTTP/1.1 200 OK
Date: Sat, 08 Aug 2020 11:53:15 GMT
Content-Type: audio/mpeg
Transfer-Encoding: chunked
Connection: close
Set-Cookie: __cfduid=d2a4be3535695af67cb7a7efe5add19bf1596887595; expires=Mon, 07-Sep-20 11:53:15 GMT; path=/; domain=.filesamples.com; HttpOnly; SameSite=Lax
Cache-Control: public, max-age=86400
Display: staticcontent_sol, staticcontent_sol
Etag: W/"5def04f1-19d6dd-gzip"
Last-Modified: Fri, 31 Jul 2020 21:52:34 GMT
Response: 200
Vary: Accept-Encoding
Vary: User-Agent,Origin,Accept-Encoding
X-Ezoic-Cdn: Miss
X-Middleton-Display: staticcontent_sol, staticcontent_sol
X-Middleton-Response: 200
CF-Cache-Status: HIT
Age: 24
cf-request-id: 046f8413ab0000e047449da200000001
Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
Server: cloudflare
CF-RAY: 5bf90932ae53e047-SEA
How can I get the content-length of the file without downloading the whole file?
Related
These are the homework assignment and what it is asking.
to prompt the user for the URL so it can read any web page.
You can use split('/') to break the URL into its component parts so you can extract the host name for the socket connect call.
Add error checking using try and except to handle the condition where the user enters an improperly formatted or non-existent URL
the code only needs to accept 300 bytes from the URL page - in the code in 12.2 it would be len(data)to determine how much data was read. In the the while loop the mysock.recv only reads 128 bytes at a time
When putting a website in, it does print on what it has within that website,
I need something like this for my output:
HTTP/1.1 200 OK
Date: Thu, 15 Apr 2021 20:46:29 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Sat, 13 May 2017 11:22:22 GMT
ETag: "a7-54f6609245537"
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: text/plain
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief
Please help me
I am attempting to make a web scraping tool to scrape facebook marketplace and ran into an issue when using my code on an AWS E2 Server.
It seems as though when I run the code locally, things work okay. But when I run them from my AWS E2 Server, I get redirected to a login page:
Locally
HTTP/1.1 200 OK
Vary: Accept-Encoding
accept-ch-lifetime: 4838400
accept-ch: sec-ch-prefers-color-scheme
x-fb-rlafr: 0
document-policy: force-load-at-top
Pragma: no-cache
Cache-Control: private, no-cache, no-store, must-revalidate
Expires: Sat, 01 Jan 2000 00:00:00 GMT
X-Content-Type-Options: nosniff
X-XSS-Protection: 0
X-Frame-Options: DENY
Strict-Transport-Security: max-age=15552000; preload
Content-Type: text/html; charset="utf-8"
X-FB-Debug: hRIETkwli3E8bVAgkC5W4QkI8QUsRTBrYr3VNlSA2uhlf6t+W79d7uwuiu7bO4XjOZSc40kGAmrfUXC1/3bHCg==
Date: Tue, 03 May 2022 20:38:25 GMT
Priority: u=3,i
Transfer-Encoding: chunked
Alt-Svc: h3=":443"; ma=86400, h3-29=":443"; ma=86400
Connection: keep-alive
AWS
HTTP/2 302
location: https://www.facebook.com/login/?next=https%3A%2F%2Fwww.facebook.com%2Fmarketplace%2Fnyc
x-fb-rlafr: 0
document-policy: force-load-at-top
pragma: no-cache
cache-control: private, no-cache, no-store, must-revalidate
expires: Sat, 01 Jan 2000 00:00:00 GMT
x-content-type-options: nosniff
x-xss-protection: 0
x-frame-options: DENY
strict-transport-security: max-age=15552000; preload
content-type: text/html; charset="utf-8"
x-fb-debug: 0BY9brNcXJnElH5BpyMTVoppfjVs+5MRBpn+HlJd477H/E3T3P6XZNh/bu3oEMT1+Np/zProRfu4XEqkU2CAOA==
content-length: 0
date: Tue, 03 May 2022 20:38:15 GMT
priority: u=3,i
alt-svc: h3=":443"; ma=86400, h3-29=":443"; ma=86400
Do I need to run my requests through a proxy server in order to interact with Facebook from my AWS E2 Server in order to send requests anonymously? Or is there something configured on my server that is causing this 302 response?
Any help would be greatly appreciated here!
I tried through several means, but nowhere do I find a satisfatory answer to this -
What are the differences between Python "requests" module and Linux "curl" command? Does "requests" use "curl" underlying, or is it totally different way of dealing with HTTP request/response?
For most of the requests, they both behave in the same way (as it should be), but sometimes, I find a difference in response and it is really hard to figure out why is it so.
eg. Using curl for HEAD request:
curl --head https://historia.sherpadesk.com
HTTP/2 302
content-type: text/html; charset=utf-8
date: Mon, 28 Feb 2022 20:31:30 GMT
access-control-expose-headers: Request-Context
cache-control: private
location: /login/?ref=portal
set-cookie: ASP.NET_SessionId=nghpw4qp5cw2ntwmwfuxw3oi; path=/; HttpOnly; SameSite=Lax
content-length: 135
request-context: appId=cid-v1:d5f9900e-ecd4-442f-9e92-e11b4cdbc0c9
x-frame-options: SAMEORIGIN
x-xss-protection: 1
x-content-type-options: nosniff
strict-transport-security: max-age=31536000
and if I use -L to follow redirects,
curl --head https://historia.sherpadesk.com -L
HTTP/2 302
content-type: text/html; charset=utf-8
date: Mon, 28 Feb 2022 20:31:37 GMT
access-control-expose-headers: Request-Context
cache-control: private
location: /login/?ref=portal
set-cookie: ASP.NET_SessionId=trzp0bql4nibswux5z5wfayy; path=/; HttpOnly; SameSite=Lax
content-length: 135
request-context: appId=cid-v1:d5f9900e-ecd4-442f-9e92-e11b4cdbc0c9
x-frame-options: SAMEORIGIN
x-xss-protection: 1
x-content-type-options: nosniff
strict-transport-security: max-age=31536000
HTTP/2 302
content-type: text/html; charset=utf-8
date: Mon, 28 Feb 2022 20:31:38 GMT
access-control-expose-headers: Request-Context
location: https://app.sherpadesk.com/login/?ref=portal
content-length: 161
request-context: appId=cid-v1:d5f9900e-ecd4-442f-9e92-e11b4cdbc0c9
x-frame-options: SAMEORIGIN
x-xss-protection: 1
x-content-type-options: nosniff
strict-transport-security: max-age=31536000
HTTP/2 200
content-type: text/html; charset=utf-8
date: Mon, 28 Feb 2022 20:31:39 GMT
access-control-expose-headers: Request-Context
cache-control: no-store, no-cache
expires: -1
pragma: no-cache
set-cookie: ASP.NET_SessionId=aqmnxu2s3qkri3sravsrs1cq; path=/; HttpOnly; SameSite=Lax
content-length: 8935
request-context: appId=cid-v1:d5f9900e-ecd4-442f-9e92-e11b4cdbc0c9
x-frame-options: SAMEORIGIN
x-xss-protection: 1
x-content-type-options: nosniff
strict-transport-security: max-age=31536000
and here is the (debug) output when I use Python's requests module requests.head(url):
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): historia.sherpadesk.com:443
send: b'HEAD / HTTP/1.1\r\nHost: historia.sherpadesk.com\r\nUser-Agent: python-requests/2.26.0\r\nAccept-Encoding: gzip, deflate, br\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 403 Forbidden: Access is denied.\r\n'
header: Content-Length: 58
header: Content-Type: text/html
header: Date: Mon, 28 Feb 2022 20:36:18 GMT
header: X-Frame-Options: SAMEORIGIN
header: X-XSS-Protection: 1
header: X-Content-Type-Options: nosniff
header: Strict-Transport-Security: max-age=31536000
DEBUG:urllib3.connectionpool:https://historia.sherpadesk.com:443 "HEAD / HTTP/1.1" 403 0
INFO:root:URL: https://historia.sherpadesk.com/
INFO:root:<Response [403]>
which just results in 403 response code. Response is same whether allow_redirects is True/False. I have also tried using proxy with python code, as I thought maybe its getting blocked as this URL might be recognising Python's request to be a bot, but that also fails. Also, if that was the case, why does curl succeed?
So, my main question here is: what are the major differences between curl and requests, which might cause difference in responses in certain cases? If possible, I would really like thorough explanation which could help me debug and resolve these issues.
The two libraries are different but the problem here is related to user agent.
When I try with curl, specifying the python-requests user agent:
$ curl --head -A "python-requests/2.26.0" https://historia.sherpadesk.com/
HTTP/2 403
content-type: text/html
date: Mon, 28 Feb 2022 22:30:02 GMT
content-length: 58
x-frame-options: SAMEORIGIN
x-xss-protection: 1
x-content-type-options: nosniff
strict-transport-security: max-age=31536000
With curl default user agent:
$ curl --head https://historia.sherpadesk.com/
HTTP/2 302
...
Apparently, they have some type of website security that is blocking HTTP clients like python-requests, but not curl for some reason.
I want to retrieve all headers from a certain site, in this example "https://www.facebook.com" as following:
import urllib2
enter code here`req = urllib2.Request('https://www.facebook.com/')
res = urllib2.urlopen(req)
print res.info()
res.close();
that results in this response:
X-XSS-Protection: 0
Pragma: no-cache
Cache-Control: private, no-cache, no-store, must-revalidate
X-Frame-Options: DENY
Strict-Transport-Security: max-age=15552000; preload
X-Content-Type-Options: nosniff
Expires: Sat, 01 Jan 2000 00:00:00 GMT
Set-Cookie: sb=1GyeWkJzGbmX-VUyBi26; expires=Thu, 05-Mar-2020 10:26:28 GMT; Max-Age=63071999; path=/; domain=.facebook.com; secure; httponly
Vary: Accept-Encoding
Content-Type: text/html; charset=UTF-8
X-FB-Debug: X9aSOOKs6/aER1yuY4iUUIZrj4yTKtKSUAZ/AFE37IieCe8O4MSsFc5xlQ0LoQyHnbrSL4DaYiTVUUkFZeDrsqqg==
Date: Tue, 06 Mar 2018 10:26:29 GMT
Connection: close
I can retrieve all headers except for the Content-Security-Policy (csp);
But whenever I test on geekflare csp test
It succesfully retrieved all headers including the csp one.
Seems like I forgot to set the User-Agent within the request.
I am trying to download only range of bytes of a file and I am trying the following process:
r = requests.get('https://stackoverflow.com', headers={'Range':'bytes=0-999'})
But it give status code 200 as opposed to 206 and I am getting the entire file.
I tried following this Python: How to download file using range of bytes? but it also gave me status code 200. What is the reason and how do I download files partially using python?
Headers for stackoverflow:
HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Last-Modified: Fri, 04 Aug 2017 05:28:29 GMT
X-Frame-Options: SAMEORIGIN
X-Request-Guid: 86fd186e-b5ac-472f-9d79-c45149343776
Strict-Transport-Security: max-age=15552000
Content-Length: 107699
Accept-Ranges: bytes
Date: Wed, 06 Sep 2017 11:48:16 GMT
Via: 1.1 varnish
Age: 0
Connection: keep-alive
X-Served-By: cache-sin18023-SIN
X-Cache: MISS
X-Cache-Hits: 0
X-Timer: S1504698496.659820,VS0,VE404
Vary: Fastly-SSL
X-DNS-Prefetch-Control: off
Set-Cookie: prov=6df82a45-2405-b1dd-1430-99e27827b360; domain=.stackoverflow.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly
Cache-Control: private
This requires server-side support. Probably the server does not support it.
To find out, make a HEAD request and see if the response has Accept-Ranges.