Download a URL only if it is a HTML Webpage

Download a URL only if it is a HTML Webpage - python

I want to write a python script which downloads the web-page only if the web-page contains HTML. I know that content-type in header will be used. Please suggest someway to do it as i am unable to get a way to get header before the file download.

Use http.client to send a HEAD request to the URL. This will return only the headers for the resource then you can look at the content-type header and see if it text/html. If it is then send a GET request to the URL to get the body.

Related

Upload PDF using requests in Python

I am trying to upload a pdf file to an API using requests in python.
In Postman e.g. I can use the body and select "binary" to upload the file. How can I do this in Python? (payload="")
The Content-type needs to be application/pdf
The PDF needs to be downloaded from an URL.
Thanks in advance
I tried to use PyPDF to read the file from the URL.

First, define the headers that your upload request will have. Then get pdf from remote url and save its content to the memory. After that, just use a post request with specified headers to upload the document.
headers = {'content-type': 'application/pdf'}
pdf = requests.get('http://mypdfurl').content
requests.post('http://myserver/upload', headers = headers, data = pdf)

Parameters are ignored in python web request for JSON data

I try to read JSON-formatted data from the following public URL: http://ws-old.parlament.ch/factions?format=json. Unfortunately, I was not able to convert the response to JSON as I always get the HTML-formatted content back from my request. Somehow the request seems to completely ignore the parameters for JSON formatting passed with the URL:
import urllib.request
response = urllib.request.urlopen('http://ws-old.parlament.ch/factions?format=json')
response_text = response.read()
print(response_text) #why is this HTML?
Does somebody know how I am able to get the JSON formatted content as displayed in the web browser?

You need to add "Accept": "text/json" to request header.
For example using requests package:
r = requests.get(r'http://ws-old.parlament.ch/factions?format=json',
headers={'Accept':'text/json'})
print(r.json())
Result:
[{'id': 3, 'updated': '2022-02-22T14:59:17Z', 'abbreviation': ...

Sorry for you but these web services have a misleading implementation. The format query parameter is useless. As pointed out by #maciek97x only the header Accept: <format> will be considered for the formatting.
So your can directly call the endpoint without the ?format=json but with the header Accept: text/json.

Twitter scraping using Python

I've been working on a project to reverse-enginner twitter's app to scrape public posts from Twitter using an unofficial API, with Python. (I want to create an "alternative" app, which is simply a localhost that can search for a user, and get its posts)
I've been searching and reading everything related to REST, AJAX, and the python modules requests, requests-html, BeautifulSoup, and more.
I can see when looking at twitter on the devtools (for example on Marvel's profile page) that the only relevant requests being sent (by POST and GET) are the following: client_event.json and UserTweets?variables=... .
I understood that these are the relevant messages being received by cleaning the network tab and recording only when I scroll down and load new tweets - these are the only messages that came up which aren't random videos (I cleaned the search using -video -init -csp_report -config -ondemand -like -pageview -recommendations -prefetch -jot -key_live_kn -svg -jpg -jpeg -png -ico -analytics -loader -sharedCore -Hebrew).
I am new to this field, so I am probably doing something wrong. I can see on UserTweets the response I'm looking for - a beautiful JSON with all the data I need - but I am unable, no matter how much I've been trying to, to access it.
I tried different modules and different headers, and I get nothing. I DON'T want to use Selenium since it's tiresome, and I know where the data I need is stored.
I've been trying to send a GET reuest to:
https://twitter.com/i/api/graphql/vamMfA41UoKXUmppa9PhSw/UserTweets?variables=%7B%22userId%22%3A%2215687962%22%2C%22count%22%3A20%2C%22cursor%22%3A%22HBaIgLLN%2BKGEryYAAA%3D%3D%22%2C%22withHighlightedLabel%22%3Atrue%2C%22withTweetQuoteCount%22%3Atrue%2C%22includePromotedContent%22%3Atrue%2C%22withTweetResult%22%3Afalse%2C%22withUserResults%22%3Afalse%2C%22withVoice%22%3Afalse%2C%22withNonLegacyCard%22%3Atrue%7D
by doing:
from requests_html import HTMLSession
from bs4 import BeautifulSoup
response = session.get('https://twitter.com/i/api/graphql/vamMfA41UoKXUmppa9PhSw/UserTweets?variables=%7B%22userId%22%3A%2215687962%22%2C%22count%22%3A20%2C%22cursor%22%3A%22HBaIgLLN%2BKGEryYAAA%3D%3D%22%2C%22withHighlightedLabel%22%3Atrue%2C%22withTweetQuoteCount%22%3Atrue%2C%22includePromotedContent%22%3Atrue%2C%22withTweetResult%22%3Afalse%2C%22withUserResults%22%3Afalse%2C%22withVoice%22%3Afalse%2C%22withNonLegacyCard%22%3Atrue%7D')
response.html.render()
s = BeautifulSoup(response.html.html, 'lxml')
but I get back an HTML script that either says Chromium is unsupported, or just a static page without the javascript updating the DOM.
All help appreciated.
Thank you
P.S
I've posted the same question on reverseengineering.stackexchange, just to be safe (overflow has more appropriate tags :-))

Before you deep dive into the actual code, I would first start building the correct request to twitter. I would use a 3rd party tool focused on REST and APIs such as Postman to build and test the required request - and only then would write the actual code.
From your questions it seems that you'll be using an open API of twitter, so it means you'll only need to send x-guest-token and basic Bearer authorization in your request headers.
The Bearer is static - you can just browse to twitter and copy/paste
it from the dev tools network monitor.
To get the x-guest-token you'll need something dynamic because it has expiration, what I would suggest is send a curl request to twitter, parse the token from there and put it in your header before sending the request. You can see something very similar in: Python Downloading twitter video using python (without using twitter api)
.
After you have both of the above, build the required GET request in Postman and test if you get back the correct response. Only after you have everything working in Postman - write the same in Python, or any other language**
**You can use Postman snippets which automatically generates the code needed in many programming languages.

#TripleS, example of how one may extract json data from __INITIAL_STATE__ and write it to text file.
import requests
import re
import json
from contextlib import suppress
# get page
result = requests.get('https://twitter.com/ThePSF')
# Extract json from "window.__INITIAL_STATE__={....};
json_string = re.search(r"window.__INITIAL_STATE__\s?=\s?(\{.*?\});", result.text).group(1)
# convert text string to structured json data
twitter_json = json.loads(json_string)
# Save structured json data to a text file that may help
# you to orient yourself and possible pick some parts you
# are interested in (if there are any)
with open('twitter_json_data.txt', 'w') as outfile:
outfile.write(json.dumps(twitter_json, indent=4, sort_keys=True))

I've just tried the same, but with requests, not requests_html module. I could get all site contents, but I would not call it "beautiful".
Also, now I am blocked to access the site without logging in.
Here is my small example.
Use official Twitter API instead.
I also think that I will probably be blocked after some tries of using this script. I've tried it only 2 times.
import requests
import bs4
def example():
result = requests.get("https://twitter.com/childrightscnct")
soup = bs4.BeautifulSoup(result.text, "lxml")
print(soup)
if __name__ == '__main__':
example()
To select any element with bs4, use
some_text = soup.select('locator').getText()
I found one tool for scraping Twitter, that has quite a lot of stars on Github https://github.com/twintproject/twint I did not try it myself and hope it is legal.

What you're missing is the bearer and guest token needed to make your request. If I just hit your endpoint with curl and no headers I get no response. However, if I add headers for the bearer token and guest token then I get that json you're looking for:
curl https://twitter.com/i/api/graphql/vamMfA41UoKXUmppa9PhSw/UserTweets?variables=%7B%22userId%22%3A%2215687962%22%2C%22count%22%3A20%2C%22cursor%22%3A%22HBaIgLLN%2BKGEryYAAA%3D%3D%22%2C%22withHighlightedLabel%22%3Atrue%2C%22withTweetQuoteCount%22%3Atrue%2C%22includePromotedContent%22%3Atrue%2C%22withTweetResult%22%3Afalse%2C%22withUserResults%22%3Afalse%2C%22withVoice%22%3Afalse%2C%22withNonLegacyCard%22%3Atrue%7D -H 'authorization: Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA'' -H 'x-guest-token: 1452696114205847552'
You can get the bearer token (which may not expire that often) and the guest token (which does expire, I think) like this:
The html of the twitter link you go to links a file called main.some random numbers.js. Within that javascript file is the bearer token. You can recognize it is because a long string starting with lots of A's.
Take the bearer token and call https://api.twitter.com/1.1/guest/activate.json using the bearer token as an authorization header
curl 'https://api.twitter.com/1.1/guest/activate.json' -X POST -H 'authorization: Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA'
In python this looks like:
import requests
import json
url = "https://twitter.com/i/api/graphql/vamMfA41UoKXUmppa9PhSw/UserTweets?variables=%7B%22userId%22%3A%2215687962%22%2C%22count%22%3A20%2C%22cursor%22%3A%22HBaIgLLN%2BKGEryYAAA%3D%3D%22%2C%22withHighlightedLabel%22%3Atrue%2C%22withTweetQuoteCount%22%3Atrue%2C%22includePromotedContent%22%3Atrue%2C%22withTweetResult%22%3Afalse%2C%22withUserResults%22%3Afalse%2C%22withVoice%22%3Afalse%2C%22withNonLegacyCard%22%3Atrue%7D"
headers = {"authorization": "Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA", "x-guest-token": "1452696114205847552"}
resp = requests.get(url, headers=headers)
j = json.loads(resp.text)
And now, that variable, j, holds your beautiful json. One warning, sometimes the response back can be so big that it doesn't seem to fit into a single response. If this happens, you'll notice the resp.text isn't valid json, but just some portion of a big blog of json. To fix this, you'll just need to adapt the requests to use "stream=True" and stream out the whole response before you try to parse it as json.

Is there a way to inspect full post request header and body before sending?

I'm sending a POST request, with python-requests in Python 3.5, using:
r = requests.post(apiEndpoint, data=jsonPayload, headers=headersToMergeIn)
I can inspect the headers and body after sending the request like this:
print(r.request.headers)
print(r.request.body)
Is there any way to inspect the full request headers (not just the ones i'm merging in) and body before sending the request?
Note: I need this for an api which requires me to build a hash off of a subset of the headers, and another hash off the full body.

You probably want Prepared Requests

Establishing the name associated with a HTML link

I wish to download the files associated with a set of links in a html document.
A link might appear like this:
<a href="d?kjdfer87">
But when I click on it in my browser, I get the following file downloaded:
file2.txt
The following will download the file via python:
opener = urllib.request.build_opener()
r = opener.open("unknown.txt")
r.read()
but how do I establish that the file was actually called file2.txt?

Check the Content-Disposition header on the response. It can suggest a filename. I believe this would be in r.info().dict['Content-Disposition'].

It's actually this simple:
r.info().get_filename()

I'm not sure why you think you need the name. You should call it in exactly the same way as the browser does, ie with the value in the href.

The Content-Disposition header in the HTTP response is what specified that the response should be downloaded with a specific filename.
See:
How to encode the filename parameter of Content-Disposition header in HTTP?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Download a URL only if it is a HTML Webpage - python

I want to write a python script which downloads the web-page only if the web-page contains HTML. I know that content-type in header will be used. Please suggest someway to do it as i am unable to get a way to get header before the file download.

Use http.client to send a HEAD request to the URL. This will return only the headers for the resource then you can look at the content-type header and see if it text/html. If it is then send a GET request to the URL to get the body.

Related

Upload PDF using requests in Python

Parameters are ignored in python web request for JSON data

Twitter scraping using Python

Is there a way to inspect full post request header and body before sending?

Establishing the name associated with a HTML link

Categories

Resources