Looking at the requests documentation, I know that I can use response.content for binary content (such as a .jpg file) and response.text for a regular html page. However, when the source is an image, and I try to access r.text, the script hangs. How can I determine in advance if the response contains html?
I have considered checking the url for an image extension, but that does not seem fool-proof.
The content type should be a header. See this page in the documentation.
Example code:
r = requests.get(url)
if r.headers['content-type'] == 'text/html':
data = r.text
elif r.headers['content-type'] == 'application/ogg':
data = r.content
Related
I suspect this has happened due to my misunderstanding of how either lxml or html works and I'd appreciate if someone could fill in this blank in my knowledge.
My code is:
url = "https://prnt.sc/ca0000"
response = requests.get(url,headers={'User-Agent': 'Chrome'})
# Navigate to the correct img src.
tree = html.fromstring(response.content)
xpath = '/html/body/div[3]/div/div/img/#src'
imageURL = tree.xpath(xpath)[0]
print(imageURL)
I expect when I do this to get a result such as:
data:image/png;base64,iVBORw0KGgoAAA...((THIS IS REALLY LONG))...Jggg==
Which if I understand correctly is where the image is stored locally on my computer.
However when I run the code I get:
"https://prnt.sc/ca0000"
Why are these different?
Problem is that this page uses javaScript to put data:image/png;base64 ... in place of https://prnt.sc/ca0000 but requests can't use JavaScript.
But there are two img with different scr - first has standard URL to image (https:///....) and other has fake https://prnt.sc/ca0000
So this xpath works for me even without JavaScript
xpath = '//img[#id="screenshot-image"]/#src'
This code get correct url and download image.
import requests
from lxml import html
url = "https://prnt.sc/ca0000"
response = requests.get(url, headers={'User-Agent': 'Chrome'})
tree = html.fromstring(response.content)
image_url = tree.xpath('//img[#id="screenshot-image"]/#src')[0]
print(image_url)
# -- download ---
response = requests.get(image_url, headers={'User-Agent': 'Chrome'})
with open('image.png', 'wb') as fh:
fh.write(response.content)
Result
https://image.prntscr.com/image/797501c08d0a46ae93ff3a477b4f771c.png
I am working on a scraping project using requests.get. The html file contains a relative path of the form href="../_static/new_page.html" for the css file.
I am using the below code to get the html file
import requests
url = "www.example.com"
req = requests.get(url)
req.content
All the href containing "../_static" become "_static/...". I tried req.text and changed the encoding to utf-8, which is the encoding of the page. However, I am always getting the same result. I also tried urllib.request.get, and I also got the same problem.
Any suggestions!
Adam.
Yes, it will be related to encoding format when you write the response content to Html file.
but you just have to consider the encoding type of response content itself.
Just check the encoding type of your requests library.
response = requests.get("url")
print(response.encoding)
You just need to choose the right encoding type like above.
response.encoding = "utf-8"
or
response.encoding = "ISO-8859-1"
or
response.encoding = "utf-8-sig"
...
Hope my answer helps you.
Regards
I'm trying to use request to download the content of some web pages which are in fact PDFs.
I've tried the following code but the output that comes back is not properly decoded it seems:
link= 'http://www.pdf995.com/samples/pdf.pdf'
import requests
r = requests.get(link)
r.text
The output looks like below:
'%PDF-1.3\n%�쏢\n30 0 obj\n<>\nstream\nx��}ݓ%�m���\x15S�%NU���M&O7�㛔]ql�����+Kr�+ْ%���/~\x00��=����{feY�T�\x05��\r�\x00�/���q�8�8�\x7f�\x7f�~����\x1f�ܷ�O�z�7�7�o\x1f����7�\'�{��\x7f<~��\x1e?����C�%\ByLշK����!_b^0o\x083�K\x0b\x0b�\x05z�E�S���?�~ �]rb\x10C�y�>_r�\x10�<�K��<��!>��(�\x17���~�.m��]2\x11��
etc
I was hoping to get the html. I also tried with beautifulsoup but it does not decode it either.. I hope someone can help. Thank you, BR
Yes; a PDF file is a binary file, not a text file, so you should use r.content instead of r.text to access the binary data.
PDF files are not easy to deal with programmatically; but you might (for example) save it to a file:
import requests
link = 'http://www.pdf995.com/samples/pdf.pdf'
r = requests.get(link)
with open('pdf.pdf', 'wb') as f:
f.write(r.content)
This is my code thus far.
url = 'https://www.endomondo.com/rest/v1/users/3014732/workouts/357031682'
response = urllib.urlopen(url)
print response
data = json.load(response)
print data
The problem is that when I look at the json in the browser it is long and contains more features than I see when printing it.
To be more exact, I'm looking for the 'points' part which should be
data['points']['points']
however
data['points']
has only 2 attributes and doesn't contain the second 'points' that I do see in the url in the browser.
Could it be that I can only load 1 "layer" deep and not 2?
You need to add a user-agent to your request.
Using requests (which urllib documentation recommends over directly using urllib), you can do:
import requests
url = 'https://www.endomondo.com/rest/v1/users/3014732/workouts/357031682'
response = requests.get(url, headers={'user-agent': 'Mozilla 5.0'})
print(response.json())
# long output....
I am trying to use the requests function in python to post the text content of a text file to a website, submit the text for analysis on said website, and pull the results back in to python. I have read through a number of responses here and on other websites, but have not yet figured out how to correctly modify the code to a new website.
I'm familiar with beautiful soup so pulling in webpage content and removing HTML isn't an issue, its the submitting the data that I don't understand.
My code currently is:
import requests
fileName = "texttoAnalyze.txt"
fileHandle = open(fileName, 'rU');
url_text = fileHandle.read()
url = "http://www.webpagefx.com/tools/read-able/"
payload = {'value':url_text}
r = requests.post(url, payload)
print r.text
This code comes back with the html of the website, but hasn't recognized the fact that I'm trying to a submit a form.
Any help is appreciated. Thanks so much.
You need to send the same request the website is sending, usually you can get these with web debugging tools (like chrome/firefox developer tools).
In this case the url the request is being sent to is: http://www.webpagefx.com/tools/read-able/check.php
With the following params: tab=Test+by+Direct+Link&directInput=SOME_RANDOM_TEXT
So your code should look like this:
url = "http://www.webpagefx.com/tools/read-able/check.php"
payload = {'directInput':url_text, 'tab': 'Test by Direct Link'}
r = requests.post(url, data=payload)
print r.text
Good luck!
There are two post parameters, tab and directInput:
import requests
post = "http://www.webpagefx.com/tools/read-able/check.php"
with open("in.txt") as f:
data = {"tab":"Test by Direct Link",
"directInput":f.read()}
r = requests.post(post, data=data)
print(r.content)