I have created a page on my site http://shedez.com/test.html this page redirects the users to a jpg on my server
I want to copy this image to my local drive using a python script. I want the python script to goto main url first and then get to the destination url of the picture
and than copy the image. As of now the destination url is hardcoded but in future it will be dynamic, because I will be using geocoding to find the city via ip and then redirect my users to the picture of day from their city.
== my present script ===
import urllib2, os
req = urllib2.urlopen("http://shedez.com/test.html")
final_link = req.info()
print req.info()
def get_image(remote, local):
imgData = urllib2.urlopen(final_link).read()
output = open(local,'wb')
output.write(imgData)
output.close()
return local
fn = os.path.join(self.tmp, 'bells.jpg')
firstimg = get_image(final_link, fn)
It doesn't seem to be header redirection. This is the body of the url -
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">\n<html>\n<head>\n<title>Your Page Title</title>\n<meta http-equiv="REFRESH" content="0;url=htt
p://2.bp.blogspot.com/-hF8PH92aYT0/TnBxwuDdcwI/AAAAAAAAHMo/71umGutZhBY/s1600/Professional%2BBusiness%2BCard%2BDesign%2B1.jpg"></HEAD>\n<BODY>\nOptional page t
ext here.\n</BODY>\n</HTML>
You can easily fetch the content with urllib or requests and parse the HTML with BeautifulSoup or lxml to get the image url from the meta tag.
You seem to be using html http-equiv redirect. To handle redirects with Python transparently, use HTTP 302 response header on the server side instead. Otherwise, you'll have to parse HTML and follow redirects manually or use something like mechanize.
As the answers mention: either redirect to the image itself, or parse out the url from the html.
Concerning the former, redirecting, if you're using nginx or HAproxy server side you can set the X-Accel-Redirect to the image's uri, and it will be served appropriately. See http://wiki.nginx.org/X-accel for more info.
The urllib2 urlopen function by default follows the redirect 3XX HTTP status code. But in your case you are using html header based redirect for which you will have use what Bibhas is proposing.
Related
I am trying to perform a get request on TCG Player via Requests on Python. I checked the sites robots.txt which specifies:
User-agent: *
Crawl-Delay: 10
Allow: /
Sitemap: https://www.tcgplayer.com/sitemap/index.xml
This is my first time seeing a robots.txt file.
My code is as follows:
import requests
url = "http://www.tcgplayer.com"
r = requests.get(url)
print(r.text)
I cannot include r.text in my post because the character limit would be exceeded.
I would have expected to be recieve the HTML content of the webpage, but I got an 'unfriendly' response instead. What is the meaning of the text above? Is there a way to get the HTML so I can scrape the site?
By 'unfriendly' I mean:
The HTML that is returned does not match the HTML that is produced by typing the URL into my web browser.
This is probably due to some server-side rendering of web content, as indicated by the empty <div id="app"></div> block in the scraped result. To properly handle such content, you will need to use a more advanced web scraping tool, like Selenium. I'd recommend this tutorial to get started.
Given the following code:
import requests
url = "https://signal.bz/"
response = requests.get(url)
print(response.text)
The output is JS code.
But what I want to get is the HTML code I see when I open my browser's Developer Tools at https://signal.bz/.
For other sites, I get HTML code well, but why is this site only getting JS code?
How can I get HTML code for this site?
What you get is html code. Look at the start of the file <!DOCTYPE html><html lang=en>
The javascript code is a script which is part of the html document and is supposed to create the content when the document is rendered by the web browser.
To see the content you can install and use the library requests-html instead of requests
To render the html use response.render() and then you can get the content as usually with response.text
I'm interested in using Python to retrieve a file that exists at an HTTPS url.
I have credentials for the site, and when I access it in my browser I'm able to download the file. How do I use those credentials in my Python script to do the same thing?
So far I have:
import urllib.request
response = urllib.request.urlopen('https:// (some url with an XML file)')
html = response.read()
html.write('C:/Users/mhurley/Portable_Python/notebooks/XMLOut.xml')
This code works for non-secured pages, but (understandably) returns 401:Unauthorized for the https address. I don't understand how urllib handles credentials, and the docs aren't as helpful as I'd like.
I wish to make a requests with the Python requests module. I have a large database of urls I wish to download. the urls are in the database of the form page.be/something/something.html
I get a lot of ConnectionError's. If I search the URL in my browser, the page exists.
My Code:
if not webpage.url.startswith('http://www.'):
new_html = requests.get(webpage.url, verify=True, timeout=10).text
An example of a page I'm trying to download is carlier.be/categorie/jobs.html. This gives me a ConnectionError, logged as below:
Connection error, Webpage not available for
"carlier.be/categorie/jobs.html" with webpage_id "229998"
What seems to be the problem here? Why can't requests make the connection, while I can find the page in the browser?
The Requests library requires that you supply a schema for it to connect with (the 'http://' part of the url). Make sure that every url has http:// or https:// in front of it. You may want a try/except block where you catch a requests.exceptions.MissingSchema and try again with "http://" prepended to the url.
I have the following script:
import requests
import cookielib
jar = cookielib.CookieJar()
login_url = 'http://www.whispernumber.com/signIn.jsp?source=calendar.jsp'
acc_pwd = {'USERNAME':'myusername',
'PASSWORD':'mypassword'
}
r = requests.get(login_url, cookies=jar)
r = requests.post(login_url, cookies=jar, data=acc_pwd)
page = requests.get('http://www.whispernumber.com/calendar.jsp?day=20150129', cookies=jar)
print page.text
But the print page.text is showing that the site is trying to forward me back to the login page:
<script>location.replace('signIn.jsp?source=calendar.jsp');</script>
I have a feeling this is because of the jsp, and am not sure how to login to a java script page? Thanks for the help!
Firstly you're posting to the wrong page. If you view the HTML from your link you'll see the form is as follows:
<form action="ValidatePassword.jsp" method="post">
Assuming you're correctly authenticated you will probably get a cookie back that you can use for subsequent page requests. (You seem to be thinking along the right lines.)
Requests isn't a web browser, it is an http client, it simply grabs the raw text from the page. You are going to want to use something like Selenium or another headless browser to programatically login to a site.