Take a screenshot from a website from commandline or with python - python

i will take a screenshot from this page: http://books.google.de/books?id=gikDAAAAMBAJ&pg=PA1&img=1&w=2500 or save the image that it outputs.
But i can't find a way. With wget/curl i get an "unavailable error" and also with others tools like webkit2png/wkhtmltoimage/wkhtmltopng.
Is there a clean way to do it with python or from commandline?
Best regards!

You can use ghost.py if you like.
https://github.com/jeanphix/Ghost.py
Here is an example of how to use it.
from ghost import Ghost
ghost = Ghost(wait_timeout=4)
ghost.open('http://www.google.com')
ghost.capture_to('screen_shot.png')
The last line saves the image in your current directory.
Hope this helps

I had difficulty getting Ghost to take a screenshot consistently on a headless Centos VM. Selenium and PhantomJS worked for me:
from selenium import webdriver
br = webdriver.PhantomJS()
br.get('http://www.stackoverflow.com')
br.save_screenshot('screenshot.png')
br.quit

Sometimes you need extra http headers such User-Agent to get downloads to work. In python 2.7, you can:
import urllib2
request = urllib2.Request(
r'http://books.google.de/books?id=gikDAAAAMBAJ&pg=PA1&img=1&w=2500',
headers={'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 firefox/2.0.0.11'})
page = urllib2.urlopen(request)
with open('somefile.png','wb') as f:
f.write(page.read())
Or you can look at the params for adding http headers in wget or curl.

Related

Link opening in download pop-up window but doesn't work when copied and pasted

I am trying to scrape and download files from search results displayed when just clicking the search button on https://elibrary.ferc.gov/eLibrary/search. When the search results are displayed, the links look like https://elibrary.ferc.gov/eLibrary/filedownload?fileid=823F0FAB-E23A-CD5B-9FDD-7B3B7A700000 as an example. Clicking on this link on the search results page forces a download (content-disposition: attachment). However, I am saving the search results as an html page and then scraping the links. I am trying to get the file associated with the link and store it locally however my code isn't working.
#!/usr/local/bin python3
import os
import sys
import psycopg2
from pathlib import Path
import urllib.request
import requests
session = requests.Session()
session.headers.update(
{"User-Agent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"}
)
r1 = session.get("https://elibrary.ferc.gov/eLibrary/search", verify=False)
dl_url = "https://elibrary.ferc.gov/eLibrary/filedownload?fileid=020E6084-66E2-5005-8110-C31FAFC91712"
req = session.get(dl_url, verify=False)
with open("dl_url", "wb") as outfile:
outfile.write(req.content)
I am not able to download the file contents at all (pdf, docx etc). The code above is just to try and solve the local download issue. Thanks in advance for any help.
Solved by using a json post request. Original URL won't work.

downloading data using a blob link

using data from this website: https://ourworldindata.org/grapher/total-daily-covid-deaths?tab=map
I am trying to interact with the link 'total-daily-covid-deaths.csv' which has the href link 'blob:https://ourworldindata.org/b1c6f69e-4df4-4458-8aa0-35173733b364'. After clicking on the link I am taken to a page with a lot of data and I am merely trying to write a python script to take that data and put it into a csv file for me to use. While researching this I found that there was an overwhelming amount of information and I got confused very quickly. I have experience web scraping using beautiful soup and requests however I haven't been able to get it working since the blob link isn't an actual website. I was hoping someone could shed some light for me and point me in the right direction.
This is the code I'm using:
import urllib.request as request
url = 'https://ourworldindata.org/grapher/total-daily-covid-deaths?tab=map'
# fake user agent of Safari
fake_useragent = 'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25'
r = request.Request(url, headers={'User-Agent': fake_useragent})
f = request.urlopen(r)
# print or write
print(f.read())
blob urls are already explain in below url.
Convert blob URL to normal URL
Can you share your code-snippet to get a better idea about your problem.

using Selenium, Firefox, Python to save download of EPS files to disk after automated clicking of download link

Tools: Ubuntu, Python, Selenium, Firefox
I am tying to automate the dowloading of image files from a subscription web site. I do not have access to the server other than through my paid subscription. To avoid having to click a button for each file download, I decided to automate it using Python, Selenium, and Firefox. (I have been using these three together for the first time for two days now. I also know very little about cookies.)
I am interested in downloading following three formats in order or preference: ['EPS', 'PNG', 'JPG']. A button for each format is available on the web site.
I have managed to have success in automating the downloading of the 'PNG' and 'JPG' files to disk by setting the Firefox preferences by hand as suggested in this post: python webcrawler downloading files
However, when the file is in an 'EPS' format, the "You have chosen to save" dialog box still pops open in the Firefox window.
As you can see from my code, I have set the preferences to save 'EPS' files to disk. (Again, 'JPG' and 'PNG' files are saved as expected.)
from selenium import webdriver
profile = webdriver.firefox.firefox_profile.FirefoxProfile()
profile.set_preference("browser.download.folderList", 1)
profile.set_preference("browser.download.manager.showWhenStarting", False)
profile.set_preference('browser.helperApps.neverAsk.saveToDisk',
'image/jpeg,image/png,application/postscript,'
'application/eps,application/x-eps,image/x-eps,'
'image/eps')
profile.set_preference("browser.helperApps.alwaysAsk.force", False)
profile.set_preference("plugin.disable_full_page_plugin_for_types",
"application/eps,application/x-eps,image/x-eps,"
"image/eps")
profile.set_preference(
"general.useragent.override",
"Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:26.0)"
" Gecko/20100101 Firefox/26.0")
driver = webdriver.Firefox(firefox_profile=profile)
#I then log in and begin automated clicking to download files. 'JPG' and 'PNG' files are
#saved to disk as expected. The 'EPS' files present a save dialog box in Firefox.
I tried installing an extension for Firefox called "download-statusbar" that claims to negate any save dialog box from appearing. The extension loads in the Selenium Firefox browser, but it doesn't function. (A lot of reviews say the extension is broken despite the developers' insistence that it does function.) It isn't working for me anyway so I gave up on it.
I added this to the Firefox profile in that attempt:
#The extension loads, but it doesn't function.
download_statusbar = '/home/$USER/Downloads/'
'/download_statusbar_fixed-1.2.00-fx.xpi'
profile.add_extension(download_statusbar)
From reading other stackoverflow.com posts, I decided to see if I could download the file via the url with urllib2. As I understand how this would work, I would need to add cookies to the headers in order to authenticate the downloading of the 'EPS' file via a url.
I am unfamiliar with this technique, but here is the code I tried to use to download the file directly. It failed with a '403 Forbidden' response despite my attemps to set cookies in the urllib2 opener.
import urllib2
import cookielib
import logging
import sys
cookie_jar = cookielib.LWPCookieJar()
handlers = [
urllib2.HTTPHandler(),
urllib2.HTTPSHandler(),
]
[h.set_http_debuglevel(1) for h in handlers]
handlers.append(urllib2.HTTPCookieProcessor(cookie_jar))
#using selenium driver cookies, returns a list of dictionaries
cookies = driver.get_cookies()
opener = urllib2.build_opener(*handlers)
opener.addheaders = [(
'User-agent',
'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:26.0) '
'Gecko/20100101 Firefox/26.0'
)]
logger = logging.getLogger("cookielib")
logger.addHandler(logging.StreamHandler(sys.stdout))
logger.setLevel(logging.DEBUG)
for item in cookies:
opener.addheaders.append(('Cookie', '{}={}'.format(
item['name'], item['value']
)))
logger.info('{}={}'.format(item['name'], item['value']))
response = opener.open('http://path/to/file.eps')
#Fails with a 403 Forbidden response
Any thoughts or suggestions? Am I missing something easy or do I need to give up hope on an automated download of the EPS files? Thanks in advance.
Thank you to #unutbu for helping me solve this. I just didn't understand the anatomy of a file download. I do understand a little bit better now.
I ended up installing an extension called "Live HTTP Headers" on Firefox to examine the headers sent by the server. As it turned out, the 'EPS' files were sent with a 'Content-Type' of 'application/octet-stream'.
Now the EPS files are saved to disk as expected. I modified the Firefox preferences to the following:
profile.set_preference('browser.helperApps.neverAsk.saveToDisk',
'image/jpeg,image/png,'
'application/octet-stream')

Unable to pull HTML from website

I'm pulling HTML from web sites, by sending headers to make the site think I'm just a user surfing the site, like so:
def page(goo):
import fileinput
import sys, heapq, array, urllib
import BeautifulSoup
from BeautifulSoup import BeautifulSoup
import re
from urllib import FancyURLopener
class MyOpener(FancyURLopener):
version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
myopener = MyOpener()
filehandle = myopener.open(goo)
return filehandle.read()
page=page(WebSite)
This works perfectly with most websites, even Google and Wikipedia, but not with Tmart.com. Somehow, Tmart can see it's not a web browser, and returns an error. How can I fix this?
They might be detecting that you don't have a JavaScript interpreter? Hard to tell without seeing the error message you are receiving. There is one method that is guaranteed to work though. And that is directly driving a browser using Selenium Webdriver.
Selenium is normally used for functionally testing web sites. But works very well for scraping sites that use JavaScript as well.
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('http://www.someurl.com')
html = browser.page_source
See all the methods available on browser here: http://code.google.com/p/selenium/source/browse/trunk/py/selenium/webdriver/remote/webdriver.py
For this to work you will also need to have the chromedriver executable available: http://code.google.com/p/chromedriver/downloads/list

Python - The request headers for mechanize

I am looking for a way to view the request (not response) headers, specifically what browser mechanize claims to be. Also how would I go about manipulating them, eg setting another browser?
Example:
import mechanize
browser = mechanize.Browser()
# Now I want to make a request to eg example.com with custom headers using browser
The purpose is of course to test a website and see whether or not it shows different pages depending on the reported browser.
It has to be the mechanize browser as the rest of the code depends on it (but is left out as it's irrelevant.)
browser.addheaders = [('User-Agent', 'Mozilla/5.0 blahblah')]
You've got an answer on how to change the headers, but if you want to see the exact headers that are being used try using a proxy that displays the traffic. e.g. Fiddler2 on windows or see this question for some Linux altenatives.
you can modify referer too...
br.addheaders = [('Referer', 'http://google.com')]

Categories