First post here, any help would be greatly appreciated :).
I'm trying to scrape from a website with an embedded pdf viewer. As far as I can tell, there is no way to directly download the PDF file.
The browser displays the pdf as multiple PNG image files, the problem is that the png files aren't directly accessible either. They are rendered from the original pdf and then displayed.
And the URL with the heading stripped out is in the codeblock.
The original URL to the pdf viewer (I'm using the second URL), and the link to render the pdf are included in the code.
My strategy here is to pull the viewstate and eventvalidation using urllib, then use wget to download all files from the site. This method does work without post data (page 1). I am getting the rest of the parameters from fiddler (sniffing tool)
But when I use post data to specify the page, I get 405 errors like these when trying to download the image files. However, it downloads the actual html page without a problem, just none of the png files that go along with it. Here is an example of the wget errors.
HTTP request sent, awaiting response... 405 Method Not Allowed
2014-03-27 17:09:38 ERROR 405: Method Not Allowed.
Since I can't access the image file link directly, I thought grabbing the entire page with wget would be my best bet. If anyone knows some better alternatives, please let me know :). The post data seems to work at least partially since the downloaded html file is set to the page I specified in parameters.
According to fiddler, the site automatically does a get request for the image file. I'm not quite sure how to emulate this however.
Any help is appreciated, thanks for your time!
imglink = 'http://201.150.36.178/consultaexpedientes/render/2132495e-863c-4b96-8135-ea7357ff41511.png'
origurl = 'http://201.150.36.178/consultaexpedientes/sistemas/boletines/wfBoletinVisor.aspx?tomo=1&numero=9760&fecha=14/03/2014%2012:40:00'
url = 'http://201.150.36.178/consultaexpedientes/usercontrol/Default.aspx?name=e%3a%5cBoletinesPdf%5c2014%5c3%5cBol_9760.pdf%7c0'
f = urllib2.urlopen(url)
html = f.read()
soup = BeautifulSoup(html)
eventargs = soup.findAll(attrs={'type':'hidden'})
reValue = re.compile(r'value=\"(.*)\"', re.DOTALL)
viewstate = re.findall(reValue, str(eventargs[0]))[0]
validation = re.findall(reValue, str(eventargs[1]))[0]
params = urllib.urlencode({'__VIEWSTATE':viewstate,
'__EVENTVALIDATION':validation,
'PDFViewer1$PageNumberTextBox':6,
'PDFViewer1_BookmarkPanelScrollX':0,
'PDFViewer1_BookmarkPanelScrollY':0,
'PDFViewer1_ImagePanelScrollX' : 0,
'PDFViewer1_ImagePanelScrollY' : 0,
'PDFViewer1$HiddenPageNumber':6,
'PDFViewer1$HiddenAplicaMarcaAgua':0,
'PDFViewer1$HiddenBrowserWidth':1920,
'PDFViewer1$HiddenBrowserHeight':670,
'PDFViewer1$HiddenPageNav':''})
command = '/usr/bin/wget -E -H -k -K -p --post-data=\"%s' % params + '\" ' + url
print command
os.system(command)
Related
I'm trying to grab snowfall data from the National Weather Service at this site:
https://www.nohrsc.noaa.gov/snowfall/
The data can be downloaded via the webpage with a 'click' on the file type in the drop down, but I can't seem to figure out how to automate this using python. They have an ftp archive, but it requires a login and I can't access it for some reason.
However, since the files can be downloaded via a "click" on the webpage interface, I imagine there must be a way to grab it using wget or urlopen? But I can't seem to figure out what the exact url address would be in this case in order to use those functions. Does anyone have any ideas on how to download this data straight from the website listed above?
Thanks!
You can inspect links with Chrome Console.
Press F12, then click on file type:
Here an URL https://www.nohrsc.noaa.gov/snowfall/data/202112/sfav2_CONUS_6h_2021122618_grid184.nc
You can download it with python using Requests library
import requests
r = requests.get('https://www.nohrsc.noaa.gov/snowfall/data/202112/sfav2_CONUS_6h_2021122618_grid184.nc')
data = r.content # file context
Or you can just save it to file with urlretrieve
from urllib.request import urlretrieve
url = 'https://www.nohrsc.noaa.gov/snowfall/data/202112/sfav2_CONUS_6h_2021122618_grid184.nc'
dst = 'data.nc'
urlretrieve(url, dst)
I want to download about 1000 pdf files from a web page.
Then I encountered this awkward pdf url format.
Both requests.get() and urllib.request.urlretrieve() don't work for me.
Usual pdf url looks like :
https://webpage.com/this_file.pdf
But this url is like :
https://gongu.copyright.or.kr/gongu/wrt/cmmn/wrtFileDownload.do?wrtSn=9000001&fileSn=1&wrtFileTy=01
So it doesn't have .pdf in url, and if you click on it, you can download it, But using python's urllib, you get corrupt file.
At first I thought it is redirected into some other url.
So I used request.get(url, allow_retrieves=True) option,
the result is the same url as before..
filename = './novel/pdf1.pdf'
url = 'https://gongu.copyright.or.kr/gongu/wrt/cmmn/wrtFileDownload.do?wrtSn=9031938&fileSn=1&wrtFileTy=01'
urllib.request.urlretrieve(url, filename)
this code downloads corrupt pdf file.
I solved it using content field in the retrieved object.
filename = './novel1/pdf1.pdf'
url = . . .
object = requests.get(url)
with open('./novels/'+filename, 'wb') as f:
f.write(t.content)
refered to this QnA ; Download and save PDF file with Python requests module
Im trying to fetch an excel file with urllib as seen below
import urllib.request as url
request = url.urlopen("url").geturl()
url.urlretrieve(request,"excelfile.xls")
However, the url is not a direct link to the file, but to a html page which trigger the download after a small delay (without any redirects). This causes the above code to retrieve the html file instead.
I've worked out a temporary fix to this, but it is very unreliable. See below.
req1 = url.urlopen("url").geturl()
url.urlretrieve(req1,"excelfile.xls")
time.sleep(5)
req2 = url.urlopen("url").geturl()
url.urlretrieve(req2,"excelfile.xls")
time.sleep(5) sometimes makes up for the delay and the correct file gets downloaded.
Is there a more reliable way to be sure to get the correct file?
I've tried using .info() to maybe try to have the code retry until I get the correct file, but when trying out the code below the info printed is not correlated with the actual response from urlretrieve. I'm probably using it wrong.
req1 = url.urlopen("url")
url.urlretrieve(req1.geturl(),"excelfile.xls")
info = req1.info()
print(info.get_content_type())
time.sleep(5)
req2 = url.urlopen("url")
url.urlretrieve(req2.geturl(),"excelfile.xls")
info = req2.info()
print(info.get_content_type())
Any suggestions?
The url to the excel file can be found here.
I am new to web crawling, thanks for helping out. The task I need to perform is to obtain the full returned HTTP response from google search. When searching on Google with a search keyword in browser, in the returned page, there is a section:
Searches related to XXXX (where XXXX is the searched words)
I need to extract this section of the web page. From my research, most of the current package on google crawling are not able to extract this section of information. I tried to use urllib2, with the following code:
import urllib2
url = "https://www.google.com.sg/search? q=test&ie=&oe=#q=international+business+machine&spf=187"
req = urllib2.Request(url, headers={'User-Agent' : 'Mozilla/5.0'})
con = urllib2.urlopen( req )
strs = con.read()
print strs
I am getting a large chunk of text which looks like legit HTTP response, but within the text, there isn't any content related to my searched key "international business machine". I know Google probably detect this is not request from an actual browser hence hide this info. May I know if there is any way to bypass this and obtained the "related search" section of google result? Thanks.
as pointed out by #anonyXmous. the useful post to refer to is here:
Google Search Web Scraping with Python
with
from requests import get
keyword = "internation business machine"
url = "https://google.com/search?q="+keyword
raw = get(url).text
print raw
I am able to get the needed text in "raw"
I have created a page on my site http://shedez.com/test.html this page redirects the users to a jpg on my server
I want to copy this image to my local drive using a python script. I want the python script to goto main url first and then get to the destination url of the picture
and than copy the image. As of now the destination url is hardcoded but in future it will be dynamic, because I will be using geocoding to find the city via ip and then redirect my users to the picture of day from their city.
== my present script ===
import urllib2, os
req = urllib2.urlopen("http://shedez.com/test.html")
final_link = req.info()
print req.info()
def get_image(remote, local):
imgData = urllib2.urlopen(final_link).read()
output = open(local,'wb')
output.write(imgData)
output.close()
return local
fn = os.path.join(self.tmp, 'bells.jpg')
firstimg = get_image(final_link, fn)
It doesn't seem to be header redirection. This is the body of the url -
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">\n<html>\n<head>\n<title>Your Page Title</title>\n<meta http-equiv="REFRESH" content="0;url=htt
p://2.bp.blogspot.com/-hF8PH92aYT0/TnBxwuDdcwI/AAAAAAAAHMo/71umGutZhBY/s1600/Professional%2BBusiness%2BCard%2BDesign%2B1.jpg"></HEAD>\n<BODY>\nOptional page t
ext here.\n</BODY>\n</HTML>
You can easily fetch the content with urllib or requests and parse the HTML with BeautifulSoup or lxml to get the image url from the meta tag.
You seem to be using html http-equiv redirect. To handle redirects with Python transparently, use HTTP 302 response header on the server side instead. Otherwise, you'll have to parse HTML and follow redirects manually or use something like mechanize.
As the answers mention: either redirect to the image itself, or parse out the url from the html.
Concerning the former, redirecting, if you're using nginx or HAproxy server side you can set the X-Accel-Redirect to the image's uri, and it will be served appropriately. See http://wiki.nginx.org/X-accel for more info.
The urllib2 urlopen function by default follows the redirect 3XX HTTP status code. But in your case you are using html header based redirect for which you will have use what Bibhas is proposing.