I'm using Python 3.7.3 and the requests_pkcs12 library to scrape a website where I must pass a certificate and password, then download and extract zip files from links on the page. I've got the first part working fine. But when I try to read the files using urllib, I get an error.
import urllib.request
from bs4 import BeautifulSoup
import requests
from requests_pkcs12 import get
# get page and setup BeautifulSoup
# r = requests.get(url) # old non-cert method
r = get(url, pkcs12_filename=certpath, pkcs12_password=certpwd)
# find zip files to download
soup = BeautifulSoup(r.content, "html.parser")
# Read files
i = 1
for td in soup.find_all(lambda tag: tag.name=='td' and tag.text.strip().endswith('DAILY.zip')):
link = td.find_next('a')
print(td.get_text(strip=True), link['href'] if link else '') # good
zipurl = 'https:\\my.downloadsite.com" + link['href'] if link else ''
print (zipurl) # good
# Read zip file from URL
url = urllib.request.urlopen(zipurl) # ERROR on this line SSLv3 alert handshake failure
zippedData = url.read()
I've seen various older posts with Python 2.x on ways to handle this, but wondering what the best way to do this now, with new libraries in Python 3.7.x.
Below is the stack trace of the error.
Answer was to not use urllib and instead use the same requests replacement that allows a pfx and password passed to it.
Last 2 lines:
url = urllib.request.urlopen(zipurl) # ERROR on this line SSLv3 alert handshake failure
zippedData = url.read()
should be replace with:
url = get(zipurl, pkcs12_filename=certpath, pkcs12_password=certpwd)
zippedData = url.content
Related
I am new to web scraping. When I go to "https://pancakeswap.finance/prediction?token=BNB" and I right click on the page and investigate it, I get a complex html-page.
But when I try to get the same html-page through python with:
from urllib.request import urlopen
import ssl
try:
_create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
# Legacy Python that doesn't verify HTTPS certificates by default
pass
else:
# Handle target environment that doesn't support HTTPS verification
ssl._create_default_https_context = _create_unverified_https_context
url = 'https://pancakeswap.finance/prediction?token=BNB'
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode('utf-8')
print(html)
I get a different html-page. My goal is to scrape a specific value from the page, but I cannot access the value with the html-page I get through python.
Thanks for your help!
I want to download the ipranges.json (which is updated weekly) from https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519
I have this python code which keeps running forever.
import wget
URL = "https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519"
response = wget.download(URL, "ips.json")
print(response)
How can I download the JSON file in Python?
Because https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519 is the link which automatically trigger javascript to download, therefore you just download the page, not the file
If you check downloaded file, the source will look like this
We realize the file will change after a while, so we have to scrape it in generic way
For convenience, I will not use wget, 2 libraries here are requests to request page and download file, beaufitulsoup to parse html
# pip install requests
# pip install bs4
import requests
from bs4 import BeautifulSoup
# request page
URL = "https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519"
page = requests.get(URL)
# parse HTML to get the real link
soup = BeautifulSoup(page.content, "html.parser")
link = soup.find('a', {'data-bi-containername':'download retry'})['href']
# download
file_download = requests.get(link)
# save in azure_ips.json
open("azure_ips.json", "wb").write(file_download.content)
Hi I want to download delimited text which is hosted on a HTML Link. (The link is accessible on a Private network only, so can't share here).
In R, following function solves the purpose (all other functions gave "Unauthorized access" or "401" error)
url = 'https://dw-results.ansms.com/dw-platform/servlet/results? job_id=13802737&encoding=UTF8&mimeType=plain'
download.file(url, "~/insights_dashboard/testing_file.tsv")
a = read.csv("~/insights_dashboard/testing_file.tsv",header = T,stringsAsFactors = F,sep='\t')
I want to do the same thing in Python, for which I used:
(A)urllib and requests.get()
import urllib.request
url_get = requests.get(url, verify=False)
urllib.request.urlretrieve(url_get, 'C:\\Users\\cssaxena\\Desktop\\24.tsv')
(B)requests.get() and read.html
url='https://dw-results.ansms.com/dw-platform/servlet/results? job_id=13802737&encoding=UTF8&mimeType=plain'
s = requests.get(url, verify=False)
a = pd.read_html(io.StringIO(s.decode('utf-8')))
(C) Using wget:
import wget
url = 'https://dw-results.ansms.com/dw-platform/servlet/results? job_id=13802737&encoding=UTF8&mimeType=plain'
wget.download(url,--auth-no-challenge, 'C:\\Users\\cssaxena\\Desktop\\24.tsv')
OR
wget --server-response -owget.log "https://dw-results.ansms.com/dw-platform/servlet/results? job_id=13802737&encoding=UTF8&mimeType=plain"
NOTE: The URL doesn't asks for any credentials and it is accessible by browser and able to download using R with download.file. I am looking for a solution in Python
def geturls(path):
yy=open(path,'rb').read()
yy="".join(str(yy))
yy=yy.split('<a')
out=[]
for d in yy:
z=d.find('href="')
if z>-1:
x=d[z+6:len(d)]
r=x.find('"')
x=x[:r]
x=x.strip(' ./')
#
if (len(x)>2) and (x.find(";")==-1):
out.append(x.strip(" /"))
out=set(out)
return(out)
pg="./test.html"# your html
url=geturls(pg)
print(url)
I am using a simple python code to try and fetch a URL and scrape out all the other URLs mentioned in every webpage(all html sub-pages if any under the home/root page) of that URL. Here is my code:
import urllib
import urllib2
import re
import socks
import socket
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9050)
socket.socket = socks.socksocket
req = urllib2.Request('http://www.python.org')
#connect to a URL
try:
website = urllib2.urlopen(req)
except urllib2.URLError as e:
print "Error Reason:" ,e.reason
else:
#read html code
html = website.read()
#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)
print links
Right now I am getting a simple error where the module socks is not recognized. I figured out I have to copy the "socks.py" in the correct path under Python's lib/site-packages directory.
I've added the socks module to my code, as my python script was not otherwise able to connect to the url http://www.python.org. My question is am I using the socks correctly ?
Also will my script take care of all the webpages under the root url ? as I want to scrape all urls from all such webpages under the root URL.
Also how can I check what would be the port to mention in setdefaultproxy line of my code ?
I would suggest you to use BeautifulSoup for Webscraping purpose. Below is the code for it with a lot more simpler method.
import requests
from bs4 import BeautifulSoup
r=requests.get("http://www.python.org")
c=r.content
soup=BeautifulSoup(c,"html.parser")
anchor_list=[a['href'] for a in soup.find_all('a', href=True) if a.text.strip()]
print(anchor_list)
Hope it helps !
How can I list files and folders if I only have an IP-address?
With urllib and others, I am only able to display the content of the index.html file. But what if I want to see which files are in the root as well?
I am looking for an example that shows how to implement username and password if needed. (Most of the time index.html is public, but sometimes the other files are not).
Use requests to get page content and BeautifulSoup to parse the result.
For example if we search for all iso files at http://cdimage.debian.org/debian-cd/8.2.0-live/i386/iso-hybrid/:
from bs4 import BeautifulSoup
import requests
url = 'http://cdimage.debian.org/debian-cd/8.2.0-live/i386/iso-hybrid/'
ext = 'iso'
def listFD(url, ext=''):
page = requests.get(url).text
print page
soup = BeautifulSoup(page, 'html.parser')
return [url + '/' + node.get('href') for node in soup.find_all('a') if node.get('href').endswith(ext)]
for file in listFD(url, ext):
print file
You cannot get the directory listing directly via HTTP, as another answer says. It's the HTTP server that "decides" what to give you. Some will give you an HTML page displaying links to all the files inside a "directory", some will give you some page (index.html), and some will not even interpret the "directory" as one.
For example, you might have a link to "http://localhost/user-login/": This does not mean that there is a directory called user-login in the document root of the server. The server interprets that as a "link" to some page.
Now, to achieve what you want, you either have to use something other than HTTP (an FTP server on the "ip address" you want to access would do the job), or set up an HTTP server on that machine that provides for each path (http://192.168.2.100/directory) a list of files in it (in whatever format) and parse that through Python.
If the server provides an "index of /bla/bla" kind of page (like Apache server do, directory listings), you could parse the HTML output to find out the names of files and directories. If not (e.g. a custom index.html, or whatever the server decides to give you), then you're out of luck :(, you can't do it.
Zety provides a nice compact solution. I would add to his example by making the requests component more robust and functional:
import requests
from bs4 import BeautifulSoup
def get_url_paths(url, ext='', params={}):
response = requests.get(url, params=params)
if response.ok:
response_text = response.text
else:
return response.raise_for_status()
soup = BeautifulSoup(response_text, 'html.parser')
parent = [url + node.get('href') for node in soup.find_all('a') if node.get('href').endswith(ext)]
return parent
url = 'http://cdimage.debian.org/debian-cd/8.2.0-live/i386/iso-hybrid'
ext = 'iso'
result = get_url_paths(url, ext)
print(result)
HTTP does not work with "files" and "directories". Pick a different protocol.
You can use the following script to get names of all files in sub-directories and directories in a HTTP Server. A file writer can be used to download them.
from urllib.request import Request, urlopen, urlretrieve
from bs4 import BeautifulSoup
def read_url(url):
url = url.replace(" ","%20")
req = Request(url)
a = urlopen(req).read()
soup = BeautifulSoup(a, 'html.parser')
x = (soup.find_all('a'))
for i in x:
file_name = i.extract().get_text()
url_new = url + file_name
url_new = url_new.replace(" ","%20")
if(file_name[-1]=='/' and file_name[0]!='.'):
read_url(url_new)
print(url_new)
read_url("www.example.com")