InvalidSchema("No connection adapters were found for '%s'" % url) - python

I was able to gather data from a web page using this
import requests
import lxml.html
import re
url = "http://animesora.com/flying-witch-episode-7-english-subtitle/"
r = requests.get(url)
page = r.content
dom = lxml.html.fromstring(page)
for link in dom.xpath('//div[#class="downloadarea"]//a/#href'):
down = re.findall('https://.*',link)
print (down)
when I try this to gather more data on the results of the above code I was presented with this error:
Traceback (most recent call last):
File "/home/sven/PycharmProjects/untitled1/.idea/test4.py", line 21, in <module>
r2 = requests.get(down)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 70, in get
return request('get', url, params=params, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 56, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 475, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 590, in send
adapter = self.get_adapter(url=request.url)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 672, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for '['https://link.safelinkconverter.com/review.php?id=aHR0cDovLygqKC5fKC9zTGZYZ0s=&c=1&user=51757']'
This is the code I was using:
for link2 in down:
r2 = requests.get(down)
page2 = r.url
dom2 = lxml.html.fromstring(page2)
for link2 in dom2('//div[#class="button green"]//onclick'):
down2 = re.findall('.*',down2)
print (down2)

You are passing in the whole list:
for link2 in down:
r2 = requests.get(down)
Note how you passed in down, not link2. down is a list, not a single URL string.
Pass in link2:
for link2 in down:
r2 = requests.get(link2)
I'm not sure why you are using regular expressions. In the loop
for link in dom.xpath('//div[#class="downloadarea"]//a/#href'):
each link is already a fully qualified URL:
>>> for link in dom.xpath('//div[#class="downloadarea"]//a/#href'):
... print link
...
https://link.safelinkconverter.com/review.php?id=aHR0cDovLygqKC5fKC9FZEk2Qg==&c=1&user=51757
https://link.safelinkconverter.com/review.php?id=aHR0cDovLygqKC5fKC95Tmg2Qg==&c=1&user=51757
https://link.safelinkconverter.com/review.php?id=aHR0cDovLygqKC5fKC93dFBmVFg=&c=1&user=51757
https://link.safelinkconverter.com/review.php?id=aHR0cDovLygqKC5fKC9zTGZYZ0s=&c=1&user=51757
You don't need to do any further processing on that.
Your remaining code has more errors; you confused r2.url with r2.content, and forgot the .xpath part in your dom2.xpath(...) query.

Related

Get HTML of new Chrome tab with Python

I'm trying to scrape the HTML code of a new chrome tab, but I can't find a way that works using Python.
Here's what I've tried:
I've tried the requests module, but this code:
import requests
URL = "chrome://newtab"
page = requests.get(URL)
print(page.text)
Yields this error:
Traceback (most recent call last):
File "c:\Users\Ben Bistline\Code\PythonFiles\PythonFiles\chromescrape.py", line 4, in <module>
page = requests.get(URL)
File "C:\Users\Ben Bistline\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\Ben Bistline\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\Ben Bistline\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\requests\sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\Ben Bistline\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\requests\sessions.py", line 649, in send
adapter = self.get_adapter(url=request.url)
File "C:\Users\Ben Bistline\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\requests\sessions.py", line 742, in get_adapter
raise InvalidSchema("No connection adapters were found for {!r}".format(url))
requests.exceptions.InvalidSchema: No connection adapters were found for 'chrome://newtab'
I suppose this result makes sense, but I'm not sure how/if I can get around it.
I've also tried using the webbrowser module with this code:
import requests, webbrowser
URL = "chrome://newtab"
chromePath = 'C:/Program Files/Google/Chrome/Application/chrome.exe %s'
webbrowser.get(chromePath).open(URL)
Unfortunately, although successful, this method does not seem to offer a way of gathering the HTML.
Anyone know of any other ways using Python to grab the HTML of a new Chrome tab?
Thanks!
You can use Selenium driver with Chrome to do that
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('chrome://newtab')
content = driver.page_source
browser.close()

Requests Error: No connection adapters were found

I'm currently new to programming and I'm following along on one of Qazi tutorials and I'm on a section for web scraping but unfortunately I'm getting errors that I can't seem to find a solution for, can you please help me out. Thanks
The error code is bellow.
Traceback (most recent call last):
File "D:\Users\Vaughan\Qazi\Web Scrapping\webscraping.py", line 6, in <module>
page = requests.get(
File "C:\Users\vaugh\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\vaugh\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\vaugh\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\vaugh\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\sessions.py", line 640, in send
adapter = self.get_adapter(url=request.url)
File "C:\Users\vaugh\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\sessions.py", line 731, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for '['https://forecast.weather.gov/MapClick.php?lat=34.09979000000004&lon=-118.32721499999997#.XkzZwCgzaUk']'
[Finished in 1.171s]
My line of code is as follows
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
import lxml
page = requests.get('https://forecast.weather.gov/MapClick.php?lat=34.09979000000004&lon=-118.32721499999997#.XkzZwCgzaUk')
soup = BeautifulSoup(page.content, 'html.parser')
week = soup.find(id='seven-day-forecast-body')
print(week)

Python web crawling is throwing connection errors

I have the following Python code intended for web crawling, When I try to run this one, It is throwing me the following error. Code :
import lxml.html
import requests
from bs4 import BeautifulSoup
url1='http://stats.espncricinfo.com/ci/engine/stats/index.html?class=11;filter=advanced;orderby=runs;'
url2 ='page='
url3 ='size=200;template=results;type=batting'
url5 = ['http://stats.espncricinfo.com/ci/engine/stats/index.html?class=11;filter=advanced;orderby=runs;size=200;template=results;type=batting']
for i in range(2,3854):
url4 = url1 + url2 + str(i) + ';' + url3
url5.append(url4)
for page in url5:
source_code = requests.get(page, verify=False)
# just get the code, no headers or anything
plain_text = source_code.text
# BeautifulSoup objects can be sorted through easy
soup = BeautifulSoup(plain_text, "lxml")
for link in soup.findAll('a', {'class': 'data-link'}):
href = "https://www.espncricinfo.com" + link.get('href')
title = link.string # just the text, not the HTML
source_code = requests.get(href)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "lxml")
# if you want to gather information from that page
for item_name in soup.findAll('span', {'class': 'ciPlayerinformationtxt'}):
print(item_name.string)
The error:
Traceback (most recent call last):
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connectionpool.py", line 559, in urlopen
body=body, headers=headers)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connectionpool.py", line 345, in _make_request
self._validate_conn(conn)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connectionpool.py", line 782, in _validate_conn
conn.connect()
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connection.py", line 266, in connect
match_hostname(cert, self.assert_hostname or hostname)
File "C:\Python34\lib\ssl.py", line 285, in match_hostname
% (hostname, ', '.join(map(repr, dnsnames))))
ssl.CertificateError: hostname 'www.espncricinfo.com' doesn't match either of 'a248.e.akamai.net', '*.akamaihd.net', '*.akamaihd-staging.net', '*.akamaized.net', '*.akamaized-staging.net'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\adapters.py", line 369, in send
timeout=timeout File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connectionpool.py", line 588, in urlopen
raise SSLError(e) requests.packages.urllib3.exceptions.SSLError: hostname 'www.espncricinfo.com' doesn't match either of 'a248.e.akamai.net', '*.akamaihd.net', '*.akamaihd-staging.net', '*.akamaized.net', '*.akamaized-staging.net'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:/Python34/intplayername.py", line 23, in <module>
source_code = requests.get(href)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\api.py", line 69, in get
return request('get', url, params=params, **kwargs)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\sessions.py", line 471, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\sessions.py", line 579, in send
r = adapter.send(request, **kwargs)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\adapters.py", line 430, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: hostname 'www.espncricinfo.com' doesn't match either of 'a248.e.akamai.net', '*.akamaihd.net', '*.akamaihd-staging.net', '*.akamaized.net', '*.akamaized-staging.net'
It's due to the misconfiguration of the https certificates on the site you want to crawl. As a workaround, you can turn off certificate checking in the requests library
requests.get(href, verify=False)
Please be advised, that it's not a recommended practice when you work with sensitive information.

Requests : No connection adapters were found for, error in Python3 [duplicate]

This question already has answers here:
Python Requests - No connection adapters
(4 answers)
Closed 6 months ago.
import requests
import xml.etree.ElementTree as ET
import re
gen_news_list=[]
r_milligenel = requests.get('http://www.milliyet.com.tr/D/rss/rss/Rss_4.xml')
root_milligenel = ET.fromstring(r_milligenel.text)
for entry in root_milligenel:
for channel in entry:
for item in channel:
title = re.search(".*title.*",item.tag)
if title:
gen_news_list.append(item.text)
link = re.search(".*link.*",item.tag)
if link:
gen_news_list.append(item.text)
r = requests.get(item.text)
print(r.text)
I have a list which named gen_news_list and i'm trying to append titles, summaries, links etc to this list. But there is an error occur when i tried to request a link:
Traceback (most recent call last):
File "/home/deniz/Masaüstü/Çalışmalar/Python/Bot/xmlcek.py", line 23, in <module>
r = requests.get(item.text)
File "/usr/lib/python3/dist-packages/requests/api.py", line 55, in get
return request('get', url, **kwargs)
File "/usr/lib/python3/dist-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 456, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 553, in send
adapter = self.get_adapter(url=request.url)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 608, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for '
http://www.milliyet.com.tr/tbmm-baskani-cicek-programlarini/siyaset/detay/2037301/default.htm
The first link worked successfully. But second one out an error. I can't add the content to list cause of this error. Is it a problem about my loop? What's wrong with code?
If you add the line print(repr(item.text)) before the problematic line r = requests.get(item.text) you see that starting the second time item.text has \n at the beginning and the end which is not allowed for a URL.
'\nhttp://www.milliyet.com.tr/tbmm-baskani-cicek-programlarini/siyaset/detay/2037301/default.htm\n'
I use repr because it literally shows the newline as the string \n in its output.
The solution to your problem is to call strip on item.text to remove those newlines:
r = requests.get(item.text.strip())

Python: requests module throws exception with Gevent

The following code:
import gevent
import gevent.monkey
gevent.monkey.patch_socket()
import requests
import json
base_url = 'https://api.getclever.com'
section_url = base_url + '/v1.1/sections'
#get all sections
sections = requests.get(section_url, auth=('DEMO_KEY', '')).json()
urls = [base_url+data['uri']+'/students' for data in sections['data']]
#get students for each section
threads = [gevent.spawn(requests.get, url, auth=('DEMO_KEY', '')) for url in urls]
gevent.joinall(threads)
students = [thread.value for thread in threads]
#get number of students in each section
num_students = [len(student.json()['data']) for student in students]
print (sum(num_students)/len(num_students))
results in this error:
Traceback (most recent call last):
File "clever.py", line 12, in <module>
sections = requests.get(section_url, auth=('DEMO_KEY', '')).json()
File "/Library/Python/2.7/site-packages/requests/api.py", line 55, in get
return request('get', url, **kwargs)
File "/Library/Python/2.7/site-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 382, in request
resp = self.send(prep, **send_kwargs)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 485, in send
r = adapter.send(request, **kwargs)
File "/Library/Python/2.7/site-packages/requests/adapters.py", line 379, in send
raise SSLError(e)
requests.exceptions.SSLError: [Errno 2] _ssl.c:503: The operation did not complete (read)
What am I doing wrong here?
Here's a similar question: [Errno 2] _ssl.c:504: The operation did not complete (read).
When you comment out
gevent.monkey.patch_socket()
or use
gevent.monkey.patch_all()
or use
gevent.monkey.patch_socket()
gevent.monkey.patch_ssl()
then the problem disappears.

Categories