I was able to gather data from a web page using this
import requests
import lxml.html
import re
url = "http://animesora.com/flying-witch-episode-7-english-subtitle/"
r = requests.get(url)
page = r.content
dom = lxml.html.fromstring(page)
for link in dom.xpath('//div[#class="downloadarea"]//a/#href'):
down = re.findall('https://.*',link)
print (down)
when I try this to gather more data on the results of the above code I was presented with this error:
Traceback (most recent call last):
File "/home/sven/PycharmProjects/untitled1/.idea/test4.py", line 21, in <module>
r2 = requests.get(down)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 70, in get
return request('get', url, params=params, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 56, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 475, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 590, in send
adapter = self.get_adapter(url=request.url)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 672, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for '['https://link.safelinkconverter.com/review.php?id=aHR0cDovLygqKC5fKC9zTGZYZ0s=&c=1&user=51757']'
This is the code I was using:
for link2 in down:
r2 = requests.get(down)
page2 = r.url
dom2 = lxml.html.fromstring(page2)
for link2 in dom2('//div[#class="button green"]//onclick'):
down2 = re.findall('.*',down2)
print (down2)
You are passing in the whole list:
for link2 in down:
r2 = requests.get(down)
Note how you passed in down, not link2. down is a list, not a single URL string.
Pass in link2:
for link2 in down:
r2 = requests.get(link2)
I'm not sure why you are using regular expressions. In the loop
for link in dom.xpath('//div[#class="downloadarea"]//a/#href'):
each link is already a fully qualified URL:
>>> for link in dom.xpath('//div[#class="downloadarea"]//a/#href'):
... print link
...
https://link.safelinkconverter.com/review.php?id=aHR0cDovLygqKC5fKC9FZEk2Qg==&c=1&user=51757
https://link.safelinkconverter.com/review.php?id=aHR0cDovLygqKC5fKC95Tmg2Qg==&c=1&user=51757
https://link.safelinkconverter.com/review.php?id=aHR0cDovLygqKC5fKC93dFBmVFg=&c=1&user=51757
https://link.safelinkconverter.com/review.php?id=aHR0cDovLygqKC5fKC9zTGZYZ0s=&c=1&user=51757
You don't need to do any further processing on that.
Your remaining code has more errors; you confused r2.url with r2.content, and forgot the .xpath part in your dom2.xpath(...) query.
Related
I'm trying to scrape the HTML code of a new chrome tab, but I can't find a way that works using Python.
Here's what I've tried:
I've tried the requests module, but this code:
import requests
URL = "chrome://newtab"
page = requests.get(URL)
print(page.text)
Yields this error:
Traceback (most recent call last):
File "c:\Users\Ben Bistline\Code\PythonFiles\PythonFiles\chromescrape.py", line 4, in <module>
page = requests.get(URL)
File "C:\Users\Ben Bistline\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\Ben Bistline\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\Ben Bistline\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\requests\sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\Ben Bistline\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\requests\sessions.py", line 649, in send
adapter = self.get_adapter(url=request.url)
File "C:\Users\Ben Bistline\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\requests\sessions.py", line 742, in get_adapter
raise InvalidSchema("No connection adapters were found for {!r}".format(url))
requests.exceptions.InvalidSchema: No connection adapters were found for 'chrome://newtab'
I suppose this result makes sense, but I'm not sure how/if I can get around it.
I've also tried using the webbrowser module with this code:
import requests, webbrowser
URL = "chrome://newtab"
chromePath = 'C:/Program Files/Google/Chrome/Application/chrome.exe %s'
webbrowser.get(chromePath).open(URL)
Unfortunately, although successful, this method does not seem to offer a way of gathering the HTML.
Anyone know of any other ways using Python to grab the HTML of a new Chrome tab?
Thanks!
You can use Selenium driver with Chrome to do that
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('chrome://newtab')
content = driver.page_source
browser.close()
I'm currently new to programming and I'm following along on one of Qazi tutorials and I'm on a section for web scraping but unfortunately I'm getting errors that I can't seem to find a solution for, can you please help me out. Thanks
The error code is bellow.
Traceback (most recent call last):
File "D:\Users\Vaughan\Qazi\Web Scrapping\webscraping.py", line 6, in <module>
page = requests.get(
File "C:\Users\vaugh\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\vaugh\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\vaugh\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\vaugh\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\sessions.py", line 640, in send
adapter = self.get_adapter(url=request.url)
File "C:\Users\vaugh\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\sessions.py", line 731, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for '['https://forecast.weather.gov/MapClick.php?lat=34.09979000000004&lon=-118.32721499999997#.XkzZwCgzaUk']'
[Finished in 1.171s]
My line of code is as follows
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
import lxml
page = requests.get('https://forecast.weather.gov/MapClick.php?lat=34.09979000000004&lon=-118.32721499999997#.XkzZwCgzaUk')
soup = BeautifulSoup(page.content, 'html.parser')
week = soup.find(id='seven-day-forecast-body')
print(week)
I have the following Python code intended for web crawling, When I try to run this one, It is throwing me the following error. Code :
import lxml.html
import requests
from bs4 import BeautifulSoup
url1='http://stats.espncricinfo.com/ci/engine/stats/index.html?class=11;filter=advanced;orderby=runs;'
url2 ='page='
url3 ='size=200;template=results;type=batting'
url5 = ['http://stats.espncricinfo.com/ci/engine/stats/index.html?class=11;filter=advanced;orderby=runs;size=200;template=results;type=batting']
for i in range(2,3854):
url4 = url1 + url2 + str(i) + ';' + url3
url5.append(url4)
for page in url5:
source_code = requests.get(page, verify=False)
# just get the code, no headers or anything
plain_text = source_code.text
# BeautifulSoup objects can be sorted through easy
soup = BeautifulSoup(plain_text, "lxml")
for link in soup.findAll('a', {'class': 'data-link'}):
href = "https://www.espncricinfo.com" + link.get('href')
title = link.string # just the text, not the HTML
source_code = requests.get(href)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "lxml")
# if you want to gather information from that page
for item_name in soup.findAll('span', {'class': 'ciPlayerinformationtxt'}):
print(item_name.string)
The error:
Traceback (most recent call last):
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connectionpool.py", line 559, in urlopen
body=body, headers=headers)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connectionpool.py", line 345, in _make_request
self._validate_conn(conn)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connectionpool.py", line 782, in _validate_conn
conn.connect()
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connection.py", line 266, in connect
match_hostname(cert, self.assert_hostname or hostname)
File "C:\Python34\lib\ssl.py", line 285, in match_hostname
% (hostname, ', '.join(map(repr, dnsnames))))
ssl.CertificateError: hostname 'www.espncricinfo.com' doesn't match either of 'a248.e.akamai.net', '*.akamaihd.net', '*.akamaihd-staging.net', '*.akamaized.net', '*.akamaized-staging.net'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\adapters.py", line 369, in send
timeout=timeout File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connectionpool.py", line 588, in urlopen
raise SSLError(e) requests.packages.urllib3.exceptions.SSLError: hostname 'www.espncricinfo.com' doesn't match either of 'a248.e.akamai.net', '*.akamaihd.net', '*.akamaihd-staging.net', '*.akamaized.net', '*.akamaized-staging.net'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:/Python34/intplayername.py", line 23, in <module>
source_code = requests.get(href)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\api.py", line 69, in get
return request('get', url, params=params, **kwargs)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\sessions.py", line 471, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\sessions.py", line 579, in send
r = adapter.send(request, **kwargs)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\adapters.py", line 430, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: hostname 'www.espncricinfo.com' doesn't match either of 'a248.e.akamai.net', '*.akamaihd.net', '*.akamaihd-staging.net', '*.akamaized.net', '*.akamaized-staging.net'
It's due to the misconfiguration of the https certificates on the site you want to crawl. As a workaround, you can turn off certificate checking in the requests library
requests.get(href, verify=False)
Please be advised, that it's not a recommended practice when you work with sensitive information.
This question already has answers here:
Python Requests - No connection adapters
(4 answers)
Closed 6 months ago.
import requests
import xml.etree.ElementTree as ET
import re
gen_news_list=[]
r_milligenel = requests.get('http://www.milliyet.com.tr/D/rss/rss/Rss_4.xml')
root_milligenel = ET.fromstring(r_milligenel.text)
for entry in root_milligenel:
for channel in entry:
for item in channel:
title = re.search(".*title.*",item.tag)
if title:
gen_news_list.append(item.text)
link = re.search(".*link.*",item.tag)
if link:
gen_news_list.append(item.text)
r = requests.get(item.text)
print(r.text)
I have a list which named gen_news_list and i'm trying to append titles, summaries, links etc to this list. But there is an error occur when i tried to request a link:
Traceback (most recent call last):
File "/home/deniz/Masaüstü/Çalışmalar/Python/Bot/xmlcek.py", line 23, in <module>
r = requests.get(item.text)
File "/usr/lib/python3/dist-packages/requests/api.py", line 55, in get
return request('get', url, **kwargs)
File "/usr/lib/python3/dist-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 456, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 553, in send
adapter = self.get_adapter(url=request.url)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 608, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for '
http://www.milliyet.com.tr/tbmm-baskani-cicek-programlarini/siyaset/detay/2037301/default.htm
The first link worked successfully. But second one out an error. I can't add the content to list cause of this error. Is it a problem about my loop? What's wrong with code?
If you add the line print(repr(item.text)) before the problematic line r = requests.get(item.text) you see that starting the second time item.text has \n at the beginning and the end which is not allowed for a URL.
'\nhttp://www.milliyet.com.tr/tbmm-baskani-cicek-programlarini/siyaset/detay/2037301/default.htm\n'
I use repr because it literally shows the newline as the string \n in its output.
The solution to your problem is to call strip on item.text to remove those newlines:
r = requests.get(item.text.strip())
The following code:
import gevent
import gevent.monkey
gevent.monkey.patch_socket()
import requests
import json
base_url = 'https://api.getclever.com'
section_url = base_url + '/v1.1/sections'
#get all sections
sections = requests.get(section_url, auth=('DEMO_KEY', '')).json()
urls = [base_url+data['uri']+'/students' for data in sections['data']]
#get students for each section
threads = [gevent.spawn(requests.get, url, auth=('DEMO_KEY', '')) for url in urls]
gevent.joinall(threads)
students = [thread.value for thread in threads]
#get number of students in each section
num_students = [len(student.json()['data']) for student in students]
print (sum(num_students)/len(num_students))
results in this error:
Traceback (most recent call last):
File "clever.py", line 12, in <module>
sections = requests.get(section_url, auth=('DEMO_KEY', '')).json()
File "/Library/Python/2.7/site-packages/requests/api.py", line 55, in get
return request('get', url, **kwargs)
File "/Library/Python/2.7/site-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 382, in request
resp = self.send(prep, **send_kwargs)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 485, in send
r = adapter.send(request, **kwargs)
File "/Library/Python/2.7/site-packages/requests/adapters.py", line 379, in send
raise SSLError(e)
requests.exceptions.SSLError: [Errno 2] _ssl.c:503: The operation did not complete (read)
What am I doing wrong here?
Here's a similar question: [Errno 2] _ssl.c:504: The operation did not complete (read).
When you comment out
gevent.monkey.patch_socket()
or use
gevent.monkey.patch_all()
or use
gevent.monkey.patch_socket()
gevent.monkey.patch_ssl()
then the problem disappears.