I'm trying to scrape the HTML code of a new chrome tab, but I can't find a way that works using Python.
Here's what I've tried:
I've tried the requests module, but this code:
import requests
URL = "chrome://newtab"
page = requests.get(URL)
print(page.text)
Yields this error:
Traceback (most recent call last):
File "c:\Users\Ben Bistline\Code\PythonFiles\PythonFiles\chromescrape.py", line 4, in <module>
page = requests.get(URL)
File "C:\Users\Ben Bistline\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\Ben Bistline\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\Ben Bistline\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\requests\sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\Ben Bistline\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\requests\sessions.py", line 649, in send
adapter = self.get_adapter(url=request.url)
File "C:\Users\Ben Bistline\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\requests\sessions.py", line 742, in get_adapter
raise InvalidSchema("No connection adapters were found for {!r}".format(url))
requests.exceptions.InvalidSchema: No connection adapters were found for 'chrome://newtab'
I suppose this result makes sense, but I'm not sure how/if I can get around it.
I've also tried using the webbrowser module with this code:
import requests, webbrowser
URL = "chrome://newtab"
chromePath = 'C:/Program Files/Google/Chrome/Application/chrome.exe %s'
webbrowser.get(chromePath).open(URL)
Unfortunately, although successful, this method does not seem to offer a way of gathering the HTML.
Anyone know of any other ways using Python to grab the HTML of a new Chrome tab?
Thanks!
You can use Selenium driver with Chrome to do that
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('chrome://newtab')
content = driver.page_source
browser.close()
Related
I'm currently new to programming and I'm following along on one of Qazi tutorials and I'm on a section for web scraping but unfortunately I'm getting errors that I can't seem to find a solution for, can you please help me out. Thanks
The error code is bellow.
Traceback (most recent call last):
File "D:\Users\Vaughan\Qazi\Web Scrapping\webscraping.py", line 6, in <module>
page = requests.get(
File "C:\Users\vaugh\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\vaugh\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\vaugh\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\vaugh\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\sessions.py", line 640, in send
adapter = self.get_adapter(url=request.url)
File "C:\Users\vaugh\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\sessions.py", line 731, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for '['https://forecast.weather.gov/MapClick.php?lat=34.09979000000004&lon=-118.32721499999997#.XkzZwCgzaUk']'
[Finished in 1.171s]
My line of code is as follows
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
import lxml
page = requests.get('https://forecast.weather.gov/MapClick.php?lat=34.09979000000004&lon=-118.32721499999997#.XkzZwCgzaUk')
soup = BeautifulSoup(page.content, 'html.parser')
week = soup.find(id='seven-day-forecast-body')
print(week)
I'm trying to learn web scraping in python with the request-html package. At first, I render a mainpage and pull out all the necessary links. That works just fine. Later I iterate over all links and render the specific subpage for that link. 2 Iterations are successful, but with the third I get an error that i am unable to solve.
Here is my code:
# import HTMLSession from requests_html
from requests_html import HTMLSession
# create an HTML Session object
session = HTMLSession()
# Use the object above to connect to needed webpage
baseurl = 'http://www.möbelfreude.de/'
resp = session.get(baseurl+'alle-boxspringbetten')
# Run JavaScript code on webpage
resp.html.render()
links = resp.html.find('a.image-wrapper.text-center')
for link in links:
print('Rendering... {}'.format(link.attrs['href']))
r = session.get(baseurl + link.attrs['href'])
r.html.render()
print('Completed rendering... {}'.format(link.attrs['href']))
# do stuff
Error:
Completed rendering... bett/boxspringbett-bea
Rendering... bett/boxspringbett-valina
Completed rendering... bett/boxspringbett-valina
Rendering... bett/boxspringbett-benno-anthrazit
Traceback (most recent call last):
File "C:\Users\pasca\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 603, in urlopen
chunked=chunked)
File "C:\Users\pasca\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 387, in _make_request
six.raise_from(e, None)
File "<string>", line 2, in raise_from
File "C:\Users\pasca\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 383, in _make_request
httplib_response = conn.getresponse()
File "C:\Users\pasca\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 1336, in getresponse
response.begin()
File "C:\Users\pasca\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 306, in begin
version, status, reason = self._read_status()
File "C:\Users\pasca\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 275, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:```
The error is due to the connection closure, and may be due to some configurations on the server side.
Have you scraping the site and appending the links to a list.
Then request each link individually to find and locate the specific directory that is cause issue.
Using dev mode in chrome under the network tab can help identify the necessary headers for requests that require them.
I was able to gather data from a web page using this
import requests
import lxml.html
import re
url = "http://animesora.com/flying-witch-episode-7-english-subtitle/"
r = requests.get(url)
page = r.content
dom = lxml.html.fromstring(page)
for link in dom.xpath('//div[#class="downloadarea"]//a/#href'):
down = re.findall('https://.*',link)
print (down)
when I try this to gather more data on the results of the above code I was presented with this error:
Traceback (most recent call last):
File "/home/sven/PycharmProjects/untitled1/.idea/test4.py", line 21, in <module>
r2 = requests.get(down)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 70, in get
return request('get', url, params=params, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 56, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 475, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 590, in send
adapter = self.get_adapter(url=request.url)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 672, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for '['https://link.safelinkconverter.com/review.php?id=aHR0cDovLygqKC5fKC9zTGZYZ0s=&c=1&user=51757']'
This is the code I was using:
for link2 in down:
r2 = requests.get(down)
page2 = r.url
dom2 = lxml.html.fromstring(page2)
for link2 in dom2('//div[#class="button green"]//onclick'):
down2 = re.findall('.*',down2)
print (down2)
You are passing in the whole list:
for link2 in down:
r2 = requests.get(down)
Note how you passed in down, not link2. down is a list, not a single URL string.
Pass in link2:
for link2 in down:
r2 = requests.get(link2)
I'm not sure why you are using regular expressions. In the loop
for link in dom.xpath('//div[#class="downloadarea"]//a/#href'):
each link is already a fully qualified URL:
>>> for link in dom.xpath('//div[#class="downloadarea"]//a/#href'):
... print link
...
https://link.safelinkconverter.com/review.php?id=aHR0cDovLygqKC5fKC9FZEk2Qg==&c=1&user=51757
https://link.safelinkconverter.com/review.php?id=aHR0cDovLygqKC5fKC95Tmg2Qg==&c=1&user=51757
https://link.safelinkconverter.com/review.php?id=aHR0cDovLygqKC5fKC93dFBmVFg=&c=1&user=51757
https://link.safelinkconverter.com/review.php?id=aHR0cDovLygqKC5fKC9zTGZYZ0s=&c=1&user=51757
You don't need to do any further processing on that.
Your remaining code has more errors; you confused r2.url with r2.content, and forgot the .xpath part in your dom2.xpath(...) query.
I am currently trying to do some QA/form submissions using a headless browser in python, and I don't think my libraries are able to submit/complete the form. What am I doing wrong here?
import mechanize
import cookielib
cj = cookielib.LWPCookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
response1 = br.open("http://www.nike.com/us/en_us/")
assert br.viewing_html()
print br.title()
print response1.geturl()
html = response1.read()
for forms in br.forms():
print forms
# Select the second (index one) form
br.select_form('login-form')
# User credentials
br.form['email'] = 'example#email.com'
br.form['password'] = 'test-password'
br.submit
If I try robobrowser, this is my error:
Traceback (most recent call last):
File "/Users/cmw/PycharmProjects/Nike_Bot/nike_bot_py.py", line 44, in <module>
browser.submit_form(signin_form)
File "/Library/Python/2.7/site-packages/robobrowser/browser.py", line 341, in submit_form
response = self.session.request(method, url, **payload.to_requests(method))
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 456, in request
resp = self.send(prep, **send_kwargs)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 553, in send
adapter = self.get_adapter(url=request.url)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 608, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'javascript:void(0);'
The website you are trying to access runs javascript to submit forms: action="javascript:void(0);". Your mechanize library is trying to mimic that without actually being able to understand javascript and fails. If you just submit the content of the form with a POST then that may work unless they are using request authentication, in which case you are out of luck.
I'm making a web page scraper using BeautifulSoup4 and requests libraries. I had some trouble with BeautifulSoup working but got some help and was able to get that fixed. Now I've run into a new problem and I'm not sure how to fix it. I'm using requests 2.2.1 and I'm trying to run this program on Python 3.1.2. And when I do I get a traceback error.
here is my code:
from bs4 import BeautifulSoup
import requests
url = input("Enter a URL (start with www): ")
link = "http://" + url
page = requests.get(link).content
soup = BeautifulSoup(page)
for url in soup.find_all('a'):
print(url.get('href'))
print()
and the error:
Enter a URL (start with www): www.google.com
Traceback (most recent call last):
File "/Users/user/Desktop/project.py", line 8, in <module>
page = requests.get(link).content
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages/requests-2.2.1-py3.1.egg/requests/api.py", line 55, in get
return request('get', url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages/requests-2.2.1-py3.1.egg/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages/requests-2.2.1-py3.1.egg/requests/sessions.py", line 349, in request
prep = self.prepare_request(req)
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages/requests-2.2.1-py3.1.egg/requests/sessions.py", line 287, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages/requests-2.2.1-py3.1.egg/requests/models.py", line 287, in prepare
self.prepare_url(url,params)
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages/requests-2.2.1-py3.1.egg/requests/models.py", line 321, in prepare_url
url = str(url)
TypeError: 'tuple' object is not callable
I've done some looking and when others have gotten this error (in django mostly) there was a comma missing but I'm not sure where to put a comma at? Any help will be appreciated.