Scraping with urlopen do not work with a specific server

Scraping with urlopen do not work with a specific server - python

The following code in Python, usually gives me the html of a website specified in the url variable. Urlopen works correctly. The problem: with a particular website from now I am no more able to make this code working.
from urllib.request import urlopen
url = "http://www.somewebsite.com"
html = urlopen(url)
print(html.read())
The error is the following:
File "C:/script.py", line 4, in <module>
html = urlopen(url)
[...]
File "C:\Python\Python35\lib\http\client.py", line 251, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

Related

Import JSON in Python error BadStatusLine

I'm trying to import the Json data generated by an Impinj R420 reader.
The code i use is:
# import urllib library
import urllib.request
from urllib.request import urlopen
# import json
import json
# store the URL in url as
# parameter for urlopen
url = "http://10.234.92.19:14150"
# store the response of URL
response = urllib.request.urlopen(url)
# storing the JSON response
# from url in data
data_json = json.loads(response())
# print the json response
print(data_json)
When i execute the programm it gives the following error:
Traceback (most recent call last):
File "C:\Users\V\PycharmProjects\Stapelcontrole\main.py", line 13, in <module>
response = urllib.request.urlopen(url)
File "C:\Users\V\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 216, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\V\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 519, in open
response = self._open(req, data)
File "C:\Users\V\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 536, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "C:\Users\V\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 496, in _call_chain
result = func(*args)
File "C:\Users\V\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 1377, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "C:\Users\V\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 1352, in do_open
r = h.getresponse()
File "C:\Users\V\AppData\Local\Programs\Python\Python310\lib\http\client.py", line 1374, in getresponse
response.begin()
File "C:\Users\V\AppData\Local\Programs\Python\Python310\lib\http\client.py", line 318, in begin
version, status, reason = self._read_status()
File "C:\Users\V\AppData\Local\Programs\Python\Python310\lib\http\client.py", line 300, in _read_status
raise BadStatusLine(line)
http.client.BadStatusLine: {"epc":"3035307B2831B383E019E8EA","firstSeenTimestamp":"2022-04-11T11:24:23.592434Z","isHeartBeat":false}
Process finished with exit code 1
I know this is an error in the response where it gets a faulty HTTP status code.
Yet i don't know how to fix the error.
Could you advice me how to fix this?
The {"epc":"3035307B2831B383E019E8EA","firstSeenTimestamp":"2022-04-11T11:24:23.592434Z","isHeartBeat":false} is an answer i expect.
Thanks in advance
Edit:
Even with
with urllib.request.urlopen(url) as f:
data_json = json.load(f)`
i get the same BadStatusLine error.
I can't setup the reader any different, it can only sent a JSON response trough the IP-adress of the device. Is there a way to import the data without the HTTP Protocol?

# store the response of URL
response = urllib.request.urlopen(url)
# storing the JSON response
# from url in data
data_json = json.loads(response())
Here you are actually calling response, I do not know what you want to achieve by that, but examples in urllib.request docs suggest that urllib.request.urlopen should be treated akin to local file handle, thus please replace above using
with urllib.request.urlopen(url) as f:
data_json = json.load(f)
Observe that I used json.load not json.loads
EDIT: After Is there a way to import the data without the HTTP Protocol? I conclude more low-level solution is needed, hopefully socket will you allow to get what you want, using echo client example as starting point I prepared following code
import socket
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.connect(("10.234.92.19",14150))
s.sendall(b'GET / HTTP/1.1\r\n')
data = s.recv(1024)
print(data)
If everything will work as intended you should get printed 1024 first bytes of answer. If it so change 1024 to value which will be always bigger than number of bytes of response and use json.dumps(data) to get parsed response

Get HTML of new Chrome tab with Python

I'm trying to scrape the HTML code of a new chrome tab, but I can't find a way that works using Python.
Here's what I've tried:
I've tried the requests module, but this code:
import requests
URL = "chrome://newtab"
page = requests.get(URL)
print(page.text)
Yields this error:
Traceback (most recent call last):
File "c:\Users\Ben Bistline\Code\PythonFiles\PythonFiles\chromescrape.py", line 4, in <module>
page = requests.get(URL)
File "C:\Users\Ben Bistline\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\Ben Bistline\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\Ben Bistline\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\requests\sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\Ben Bistline\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\requests\sessions.py", line 649, in send
adapter = self.get_adapter(url=request.url)
File "C:\Users\Ben Bistline\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\requests\sessions.py", line 742, in get_adapter
raise InvalidSchema("No connection adapters were found for {!r}".format(url))
requests.exceptions.InvalidSchema: No connection adapters were found for 'chrome://newtab'
I suppose this result makes sense, but I'm not sure how/if I can get around it.
I've also tried using the webbrowser module with this code:
import requests, webbrowser
URL = "chrome://newtab"
chromePath = 'C:/Program Files/Google/Chrome/Application/chrome.exe %s'
webbrowser.get(chromePath).open(URL)
Unfortunately, although successful, this method does not seem to offer a way of gathering the HTML.
Anyone know of any other ways using Python to grab the HTML of a new Chrome tab?
Thanks!

You can use Selenium driver with Chrome to do that
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('chrome://newtab')
content = driver.page_source
browser.close()

RemoteDisconnect using python request-html

I'm trying to learn web scraping in python with the request-html package. At first, I render a mainpage and pull out all the necessary links. That works just fine. Later I iterate over all links and render the specific subpage for that link. 2 Iterations are successful, but with the third I get an error that i am unable to solve.
Here is my code:
# import HTMLSession from requests_html
from requests_html import HTMLSession
# create an HTML Session object
session = HTMLSession()
# Use the object above to connect to needed webpage
baseurl = 'http://www.möbelfreude.de/'
resp = session.get(baseurl+'alle-boxspringbetten')
# Run JavaScript code on webpage
resp.html.render()
links = resp.html.find('a.image-wrapper.text-center')
for link in links:
print('Rendering... {}'.format(link.attrs['href']))
r = session.get(baseurl + link.attrs['href'])
r.html.render()
print('Completed rendering... {}'.format(link.attrs['href']))
# do stuff
Error:
Completed rendering... bett/boxspringbett-bea
Rendering... bett/boxspringbett-valina
Completed rendering... bett/boxspringbett-valina
Rendering... bett/boxspringbett-benno-anthrazit
Traceback (most recent call last):
File "C:\Users\pasca\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 603, in urlopen
chunked=chunked)
File "C:\Users\pasca\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 387, in _make_request
six.raise_from(e, None)
File "<string>", line 2, in raise_from
File "C:\Users\pasca\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 383, in _make_request
httplib_response = conn.getresponse()
File "C:\Users\pasca\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 1336, in getresponse
response.begin()
File "C:\Users\pasca\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 306, in begin
version, status, reason = self._read_status()
File "C:\Users\pasca\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 275, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:```

The error is due to the connection closure, and may be due to some configurations on the server side.
Have you scraping the site and appending the links to a list.
Then request each link individually to find and locate the specific directory that is cause issue.
Using dev mode in chrome under the network tab can help identify the necessary headers for requests that require them.

HTML Link parsing using BeautifulSoup

here is my Python code which I'm using to extract the Specific HTML from the Page links I'm sending as parameter. I'm using BeautifulSoup. This code works fine for sometimes and sometimes it is getting stuck!
import urllib
from bs4 import BeautifulSoup
rawHtml = ''
url = r'http://iasexamportal.com/civilservices/tag/voice-notes?page='
for i in range(1, 49):
#iterate url and capture content
sock = urllib.urlopen(url+ str(i))
html = sock.read()
sock.close()
rawHtml += html
print i
Here I'm printing the loop variable to find out where it is getting stuck. It shows me that it's getting stuck randomly at any of the loop sequence.
soup = BeautifulSoup(rawHtml, 'html.parser')
t=''
for link in soup.find_all('a'):
t += str(link.get('href')) + "</br>"
#t += str(link) + "</br>"
f = open("Link.txt", 'w+')
f.write(t)
f.close()
what could be the possible issue. Is it the problem with the socket configuration or some other issue.
This is the error I got. I checked these links - python-gaierror-errno-11004,ioerror-errno-socket-error-errno-11004-getaddrinfo-failed for the solution. But I didn't find it much helpful.
d:\python>python ext.py
Traceback (most recent call last):
File "ext.py", line 8, in <module>
sock = urllib.urlopen(url+ str(i))
File "d:\python\lib\urllib.py", line 87, in urlopen
return opener.open(url)
File "d:\python\lib\urllib.py", line 213, in open
return getattr(self, name)(url)
File "d:\python\lib\urllib.py", line 350, in open_http
h.endheaders(data)
File "d:\python\lib\httplib.py", line 1049, in endheaders
self._send_output(message_body)
File "d:\python\lib\httplib.py", line 893, in _send_output
self.send(msg)
File "d:\python\lib\httplib.py", line 855, in send
self.connect()
File "d:\python\lib\httplib.py", line 832, in connect
self.timeout, self.source_address)
File "d:\python\lib\socket.py", line 557, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
IOError: [Errno socket error] [Errno 11004] getaddrinfo failed
It's running perfectly fine when I'm running it on my personal laptop. But It's giving error when I'm running it on Office Desktop. Also, My version of Python is 2.7. Hope these information will help.

Finally, guys.... It worked! Same script worked when I checked on other PC's too. So probably the problem was because of the firewall settings or proxy settings of my office desktop. which was blocking this website.

Requests : No connection adapters were found for, error in Python3 [duplicate]

This question already has answers here:
Python Requests - No connection adapters
(4 answers)
Closed 6 months ago.
import requests
import xml.etree.ElementTree as ET
import re
gen_news_list=[]
r_milligenel = requests.get('http://www.milliyet.com.tr/D/rss/rss/Rss_4.xml')
root_milligenel = ET.fromstring(r_milligenel.text)
for entry in root_milligenel:
for channel in entry:
for item in channel:
title = re.search(".*title.*",item.tag)
if title:
gen_news_list.append(item.text)
link = re.search(".*link.*",item.tag)
if link:
gen_news_list.append(item.text)
r = requests.get(item.text)
print(r.text)
I have a list which named gen_news_list and i'm trying to append titles, summaries, links etc to this list. But there is an error occur when i tried to request a link:
Traceback (most recent call last):
File "/home/deniz/Masaüstü/Çalışmalar/Python/Bot/xmlcek.py", line 23, in <module>
r = requests.get(item.text)
File "/usr/lib/python3/dist-packages/requests/api.py", line 55, in get
return request('get', url, **kwargs)
File "/usr/lib/python3/dist-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 456, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 553, in send
adapter = self.get_adapter(url=request.url)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 608, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for '
http://www.milliyet.com.tr/tbmm-baskani-cicek-programlarini/siyaset/detay/2037301/default.htm
The first link worked successfully. But second one out an error. I can't add the content to list cause of this error. Is it a problem about my loop? What's wrong with code?

If you add the line print(repr(item.text)) before the problematic line r = requests.get(item.text) you see that starting the second time item.text has \n at the beginning and the end which is not allowed for a URL.
'\nhttp://www.milliyet.com.tr/tbmm-baskani-cicek-programlarini/siyaset/detay/2037301/default.htm\n'
I use repr because it literally shows the newline as the string \n in its output.
The solution to your problem is to call strip on item.text to remove those newlines:
r = requests.get(item.text.strip())

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping with urlopen do not work with a specific server - python

Related

Import JSON in Python error BadStatusLine

Get HTML of new Chrome tab with Python

RemoteDisconnect using python request-html

HTML Link parsing using BeautifulSoup

Requests : No connection adapters were found for, error in Python3 [duplicate]

Categories

Resources