Unknown URL type using urllib2.urlopen()

Unknown URL type using urllib2.urlopen() - python

I am trying to do the following:
open a CSV file containing a list with URLs (GET-Requests)
read the CSV file and write the entries to a list
open every single URL and read the answer
write the answers back to a new CSV file
I get the following error:
Traceback (most recent call last):
File "C:\Users\l.buinui\Desktop\request2.py", line 16, in <module>
req = urllib2.urlopen(url)
File "C:\Python27\lib\urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 404, in open
response = self._open(req, data)
File "C:\Python27\lib\urllib2.py", line 427, in _open
'unknown_open', req)
File "C:\Python27\lib\urllib2.py", line 382, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 1247, in unknown_open
raise URLError('unknown url type: %s' % type)
URLError: <urlopen error unknown url type: ['http>
Here is the code I am using:
import urllib2
import urllib
import csv
# Open and read the source file and write entries to a list called link_list
source_file=open("source_new.csv", "rb")
source = csv.reader(source_file, delimiter=";")
link_list = [row for row in source]
source_file.close()
# Create an output file which contains the answer of the GET-Request
out=open("output.csv", "wb")
output = csv.writer(out, delimiter=";")
for row in link_list:
url = str(row)
req = urllib2.urlopen(url)
output.writerow(req.read())
out.close()
What is going wrong there?
Thanks in advance for any hints.
Cheers

Using the row variable will pass a list element (it contains only one element, the url) to urlopen, but passing row[0] will pass the string containing the url.
The csv.reader returns a list for each row it reads, no matter how many items are in the row.

It's working now. If I directly reference row[0]in the loop there are no problems.
import urllib2
import urllib
import csv
# Open and read the source file and write entries to a list called link_list
source_file=open("source.csv", "rb")
source = csv.reader(source_file, delimiter=";")
link_list = [row for row in source]
source_file.close()
# Create an output file which contains the answer of the GET-Request
out=open("output.csv", "wb")
output = csv.writer(out)
for row in link_list:
req = urllib2.urlopen(row[0])
answer = req.read()
output.writerow([answer])
out.close()

Related

How to print JSON data to a Google Sheet using GSpread

I have tried every possible fix I can find online, unfortunately, I'm new to this and not sure if I'm getting closer or not.
Ultimately, all I am trying to do is print a JSON feed into a Google Sheet.
GSpread is working (I've appended just number values as a test), but I simply cannot get the JSON feed to print there.
I've gotten it printing to terminal, so I know it's accessible, but writing the loop to append the data becomes the issue.
This is my current script:
# import urllib library
import json
from urllib.request import urlopen
import gspread
gc = gspread.service_account(filename='creds.json')
sh = gc.open_by_key('1-1aiGMn2yUWRlh_jnIebcMNs-6phzUNxkktAFH7uY9o')
worksheet = sh.sheet1
# import json
# store the URL in url as
# parameter for urlopen
url = 'https://api.chucknorris.io/jokes/random'
# store the response of URL
response = urlopen(url)
# storing the JSON response
# from url in data
data_json = json.loads(response.read())
# print the json response
# print(data_json)
result = []
for key in data_json:
result.append([key, data_json[key]])
worksheet.update('a1', result)
I've hit a complete brick wall - any advice would be greatly appreciated
Update - suggested script with new error:
# import urllib library
import json
from urllib.request import urlopen
import gspread
gc = gspread.service_account(filename='creds.json')
sh = gc.open_by_key('1-1aiGMn2yUWRlh_jnIebcMNs-6phzUNxkktAFH7uY9o')
worksheet = sh.sheet1
url = 'https://api.chucknorris.io/jokes/random'
# store the response of URL
response = urlopen(url)
# storing the JSON response
# from url in data
data_json = json.loads(response.read())
# print the json response
# print(data_json)
result = []
for key in data_json:
result.append([key, data_json[key] if not isinstance(
data_json[key], list) else ",".join(map(str, data_json[key]))])
worksheet.update('a1', result)
Error:
Traceback (most recent call last):
File "c:\Users\AMadle\NBA-JSON-Fetch\PrintToSheetTest.py", line 17, in <module>
response = urlopen(url)
File "C:\Python\python3.10.5\lib\urllib\request.py", line 216, in urlopen
return opener.open(url, data, timeout)
File "C:\Python\python3.10.5\lib\urllib\request.py", line 525, in open
response = meth(req, response)
File "C:\Python\python3.10.5\lib\urllib\request.py", line 634, in http_response
response = self.parent.error(
File "C:\Python\python3.10.5\lib\urllib\request.py", line 563, in error
return self._call_chain(*args)
File "C:\Python\python3.10.5\lib\urllib\request.py", line 496, in _call_chain
result = func(*args)
File "C:\Python\python3.10.5\lib\urllib\request.py", line 643, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
Can confirm it is not a permissions issue, the script below prints the same URL to terminal no problem. Also have no problem writing other data to the sheet:
import requests as rq
from bs4 import BeautifulSoup
url = 'https://api.chucknorris.io/jokes/random'
req = rq.get(url, verify=False)
soup = BeautifulSoup(req.text, 'html.parser')
print(soup)

In your script, I thought that it is required to convert the JSON data to a 2-dimensional array. And, when I saw the value of data_json, I noticed that an array is included in the value. I think that it is required to be also considered. I thought that this might be the reason for your issue. When this is reflected in your script, how about the following modification?
From:
result.append([key, data_json[key]])
To:
result.append([key, data_json[key] if not isinstance(data_json[key], list) else ",".join(map(str, data_json[key]))])
In this modification, the array is converted to the string using join.

PythonAnywhere Issues

I am a new python user. I'm getting the following error when trying to run my code in PythonAnywhere, despite it working fine on my local PC.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/zachfeatherstone/imputations.py", line 7, in <module>
html = urlopen(url).read()
File "/usr/local/lib/python3.9/urllib/request.py", line 214, in urlopen
return opener.open(url, data, timeout)
File "/usr/local/lib/python3.9/urllib/request.py", line 517, in open
response = self._open(req, data)
File "/usr/local/lib/python3.9/urllib/request.py", line 534, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "/usr/local/lib/python3.9/urllib/request.py", line 494, in _call_chain
result = func(*args)
File "/usr/local/lib/python3.9/urllib/request.py", line 1389, in https_open
return self.do_open(http.client.HTTPSConnection, req,
File "/usr/local/lib/python3.9/urllib/request.py", line 1349, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error Tunnel connection failed: 403 Forbidden>
It's similar to: urllib.request.urlopen: ValueError: unknown url type.
CODE
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import os.path
url = input("Enter the URL you want to analyse: ")
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
wordList = text.split()
#imputations
imputations = []
if "unprofessional" in wordList:
unprofessional_imputation = "That our Client is unprofessional."
imputations.append(unprofessional_imputation)
print(imputations)
#print(wordList)
#print(text)
#saving file
save_path = 'C:/Users/team/downloads'
name_of_file = input("What do you want to save the file as? ")
completeName = os.path.join(save_path, name_of_file+".txt")
f = open(completeName, "w")
# traverse paragraphs from soup
for words in imputations:
f.write(words)
f.write("\n")
My apologies if this has been answered before. How do I manage to run this in PythonAnywhere so that I can deploy over the web?

It COULD be helpful to send header info along with your request. You can pass a Request object into your request like this:
url = input("Enter the URL you want to analyse: ")
header = {'User-Agent': 'Gandalf'}
req = urllib.request.Request(url, None, header)
html = urllib.request.urlopen(req)
html = html.read()
soup = BeautifulSoup(html, features="html.parser")

Import JSON in Python error BadStatusLine

I'm trying to import the Json data generated by an Impinj R420 reader.
The code i use is:
# import urllib library
import urllib.request
from urllib.request import urlopen
# import json
import json
# store the URL in url as
# parameter for urlopen
url = "http://10.234.92.19:14150"
# store the response of URL
response = urllib.request.urlopen(url)
# storing the JSON response
# from url in data
data_json = json.loads(response())
# print the json response
print(data_json)
When i execute the programm it gives the following error:
Traceback (most recent call last):
File "C:\Users\V\PycharmProjects\Stapelcontrole\main.py", line 13, in <module>
response = urllib.request.urlopen(url)
File "C:\Users\V\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 216, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\V\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 519, in open
response = self._open(req, data)
File "C:\Users\V\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 536, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "C:\Users\V\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 496, in _call_chain
result = func(*args)
File "C:\Users\V\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 1377, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "C:\Users\V\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 1352, in do_open
r = h.getresponse()
File "C:\Users\V\AppData\Local\Programs\Python\Python310\lib\http\client.py", line 1374, in getresponse
response.begin()
File "C:\Users\V\AppData\Local\Programs\Python\Python310\lib\http\client.py", line 318, in begin
version, status, reason = self._read_status()
File "C:\Users\V\AppData\Local\Programs\Python\Python310\lib\http\client.py", line 300, in _read_status
raise BadStatusLine(line)
http.client.BadStatusLine: {"epc":"3035307B2831B383E019E8EA","firstSeenTimestamp":"2022-04-11T11:24:23.592434Z","isHeartBeat":false}
Process finished with exit code 1
I know this is an error in the response where it gets a faulty HTTP status code.
Yet i don't know how to fix the error.
Could you advice me how to fix this?
The {"epc":"3035307B2831B383E019E8EA","firstSeenTimestamp":"2022-04-11T11:24:23.592434Z","isHeartBeat":false} is an answer i expect.
Thanks in advance
Edit:
Even with
with urllib.request.urlopen(url) as f:
data_json = json.load(f)`
i get the same BadStatusLine error.
I can't setup the reader any different, it can only sent a JSON response trough the IP-adress of the device. Is there a way to import the data without the HTTP Protocol?

# store the response of URL
response = urllib.request.urlopen(url)
# storing the JSON response
# from url in data
data_json = json.loads(response())
Here you are actually calling response, I do not know what you want to achieve by that, but examples in urllib.request docs suggest that urllib.request.urlopen should be treated akin to local file handle, thus please replace above using
with urllib.request.urlopen(url) as f:
data_json = json.load(f)
Observe that I used json.load not json.loads
EDIT: After Is there a way to import the data without the HTTP Protocol? I conclude more low-level solution is needed, hopefully socket will you allow to get what you want, using echo client example as starting point I prepared following code
import socket
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.connect(("10.234.92.19",14150))
s.sendall(b'GET / HTTP/1.1\r\n')
data = s.recv(1024)
print(data)
If everything will work as intended you should get printed 1024 first bytes of answer. If it so change 1024 to value which will be always bigger than number of bytes of response and use json.dumps(data) to get parsed response

No connection adapters were found for '%s'" % url

Sorry to ask this ! I'm newbie so feel free to teach me anything you guys know.
I'm making a scraping tool for my marketing purpose to scrape contact information from website. I'm using Python 3
This is my code:
import requests, bs4, os, codecs, csv
import pandas as pd
import sys
os.path.join('usr', 'bin', 'spam')
openFile = open('C:\\Users\\hdtra\\Desktop\\Test_1.csv',encoding='utf-8-sig')
read_test = csv.reader(openFile)
for link in read_test :
res = requests.get(link)
res.raise_for_status
facebookSpider = bs4.BeautifulSoup(res.text)
email = facebookSpider.select("._4-u2._3xaf._3-95._4-u8")
helloFile = open('C:\\Users\\hdtra\\Desktop\\In processing\\information.txt','w')
helloFile.write(str(email[3].encode('utf-8')) + '\n')
helloFile.close()
Have no idea why it gets me st like this:
Traceback (most recent call last):
File "C:\Users\hdtra\Desktop\In processing\Facebook_spider.py", line 12, in <module>
res = requests.get(link)
File "C:\Program Files\Python36\lib\site-packages\requests\api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "C:\Program Files\Python36\lib\site-packages\requests\api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Program Files\Python36\lib\site-packages\requests\sessions.py", line 508, in request
resp = self.send(prep, **send_kwargs)
File "C:\Program Files\Python36\lib\site-packages\requests\sessions.py", line 612, in send
adapter = self.get_adapter(url=request.url)
File "C:\Program Files\Python36\lib\site-packages\requests\sessions.py", line 703, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for '['http://www.facebook.com/D2Streetwear/?ref=br_rs']'
I know that get() only gets string, but have no idea how to convert these links into strings. This is my cvs file:
only one column with 5 row:
http://www.facebook.com/D2Streetwear/?ref=br_rs
https://www.facebook.com/RealClothes/?ref=br_rs
https://www.facebook.com/Lecamelliaclothing/?ref=br_rs
https://www.facebook.com/TaTclothing-285844471884952/?ref=br_rs
https://www.facebook.com/Dai-Clothing-130675847640538/?ref=br_rs
I tried to put str(link()) but it does not work.

You should understand that csv.reader returns an iterator that iterates over each row to return a list of columns for each one.
csv.reader(csvfile, dialect='excel', **fmtparams)
Return a reader object which will iterate over lines in the given
csvfile.[...]
Each row read from the csv file is returned as a list of strings.
Bold emphasis mine. Your CSV appears to contain a single column, so you can access the first column using link[0].
with open('test.csv') as f:
r = csv.reader(f)
for row in r:
r = requests.get(row[0])
...
I consider it good practice to always use a with...as context manager when handling file I/O, as it automatically closes your file and results in cleaner code.

how to apply "catch-all" exception clause to complex python web-scraping script?

I've got a list of 100 websites in CSV format. All of the sites have the same general format, including a large table with 7 columns. I wrote this script to extract the data from the 7th column of each of the websites and then write this data to file. The script below partially works, however: opening the output file (after running the script) shows that something is being skipped because it only shows 98 writes (clearly the script also registers a number of exceptions). Guidance on how to implement a "catching exception" in this context would be much appreciated. Thank you!
import csv, urllib2, re
def replace(variab): return variab.replace(",", " ")
urls = csv.reader(open('input100.txt', 'rb')) #access list of 100 URLs
for url in urls:
html = urllib2.urlopen(url[0]).read() #get HTML starting with the first URL
col7 = re.findall('td7.*?td', html) #use regex to get data from column 7
string = str(col7) #stringify data
neat = re.findall('div3.*?div', string) #use regex to get target text
result = map(replace, neat) #apply function to remove','s from elements
string2 = ", ".join(result) #separate list elements with ', ' for export to csv
output = open('output.csv', 'ab') #open file for writing
output.write(string2 + '\n') #append output to file and create new line
output.close()
Return:
Traceback (most recent call last):
File "C:\Python26\supertest3.py", line 6, in <module>
html = urllib2.urlopen(url[0]).read()
File "C:\Python26\lib\urllib2.py", line 124, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python26\lib\urllib2.py", line 383, in open
response = self._open(req, data)
File "C:\Python26\lib\urllib2.py", line 401, in _open
'_open', req)
File "C:\Python26\lib\urllib2.py", line 361, in _call_chain
result = func(*args)
File "C:\Python26\lib\urllib2.py", line 1130, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "C:\Python26\lib\urllib2.py", line 1103, in do_open
r = h.getresponse()
File "C:\Python26\lib\httplib.py", line 950, in getresponse
response.begin()
File "C:\Python26\lib\httplib.py", line 390, in begin
version, status, reason = self._read_status()
File "C:\Python26\lib\httplib.py", line 354, in _read_status
raise BadStatusLine(line)
BadStatusLine
>>>>

Make the body of your for loop into:
for url in urls:
try:
...the body you have now...
except Exception, e:
print>>sys.stderr, "Url %r not processed: error (%s) % (url, e)
(Or, use logging.error instead of the goofy print>>, if you're already using the logging module of the standard library [and you should;-)]).

I'd recommend reading the Errors and Exceptions Python documentation, especially section 8.3 -- Handling Exceptions.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unknown URL type using urllib2.urlopen() - python

Using the row variable will pass a list element (it contains only one element, the url) to urlopen, but passing row[0] will pass the string containing the url. The csv.reader returns a list for each row it reads, no matter how many items are in the row.

Related

How to print JSON data to a Google Sheet using GSpread

PythonAnywhere Issues

Import JSON in Python error BadStatusLine

No connection adapters were found for '%s'" % url

how to apply "catch-all" exception clause to complex python web-scraping script?

Categories

Resources