I want to get data from NCBI website using python3. When I use
fp = urllib.request.urlopen("https://www.ncbi.nlm.nih.gov/gene/?term=50964")
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
print(mystr) #### executes without any error
but when I pass the id as a variable in the url, it throws an error.
id_pool=[50964, 4552,7845,987,796]
for id in id_pool:
id=str(id)
url=f'"https://www.ncbi.nlm.nih.gov/gene/?term={id}"'
print(url) ## "https://www.ncbi.nlm.nih.gov/gene/?term=50964" ### same as above
fp = urllib.request.urlopen(url)
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
print(mystr) #### shows the following error
break
" raise URLError('unknown url type: %s' % type)
urllib.error.URLError: <urlopen error unknown url type: "https>"
There was a typo in the url and I've corrected it for you:
url = f'https://www.ncbi.nlm.nih.gov/gene/?term={id}'
Looks like you have enclosed it twice. I was able to retrieve the output:
Because you have a redundant double quotes (") around the url
f'"https://www.ncbi.nlm.nih.gov/gene/?term={id}"'
The string became "https://www.ncbi.nlm.nih.gov/gene/?term={id}" instead of https://www.ncbi.nlm.nih.gov/gene/?term={id}.
Remove them and the code will work fine
Related
I know this has been asked in many forms already, but I can't seem to find my answer and hope to receive some help here.
I try to download files that are stored behind a list of urls..
I've found following function that should do what I want:
import os.path
import urllib.request
import requests
for link in links:
link = link.strip()
name = link.rsplit('/', 1)[-1]
filename = os.path.join('downloads', name)
if not os.path.isfile(filename):
print('Downloading: ' + filename)
try:
urllib.request.urlretrieve(link, filename)
except Exception as inst:
print(inst)
print(' Encountered unknown error. Continuing.')
I always receive: HTTP Error 400: Bad Request.
I tried to set user-agents to fake a browser visit (I use Google Chrome), but it did not help at all. The links work if copied in the browser, hence I wonder how to solve this.
Spaces have to be quoted. I've used quote function to quote filename in your link. Also I've used rindex to cut last part in url path. There is urlsplit and urlunsplit functions which should be used instead of string operations, but .. I'm too lazy :D
import os.path
import urllib.request
from urllib.parse import quote
links = ['https://undpgefpims.org/attachments/6222/216410/1717887/1724973/6222_4NC_3BUR_Macedonia_Final ProDoc 30 July 2018.doc', 'https://undpgefpims.org/attachments/6214/216405/1719672/1729436/6214_4NC_Niger_ProDoc final for DoA.doc']
for link in links:
link = link.strip()
name = link.rsplit('/', 1)[-1]
filename = os.path.join('downloads', name)
if not os.path.isfile(filename):
print('Downloading: ' + filename)
try:
urllib.request.urlretrieve(link[:link.rindex('/') + 1] + quote(link[link.rindex('/') + 1:]), filename)
except Exception as inst:
print(inst)
print(' Encountered unknown error. Continuing.')
I found the answer to my own question.
The problem was that the urls contained white spaces, which apparently can not be read in properly by urllib.request. The solution is to first parse the urls into quotes and then call the quoted url.
This is the working code for all that run into the same problem:
import os.path
import urllib.request
import requests
import urllib.parse
for link in urls:
link = link.strip()
name = link.rsplit('/', 1)[-1]
filename = os.path.join(name)
quoted_url = urllib.parse.quote(link, safe=":/")
if not os.path.isfile(filename):
print('Downloading: ' + filename)
try:
urllib.request.urlretrieve(quoted_url, filename)
except Exception as inst:
print(inst)
print(' Encountered unknown error. Continuing.')
I am trying to download file from GitHub(raw file) and then run this file as .sql file.
import snowflake.connector
from codecs import open
import logging
import requests
from os import getcwd
import os
import sys
#logging
logging.basicConfig(
filename='C:/Users/abc/Documents/Test.log',
level=logging.INFO
)
url = "https://github.com/raw/abc/master/file_name?token=Anvn3lJXDks5ciVaPwA%3D%3D"
directory = getcwd()
filename = os.path.join(getcwd(),'VIEWS.SQL')
r = requests.get(url)
filename.decode("utf-8")
f = open(filename,'w')
f.write(str(r.content))
with open(filename,'r') as theFile, open(filename,'w') as outFile:
data = theFile.read().split('\n')
data = theFile.read().replace('\n','')
data = theFile.read().replace("b'","")
data = theFile.read()
outFile.write(data)
However I get this error
syntax error line 1 at position 0 unexpected 'b'
My converted sql file has b at the beginning and bunch of newline \n characters in the file. Also the entire output file is in single quotes 'text'. Can anyone help me get rid of these? Looks like replace isn't working.
OS: Windows
Python Version: 3.7.0
You introduced a b'.. prefix by converting the response.content bytes value to a string with str():
>>> import requests
>>> r = requests.get("https://github.com/raw/abc/master/file_name?token=Anvn3lJXDks5ciVaPwA%3D%3D")
>>> r.content
b'Not Found'
>>> str(r.content)
"b'Not Found'"
Of course, the specific dummy URL you gave in your question produces a 404 Not Found response, hence the Not Found content of the response body:
>>> r.status_code
404
so the contents in this demonstration are not actually all that useful. However, even for your real URL you probably want to test for a 200 status code before moving to write the data to a file!
What is going wrong in the above is that str(bytesvalue) converts a bytes object to its representation. You'd normally want to decode a bytes value with a text codec, using the bytes.decode() method. But because you are writing the data to a file here, you should instead just open the file in binary mode and write the bytes object without decoding:
r = requests.get(url)
if r.status_code == 200:
with open(filename, 'wb') as f:
f.write(r.content)
The 'wb' mode opens the file for writing in binary mode. Writing binary content to a binary file is the most efficient; decoding it first then writing to a text file requires that it is encoded again. Better to avoid doing double work.
As a side note: there is no need to join a local filename with getcwd(); relative paths always end up in the current working directory, and otherwise it's better to use os.path.abspath(filename).
You could also trust that GitHub sets the correct character set in the Content-Type headers and have response decode the value to str for you in the form of the response.text attribute:
r = requests.get(url)
if r.status_code == 200:
with open(filename, 'w') as f:
f.write(r.text)
but again, that's really doing extra work for nothing, first decoding the binary content from the request, then encoding again when writing to a text file.
Finally, for larger file responses it is better to stream the data and copy it directly to a file. The shutil.copyfileobj() function can take a raw response fileobject directly, provided you enable transparent transport decompression:
import shutil
r = requests.get(url, stream=True)
if r.status_code == 200:
with open(filename, 'wb') as f:
# enable transparent transport decompression handling
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
Depending on your version of Python/OS it could be as simple as changing the file to read/write in binary (and if they're still there then altering where you have the replaces):
with open(filename,'rb') as theFile, open(filename,'wb') as outFile:
outfile.write(str(r.content))
data = theFile.read().split('\n')
data = data.replace('\n','')
data = data.replace("b'","")
outFile.write(data)
It would help to have a copy of the file and the line the error is occurring on.
Invalid Argument error while reading external json file's values in python
I tried:
import json
with open('https://www.w3schools.com/js/json_demo.txt') as json_file:
data = json.load(json_file)
#for p in data['people']:
print('Name: ' + data['name'])
Gave me error:
with open('https://www.w3schools.com/js/json_demo.txt') as json_file: OSError: [Errno 22] Invalid argument:
'https://www.w3schools.com/js/json_demo.txt'
As open is for opening local files, not URLs as commented by jonrsharpe so, go with urllib as commented by fl00r.
Though the link provided by him was for python-2
Try this (python-3):
import json
from urllib.request import urlopen
with urlopen('https://www.w3schools.com/js/json_demo.txt') as json_file:
data = json.load(json_file)
#for p in data['people']:
print('Name: ' + data['name'])
John
Use requests
import requests
response = requests.get('https://www.w3schools.com/js/json_demo.txt')
response.encoding = "utf-8-sig"
data = response.json()
print(data['name'])
>>> John
I want to save these email results to my results.txt file in the directory.
def parseAddress():
try:
website = urllib2.urlopen(getAddress())
html = website.read()
addys = re.findall('''[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?''', html, flags=re.IGNORECASE)
print addys
except urllib2.HTTPError, err:
print "Cannot retrieve URL: HTTP Error Code: ", err.code
except urllib2.URLError, err:
print "Cannot retrive URL: " + err.reason[1]
# need to write the addys data to results.txt
with open('results.txt', 'w') as f:
result_line = f.writelines(addys)
Use return addys at the end of your function. print will only output to your screen.
In order to retrieve addys, you would need to call the function in your with statement or create a variable that contains the result of parseAddress().
You can save the memory that a variable would use by simply calling the function, like so:
with open('results.txt', 'w') as f:
f.write ( parseAddress() )
You mistakenly indented the "with" statement one space. This makes it subjective to an earlier block. I would think any self-respecting Python interpreter would flag this as not matching any earlier indentation, but it seems to be fouling your output.
Also, please consider adding some tracing print statements to see where your code did execute. That output alone can often show you the problem, or lead us to it. You should always provide actual output for us, rather than just a general description.
You need to fix your indentation, which is important in Python as it is the only way to define a block of code.
You also have too many statements in your try block.
def parseAddress():
website = None
try:
website = urllib2.urlopen(getAddress())
except urllib2.HTTPError, err:
print "Cannot retrieve URL: HTTP Error Code: ", err.code
except urllib2.URLError, err:
print "Cannot retrive URL: " + err.reason[1]
if website is not None:
html = website.read()
addys = re.findall('''[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?''', html, flags=re.IGNORECASE)
print addys
# need to write the addys data to results.txt
with open('results.txt', 'w') as f:
result_line = f.writelines(addys)
I am getting the error when trying to open a url I obtained from reading data from a .txt file in python using match.group(). This is my code below for where the error comes up. Any help as too how this can be corrected would be very much appreciated.
with open('output.txt') as f:
for line in f:
match = re.search("(?P<url>https?://docs.google.com/file[^\s]+)", line)
if match is not None:
urltest = match.group()
print urltest
print "[*] Opening Map in the web browser..."
kml_url = "urltest"
try:
webbrowser.get().open_new_tab(kml_url)
Since you have not provided what you are trying to parse I can only guess but this should pretty much work for your url:
>>> import re
>>> match = re.search('(?P<url>https:\/\/docs.google.com\/file[a-zA-z0-9-]*)', 'https://docs.google.com/fileCharWithnumbers123')
>>> match.group("url")
'https://docs.google.com/fileCharWithnumbers123'