Error when I retrieve data from dbpedia - python

I try to retrieve data from dbpedia but I get error every time i run the code.
The code in Python is:
#!/usr/bin/python
# -*- coding: utf-8 -*-
from SPARQLWrapper import SPARQLWrapper, JSON
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
sparql.setQuery("""
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?subject
WHERE { <http://dbpedia.org/resource/Musée_du_Louvre> dcterms:subject ?subject }
""")
# JSON example
print '\n\n*** JSON Example'
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
for result in results["results"]["bindings"]:
print result["subject"]["value"]
I believe that I must use a different char for "é" in "Musée_du_Louvre"but I cant figure which.
Thx!

The first problem is that SPARQLWrapper seems to expect its query to be in unicode, but you're passing it an utf-8 encoded string - that's why you get a UnicodeDecoreError. Instead you should pass it a unicode object, either by decoding your utf-8 string
unicode_obj = some_utf8_string.decode('utf-8')
or by using an unicode literal:
unicode_obj = u'Hello World'
Passing it a unicode object avoids that UnicodeDecodeError, but doesn't yield any results. So it looks the dbpedia API expects URLs containing non-ASCII characters to be percent-encoded. Therefore you need to encode the URL beforehand using urllib.quote_plus:
from urllib import quote_plus
encoded_url = quote_plus(url, safe='/:')
With these two changes your code could look like this:
#!/usr/bin/python
# -*- coding: utf-8 -*-
from SPARQLWrapper import SPARQLWrapper, JSON
from urllib import quote_plus
url = 'http://dbpedia.org/resource/Musée_du_Louvre'
encoded_url = quote_plus(url, safe='/:')
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
query = u"""
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?subject
WHERE { <%s> dcterms:subject ?subject }
""" % encoded_url
sparql.setQuery(query)
# JSON example
print '\n\n*** JSON Example'
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
for result in results["results"]["bindings"]:
print result["subject"]["value"]

Related

how to parse query string parameters in python?

I am working on a REST API and using python. say for a get request ( sample below),
I am assuming , anyone who makes a call will URL encode the URL, what is the correct way to decode and read query parameters in python?
'https://someurl.com/query_string_params?id=1&type=abc'
import requests
import urllib
def get():
//parse query string parameters here
Here's an example of how to split a URL and get the query parameters:
import urllib.parse
url='https://someurl.com/query_string_params?id=1&type=abc'
url_parts = urllib.parse.urlparse(url)
print( f"{url_parts=}" )
query_parts = urllib.parse.parse_qs(url_parts.query)
print( f"{query_parts=}" )
Result:
url_parts=ParseResult(scheme='https', netloc='someurl.com', path='/query_string_params', params='', query='id=1&type=abc', fragment='')
query_parts={'id': ['1'], 'type': ['abc']}
Documentation is here https://docs.python.org/3/library/urllib.parse.html?highlight=url%20decode

How can I properly serialize wikidata SPARQL queries answers?

I have the following example of querying wikidata via Python's SPARQLWrapper:
import rdflib, urllib
from SPARQLWrapper import SPARQLWrapper, JSON, XML, TURTLE, RDF, N3
from rdflib import Graph, Namespace, URIRef, RDF#, RDFS, Literal
def graph_full(uri, f)
sparql = SPARQLWrapper('https://query.wikidata.org/sparql')
sparql.setQuery('''
PREFIX entity: <http://www.wikidata.org/entity/>
SELECT ?predicate ?object WHERE {
<'''+urllib.unquote(uri).encode("utf8")+'''> ?predicate ?object .
} LIMIT 100
''')
sparql.setReturnFormat(N3)
results = sparql.query().convert()
#print results.serialize()
print type(results)
g = Graph()
g.parse(results)
print g
#g.serialize(f, format="n3")
if __name__ == '__main__':
graph_full("entity:Q76", "wikidata/output.nt")
I want to serialize the result of the SPARQL query and save it to a file. This seems to always throw the following error:
Exception: Unexpected type '<type 'instance'>' for source '<xml.dom.minidom.Document instance at 0x7fa11e3715a8>'
Using similar code against DBpedia SPARQL endpoints throws no erros.

Python Requests URL with Unicode Parameters

I'm currently trying to hit the google tts url, http://translate.google.com/translate_tts with japanese characters and phrases in python using the requests library.
Here is an example:
http://translate.google.com/translate_tts?tl=ja&q=ひとつ
However, when I try to use the python requests library to download the mp3 that the endpoint returns, the resulting mp3 is blank. I have verified that I can hit this URL in requests using non-unicode characters (via romanji) and have gotten correct responses back.
Here is a part of the code I am using to make the request
langs = {'japanese': 'ja',
'english': 'en'}
def get_sound_file_for_text(text, download=False, lang='japanese'):
r = StringIO()
glang = langs[lang]
text = text.replace('*', '')
text = text.replace('/', '')
text = text.replace('x', '')
url = 'http://translate.google.com/translate_tts'
if download:
result = requests.get(url, params={'tl': glang, 'q': text})
r.write(result.content)
r.seek(0)
return r
else:
return url
Also, if I print textor url within this snippet, the kana/kanji is rendered correctly in my console.
Edit:
If I attempt to encode the unicode and quote it as such, I still get the same response.
# -*- coding: utf-8 -*-
from StringIO import StringIO
import urllib
import requests
__author__ = 'jacob'
langs = {'japanese': 'ja',
'english': 'en'}
def get_sound_file_for_text(text, download=False, lang='japanese'):
r = StringIO()
glang = langs[lang]
text = text.replace('*', '')
text = text.replace('/', '')
text = text.replace('x', '')
text = urllib.quote(text.encode('utf-8'))
url = 'http://translate.google.com/translate_tts?tl=%(glang)s&q=%(text)s' % locals()
print url
if download:
result = requests.get(url)
r.write(result.content)
r.seek(0)
return r
else:
return url
Which returns this:
http://translate.google.com/translate_tts?tl=ja&q=%E3%81%B2%E3%81%A8%E3%81%A4
Which seems like it should work, but doesn't.
Edit 2:
If I attempt to use urlllb/urllib2, I get a 403 error.
Edit 3:
So, it seems that this problem/behavior is simply limited to this endpoint. If I try the following URL, a different endpoint.
http://www.kanjidamage.com/kanji/13-un-%E4%B8%8D
From within requests and my browser, I get the same response (they match). If I even try ascii characters to the server, like this url.
http://translate.google.com/translate_tts?tl=ja&q=sayonara
I get the same response as well (they match again). But if I attempt to send unicode characters to this URL, I get a correct audio file on my browser, but not from requests, which sends an audio file, but with no sound.
http://translate.google.com/translate_tts?tl=ja&q=%E3%81%B2%E3%81%A8%E3%81%A4
So, it seems like this behavior is limited to the Google TTL URL?
The user agent can be part of the problem, however, it is not in this case. The translate_tts service rejects (with HTTP 403) some user agents, e.g. any that begin with Python, curl, wget, and possibly others. That is why you are seeing a HTTP 403 response when using urllib2.urlopen() - it sets the user agent to Python-urllib/2.7 (the version might vary).
You found that setting the user agent to Mozilla/5.0 fixed the problem, but that might work because the API might assume a particular encoding based on the user agent.
What you actually should do is to explicitly specify the URL character encoding with the ie field. Your URL request should look like this:
http://translate.google.com/translate_tts?ie=UTF-8&tl=ja&q=%E3%81%B2%E3%81%A8%E3%81%A4
Note the ie=UTF-8 which explicitly sets the URL character encoding. The spec does state that UTF-8 is the default, but doesn't seem entirely true, so you should always set ie in your requests.
The API supports kanji, hiragana, and katakana (possibly others?). These URLs all produce "nihongo", although the audio produced for hiragana input has a slightly different inflection to the others.
import requests
one = u'\u3072\u3068\u3064'
kanji = u'\u65e5\u672c\u8a9e'
hiragana = u'\u306b\u307b\u3093\u3054'
katakana = u'\u30cb\u30db\u30f3\u30b4'
url = 'http://translate.google.com/translate_tts'
for text in one, kanji, hiragana, katakana:
r = requests.get(url, params={'ie': 'UTF-8', 'tl': 'ja', 'q': text})
print u"{} -> {}".format(text, r.url)
open(u'/tmp/{}.mp3'.format(text), 'wb').write(r.content)
I made this little method before to help me with UTF-8 encoding. I was having issues printing cyrllic and CJK languages to csvs and this did the trick.
def assist(unicode_string):
utf8 = unicode_string.encode('utf-8')
read = utf8.decode('string_escape')
return read ## UTF-8 encoded string
Also, make sure you have these two lines at the beginning of your .py.
#!/usr/bin/python
# -*- coding: utf-8 -*-
The first line is just a good python habit, it specifies which compiler to use on the .py (really only useful if you have more than one version of python loaded on your machine). The second line specifies the encoding of the python file. A slightly longer answer for this is given here.
Setting the User-Agent to Mozilla/5.0 fixes this issue.
from StringIO import StringIO
import urllib
import requests
__author__ = 'jacob'
langs = {'japanese': 'ja',
'english': 'en'}
def get_sound_file_for_text(text, download=False, lang='japanese'):
r = StringIO()
glang = langs[lang]
text = text.replace('*', '')
text = text.replace('/', '')
text = text.replace('x', '')
url = 'http://translate.google.com/translate_tts'
if download:
result = requests.get(url, params={'tl': glang, 'q': text}, headers={'User-Agent': 'Mozilla/5.0'})
r.write(result.content)
r.seek(0)
return r
else:
return url

sending utf-8 adress to urlretrieve in python

While trying to access a file whose name contain utf-8 chars from browser I get the error
The requested URL /images/0/04/×¤×ª×¨×•× ×•×ª_תרגילי×_על_משטחי×_דיפ'_2014.pdf was not found on this server.
Additionally, a 404 Not Found error was encountered while trying to use an ErrorDocument to handle the request.`
In order to access the files I wrote the following python script:
# encoding: utf8
__author__ = 'Danis'
__date__ = '20/10/14'
import urllib
curr_link = u'http://math-wiki.com/images/0/04/2014_\'דיפ_משטחים_על_פתרונות.nn uft8pdf'
urllib.urlretrieve(link, 'home/danisf/targil4.pdf')
but when I run the code I get the error URLError:<curr_link appears here> contains non-ASCII characters
How can I fix the code to get him work? (by the way I don't have access to the server or to the webmaster) maybe the browser failed not because the bad encoding of the name for the file?
You cannot just pass Unicode URLs into urllib functions; URLs must be valid bytestrings instead. You'll need to encode to UTF-8, then url quote the path of your URL:
import urllib
import urlparse
curr_link = u'http://math-wiki.com/images/0/04/2014_\'דיפ_משטחים_על_פתרונות.nn uft8pdf'
parsed_link = urlparse.urlsplit(curr_link.encode('utf8'))
parsed_link = parsed_link._replace(path=urllib.quote(parsed_link.path))
encoded_link = parsed_link.geturl()
urllib.urlretrieve(encoded_link, 'home/danisf/targil4.pdf')
The specific URL you provided in your question produces a 404 error however.
Demo:
>>> import urllib
>>> import urlparse
>>> curr_link = u'http://math-wiki.com/images/0/04/2014_\'דיפ_משטחים_על_פתרונות.nn uft8pdf'
>>> parsed_link = urlparse.urlsplit(curr_link.encode('utf8'))
>>> parsed_link = parsed_link._replace(path=urllib.quote(parsed_link.path))
>>> print parsed_link.geturl()
http://math-wiki.com/images/0/04/2014_%27%D7%93%D7%99%D7%A4_%D7%9E%D7%A9%D7%98%D7%97%D7%99%D7%9D_%D7%A2%D7%9C_%D7%A4%D7%AA%D7%A8%D7%95%D7%A0%D7%95%D7%AA.nn%20uft8pdf
Your browser usually decodes UTF-8 bytes encoded like this, to present a readable URL, but when sending the URL to the server to retrieve, it is encoded in the exact same manner.

ValueError: need more than 1 value to unpack, PoolManager request

The following code in utils.py
manager = PoolManager()
data = json.dumps(dict) #takes in a python dictionary of json
manager.request("POST", "https://myurlthattakesjson", data)
Gives me ValueError: need more than 1 value to unpack when the server is run. Does this most likely mean that the JSON is incorrect or something else?
Your Json data needs to be URLencoded for it to be POST (or GET) safe.
# import parser
import urllib.parse
manager = PoolManager()
# stringify your data
data = json.dumps(dict) #takes in a python dictionary of json
# base64 encode your data string
encdata = urllib.parse.urlencode(data)
manager.request("POST", "https://myurlthattakesjson", encdata)
I believe in python3 they made some changes that the data needs to be binary. See unable to Post data to a login form using urllib python v3.2.1

Categories