how to unpack dmoz urls from rdf dump with python and rdflib?

how to unpack dmoz urls from rdf dump with python and rdflib? - python

i tried to open rdf file (dmoz rdf dump), but a get this error message
Traceback (most recent call last):
File "/media/_dev_/ODP_RDF_get_links.py", line 4, in <module>
result = g.parse("data/content.rdf")
File "/usr/local/lib/python2.7/dist-packages/rdflib/graph.py", line 1033, in parse
parser.parse(source, self, **args)
File "/usr/local/lib/python2.7/dist-packages/rdflib/plugins/parsers/rdfxml.py", line 577, in parse
self._parser.parse(source)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
self.feed(buffer)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 210, in feed
self._parser.Parse(data, isFinal)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 352, in end_element_ns
self._cont_handler.endElementNS(pair, None)
File "/usr/local/lib/python2.7/dist-packages/rdflib/plugins/parsers/rdfxml.py", line 160, in endElementNS
self.current.end(name, qname)
File "/usr/local/lib/python2.7/dist-packages/rdflib/plugins/parsers/rdfxml.py", line 331, in node_element_end
self.error("Repeat node-elements inside property elements: %s"%"".join(name))
File "/usr/local/lib/python2.7/dist-packages/rdflib/plugins/parsers/rdfxml.py", line 185, in error
raise ParserError(info + message)
file:///media/_dev_/data/content.rdf:5:12: Repeat node-elements inside property elements: http://dmoz.org/rdf/catid
my simple code is as follow:
import rdflib
g = rdflib.Graph()
result = g.parse("data/content.rdf")
print("graph has %s statements." % len(g))
i need to be able to read the file.
extract all links in the world category.
thanks for any possible help
EDIT:
PS: found this wikipedia rdf_dumps, so developing custom scripts is necessary to use this dump

Related

error while generating two log.html files in robot framework by Rebot model

I am trying to generate an error log html by “rebot” package of robot framework and its getting generated successfully.
But if I use the rebot function in my module then its affect default log and report html which gets generated after script execution.
[ ERROR ] Unexpected error: AttributeError: 'NoneType' object has no attribute 'encode'
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/robot/utils/application.py", line 83, in _execute
rc = self.main(arguments, **options)
File "/usr/local/lib/python3.5/dist-packages/robot/run.py", line 445, in main
result = suite.run(settings)
File "/usr/local/lib/python3.5/dist-packages/robot/running/model.py", line 248, in run
self.visit(runner)
File "/usr/local/lib/python3.5/dist-packages/robot/model/testsuite.py", line 161, in visit
visitor.visit_suite(self)
File "/usr/local/lib/python3.5/dist-packages/robot/model/visitor.py", line 87, in visit_suite
suite.tests.visit(self)
File "/usr/local/lib/python3.5/dist-packages/robot/model/itemlist.py", line 76, in visit
item.visit(visitor)
File "/usr/local/lib/python3.5/dist-packages/robot/model/testcase.py", line 74, in visit
visitor.visit_test(self)
File "/usr/local/lib/python3.5/dist-packages/robot/running/runner.py", line 159, in visit_test
self._output.end_test(ModelCombiner(test, result))
File "/usr/local/lib/python3.5/dist-packages/robot/output/output.py", line 59, in end_test
LOGGER.end_test(test)
File "/usr/local/lib/python3.5/dist-packages/robot/output/logger.py", line 183, in end_test
logger.end_test(test)
File "/usr/local/lib/python3.5/dist-packages/robot/output/console/verbose.py", line 51, in end_test
self._writer.status(test.status, clear=True)
File "/usr/local/lib/python3.5/dist-packages/robot/output/console/verbose.py", line 114, in status
self._clear_status()
File "/usr/local/lib/python3.5/dist-packages/robot/output/console/verbose.py", line 124, in _clear_status
self._write_info()
File "/usr/local/lib/python3.5/dist-packages/robot/output/console/verbose.py", line 90, in _write_info
self._stdout.write(self._last_info)
File "/usr/local/lib/python3.5/dist-packages/robot/output/console/highlighting.py", line 51, in write
self._write(console_encode(text, stream=self.stream))
File "/usr/local/lib/python3.5/dist-packages/robot/utils/encoding.py", line 60, in console_encode
return string.encode(encoding, errors).decode(encoding)

The error message is rather clear. It looks like you're trying to encode a variable which contains a None instead of a string. You need to make sure that this variable always contains a string and handle cases where something else is inside. You can do it for example using the try ... except statement.

How to debug vage error lxml.etree.SerialisationError: unknown error -2029930774 in python

I am using some legacy code from python2 that has to work with python3. So far so good, most of the things work as they should. However I get the most vage error from a library called lxml.
In my understanding this is a library that binds to a binary program written in c.
The problem comes from this piece of code:
with etree.xmlfile(self.temp_file, encoding='utf-8') as xf:
with xf.element('{http://www.opengis.net/citygml/2.0}CityModel', nsmap=nsmap):
with open(input_gml, mode='rb') as f:
context = etree.iterparse(f)
for action, elem in context:
if action == 'end' and elem.tag == '{http://www.opengis.net/citygml/2.0}cityObjectMember':
# Duplicate feature and subfeatures
self.duplicateFeature(xf, elem)
# Clean up the original element and the node of its previous sibling
# (https://www.ibm.com/developerworks/xml/library/x-hiperfparse/)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context
xf.flush()
It processes this xml file. And gets the following error:
Traceback (most recent call last):
File "/usr/local/bin/stetl", line 4, in <module>
__import__('pkg_resources').run_script('Stetl==2.0', 'stetl')
File "/usr/local/lib/python3.6/site-packages/pkg_resources/__init__.py", line 666, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1446, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python3.6/site-packages/Stetl-2.0-py3.6.egg/EGG-INFO/scripts/stetl", line 43, in <module>
main()
File "/usr/local/lib/python3.6/site-packages/Stetl-2.0-py3.6.egg/EGG-INFO/scripts/stetl", line 36, in main
etl.run()
File "/usr/local/lib/python3.6/site-packages/Stetl-2.0-py3.6.egg/stetl/etl.py", line 157, in run
chain.run()
File "/usr/local/lib/python3.6/site-packages/Stetl-2.0-py3.6.egg/stetl/chain.py", line 172, in run
packet = self.first_comp.process(packet)
File "/usr/local/lib/python3.6/site-packages/Stetl-2.0-py3.6.egg/stetl/component.py", line 213, in process
packet = self.next.process(packet)
File "/usr/local/lib/python3.6/site-packages/Stetl-2.0-py3.6.egg/stetl/component.py", line 213, in process
packet = self.next.process(packet)
File "/usr/local/lib/python3.6/site-packages/Stetl-2.0-py3.6.egg/stetl/component.py", line 213, in process
packet = self.next.process(packet)
File "/usr/local/lib/python3.6/site-packages/Stetl-2.0-py3.6.egg/stetl/component.py", line 199, in process
packet = self.invoke(packet)
File "/app/bgt/etl/stetlbgt/subfeaturehandler.py", line 144, in invoke
del context
File "src/lxml/serializer.pxi", line 925, in lxml.etree.xmlfile.__exit__
File "src/lxml/serializer.pxi", line 1263, in lxml.etree._IncrementalFileWriter._close
File "src/lxml/serializer.pxi", line 1269, in lxml.etree._IncrementalFileWriter._handle_error
File "src/lxml/serializer.pxi", line 199, in lxml.etree._raiseSerialisationError
lxml.etree.SerialisationError: unknown error -2029930774
I'm not sure what's going wrong here. It seems that something is wrong with some weird encoded character.
How to debug this?

loop through and load a zipped folder of yaml files

I have a zipped folder containing 15 000 yaml files. I'd like to iterate through the folder using yaml.safe_load so that each file is in a dictionary format and I can extract information from each file that I need. I've written some code so far using zipfile.ZipFile and yaml.safe_load but it only works for the first file in the zipped folder. Would anyone please mind taking a look and explaining what I'm misunderstanding please?
zip_file = zipfile.ZipFile("D:/export.zip")
files = zip_file.namelist()
print(files)
for i in range(10):
with zip_file.open(files[i]) as yamlfile:
yamlreader = yaml.safe_load(yamlfile)
print(yamlreader["identifier"])
for now I'm just iterating through 10 files to make life easier. Eventually I'd like to do the whole 15 000. "identifier" is a key in the yaml file.
This is the error:
10.5281/zenodo.1014773
Traceback (most recent call last):
File "C:/Users/estho/PycharmProjects/GSOC3/testing_dataextraction.py", line 20, in <module>
yamlreader = yaml.safe_load(yamlfile)
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\__init__.py", line 162, in safe_load
return load(stream, SafeLoader)
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\__init__.py", line 114, in load
return loader.get_single_data()
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\constructor.py", line 41, in get_single_data
node = self.get_single_node()
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\composer.py", line 36, in get_single_node
document = self.compose_document()
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\composer.py", line 55, in compose_document
node = self.compose_node(None, None)
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\composer.py", line 84, in compose_node
node = self.compose_mapping_node(anchor)
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\composer.py", line 127, in compose_mapping_node
while not self.check_event(MappingEndEvent):
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\parser.py", line 98, in check_event
self.current_event = self.state()
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\parser.py", line 428, in parse_block_mapping_key
if self.check_token(KeyToken):
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\scanner.py", line 116, in check_token
self.fetch_more_tokens()
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\scanner.py", line 260, in fetch_more_tokens
self.get_mark())
yaml.scanner.ScannerError: while scanning for the next token
found character '\t' that cannot start any token
in "yamlfile_10_5281_zenodo_1745362.yaml", line 4, column 1
Thank you.

It seems to me like in the file "yamlfile_10_5281_zenodo_1745362.yaml" there is a bad token name. Try running it without this file. In python \t is representative of a tab and so cannot be included in a string ect normally without escaping it.

What is the 'rdt' object in the error: 'KeyError: "Unable to open object (Object 'rdt' doesn't exist)"?

I am trying to execute the following code
impot spacepy.time as spt
import spacepy.omni as om
ticks = spt.Ticktock(['2002-02-02T12:00:00', '2002-02-02T12:10:00'], 'ISO')
d = om.get_omni(ticks)
d.tree(levels=1)
that is the example at the spacepy documentation.
I got the error:
Traceback (most recent call last):
File "<ipython-input-28-bd1a52c0010b>", line 1, in <module>
data = om.get_omni(ticks)
File "/usr/local/lib/python2.7/dist-packages/spacepy-0.1.6-py2.7.egg/spacepy/omni.py", line 252, in get_omni
enval, stval = omnirange(dbase=ldb)[1], omnirange(dbase=ldb)[0]
File "/usr/local/lib/python2.7/dist-packages/spacepy-0.1.6-py2.7.egg/spacepy/omni.py", line 377, in omnirange
start, end = hfile['RDT'][0], hfile['RDT'][-1]
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2684)
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2642)
File "~/.local/lib/python2.7/site-packages/h5py/_hl/group.py", line 166, in __getitem__
oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2684)
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2642)
File "h5py/h5o.pyx", line 190, in h5py.h5o.open (/tmp/pip-4rPeHA-build/h5py/h5o.c:3570)
KeyError: "Unable to open object (Object 'rdt' doesn't exist)"
I don't know how to fix this.
The same problem occur when executing other SpacePy codes.

If you run SpacePy for the first time, a special dataset of OMNI data (more details on it here) needs to be downloaded. To obtain it, simply execute:
import spacepy
spacepy.toolbox.update()
For this function to work properly, you have to make sure that all dependencies according to the Installation Guideline are met - especially the NASA CDF library is needed.

how to fix or make an exception for this error

I'm creating a code that gets image's urls from any web pages, the code are in python and use BeutifulSoup and httplib2.
When I run the code, I get the next error:
Look me http://movies.nytimes.com (this line is printed by the code)
Traceback (most recent call last):
File "main.py", line 103, in <module>
visit(initialList,profundidad)
File "main.py", line 98, in visit
visit(dodo[indice], bottom -1)
File "main.py", line 94, in visit
getImages(w)
File "main.py", line 34, in getImages
iSoupList = BeautifulSoup(response, parseOnlyThese=SoupStrainer('img'))
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup.py", line 1499, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup.py", line 1230, in __init__
self._feed(isHTML=isHTML)
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup.py", line 1263, in _feed
self.builder.feed(markup)
File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 226, in parse_starttag
endpos = self.check_for_whole_start_tag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 301, in check_for_whole_start_tag
self.error("malformed start tag")
File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 942, column 118
Someone can explain me how to fix or make an exeption for the error

Are you using latest version of BeautifulSoup?
This seems a known issue of version 3.1.x, because it started using a new parser (HTMLParser, instead of SGMLParser) that is much worse at processing malformed HTML. You can find more information about this on BeautifulSoup website.
As a quick solution, you can simply use an older version (3.0.7a).

To catch that error specifically, change your code to look like this:
try:
iSoupList = BeautifulSoup(response, parseOnlyThese=SoupStrainer('img'))
except HTMLParseError:
#Do something intelligent here
Here's some more reading on Python's try except blocks:
http://docs.python.org/tutorial/errors.html

I got that error when I had the string =& in my HTML document. When I replaced that string (in my case with =and) then I no longer received that parsing error.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to unpack dmoz urls from rdf dump with python and rdflib? - python

Related

error while generating two log.html files in robot framework by Rebot model

How to debug vage error lxml.etree.SerialisationError: unknown error -2029930774 in python

loop through and load a zipped folder of yaml files

What is the 'rdt' object in the error: 'KeyError: "Unable to open object (Object 'rdt' doesn't exist)"?

how to fix or make an exception for this error

Categories

Resources