how to parse big datasets using RDFLib?

how to parse big datasets using RDFLib? - python

I'm trying to parse several big graphs with RDFLib 3.0, apparently it handles first one and dies on the second (MemoryError)... looks like MySQL is not supported as store anymore, can you please suggest a way to somehow parse those?
Traceback (most recent call last):
File "names.py", line 152, in <module>
main()
File "names.py", line 91, in main
locals()[graphname].parse(filename, format="nt")
File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/graph.py", line 938, in parse
location=location, file=file, data=data, **args)
File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/graph.py", line 757, in parse
parser.parse(source, self, **args)
File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/plugins/parsers/nt.py", line 24, in parse
parser.parse(f)
File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/plugins/parsers/ntriples.py", line 124, in parse
self.line = self.readline()
File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/plugins/parsers/ntriples.py", line 151, in readline
m = r_line.match(self.buffer)
MemoryError

How many triples on those RDF files ? I have tested rdflib and it won't scale much further than few tens of ktriples - if you are lucky. No way it really performs well for files with millions of triples.
The best parser out there is rapper from Redland Libraries. My first advice is to not use RDF/XML and go for ntriples. Ntriples is a lighter format than RDF/XML. You can transform from RDF/XML to ntriples using rapper:
rapper -i rdfxml -o ntriples YOUR_FILE.rdf > YOUR_FILE.ntriples
If you like Python you can use the Redland python bindings:
import RDF
parser=RDF.Parser(name="ntriples")
model=RDF.Model()
stream=parser.parse_into_model(model,"file://file_path",
"http://your_base_uri.org")
for triple in model:
print triple.subject, triple.predicate, triple.object
I have parsed fairly big files (couple of gigabyes) with redland libraries with no problem.
Eventually if you are handling big datasets you might need to assert your data into a scalable triple store, the one I normally use is 4store. 4store internally uses redland to parse RDF files. In the long term, I think, going for a scalable triple store is what you'll have to do. And with it you'll be able to use SPARQL to query your data and SPARQL/Update to insert and delete triples.

Related

Cannot read Statistics Canada sdmx file

I am trying to read some Canadian census data from Statistics Canada
(the XML option for the "Canada, provinces and territories" geograpic level). I see that the xml file is in the SDMX format and that there is a structure file provided, but I cannot figure out how to read the data from the xml file.
It seems there are 2 options in Python, pandasdmx and sdmx1, both of which say they can read local files. When I try
import sdmx
datafile = '~/Documents/Python/Generic_98-401-X2016059.xml'
canada = sdmx.read_sdmx(datafile)
It appears to read the first 903 lines and then produces the following:
Traceback (most recent call last):
File "/home/username/.local/lib/python3.10/site-packages/sdmx/reader/xml.py", line 238, in read_message
raise NotImplementedError(element.tag, event) from None
NotImplementedError: ('{http://www.SDMX.org/resources/SDMXML/schemas/v2_0/message}GenericData', 'start')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/username/.local/lib/python3.10/site-packages/sdmx/reader/__init__.py", line 126, in read_sdmx
return reader().read_message(obj, **kwargs)
File "/home/username/.local/lib/python3.10/site-packages/sdmx/reader/xml.py", line 259, in read_message
raise XMLParseError from exc
sdmx.exceptions.XMLParseError: NotImplementedError: ('{http://www.SDMX.org/resources/SDMXML/schemas/v2_0/message}GenericData', 'start')
Is this happening because I've not loaded the structure of the sdmx file (Structure_98-401-X2016059.xml in the zip file from the StatsCan link above)? If so, how do I go about loading that and telling sdmx to use that when reading datafile?
The documentation for sdmx and pandasdmx only show examples of loading files from online providers and not from local files, so I'm stuck. I have limited familiarity with python so any help is much appreciated.
For reference, I can read the file in R using the instructions from the rsdmx github. I would like to be able to do the same/similar in Python.
Thanks in advance.

From a cursory inspection of the documentation, it seems that Statistics Canada is not one of the sources that is included by default. There is however an sdmx.add_source function. I suggest you try that (before loading the data).

As per the sdmx1 developer, StatsCan is using the older, unsupported version of the SDMX (v. 2.0). The current version is 2.1 and rsdmx1 only supports this (support is also going towards the upcoming v.3).

processing LZO sequence files with mrjob

I'm writing a task with mrjob to compute various statistics using the Google Ngrams data: https://aws.amazon.com/datasets/8172056142375670
I developed & tested my script locally using an uncompressed subset of the data in tab-delimited text. Once I tried to run the job, I got this error:
Traceback (most recent call last):
File "ngram_counts.py", line 74, in <module>
MRNGramCounts.run()
File "/usr/lib/python2.6/dist-packages/mrjob/job.py", line 500, in run
mr_job.execute()
File "/usr/lib/python2.6/dist-packages/mrjob/job.py", line 509, in execute
self.run_mapper(self.options.step_num)
File "/usr/lib/python2.6/dist-packages/mrjob/job.py", line 574, in run_mapper
for out_key, out_value in mapper(key, value) or ():
File "ngram_counts.py", line 51, in mapper
(ngram, year, _mc, _pc, _vc) = line.split('\t')
ValueError: need more than 2 values to unpack
(while reading from s3://datasets.elasticmapreduce/ngrams/books/20090715/eng-1M/5gram/data)
Presumably this is because of the public data set's compression scheme (from the URL link above):
We store the datasets in a single object in Amazon S3. The file is in
sequence file format with block level LZO compression. The sequence
file key is the row number of the dataset stored as a LongWritable and
the value is the raw data stored as TextWritable.
Any guidance on how to set up a workflow that can process these files? I've searched exhaustively for tips but haven't turned up anything useful...
(I'm a relative n00b to mrjob and Hadoop.)

I finally figured this out. It looks like EMR takes care of the LZO compression for you, but for the sequence file format, you need to add the following HADOOP_INPUT_FORMAT field to your MRJob class:
class MyMRJob(MRJob):
HADOOP_INPUT_FORMAT = 'org.apache.hadoop.mapred.SequenceFileAsTextInputFormat'
def mapper(self, _, line):
# mapper code...
def reducer(self, key, value):
# reducer code...
There's another gotcha, too (quoting from the AWS-hosted Google NGrams page):
The sequence file key is the row number of the dataset stored as a LongWritable and the value is the raw data stored as TextWritable.
That means each row is prepended with and extra Long + TAB, so any line parsing you do in your mapper method needs to account for the prepended info as well.

Read .net pajek file using python igraph library

I am trying to load a .net file using python igraph library. Here is the sample code:
import igraph
g = igraph.read("s.net",format="pajek")
But when I tried to run this script I got the following errors:
Traceback (most recent call last):
File "demo.py", line 2, in <module>
g = igraph.read('s.net',format="pajek")
File "C:\Python27\lib\site-packages\igraph\__init__.py", line 3703, in read
return Graph.Read(filename, *args, **kwds)
File "C:\Python27\lib\site-packages\igraph\__init__.py", line 2062, in Read
return reader(f, *args, **kwds)
igraph._igraph.InternalError: Error at .\src\foreign.c:574: Parse error in Pajek
file, line 1 (syntax error, unexpected ARCSLINE, expecting VERTICESLINE), Parse error
Kindly provide some hint over it.

Either your file is not a regular Pajek file or igraph's Pajek parser is not able to read this particular Pajek file. (Writing a Pajek parser is a bit of hit-and-miss since the Pajek file format has no formal specification). If you send me your Pajek file via email, I'll take a look at it.
Update: you were missing the *Vertices section of the Pajek file. Adding a line like *Vertices N (where N is the number of vertices in the graph) resolves your problem. I cannot state that this line is mandatory in Pajek files because of lack of formal specification for the file format, but all the Pajek files I have seen so far included this line so I guess it's pretty standard.

Large file upload fails

I'm in the process of writing a python module to POST files to a server , I can upload files of size of upto 500MB but when I tried to upload a 1gb file the upload failed, If I were to use something like cURL it won't fail. I got the code after googling how to upload multipart formdata using python , the code can be found here. I just compiled and ran that code , the error I'm getting is this
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
opener.open("http://127.0.0.1/test_server/upload",params)
File "C:\Python27\lib\urllib2.py", line 392, in open
req = meth(req)
File "C:\Python27\MultipartPostHandler.py", line 35, in http_request
boundary, data = self.multipart_encode(v_vars, v_files)
File "C:\Python27\MultipartPostHandler.py", line 63, in multipart_encode
buffer += '\r\n' + fd.read() + '\r\n'
MemoryError
I'm new to python and having a hard time grasping it. I also came across another program here , I'll be honest I don't know how to run it. I tried running it by guessing based on the function name , but that didn't work.

The script in question isn't very smart and builds the POST body in memory.
Thus, to POST a 1GB file, you'll need 1GB of memory just to hold that data, plus the HTTP headers, boundaries, and python and the code itself.
You'd have to rework the script to use mmap instead, where you first construct the whole body in a temp file before handing that file wrapped in a mmap.mmap value to passing it to request.add_data.
See Python: HTTP Post a large file with streaming for hints on how to achieve that.

Is it possible to increase the amount of RAM a python process is using

I am running a classification/feature extraction task on a windows server with 64GB of RAM, and somehow, python thinks i am running out of memory:
misiti#fff /cygdrive/c/NaiveBayes
$ python run_classify_comments.py > tenfoldcrossvalidation.txt
Traceback (most recent call last):
File "run_classify_comments.py", line 70, in <module>
run_classify_comments()
File "run_classify_comments.py", line 51, in run_classify_comments
NWORDS = get_all_words("./data/HUGETEXTFILE.txt")
File "run_classify_comments.py", line 16, in get_all_words
def get_all_words(path): return words(file(path).read())
File "run_classify_comments.py", line 15, in words
def words(text): return re.findall('[a-z]+', text.lower())
File "C:\Program Files (x86)\Python26\lib\re.py", line 175, in findall
return _compile(pattern, flags).findall(string)
MemoryError
So the re module is crashing with 64 GB of RAM...I do not think so...
Why is this happening, and how can I configure python to use all available RAM on my machine?

Just rewrite your program to read your huge text file one line at a time. This is easily done by just changing get_all_words(path) to:
def get_all_words(path):
return sum((words(line) for line in open(path))
Note the use of a generator in the parenthesis, which is lazy and will evaluate on demand by the sum function.

It looks to me as if the problem is in using re.findall() to read the whole text as a list of words into memory. Are you reading more than 64GB of text in this way? Depending on how your NaiveBayes algorithm is implemented, you may do better to build your frequency dictionary incrementally such that only the dictionary is being held in memory (not the whole text). Some more information about your implementation might help answer your question more directly.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to parse big datasets using RDFLib? - python

Related

Cannot read Statistics Canada sdmx file

processing LZO sequence files with mrjob

Read .net pajek file using python igraph library

Large file upload fails

Is it possible to increase the amount of RAM a python process is using

Categories

Resources