processing LZO sequence files with mrjob - python

I'm writing a task with mrjob to compute various statistics using the Google Ngrams data: https://aws.amazon.com/datasets/8172056142375670
I developed & tested my script locally using an uncompressed subset of the data in tab-delimited text. Once I tried to run the job, I got this error:
Traceback (most recent call last):
File "ngram_counts.py", line 74, in <module>
MRNGramCounts.run()
File "/usr/lib/python2.6/dist-packages/mrjob/job.py", line 500, in run
mr_job.execute()
File "/usr/lib/python2.6/dist-packages/mrjob/job.py", line 509, in execute
self.run_mapper(self.options.step_num)
File "/usr/lib/python2.6/dist-packages/mrjob/job.py", line 574, in run_mapper
for out_key, out_value in mapper(key, value) or ():
File "ngram_counts.py", line 51, in mapper
(ngram, year, _mc, _pc, _vc) = line.split('\t')
ValueError: need more than 2 values to unpack
(while reading from s3://datasets.elasticmapreduce/ngrams/books/20090715/eng-1M/5gram/data)
Presumably this is because of the public data set's compression scheme (from the URL link above):
We store the datasets in a single object in Amazon S3. The file is in
sequence file format with block level LZO compression. The sequence
file key is the row number of the dataset stored as a LongWritable and
the value is the raw data stored as TextWritable.
Any guidance on how to set up a workflow that can process these files? I've searched exhaustively for tips but haven't turned up anything useful...
(I'm a relative n00b to mrjob and Hadoop.)

I finally figured this out. It looks like EMR takes care of the LZO compression for you, but for the sequence file format, you need to add the following HADOOP_INPUT_FORMAT field to your MRJob class:
class MyMRJob(MRJob):
HADOOP_INPUT_FORMAT = 'org.apache.hadoop.mapred.SequenceFileAsTextInputFormat'
def mapper(self, _, line):
# mapper code...
def reducer(self, key, value):
# reducer code...
There's another gotcha, too (quoting from the AWS-hosted Google NGrams page):
The sequence file key is the row number of the dataset stored as a LongWritable and the value is the raw data stored as TextWritable.
That means each row is prepended with and extra Long + TAB, so any line parsing you do in your mapper method needs to account for the prepended info as well.

Related

RuntimeError: dictionary keys changed during iteration

I am trying to use the following Script but there have been some errors that I've needed to fix. I was able to get it running but for most instances of the data it tries to process the following error arises:
C:\Users\Alexa\OneDrive\Skrivbord\Database Samples\LIDC-IDRI-0001\1.3.6.1.4.1.14519.5.2.1.6279.6001.175012972118199124641098335511\1.3.6.1.4.1.14519.5.2.1.6279.6001.141365756818074696859567662357\068.xml
Traceback (most recent call last):
File "C:\Users\Alexa\OneDrive\Documents\Universitet\Nuvarande\KEX\LIDC-IDRI-processing-master\lidc_data_to_nifti.py", line 370, in <module>
parse_xml_file(xml_file)
File "C:\Users\Alexa\OneDrive\Documents\Universitet\Nuvarande\KEX\LIDC-IDRI-processing-master\lidc_data_to_nifti.py", line 311, in parse_xml_file
root=xmlHelper.create_xml_tree(file)
File "C:\Users\Alexa\OneDrive\Documents\Universitet\Nuvarande\KEX\LIDC-IDRI-processing-master\lidcXmlHelper.py", line 23, in create_xml_tree
for at in el.attrib.keys(): # strip namespaces of attributes too
RuntimeError: dictionary keys changed during iteration
This corresponds to the following code:
def create_xml_tree(filepath):
"""
Method to ignore the namespaces if ElementTree is used.
Necessary becauseElementTree, by default, extend
Tag names by the name space, but the namespaces used in the
LIDC-IDRI dataset are not consistent.
Solution based on https://stackoverflow.com/questions/13412496/python-elementtree-module-how-to-ignore-the-namespace-of-xml-files-to-locate-ma
instead of ET.fromstring(xml)
"""
it = ET.iterparse(filepath)
for _, el in it:
if '}' in el.tag:
el.tag = el.tag.split('}', 1)[1] # strip all namespaces
for at in el.attrib.keys(): # strip namespaces of attributes too
if '}' in at:
newat = at.split('}', 1)[1]
el.attrib[newat] = el.attrib[at]
del el.attrib[at]
return it.root
I am not at all familiar with xml file reading in python and this problem has gotten me stuck for the last two days now. I tried reading up on threads both here and on other forums but it did not give me significant insight. From what I understand the problem arises because we're trying to manipulate the same object we are reading from, which is not allowed? I tried to fix this by making a copy of the file and then having that change depending on what is read in the original file but I did not get it working properly.

Python loadarff fails for string attributes

I am trying to load an arff file using Python's 'loadarff' function from scipy.io.arff. The file has string attributes and it is giving the following error.
>>> data,meta = arff.loadarff(fpath)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/data/home/eex608/conda3_envs/PyT3/lib/python3.6/site-packages/scipy/io/arff/arffread.py", line 805, in loadarff
return _loadarff(ofile)
File "/data/home/eex608/conda3_envs/PyT3/lib/python3.6/site-packages/scipy/io/arff/arffread.py", line 838, in _loadarff
raise NotImplementedError("String attributes not supported yet, sorry")
NotImplementedError: String attributes not supported yet, sorry
How to read the arff successfully?
Since SciPy's loadarff converts containings of arff file into NumPy array, it does not support strings as attributes.
In 2020, you can use liac-arff package.
import arff
data = arff.load(open('your_document.arff', 'r'))
However, make sure your arff document does not contain inline comments after a meaningful text.
So there won't be such inputs:
#ATTRIBUTE class {F,A,L,LF,MN,O,PE,SC,SE,US,FT,PO} %Check and make sure that FT and PO should be there
Delete or move comment to the next line.
I'd got such mistake in one document and it took some time to figure out what's wrong.

Read .net pajek file using python igraph library

I am trying to load a .net file using python igraph library. Here is the sample code:
import igraph
g = igraph.read("s.net",format="pajek")
But when I tried to run this script I got the following errors:
Traceback (most recent call last):
File "demo.py", line 2, in <module>
g = igraph.read('s.net',format="pajek")
File "C:\Python27\lib\site-packages\igraph\__init__.py", line 3703, in read
return Graph.Read(filename, *args, **kwds)
File "C:\Python27\lib\site-packages\igraph\__init__.py", line 2062, in Read
return reader(f, *args, **kwds)
igraph._igraph.InternalError: Error at .\src\foreign.c:574: Parse error in Pajek
file, line 1 (syntax error, unexpected ARCSLINE, expecting VERTICESLINE), Parse error
Kindly provide some hint over it.
Either your file is not a regular Pajek file or igraph's Pajek parser is not able to read this particular Pajek file. (Writing a Pajek parser is a bit of hit-and-miss since the Pajek file format has no formal specification). If you send me your Pajek file via email, I'll take a look at it.
Update: you were missing the *Vertices section of the Pajek file. Adding a line like *Vertices N (where N is the number of vertices in the graph) resolves your problem. I cannot state that this line is mandatory in Pajek files because of lack of formal specification for the file format, but all the Pajek files I have seen so far included this line so I guess it's pretty standard.

Large file upload fails

I'm in the process of writing a python module to POST files to a server , I can upload files of size of upto 500MB but when I tried to upload a 1gb file the upload failed, If I were to use something like cURL it won't fail. I got the code after googling how to upload multipart formdata using python , the code can be found here. I just compiled and ran that code , the error I'm getting is this
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
opener.open("http://127.0.0.1/test_server/upload",params)
File "C:\Python27\lib\urllib2.py", line 392, in open
req = meth(req)
File "C:\Python27\MultipartPostHandler.py", line 35, in http_request
boundary, data = self.multipart_encode(v_vars, v_files)
File "C:\Python27\MultipartPostHandler.py", line 63, in multipart_encode
buffer += '\r\n' + fd.read() + '\r\n'
MemoryError
I'm new to python and having a hard time grasping it. I also came across another program here , I'll be honest I don't know how to run it. I tried running it by guessing based on the function name , but that didn't work.
The script in question isn't very smart and builds the POST body in memory.
Thus, to POST a 1GB file, you'll need 1GB of memory just to hold that data, plus the HTTP headers, boundaries, and python and the code itself.
You'd have to rework the script to use mmap instead, where you first construct the whole body in a temp file before handing that file wrapped in a mmap.mmap value to passing it to request.add_data.
See Python: HTTP Post a large file with streaming for hints on how to achieve that.

how to parse big datasets using RDFLib?

I'm trying to parse several big graphs with RDFLib 3.0, apparently it handles first one and dies on the second (MemoryError)... looks like MySQL is not supported as store anymore, can you please suggest a way to somehow parse those?
Traceback (most recent call last):
File "names.py", line 152, in <module>
main()
File "names.py", line 91, in main
locals()[graphname].parse(filename, format="nt")
File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/graph.py", line 938, in parse
location=location, file=file, data=data, **args)
File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/graph.py", line 757, in parse
parser.parse(source, self, **args)
File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/plugins/parsers/nt.py", line 24, in parse
parser.parse(f)
File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/plugins/parsers/ntriples.py", line 124, in parse
self.line = self.readline()
File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/plugins/parsers/ntriples.py", line 151, in readline
m = r_line.match(self.buffer)
MemoryError
How many triples on those RDF files ? I have tested rdflib and it won't scale much further than few tens of ktriples - if you are lucky. No way it really performs well for files with millions of triples.
The best parser out there is rapper from Redland Libraries. My first advice is to not use RDF/XML and go for ntriples. Ntriples is a lighter format than RDF/XML. You can transform from RDF/XML to ntriples using rapper:
rapper -i rdfxml -o ntriples YOUR_FILE.rdf > YOUR_FILE.ntriples
If you like Python you can use the Redland python bindings:
import RDF
parser=RDF.Parser(name="ntriples")
model=RDF.Model()
stream=parser.parse_into_model(model,"file://file_path",
"http://your_base_uri.org")
for triple in model:
print triple.subject, triple.predicate, triple.object
I have parsed fairly big files (couple of gigabyes) with redland libraries with no problem.
Eventually if you are handling big datasets you might need to assert your data into a scalable triple store, the one I normally use is 4store. 4store internally uses redland to parse RDF files. In the long term, I think, going for a scalable triple store is what you'll have to do. And with it you'll be able to use SPARQL to query your data and SPARQL/Update to insert and delete triples.

Categories