Convert Wikipedia dump to text using python -m gensim.scripts.make_wiki

Convert Wikipedia dump to text using python -m gensim.scripts.make_wiki - python

I want to use gensim to convert Wikipedia dump to plain text using python -m gensim.scripts.make_wiki script.
I use it as :
python -m gensim.scripts.make_wiki ./enwiki-latest-pages-articles.xml.bz2 ./results
gives me an error at the end:
2016-04-06 20:43:46,471 : INFO : storing corpus in Matrix Market format to ./results/_bow.mm
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/scripts/make_wiki.py", line 88, in <module>
MmCorpus.serialize(outp + '_bow.mm', wiki, progress_cnt=10000) # another ~9h
File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/corpora/indexedcorpus.py", line 89, in serialize
offsets = serializer.save_corpus(fname, corpus, id2word, progress_cnt=progress_cnt, metadata=metadata)
File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/corpora/mmcorpus.py", line 49, in save_corpus
return matutils.MmWriter.write_corpus(fname, corpus, num_terms=num_terms, index=True, progress_cnt=progress_cnt, metadata=metadata)
File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/matutils.py", line 486, in write_corpus
mw = MmWriter(fname)
File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/matutils.py", line 436, in __init__
self.fout = utils.smart_open(self.fname, 'wb+') # open for both reading and writing
File "build/bdist.linux-x86_64/egg/smart_open/smart_open_lib.py", line 111, in smart_open
NotImplementedError: unknown file mode wb+
Does anybody know what is going on?

Not sure of the command line script, but the following works for me -
def parse_wiki(wiki_bz_file):
output = open('./wiki_text_dump.txt', 'w')
i = 0
wiki = WikiCorpus(wiki_bz_file, lemmatize=False, dictionary={}) #vocab dict not needed
for text in wiki.get_texts():
output.write(u.listToStr(chunk) + '\n')
i = i + 1
if i%50000 == 0:
logger.info("Saved " + str(i) + " articles")
output.close()
logger.info("Finished Saved " + str(i) + " articles")
return

Related

What is correct Cypher Syntax?

I refer to Neo4j create nodes and relationships from pandas dataframe with py2neo and code with belows.
But I got py2neo.database.status.CypherSyntaxError. Please verify below is right approach and let me know what is correct Cypher syntax.
My Code:
for line in reader:
print(line['word'], line['similar_word'], line['probability'] )
w1 = Node("Word", name = line['word'])
w2 = Node("Word", name = line['similar_word'])
graph.merge(w1|w2)
graph.run('''
MATCH (a:Paper),(b:Word)
WHERE (a.name = 'Paper10' AND b.name = {$word1})
CREATE (a)<-[o:ORIGINAL]-(b)
''', parameters = {'word1':line['word']})
py2neo.database.status.CypherSyntaxError:
Traceback (most recent call last): File "test.py", line 24, in <module>
''', parameters = {'word1':line['word']}) File "/root/miniconda3/lib/python3.6/site-packages/py2neo/database/__init__.py", line 731, in run
return self.begin(autocommit=True).run(statement, parameters, **kwparameters) File "/root/miniconda3/lib/python3.6/site-packages/py2neo/database/__init__.py", line 1277, in run
self.finish() File "/root/miniconda3/lib/python3.6/site-packages/py2neo/database/__init__.py", line 1296, in finish
self._sync() File "/root/miniconda3/lib/python3.6/site-packages/py2neo/database/__init__.py", line 1286, in _sync
connection.fetch() File "/root/miniconda3/lib/python3.6/site-packages/py2neo/packages/neo4j/v1/bolt.py", line 344, in fetch
handler(*fields) File "/root/miniconda3/lib/python3.6/site-packages/py2neo/database/__init__.py", line 961, in on_failure
raise GraphError.hydrate(metadata) py2neo.database.status.CypherSyntaxError: Invalid input '$': expected whitespace, an identifier, UnsignedDecimalInteger, a property key name or '}' (line 3, column 53 (offset: 78)) " WHERE (a.name = 'Paper10' AND b.name = {$word1})"

For referencing a parameter, you can either use the $ syntax, or the {}, but not both. Try this with just $word1.

Python connectivity with HBase using happybase

Can someone help me with the stacktrace generated while using happybase library?
I am trying to pass a dictionary object of python 3.4 in the 'put' method and the following stack trace is generated::
x
b"TWLb'25-Jan-13'"
data_values
{b'Low': b'0.10', b'Date': b'25-Jan-13', b'Volume': b'-', b'Close': b'0.12', b'High': b'0.12', b'Open': b'0.12'}
Traceback (most recent call last):
File "/home/msingal/Desktop/asd/Daily.py", line 63, in insert
hbase_test.insert_data(code, data_format)
File "/home/msingal/Desktop/asd/hbase_test.py", line 56, in insert_data
con.table(ticker, use_prefix=False).put(x, data_values)
File "/usr/lib/python3.4/site-packages/happybase/table.py", line 464, in put
batch.put(row, data)
File "/usr/lib/python3.4/site-packages/happybase/batch.py", line 137, in __exit__
self.send()
File "/usr/lib/python3.4/site-packages/happybase/batch.py", line 60, in send
self._table.connection.client.mutateRows(self._table.name, bms, {})
File "/usr/lib64/python3.4/site-packages/thriftpy/thrift.py", line 198, in _req
return self._recv(_api)
File "/usr/lib64/python3.4/site-packages/thriftpy/thrift.py", line 210, in _recv
fname, mtype, rseqid = self._iprot.read_message_begin()
File "thriftpy/protocol/cybin/cybin.pyx", line 429, in cybin.TCyBinaryProtocol.read_message_begin (thriftpy/protocol/cybin/cybin.c:6325)
File "thriftpy/protocol/cybin/cybin.pyx", line 60, in cybin.read_i32 (thriftpy/protocol/cybin/cybin.c:1546)
File "thriftpy/transport/buffered/cybuffered.pyx", line 65, in thriftpy.transport.buffered.cybuffered.TCyBufferedTransport.c_read (thriftpy/transport/buffered/cybuffered.c:1881)
File "thriftpy/transport/buffered/cybuffered.pyx", line 69, in thriftpy.transport.buffered.cybuffered.TCyBufferedTransport.read_trans (thriftpy/transport/buffered/cybuffered.c:1948)
File "thriftpy/transport/cybase.pyx", line 61, in thriftpy.transport.cybase.TCyBuffer.read_trans (thriftpy/transport/cybase.c:1472)
File "/usr/lib64/python3.4/site-packages/thriftpy/transport/socket.py", line 125, in read
message='TSocket read 0 bytes')
thriftpy.transport.TTransportException: TTransportException(message='TSocket read 0 bytes', type=4)
The lines of code are::
ticker = 'TWLB'
data_values = {b'Low': b'0.10', b'Date': b'25-Jan-13', b'Volume': b'-', b'Close': b'0.12', b'High': b'0.12', b'Open': b'0.12'}
x = (ticker + str(data_values.get(b'Date'))).encode('ASCII')
print('x')
print(x)
print('data_values')
print(data_values)
con.table(ticker, use_prefix=False).put(x, data_values)
Any help on solution and explaination would be apprciated.
I am new to StackOverflow so if my language feels offencive, please forgive me. I have tried to provide all the relevant info but if some info is missing let me know and I will update.

Python/Pyomo with glpk Solver - Error

I am trying to run some simle example with Pyomo + glpk Solver (Anaconda2 64bit Spyder):
from pyomo.environ import *
model = ConcreteModel()
model.x_1 = Var(within=NonNegativeReals)
model.x_2 = Var(within=NonNegativeReals)
model.obj = Objective(expr=model.x_1 + 2*model.x_2)
model.con1 = Constraint(expr=3*model.x_1 + 4*model.x_2 >= 1)
model.con2 = Constraint(expr=2*model.x_1 + 5*model.x_2 >= 2)
opt = SolverFactory("glpk")
instance = model.create()
#results = opt.solve(instance)
#results.write()
But i get the following error message:
invalid literal for int() with base 10: 'c'
Traceback (most recent call last):
File "<ipython-input-5-e074641da66d>", line 1, in <module>
runfile('D:/..../Exampe.py', wdir='D:.../exercises/pyomo')
File "C:\...\Continuum\Anaconda21\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 699, in runfile
execfile(filename, namespace)
File "C:\....\Continuum\Anaconda21\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "D:/...pyomo/Exampe.py", line 34, in <module>
results = opt.solve(instance)
File "C:\....\Continuum\Anaconda21\lib\site-packages\pyomo\opt\base\solvers.py", line 580, in solve
result = self._postsolve()
File "C:\...Continuum\Anaconda21\lib\site-packages\pyomo\opt\solver\shellcmd.py", line 267, in _postsolve
results = self.process_output(self._rc)
File "C:\...\Continuum\Anaconda21\lib\site-packages\pyomo\opt\solver\shellcmd.py", line 329, in process_output
self.process_soln_file(results)
File "C:\....\Continuum\Anaconda21\lib\site-packages\pyomo\solvers\plugins\solvers\GLPK.py", line 454, in process_soln_file
raise ValueError(msg)
ValueError: Error parsing solution data file, line 1
I downloaded glpk from http://winglpk.sourceforge.net/ --> unziped + added parth to the environmental variable "path".
Hope someone can help me - thank you!

This is a known problem with GLPK 4.60 (glpsol changed the format of their output which broke Pyomo 4.3's parser). You can either download an older release of GLPK, or upgrade Pyomo to 4.4.1 (which contains an updated parser).

use polyglot package for Named Entity Recognition in hebrew

I am trying to use the polyglot package for Named Entity Recognition in hebrew.
this is my code:
# -*- coding: utf8 -*-
import polyglot
from polyglot.text import Text, Word
from polyglot.downloader import downloader
downloader.download("embeddings2.iw")
text = Text(u"in france and in germany")
print(type(text))
text2 = Text(u"נסעתי מירושלים לתל אביב")
print(type(text2))
print(text.entities)
print(text2.entities)
this is the output:
<class 'polyglot.text.Text'>
<class 'polyglot.text.Text'>
[I-LOC([u'france']), I-LOC([u'germany'])]
Traceback (most recent call last):
File "C:/Python27/Lib/site-packages/IPython/core/pyglot.py", line 15, in <module>
print(text2.entities)
File "C:\Python27\lib\site-packages\polyglot\decorators.py", line 20, in __get__
value = obj.__dict__[self.func.__name__] = self.func(obj)
File "C:\Python27\lib\site-packages\polyglot\text.py", line 132, in entities
for i, (w, tag) in enumerate(self.ne_chunker.annotate(self.words)):
File "C:\Python27\lib\site-packages\polyglot\decorators.py", line 20, in __get__
value = obj.__dict__[self.func.__name__] = self.func(obj)
File "C:\Python27\lib\site-packages\polyglot\text.py", line 100, in ne_chunker
return get_ner_tagger(lang=self.language.code)
File "C:\Python27\lib\site-packages\polyglot\decorators.py", line 30, in memoizer
cache[key] = obj(*args, **kwargs)
File "C:\Python27\lib\site-packages\polyglot\tag\base.py", line 191, in get_ner_tagger
return NEChunker(lang=lang)
File "C:\Python27\lib\site-packages\polyglot\tag\base.py", line 104, in __init__
super(NEChunker, self).__init__(lang=lang)
File "C:\Python27\lib\site-packages\polyglot\tag\base.py", line 40, in __init__
self.predictor = self._load_network()
File "C:\Python27\lib\site-packages\polyglot\tag\base.py", line 109, in _load_network
self.embeddings = load_embeddings(self.lang, type='cw', normalize=True)
File "C:\Python27\lib\site-packages\polyglot\decorators.py", line 30, in memoizer
cache[key] = obj(*args, **kwargs)
File "C:\Python27\lib\site-packages\polyglot\load.py", line 61, in load_embeddings
p = locate_resource(src_dir, lang)
File "C:\Python27\lib\site-packages\polyglot\load.py", line 43, in locate_resource
if downloader.status(package_id) != downloader.INSTALLED:
File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 738, in status
info = self._info_or_id(info_or_id)
File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 508, in _info_or_id
return self.info(info_or_id)
File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 934, in info
raise ValueError('Package %r not found in index' % id)
ValueError: Package u'embeddings2.iw' not found in index
The english worked but not the hebrew.
Whether I try to download the package u'embeddings2.iw' or not I get:
ValueError: Package u'embeddings2.iw' not found in index

I got it!
It seems like a bug to me.
The language detection defined the language as 'iw' which is the The former ISO 639 language code for Hebrew, and was changed to 'he'.
The text.entities did not recognize the iw code, so i changes it like so:
text2.hint_language_code = 'he'

python lxml save not working

I have the following script -
count = 1
for line in temp:
if (str(count) + '=') in line:
job = re.findall(re.escape('=')+"(.*)",line)[0]
fullsrcurl = self.srcjson + '?format=xml&jobname=' + job
srcfile = urllib2.urlopen(fullsrcurl)
srcdoc = etree.parse(srcfile)
srcdata = etree.tostring(srcdoc, pretty_print=True)
srcjobmst_id = srcdoc.xpath('//jobmst_id/text()')[0]
srcxml = 'c:\\temp\\deployments\\%s\\%s.xml' % (source_env, srcjobmst_id)
srcxmlsave = open(srcxml, 'w')
srcxmlsave.write(srcdata)
srcxmlsave.close
fulldsturl = self.targetjson + '?format=xml&jobname=' + job
dstfile = urllib2.urlopen(fulldsturl)
dstdoc = etree.parse(dstfile)
dstdata = etree.tostring(dstdoc, pretty_print=True)
dstjobmst_id = dstdoc.xpath('//jobmst_id/text()')[0]
dstxml = 'c:\\temp\\deployments\\%s\\%s.xml' % (target_env, dstjobmst_id)
dstxmlsave = open(dstxml, 'w')
dstxmlsave.write(dstdata)
dstxmlsave.close
print "Job = " + job
count += 1
It's hitting 2 separate APIs in 2 environments but the data is almost identical. The source works fine, as soon as it tries to do anything with the destination data I get the followign error -
Traceback (most recent call last):
File "S:\Operations\Tidal\deployment\deployv2.py", line 213, in <module>
main()
File "S:\Operations\Tidal\deployment\deployv2.py", line 209, in main
auto_deploy.deploy()
File "S:\Operations\Tidal\deployment\deployv2.py", line 173, in deploy
dstdoc = etree.parse(dstfile)
File "lxml.etree.pyx", line 3239, in lxml.etree.parse (src\lxml\lxml.etree.c:6
9970)
File "parser.pxi", line 1770, in lxml.etree._parseDocument (src\lxml\lxml.etre
e.c:102272)
File "parser.pxi", line 1790, in lxml.etree._parseFilelikeDocument (src\lxml\l
xml.etree.c:102531)
File "parser.pxi", line 1685, in lxml.etree._parseDocFromFilelike (src\lxml\lx
ml.etree.c:101457)
File "parser.pxi", line 1134, in lxml.etree._BaseParser._parseDocFromFilelike
(src\lxml\lxml.etree.c:97084)
File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDo
c (src\lxml\lxml.etree.c:91290)
File "parser.pxi", line 683, in lxml.etree._handleParseResult (src\lxml\lxml.e
tree.c:92476)
File "parser.pxi", line 622, in lxml.etree._raiseParseError (src\lxml\lxml.etr
ee.c:91772)
lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 4, col
umn 1
So there has to be something different about the destination/target xml but I'm having a hard time understanding what. When I look at both of the values in a browser they're identical except for a few values (jobmst_id)

You aren't closing the files. Change srcxmlsave.close to srcxmlsave.close() or use a context manager as in
with open(srcxml, 'w') as srcxmlsave:
srcxmlsave.write(srcdata)

If anyone experiences an issue like this in the future I found the problem and it's nothing related to lxml or the xml I'm generating. My source environment has been productionalized using mod_wsgi but the target environment is still using runserver.
I guess something in the encoding is breaking with the target. I just productionalized the target environment and it works fine.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert Wikipedia dump to text using python -m gensim.scripts.make_wiki - python

Related

What is correct Cypher Syntax?

Python connectivity with HBase using happybase

Python/Pyomo with glpk Solver - Error

use polyglot package for Named Entity Recognition in hebrew

python lxml save not working

Categories

Resources