Transform docx to html raises python MemoryError - python

I have a function that converts a docx to html and a large docx file to be converted.
The problem is this function is part of a bigger program and the converted html is parsed afterwards so I cannot afford to use another converter without impacting the rest of the code (which is not wanted). Running on python 2.7.13 installed on 32-bit, but changing to 64-bit is also not desired.
This is the function:
import logging
from ooxml import serialize
def trasnformDocxtoHtml(inputFile, outputFile):
logging.basicConfig(filename='ooxml.log', level=logging.INFO)
dfile = ooxml.read_from_file(inputFile)
with open(outputFile,'w') as htmlFile:
htmlFile.write( serialize.serialize(dfile.document))
and here's the error:
>>> import library
>>> library.trasnformDocxtoHtml(r'large_file.docx', 'output.html')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "library.py", line 9, in trasnformDocxtoHtml
dfile = ooxml.read_from_file(inputFile)
File "C:\Python27\lib\site-packages\ooxml\__init__.py", line 52, in read_from_file
dfile.parse()
File "C:\Python27\lib\site-packages\ooxml\docxfile.py", line 46, in parse
self._doc = parse_from_file(self)
File "C:\Python27\lib\site-packages\ooxml\parse.py", line 655, in parse_from_file
document = parse_document(doc_content)
File "C:\Python27\lib\site-packages\ooxml\parse.py", line 463, in parse_document
document.elements.append(parse_table(document, elem))
File "C:\Python27\lib\site-packages\ooxml\parse.py", line 436, in parse_table
for p in tc.xpath('./w:p', namespaces=NAMESPACES):
File "src\lxml\etree.pyx", line 1583, in lxml.etree._Element.xpath
MemoryError
no mem for new parser
MemoryError
Could I somehow increase the buffer memory in python? Or fix the function without impacting the html output format?

Related

Reading TDMS File with python nptdms, cannot open tdms file

I am having issues with getting basic function of the nptdms module working.
First, I am just trying to open a TDMS file and print the contents of specific channels within specific groups.
Using python 2.7 and the nptdms quick start here
Following this, I will be writing these specific pieces of data into a new TDMS file. Then, my ultimate goal is to be able to take a set of source files, open each, and write (append) to a new file. The source data files contain far more information that is needed, so I am breaking out the specifics into their own file.
The problem I have is that I cannot get past a basic error.
When running this code, I get:
Traceback (most recent call last):
File "PullTDMSdataIntoNewFile.py", line 27, in <module>
tdms_file = TdmsFile(r"C:\\Users\daniel.worts\Desktop\this_is_my_tdms_file.tdms","r")
File "C:\Anaconda2\lib\site-packages\nptdms\tdms.py", line 94, in __init__
self._read_segments(f)
File "C:\Anaconda2\lib\site-packages\nptdms\tdms.py", line 119, in _read_segments
object._initialise_data(memmap_dir=self.memmap_dir)
File "C:\Anaconda2\lib\site-packages\nptdms\tdms.py", line 709, in _initialise_data
mode='w+b', prefix="nptdms_", dir=memmap_dir)
File "C:\Anaconda2\lib\tempfile.py", line 475, in NamedTemporaryFile
(fd, name) = _mkstemp_inner(dir, prefix, suffix, flags)
File "C:\Anaconda2\lib\tempfile.py", line 244, in _mkstemp_inner
fd = _os.open(file, flags, 0600)
OSError: [Errno 2] No such file or directory: 'r\\nptdms_yjfyam'
Here is my code:
from nptdms import TdmsFile
import numpy as np
import pandas as pd
#set Tdms file path
tdms_file = TdmsFile(r"C:\\Users\daniel.worts\Desktop\this_is_my_tdms_file.tdms","r")
# set variable for TDMS groups
group_nameone = '101'
group_nametwo = '752'
# set objects for TDMS channels
channel_dataone = tdms_file.object(group_nameone 'Payload_1')
channel_datatwo = tdms_file.object(group_nametwo, 'Payload_2')
# set data from channels
data_dataone = channel_dataone.data
data_datatwo = channel_datatwo.data
print data_dataone
print data_datatwo
Big thanks to anyone who may have encountered this before and can help point to what I am missing.
Best,
- Dan
edit:
Solved the read data issue by removing the 'r' argument from the file path.
Now I am having another error I can't trace when trying to write.
from nptdms import TdmsFile, TdmsWriter, RootObject, GroupObject, ChannelObject
import numpy as np
import pandas as pd
newfilepath = r"C:\\Users\daniel.worts\Desktop\Mined.tdms"
datetimegroup101_channel_object = ChannelObject('101', DateTime, data_datetimegroup101)
with TdmsWriter(newfilepath) as tdms_writer:
tdms_writer.write_segment([datetimegroup101_channel_object])
Returns error:
Traceback (most recent call last):
File "PullTDMSdataIntoNewFile.py", line 82, in <module>
tdms_writer.write_segment([datetimegroup101_channel_object])
File "C:\Anaconda2\lib\site-packages\nptdms\writer.py", line 68, in write_segment
segment = TdmsSegment(objects)
File "C:\Anaconda2\lib\site-packages\nptdms\writer.py", line 88, in __init__
paths = set(obj.path for obj in objects)
File "C:\Anaconda2\lib\site-packages\nptdms\writer.py", line 88, in <genexpr>
paths = set(obj.path for obj in objects)
File "C:\Anaconda2\lib\site-packages\nptdms\writer.py", line 254, in path
self.channel.replace("'", "''"))
AttributeError: 'TdmsObject' object has no attribute 'replace'

NBT Parser Minecraft mca file not a gzipped file error

I try to read a Minecraft world with Python from the filesystem and the .mca region/anvil files using the NBT 1.4.1 module (Named Binary Tag Reader/Writer), which is supposed to read the NBT format used in Minecraft. It works fine for files such as level.dat, but throws an error for the region files such as r.0.0.mca
Edit: I am referring to the auto generated world files that minecraft stores in the .minecraft/saves/"MyWorld"/ folder. Such as the level.dat (which works), and the mca files stored in the .minecraft/saves/"MyWorld"/region/ folder such as r.0.0.mca which don't work. I uploaded two sample files from one of my worlds.
Code:
from nbt import nbt
level_file = nbt.NBTFile("level.dat", "rb") # works
region_file = nbt.NBTFile("r.0.0.mca", "rb")# does not work
Error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/dist-packages/nbt/nbt.py", line 508, in __init__
self.parse_file()
File "/usr/local/lib/python3.5/dist-packages/nbt/nbt.py", line 532, in parse_file
type = TAG_Byte(buffer=self.file)
File "/usr/local/lib/python3.5/dist-packages/nbt/nbt.py", line 85, in __init__
self._parse_buffer(buffer)
File "/usr/local/lib/python3.5/dist-packages/nbt/nbt.py", line 90, in _parse_buffer
self.value = self.fmt.unpack(buffer.read(self.fmt.size))[0]
File "/usr/lib/python3.5/gzip.py", line 274, in read
return self._buffer.read(size)
File "/usr/lib/python3.5/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/usr/lib/python3.5/gzip.py", line 461, in read
if not self._read_gzip_header():
File "/usr/lib/python3.5/gzip.py", line 409, in _read_gzip_header
raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b'\x00\x00')
Any suggestions how to get this working?
r.0.0.mca is most definitely not compressed. About 80% of the bytes are zeros.
It turns out that the NBT library only supports .mcr region files which have been replaced by .mca files about 6 years ago. However, mcedit is written in Python and supports those files. Due the changes in the Minecraft save format, the interpretation of the content needs to be adjusted though, but the files can be successfully read.

'Search for pattern exhausted' happens when processing WARC file in python3

I'm trying to fetch some plain text from a WARC dataset (yahoo!webscope L2), and keep meeting ValueError: Search for pattern exhausted when using load() function in python3 module warcat. Have tried some random WARC example files and everything worked well.
The dataset did ask for a further license to commit(and then a password would be provide, according to the readme file;do WARC files come with passwords?) but for now I'm not equipped to send a fax.
I also checked out warcat source code, and found that the ValueError would be raised when file_obj.read(size) is False. It seems making no sense to me so I'm asking here...
The code:
>>> import warcat
>>> import warcat.model
>>> warc = warcat.model.WARC()
>>> warc.load('ydata-embedded-metadata-v1_0.warc')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.4/site-packages/warcat/model/warc.py", line 32, in load
self.read_file_object(f)
File "/usr/local/lib/python3.4/site-packages/warcat/model/warc.py", line 39, in read_file_object
record, has_more = self.read_record(file_object)
File "/usr/local/lib/python3.4/site-packages/warcat/model/warc.py", line 75, in read_record
check_block_length=check_block_length)
File "/usr/local/lib/python3.4/site-packages/warcat/model/record.py", line 59, in load
inclusive=True)
File "/usr/local/lib/python3.4/site-packages/warcat/util.py", line 66, in find_file_pattern
raise ValueError('Search for pattern exhausted')
ValueError: Search for pattern exhausted
Thanks in advance.

Python MemoryError When Querying ArcGIS Web Service

I am running into the following MemoryError when running the below script. Any assistance would be greatly appreciated. The layer that I am querying contains 235,896 features which I'm afraid is the problem.
Script
import arcgis
import json
from arcgis import ArcGIS
service = ArcGIS("http://mapping.dekalbcountyga.gov/arcgis/rest/services/LandUse/MapServer")
query = service.get(0, count_only=False)
json_query = json.dump(query)
f = open("dekalb_parcels.geojson", "w")
f.write(json_query)
f.close()
Error
Traceback (most recent call last):
File "G:/Python/Scripts/dekalb_parcel_query.py", line 8, in <module>
query = service.get(0, count_only=False)
File "C:\Python27\lib\site-packages\arcgis\arcgis.py", line 146, in get
jsobj = self.get_json(layer, where, fields, count_only, srid)
File "C:\Python27\lib\site-packages\arcgis\arcgis.py", line 90, in get_json
return response.json(strict=False)
File "C:\Python27\lib\site-packages\requests\models.py", line 802, in json
return json.loads(self.text, **kwargs)
File "C:\Python27\lib\site-packages\requests\models.py", line 769, in text
content = str(self.content, encoding, errors='replace')
MemoryError
I was able to rectify this issue by switching to 64-bit Python. The process was crashing when it reached 2GB of RAM usage, but by switching to 64-bit Python I avoided this problem.

Parser.pxi problems when pretty printing xml file

I'm getting some errors when I'm trying to pretty print a xml file. I've looked everywhere and tried installning latest version of lxml but still getting this error.
My script is pretty simple it looks like this.
import os
import lxml.etree as etree
from lxml.etree import parse
fname = 'C:\Test_folder\SlutR_20150218.xml'
x = etree.parse(fname)
print etree.tostring(x, pretty_print = True)
And the errors I'm getting is following.
Traceback (most recent call last):
File "C:\Users\a.curcic\Desktop\Övriga_python_script\
Pretty_print_example.py",
line 5, in <module> x = etree.parse(fname)
File "lxml.etree.pyx",
line 3301, in lxml.etree.parse
(src\lxml\lxml.etree.c:72453)
File "parser.pxi", line 1791, in lxml.etree._parseDocument
(src\lxml\lxml.etree.c:105915)
File "parser.pxi", line 1817,
in lxml.etree._parseDocumentFromURL (src\lxml\lxml.etree.c:106214)
XMLSyntaxError:
Extra content at the end of the document, line 2, column 909

Categories