'Search for pattern exhausted' happens when processing WARC file in python3 - python

I'm trying to fetch some plain text from a WARC dataset (yahoo!webscope L2), and keep meeting ValueError: Search for pattern exhausted when using load() function in python3 module warcat. Have tried some random WARC example files and everything worked well.
The dataset did ask for a further license to commit(and then a password would be provide, according to the readme file;do WARC files come with passwords?) but for now I'm not equipped to send a fax.
I also checked out warcat source code, and found that the ValueError would be raised when file_obj.read(size) is False. It seems making no sense to me so I'm asking here...
The code:
>>> import warcat
>>> import warcat.model
>>> warc = warcat.model.WARC()
>>> warc.load('ydata-embedded-metadata-v1_0.warc')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.4/site-packages/warcat/model/warc.py", line 32, in load
self.read_file_object(f)
File "/usr/local/lib/python3.4/site-packages/warcat/model/warc.py", line 39, in read_file_object
record, has_more = self.read_record(file_object)
File "/usr/local/lib/python3.4/site-packages/warcat/model/warc.py", line 75, in read_record
check_block_length=check_block_length)
File "/usr/local/lib/python3.4/site-packages/warcat/model/record.py", line 59, in load
inclusive=True)
File "/usr/local/lib/python3.4/site-packages/warcat/util.py", line 66, in find_file_pattern
raise ValueError('Search for pattern exhausted')
ValueError: Search for pattern exhausted
Thanks in advance.

Related

Transform docx to html raises python MemoryError

I have a function that converts a docx to html and a large docx file to be converted.
The problem is this function is part of a bigger program and the converted html is parsed afterwards so I cannot afford to use another converter without impacting the rest of the code (which is not wanted). Running on python 2.7.13 installed on 32-bit, but changing to 64-bit is also not desired.
This is the function:
import logging
from ooxml import serialize
def trasnformDocxtoHtml(inputFile, outputFile):
logging.basicConfig(filename='ooxml.log', level=logging.INFO)
dfile = ooxml.read_from_file(inputFile)
with open(outputFile,'w') as htmlFile:
htmlFile.write( serialize.serialize(dfile.document))
and here's the error:
>>> import library
>>> library.trasnformDocxtoHtml(r'large_file.docx', 'output.html')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "library.py", line 9, in trasnformDocxtoHtml
dfile = ooxml.read_from_file(inputFile)
File "C:\Python27\lib\site-packages\ooxml\__init__.py", line 52, in read_from_file
dfile.parse()
File "C:\Python27\lib\site-packages\ooxml\docxfile.py", line 46, in parse
self._doc = parse_from_file(self)
File "C:\Python27\lib\site-packages\ooxml\parse.py", line 655, in parse_from_file
document = parse_document(doc_content)
File "C:\Python27\lib\site-packages\ooxml\parse.py", line 463, in parse_document
document.elements.append(parse_table(document, elem))
File "C:\Python27\lib\site-packages\ooxml\parse.py", line 436, in parse_table
for p in tc.xpath('./w:p', namespaces=NAMESPACES):
File "src\lxml\etree.pyx", line 1583, in lxml.etree._Element.xpath
MemoryError
no mem for new parser
MemoryError
Could I somehow increase the buffer memory in python? Or fix the function without impacting the html output format?

How to append keywords to IPTC data in a JPG image?

I'm trying to add keywords to the IPTC data in a JPG file and failing miserably. I'm able to read in the keywords using the iptcinfo3 library and, seemingly, append the keyword to the list of current keywords but I'm failing when trying to write those keywords back to the JPG file, if not sooner. The error message is a bit unclear to me and may actually reference the appending of the new keyword (although a print statement seems to indicate it took).
I've tried three different metadata libraries (there doesn't seem to be one standard) and this is the furthest I've gotten with any of them (failing to even install one and not being able to get a second one to run). This seems so basic but I can't figure it out and haven't been able to adapt the few other code examples I've seen online to work, including iptcinfo3's example code fragment.
The current Error message is:
| => pipenv run python editMetadata.py
WARNING: problems with charset recognition (b'\x1b')
[b'Gus']
[b'Gus', b'frog']
Traceback (most recent call last):
File "editMetadata.py", line 22, in <module>
info.save_as('Gus2.jpg')
File "/Users/Scott/.local/share/virtualenvs/editPhotoMetadata-tx0JAOmI/lib/python3.7/site-packages/iptcinfo3.py", line 635, in save_as
jpeg_parts = jpeg_collect_file_parts(fh)
File "/Users/Scott/.local/share/virtualenvs/editPhotoMetadata-tx0JAOmI/lib/python3.7/site-packages/iptcinfo3.py", line 324, in jpeg_collect_file_parts
adobeParts = collect_adobe_parts(partdata)
File "/Users/Scott/.local/share/virtualenvs/editPhotoMetadata-tx0JAOmI/lib/python3.7/site-packages/iptcinfo3.py", line 433, in collect_adobe_parts
out = [''.join(out)]
TypeError: sequence item 0: expected str instance, bytes found
Code:
from iptcinfo3 import IPTCInfo
import os
# Create new info object
info = IPTCInfo('Gus.jpg')
# Print list of keywords
print(info['keywords'])
# Append the keyword I want to add
info['keywords'].append(b'frog')
# Print to test keyword has been added
print(info['keywords'])
# Save new info to file
info.save()
info.save_as('Gus2.jpg')
Instead of appending use equal "="
from iptcinfo3 import IPTCInfo
info = IPTCInfo('Gus.jpg')
print(info['keywords'])
# add keyword
info['keywords'] = ['new keyword']
info.save()
info.save_as('Gus_2.jpg')
I have the same error. It seems to be an issue with the save depending on the file.
from iptcinfo3 import IPTCInfo
info = IPTCInfo('image.jpg', force=True)
info.save()
Which gives me the same error.
WARNING: problems with charset recognition (b'\x1b')
WARNING: problems with charset recognition (b'\x1b')
Traceback (most recent call last):
File "./searchimages.py", line 123, in <module>
main(sys.argv[1:])
File "./searchimages.py", line 119, in main
find_photos(str(sys.argv[1]))
File "./searchimages.py", line 46, in find_photos
write_keywords(image, current_keywords, new_keywords)
File "./searchimages.py", line 109, in write_keywords
info.save_as('out.jpg')
File "/usr/local/lib/python3.7/site-packages/iptcinfo3.py", line 635, in save_as
jpeg_parts = jpeg_collect_file_parts(fh)
File "/usr/local/lib/python3.7/site-packages/iptcinfo3.py", line 324, in jpeg_collect_file_parts
adobeParts = collect_adobe_parts(partdata)
File "/usr/local/lib/python3.7/site-packages/iptcinfo3.py", line 433, in collect_adobe_parts
out = [''.join(out)]
TypeError: sequence item 0: expected str instance, bytes found

Failed to download file using pafy

I am using Python 2.7 and pafy to download audio file from youtube
import pafy
video = pafy.new("https://www.youtube.com/watch?v=dcNlEn1LrrE")
print video.m4astreams
filename = video.m4astreams[0].download(quiet=False)
I get the following error:
Traceback (most recent call last):
File "E:\work\Python\2017\pafy\work_with_pafy.py", line 27, in <module>
filename = video.m4astreams[0].download(quiet=False)#.encode('utf-8')
File "c:\python27\lib\site-packages\pafy\backend_shared.py", line 586, in download
filename = self.generate_filename(meta=meta, max_length=256-len('.temp'))
File "c:\python27\lib\site-packages\pafy\backend_shared.py", line 458, in generate_filename
return xenc(filename)
File "c:\python27\lib\site-packages\pafy\util.py", line 63, in xenc
return utf8_replace(stuff) if not_utf8_environment else stuff
File "c:\python27\lib\site-packages\pafy\util.py", line 57, in utf8_replace
txt = txt.encode(sse, "replace").decode(sse)
TypeError: encode() argument 1 must be string, not None
Please Help!
Thanks in advance.
I have found the solution.
The problem is solved by replacing one string in util.py file C:\Python27\Lib\site-packages\pafy\util.py
I replaced that string in util.py:
txt = txt.encode(sse, "replace").decode(sse)
by this one:
txt = txt.encode('utf-8')
After that file successfully downloaded without any problems.

Cannot determine type of file

Hi i have just started learning image processing using python.
When i tried to open an image that i downloaded from the net, I keep getting this error and I have no idea about how to resolve it. Can anyone please help me with this?
>>> dna=mahotas.imread('dna.jpeg')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\mahotas\io\freeimage.py", line 773, in imread
img = read(filename)
File "C:\Python27\lib\site-packages\mahotas\io\freeimage.py", line 444, in read
bitmap = _read_bitmap(filename, flags)
File "C:\Python27\lib\site-packages\mahotas\io\freeimage.py", line 490, in _read_bitmap
'mahotas.freeimage: cannot determine type of file %s' % filename)
ValueError: mahotas.freeimage: cannot determine type of file dna.jpeg
Hello this looks like a pretty old thread but I found it recently because I had the same problem.
I think that the error message is misleading because it implies that the type of file is incorrect.
I fixed the problem by including the full path to the image file. For example, it could look something like:
dna = mahotas.imread('C:\Documents\dna.jpeg')

Strange exception with python's cgitb and inspect.py

I have a function that decodes an exception and pushes the info to a file. Following is what I do basically:
exc_info = sys.exc_info
txt = cgitb.text(exc_info)
Using this, I got the following exception trace:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\job_queue\utils\start_workers.py", line 40, in start_worker
worker_loop(r_jq, worktype, worker_id)
File "C:\Python27\lib\site-packages\job_queue\server\jq_worker.py", line 55, in worker_loop
_job_machine(*job)
File "C:\Python27\lib\site-packages\job_queue\server\jq_worker.py", line 34, in _job_machine
do_verbose_exception()
File "C:\Python27\lib\site-packages\job_queue\server\errors.py", line 23, in do_verbose_exception
txt = cgitb.text(exc_info)
File "C:\Python27\lib\cgitb.py", line 214, in text
formatvalue=lambda value: '=' + pydoc.text.repr(value))
File "C:\Python27\lib\inspect.py", line 885, in formatargvalues
specs.append(strseq(args[i], convert, join))
File "C:\Python27\lib\inspect.py", line 840, in strseq
return convert(object)
File "C:\Python27\lib\inspect.py", line 882, in convert
return formatarg(name) + formatvalue(locals[name])
KeyError: 'connection'
I ran the code multiple times after this exception, but couldn't reproduce it. However, I didn't find any reference in files cgitb.py or inspect.py to a dict with 'connection' key either.
Will anybody know if this is an issue with python's cgitb or inspect files? Any helpful inputs?
You passed a wrong type to text function
below is the correct way.
cgitb.text((sys.last_type, sys.last_value, sys.last_traceback))
Im not sure specifically why this exception is happening, but have you read the docs for cgitb module? It seems that since python 2.2 it has supported writing exceptions to a file:
http://docs.python.org/library/cgitb.html
Probably something like:
cgitb.enable(0, "/my/log/directory") # or 1 if you want to see it in browser
As far as your actual traceback, are you sure 'connection' isnt a name you are using in your own code? 'inspect' module is most likely trying to examine your own code to build the cgi traceback info and getting a bad key somewhere?

Categories