Python HTMLParser - stop parsing

Python HTMLParser - stop parsing - python

I am using Python's HTMLParser from html.parser module.
I am looking for a single tag and when it is found it would make sense to stop the parsing. Is this possible? I tried to call close() but I am not sure if this is the way to go.
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
login_form = False
if tag == "form":
print("finished")
self.close()
However this seems to have recursive effects ending with
File "/usr/lib/python3.4/re.py", line 282, in _compile
p, loc = _cache[type(pattern), pattern, flags]
RuntimeError: maximum recursion depth exceeded in comparison

According to the docs, the close() method does this:
Force processing of all buffered data as if it were followed by an end-of-file mark.
You're still inside the handle_starttag and haven't finished working with the buffer yet, so you definitely do not want to process all the buffered data - that's why you're getting stuck with recursion. You can't stop the machine from inside the machine.
From the description of reset() this sounds more like what you want:
Reset the instance. Loses all unprocessed data.
but also this can't be called from the things which it calls, so this also shows recursion.
It sounds like you have two options:
raise an Exception (like for example a StopIteration) and catch it from your call to the parser. Depending on what else you're doing in the parsing this may retain the information you need. You may need to do some checks to see that files aren't left open.
use a simple flag (True / False) to signify whether you have aborted or not. At the very start of handle_starttag just exit if aborted. So the machinery will still go through all the tags of the html, but do nothing for each one. Obviously if you're processing handle_endtag as well then this would also check the flag. You can set the flag back to normal either when you receive a <html> tag or by overwriting the feed method.

Related

How do I ignore characters using the python pty module?

I want to write a command-line program that communicates with other interactive programs through a pseudo-terminal. In particular I want to be able to cause keystrokes received to conditionally be sent to the underlying process. Let's say for an example that I would like to silently ignore any "e" characters that are sent.
I know that Python has a pty module for working with pseudo-terminals and I have a basic version of my program working using it:
import os
import pty
def script_read(stdin):
data = os.read(stdin, 1024)
if data == b"e":
return ... # What goes here?
return data
pty.spawn(["bash"], script_read)
From experimenting, I know that returning an empty bytes object b"" causes the pty.spawn implementation to think that the underlying file descriptor has reached the end of file and should no longer be read from, which causes the terminal to become totally unresponsive (I had to kill my terminal emulator!).

For interactive use, the simplest way to do this is probably to just return a bytes object containing a single null byte: b"\0". The terminal emulator will not print anything for it and so it will look like that input is just completely ignored.
This probably isn't great for certain usages of pseudo-terminals. In particular, if the content written to the pseudo-terminal is going to be written again by the attached program this would probably cause random null bytes to appear in the file. Testing with cat as the attached program, the sequence ^# is printed to the terminal whenever a null byte is sent to it.
So, YMMV.
A more proper solution would be to create a wrapper type that can masquerade as an empty string for the purposes of os.write but that would evaluate as "truthy" in a boolean context to not trigger the end of file conditional. I did some experimenting with this and couldn't figure out what needs to be faked to make os.write fully accept the wrapper as a string type. I'm unclear if it's even possible. :(
Here's my initial attempt at creating such a wrapper type:
class EmptyBytes():
def __init__(self):
self.sliced = False
def __class__(self):
return type(b"")
def __getitem__(self, _key):
return b""

Segmentation fault when initializing array

I am getting a segmentation fault when initializing an array.
I have a callback function from when an RFID tag gets read
IDS = []
def readTag(e):
epc = str(e.epc, 'utf-8')
if not epc in IDS:
now = datetime.datetime.now().strftime('%m/%d/%Y %H:%M:%S')
IDS.append([epc, now, "name.instrument"])
and a main function from which it's called
def main():
for x in vals:
IDS.append([vals[0], vals[1], vals[2]])
for x in IDS:
print(x[0])
r = mercury.Reader("tmr:///dev/ttyUSB0", baudrate=9600)
r.set_region("NA")
r.start_reading(readTag, on_time=1500)
input("press any key to stop reading: ")
r.stop_reading()
The error occurs because of the line IDS.append([epc, now, "name.instrument"]). I know because when I replace it with a print call instead the program will run just fine. I've tried using different types for the array objects (integers), creating an array of the same objects outside of the append function, etc. For some reason just creating an array inside the "readTag" function causes the segmentation fault like row = [1,2,3]
Does anyone know what causes this error and how I can fix it? Also just to be a little more specific, the readTag function will work fine for the first two (only ever two) calls, but then it crashes and the Reader object that has the start_reading() function is from the mercury-api

This looks like a scoping issue to me; the mercury library doesn't have permission to access your list's memory address, so when it invokes your callback function readTag(e) a segfault occurs. I don't think that the behavior that you want is supported by that library

To extend Michael's answer, this appears to be an issue with scoping and the API you're using. In general pure-Python doesn't seg-fault. Or at least, it shouldn't seg-fault unless there's a bug in the interpreter, or some extension that you're using. That's not to say pure-Python won't break, it's just that a genuine seg-fault indicates the problem is probably the result of something messy outside of your code.
I'm assuming you're using this Python API.
In that case, the README.md mentions that the Reader.start_reader() method you're using is "asynchronous". Meaning it invokes a new thread or process and returns immediately and then the background thread continues to call your callback each time something is scanned.
I don't really know enough about the nitty gritty of CPython to say exactly what going on, but you've declared IDS = [] as a global variable and it seems like the background thread is running the callback with a different context to the main program. So when it attempts to access IDS it's reading memory it doesn't own, hence the seg-fault.
Because of how restrictive the callback is and the apparent lack of a buffer, this might be an oversight on the behalf of the developer. If you really need asynchronous reads it's worth sending them an issue report.
Otherwise, considering you're just waiting for input you probably don't need the asynchronous reads, and you could use the synchronous Reader.read() method inside your own busy loop instead with something like:
try:
while True:
readTags(r.read(timeout=10))
except KeyboardInterrupt: ## break loop on SIGINT (Ctrl-C)
pass
Note that r.read() returns a list of tags rather than just one, so you'd need to modify your callback slightly, and if you're writing more than just a quick script you probably want to use threads to interrupt the loop properly as SIGINT is pretty hacky.

python threading as a way to complete a script that allways crashes

I've been struggling for many days now with a class PublicationSaver() that I wrote that has a method for loading xml documents as strings (not shown here) and then it passes each loaded string to self.savePublication(self, publication, myDirPath).
Every time I have used it crashed after about 25.000 strings and it saves the last string on which it crashes, I was able parse that string separately so I suppose that the problem is not bad XML.
I asked here but no answers.
I goggled a lot and it seems that I'm not the only one having this problem: here
So, since I really need to complete this task, I thought this: can I wrap all with a Thread set in main, so that when lxml parse throws an exception I get it and send a result to main to kill the thread and start it again?
#threading
result_q = Queue.Queue()
# Create the thread
xmlSplitter = XmlSplitter_Thread(result_q=result_q)
xmlSplitter.run(toSplit_DirPath, target_DirPath)
print "Hello !!!\n"
toSplitDirEmptyB=False
while not toSplitDirEmptyB:
splitterAlive=True
while splitterAlive:
sleep(120)
splitterAlive=result_q.get()
xmlSplitter.join()
print "*** KILLED XmlSplitter_Thread !!! ***\n"
if not os.listdir(toSplit_DirPath):
toSplitDirEmptyB=True
else:
xmlSplitter.run(toSplit_DirPath, target_DirPath)
Is this a valid approach ? When I run the code above at the moment is not working; I mean I never get the "Hello !!" displayed and the xmlSplitter just keep going even when it starts to fail (there's an exception rule that keeps it going).

Probably the thread fails and its blocking on join method. take a look here . Split the xml into chunks and try to parse the chunk to avoid memory errors.

Why does my generator hang instead of throwing exception?

I have a generator that returns lines from a number of files, through a filter. It looks like this:
def line_generator(self):
# Find the relevant files
files = self.get_files()
# Read lines
input_object = fileinput.input(files)
for line in input_object:
# Apply filter and yield if it is not *None*
filtered = self.__line_filter(input_object.filename(), line)
if filtered is not None:
yield filtered
input_object.close()
The method self.get_files() returns a list of file paths or an empty list.
I have tried to do s = fileinput.input([]), and then call s.next(). This is where it hangs, and I cannot understand why. I'm trying to be pythonic, and not handling all errors myself, but I guess this is one where there is no way around. Or is there?
Unfortunately I have no means of testing this on Linux right now, but could someone please try the following on Linux, and comment what they get?
import fileinput
s = fileinput.input([])
s.next()
I'm on Windows with Python 2.7.5 (64 bit).
All in all, I'd really like to know:
Is this a bug in Python, or me that is doing something wrong?
Shouldn't .next() always return something, or raise a StopIteration?

fileinput defaults to stdin if the list is empty, so it's just waiting for you to type something.
An obvious fix would be to get rid of fileinput (is not terribly useful anyway) and to be explicit, as python zen suggests:
for path in self.get_files():
with open(path) as fp:
for line in fp:
etc

As others already have answered, I try to answer one specific sub-item:
Shouldn't .next() always return something, or raise a StopIteration?
Yes, but it is not specified when this return is supposed to happen: within some milliseconds, seconds or even longer.
If you have a blocking iterator, you can define some wrapper around it so that it runs inside a different thread, filling a list or something, and the originating thread gets an interface to determine if there are data, if there are currently no data or if the source is exhausted.
I can elaborate on this even more if needed.

File open and close in python

I have read that when file is opened using the below format
with open(filename) as f:
#My Code
f.close()
explicit closing of file is not required . Can someone explain why is it so ? Also if someone does explicitly close the file, will it have any undesirable effect ?

The mile-high overview is this: When you leave the nested block, Python automatically calls f.close() for you.
It doesn't matter whether you leave by just falling off the bottom, or calling break/continue/return to jump out of it, or raise an exception; no matter how you leave that block. It always knows you're leaving, so it always closes the file.*
One level down, you can think of it as mapping to the try:/finally: statement:
f = open(filename)
try:
# My Code
finally:
f.close()
One level down: How does it know to call close instead of something different?
Well, it doesn't really. It actually calls special methods __enter__ and __exit__:
f = open()
f.__enter__()
try:
# My Code
finally:
f.__exit__()
And the object returned by open (a file in Python 2, one of the wrappers in io in Python 3) has something like this in it:
def __exit__(self):
self.close()
It's actually a bit more complicated than that last version, which makes it easier to generate better error messages, and lets Python avoid "entering" a block that it doesn't know how to "exit".
To understand all the details, read PEP 343.
Also if someone does explicitly close the file, will it have any undesirable effect ?
In general, this is a bad thing to do.
However, file objects go out of their way to make it safe. It's an error to do anything to a closed file—except to close it again.
* Unless you leave by, say, pulling the power cord on the server in the middle of it executing your script. In that case, obviously, it never gets to run any code, much less the close. But an explicit close would hardly help you there.

Closing is not required because the with statement automatically takes care of that.
Within the with statement the __enter__ method on open(...) is called and as soon as you go out of that block the __exit__ method is called.
So closing it manually is just futile since the __exit__ method will take care of that automatically.
As for the f.close() after, it's not wrong but useless. It's already closed so it won't do anything.
Also see this blogpost for more info about the with statement: http://effbot.org/zone/python-with-statement.htm

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.