I have a pretty annoying issue at the moment. When I process to a httplib2.request with a way too large page, I would like to be able to stop it cleanly.
For example :
from httplib2 import Http
url = 'http://media.blubrry.com/podacademy/p/content.blubrry.com/podacademy/Neuroscience_and_Society_1.mp3'
h = Http(timeout=5)
h.request(url, 'GET')
In this example, the url is a podcast and it will keep being downloaded forever. My main process will hang indefinitely in this situation.
I have tried to set it in a separate thread using this code and to delete straight my object.
def http_worker(url, q):
h = Http()
print 'Http worker getting %s' % url
q.put(h.request(url, 'GET'))
def process(url):
q = Queue.Queue()
t = Thread(target=http_worker, args=(url, q))
t.start()
tid = t.ident
t.join(3)
if t.isAlive():
try:
del t
print 'deleting t'
except: print 'error deleting t'
else: print q.get()
check_thread(tid)
process(url)
Unfortunately, the thread is still active and will continue to consume cpu / memory.
def check_thread(tid):
import sys
print 'Thread id %s is still active ? %s' % (tid, tid in sys._current_frames().keys() )
Thank you.
Ok I found an hack to be able to deal with this issue.
The best solution so far is to set a maximum of data read and to stop reading from the socket. The data is read from the method _safe_read of httplib module. In order to overwrite this method, I used this lib : http://blog.rabidgeek.com/?tag=wraptools
And voila :
from httplib import HTTPResponse, IncompleteRead, MAXAMOUNT
from wraptools import wraps
#wraps(httplib.HTTPResponse._safe_read)
def _safe_read(original_method, self, amt):
"""Read the number of bytes requested, compensating for partial reads.
Normally, we have a blocking socket, but a read() can be interrupted
by a signal (resulting in a partial read).
Note that we cannot distinguish between EOF and an interrupt when zero
bytes have been read. IncompleteRead() will be raised in this
situation.
This function should be used when <amt> bytes "should" be present for
reading. If the bytes are truly not available (due to EOF), then the
IncompleteRead exception can be used to detect the problem.
"""
# NOTE(gps): As of svn r74426 socket._fileobject.read(x) will never
# return less than x bytes unless EOF is encountered. It now handles
# signal interruptions (socket.error EINTR) internally. This code
# never caught that exception anyways. It seems largely pointless.
# self.fp.read(amt) will work fine.
s = []
total = 0
MAX_FILE_SIZE = 3*10**6
while amt > 0 and total < MAX_FILE_SIZE:
chunk = self.fp.read(min(amt, httplib.MAXAMOUNT))
if not chunk:
raise IncompleteRead(''.join(s), amt)
total = total + len(chunk)
s.append(chunk)
amt -= len(chunk)
return ''.join(s)
In this case, MAX_FILE_SIZE is set to 3Mb.
Hopefully, this will help others.
Related
I'm trying to construct a man in the middle attack on a webpage (i.e. HTTP traffic). I'm doing this by using a Linux machine attached to Ethernet and a client attached to the Linux box via its WiFi hotspot.
What I've done so far is use NFQueue from within the IPTables Linux firewall to route all TCP packets on the FORWARD chain to the NFQueue queue, which a Python script is picking up and then processing those rules. I'm able to read the data off of the HTTP response packets, but whenever I try to modify them and pass them back (accept the packets), I'm getting an error regarding the strings:
Exception AttributeError: "'str' object has no attribute 'build_padding'" in 'netfilterqueue.global_callback' ignored
My code is here, which includes things that I've tried that didn't work. Notably, I'm using a third-party extension for scapy called scapy_http that may be interfering with things, and I'm using a webpage that is not being compressed by gzip because that was messing with things as well. The test webpage that I'm using is here.
#scapy
from scapy.all import *
#nfqueue import
from netfilterqueue import NetfilterQueue
#scapy http extension, not really needed
import scapy_http.http
#failed gzip decoding, also tried some other stuff
#import gzip
def print_and_accept(packet):
#convert nfqueue datatype to scapy-compatible
pkt = IP(packet.get_payload())
#is this an HTTP response?
if pkt[TCP].sport == 80:
#legacy trial that doesn't work
#data = packet.get_data()
print('HTTP Packet Found')
#check what's in the payload
stringLoad = str(pkt[TCP].payload)
#deleted because printing stuff out clogs output
#print(stringLoad)
#we only want to modify a specific packet:
if "<title>Acids and Bases: Use of the pKa Table</title>" in stringLoad:
print('Target Found')
#strings kind of don't work, I think this is a me problem
#stringLoad.replace('>Acids and Bases: Use of the pK<sub>a</sub>', 'This page has been modified: a random ')
#pkt[TCP].payload = stringLoad
#https://stackoverflow.com/questions/27293924/change-tcp-payload-with-nfqueue-scapy
payload_before = len(pkt[TCP].payload)
# I suspect this line is a problem: the string assigns,
# but maybe under the hood scapy doesn't like that very much
pkt[TCP].payload = str(pkt[TCP].payload).replace("Discussion", "This page has been modified")
#recalculate length
payload_after = len(pkt[TCP].payload)
payload_dif = payload_after - payload_before
pkt[IP].len = pkt[IP].len + payload_dif
#recalculate checksum
del pkt[TCP].chksum
del pkt[IP].chksum
del pkt.chksum
print('Packet Modified')
#redudant
#print(stringLoad)
#this throws an error (I think)
print(str(pkt[TCP].payload))
#no clue if this works or not yet
#goal here is to reassign modified packet to original parameter
packet.set_payload(str(pkt))
#this was also throwing the error, so tried to move away from it
#print(pkt.show2())
#bunch of legacy code that didn't work
#print(GET_print(pkt))
#print(pkt.show())
#decompressed_data = zlib.decompress(str(pkt[TCP].payload), 16 + zlib.MAX_WBITS)
#print(decompressed_data)
#print(str(gzip.decompress(pkt[TCP].payload)))
# print(pkt.getlayer(Raw).load)
#print('HTTP Contents Shown')
packet.accept()
def GET_print(packet1):
ret = "***************************************GET PACKET****************************************************\n"
ret += "\n".join(packet1.sprintf("{Raw:%Raw.load%}\n").split(r"\r\n"))
ret += "*****************************************************************************************************\n"
return ret
print('Test: Modify a very specific target')
print('Program Starting')
nfqueue = NetfilterQueue()
nfqueue.bind(1, print_and_accept)
try:
print('Packet Interface Starting')
nfqueue.run()
except KeyboardInterrupt:
print('\nProgram Ending')
nfqueue.unbind()
Apologies in advance if this is hard to read or badly formatted code; Python isn't a language that I write in often. Any help is greatly appreciated!
I have just learned the basics of Python, and I am trying to make a few projects so that I can increase my knowledge of the programming language.
Since I am rather paranoid, I created a script that uses PycURL to fetch my current IP address every x seconds, for VPN security. Here is my code[EDITED]:
import requests
enterIP = str(input("What is your current IP address?"))
def getIP():
while True:
try:
result = requests.get("http://ipinfo.io/ip")
print(result.text)
except KeyboardInterrupt:
print("\nProccess terminated by user")
return result.text
def checkIP():
while True:
if enterIP == result.text:
pass
else:
print("IP has changed!")
getIP()
checkIP()
Now I would like to expand the idea, so that the script asks the user to enter their current IP, saves that octet as a string, then uses a loop to keep running it against the PycURL function to make sure that their IP hasn't changed? The only problem is that I am completely stumped, I cannot come up with a function that would take the output of PycURL and compare it to a string. How could I achieve that?
As #holdenweb explained, you do not need pycurl for such a simple task, but nevertheless, here is a working example:
import pycurl
import time
from StringIO import StringIO
def get_ip():
buffer = StringIO()
c = pycurl.Curl()
c.setopt(pycurl.URL, "http://ipinfo.io/ip")
c.setopt(c.WRITEDATA, buffer)
c.perform()
c.close()
return buffer.getvalue()
def main():
initial = get_ip()
print 'Initial IP: %s' % initial
try:
while True:
current = get_ip()
if current != initial:
print 'IP has changed to: %s' % current
time.sleep(300)
except KeyboardInterrupt:
print("\nProccess terminated by user")
if __name__ == '__main__':
main()
As you can see I moved the logic of getting the IP to separate function: get_ip and added few missing things, like catching the buffer to a string and returning it. Otherwise it is pretty much the same as the first example in pycurl quickstart
The main function is called below, when the script is accessed directly (not by import).
First off it calls the get_ip to get initial IP and then runs the while loop which checks if the IP has changed and lets you know if so.
EDIT:
Since you changed your question, here is your new code in a working example:
import requests
def getIP():
result = requests.get("http://ipinfo.io/ip")
return result.text
def checkIP():
initial = getIP()
print("Initial IP: {}".format(initial))
while True:
current = getIP()
if initial == current:
pass
else:
print("IP has changed!")
checkIP()
As I mentioned in the comments above, you do not need two loops. One is enough. You don't even need two functions, but better do. One for getting the data and one for the loop. In the later, first get initial value and then run the loop, inside which you check if value has changed or not.
It seems, from reading the pycurl documentation, like you would find it easier to solve this problem using the requests library. Curl is more to do with file transfer, so the library expects you to provide a file-like object into which it writes the contents. This would greatly complicate your logic.
requests allows you to access the text of the server's response directly:
>>> import requests
>>> result = requests.get("http://ipinfo.io/ip")
>>> result.text
'151.231.192.8\n'
As #PeterWood suggested, a function would be more appropriate than a class for this - or if the script is going to run continuously, just a simple loop as the body of the program.
I use the Python Requests library to download a big file, e.g.:
r = requests.get("http://bigfile.com/bigfile.bin")
content = r.content
The big file downloads at +- 30 Kb per second, which is a bit slow. Every connection to the bigfile server is throttled, so I would like to make multiple connections.
Is there a way to make multiple connections at the same time to download one file?
You can use HTTP Range header to fetch just part of file (already covered for python here).
Just start several threads and fetch different range with each and you're done ;)
def download(url,start):
req = urllib2.Request('http://www.python.org/')
req.headers['Range'] = 'bytes=%s-%s' % (start, start+chunk_size)
f = urllib2.urlopen(req)
parts[start] = f.read()
threads = []
parts = {}
# Initialize threads
for i in range(0,10):
t = threading.Thread(target=download, i*chunk_size)
t.start()
threads.append(t)
# Join threads back (order doesn't matter, you just want them all)
for i in threads:
i.join()
# Sort parts and you're done
result = ''.join(parts[i] for i in sorted(parts.keys()))
Also note that not every server supports Range header (and especially servers with php scripts responsible for data fetching often don't implement handling of it).
Here's a Python script that saves given url to a file and uses multiple threads to download it:
#!/usr/bin/env python
import sys
from functools import partial
from itertools import count, izip
from multiprocessing.dummy import Pool # use threads
from urllib2 import HTTPError, Request, urlopen
def download_chunk(url, byterange):
req = Request(url, headers=dict(Range='bytes=%d-%d' % byterange))
try:
return urlopen(req).read()
except HTTPError as e:
return b'' if e.code == 416 else None # treat range error as EOF
except EnvironmentError:
return None
def main():
url, filename = sys.argv[1:]
pool = Pool(4) # define number of concurrent connections
chunksize = 1 << 16
ranges = izip(count(0, chunksize), count(chunksize - 1, chunksize))
with open(filename, 'wb') as file:
for s in pool.imap(partial(download_part, url), ranges):
if not s:
break # error or EOF
file.write(s)
if len(s) != chunksize:
break # EOF (servers with no Range support end up here)
if __name__ == "__main__":
main()
The end of file is detected if a server returns empty body, or 416 http code, or if the response size is not chunksize exactly.
It supports servers that doesn't understand Range header (everything is downloaded in a single request in this case; to support large files, change download_chunk() to save to a temporary file and return the filename to be read in the main thread instead of the file content itself).
It allows to change independently number of concurrent connections (pool size) and number of bytes requested in a single http request.
To use multiple processes instead of threads, change the import:
from multiprocessing.pool import Pool # use processes (other code unchanged)
This solution requires the linux utility named "aria2c", but it has the advantage of easily resuming downloads.
It also assumes that all the files you want to download are listed in the http directory list for location MY_HTTP_LOC. I tested this script on an instance of lighttpd/1.4.26 http server. But, you can easily modify this script so that it works for other setups.
#!/usr/bin/python
import os
import urllib
import re
import subprocess
MY_HTTP_LOC = "http://AAA.BBB.CCC.DDD/"
# retrieve webpage source code
f = urllib.urlopen(MY_HTTP_LOC)
page = f.read()
f.close
# extract relevant URL segments from source code
rgxp = '(\<td\ class="n"\>\<a\ href=")([0-9a-zA-Z\(\)\-\_\.]+)(")'
results = re.findall(rgxp,str(page))
files = []
for match in results:
files.append(match[1])
# download (using aria2c) files
for afile in files:
if os.path.exists(afile) and not os.path.exists(afile+'.aria2'):
print 'Skipping already-retrieved file: ' + afile
else:
print 'Downloading file: ' + afile
subprocess.Popen(["aria2c", "-x", "16", "-s", "20", MY_HTTP_LOC+str(afile)]).wait()
you could use a module called pySmartDLfor this it uses multiple threads and can do a lot more also this module gives a download bar by default.
for more info check this answer
Recently I am working on a tiny crawler for downloading images on a url.
I use openurl() in urllib2 with f.open()/f.write():
Here is the code snippet:
# the list for the images' urls
imglist = re.findall(regImg,pageHtml)
# iterate to download images
for index in xrange(1,len(imglist)+1):
img = urllib2.urlopen(imglist[index-1])
f = open(r'E:\OK\%s.jpg' % str(index), 'wb')
print('To Read...')
# potential timeout, may block for a long time
# so I wonder whether there is any mechanism to enable retry when time exceeds a certain threshold
f.write(img.read())
f.close()
print('Image %d is ready !' % index)
In the code above, the img.read() will potentially block for a long time, I hope to do some retry/re-open the image url operation under this issue.
I also concern on the efficient perspective of the code above, if the number of the images to be downloaded is somewhat big, using a thread pool to download them seems to be better.
Any suggestions? Thanks in advance.
p.s. I found the read() method on img object may cause blocking, so adding a timeout parameter to the urlopen() alone seems useless. But I found file object has no timeout version of read(). Any suggestions on this ? Thanks very much .
The urllib2.urlopen has a timeout parameter which is used for all blocking operations (connection buildup etc.)
This snippet is taken from one of my projects. I use a thread pool to download multiple files at once. It uses urllib.urlretrieve but the logic is the same. The url_and_path_list is a list of (url, path) tuples, the num_concurrent is the number of threads to be spawned, and the skip_existing skips downloading of files if they already exist in the filesystem.
def download_urls(url_and_path_list, num_concurrent, skip_existing):
# prepare the queue
queue = Queue.Queue()
for url_and_path in url_and_path_list:
queue.put(url_and_path)
# start the requested number of download threads to download the files
threads = []
for _ in range(num_concurrent):
t = DownloadThread(queue, skip_existing)
t.daemon = True
t.start()
queue.join()
class DownloadThread(threading.Thread):
def __init__(self, queue, skip_existing):
super(DownloadThread, self).__init__()
self.queue = queue
self.skip_existing = skip_existing
def run(self):
while True:
#grabs url from queue
url, path = self.queue.get()
if self.skip_existing and exists(path):
# skip if requested
self.queue.task_done()
continue
try:
urllib.urlretrieve(url, path)
except IOError:
print "Error downloading url '%s'." % url
#signals to queue job is done
self.queue.task_done()
When you create tje connection with urllib2.urlopen(), you can give a timeout parameter.
As described in the doc :
The optional timeout parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the
global default timeout setting will be used). This actually only works
for HTTP, HTTPS and FTP connections.
With this you will be able to manage a maximum waiting duration and catch the exception raised.
The way I crawl a huge batch of documents is by having batch processor which crawls and dumps constant sized chunks.
Suppose you are to crawl a pre-known batch of say 100K documents. You can have some logic to generate constant size chunks of say 1000 documents which would be downloaded by a threadpool. Once the whole chunk is crawled, you can have bulk insert in your database. And then proceed with further 1000 documents and so on.
Advantages you get by following this approach:
You get the advantage of threadpool speeding up your crawl rate.
Its fault tolerant in the sense, you can continue from the chunk where it last failed.
You can have chunks generated on the basis of priority i.e. important documents to crawl first. So in case you are unable to complete the whole batch. Important documents are processed and less important documents can be picked up later on the next run.
An ugly hack that seems to work.
import os, socket, threading, errno
def timeout_http_body_read(response, timeout = 60):
def murha(resp):
os.close(resp.fileno())
resp.close()
# set a timer to yank the carpet underneath the blocking read() by closing the os file descriptor
t = threading.Timer(timeout, murha, (response,))
try:
t.start()
body = response.read()
t.cancel()
except socket.error as se:
if se.errno == errno.EBADF: # murha happened
return (False, None)
raise
return (True, body)
So I have two Python3.2 processes that need to communicate with each other. Most of the information that needs to be communicated are standard dictionaries. Named pipes seemed like the way to go so I made a pipe class that can be instantiated in both processes. this class implements a very basic protocol for getting information around.
My problem is that sometimes it works, sometimes it doesn't. There seems to be no pattern to this behavior except the place where the code fails.
Here are the bits of the Pipe class that matter. Shout if you want more code:
class Pipe:
"""
there are a bunch of constants set up here. I dont think it would be useful to include them. Just think like this: Pipe.WHATEVER = 'WHATEVER'
"""
def __init__(self,sPath):
"""
create the fifo. if it already exists just associate with it
"""
self.sPath = sPath
if not os.path.exists(sPath):
os.mkfifo(sPath)
self.iFH = os.open(sPath,os.O_RDWR | os.O_NONBLOCK)
self.iFHBlocking = os.open(sPath,os.O_RDWR)
def write(self,dMessage):
"""
write the dict to the fifo
if dMessage is not a dictionary then there will be an exception here. There never is
"""
self.writeln(Pipe.MESSAGE_START)
for k in dMessage:
self.writeln(Pipe.KEY)
self.writeln(k)
self.writeln(Pipe.VALUE)
self.writeln(dMessage[k])
self.writeln(Pipe.MESSAGE_END)
def writeln(self,s):
os.write(self.iFH,bytes('{0} : {1}\n'.format(Pipe.LINE_START,len(s)+1),'utf-8'))
os.write(self.iFH,bytes('{0}\n'.format(s), 'utf-8'))
os.write(self.iFH,bytes(Pipe.LINE_END+'\n','utf-8'))
def readln(self):
"""
look for LINE_START, get line size
read until LINE_END
clean up
return string
"""
iLineStartBaseLength = len(self.LINE_START)+3 #'{0} : '
try:
s = os.read(self.iFH,iLineStartBaseLength).decode('utf-8')
except:
return Pipe.READLINE_FAIL
if Pipe.LINE_START in s:
#get the length of the line
sLineLen = ''
while True:
try:
sCurrent = os.read(self.iFH,1).decode('utf-8')
except:
return Pipe.READLINE_FAIL
if sCurrent == '\n':
break
sLineLen += sCurrent
try:
iLineLen = int(sLineLen.strip(string.punctuation+string.whitespace))
except:
raise Exception('Not a valid line length: "{0}"'.format(sLineLen))
#read the line
sLine = os.read(self.iFHBlocking,iLineLen).decode('utf-8')
#read the line terminator
sTerm = os.read(self.iFH,len(Pipe.LINE_END+'\n')).decode('utf-8')
if sTerm == Pipe.LINE_END+'\n':
return sLine
return Pipe.READLINE_FAIL
else:
return Pipe.READLINE_FAIL
def read(self):
"""
read from the fifo, make a dict
"""
dRet = {}
sKey = ''
sValue = ''
sCurrent = None
def value_flush():
nonlocal dRet, sKey, sValue, sCurrent
if sKey:
dRet[sKey.strip()] = sValue.strip()
sKey = ''
sValue = ''
sCurrent = ''
if self.message_start():
while True:
sLine = self.readln()
if Pipe.MESSAGE_END in sLine:
value_flush()
return dRet
elif Pipe.KEY in sLine:
value_flush()
sCurrent = Pipe.KEY
elif Pipe.VALUE in sLine:
sCurrent = Pipe.VALUE
else:
if sCurrent == Pipe.VALUE:
sValue += sLine
elif sCurrent == Pipe.KEY:
sKey += sLine
else:
return Pipe.NO_MESSAGE
It sometimes fails here (in readln):
try:
iLineLen = int(sLineLen.strip(string.punctuation+string.whitespace))
except:
raise Exception('Not a valid line length: "{0}"'.format(sLineLen))
It doesn't fail anywhere else.
An example error is:
Not a valid line length: "KE 17"
The fact that it's intermittent says to me that it's due to some kind of race condition, I'm just struggling to figure out what it might be. Any ideas?
EDIT added stuff about calling processes
How the Pipe is used is it is instantiated in processA and ProcessB by calling the constructor with the same path. Process A will then intermittently write to the Pipe and processB will try to read from it. At no point do I ever try to get the thing acting as a two way.
Here is a more long winded explanation of the situation. I've been trying to keep the question short but I think it's about time I give up on that. Anyhoo, I have a daemon and a Pyramid process that need to play nice. There are two Pipe instances in use: One that only Pyramid writes to, and one that only the daemon writes to. The stuff Pyramid writes is really short, I have experienced no errors on this pipe. The stuff that the daemon writes is much longer, this is the pipe that's giving me grief. Both pipes are implemented in the same way. Both processes only write dictionaries to their respective Pipes (if this were not the case then there would be an exception in Pipe.write).
The basic algorithm is: Pyramid spawns the daemon, the daemon loads craze object hierarchy of doom and vast ram consumption. Pyramid sends POST requests to the daemon which then does a whole bunch of calculations and sends data to Pyramid so that a human-friendly page can be rendered. the human can then respond to what's in the hierarchy by filling in HTML forms and suchlike thus causing pyramid to send another dictionary to the daemon, and the daemon sending back a dictionary response.
So: only one pipe has exhibited any problems, the problem pipe has a lot more traffic than the other one, and it is a guarentee that only dictionaries are written to either
EDIT as response to question and comment
Before you tell me to take out the try...except stuff read on.
The fact that the exception gets raised at all is what is bothering me. iLineLengh = int(stuff) looks to me like it should always be passed a string that looks like an integer. This is the case only most of the time, not all of it. So if you feel the urge to comment about how it's probably not an integer please please don't.
To paraphrase my question: Spot the race condition and you will be my hero.
EDIT a little example:
process_1.py:
oP = Pipe(some_path)
while 1:
oP.write({'a':'foo','b':'bar','c':'erm...','d':'plop!','e':'etc'})
process_2.py:
oP = Pipe(same_path_as_before)
while 1:
print(oP.read())
After playing around with the code, I suspect the problem is coming from how you are reading the file.
Specifically, lines like this:
os.read(self.iFH, iLineStartBaseLength)
That call doesn't necessarily return iLineStartBaseLength bytes - it might consume "LI" , then return READLINE_FAIL and retry. On the second attempt, it will get the remainder of the line, and somehow end up giving the non-numeric string to the int() call
The unpredictability likely comes from how the fifo is being flushed - if it happens to flush when the complete line is written, all is fine. If it flushes when the line is half-written, weirdness.
At least in the hacked-up version of the script I ended up with, the oP.read() call in process_2.py often got a different dict to the one sent (where the KEY might bleed into the previous VALUE and other strangeness).
I might be mistaken, as I had to make a bunch of changes to get the code running on OS X, and further while experimenting. My modified code here
Not sure exactly how to fix it, but.. with the json module or similar, the protocol/parsing can be greatly simplified - newline separated JSON data is much easier to parse:
import os
import time
import json
import errno
def retry_write(*args, **kwargs):
"""Like os.write, but retries until EAGAIN stops appearing
"""
while True:
try:
return os.write(*args, **kwargs)
except OSError as e:
if e.errno == errno.EAGAIN:
time.sleep(0.5)
else:
raise
class Pipe(object):
"""FIFO based IPC based on newline-separated JSON
"""
ENCODING = 'utf-8'
def __init__(self,sPath):
self.sPath = sPath
if not os.path.exists(sPath):
os.mkfifo(sPath)
self.fd = os.open(sPath,os.O_RDWR | os.O_NONBLOCK)
self.file_blocking = open(sPath, "r", encoding=self.ENCODING)
def write(self, dmsg):
serialised = json.dumps(dmsg) + "\n"
dat = bytes(serialised.encode(self.ENCODING))
# This blocks until data can be read by other process.
# Can just use os.write and ignore EAGAIN if you want
# to drop the data
retry_write(self.fd, dat)
def read(self):
serialised = self.file_blocking.readline()
return json.loads(serialised)
Try getting rid of the try:, except: blocks and seeing what exception is actually being thrown.
So replace your sample with just:
iLineLen = int(sLineLen.strip(string.punctuation+string.whitespace))
I bet it'll now throw a ValueError, and it's because you're trying to cast "KE 17" to an int.
You'll need to strip more than string.whitespace and string.punctuation if you're going to cast the string to an int.